Message ID | 20240909133921.1141067-1-peter.maydell@linaro.org |
---|---|
State | Rejected |
Headers | show |
Series | [RFC] tests/qtest: Don't parallelize migration-test | expand |
Peter Maydell <peter.maydell@linaro.org> writes: > The migration-test is a long-running test whose subtests all launch > at least two QEMU processes. This means that if for example the host > has 4 CPUs then 'make check' defaults to a parallelism of 5, and if > we launch 5 migration-tests in parallel then we will be running 10 > QEMU instances on a 4 CPU system. If the system is not very fast > then the test can spuriously time out because the different tests are > all stealing CPU from each other. This seems to particularly be a > problem on our S390 CI job and the cross-i686-tci CI job. > > Force meson to run migration-test non-parallel, so there is never any > other test running at the same time as it. This will slow down > overall test execution time somewhat, but hopefully make our CI less > flaky. > > The downside is that because each migration-test instance runs for > between 2 and 5 minutes and we run it for five architectures this > significantly increases the runtime. For an all-architectures build > on my local machine 'make check -j8' goes from > > real 8m19.127s > user 31m47.534s > sys 19m42.650s > > to > > real 20m31.218s > user 32m48.712s > sys 19m52.133s > > more than doubling the wallclock runtime. > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org> > --- > Also, looking at these figures we spend a *lot* of our overall > 'make check' time on migration-test. Do we really need to do > that much for every architecture? I guess one question is are we getting value from all the extra migration tests? There certainly seem to be some sub-tests that are slower than the others and I assume testing a small delta on the tests before it. On s390x it seems the native test runs pretty much to the same time as the other TCG guests. Do we exercise any extra migration code by running tests for every architecture as opposed to one KVM/native hyp and one TCG one?
On Mon, 9 Sept 2024 at 16:23, Alex Bennée <alex.bennee@linaro.org> wrote: > I guess one question is are we getting value from all the extra > migration tests? There certainly seem to be some sub-tests that are > slower than the others and I assume testing a small delta on the tests > before it. > > On s390x it seems the native test runs pretty much to the same time as > the other TCG guests. Do we exercise any extra migration code by running > tests for every architecture as opposed to one KVM/native hyp and one > TCG one? s390 is an interesting one because Christian pointed out that although it has "KVM" support, we're actually running on a VM under z/VM, and so when we run a CI test under "-accel KVM" that's actually nested-KVM and its effects on the host CPU's TLB could be such that it's actually worse than using TCG... -- PMM
diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build index fc852f3d8ba..dbf2b8e2be1 100644 --- a/tests/qtest/meson.build +++ b/tests/qtest/meson.build @@ -17,6 +17,21 @@ slow_qtests = { 'vmgenid-test': 610, } +# Tests which override the default of "can run in parallel". +# Don't use this to work around test bugs which prevent parallelism. +# Do document why we need to make a particular test serialized. +# Do be sparing with use of this: tests listed here will not be +# run in parallel with any other test, not merely not with other +# instances of themselves. +# +# The migration-test's subtests will each kick off two QEMU processes, +# so allowing multiple migration-tests in parallel can overload the +# host system and result in intermittent timeouts. So we only want to +# run one migration-test at once.a +qtests_parallelism = { + 'migration-test': false, +} + qtests_generic = [ 'cdrom-test', 'device-introspect-test', @@ -411,6 +426,7 @@ foreach dir : target_dirs protocol: 'tap', timeout: slow_qtests.get(test, 60), priority: slow_qtests.get(test, 60), + is_parallel: qtests_parallelism.get(test, true), suite: ['qtest', 'qtest-' + target_base]) endforeach endforeach
The migration-test is a long-running test whose subtests all launch at least two QEMU processes. This means that if for example the host has 4 CPUs then 'make check' defaults to a parallelism of 5, and if we launch 5 migration-tests in parallel then we will be running 10 QEMU instances on a 4 CPU system. If the system is not very fast then the test can spuriously time out because the different tests are all stealing CPU from each other. This seems to particularly be a problem on our S390 CI job and the cross-i686-tci CI job. Force meson to run migration-test non-parallel, so there is never any other test running at the same time as it. This will slow down overall test execution time somewhat, but hopefully make our CI less flaky. The downside is that because each migration-test instance runs for between 2 and 5 minutes and we run it for five architectures this significantly increases the runtime. For an all-architectures build on my local machine 'make check -j8' goes from real 8m19.127s user 31m47.534s sys 19m42.650s to real 20m31.218s user 32m48.712s sys 19m52.133s more than doubling the wallclock runtime. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> --- Also, looking at these figures we spend a *lot* of our overall 'make check' time on migration-test. Do we really need to do that much for every architecture? It's unfortunate that meson doesn't let us say "parallel is OK, but not very parallel". One other approach would be to have mtest2make say "run tests at half the parallelism that -jN suggests, rather than at that parallelism", I guess... --- tests/qtest/meson.build | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)