diff mbox series

tests: Disable migration-test

Message ID 20230221132726.2892383-1-peter.maydell@linaro.org
State Superseded
Headers show
Series tests: Disable migration-test | expand

Commit Message

Peter Maydell Feb. 21, 2023, 1:27 p.m. UTC
The migration-test is annoyingly flaky. Examples:

https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
(a FreeBSD job)
  32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR

on a local macos x86 box:
▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_
str_equal(status, "failed")) ERROR
 34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
qemu-system-i386: Failed to peek at channel
query-migrate shows failed migration: Unable to write to socket: Broken pipe
**
ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))

(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
 37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
query-migrate shows failed migration: Unable to write to socket: Broken pipe
**
ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))

(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

I've seen this on other CI jobs as well, but Gitlab's UI makes it
pretty much impossible to re-find failed jobs, since you can't
search for them by failure reason at all.

I've also seen this fail on the OpenBSD vm build.

I've seen the migration-test hang on the s390 private CI runner
in such a way that even though the CI job has timed out, the
stale QEMU and migration-test processes are still lying around on
the host.

I've complained about these before, but nobody has either investigated
or suggested improvements to the test program that would let us gather
more information about what's happening when these fail.
 https://lore.kernel.org/qemu-devel/CAFEAcA8x_iM3hN2-P9F+huXnXFXy+D6FzE+Leq4erLdg7zkVGw@mail.gmail.com/

So this is the big hammer: disable the test entirely, so that we
don't keep getting CI job intermittent failures because of it.
When somebody has time to investigate, we can fix the underlying
cause and reenable the job.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
This is an "if you don't want this, propose something else" patch :-)
---
 tests/qtest/meson.build | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Thomas Huth Feb. 21, 2023, 2:27 p.m. UTC | #1
On 21/02/2023 14.27, Peter Maydell wrote:
> The migration-test is annoyingly flaky. Examples:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> (a FreeBSD job)
>    32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> 
> on a local macos x86 box:
> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_
> str_equal(status, "failed")) ERROR
>   34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-i386: Failed to peek at channel
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> 
> ▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>   37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> 
> I've seen this on other CI jobs as well, but Gitlab's UI makes it
> pretty much impossible to re-find failed jobs, since you can't
> search for them by failure reason at all.
> 
> I've also seen this fail on the OpenBSD vm build.
> 
> I've seen the migration-test hang on the s390 private CI runner
> in such a way that even though the CI job has timed out, the
> stale QEMU and migration-test processes are still lying around on
> the host.
> 
> I've complained about these before, but nobody has either investigated
> or suggested improvements to the test program that would let us gather
> more information about what's happening when these fail.
>   https://lore.kernel.org/qemu-devel/CAFEAcA8x_iM3hN2-P9F+huXnXFXy+D6FzE+Leq4erLdg7zkVGw@mail.gmail.com/
> 
> So this is the big hammer: disable the test entirely, so that we
> don't keep getting CI job intermittent failures because of it.
> When somebody has time to investigate, we can fix the underlying
> cause and reenable the job.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> This is an "if you don't want this, propose something else" patch :-)

I'm also regularly running into issues with this test, so from my side:

Acked-by: Thomas Huth <thuth@redhat.com>
Philippe Mathieu-Daudé Feb. 21, 2023, 2:36 p.m. UTC | #2
On 21/2/23 15:27, Thomas Huth wrote:
> On 21/02/2023 14.27, Peter Maydell wrote:
>> The migration-test is annoyingly flaky. Examples:
>>
>> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
>> (a FreeBSD job)
>>    32/648 
>> ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
>>
>> on a local macos x86 box:
>> ▶  34/621 
>> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_
>> str_equal(status, "failed")) ERROR
>>   34/621 qemu:qtest+qtest-i386 / 
>> qtest-i386/migration-test                         ERROR          
>> 168.12s   killed by signal 6 SIGABRT
>> ――――――――――――――――――――――――――――――――――――― ✀  
>> ―――――――――――――――――――――――――――――――――――――
>> stderr:
>> qemu-system-i386: Failed to peek at channel
>> query-migrate shows failed migration: Unable to write to socket: 
>> Broken pipe
>> **
>> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
>>
>> (test program exited with status code -6)
>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>
>> ▶  37/621 
>> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>>   37/621 qemu:qtest+qtest-x86_64 / 
>> qtest-x86_64/migration-test                     ERROR          
>> 174.37s   killed by signal 6 SIGABRT
>> ――――――――――――――――――――――――――――――――――――― ✀  
>> ―――――――――――――――――――――――――――――――――――――
>> stderr:
>> query-migrate shows failed migration: Unable to write to socket: 
>> Broken pipe
>> **
>> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
>>
>> (test program exited with status code -6)
>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>
>> I've seen this on other CI jobs as well, but Gitlab's UI makes it
>> pretty much impossible to re-find failed jobs, since you can't
>> search for them by failure reason at all.
>>
>> I've also seen this fail on the OpenBSD vm build.
>>
>> I've seen the migration-test hang on the s390 private CI runner
>> in such a way that even though the CI job has timed out, the
>> stale QEMU and migration-test processes are still lying around on
>> the host.
>>
>> I've complained about these before, but nobody has either investigated
>> or suggested improvements to the test program that would let us gather
>> more information about what's happening when these fail.
>>   
>> https://lore.kernel.org/qemu-devel/CAFEAcA8x_iM3hN2-P9F+huXnXFXy+D6FzE+Leq4erLdg7zkVGw@mail.gmail.com/
>>
>> So this is the big hammer: disable the test entirely, so that we
>> don't keep getting CI job intermittent failures because of it.
>> When somebody has time to investigate, we can fix the underlying
>> cause and reenable the job.

This isn't a negative patch. Flaky tests happen, and if it is hard
to fix them, disabling them is saner than loosing trust in the CI
and the rest of the tests.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
>> ---
>> This is an "if you don't want this, propose something else" patch :-)
> 
> I'm also regularly running into issues with this test, so from my side:
> 
> Acked-by: Thomas Huth <thuth@redhat.com>
Dr. David Alan Gilbert Feb. 21, 2023, 3:21 p.m. UTC | #3
* Peter Maydell (peter.maydell@linaro.org) wrote:
> The migration-test is annoyingly flaky. Examples:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> (a FreeBSD job)
>   32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> 
> on a local macos x86 box:
> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_
> str_equal(status, "failed")) ERROR
>  34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-i386: Failed to peek at channel
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> 
> ▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>  37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> 
> I've seen this on other CI jobs as well, but Gitlab's UI makes it
> pretty much impossible to re-find failed jobs, since you can't
> search for them by failure reason at all.
> 
> I've also seen this fail on the OpenBSD vm build.
> 
> I've seen the migration-test hang on the s390 private CI runner
> in such a way that even though the CI job has timed out, the
> stale QEMU and migration-test processes are still lying around on
> the host.
> 
> I've complained about these before, but nobody has either investigated
> or suggested improvements to the test program that would let us gather
> more information about what's happening when these fail.
>  https://lore.kernel.org/qemu-devel/CAFEAcA8x_iM3hN2-P9F+huXnXFXy+D6FzE+Leq4erLdg7zkVGw@mail.gmail.com/

Damn this is really going to impact the stability of migration if we
don't regularly test.
But fundamentally, I've never been able to debug much of the reports
that come from flakyness in gitlab ci; we're not getting the most basic
information like which subtest or where we're upto in the test which
makes it very very hard to debug.

Dave

> So this is the big hammer: disable the test entirely, so that we
> don't keep getting CI job intermittent failures because of it.
> When somebody has time to investigate, we can fix the underlying
> cause and reenable the job.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> This is an "if you don't want this, propose something else" patch :-)
> ---
>  tests/qtest/meson.build | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
> index 29a4efb4c24..0e362fcb1e0 100644
> --- a/tests/qtest/meson.build
> +++ b/tests/qtest/meson.build
> @@ -11,6 +11,12 @@ slow_qtests = {
>    'test-hmp' : 120,
>  }
>  
> +# Temporarily disabled tests can be listed here
> +qtests_disabled = [
> +  # This test is currently flaky and fails intermittently
> +  'migration-test',
> +]
> +
>  qtests_generic = [
>    'cdrom-test',
>    'device-introspect-test',
> @@ -343,6 +349,9 @@ foreach dir : target_dirs
>    endif
>  
>    foreach test : target_qtests
> +    if test in qtests_disabled
> +      continue
> +    endif
>      # Executables are shared across targets, declare them only the first time we
>      # encounter them
>      if not qtest_executables.has_key(test)
> -- 
> 2.34.1
>
Peter Maydell Feb. 21, 2023, 3:29 p.m. UTC | #4
On Tue, 21 Feb 2023 at 15:21, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> Damn this is really going to impact the stability of migration if we
> don't regularly test.
> But fundamentally, I've never been able to debug much of the reports
> that come from flakyness in gitlab ci; we're not getting the most basic
> information like which subtest or where we're upto in the test which
> makes it very very hard to debug.

Right, but if you want more information you need to change the
tests and/or test harness to provide it.

thanks
-- PMM
Dr. David Alan Gilbert Feb. 21, 2023, 3:36 p.m. UTC | #5
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Tue, 21 Feb 2023 at 15:21, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > Damn this is really going to impact the stability of migration if we
> > don't regularly test.
> > But fundamentally, I've never been able to debug much of the reports
> > that come from flakyness in gitlab ci; we're not getting the most basic
> > information like which subtest or where we're upto in the test which
> > makes it very very hard to debug.
> 
> Right, but if you want more information you need to change the
> tests and/or test harness to provide it.

I don't think the migration test is doing anything odd in that respect;
We've just got a bunch of qtest tests; having a test framework which
doesn't tell you which test failed is very difficult.

Dave

> thanks
> -- PMM
>
Peter Maydell Feb. 21, 2023, 3:47 p.m. UTC | #6
On Tue, 21 Feb 2023 at 15:36, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> * Peter Maydell (peter.maydell@linaro.org) wrote:
> > On Tue, 21 Feb 2023 at 15:21, Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > > Damn this is really going to impact the stability of migration if we
> > > don't regularly test.
> > > But fundamentally, I've never been able to debug much of the reports
> > > that come from flakyness in gitlab ci; we're not getting the most basic
> > > information like which subtest or where we're upto in the test which
> > > makes it very very hard to debug.
> >
> > Right, but if you want more information you need to change the
> > tests and/or test harness to provide it.
>
> I don't think the migration test is doing anything odd in that respect;
> We've just got a bunch of qtest tests; having a test framework which
> doesn't tell you which test failed is very difficult.

In the specific case here of the FreeBSD job it does, though
you have to search backwards for 'migration' to find it:
https://gitlab.com/qemu-project/qemu/-/jobs/3806090216

It prints
ok 23 /aarch64/migration/multifd/tcp/plain/none
# starting QEMU: exec ./qemu-system-aarch64 -qtest
unix:/tmp/qtest-32469.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-32469.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg
-machine virt,gic-version=max -name source,debug-threads=on -m 150M
-serial file:/tmp/migration-test-2A1201/src_serial -cpu max -kernel
/tmp/migration-test-2A1201/bootsect 2>/dev/null -accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest
unix:/tmp/qtest-32469.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-32469.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg
-machine virt,gic-version=max -name target,debug-threads=on -m 150M
-serial file:/tmp/migration-test-2A1201/dest_serial -incoming defer
-cpu max -kernel /tmp/migration-test-2A1201/bootsect 2>/dev/null
-accel qtest
# starting QEMU: exec ./qemu-system-aarch64 -qtest
unix:/tmp/qtest-32469.sock -qtest-log /dev/null -chardev
socket,path=/tmp/qtest-32469.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg
-machine virt,gic-version=max -name target,debug-threads=on -m 150M
-serial file:/tmp/migration-test-2A1201/dest_serial -incoming defer
-cpu max -kernel /tmp/migration-test-2A1201/bootsect -accel qtest
Bail out! ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status:
assertion failed: (g_test_timer_elapsed() <
MIGRATION_STATUS_WAIT_TIMEOUT)

so it has successfully run tests up to 23, and then
crashed on test 24 (which will be
migration/multifd/tcp/plain/cancel).

The test framework also collects and logs any information
the test case wants to print to stderr -- the reason
there is no information there other than the assertion
message is because that's the only thing the test emits.

The local macos test log I have here also shows it failing
on test 24.

The macos tests are easy to provoke failures here during
'make check', it happens at least 1 time in 2. So if you
want to write a patch to print out more information to
the test log I'm happy to run it and collect the output.

thanks
-- PMM
Daniel P. Berrangé Feb. 21, 2023, 3:50 p.m. UTC | #7
On Tue, Feb 21, 2023 at 01:27:26PM +0000, Peter Maydell wrote:
> The migration-test is annoyingly flaky. Examples:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> (a FreeBSD job)
>   32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> 
> on a local macos x86 box:
> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_
> str_equal(status, "failed")) ERROR
>  34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-i386: Failed to peek at channel
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)

THis particular one looks like a recent regresssion. The 'Failed to peek at'
message refers to a patch series that merged last week I believe.

Probably not the fault of that series. Rather the test is now
exposed to yet another failure scenario. It probably would have
failed with a different message before this.

With regards,
Daniel
Daniel P. Berrangé Feb. 21, 2023, 4:09 p.m. UTC | #8
On Tue, Feb 21, 2023 at 03:36:39PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Maydell (peter.maydell@linaro.org) wrote:
> > On Tue, 21 Feb 2023 at 15:21, Dr. David Alan Gilbert
> > <dgilbert@redhat.com> wrote:
> > > Damn this is really going to impact the stability of migration if we
> > > don't regularly test.
> > > But fundamentally, I've never been able to debug much of the reports
> > > that come from flakyness in gitlab ci; we're not getting the most basic
> > > information like which subtest or where we're upto in the test which
> > > makes it very very hard to debug.
> > 
> > Right, but if you want more information you need to change the
> > tests and/or test harness to provide it.
> 
> I don't think the migration test is doing anything odd in that respect;
> We've just got a bunch of qtest tests; having a test framework which
> doesn't tell you which test failed is very difficult.

Right so the problem here is the use of 'bail out'. From the POV
of the TAP output format that is an immediate termination, so there
is nothing to report about which test was being run.

To make the failing test visible, it needs to NOT trigger a bail
out, but instead report an "not ok" line, which will show the
test case name. AFAIK, this is a limitation of glib's test harness
and its adoption of TAP. You can't get the test case name printed
until the test case is finished. And glib tests fail by calling
assert, so they will inherantly trigger 'bail out' logic from a
TAP POV.

IIRC, the historical non-TAP output format would output the test
case name first, and then once done report ok/not ok.

The workaround would be for glib to use a TAP diagnostic line
to print the test case it is about to run. It already uses the
diagnostic lines to report test groups eg

$ ./build//tests/unit/test-io-channel-command
# random seed: R02S0718b3006d3dcf15099db36b61dff3e8
1..4
# Start of io tests
# Start of channel tests
# Start of command tests
# Start of fifo tests
ok 1 /io/channel/command/fifo/sync
ok 2 /io/channel/command/fifo/async
# End of fifo tests
# Start of echo tests
**
ERROR:../tests/unit/test-io-channel-command.c:102:test_io_channel_command_echo: assertion failed: (rand() < 0.5)
Bail out! ERROR:../tests/unit/test-io-channel-command.c:102:test_io_channel_command_echo: assertion failed: (rand() < 0.5)
Aborted (core dumped)


would have to be changed to report

$ ./build//tests/unit/test-io-channel-command
# random seed: R02S0718b3006d3dcf15099db36b61dff3e8
1..4
# Start of io tests
# Start of channel tests
# Start of command tests
# Start of fifo tests
# Running /io/channel/command/fifo/sync
ok 1 /io/channel/command/fifo/sync
# Running /io/channel/command/fifo/async
ok 2 /io/channel/command/fifo/async
# End of fifo tests
# Start of echo tests
# Running /io/channel/command/echo/sync
**
ERROR:../tests/unit/test-io-channel-command.c:102:test_io_channel_command_echo: assertion failed: (rand() < 0.5)
Bail out! ERROR:../tests/unit/test-io-channel-command.c:102:test_io_channel_command_echo: assertion failed: (rand() < 0.5)
Aborted (core dumped)

so we see exactly what was running.

Without this though, we can still figure out what was running. You
have to look back for the previous 'ok' line and see what it was.
Then locally run '/path/to/test -l' to list the test case names.
The one you want is the one immediately after the last 'ok' (not ok)
line.

Rather tedious for sure, but not impossible.

Worth an RFE for glib, but would be a while before we saw the benefit.

As a workaround, we could print out TAP diagnostic output ourselves
in any qtests that we see as especially unreliable. A diagnostic
output is any line printed on stdout that starts with a '# '

With regards,
Daniel
Peter Xu Feb. 21, 2023, 5:14 p.m. UTC | #9
On Tue, Feb 21, 2023 at 04:09:42PM +0000, Daniel P. Berrangé wrote:
> On Tue, Feb 21, 2023 at 03:36:39PM +0000, Dr. David Alan Gilbert wrote:
> > * Peter Maydell (peter.maydell@linaro.org) wrote:
> > > On Tue, 21 Feb 2023 at 15:21, Dr. David Alan Gilbert
> > > <dgilbert@redhat.com> wrote:
> > > > Damn this is really going to impact the stability of migration if we
> > > > don't regularly test.
> > > > But fundamentally, I've never been able to debug much of the reports
> > > > that come from flakyness in gitlab ci; we're not getting the most basic
> > > > information like which subtest or where we're upto in the test which
> > > > makes it very very hard to debug.
> > > 
> > > Right, but if you want more information you need to change the
> > > tests and/or test harness to provide it.
> > 
> > I don't think the migration test is doing anything odd in that respect;
> > We've just got a bunch of qtest tests; having a test framework which
> > doesn't tell you which test failed is very difficult.
> 
> Right so the problem here is the use of 'bail out'. From the POV
> of the TAP output format that is an immediate termination, so there
> is nothing to report about which test was being run.
> 
> To make the failing test visible, it needs to NOT trigger a bail
> out, but instead report an "not ok" line, which will show the
> test case name. AFAIK, this is a limitation of glib's test harness
> and its adoption of TAP. You can't get the test case name printed
> until the test case is finished. And glib tests fail by calling
> assert, so they will inherantly trigger 'bail out' logic from a
> TAP POV.
> 
> IIRC, the historical non-TAP output format would output the test
> case name first, and then once done report ok/not ok.
> 
> The workaround would be for glib to use a TAP diagnostic line
> to print the test case it is about to run. It already uses the
> diagnostic lines to report test groups eg
> 
> $ ./build//tests/unit/test-io-channel-command
> # random seed: R02S0718b3006d3dcf15099db36b61dff3e8
> 1..4
> # Start of io tests
> # Start of channel tests
> # Start of command tests
> # Start of fifo tests
> ok 1 /io/channel/command/fifo/sync
> ok 2 /io/channel/command/fifo/async
> # End of fifo tests
> # Start of echo tests
> **
> ERROR:../tests/unit/test-io-channel-command.c:102:test_io_channel_command_echo: assertion failed: (rand() < 0.5)
> Bail out! ERROR:../tests/unit/test-io-channel-command.c:102:test_io_channel_command_echo: assertion failed: (rand() < 0.5)
> Aborted (core dumped)
> 
> 
> would have to be changed to report
> 
> $ ./build//tests/unit/test-io-channel-command
> # random seed: R02S0718b3006d3dcf15099db36b61dff3e8
> 1..4
> # Start of io tests
> # Start of channel tests
> # Start of command tests
> # Start of fifo tests
> # Running /io/channel/command/fifo/sync
> ok 1 /io/channel/command/fifo/sync
> # Running /io/channel/command/fifo/async
> ok 2 /io/channel/command/fifo/async
> # End of fifo tests
> # Start of echo tests
> # Running /io/channel/command/echo/sync
> **
> ERROR:../tests/unit/test-io-channel-command.c:102:test_io_channel_command_echo: assertion failed: (rand() < 0.5)
> Bail out! ERROR:../tests/unit/test-io-channel-command.c:102:test_io_channel_command_echo: assertion failed: (rand() < 0.5)
> Aborted (core dumped)
> 
> so we see exactly what was running.
> 
> Without this though, we can still figure out what was running. You
> have to look back for the previous 'ok' line and see what it was.
> Then locally run '/path/to/test -l' to list the test case names.
> The one you want is the one immediately after the last 'ok' (not ok)
> line.
> 
> Rather tedious for sure, but not impossible.
> 
> Worth an RFE for glib, but would be a while before we saw the benefit.
> 
> As a workaround, we could print out TAP diagnostic output ourselves
> in any qtests that we see as especially unreliable. A diagnostic
> output is any line printed on stdout that starts with a '# '

The other possible way is I'm wondering whether we can try to do a whole
stack dump when an assertion happened.

With a backtrace we can easily see the function pointer so that'll be
another way of knowing which test we're running, slightly more awkward than
dumping the names but probably still good enough for developers.

It also contains more information so sometimes we can already spot the bug
even before trying to reproduce.

Though I'm not sure whether it's easily feasible, e.g., the tests are
mostly using g_assert*() families and I've no idea of any good way to let
it happen if without replacing all of them to inject a dump.

Thanks,
Thomas Huth Feb. 21, 2023, 5:35 p.m. UTC | #10
On 21/02/2023 16.36, Dr. David Alan Gilbert wrote:
> * Peter Maydell (peter.maydell@linaro.org) wrote:
>> On Tue, 21 Feb 2023 at 15:21, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>>> Damn this is really going to impact the stability of migration if we
>>> don't regularly test.
>>> But fundamentally, I've never been able to debug much of the reports
>>> that come from flakyness in gitlab ci; we're not getting the most basic
>>> information like which subtest or where we're upto in the test which
>>> makes it very very hard to debug.
>>
>> Right, but if you want more information you need to change the
>> tests and/or test harness to provide it.
> 
> I don't think the migration test is doing anything odd in that respect;
> We've just got a bunch of qtest tests; having a test framework which
> doesn't tell you which test failed is very difficult.

FWIW, I just saw another failure, on a s390x host,
last lines of the log are:

ok 10 /s390x/migration/postcopy/recovery/tls/psk
# End of tls tests
# End of recovery tests
# Start of preempt tests
# starting QEMU: exec ./qemu-system-s390x -qtest unix:/tmp/qtest-7847.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-7847.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name source,debug-threads=on -m 128M -serial file:/tmp/migration-test-03MJ01/src_serial -bios /tmp/migration-test-03MJ01/bootsect    -accel qtest
# starting QEMU: exec ./qemu-system-s390x -qtest unix:/tmp/qtest-7847.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-7847.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name target,debug-threads=on -m 128M -serial file:/tmp/migration-test-03MJ01/dest_serial -incoming unix:/tmp/migration-test-03MJ01/migsocket -bios /tmp/migration-test-03MJ01/bootsect    -accel qtest
ok 11 /s390x/migration/postcopy/preempt/plain
# Start of recovery tests
# starting QEMU: exec ./qemu-system-s390x -qtest unix:/tmp/qtest-7847.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-7847.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name source,debug-threads=on -m 128M -serial file:/tmp/migration-test-03MJ01/src_serial -bios /tmp/migration-test-03MJ01/bootsect   2>/dev/null -accel qtest
# starting QEMU: exec ./qemu-system-s390x -qtest unix:/tmp/qtest-7847.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-7847.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name target,debug-threads=on -m 128M -serial file:/tmp/migration-test-03MJ01/dest_serial -incoming unix:/tmp/migration-test-03MJ01/migsocket -bios /tmp/migration-test-03MJ01/bootsect   2>/dev/null -accel qtest
----------------------------------- stderr -----------------------------------
Broken pipe
../tests/qtest/libqtest.c:193: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped)

(test program exited with status code -6)

  Thomas
Philippe Mathieu-Daudé Feb. 21, 2023, 10:24 p.m. UTC | #11
On 21/2/23 18:35, Thomas Huth wrote:
> On 21/02/2023 16.36, Dr. David Alan Gilbert wrote:
>> * Peter Maydell (peter.maydell@linaro.org) wrote:
>>> On Tue, 21 Feb 2023 at 15:21, Dr. David Alan Gilbert
>>> <dgilbert@redhat.com> wrote:
>>>> Damn this is really going to impact the stability of migration if we
>>>> don't regularly test.
>>>> But fundamentally, I've never been able to debug much of the reports
>>>> that come from flakyness in gitlab ci; we're not getting the most basic
>>>> information like which subtest or where we're upto in the test which
>>>> makes it very very hard to debug.
>>>
>>> Right, but if you want more information you need to change the
>>> tests and/or test harness to provide it.
>>
>> I don't think the migration test is doing anything odd in that respect;
>> We've just got a bunch of qtest tests; having a test framework which
>> doesn't tell you which test failed is very difficult.
> 
> FWIW, I just saw another failure, on a s390x host,
> last lines of the log are:
> 
> ok 10 /s390x/migration/postcopy/recovery/tls/psk
> # End of tls tests
> # End of recovery tests
> # Start of preempt tests
> # starting QEMU: exec ./qemu-system-s390x -qtest 
> unix:/tmp/qtest-7847.sock -qtest-log /dev/null -chardev 
> socket,path=/tmp/qtest-7847.qmp,id=char0 -mon chardev=char0,mode=control 
> -display none -accel kvm -accel tcg -name source,debug-threads=on -m 
> 128M -serial file:/tmp/migration-test-03MJ01/src_serial -bios 
> /tmp/migration-test-03MJ01/bootsect    -accel qtest
> # starting QEMU: exec ./qemu-system-s390x -qtest 
> unix:/tmp/qtest-7847.sock -qtest-log /dev/null -chardev 
> socket,path=/tmp/qtest-7847.qmp,id=char0 -mon chardev=char0,mode=control 
> -display none -accel kvm -accel tcg -name target,debug-threads=on -m 
> 128M -serial file:/tmp/migration-test-03MJ01/dest_serial -incoming 
> unix:/tmp/migration-test-03MJ01/migsocket -bios 
> /tmp/migration-test-03MJ01/bootsect    -accel qtest
> ok 11 /s390x/migration/postcopy/preempt/plain
> # Start of recovery tests
> # starting QEMU: exec ./qemu-system-s390x -qtest 
> unix:/tmp/qtest-7847.sock -qtest-log /dev/null -chardev 
> socket,path=/tmp/qtest-7847.qmp,id=char0 -mon chardev=char0,mode=control 
> -display none -accel kvm -accel tcg -name source,debug-threads=on -m 
> 128M -serial file:/tmp/migration-test-03MJ01/src_serial -bios 
> /tmp/migration-test-03MJ01/bootsect   2>/dev/null -accel qtest
> # starting QEMU: exec ./qemu-system-s390x -qtest 
> unix:/tmp/qtest-7847.sock -qtest-log /dev/null -chardev 
> socket,path=/tmp/qtest-7847.qmp,id=char0 -mon chardev=char0,mode=control 
> -display none -accel kvm -accel tcg -name target,debug-threads=on -m 
> 128M -serial file:/tmp/migration-test-03MJ01/dest_serial -incoming 
> unix:/tmp/migration-test-03MJ01/migsocket -bios 
> /tmp/migration-test-03MJ01/bootsect   2>/dev/null -accel qtest
> ----------------------------------- stderr 
> -----------------------------------
> Broken pipe
> ../tests/qtest/libqtest.c:193: kill_qemu() detected QEMU death from 
> signal 11 (Segmentation fault) (core dumped)
> 
> (test program exited with status code -6)

I get some of these too, but qtest parent debug info isn't very helpful,
I'd rather get the child info. I tried this some time ago but per Paolo
it wasn't correct: 
https://lore.kernel.org/qemu-devel/20200707031920.17428-1-f4bug@amsat.org/
I still use this patch when running my tests as it helps me to
understand where the child crash, so I didn't really looked at
what is the correct fix.
diff mbox series

Patch

diff --git a/tests/qtest/meson.build b/tests/qtest/meson.build
index 29a4efb4c24..0e362fcb1e0 100644
--- a/tests/qtest/meson.build
+++ b/tests/qtest/meson.build
@@ -11,6 +11,12 @@  slow_qtests = {
   'test-hmp' : 120,
 }
 
+# Temporarily disabled tests can be listed here
+qtests_disabled = [
+  # This test is currently flaky and fails intermittently
+  'migration-test',
+]
+
 qtests_generic = [
   'cdrom-test',
   'device-introspect-test',
@@ -343,6 +349,9 @@  foreach dir : target_dirs
   endif
 
   foreach test : target_qtests
+    if test in qtests_disabled
+      continue
+    endif
     # Executables are shared across targets, declare them only the first time we
     # encounter them
     if not qtest_executables.has_key(test)