diff mbox series

tests/qtest/migration-test: Disable migration/multifd/tcp/plain/cancel

Message ID 20230302172211.4146376-1-peter.maydell@linaro.org
State Accepted
Commit c15d9e2311a54346e01afb22007d0868724ee8c8
Headers show
Series tests/qtest/migration-test: Disable migration/multifd/tcp/plain/cancel | expand

Commit Message

Peter Maydell March 2, 2023, 5:22 p.m. UTC
migration-test has been flaky for a long time, both in CI and
otherwise:

https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
(a FreeBSD job)
  32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR

on a local macos x86 box:
▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
 34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
qemu-system-i386: Failed to peek at channel
query-migrate shows failed migration: Unable to write to socket: Broken pipe
**
ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))

(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
 37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
query-migrate shows failed migration: Unable to write to socket: Broken pipe
**
ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))

(test program exited with status code -6)

In the cases where I've looked at the underlying log, this seems to
be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
specific subtest by default until somebody can track down the
underlying cause. Enthusiasts can opt back in by setting
QEMU_TEST_FLAKY_TESTS=1 in their environment.

We might need to disable more parts of this test if this isn't
sufficient to fix the flakiness.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
This is a slightly more targeted variation on my original
modest proposal.
---
 tests/qtest/migration-test.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Daniel P. Berrangé March 2, 2023, 5:34 p.m. UTC | #1
On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
> migration-test has been flaky for a long time, both in CI and
> otherwise:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> (a FreeBSD job)
>   32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> 
> on a local macos x86 box:
> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>  34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-i386: Failed to peek at channel
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> 
> ▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>  37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)
> 
> In the cases where I've looked at the underlying log, this seems to
> be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> specific subtest by default until somebody can track down the
> underlying cause. Enthusiasts can opt back in by setting
> QEMU_TEST_FLAKY_TESTS=1 in their environment.

No objection to disabling the test. Given the many multifd fixes we
have seen, I fear that unlikely many of the flakey tests, this is
not merely a test problem, but rather has a decent chance of being
a real bug in migration code.

> 
> We might need to disable more parts of this test if this isn't
> sufficient to fix the flakiness.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> This is a slightly more targeted variation on my original
> modest proposal.
> ---
>  tests/qtest/migration-test.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 109bc8e7b13..d4ab3934ed2 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -2572,8 +2572,14 @@ int main(int argc, char **argv)
>      qtest_add_func("/migration/auto_converge", test_migrate_auto_converge);
>      qtest_add_func("/migration/multifd/tcp/plain/none",
>                     test_multifd_tcp_none);
> -    qtest_add_func("/migration/multifd/tcp/plain/cancel",
> -                   test_multifd_tcp_cancel);
> +    /*
> +     * This test is flaky and sometimes fails in CI and otherwise:
> +     * don't run unless user opts in via environment variable.
> +     */
> +    if (getenv("QEMU_TEST_FLAKY_TESTS")) {
> +        qtest_add_func("/migration/multifd/tcp/plain/cancel",
> +                       test_multifd_tcp_cancel);
> +    }
>      qtest_add_func("/migration/multifd/tcp/plain/zlib",
>                     test_multifd_tcp_zlib);
>  #ifdef CONFIG_ZSTD
> -- 
> 2.34.1
> 
> 

With regards,
Daniel
Dr. David Alan Gilbert March 2, 2023, 5:37 p.m. UTC | #2
* Peter Maydell (peter.maydell@linaro.org) wrote:
> migration-test has been flaky for a long time, both in CI and
> otherwise:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> (a FreeBSD job)
>   32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> 
> on a local macos x86 box:
> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>  34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-i386: Failed to peek at channel
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> 
> ▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>  37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> 
> (test program exited with status code -6)
> 
> In the cases where I've looked at the underlying log, this seems to
> be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> specific subtest by default until somebody can track down the
> underlying cause. Enthusiasts can opt back in by setting
> QEMU_TEST_FLAKY_TESTS=1 in their environment.
> 
> We might need to disable more parts of this test if this isn't
> sufficient to fix the flakiness.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> This is a slightly more targeted variation on my original
> modest proposal.
> ---
>  tests/qtest/migration-test.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 109bc8e7b13..d4ab3934ed2 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -2572,8 +2572,14 @@ int main(int argc, char **argv)
>      qtest_add_func("/migration/auto_converge", test_migrate_auto_converge);
>      qtest_add_func("/migration/multifd/tcp/plain/none",
>                     test_multifd_tcp_none);
> -    qtest_add_func("/migration/multifd/tcp/plain/cancel",
> -                   test_multifd_tcp_cancel);
> +    /*
> +     * This test is flaky and sometimes fails in CI and otherwise:
> +     * don't run unless user opts in via environment variable.
> +     */
> +    if (getenv("QEMU_TEST_FLAKY_TESTS")) {
> +        qtest_add_func("/migration/multifd/tcp/plain/cancel",
> +                       test_multifd_tcp_cancel);
> +    }

OK, that seems reasonably selective.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>


(Cancel tests are always a pain; they can be racy with the test
completing before you fire the cancel; or 'cancel' itself
can hit lots of races inside the migration code if it's not written
carefully enough to expect a cancel).

Dave


>      qtest_add_func("/migration/multifd/tcp/plain/zlib",
>                     test_multifd_tcp_zlib);
>  #ifdef CONFIG_ZSTD
> -- 
> 2.34.1
>
Philippe Mathieu-Daudé March 2, 2023, 10:25 p.m. UTC | #3
On 2/3/23 18:37, Dr. David Alan Gilbert wrote:
> * Peter Maydell (peter.maydell@linaro.org) wrote:
>> migration-test has been flaky for a long time, both in CI and
>> otherwise:
>>
>> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
>> (a FreeBSD job)
>>    32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
>>
>> on a local macos x86 box:
>> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>>   34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> stderr:
>> qemu-system-i386: Failed to peek at channel
>> query-migrate shows failed migration: Unable to write to socket: Broken pipe
>> **
>> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
>>
>> (test program exited with status code -6)
>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>>
>> ▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>>   37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> stderr:
>> query-migrate shows failed migration: Unable to write to socket: Broken pipe
>> **
>> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
>>
>> (test program exited with status code -6)
>>
>> In the cases where I've looked at the underlying log, this seems to
>> be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
>> specific subtest by default until somebody can track down the
>> underlying cause. Enthusiasts can opt back in by setting
>> QEMU_TEST_FLAKY_TESTS=1 in their environment.
>>
>> We might need to disable more parts of this test if this isn't
>> sufficient to fix the flakiness.
>>
>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
>> ---
>> This is a slightly more targeted variation on my original
>> modest proposal.
>> ---
>>   tests/qtest/migration-test.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
>> index 109bc8e7b13..d4ab3934ed2 100644
>> --- a/tests/qtest/migration-test.c
>> +++ b/tests/qtest/migration-test.c
>> @@ -2572,8 +2572,14 @@ int main(int argc, char **argv)
>>       qtest_add_func("/migration/auto_converge", test_migrate_auto_converge);
>>       qtest_add_func("/migration/multifd/tcp/plain/none",
>>                      test_multifd_tcp_none);
>> -    qtest_add_func("/migration/multifd/tcp/plain/cancel",
>> -                   test_multifd_tcp_cancel);
>> +    /*
>> +     * This test is flaky and sometimes fails in CI and otherwise:
>> +     * don't run unless user opts in via environment variable.
>> +     */
>> +    if (getenv("QEMU_TEST_FLAKY_TESTS")) {
>> +        qtest_add_func("/migration/multifd/tcp/plain/cancel",
>> +                       test_multifd_tcp_cancel);
>> +    }
> 
> OK, that seems reasonably selective.
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> 
> (Cancel tests are always a pain; they can be racy with the test
> completing before you fire the cancel; or 'cancel' itself
> can hit lots of races inside the migration code if it's not written
> carefully enough to expect a cancel).

Set 'QEMU_TEST_FLAKY_TESTS' in your gitlab namespace:
https://docs.gitlab.com/ee/ci/variables/#for-a-project

(Also, set it in your local env or in the script you run
to automate your testing).
Thomas Huth March 3, 2023, 7:43 a.m. UTC | #4
On 02/03/2023 18.22, Peter Maydell wrote:
> migration-test has been flaky for a long time, both in CI and
> otherwise

Acked-by: Thomas Huth <thuth@redhat.com>

Do you want to apply it directly as a CI fix, or shall I queue it for my 
next pull request?

  Thomas
Juan Quintela March 3, 2023, 9:08 a.m. UTC | #5
Peter Maydell <peter.maydell@linaro.org> wrote:
> migration-test has been flaky for a long time, both in CI and
> otherwise:
>
> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> (a FreeBSD job)
>   32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
>
> on a local macos x86 box:
> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>  34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-i386: Failed to peek at channel
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
>
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>
> ▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>  37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> **
> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
>
> (test program exited with status code -6)
>
> In the cases where I've looked at the underlying log, this seems to
> be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> specific subtest by default until somebody can track down the
> underlying cause. Enthusiasts can opt back in by setting
> QEMU_TEST_FLAKY_TESTS=1 in their environment.
>
> We might need to disable more parts of this test if this isn't
> sufficient to fix the flakiness.
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Reviewed-by: Juan Quintela <quintela@redhat.com>

Do you want it to go through the migration tree, or get it directly, as
you want.

I lost my access to the ARM server for this week and was not able to
investigate this issue :-(
Juan Quintela March 3, 2023, 9:10 a.m. UTC | #6
Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
>> migration-test has been flaky for a long time, both in CI and
>> otherwise:
>> 
>> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
>> (a FreeBSD job)
>>   32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
>> 
>> on a local macos x86 box:
>> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>>  34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> stderr:
>> qemu-system-i386: Failed to peek at channel
>> query-migrate shows failed migration: Unable to write to socket: Broken pipe
>> **
>> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
>> 
>> (test program exited with status code -6)
>> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>> 
>> ▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
>>  37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
>> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
>> stderr:
>> query-migrate shows failed migration: Unable to write to socket: Broken pipe
>> **
>> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
>> 
>> (test program exited with status code -6)
>> 
>> In the cases where I've looked at the underlying log, this seems to
>> be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
>> specific subtest by default until somebody can track down the
>> underlying cause. Enthusiasts can opt back in by setting
>> QEMU_TEST_FLAKY_TESTS=1 in their environment.
>
> No objection to disabling the test. Given the many multifd fixes we
> have seen, I fear that unlikely many of the flakey tests, this is
> not merely a test problem, but rather has a decent chance of being
> a real bug in migration code.

What is really weird with this failure is that:
- it only happens on non-x86
- on code that is not arch dependent
- on cancel, what we really do there is close fd's for the multifd
  channel threads to get out of the recv, i.e. again, nothing that
  should be arch dependent.

As said in the other email, I expect to get back access to ARM servers
next week.

Later, Juan.
Daniel P. Berrangé March 3, 2023, 9:12 a.m. UTC | #7
On Fri, Mar 03, 2023 at 10:10:28AM +0100, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
> >> migration-test has been flaky for a long time, both in CI and
> >> otherwise:
> >> 
> >> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> >> (a FreeBSD job)
> >>   32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> >> 
> >> on a local macos x86 box:
> >> ▶  34/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
> >>  34/621 qemu:qtest+qtest-i386 / qtest-i386/migration-test                         ERROR          168.12s   killed by signal 6 SIGABRT
> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> >> stderr:
> >> qemu-system-i386: Failed to peek at channel
> >> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> >> **
> >> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> >> 
> >> (test program exited with status code -6)
> >> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> >> 
> >> ▶  37/621 ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed")) ERROR
> >>  37/621 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test                     ERROR          174.37s   killed by signal 6 SIGABRT
> >> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> >> stderr:
> >> query-migrate shows failed migration: Unable to write to socket: Broken pipe
> >> **
> >> ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed: assertion failed: (!g_str_equal(status, "failed"))
> >> 
> >> (test program exited with status code -6)
> >> 
> >> In the cases where I've looked at the underlying log, this seems to
> >> be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> >> specific subtest by default until somebody can track down the
> >> underlying cause. Enthusiasts can opt back in by setting
> >> QEMU_TEST_FLAKY_TESTS=1 in their environment.
> >
> > No objection to disabling the test. Given the many multifd fixes we
> > have seen, I fear that unlikely many of the flakey tests, this is
> > not merely a test problem, but rather has a decent chance of being
> > a real bug in migration code.
> 
> What is really weird with this failure is that:
> - it only happens on non-x86

That doesn't seem right as the two reports Peter has above are both x86

> - on code that is not arch dependent
> - on cancel, what we really do there is close fd's for the multifd
>   channel threads to get out of the recv, i.e. again, nothing that
>   should be arch dependent.
> 
> As said in the other email, I expect to get back access to ARM servers
> next week.

With regards,
Daniel
Peter Maydell March 3, 2023, 11:18 a.m. UTC | #8
On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
>
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
> >> migration-test has been flaky for a long time, both in CI and
> >> otherwise:
> >>
> >> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> >> (a FreeBSD job)
> >>   32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> >>
> >> on a local macos x86 box:



> What is really weird with this failure is that:
> - it only happens on non-x86

No, I have seen it on x86 macos, and x86 OpenBSD

> - on code that is not arch dependent
> - on cancel, what we really do there is close fd's for the multifd
>   channel threads to get out of the recv, i.e. again, nothing that
>   should be arch dependent.

I'm pretty sure that it tends to happen when the machine that's
running the test is heavily loaded. You probably have a race condition.

thanks
-- PMM
Thomas Huth March 3, 2023, 11:28 a.m. UTC | #9
On 03/03/2023 12.18, Peter Maydell wrote:
> On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
>>
>> Daniel P. Berrangé <berrange@redhat.com> wrote:
>>> On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
>>>> migration-test has been flaky for a long time, both in CI and
>>>> otherwise:
>>>>
>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
>>>> (a FreeBSD job)
>>>>    32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
>>>>
>>>> on a local macos x86 box:
> 
> 
> 
>> What is really weird with this failure is that:
>> - it only happens on non-x86
> 
> No, I have seen it on x86 macos, and x86 OpenBSD
> 
>> - on code that is not arch dependent
>> - on cancel, what we really do there is close fd's for the multifd
>>    channel threads to get out of the recv, i.e. again, nothing that
>>    should be arch dependent.
> 
> I'm pretty sure that it tends to happen when the machine that's
> running the test is heavily loaded. You probably have a race condition.

I think I can second that. IIRC I've seen it a couple of times on my x86 
laptop when running "make check -j$(nproc) SPEED=slow" here.

  Thomas
Peter Maydell March 3, 2023, 11:43 a.m. UTC | #10
On Fri, 3 Mar 2023 at 11:29, Thomas Huth <thuth@redhat.com> wrote:
>
> On 03/03/2023 12.18, Peter Maydell wrote:
> > On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
> >> What is really weird with this failure is that:
> >> - it only happens on non-x86
> >
> > No, I have seen it on x86 macos, and x86 OpenBSD
> >
> >> - on code that is not arch dependent
> >> - on cancel, what we really do there is close fd's for the multifd
> >>    channel threads to get out of the recv, i.e. again, nothing that
> >>    should be arch dependent.
> >
> > I'm pretty sure that it tends to happen when the machine that's
> > running the test is heavily loaded. You probably have a race condition.
>
> I think I can second that. IIRC I've seen it a couple of times on my x86
> laptop when running "make check -j$(nproc) SPEED=slow" here.

Also, to repeat an offer I've made before, on my x86 macos laptop
'make check -j4' fails on this test about one time in two. If you
write a patch to add whatever logging information you would find
useful for debugging, I'm happy to do test runs with that patch
and send you the logs.

-- PMM
Peter Maydell March 3, 2023, 12:05 p.m. UTC | #11
On Fri, 3 Mar 2023 at 11:29, Thomas Huth <thuth@redhat.com> wrote:
>
> On 03/03/2023 12.18, Peter Maydell wrote:
> > On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
> >>
> >> Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>> On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
> >>>> migration-test has been flaky for a long time, both in CI and
> >>>> otherwise:
> >>>>
> >>>> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> >>>> (a FreeBSD job)
> >>>>    32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> >>>>
> >>>> on a local macos x86 box:
> >
> >
> >
> >> What is really weird with this failure is that:
> >> - it only happens on non-x86
> >
> > No, I have seen it on x86 macos, and x86 OpenBSD
> >
> >> - on code that is not arch dependent
> >> - on cancel, what we really do there is close fd's for the multifd
> >>    channel threads to get out of the recv, i.e. again, nothing that
> >>    should be arch dependent.
> >
> > I'm pretty sure that it tends to happen when the machine that's
> > running the test is heavily loaded. You probably have a race condition.
>
> I think I can second that. IIRC I've seen it a couple of times on my x86
> laptop when running "make check -j$(nproc) SPEED=slow" here.

And another on-x86 failure case, just now, on the FreeBSD x86 CI job:
https://gitlab.com/qemu-project/qemu/-/jobs/3870165180

thanks
-- PMM
Peter Maydell March 4, 2023, 3:39 p.m. UTC | #12
On Thu, 2 Mar 2023 at 17:22, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> migration-test has been flaky for a long time, both in CI and
> otherwise:
>


> In the cases where I've looked at the underlying log, this seems to
> be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> specific subtest by default until somebody can track down the
> underlying cause. Enthusiasts can opt back in by setting
> QEMU_TEST_FLAKY_TESTS=1 in their environment.

So I'm going to apply this, because hopefully it will improve
the reliability a bit, but it's clearly not all of the
issues with migration-test, because in the course of the
run I was doing to test it before applying it I got this
error from the OpenBSD VM:

 32/646 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
           ERROR          134.73s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
qemu-system-aarch64: multifd_send_sync_main: channel 15 has already quit
qemu-system-aarch64: failed to save SaveStateEntry with id(name): 2(ram): -1
qemu-system-aarch64: Failed to connect to '127.0.0.1:19581': Address
already in use
query-migrate shows failed migration: Failed to connect to
'127.0.0.1:19581': Address already in use
**
ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
assertion failed: (!g_str_equal(status, "failed"))

(test program exited with status code -6)

thanks
-- PMM
Thomas Huth March 6, 2023, 1:08 p.m. UTC | #13
On 03/03/2023 13.05, Peter Maydell wrote:
> On Fri, 3 Mar 2023 at 11:29, Thomas Huth <thuth@redhat.com> wrote:
>>
>> On 03/03/2023 12.18, Peter Maydell wrote:
>>> On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
>>>>
>>>> Daniel P. Berrangé <berrange@redhat.com> wrote:
>>>>> On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
>>>>>> migration-test has been flaky for a long time, both in CI and
>>>>>> otherwise:
>>>>>>
>>>>>> https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
>>>>>> (a FreeBSD job)
>>>>>>     32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
>>>>>>
>>>>>> on a local macos x86 box:
>>>
>>>
>>>
>>>> What is really weird with this failure is that:
>>>> - it only happens on non-x86
>>>
>>> No, I have seen it on x86 macos, and x86 OpenBSD
>>>
>>>> - on code that is not arch dependent
>>>> - on cancel, what we really do there is close fd's for the multifd
>>>>     channel threads to get out of the recv, i.e. again, nothing that
>>>>     should be arch dependent.
>>>
>>> I'm pretty sure that it tends to happen when the machine that's
>>> running the test is heavily loaded. You probably have a race condition.
>>
>> I think I can second that. IIRC I've seen it a couple of times on my x86
>> laptop when running "make check -j$(nproc) SPEED=slow" here.
> 
> And another on-x86 failure case, just now, on the FreeBSD x86 CI job:
> https://gitlab.com/qemu-project/qemu/-/jobs/3870165180

And FWIW, I just saw this while doing "make vm-build-netbsd J=4":

▶  31/645 ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: 'got_stop' should be FALSE ERROR
  31/645 qemu:qtest+qtest-i386 / qtest-i386/migration-test                                  ERROR          25.21s   killed by signal 6 SIGABRT
>>> QTEST_QEMU_BINARY=./qemu-system-i386 MALLOC_PERTURB_=35 G_TEST_DBUS_DAEMON=/home/qemu/qemu-test.fYHKFz/src/tests/dbus-vmstate-daemon.sh QTEST_QEMU_IMG=./qemu-img /home/qemu/qemu-test.fYHKFz/build/tests/qtest/migration-test --tap -k
―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
stderr:
qemu: thread naming not supported on this host
qemu: thread naming not supported on this host
qemu: thread naming not supported on this host
qemu: thread naming not supported on this host
qemu: thread naming not supported on this host
qemu: thread naming not supported on this host
**
ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: 'got_stop' should be FALSE

(test program exited with status code -6)

  Thomas
Dr. David Alan Gilbert March 6, 2023, 1:44 p.m. UTC | #14
* Thomas Huth (thuth@redhat.com) wrote:
> On 03/03/2023 13.05, Peter Maydell wrote:
> > On Fri, 3 Mar 2023 at 11:29, Thomas Huth <thuth@redhat.com> wrote:
> > > 
> > > On 03/03/2023 12.18, Peter Maydell wrote:
> > > > On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
> > > > > 
> > > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
> > > > > > > migration-test has been flaky for a long time, both in CI and
> > > > > > > otherwise:
> > > > > > > 
> > > > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> > > > > > > (a FreeBSD job)
> > > > > > >     32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> > > > > > > 
> > > > > > > on a local macos x86 box:
> > > > 
> > > > 
> > > > 
> > > > > What is really weird with this failure is that:
> > > > > - it only happens on non-x86
> > > > 
> > > > No, I have seen it on x86 macos, and x86 OpenBSD
> > > > 
> > > > > - on code that is not arch dependent
> > > > > - on cancel, what we really do there is close fd's for the multifd
> > > > >     channel threads to get out of the recv, i.e. again, nothing that
> > > > >     should be arch dependent.
> > > > 
> > > > I'm pretty sure that it tends to happen when the machine that's
> > > > running the test is heavily loaded. You probably have a race condition.
> > > 
> > > I think I can second that. IIRC I've seen it a couple of times on my x86
> > > laptop when running "make check -j$(nproc) SPEED=slow" here.
> > 
> > And another on-x86 failure case, just now, on the FreeBSD x86 CI job:
> > https://gitlab.com/qemu-project/qemu/-/jobs/3870165180
> 
> And FWIW, I just saw this while doing "make vm-build-netbsd J=4":
> 
> ▶  31/645 ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: 'got_stop' should be FALSE ERROR

That one is kind of interesting; this is an auto converge test - so it
tries to setup migration so it won't finish, to check that the auto
converge kicks in.  Except in this case the migration *did* finish
without the autoconverge (significantly) kicking in.

So I guess any of:
  a) The CPU thread never got much CPU time so not much dirtying
happened.
  b) The bandwidth calculations might be bad enough/course enough
that it's passing the (very low) bandwidth limit due to bad
approximation at bandwidth needed.
  c) The autoconverge jump happens fast enough for that loop
to hit the got_stop in the loop time of that loop.

I guess we could:
  i) Reduce the usleep in test_migrate_auto_converge
    (So it is more likely to correctly drop out of that loop
    as soon as autoconverge kicks in)
  ii) Reduce inc_pct so that autoconverge kicks in slower
  iii) Reduce max-bandwidth in migrate_ensure_non_converge
     even further.

Dave

>  31/645 qemu:qtest+qtest-i386 / qtest-i386/migration-test                                  ERROR          25.21s   killed by signal 6 SIGABRT
> > > > QTEST_QEMU_BINARY=./qemu-system-i386 MALLOC_PERTURB_=35 G_TEST_DBUS_DAEMON=/home/qemu/qemu-test.fYHKFz/src/tests/dbus-vmstate-daemon.sh QTEST_QEMU_IMG=./qemu-img /home/qemu/qemu-test.fYHKFz/build/tests/qtest/migration-test --tap -k
> ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
> stderr:
> qemu: thread naming not supported on this host
> qemu: thread naming not supported on this host
> qemu: thread naming not supported on this host
> qemu: thread naming not supported on this host
> qemu: thread naming not supported on this host
> qemu: thread naming not supported on this host
> **
> ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: 'got_stop' should be FALSE
> 
> (test program exited with status code -6)
> 
>  Thomas
>
Daniel P. Berrangé March 6, 2023, 2 p.m. UTC | #15
On Mon, Mar 06, 2023 at 01:44:38PM +0000, Dr. David Alan Gilbert wrote:
> * Thomas Huth (thuth@redhat.com) wrote:
> > On 03/03/2023 13.05, Peter Maydell wrote:
> > > On Fri, 3 Mar 2023 at 11:29, Thomas Huth <thuth@redhat.com> wrote:
> > > > 
> > > > On 03/03/2023 12.18, Peter Maydell wrote:
> > > > > On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
> > > > > > 
> > > > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > > On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
> > > > > > > > migration-test has been flaky for a long time, both in CI and
> > > > > > > > otherwise:
> > > > > > > > 
> > > > > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> > > > > > > > (a FreeBSD job)
> > > > > > > >     32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> > > > > > > > 
> > > > > > > > on a local macos x86 box:
> > > > > 
> > > > > 
> > > > > 
> > > > > > What is really weird with this failure is that:
> > > > > > - it only happens on non-x86
> > > > > 
> > > > > No, I have seen it on x86 macos, and x86 OpenBSD
> > > > > 
> > > > > > - on code that is not arch dependent
> > > > > > - on cancel, what we really do there is close fd's for the multifd
> > > > > >     channel threads to get out of the recv, i.e. again, nothing that
> > > > > >     should be arch dependent.
> > > > > 
> > > > > I'm pretty sure that it tends to happen when the machine that's
> > > > > running the test is heavily loaded. You probably have a race condition.
> > > > 
> > > > I think I can second that. IIRC I've seen it a couple of times on my x86
> > > > laptop when running "make check -j$(nproc) SPEED=slow" here.
> > > 
> > > And another on-x86 failure case, just now, on the FreeBSD x86 CI job:
> > > https://gitlab.com/qemu-project/qemu/-/jobs/3870165180
> > 
> > And FWIW, I just saw this while doing "make vm-build-netbsd J=4":
> > 
> > ▶  31/645 ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: 'got_stop' should be FALSE ERROR
> 
> That one is kind of interesting; this is an auto converge test - so it
> tries to setup migration so it won't finish, to check that the auto
> converge kicks in.  Except in this case the migration *did* finish
> without the autoconverge (significantly) kicking in.
> 
> So I guess any of:
>   a) The CPU thread never got much CPU time so not much dirtying
> happened.
>   b) The bandwidth calculations might be bad enough/course enough
> that it's passing the (very low) bandwidth limit due to bad
> approximation at bandwidth needed.
>   c) The autoconverge jump happens fast enough for that loop
> to hit the got_stop in the loop time of that loop.
> 
> I guess we could:
>   i) Reduce the usleep in test_migrate_auto_converge
>     (So it is more likely to correctly drop out of that loop
>     as soon as autoconverge kicks in)

The CPU time spent by the dirtying guest CPUs should dominate
here, so we can afford to reduce that timeout down a bit to
be more responsive.

>   ii) Reduce inc_pct so that autoconverge kicks in slower
>   iii) Reduce max-bandwidth in migrate_ensure_non_converge
>      even further.

migrate_ensure_non_converge is trying to guarantee non-convergance,
but obviously we're only achieving a probibalistic chance of
non-converage. To get the probably closer to 100% we should make
it massively smaller, say 100kbs instead of 30mbs.

With regards,
Daniel
Dr. David Alan Gilbert March 6, 2023, 2:09 p.m. UTC | #16
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Mar 06, 2023 at 01:44:38PM +0000, Dr. David Alan Gilbert wrote:
> > * Thomas Huth (thuth@redhat.com) wrote:
> > > On 03/03/2023 13.05, Peter Maydell wrote:
> > > > On Fri, 3 Mar 2023 at 11:29, Thomas Huth <thuth@redhat.com> wrote:
> > > > > 
> > > > > On 03/03/2023 12.18, Peter Maydell wrote:
> > > > > > On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
> > > > > > > 
> > > > > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > > > On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
> > > > > > > > > migration-test has been flaky for a long time, both in CI and
> > > > > > > > > otherwise:
> > > > > > > > > 
> > > > > > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> > > > > > > > > (a FreeBSD job)
> > > > > > > > >     32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> > > > > > > > > 
> > > > > > > > > on a local macos x86 box:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > What is really weird with this failure is that:
> > > > > > > - it only happens on non-x86
> > > > > > 
> > > > > > No, I have seen it on x86 macos, and x86 OpenBSD
> > > > > > 
> > > > > > > - on code that is not arch dependent
> > > > > > > - on cancel, what we really do there is close fd's for the multifd
> > > > > > >     channel threads to get out of the recv, i.e. again, nothing that
> > > > > > >     should be arch dependent.
> > > > > > 
> > > > > > I'm pretty sure that it tends to happen when the machine that's
> > > > > > running the test is heavily loaded. You probably have a race condition.
> > > > > 
> > > > > I think I can second that. IIRC I've seen it a couple of times on my x86
> > > > > laptop when running "make check -j$(nproc) SPEED=slow" here.
> > > > 
> > > > And another on-x86 failure case, just now, on the FreeBSD x86 CI job:
> > > > https://gitlab.com/qemu-project/qemu/-/jobs/3870165180
> > > 
> > > And FWIW, I just saw this while doing "make vm-build-netbsd J=4":
> > > 
> > > ▶  31/645 ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: 'got_stop' should be FALSE ERROR
> > 
> > That one is kind of interesting; this is an auto converge test - so it
> > tries to setup migration so it won't finish, to check that the auto
> > converge kicks in.  Except in this case the migration *did* finish
> > without the autoconverge (significantly) kicking in.
> > 
> > So I guess any of:
> >   a) The CPU thread never got much CPU time so not much dirtying
> > happened.
> >   b) The bandwidth calculations might be bad enough/course enough
> > that it's passing the (very low) bandwidth limit due to bad
> > approximation at bandwidth needed.
> >   c) The autoconverge jump happens fast enough for that loop
> > to hit the got_stop in the loop time of that loop.
> > 
> > I guess we could:
> >   i) Reduce the usleep in test_migrate_auto_converge
> >     (So it is more likely to correctly drop out of that loop
> >     as soon as autoconverge kicks in)
> 
> The CPU time spent by the dirtying guest CPUs should dominate
> here, so we can afford to reduce that timeout down a bit to
> be more responsive.
> 
> >   ii) Reduce inc_pct so that autoconverge kicks in slower
> >   iii) Reduce max-bandwidth in migrate_ensure_non_converge
> >      even further.
> 
> migrate_ensure_non_converge is trying to guarantee non-convergance,
> but obviously we're only achieving a probibalistic chance of
> non-converage. To get the probably closer to 100% we should make
> it massively smaller, say 100kbs instead of 30mbs.

Yeh, I'll cut a patch for this.

Dave

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Dr. David Alan Gilbert March 6, 2023, 3:17 p.m. UTC | #17
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Mar 06, 2023 at 01:44:38PM +0000, Dr. David Alan Gilbert wrote:
> > * Thomas Huth (thuth@redhat.com) wrote:
> > > On 03/03/2023 13.05, Peter Maydell wrote:
> > > > On Fri, 3 Mar 2023 at 11:29, Thomas Huth <thuth@redhat.com> wrote:
> > > > > 
> > > > > On 03/03/2023 12.18, Peter Maydell wrote:
> > > > > > On Fri, 3 Mar 2023 at 09:10, Juan Quintela <quintela@redhat.com> wrote:
> > > > > > > 
> > > > > > > Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > > > On Thu, Mar 02, 2023 at 05:22:11PM +0000, Peter Maydell wrote:
> > > > > > > > > migration-test has been flaky for a long time, both in CI and
> > > > > > > > > otherwise:
> > > > > > > > > 
> > > > > > > > > https://gitlab.com/qemu-project/qemu/-/jobs/3806090216
> > > > > > > > > (a FreeBSD job)
> > > > > > > > >     32/648 ERROR:../tests/qtest/migration-helpers.c:205:wait_for_migration_status: assertion failed: (g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT) ERROR
> > > > > > > > > 
> > > > > > > > > on a local macos x86 box:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > What is really weird with this failure is that:
> > > > > > > - it only happens on non-x86
> > > > > > 
> > > > > > No, I have seen it on x86 macos, and x86 OpenBSD
> > > > > > 
> > > > > > > - on code that is not arch dependent
> > > > > > > - on cancel, what we really do there is close fd's for the multifd
> > > > > > >     channel threads to get out of the recv, i.e. again, nothing that
> > > > > > >     should be arch dependent.
> > > > > > 
> > > > > > I'm pretty sure that it tends to happen when the machine that's
> > > > > > running the test is heavily loaded. You probably have a race condition.
> > > > > 
> > > > > I think I can second that. IIRC I've seen it a couple of times on my x86
> > > > > laptop when running "make check -j$(nproc) SPEED=slow" here.
> > > > 
> > > > And another on-x86 failure case, just now, on the FreeBSD x86 CI job:
> > > > https://gitlab.com/qemu-project/qemu/-/jobs/3870165180
> > > 
> > > And FWIW, I just saw this while doing "make vm-build-netbsd J=4":
> > > 
> > > ▶  31/645 ERROR:../src/tests/qtest/migration-test.c:1841:test_migrate_auto_converge: 'got_stop' should be FALSE ERROR
> > 
> > That one is kind of interesting; this is an auto converge test - so it
> > tries to setup migration so it won't finish, to check that the auto
> > converge kicks in.  Except in this case the migration *did* finish
> > without the autoconverge (significantly) kicking in.
> > 
> > So I guess any of:
> >   a) The CPU thread never got much CPU time so not much dirtying
> > happened.
> >   b) The bandwidth calculations might be bad enough/course enough
> > that it's passing the (very low) bandwidth limit due to bad
> > approximation at bandwidth needed.
> >   c) The autoconverge jump happens fast enough for that loop
> > to hit the got_stop in the loop time of that loop.
> > 
> > I guess we could:
> >   i) Reduce the usleep in test_migrate_auto_converge
> >     (So it is more likely to correctly drop out of that loop
> >     as soon as autoconverge kicks in)
> 
> The CPU time spent by the dirtying guest CPUs should dominate
> here, so we can afford to reduce that timeout down a bit to
> be more responsive.
> 
> >   ii) Reduce inc_pct so that autoconverge kicks in slower
> >   iii) Reduce max-bandwidth in migrate_ensure_non_converge
> >      even further.
> 
> migrate_ensure_non_converge is trying to guarantee non-convergance,
> but obviously we're only achieving a probibalistic chance of
> non-converage. To get the probably closer to 100% we should make
> it massively smaller, say 100kbs instead of 30mbs.

Interestingly that doesn't work; I've not quite worked out why yet; I
pushed it to 30kbps and it got stuck in the basic fd test.  My guess is
that it's starving something so it never makes any forward progress.

Dave

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>
Peter Maydell March 7, 2023, 9:53 a.m. UTC | #18
On Sat, 4 Mar 2023 at 15:39, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Thu, 2 Mar 2023 at 17:22, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > migration-test has been flaky for a long time, both in CI and
> > otherwise:
> >
>
>
> > In the cases where I've looked at the underlying log, this seems to
> > be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> > specific subtest by default until somebody can track down the
> > underlying cause. Enthusiasts can opt back in by setting
> > QEMU_TEST_FLAKY_TESTS=1 in their environment.
>
> So I'm going to apply this, because hopefully it will improve
> the reliability a bit, but it's clearly not all of the
> issues with migration-test, because in the course of the
> run I was doing to test it before applying it I got this
> error from the OpenBSD VM:
>
>  32/646 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
>            ERROR          134.73s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-aarch64: multifd_send_sync_main: channel 15 has already quit
> qemu-system-aarch64: failed to save SaveStateEntry with id(name): 2(ram): -1
> qemu-system-aarch64: Failed to connect to '127.0.0.1:19581': Address
> already in use
> query-migrate shows failed migration: Failed to connect to
> '127.0.0.1:19581': Address already in use
> **
> ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
> assertion failed: (!g_str_equal(status, "failed"))
>
> (test program exited with status code -6)

Got another repeat of this one today; again, on the OpenBSD VM:

 32/646 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
           ERROR
  131.28s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
qemu-system-aarch64: multifd_send_sync_main: channel 15 has already quit
qemu-system-aarch64: failed to save SaveStateEntry with id(name): 2(ram): -1
qemu-system-aarch64: Failed to connect to '127.0.0.1:30312': Address
already in use
query-migrate shows failed migration: Failed to connect to
'127.0.0.1:30312': Address already i
n use
**
ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
assertion failed: (!
g_str_equal(status, "failed"))

(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

-- PMM
Peter Maydell March 12, 2023, 2:06 p.m. UTC | #19
On Tue, 7 Mar 2023 at 09:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Sat, 4 Mar 2023 at 15:39, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Thu, 2 Mar 2023 at 17:22, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > migration-test has been flaky for a long time, both in CI and
> > > otherwise:
> > >
> >
> >
> > > In the cases where I've looked at the underlying log, this seems to
> > > be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> > > specific subtest by default until somebody can track down the
> > > underlying cause. Enthusiasts can opt back in by setting
> > > QEMU_TEST_FLAKY_TESTS=1 in their environment.
> >
> > So I'm going to apply this, because hopefully it will improve
> > the reliability a bit, but it's clearly not all of the
> > issues with migration-test

Here's another one, on the s390x CI runner:
https://gitlab.com/qemu-project/qemu/-/jobs/3917587994

31/658 qemu:qtest+qtest-i386 / qtest-i386/migration-test ERROR 34.36s
killed by signal 6 SIGABRT
>>> QTEST_QEMU_STORAGE_DAEMON_BINARY=./storage-daemon/qemu-storage-daemon QTEST_QEMU_IMG=./qemu-img QTEST_QEMU_BINARY=./qemu-system-i386 G_TEST_DBUS_DAEMON=/home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/tests/dbus-vmstate-daemon.sh MALLOC_PERTURB_=154 /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
――――――――――――――――――――――――――――――――――――― ✀ ―――――――――――――――――――――――――――――――――――――
stderr:
Broken pipe
../tests/qtest/libqtest.c:193: kill_qemu() detected QEMU death from
signal 11 (Segmentation fault) (core dumped)
TAP parsing error: Too few tests run (expected 40, got 11)
(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――


Do you folks have a concrete proposal for tracking down and
fixing the flakiness in this test? If not, we'll have to
just disable it entirely, I think.

thanks
-- PMM
Peter Maydell March 12, 2023, 5:46 p.m. UTC | #20
On Sun, 12 Mar 2023 at 14:06, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Tue, 7 Mar 2023 at 09:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Sat, 4 Mar 2023 at 15:39, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > On Thu, 2 Mar 2023 at 17:22, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > >
> > > > migration-test has been flaky for a long time, both in CI and
> > > > otherwise:
> > > >
> > >
> > >
> > > > In the cases where I've looked at the underlying log, this seems to
> > > > be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> > > > specific subtest by default until somebody can track down the
> > > > underlying cause. Enthusiasts can opt back in by setting
> > > > QEMU_TEST_FLAKY_TESTS=1 in their environment.
> > >
> > > So I'm going to apply this, because hopefully it will improve
> > > the reliability a bit, but it's clearly not all of the
> > > issues with migration-test
>
> Here's another one, on the s390x CI runner:
> https://gitlab.com/qemu-project/qemu/-/jobs/3917587994

And here's a backtrace from a hung migration-test on the s390x
runner (looks like a deadlock, none of these processes were
using CPU):

Process tree:
migration-test(1464515)─┬─qemu-system-aar(1465117)
                        └─qemu-system-aar(1465125)
===========================================================
PROCESS: 1464515
gitlab-+ 1464515 1463129  0 14:27 ?        00:00:02
/home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/build/tests/qtest/migration-test
--tap -k
[New LWP 1464517]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
__libc_recv (fd=fd@entry=5, buf=buf@entry=0x3ffc06f9987,
len=len@entry=1, flags=flags@entry=0) at
../sysdeps/unix/sysv/linux/recv.c:30
30      ../sysdeps/unix/sysv/linux/recv.c: No such file or directory.

Thread 2 (Thread 0x3ff9faff900 (LWP 1464517)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa1175ae34 in qemu_futex_wait (val=<optimized out>,
f=<optimized out>) at
/home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa118d96b8 <rcu_call_ready_event>)
at ../util/qemu-thread-posix.c:464
#3  0x000002aa117879ea in call_rcu_thread (opaque=opaque@entry=0x0) at
../util/rcu.c:261
#4  0x000002aa11759e9a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#5  0x000003ffa0487e66 in start_thread (arg=0x3ff9faff900) at
pthread_create.c:477
#6  0x000003ffa037cbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffa0b6d990 (LWP 1464515)):
#0  __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3ffc06f9987,
len=len@entry=1, flags=flags@entry=0) at
../sysdeps/unix/sysv/linux/recv.c:30
#1  0x000002aa1172ba22 in recv (__flags=0, __n=1, __buf=0x3ffc06f9987,
__fd=5) at /usr/include/s390x-linux-gnu/bits/socket2.h:44
#2  qmp_fd_receive (fd=<optimized out>) at ../tests/qtest/libqmp.c:73
#3  0x000002aa11729b8e in qtest_qmp_receive_dict (s=0x2aa129b97d0) at
../tests/qtest/libqtest.c:837
#4  qtest_qmp_eventwait_ref (event=<optimized out>, s=<optimized out>)
at ../tests/qtest/libqtest.c:837
#5  qtest_qmp_eventwait_ref (s=0x2aa129b97d0, event=<optimized out>)
at ../tests/qtest/libqtest.c:828
#6  0x000002aa11729c1a in qtest_qmp_eventwait (s=<optimized out>,
event=<optimized out>) at ../tests/qtest/libqtest.c:850
#7  0x000002aa1171ec56 in test_postcopy_common (args=0x3ffc06f9c88) at
../tests/qtest/migration-test.c:1161
#8  0x000002aa1171edb4 in test_postcopy_preempt () at
../tests/qtest/migration-test.c:1178
#9  0x000003ffa067e608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#10 0x000003ffa067e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#11 0x000003ffa067e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#12 0x000003ffa067e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#13 0x000003ffa067e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#14 0x000003ffa067eada in g_test_run_suite () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#15 0x000003ffa067eb10 in g_test_run () from
/lib/s390x-linux-gnu/libglib-2.0.so.0
#16 0x000002aa1171ab14 in main (argc=<optimized out>, argv=<optimized
out>) at ../tests/qtest/migration-test.c:2615
[Inferior 1 (process 1464515) detached]

===========================================================
PROCESS: 1465117
gitlab-+ 1465117 1464515  0 14:28 ?        00:00:01
./qemu-system-aarch64 -qtest unix:/tmp/qtest-1464515.sock -qtest-log
/dev/null -chardev socket,path=/tmp/qtest-1464515.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg
-machine virt,gic-version=max -name source,debug-threads=on -m 150M
-serial file:/tmp/migration-test-G3X601/src_serial -cpu max -kernel
/tmp/migration-test-G3X601/bootsect -accel qtest
[New LWP 1465122]
[New LWP 1465123]
[New LWP 1465124]
[New LWP 1465137]
[New LWP 1465138]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
0x000003ffa66f1c8c in __ppoll (fds=0x2aa1f1c96f0, nfds=5,
timeout=<optimized out>, timeout@entry=0x3fff71f9968,
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 6 (Thread 0x3ff950fe900 (LWP 1465138)):
#0  __libc_sendmsg (fd=<optimized out>, msg=msg@entry=0x3ff950f9800,
flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/sendmsg.c:30
#1  0x000002aa1cb4c81a in qio_channel_socket_writev (ioc=<optimized
out>, iov=<optimized out>, niov=<optimized out>, fds=<optimized out>,
nfds=<optimized out>, flags=0, errp=0x3ff950f9b70) at
../io/channel-socket.c:605
#2  0x000002aa1cb51e78 in qio_channel_writev_full
(ioc=ioc@entry=0x2aa1e8ee400, iov=0x2aa1f1bac90, niov=64,
fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff950f9b70) at
../io/channel.c:108
#3  0x000002aa1cb52cf0 in qio_channel_writev_full_all
(ioc=0x2aa1e8ee400, iov=iov@entry=0x2aa1e5fe4d0, niov=<optimized out>,
fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff950f9b70) at
../io/channel.c:263
#4  0x000002aa1cb52d9e in qio_channel_writev_all (ioc=<optimized out>,
iov=iov@entry=0x2aa1e5fe4d0, niov=<optimized out>,
errp=errp@entry=0x3ff950f9b70) at ../io/channel.c:242
#5  0x000002aa1c6de2a4 in qemu_fflush (f=f@entry=0x2aa1e5f6490) at
../migration/qemu-file.c:302
#6  0x000002aa1c6de532 in qemu_fflush (f=0x2aa1e5f6490) at
../migration/qemu-file.c:297
#7  add_to_iovec (f=f@entry=0x2aa1e5f6490, buf=<optimized out>,
size=size@entry=4096, may_free=<optimized out>) at
../migration/qemu-file.c:510
#8  0x000002aa1c6dea16 in qemu_put_buffer_async
(f=f@entry=0x2aa1e5f6490, buf=<error reading variable: value has been
optimized out>, size=size@entry=4096, may_free=<optimized out>) at
../migration/qemu-file.c:535
#9  0x000002aa1cabcad6 in save_normal_page (async=<optimized out>,
buf=<optimized out>, offset=1306624, block=0x2aa1e60cab0,
pss=0x3ff28001440) at ../migration/ram.c:1400
#10 ram_save_page (pss=0x3ff28001440, rs=<optimized out>) at
../migration/ram.c:1449
#11 ram_save_target_page_legacy (rs=<optimized out>,
pss=0x3ff28001440) at ../migration/ram.c:2381
#12 0x000002aa1cab9906 in ram_save_host_page (pss=0x3ff28001440,
rs=0x3ff28001440) at ../migration/ram.c:2539
#13 ram_find_and_save_block (rs=rs@entry=0x3ff28001440) at
../migration/ram.c:2620
#14 0x000002aa1cab9fe4 in ram_save_iterate (f=0x2aa1e5f6490,
opaque=<optimized out>) at ../migration/ram.c:3361
#15 0x000002aa1c6fa934 in qemu_savevm_state_iterate (f=0x2aa1e5f6490,
postcopy=<optimized out>) at ../migration/savevm.c:1345
#16 0x000002aa1c6ede24 in migration_iteration_run (s=0x2aa1e2cdcd0) at
../migration/migration.c:3896
#17 migration_thread (opaque=opaque@entry=0x2aa1e2cdcd0) at
../migration/migration.c:4124
#18 0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#19 0x000003ffa6807e66 in start_thread (arg=0x3ff950fe900) at
pthread_create.c:477
#20 0x000003ffa66fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 5 (Thread 0x3ff958ff900 (LWP 1465137)):
#0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff958fa9f0,
flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:30
#1  0x000002aa1cb4cad4 in qio_channel_socket_readv (ioc=<optimized
out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0,
flags=0, errp=0x3ff958faca8) at ../io/channel-socket.c:532
#2  0x000002aa1cb51d60 in qio_channel_readv_full (ioc=0x2aa1e8ee400,
iov=iov@entry=0x3ff958fabf0, niov=niov@entry=1, fds=fds@entry=0x0,
nfds=nfds@entry=0x0, flags=0, errp=0x3ff958faca8) at
../io/channel.c:74
#3  0x000002aa1cb51fe6 in qio_channel_read (ioc=<optimized out>,
buf=buf@entry=0x2aa1e7a3c38 "", buflen=buflen@entry=32768,
errp=errp@entry=0x3ff958faca8) at ../io/channel.c:314
#4  0x000002aa1c6de13e in qemu_fill_buffer (f=0x2aa1e7a3c00) at
../migration/qemu-file.c:415
#5  0x000002aa1c6ded18 in qemu_peek_byte (f=0x2aa1e7a3c00,
offset=<optimized out>) at ../migration/qemu-file.c:707
#6  0x000002aa1c6defba in qemu_get_byte (f=0x2aa1e7a3c00) at
../migration/qemu-file.c:720
#7  qemu_get_be16 (f=f@entry=0x2aa1e7a3c00) at ../migration/qemu-file.c:800
#8  0x000002aa1c6ea7e8 in source_return_path_thread
(opaque=opaque@entry=0x2aa1e2cdcd0) at ../migration/migration.c:2995
#9  0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#10 0x000003ffa6807e66 in start_thread (arg=0x3ff958ff900) at
pthread_create.c:477
#11 0x000003ffa66fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ff9662e900 (LWP 1465124)):
#0  futex_wait_cancelable (private=0, expected=0,
futex_word=0x2aa1e5daeb8) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
mutex=0x2aa1d75bf80 <qemu_global_mutex>, cond=0x2aa1e5dae90) at
pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa1e5dae90,
mutex=mutex@entry=0x2aa1d75bf80 <qemu_global_mutex>) at
pthread_cond_wait.c:647
#3  0x000002aa1ccc500e in qemu_cond_wait_impl (cond=0x2aa1e5dae90,
mutex=0x2aa1d75bf80 <qemu_global_mutex>, file=0x2aa1ce5d9bc
"../softmmu/cpus.c", line=<optimized out>) at
../util/qemu-thread-posix.c:225
#4  0x000002aa1c6bdf0e in qemu_wait_io_event
(cpu=cpu@entry=0x2aa1e70c750) at ../softmmu/cpus.c:424
#5  0x000002aa1cb34412 in mttcg_cpu_thread_fn
(arg=arg@entry=0x2aa1e70c750) at
../accel/tcg/tcg-accel-ops-mttcg.c:123
#6  0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#7  0x000003ffa6807e66 in start_thread (arg=0x3ff9662e900) at
pthread_create.c:477
#8  0x000003ffa66fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff97021900 (LWP 1465123)):
#0  0x000003ffa66f1b32 in __GI___poll (fds=0x3ff480036b0, nfds=3,
timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ffa8ed4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ffa8ed4790 in g_main_loop_run () at
/lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa1cb74afe in iothread_run
(opaque=opaque@entry=0x2aa1e371400) at ../iothread.c:70
#4  0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#5  0x000003ffa6807e66 in start_thread (arg=0x3ff97021900) at
pthread_create.c:477
#6  0x000003ffa66fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff97923900 (LWP 1465122)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa1ccc55c4 in qemu_futex_wait (val=<optimized out>,
f=<optimized out>) at
/home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa1d78e5d0 <rcu_call_ready_event>)
at ../util/qemu-thread-posix.c:464
#3  0x000002aa1ccce19a in call_rcu_thread (opaque=opaque@entry=0x0) at
../util/rcu.c:261
#4  0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#5  0x000003ffa6807e66 in start_thread (arg=0x3ff97923900) at
pthread_create.c:477
#6  0x000003ffa66fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffaa07e4a0 (LWP 1465117)):
#0  0x000003ffa66f1c8c in __ppoll (fds=0x2aa1f1c96f0, nfds=5,
timeout=<optimized out>, timeout@entry=0x3fff71f9968,
sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000002aa1ccda44a in ppoll (__ss=0x0, __timeout=0x3fff71f9968,
__nfds=<optimized out>, __fds=<optimized out>) at
/usr/include/s390x-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
timeout=timeout@entry=1000000000) at ../util/qemu-timer.c:351
#3  0x000002aa1ccd7460 in os_host_main_loop_wait (timeout=1000000000)
at ../util/main-loop.c:312
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:603
#5  0x000002aa1c6c4ae4 in qemu_main_loop () at ../softmmu/runstate.c:731
#6  0x000002aa1c42c5ba in qemu_default_main () at ../softmmu/main.c:37
#7  0x000003ffa662440a in __libc_start_main (main=0x2aa1c428b90
<main>, argc=<optimized out>, argv=0x3fff71f9db8, init=<optimized
out>, fini=<optimized out>, rtld_fini=0x3ffaa090e50 <_dl_fini>,
stack_end=0x3fff71f9d00) at libc-start.c:308
#8  0x000002aa1c42c4e4 in _start () at ../softmmu/main.c:48
[Inferior 1 (process 1465117) detached]

===========================================================
PROCESS: 1465125
gitlab-+ 1465125 1464515  0 14:28 ?        00:00:00
./qemu-system-aarch64 -qtest unix:/tmp/qtest-1464515.sock -qtest-log
/dev/null -chardev socket,path=/tmp/qtest-1464515.qmp,id=char0 -mon
chardev=char0,mode=control -display none -accel kvm -accel tcg
-machine virt,gic-version=max -name target,debug-threads=on -m 150M
-serial file:/tmp/migration-test-G3X601/dest_serial -incoming
unix:/tmp/migration-test-G3X601/migsocket -cpu max -kernel
/tmp/migration-test-G3X601/bootsect -accel qtest
[New LWP 1465133]
[New LWP 1465134]
[New LWP 1465135]
[New LWP 1465140]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
futex_wait_cancelable (private=0, expected=0,
futex_word=0x2aa2b37b560) at ../sysdeps/nptl/futex-internal.h:183
183     ../sysdeps/nptl/futex-internal.h: No such file or directory.

Thread 5 (Thread 0x3ff749fe900 (LWP 1465140)):
#0  0x000003ff858f1b32 in __GI___poll (fds=fds@entry=0x3ff10000b60,
nfds=2, nfds@entry=<error reading variable: value has been optimized
out>, timeout=timeout@entry=-1) at
../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000002aa284f520e in poll (__timeout=-1, __nfds=<optimized out>,
__fds=0x3ff10000b60) at /usr/include/s390x-linux-gnu/bits/poll2.h:46
#2  postcopy_ram_fault_thread (opaque=opaque@entry=0x2aa2b37b3a0) at
../migration/postcopy-ram.c:952
#3  0x000002aa28ac462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#4  0x000003ff85a07e66 in start_thread (arg=0x3ff749fe900) at
pthread_create.c:477
#5  0x000003ff858fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ff7582e900 (LWP 1465135)):
#0  futex_wait_cancelable (private=0, expected=0,
futex_word=0x2aa2b82310c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
mutex=0x2aa2955bf80 <qemu_global_mutex>, cond=0x2aa2b8230e0) at
pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa2b8230e0,
mutex=mutex@entry=0x2aa2955bf80 <qemu_global_mutex>) at
pthread_cond_wait.c:647
#3  0x000002aa28ac500e in qemu_cond_wait_impl (cond=0x2aa2b8230e0,
mutex=0x2aa2955bf80 <qemu_global_mutex>, file=0x2aa28c5d9bc
"../softmmu/cpus.c", line=<optimized out>) at
../util/qemu-thread-posix.c:225
#4  0x000002aa284bdf0e in qemu_wait_io_event
(cpu=cpu@entry=0x2aa2b7b9750) at ../softmmu/cpus.c:424
#5  0x000002aa28934412 in mttcg_cpu_thread_fn
(arg=arg@entry=0x2aa2b7b9750) at
../accel/tcg/tcg-accel-ops-mttcg.c:123
#6  0x000002aa28ac462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#7  0x000003ff85a07e66 in start_thread (arg=0x3ff7582e900) at
pthread_create.c:477
#8  0x000003ff858fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff76221900 (LWP 1465134)):
#0  0x000003ff858f1b32 in __GI___poll (fds=0x3ff280036b0, nfds=3,
timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ff880d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ff880d4790 in g_main_loop_run () at
/lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa28974afe in iothread_run
(opaque=opaque@entry=0x2aa2b41e400) at ../iothread.c:70
#4  0x000002aa28ac462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#5  0x000003ff85a07e66 in start_thread (arg=0x3ff76221900) at
pthread_create.c:477
#6  0x000003ff858fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff76b23900 (LWP 1465133)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa28ac55c4 in qemu_futex_wait (val=<optimized out>,
f=<optimized out>) at
/home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa2958e5d0 <rcu_call_ready_event>)
at ../util/qemu-thread-posix.c:464
#3  0x000002aa28ace19a in call_rcu_thread (opaque=opaque@entry=0x0) at
../util/rcu.c:261
#4  0x000002aa28ac462a in qemu_thread_start (args=<optimized out>) at
../util/qemu-thread-posix.c:541
#5  0x000003ff85a07e66 in start_thread (arg=0x3ff76b23900) at
pthread_create.c:477
#6  0x000003ff858fcbe6 in thread_start () at
../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff8927e4a0 (LWP 1465125)):
#0  futex_wait_cancelable (private=0, expected=0,
futex_word=0x2aa2b37b560) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
mutex=0x2aa2b37b508, cond=0x2aa2b37b538) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa2b37b538,
mutex=mutex@entry=0x2aa2b37b508) at pthread_cond_wait.c:647
#3  0x000002aa28ac500e in qemu_cond_wait_impl (cond=0x2aa2b37b538,
mutex=0x2aa2b37b508, file=0x2aa28d70dee "../util/qemu-thread-posix.c",
line=<optimized out>) at ../util/qemu-thread-posix.c:225
#4  0x000002aa28ac542e in qemu_sem_wait (sem=0x2aa2b37b508) at
../util/qemu-thread-posix.c:314
#5  0x000002aa284f485a in postcopy_ram_incoming_setup
(mis=0x2aa2b37b3a0, mis@entry=<error reading variable: value has been
optimized out>) at ../migration/postcopy-ram.c:1204
#6  0x000002aa284fbf70 in loadvm_postcopy_handle_listen
(mis=<optimized out>) at ../migration/savevm.c:2016
#7  loadvm_process_command (f=0x2aa2b85b7a0) at ../migration/savevm.c:2379
#8  qemu_loadvm_state_main (f=f@entry=0x2aa2b85b7a0,
mis=0x2aa2b37b3a0) at ../migration/savevm.c:2724
#9  0x000002aa284fc2fe in loadvm_handle_cmd_packaged (mis=<optimized
out>) at ../migration/savevm.c:2248
#10 loadvm_process_command (f=0x2aa2b6a3490) at ../migration/savevm.c:2373
#11 qemu_loadvm_state_main (f=f@entry=0x2aa2b6a3490,
mis=mis@entry=0x2aa2b37b3a0) at ../migration/savevm.c:2724
#12 0x000002aa284fcf14 in qemu_loadvm_state (f=0x2aa2b6a3490) at
../migration/savevm.c:2790
#13 0x000002aa284e9f64 in process_incoming_migration_co
(opaque=<optimized out>) at ../migration/migration.c:606
#14 0x000002aa28ad92e6 in coroutine_trampoline (i0=<optimized out>,
i1=<optimized out>) at ../util/coroutine-ucontext.c:177
#15 0x000003ff858569c2 in __makecontext_ret () at
../sysdeps/unix/sysv/linux/s390/s390-64/__makecontext_ret.S:23
[Inferior 1 (process 1465125) detached]

-- PMM
Dr. David Alan Gilbert March 14, 2023, 10:11 a.m. UTC | #21
Copying Peter Xu on this one since it's poscopy, especially the newer
postcopy preempt.

* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Sun, 12 Mar 2023 at 14:06, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Tue, 7 Mar 2023 at 09:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > On Sat, 4 Mar 2023 at 15:39, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > >
> > > > On Thu, 2 Mar 2023 at 17:22, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > > >
> > > > > migration-test has been flaky for a long time, both in CI and
> > > > > otherwise:
> > > > >
> > > >
> > > >
> > > > > In the cases where I've looked at the underlying log, this seems to
> > > > > be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> > > > > specific subtest by default until somebody can track down the
> > > > > underlying cause. Enthusiasts can opt back in by setting
> > > > > QEMU_TEST_FLAKY_TESTS=1 in their environment.
> > > >
> > > > So I'm going to apply this, because hopefully it will improve
> > > > the reliability a bit, but it's clearly not all of the
> > > > issues with migration-test
> >
> > Here's another one, on the s390x CI runner:
> > https://gitlab.com/qemu-project/qemu/-/jobs/3917587994
> 
> And here's a backtrace from a hung migration-test on the s390x
> runner (looks like a deadlock, none of these processes were
> using CPU):

Ah a backtrace!
OK, I think I kind of see what's happening here, one for Peter Xu.
If I'm right it's a race something like:
  a) The test harness tells the source it wants to enter postcopy
  b) The harness then waits for the source to stop
  c) ... and the dest to start 

  It's blocked on one of b&c but can't tell which

  d) The main thread in the dest is waiting for the postcopy recovery fd
    to be opened
  e) But I think the source is still trying to send normal precopy RAM
    and perhaps hasn't got around yet to opening that socket yet????
  f) But I think the dest isn't reading from the main channel at that
    point because of (d)

Dave

> Process tree:
> migration-test(1464515)─┬─qemu-system-aar(1465117)
>                         └─qemu-system-aar(1465125)
> ===========================================================
> PROCESS: 1464515
> gitlab-+ 1464515 1463129  0 14:27 ?        00:00:02
> /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/build/tests/qtest/migration-test
> --tap -k
> [New LWP 1464517]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3ffc06f9987,
> len=len@entry=1, flags=flags@entry=0) at
> ../sysdeps/unix/sysv/linux/recv.c:30
> 30      ../sysdeps/unix/sysv/linux/recv.c: No such file or directory.
> 
> Thread 2 (Thread 0x3ff9faff900 (LWP 1464517)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa1175ae34 in qemu_futex_wait (val=<optimized out>,
> f=<optimized out>) at
> /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa118d96b8 <rcu_call_ready_event>)
> at ../util/qemu-thread-posix.c:464
> #3  0x000002aa117879ea in call_rcu_thread (opaque=opaque@entry=0x0) at
> ../util/rcu.c:261
> #4  0x000002aa11759e9a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #5  0x000003ffa0487e66 in start_thread (arg=0x3ff9faff900) at
> pthread_create.c:477
> #6  0x000003ffa037cbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ffa0b6d990 (LWP 1464515)):
> #0  __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3ffc06f9987,
> len=len@entry=1, flags=flags@entry=0) at
> ../sysdeps/unix/sysv/linux/recv.c:30
> #1  0x000002aa1172ba22 in recv (__flags=0, __n=1, __buf=0x3ffc06f9987,
> __fd=5) at /usr/include/s390x-linux-gnu/bits/socket2.h:44
> #2  qmp_fd_receive (fd=<optimized out>) at ../tests/qtest/libqmp.c:73
> #3  0x000002aa11729b8e in qtest_qmp_receive_dict (s=0x2aa129b97d0) at
> ../tests/qtest/libqtest.c:837
> #4  qtest_qmp_eventwait_ref (event=<optimized out>, s=<optimized out>)
> at ../tests/qtest/libqtest.c:837
> #5  qtest_qmp_eventwait_ref (s=0x2aa129b97d0, event=<optimized out>)
> at ../tests/qtest/libqtest.c:828
> #6  0x000002aa11729c1a in qtest_qmp_eventwait (s=<optimized out>,
> event=<optimized out>) at ../tests/qtest/libqtest.c:850
> #7  0x000002aa1171ec56 in test_postcopy_common (args=0x3ffc06f9c88) at
> ../tests/qtest/migration-test.c:1161

so I think that's the call:
    migrate_postcopy_start(from, to);
and migrate_postcopy_start has a pair of qtest_qmp_eventwait's in
but we can't tell which.

So it's waiting for an event after it's been told to enter postcopy
mode.

> #8  0x000002aa1171edb4 in test_postcopy_preempt () at
> ../tests/qtest/migration-test.c:1178
> #9  0x000003ffa067e608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #10 0x000003ffa067e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #11 0x000003ffa067e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #12 0x000003ffa067e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #13 0x000003ffa067e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #14 0x000003ffa067eada in g_test_run_suite () from
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #15 0x000003ffa067eb10 in g_test_run () from
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #16 0x000002aa1171ab14 in main (argc=<optimized out>, argv=<optimized
> out>) at ../tests/qtest/migration-test.c:2615
> [Inferior 1 (process 1464515) detached]
> 
> ===========================================================
> PROCESS: 1465117
> gitlab-+ 1465117 1464515  0 14:28 ?        00:00:01
> ./qemu-system-aarch64 -qtest unix:/tmp/qtest-1464515.sock -qtest-log
> /dev/null -chardev socket,path=/tmp/qtest-1464515.qmp,id=char0 -mon
> chardev=char0,mode=control -display none -accel kvm -accel tcg
> -machine virt,gic-version=max -name source,debug-threads=on -m 150M
> -serial file:/tmp/migration-test-G3X601/src_serial -cpu max -kernel
> /tmp/migration-test-G3X601/bootsect -accel qtest
> [New LWP 1465122]
> [New LWP 1465123]
> [New LWP 1465124]
> [New LWP 1465137]
> [New LWP 1465138]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> 0x000003ffa66f1c8c in __ppoll (fds=0x2aa1f1c96f0, nfds=5,
> timeout=<optimized out>, timeout@entry=0x3fff71f9968,
> sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
> 44      ../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
> 
> Thread 6 (Thread 0x3ff950fe900 (LWP 1465138)):
> #0  __libc_sendmsg (fd=<optimized out>, msg=msg@entry=0x3ff950f9800,
> flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/sendmsg.c:30
> #1  0x000002aa1cb4c81a in qio_channel_socket_writev (ioc=<optimized
> out>, iov=<optimized out>, niov=<optimized out>, fds=<optimized out>,
> nfds=<optimized out>, flags=0, errp=0x3ff950f9b70) at
> ../io/channel-socket.c:605
> #2  0x000002aa1cb51e78 in qio_channel_writev_full
> (ioc=ioc@entry=0x2aa1e8ee400, iov=0x2aa1f1bac90, niov=64,
> fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff950f9b70) at
> ../io/channel.c:108
> #3  0x000002aa1cb52cf0 in qio_channel_writev_full_all
> (ioc=0x2aa1e8ee400, iov=iov@entry=0x2aa1e5fe4d0, niov=<optimized out>,
> fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff950f9b70) at
> ../io/channel.c:263
> #4  0x000002aa1cb52d9e in qio_channel_writev_all (ioc=<optimized out>,
> iov=iov@entry=0x2aa1e5fe4d0, niov=<optimized out>,
> errp=errp@entry=0x3ff950f9b70) at ../io/channel.c:242
> #5  0x000002aa1c6de2a4 in qemu_fflush (f=f@entry=0x2aa1e5f6490) at
> ../migration/qemu-file.c:302
> #6  0x000002aa1c6de532 in qemu_fflush (f=0x2aa1e5f6490) at
> ../migration/qemu-file.c:297
> #7  add_to_iovec (f=f@entry=0x2aa1e5f6490, buf=<optimized out>,
> size=size@entry=4096, may_free=<optimized out>) at
> ../migration/qemu-file.c:510
> #8  0x000002aa1c6dea16 in qemu_put_buffer_async
> (f=f@entry=0x2aa1e5f6490, buf=<error reading variable: value has been
> optimized out>, size=size@entry=4096, may_free=<optimized out>) at
> ../migration/qemu-file.c:535
> #9  0x000002aa1cabcad6 in save_normal_page (async=<optimized out>,
> buf=<optimized out>, offset=1306624, block=0x2aa1e60cab0,
> pss=0x3ff28001440) at ../migration/ram.c:1400
> #10 ram_save_page (pss=0x3ff28001440, rs=<optimized out>) at
> ../migration/ram.c:1449
> #11 ram_save_target_page_legacy (rs=<optimized out>,
> pss=0x3ff28001440) at ../migration/ram.c:2381
> #12 0x000002aa1cab9906 in ram_save_host_page (pss=0x3ff28001440,
> rs=0x3ff28001440) at ../migration/ram.c:2539
> #13 ram_find_and_save_block (rs=rs@entry=0x3ff28001440) at
> ../migration/ram.c:2620
> #14 0x000002aa1cab9fe4 in ram_save_iterate (f=0x2aa1e5f6490,
> opaque=<optimized out>) at ../migration/ram.c:3361
> #15 0x000002aa1c6fa934 in qemu_savevm_state_iterate (f=0x2aa1e5f6490,
> postcopy=<optimized out>) at ../migration/savevm.c:1345
> #16 0x000002aa1c6ede24 in migration_iteration_run (s=0x2aa1e2cdcd0) at
> ../migration/migration.c:3896

To me this doesn't look like the postcopy flow; this looks like the
normal migration precopy flow; which is probably why the test thread
hasn't got it's event yet.  If you're saying it's not using CPU, that
suggests this outbound writer thread is blocked on the pipe to the
destination.

> #17 migration_thread (opaque=opaque@entry=0x2aa1e2cdcd0) at
> ../migration/migration.c:4124
> #18 0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #19 0x000003ffa6807e66 in start_thread (arg=0x3ff950fe900) at
> pthread_create.c:477
> #20 0x000003ffa66fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 5 (Thread 0x3ff958ff900 (LWP 1465137)):
> #0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff958fa9f0,
> flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:30
> #1  0x000002aa1cb4cad4 in qio_channel_socket_readv (ioc=<optimized
> out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0,
> flags=0, errp=0x3ff958faca8) at ../io/channel-socket.c:532
> #2  0x000002aa1cb51d60 in qio_channel_readv_full (ioc=0x2aa1e8ee400,
> iov=iov@entry=0x3ff958fabf0, niov=niov@entry=1, fds=fds@entry=0x0,
> nfds=nfds@entry=0x0, flags=0, errp=0x3ff958faca8) at
> ../io/channel.c:74
> #3  0x000002aa1cb51fe6 in qio_channel_read (ioc=<optimized out>,
> buf=buf@entry=0x2aa1e7a3c38 "", buflen=buflen@entry=32768,
> errp=errp@entry=0x3ff958faca8) at ../io/channel.c:314
> #4  0x000002aa1c6de13e in qemu_fill_buffer (f=0x2aa1e7a3c00) at
> ../migration/qemu-file.c:415
> #5  0x000002aa1c6ded18 in qemu_peek_byte (f=0x2aa1e7a3c00,
> offset=<optimized out>) at ../migration/qemu-file.c:707
> #6  0x000002aa1c6defba in qemu_get_byte (f=0x2aa1e7a3c00) at
> ../migration/qemu-file.c:720
> #7  qemu_get_be16 (f=f@entry=0x2aa1e7a3c00) at ../migration/qemu-file.c:800
> #8  0x000002aa1c6ea7e8 in source_return_path_thread
> (opaque=opaque@entry=0x2aa1e2cdcd0) at ../migration/migration.c:2995
> #9  0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #10 0x000003ffa6807e66 in start_thread (arg=0x3ff958ff900) at
> pthread_create.c:477
> #11 0x000003ffa66fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 4 (Thread 0x3ff9662e900 (LWP 1465124)):
> #0  futex_wait_cancelable (private=0, expected=0,
> futex_word=0x2aa1e5daeb8) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
> mutex=0x2aa1d75bf80 <qemu_global_mutex>, cond=0x2aa1e5dae90) at
> pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa1e5dae90,
> mutex=mutex@entry=0x2aa1d75bf80 <qemu_global_mutex>) at
> pthread_cond_wait.c:647
> #3  0x000002aa1ccc500e in qemu_cond_wait_impl (cond=0x2aa1e5dae90,
> mutex=0x2aa1d75bf80 <qemu_global_mutex>, file=0x2aa1ce5d9bc
> "../softmmu/cpus.c", line=<optimized out>) at
> ../util/qemu-thread-posix.c:225
> #4  0x000002aa1c6bdf0e in qemu_wait_io_event
> (cpu=cpu@entry=0x2aa1e70c750) at ../softmmu/cpus.c:424
> #5  0x000002aa1cb34412 in mttcg_cpu_thread_fn
> (arg=arg@entry=0x2aa1e70c750) at
> ../accel/tcg/tcg-accel-ops-mttcg.c:123
> #6  0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #7  0x000003ffa6807e66 in start_thread (arg=0x3ff9662e900) at
> pthread_create.c:477
> #8  0x000003ffa66fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ff97021900 (LWP 1465123)):
> #0  0x000003ffa66f1b32 in __GI___poll (fds=0x3ff480036b0, nfds=3,
> timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ffa8ed4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ffa8ed4790 in g_main_loop_run () at
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa1cb74afe in iothread_run
> (opaque=opaque@entry=0x2aa1e371400) at ../iothread.c:70
> #4  0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #5  0x000003ffa6807e66 in start_thread (arg=0x3ff97021900) at
> pthread_create.c:477
> #6  0x000003ffa66fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ff97923900 (LWP 1465122)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa1ccc55c4 in qemu_futex_wait (val=<optimized out>,
> f=<optimized out>) at
> /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa1d78e5d0 <rcu_call_ready_event>)
> at ../util/qemu-thread-posix.c:464
> #3  0x000002aa1ccce19a in call_rcu_thread (opaque=opaque@entry=0x0) at
> ../util/rcu.c:261
> #4  0x000002aa1ccc462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #5  0x000003ffa6807e66 in start_thread (arg=0x3ff97923900) at
> pthread_create.c:477
> #6  0x000003ffa66fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ffaa07e4a0 (LWP 1465117)):
> #0  0x000003ffa66f1c8c in __ppoll (fds=0x2aa1f1c96f0, nfds=5,
> timeout=<optimized out>, timeout@entry=0x3fff71f9968,
> sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
> #1  0x000002aa1ccda44a in ppoll (__ss=0x0, __timeout=0x3fff71f9968,
> __nfds=<optimized out>, __fds=<optimized out>) at
> /usr/include/s390x-linux-gnu/bits/poll2.h:77
> #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>,
> timeout=timeout@entry=1000000000) at ../util/qemu-timer.c:351
> #3  0x000002aa1ccd7460 in os_host_main_loop_wait (timeout=1000000000)
> at ../util/main-loop.c:312
> #4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:603
> #5  0x000002aa1c6c4ae4 in qemu_main_loop () at ../softmmu/runstate.c:731
> #6  0x000002aa1c42c5ba in qemu_default_main () at ../softmmu/main.c:37
> #7  0x000003ffa662440a in __libc_start_main (main=0x2aa1c428b90
> <main>, argc=<optimized out>, argv=0x3fff71f9db8, init=<optimized
> out>, fini=<optimized out>, rtld_fini=0x3ffaa090e50 <_dl_fini>,
> stack_end=0x3fff71f9d00) at libc-start.c:308
> #8  0x000002aa1c42c4e4 in _start () at ../softmmu/main.c:48
> [Inferior 1 (process 1465117) detached]
> 
> ===========================================================
> PROCESS: 1465125
> gitlab-+ 1465125 1464515  0 14:28 ?        00:00:00
> ./qemu-system-aarch64 -qtest unix:/tmp/qtest-1464515.sock -qtest-log
> /dev/null -chardev socket,path=/tmp/qtest-1464515.qmp,id=char0 -mon
> chardev=char0,mode=control -display none -accel kvm -accel tcg
> -machine virt,gic-version=max -name target,debug-threads=on -m 150M
> -serial file:/tmp/migration-test-G3X601/dest_serial -incoming
> unix:/tmp/migration-test-G3X601/migsocket -cpu max -kernel
> /tmp/migration-test-G3X601/bootsect -accel qtest
> [New LWP 1465133]
> [New LWP 1465134]
> [New LWP 1465135]
> [New LWP 1465140]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> futex_wait_cancelable (private=0, expected=0,
> futex_word=0x2aa2b37b560) at ../sysdeps/nptl/futex-internal.h:183
> 183     ../sysdeps/nptl/futex-internal.h: No such file or directory.
> 
> Thread 5 (Thread 0x3ff749fe900 (LWP 1465140)):
> #0  0x000003ff858f1b32 in __GI___poll (fds=fds@entry=0x3ff10000b60,
> nfds=2, nfds@entry=<error reading variable: value has been optimized
> out>, timeout=timeout@entry=-1) at
> ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000002aa284f520e in poll (__timeout=-1, __nfds=<optimized out>,
> __fds=0x3ff10000b60) at /usr/include/s390x-linux-gnu/bits/poll2.h:46
> #2  postcopy_ram_fault_thread (opaque=opaque@entry=0x2aa2b37b3a0) at
> ../migration/postcopy-ram.c:952
> #3  0x000002aa28ac462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #4  0x000003ff85a07e66 in start_thread (arg=0x3ff749fe900) at
> pthread_create.c:477
> #5  0x000003ff858fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Oh, we've got a postcopy_ram_fault_thread - so the dest knows that
postcopy is in progress; so it must have seen a postcopy message from
the source.

> Thread 4 (Thread 0x3ff7582e900 (LWP 1465135)):
> #0  futex_wait_cancelable (private=0, expected=0,
> futex_word=0x2aa2b82310c) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
> mutex=0x2aa2955bf80 <qemu_global_mutex>, cond=0x2aa2b8230e0) at
> pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa2b8230e0,
> mutex=mutex@entry=0x2aa2955bf80 <qemu_global_mutex>) at
> pthread_cond_wait.c:647
> #3  0x000002aa28ac500e in qemu_cond_wait_impl (cond=0x2aa2b8230e0,
> mutex=0x2aa2955bf80 <qemu_global_mutex>, file=0x2aa28c5d9bc
> "../softmmu/cpus.c", line=<optimized out>) at
> ../util/qemu-thread-posix.c:225
> #4  0x000002aa284bdf0e in qemu_wait_io_event
> (cpu=cpu@entry=0x2aa2b7b9750) at ../softmmu/cpus.c:424
> #5  0x000002aa28934412 in mttcg_cpu_thread_fn
> (arg=arg@entry=0x2aa2b7b9750) at
> ../accel/tcg/tcg-accel-ops-mttcg.c:123
> #6  0x000002aa28ac462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #7  0x000003ff85a07e66 in start_thread (arg=0x3ff7582e900) at
> pthread_create.c:477
> #8  0x000003ff858fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ff76221900 (LWP 1465134)):
> #0  0x000003ff858f1b32 in __GI___poll (fds=0x3ff280036b0, nfds=3,
> timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ff880d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ff880d4790 in g_main_loop_run () at
> /lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa28974afe in iothread_run
> (opaque=opaque@entry=0x2aa2b41e400) at ../iothread.c:70
> #4  0x000002aa28ac462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #5  0x000003ff85a07e66 in start_thread (arg=0x3ff76221900) at
> pthread_create.c:477
> #6  0x000003ff858fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ff76b23900 (LWP 1465133)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa28ac55c4 in qemu_futex_wait (val=<optimized out>,
> f=<optimized out>) at
> /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa2958e5d0 <rcu_call_ready_event>)
> at ../util/qemu-thread-posix.c:464
> #3  0x000002aa28ace19a in call_rcu_thread (opaque=opaque@entry=0x0) at
> ../util/rcu.c:261
> #4  0x000002aa28ac462a in qemu_thread_start (args=<optimized out>) at
> ../util/qemu-thread-posix.c:541
> #5  0x000003ff85a07e66 in start_thread (arg=0x3ff76b23900) at
> pthread_create.c:477
> #6  0x000003ff858fcbe6 in thread_start () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ff8927e4a0 (LWP 1465125)):
> #0  futex_wait_cancelable (private=0, expected=0,
> futex_word=0x2aa2b37b560) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0,
> mutex=0x2aa2b37b508, cond=0x2aa2b37b538) at pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa2b37b538,
> mutex=mutex@entry=0x2aa2b37b508) at pthread_cond_wait.c:647
> #3  0x000002aa28ac500e in qemu_cond_wait_impl (cond=0x2aa2b37b538,
> mutex=0x2aa2b37b508, file=0x2aa28d70dee "../util/qemu-thread-posix.c",
> line=<optimized out>) at ../util/qemu-thread-posix.c:225
> #4  0x000002aa28ac542e in qemu_sem_wait (sem=0x2aa2b37b508) at
> ../util/qemu-thread-posix.c:314
> #5  0x000002aa284f485a in postcopy_ram_incoming_setup
> (mis=0x2aa2b37b3a0, mis@entry=<error reading variable: value has been

Which is:
    if (migrate_postcopy_preempt()) {
        /*
         * The preempt channel is established in asynchronous way.  Wait
         * for its completion.
         */
        qemu_sem_wait(&mis->postcopy_qemufile_dst_done);

> optimized out>) at ../migration/postcopy-ram.c:1204
> #6  0x000002aa284fbf70 in loadvm_postcopy_handle_listen
> (mis=<optimized out>) at ../migration/savevm.c:2016
> #7  loadvm_process_command (f=0x2aa2b85b7a0) at ../migration/savevm.c:2379
> #8  qemu_loadvm_state_main (f=f@entry=0x2aa2b85b7a0,
> mis=0x2aa2b37b3a0) at ../migration/savevm.c:2724
> #9  0x000002aa284fc2fe in loadvm_handle_cmd_packaged (mis=<optimized
> out>) at ../migration/savevm.c:2248
> #10 loadvm_process_command (f=0x2aa2b6a3490) at ../migration/savevm.c:2373
> #11 qemu_loadvm_state_main (f=f@entry=0x2aa2b6a3490,
> mis=mis@entry=0x2aa2b37b3a0) at ../migration/savevm.c:2724
> #12 0x000002aa284fcf14 in qemu_loadvm_state (f=0x2aa2b6a3490) at
> ../migration/savevm.c:2790
> #13 0x000002aa284e9f64 in process_incoming_migration_co
> (opaque=<optimized out>) at ../migration/migration.c:606
> #14 0x000002aa28ad92e6 in coroutine_trampoline (i0=<optimized out>,
> i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> #15 0x000003ff858569c2 in __makecontext_ret () at
> ../sysdeps/unix/sysv/linux/s390/s390-64/__makecontext_ret.S:23
> [Inferior 1 (process 1465125) detached]
> 
> -- PMM
>
Peter Maydell March 14, 2023, 12:46 p.m. UTC | #22
On Tue, 14 Mar 2023 at 10:12, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
>
> Copying Peter Xu on this one since it's poscopy, especially the newer
> postcopy preempt.
>
> * Peter Maydell (peter.maydell@linaro.org) wrote:
> > On Sun, 12 Mar 2023 at 14:06, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > Here's another one, on the s390x CI runner:
> > > https://gitlab.com/qemu-project/qemu/-/jobs/3917587994
> >
> > And here's a backtrace from a hung migration-test on the s390x
> > runner (looks like a deadlock, none of these processes were
> > using CPU):
>
> Ah a backtrace!

I've attached another 2 -- migration-test seems to fairly
commonly hang on the s390 CI runner; there are several
stuck tests on it at the moment.

-- PMM
Process tree:
migration-test(1841031)-+-qemu-system-x86(1841381)
                        `-qemu-system-x86(1841387)
===========================================================
PROCESS: 1841031
gitlab-+ 1841031 1838397  0 Mar13 ?        00:00:02 /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
[New LWP 1841033]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
__libc_recv (fd=fd@entry=5, buf=buf@entry=0x3ffdac792e7, len=len@entry=1, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:30

Thread 2 (Thread 0x3ffb01ff900 (LWP 1841033)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa133dae34 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa135596b8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x000002aa134079ea in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x000002aa133d9e9a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ffb0b87e66 in start_thread (arg=0x3ffb01ff900) at pthread_create.c:477
#6  0x000003ffb0a7cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffb126d990 (LWP 1841031)):
#0  __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3ffdac792e7, len=len@entry=1, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:30
#1  0x000002aa133aba22 in recv (__flags=0, __n=1, __buf=0x3ffdac792e7, __fd=5) at /usr/include/s390x-linux-gnu/bits/socket2.h:44
#2  qmp_fd_receive (fd=<optimized out>) at ../tests/qtest/libqmp.c:73
#3  0x000002aa133a9b8e in qtest_qmp_receive_dict (s=0x2aa135fb800) at ../tests/qtest/libqtest.c:837
#4  qtest_qmp_eventwait_ref (event=<optimized out>, s=<optimized out>) at ../tests/qtest/libqtest.c:837
#5  qtest_qmp_eventwait_ref (s=0x2aa135fb800, event=<optimized out>) at ../tests/qtest/libqtest.c:828
#6  0x000002aa133a9c1a in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
#7  0x000002aa1339ec56 in test_postcopy_common (args=0x3ffdac795e8) at ../tests/qtest/migration-test.c:1161
#8  0x000002aa1339edb4 in test_postcopy_preempt () at ../tests/qtest/migration-test.c:1178
#9  0x000003ffb0d7e608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#10 0x000003ffb0d7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#11 0x000003ffb0d7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#12 0x000003ffb0d7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#13 0x000003ffb0d7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#14 0x000003ffb0d7eada in g_test_run_suite () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#15 0x000003ffb0d7eb10 in g_test_run () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#16 0x000002aa1339ab14 in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2615
[Inferior 1 (process 1841031) detached]

===========================================================
PROCESS: 1841381
gitlab-+ 1841381 1841031  0 Mar13 ?        00:00:06 ./qemu-system-x86_64 -qtest unix:/tmp/qtest-1841031.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1841031.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-RM5901/src_serial -drive file=/tmp/migration-test-RM5901/bootsect,format=raw -accel qtest
[New LWP 1841383]
[New LWP 1841384]
[New LWP 1841385]
[New LWP 1841394]
[New LWP 1841395]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
0x000003ffb51f1c8c in __ppoll (fds=0x2aa1c0ffbd0, nfds=5, timeout=<optimized out>, timeout@entry=0x3fffbb79bc8, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44

Thread 6 (Thread 0x3ff469f7900 (LWP 1841395)):
#0  __libc_sendmsg (fd=<optimized out>, msg=msg@entry=0x3ff469f2810, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/sendmsg.c:30
#1  0x000002aa1a34c52a in qio_channel_socket_writev (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=<optimized out>, nfds=<optimized out>, flags=0, errp=0x3ff469f2b80) at ../io/channel-socket.c:605
#2  0x000002aa1a351b88 in qio_channel_writev_full (ioc=ioc@entry=0x2aa1c0f8400, iov=0x2aa1c0e88e0, niov=64, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff469f2b80) at ../io/channel.c:108
#3  0x000002aa1a352a00 in qio_channel_writev_full_all (ioc=0x2aa1c0f8400, iov=iov@entry=0x2aa1b2792a0, niov=<optimized out>, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff469f2b80) at ../io/channel.c:263
#4  0x000002aa1a352aae in qio_channel_writev_all (ioc=<optimized out>, iov=iov@entry=0x2aa1b2792a0, niov=<optimized out>, errp=errp@entry=0x3ff469f2b80) at ../io/channel.c:242
#5  0x000002aa1a10de94 in qemu_fflush (f=f@entry=0x2aa1b271260) at ../migration/qemu-file.c:302
#6  0x000002aa1a10e122 in qemu_fflush (f=0x2aa1b271260) at ../migration/qemu-file.c:297
#7  add_to_iovec (f=f@entry=0x2aa1b271260, buf=<optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:510
#8  0x000002aa1a10e606 in qemu_put_buffer_async (f=f@entry=0x2aa1b271260, buf=<error reading variable: value has been optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:535
#9  0x000002aa1a2bd398 in save_normal_page (async=<optimized out>, buf=<optimized out>, offset=1175552, block=0x2aa1b19f340, pss=0x3ff40003000) at ../migration/ram.c:1400
#10 ram_save_page (pss=0x3ff40003000, rs=0x3ff40003000) at ../migration/ram.c:1449
#11 ram_save_target_page_legacy (rs=0x3ff40003000, pss=0x3ff40003000) at ../migration/ram.c:2381
#12 0x000002aa1a2ba330 in ram_save_host_page (pss=0x3ff40003000, rs=0x3ff40003000) at ../migration/ram.c:2539
#13 ram_find_and_save_block (rs=rs@entry=0x3ff40003000) at ../migration/ram.c:2620
#14 0x000002aa1a2ba9e4 in ram_save_iterate (f=0x2aa1b271260, opaque=<optimized out>) at ../migration/ram.c:3361
#15 0x000002aa1a12a524 in qemu_savevm_state_iterate (f=0x2aa1b271260, postcopy=<optimized out>) at ../migration/savevm.c:1345
#16 0x000002aa1a11da14 in migration_iteration_run (s=0x2aa1af3a810) at ../migration/migration.c:3896
#17 migration_thread (opaque=opaque@entry=0x2aa1af3a810) at ../migration/migration.c:4124
#18 0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#19 0x000003ffb5307e66 in start_thread (arg=0x3ff469f7900) at pthread_create.c:477
#20 0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 5 (Thread 0x3ff471f8900 (LWP 1841394)):
#0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff471f39f0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:30
#1  0x000002aa1a34c7e4 in qio_channel_socket_readv (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, flags=0, errp=0x3ff471f3ca8) at ../io/channel-socket.c:532
#2  0x000002aa1a351a70 in qio_channel_readv_full (ioc=0x2aa1c0f8400, iov=iov@entry=0x3ff471f3bf0, niov=niov@entry=1, fds=fds@entry=0x0, nfds=nfds@entry=0x0, flags=0, errp=0x3ff471f3ca8) at ../io/channel.c:74
#3  0x000002aa1a351cf6 in qio_channel_read (ioc=<optimized out>, buf=buf@entry=0x2aa1b17d9d8 "", buflen=buflen@entry=32768, errp=errp@entry=0x3ff471f3ca8) at ../io/channel.c:314
#4  0x000002aa1a10dd2e in qemu_fill_buffer (f=0x2aa1b17d9a0) at ../migration/qemu-file.c:415
#5  0x000002aa1a10e908 in qemu_peek_byte (f=0x2aa1b17d9a0, offset=<optimized out>) at ../migration/qemu-file.c:707
#6  0x000002aa1a10ebaa in qemu_get_byte (f=0x2aa1b17d9a0) at ../migration/qemu-file.c:720
#7  qemu_get_be16 (f=f@entry=0x2aa1b17d9a0) at ../migration/qemu-file.c:800
#8  0x000002aa1a11a3d8 in source_return_path_thread (opaque=opaque@entry=0x2aa1af3a810) at ../migration/migration.c:2995
#9  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#10 0x000003ffb5307e66 in start_thread (arg=0x3ff471f8900) at pthread_create.c:477
#11 0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ffa4eab900 (LWP 1841385)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa1b29c30c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa1ad9b798 <qemu_global_mutex>, cond=0x2aa1b29c2e0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa1b29c2e0, mutex=mutex@entry=0x2aa1ad9b798 <qemu_global_mutex>) at pthread_cond_wait.c:647
#3  0x000002aa1a4c4cae in qemu_cond_wait_impl (cond=0x2aa1b29c2e0, mutex=0x2aa1ad9b798 <qemu_global_mutex>, file=0x2aa1a60b014 "../softmmu/cpus.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
#4  0x000002aa1a0edafe in qemu_wait_io_event (cpu=cpu@entry=0x2aa1b26a160) at ../softmmu/cpus.c:424
#5  0x000002aa1a3315ca in mttcg_cpu_thread_fn (arg=arg@entry=0x2aa1b26a160) at ../accel/tcg/tcg-accel-ops-mttcg.c:123
#6  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#7  0x000003ffb5307e66 in start_thread (arg=0x3ffa4eab900) at pthread_create.c:477
#8  0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ffa579d900 (LWP 1841384)):
#0  0x000003ffb51f1b32 in __GI___poll (fds=0x3ff58003280, nfds=3, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ffb79d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ffb79d4790 in g_main_loop_run () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa1a37480e in iothread_run (opaque=opaque@entry=0x2aa1b0aec00) at ../iothread.c:70
#4  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ffb5307e66 in start_thread (arg=0x3ffa579d900) at pthread_create.c:477
#6  0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ffa6423900 (LWP 1841383)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa1a4c5264 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa1adccab0 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x000002aa1a4cde3a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ffb5307e66 in start_thread (arg=0x3ffa6423900) at pthread_create.c:477
#6  0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffb8b7e4a0 (LWP 1841381)):
#0  0x000003ffb51f1c8c in __ppoll (fds=0x2aa1c0ffbd0, nfds=5, timeout=<optimized out>, timeout@entry=0x3fffbb79bc8, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000002aa1a4da07a in ppoll (__ss=0x0, __timeout=0x3fffbb79bc8, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/s390x-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1000000000) at ../util/qemu-timer.c:351
#3  0x000002aa1a4d7090 in os_host_main_loop_wait (timeout=1000000000) at ../util/main-loop.c:312
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:603
#5  0x000002aa1a0f46d4 in qemu_main_loop () at ../softmmu/runstate.c:731
#6  0x000002aa19f0e312 in qemu_default_main () at ../softmmu/main.c:37
#7  0x000003ffb512440a in __libc_start_main (main=0x2aa19f0b8d8 <main>, argc=<optimized out>, argv=0x3fffbb7a018, init=<optimized out>, fini=<optimized out>, rtld_fini=0x3ffb8b90e50 <_dl_fini>, stack_end=0x3fffbb79f60) at libc-start.c:308
#8  0x000002aa19f0e23c in _start () at ../softmmu/main.c:48
[Inferior 1 (process 1841381) detached]

===========================================================
PROCESS: 1841387
gitlab-+ 1841387 1841031  0 Mar13 ?        00:00:00 ./qemu-system-x86_64 -qtest unix:/tmp/qtest-1841031.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1841031.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-RM5901/dest_serial -incoming unix:/tmp/migration-test-RM5901/migsocket -drive file=/tmp/migration-test-RM5901/bootsect,format=raw -accel qtest
[New LWP 1841389]
[New LWP 1841390]
[New LWP 1841391]
[New LWP 1841397]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa370039c0) at ../sysdeps/nptl/futex-internal.h:183

Thread 5 (Thread 0x3ff3a1fd900 (LWP 1841397)):
#0  0x000003ffa3571b32 in __GI___poll (fds=fds@entry=0x3ff2c000b60, nfds=2, nfds@entry=<error reading variable: value has been optimized out>, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000002aa348a4dfe in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x3ff2c000b60) at /usr/include/s390x-linux-gnu/bits/poll2.h:46
#2  postcopy_ram_fault_thread (opaque=opaque@entry=0x2aa37003800) at ../migration/postcopy-ram.c:952
#3  0x000002aa34c442ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#4  0x000003ffa3687e66 in start_thread (arg=0x3ff3a1fd900) at pthread_create.c:477
#5  0x000003ffa357cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ff9322b900 (LWP 1841391)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa3711630c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa3551b798 <qemu_global_mutex>, cond=0x2aa371162e0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa371162e0, mutex=mutex@entry=0x2aa3551b798 <qemu_global_mutex>) at pthread_cond_wait.c:647
#3  0x000002aa34c44cae in qemu_cond_wait_impl (cond=0x2aa371162e0, mutex=0x2aa3551b798 <qemu_global_mutex>, file=0x2aa34d8b014 "../softmmu/cpus.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
#4  0x000002aa3486dafe in qemu_wait_io_event (cpu=cpu@entry=0x2aa370e4160) at ../softmmu/cpus.c:424
#5  0x000002aa34ab15ca in mttcg_cpu_thread_fn (arg=arg@entry=0x2aa370e4160) at ../accel/tcg/tcg-accel-ops-mttcg.c:123
#6  0x000002aa34c442ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#7  0x000003ffa3687e66 in start_thread (arg=0x3ff9322b900) at pthread_create.c:477
#8  0x000003ffa357cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff93b1d900 (LWP 1841390)):
#0  0x000003ffa3571b32 in __GI___poll (fds=0x3ff44003280, nfds=3, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ffa5d54386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ffa5d54790 in g_main_loop_run () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa34af480e in iothread_run (opaque=opaque@entry=0x2aa36f28c00) at ../iothread.c:70
#4  0x000002aa34c442ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ffa3687e66 in start_thread (arg=0x3ff93b1d900) at pthread_create.c:477
#6  0x000003ffa357cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff947a3900 (LWP 1841389)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa34c45264 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa3554cab0 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x000002aa34c4de3a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x000002aa34c442ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ffa3687e66 in start_thread (arg=0x3ff947a3900) at pthread_create.c:477
#6  0x000003ffa357cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffa6efe4a0 (LWP 1841387)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa370039c0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa37003968, cond=0x2aa37003998) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa37003998, mutex=mutex@entry=0x2aa37003968) at pthread_cond_wait.c:647
#3  0x000002aa34c44cae in qemu_cond_wait_impl (cond=0x2aa37003998, mutex=0x2aa37003968, file=0x2aa34e65f06 "../util/qemu-thread-posix.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
#4  0x000002aa34c450ce in qemu_sem_wait (sem=0x2aa37003968) at ../util/qemu-thread-posix.c:314
#5  0x000002aa348a444a in postcopy_ram_incoming_setup (mis=0x2aa37003800, mis@entry=<error reading variable: value has been optimized out>) at ../migration/postcopy-ram.c:1204
#6  0x000002aa348abb60 in loadvm_postcopy_handle_listen (mis=<optimized out>) at ../migration/savevm.c:2016
#7  loadvm_process_command (f=0x2aa37201810) at ../migration/savevm.c:2379
#8  qemu_loadvm_state_main (f=f@entry=0x2aa37201810, mis=0x2aa37003800) at ../migration/savevm.c:2724
#9  0x000002aa348abeee in loadvm_handle_cmd_packaged (mis=<optimized out>) at ../migration/savevm.c:2248
#10 loadvm_process_command (f=0x2aa370eb260) at ../migration/savevm.c:2373
#11 qemu_loadvm_state_main (f=f@entry=0x2aa370eb260, mis=mis@entry=0x2aa37003800) at ../migration/savevm.c:2724
#12 0x000002aa348acb04 in qemu_loadvm_state (f=0x2aa370eb260) at ../migration/savevm.c:2790
#13 0x000002aa34899b54 in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:606
#14 0x000002aa34c58f16 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
#15 0x000003ffa34d69c2 in __makecontext_ret () at ../sysdeps/unix/sysv/linux/s390/s390-64/__makecontext_ret.S:23
[Inferior 1 (process 1841387) detached]
Process tree:
migration-test(1840499)-+-qemu-system-i38(1841012)
                        `-qemu-system-i38(1841019)
===========================================================
PROCESS: 1840499
gitlab-+ 1840499 1838397  0 Mar13 ?        00:00:02 /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
[New LWP 1840501]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
__libc_recv (fd=fd@entry=5, buf=buf@entry=0x3fff23798b7, len=len@entry=1, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:30
30	../sysdeps/unix/sysv/linux/recv.c: No such file or directory.

Thread 2 (Thread 0x3ffbe3ff900 (LWP 1840501)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa029dae34 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa02b596b8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x000002aa02a079ea in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x000002aa029d9e9a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ffbed87e66 in start_thread (arg=0x3ffbe3ff900) at pthread_create.c:477
#6  0x000003ffbec7cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffbf46d990 (LWP 1840499)):
#0  __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3fff23798b7, len=len@entry=1, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:30
#1  0x000002aa029aba22 in recv (__flags=0, __n=1, __buf=0x3fff23798b7, __fd=5) at /usr/include/s390x-linux-gnu/bits/socket2.h:44
#2  qmp_fd_receive (fd=<optimized out>) at ../tests/qtest/libqmp.c:73
#3  0x000002aa029a9b8e in qtest_qmp_receive_dict (s=0x2aa0327d120) at ../tests/qtest/libqtest.c:837
#4  qtest_qmp_eventwait_ref (event=<optimized out>, s=<optimized out>) at ../tests/qtest/libqtest.c:837
#5  qtest_qmp_eventwait_ref (s=0x2aa0327d120, event=<optimized out>) at ../tests/qtest/libqtest.c:828
#6  0x000002aa029a9c1a in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
#7  0x000002aa0299ec56 in test_postcopy_common (args=0x3fff2379bb8) at ../tests/qtest/migration-test.c:1161
#8  0x000002aa0299edb4 in test_postcopy_preempt () at ../tests/qtest/migration-test.c:1178
#9  0x000003ffbef7e608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#10 0x000003ffbef7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#11 0x000003ffbef7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#12 0x000003ffbef7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#13 0x000003ffbef7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#14 0x000003ffbef7eada in g_test_run_suite () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#15 0x000003ffbef7eb10 in g_test_run () from /lib/s390x-linux-gnu/libglib-2.0.so.0
#16 0x000002aa0299ab14 in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2615
[Inferior 1 (process 1840499) detached]

===========================================================
PROCESS: 1841012
gitlab-+ 1841012 1840499  0 Mar13 ?        00:00:06 ./qemu-system-i386 -qtest unix:/tmp/qtest-1840499.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1840499.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-8IQB11/src_serial -drive file=/tmp/migration-test-8IQB11/bootsect,format=raw -accel qtest
[New LWP 1841014]
[New LWP 1841015]
[New LWP 1841016]
[New LWP 1841062]
[New LWP 1841063]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
0x000003ff9a771c8c in __ppoll (fds=0x2aa396d66e0, nfds=5, timeout=<optimized out>, timeout@entry=0x3ffd40fa048, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
44	../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.

Thread 6 (Thread 0x3ff2bfff900 (LWP 1841063)):
#0  __libc_sendmsg (fd=<optimized out>, msg=msg@entry=0x3ff2bffa810, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/sendmsg.c:30
#1  0x000002aa38343b7a in qio_channel_socket_writev (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=<optimized out>, nfds=<optimized out>, flags=0, errp=0x3ff2bffab80) at ../io/channel-socket.c:605
#2  0x000002aa383491d8 in qio_channel_writev_full (ioc=ioc@entry=0x2aa3a52fa80, iov=0x2aa3a53b300, niov=64, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff2bffab80) at ../io/channel.c:108
#3  0x000002aa3834a050 in qio_channel_writev_full_all (ioc=0x2aa3a52fa80, iov=iov@entry=0x2aa396b26b0, niov=<optimized out>, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff2bffab80) at ../io/channel.c:263
#4  0x000002aa3834a0fe in qio_channel_writev_all (ioc=<optimized out>, iov=iov@entry=0x2aa396b26b0, niov=<optimized out>, errp=errp@entry=0x3ff2bffab80) at ../io/channel.c:242
#5  0x000002aa3810d404 in qemu_fflush (f=f@entry=0x2aa396aa670) at ../migration/qemu-file.c:302
#6  0x000002aa3810d692 in qemu_fflush (f=0x2aa396aa670) at ../migration/qemu-file.c:297
#7  add_to_iovec (f=f@entry=0x2aa396aa670, buf=<optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:510
#8  0x000002aa3810db76 in qemu_put_buffer_async (f=f@entry=0x2aa396aa670, buf=<error reading variable: value has been optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:535
#9  0x000002aa382b4d00 in save_normal_page (async=<optimized out>, buf=<optimized out>, offset=1175552, block=0x2aa395dabb0, pss=0x3ff24002800) at ../migration/ram.c:1400
#10 ram_save_page (pss=0x3ff24002800, rs=0x3ff24002800) at ../migration/ram.c:1449
#11 ram_save_target_page_legacy (rs=0x3ff24002800, pss=0x3ff24002800) at ../migration/ram.c:2381
#12 0x000002aa382b1c98 in ram_save_host_page (pss=0x3ff24002800, rs=0x3ff24002800) at ../migration/ram.c:2539
#13 ram_find_and_save_block (rs=rs@entry=0x3ff24002800) at ../migration/ram.c:2620
#14 0x000002aa382b234c in ram_save_iterate (f=0x2aa396aa670, opaque=<optimized out>) at ../migration/ram.c:3361
#15 0x000002aa38129a94 in qemu_savevm_state_iterate (f=0x2aa396aa670, postcopy=<optimized out>) at ../migration/savevm.c:1345
#16 0x000002aa3811cf84 in migration_iteration_run (s=0x2aa39375710) at ../migration/migration.c:3896
#17 migration_thread (opaque=opaque@entry=0x2aa39375710) at ../migration/migration.c:4124
#18 0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#19 0x000003ff9a887e66 in start_thread (arg=0x3ff2bfff900) at pthread_create.c:477
#20 0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 5 (Thread 0x3ff30ff9900 (LWP 1841062)):
#0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff30ff49f0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:30
#1  0x000002aa38343e34 in qio_channel_socket_readv (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, flags=0, errp=0x3ff30ff4ca8) at ../io/channel-socket.c:532
#2  0x000002aa383490c0 in qio_channel_readv_full (ioc=0x2aa3a52fa80, iov=iov@entry=0x3ff30ff4bf0, niov=niov@entry=1, fds=fds@entry=0x0, nfds=nfds@entry=0x0, flags=0, errp=0x3ff30ff4ca8) at ../io/channel.c:74
#3  0x000002aa38349346 in qio_channel_read (ioc=<optimized out>, buf=buf@entry=0x2aa395b9548 "", buflen=buflen@entry=32768, errp=errp@entry=0x3ff30ff4ca8) at ../io/channel.c:314
#4  0x000002aa3810d29e in qemu_fill_buffer (f=0x2aa395b9510) at ../migration/qemu-file.c:415
#5  0x000002aa3810de78 in qemu_peek_byte (f=0x2aa395b9510, offset=<optimized out>) at ../migration/qemu-file.c:707
#6  0x000002aa3810e11a in qemu_get_byte (f=0x2aa395b9510) at ../migration/qemu-file.c:720
#7  qemu_get_be16 (f=f@entry=0x2aa395b9510) at ../migration/qemu-file.c:800
#8  0x000002aa38119948 in source_return_path_thread (opaque=opaque@entry=0x2aa39375710) at ../migration/migration.c:2995
#9  0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#10 0x000003ff9a887e66 in start_thread (arg=0x3ff30ff9900) at pthread_create.c:477
#11 0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ff8a42b900 (LWP 1841016)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa396d5718) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa38d8fcf0 <qemu_global_mutex>, cond=0x2aa396d56f0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa396d56f0, mutex=mutex@entry=0x2aa38d8fcf0 <qemu_global_mutex>) at pthread_cond_wait.c:647
#3  0x000002aa384bc2fe in qemu_cond_wait_impl (cond=0x2aa396d56f0, mutex=0x2aa38d8fcf0 <qemu_global_mutex>, file=0x2aa38602664 "../softmmu/cpus.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
#4  0x000002aa380ed06e in qemu_wait_io_event (cpu=cpu@entry=0x2aa396a5980) at ../softmmu/cpus.c:424
#5  0x000002aa38328c1a in mttcg_cpu_thread_fn (arg=arg@entry=0x2aa396a5980) at ../accel/tcg/tcg-accel-ops-mttcg.c:123
#6  0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#7  0x000003ff9a887e66 in start_thread (arg=0x3ff8a42b900) at pthread_create.c:477
#8  0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff8ad1d900 (LWP 1841015)):
#0  0x000003ff9a771b32 in __GI___poll (fds=0x3ff3c003280, nfds=3, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ff9cf54386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ff9cf54790 in g_main_loop_run () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa3836be5e in iothread_run (opaque=opaque@entry=0x2aa39477800) at ../iothread.c:70
#4  0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ff9a887e66 in start_thread (arg=0x3ff8ad1d900) at pthread_create.c:477
#6  0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ff8b9a3900 (LWP 1841014)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa384bc8b4 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa38dbffc8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x000002aa384c548a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ff9a887e66 in start_thread (arg=0x3ff8b9a3900) at pthread_create.c:477
#6  0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ff9e0fe4a0 (LWP 1841012)):
#0  0x000003ff9a771c8c in __ppoll (fds=0x2aa396d66e0, nfds=5, timeout=<optimized out>, timeout@entry=0x3ffd40fa048, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
#1  0x000002aa384d16ca in ppoll (__ss=0x0, __timeout=0x3ffd40fa048, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/s390x-linux-gnu/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1000000000) at ../util/qemu-timer.c:351
#3  0x000002aa384ce6e0 in os_host_main_loop_wait (timeout=1000000000) at ../util/main-loop.c:312
#4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:603
#5  0x000002aa380f3c44 in qemu_main_loop () at ../softmmu/runstate.c:731
#6  0x000002aa37f0d882 in qemu_default_main () at ../softmmu/main.c:37
#7  0x000003ff9a6a440a in __libc_start_main (main=0x2aa37f0ae48 <main>, argc=<optimized out>, argv=0x3ffd40fa498, init=<optimized out>, fini=<optimized out>, rtld_fini=0x3ff9e110e50 <_dl_fini>, stack_end=0x3ffd40fa3e0) at libc-start.c:308
#8  0x000002aa37f0d7ac in _start () at ../softmmu/main.c:48
[Inferior 1 (process 1841012) detached]

===========================================================
PROCESS: 1841019
gitlab-+ 1841019 1840499  0 Mar13 ?        00:00:00 ./qemu-system-i386 -qtest unix:/tmp/qtest-1840499.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1840499.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-8IQB11/dest_serial -incoming unix:/tmp/migration-test-8IQB11/migsocket -drive file=/tmp/migration-test-8IQB11/bootsect,format=raw -accel qtest
[New LWP 1841021]
[New LWP 1841022]
[New LWP 1841023]
[New LWP 1841067]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa3a92e1c0) at ../sysdeps/nptl/futex-internal.h:183
183	../sysdeps/nptl/futex-internal.h: No such file or directory.

Thread 5 (Thread 0x3ff4a8ff900 (LWP 1841067)):
#0  0x000003ffb0071b32 in __GI___poll (fds=fds@entry=0x3ff40000b60, nfds=2, nfds@entry=<error reading variable: value has been optimized out>, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000002aa382a436e in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x3ff40000b60) at /usr/include/s390x-linux-gnu/bits/poll2.h:46
#2  postcopy_ram_fault_thread (opaque=opaque@entry=0x2aa3a92e000) at ../migration/postcopy-ram.c:952
#3  0x000002aa3863b91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#4  0x000003ffb0187e66 in start_thread (arg=0x3ff4a8ff900) at pthread_create.c:477
#5  0x000003ffb007cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 4 (Thread 0x3ff5b7fe900 (LWP 1841023)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa3aa3e69c) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa38f0fcf0 <qemu_global_mutex>, cond=0x2aa3aa3e670) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa3aa3e670, mutex=mutex@entry=0x2aa38f0fcf0 <qemu_global_mutex>) at pthread_cond_wait.c:647
#3  0x000002aa3863c2fe in qemu_cond_wait_impl (cond=0x2aa3aa3e670, mutex=0x2aa38f0fcf0 <qemu_global_mutex>, file=0x2aa38782664 "../softmmu/cpus.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
#4  0x000002aa3826d06e in qemu_wait_io_event (cpu=cpu@entry=0x2aa3aa0e980) at ../softmmu/cpus.c:424
#5  0x000002aa384a8c1a in mttcg_cpu_thread_fn (arg=arg@entry=0x2aa3aa0e980) at ../accel/tcg/tcg-accel-ops-mttcg.c:123
#6  0x000002aa3863b91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#7  0x000003ffb0187e66 in start_thread (arg=0x3ff5b7fe900) at pthread_create.c:477
#8  0x000003ffb007cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 3 (Thread 0x3ff5bfff900 (LWP 1841022)):
#0  0x000003ffb0071b32 in __GI___poll (fds=0x3ff54003280, nfds=3, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000003ffb2854386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#2  0x000003ffb2854790 in g_main_loop_run () at /lib/s390x-linux-gnu/libglib-2.0.so.0
#3  0x000002aa384ebe5e in iothread_run (opaque=opaque@entry=0x2aa3a7e0800) at ../iothread.c:70
#4  0x000002aa3863b91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ffb0187e66 in start_thread (arg=0x3ff5bfff900) at pthread_create.c:477
#6  0x000003ffb007cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 2 (Thread 0x3ffa12a3900 (LWP 1841021)):
#0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
#1  0x000002aa3863c8b4 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x2aa38f3ffc8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
#3  0x000002aa3864548a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
#4  0x000002aa3863b91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
#5  0x000003ffb0187e66 in start_thread (arg=0x3ffa12a3900) at pthread_create.c:477
#6  0x000003ffb007cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Thread 1 (Thread 0x3ffb39fe4a0 (LWP 1841019)):
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa3a92e1c0) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa3a92e168, cond=0x2aa3a92e198) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0x2aa3a92e198, mutex=mutex@entry=0x2aa3a92e168) at pthread_cond_wait.c:647
#3  0x000002aa3863c2fe in qemu_cond_wait_impl (cond=0x2aa3a92e198, mutex=0x2aa3a92e168, file=0x2aa3885b73e "../util/qemu-thread-posix.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
#4  0x000002aa3863c71e in qemu_sem_wait (sem=0x2aa3a92e168) at ../util/qemu-thread-posix.c:314
#5  0x000002aa382a39ba in postcopy_ram_incoming_setup (mis=0x2aa3a92e000, mis@entry=<error reading variable: value has been optimized out>) at ../migration/postcopy-ram.c:1204
#6  0x000002aa382ab0d0 in loadvm_postcopy_handle_listen (mis=<optimized out>) at ../migration/savevm.c:2016
#7  loadvm_process_command (f=0x2aa3aa5c7d0) at ../migration/savevm.c:2379
#8  qemu_loadvm_state_main (f=f@entry=0x2aa3aa5c7d0, mis=0x2aa3a92e000) at ../migration/savevm.c:2724
#9  0x000002aa382ab45e in loadvm_handle_cmd_packaged (mis=<optimized out>) at ../migration/savevm.c:2248
#10 loadvm_process_command (f=0x2aa3aa135f0) at ../migration/savevm.c:2373
#11 qemu_loadvm_state_main (f=f@entry=0x2aa3aa135f0, mis=mis@entry=0x2aa3a92e000) at ../migration/savevm.c:2724
#12 0x000002aa382ac074 in qemu_loadvm_state (f=0x2aa3aa135f0) at ../migration/savevm.c:2790
#13 0x000002aa382990c4 in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:606
#14 0x000002aa38650566 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
#15 0x000003ffaffd69c2 in __makecontext_ret () at ../sysdeps/unix/sysv/linux/s390/s390-64/__makecontext_ret.S:23
[Inferior 1 (process 1841019) detached]
Daniel P. Berrangé March 14, 2023, 1:05 p.m. UTC | #23
On Tue, Mar 14, 2023 at 12:46:18PM +0000, Peter Maydell wrote:
> On Tue, 14 Mar 2023 at 10:12, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > Copying Peter Xu on this one since it's poscopy, especially the newer
> > postcopy preempt.
> >
> > * Peter Maydell (peter.maydell@linaro.org) wrote:
> > > On Sun, 12 Mar 2023 at 14:06, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > > Here's another one, on the s390x CI runner:
> > > > https://gitlab.com/qemu-project/qemu/-/jobs/3917587994
> > >
> > > And here's a backtrace from a hung migration-test on the s390x
> > > runner (looks like a deadlock, none of these processes were
> > > using CPU):
> >
> > Ah a backtrace!
> 
> I've attached another 2 -- migration-test seems to fairly
> commonly hang on the s390 CI runner; there are several
> stuck tests on it at the moment.
> 
> -- PMM

> Process tree:
> migration-test(1841031)-+-qemu-system-x86(1841381)
>                         `-qemu-system-x86(1841387)
> ===========================================================

> Thread 1 (Thread 0x3ffb126d990 (LWP 1841031)):
> #6  0x000002aa133a9c1a in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
> #7  0x000002aa1339ec56 in test_postcopy_common (args=0x3ffdac795e8) at ../tests/qtest/migration-test.c:1161
> #8  0x000002aa1339edb4 in test_postcopy_preempt () at ../tests/qtest/migration-test.c:1178

There's a stack frame optimized out between 6 & 7 which means we can't
tell which event it is waiting for - could be a STOP or RESUME depending
on where in migrate_postcopy_start it is.



> ===========================================================
> PROCESS: 1841381
> gitlab-+ 1841381 1841031  0 Mar13 ?        00:00:06 ./qemu-system-x86_64 -qtest unix:/tmp/qtest-1841031.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1841031.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-RM5901/src_serial -drive file=/tmp/migration-test-RM5901/bootsect,format=raw -accel qtest


IIUC, the source QEMU

> 
> Thread 6 (Thread 0x3ff469f7900 (LWP 1841395)):
> #0  __libc_sendmsg (fd=<optimized out>, msg=msg@entry=0x3ff469f2810, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/sendmsg.c:30
> #1  0x000002aa1a34c52a in qio_channel_socket_writev (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=<optimized out>, nfds=<optimized out>, flags=0, errp=0x3ff469f2b80) at ../io/channel-socket.c:605
> #2  0x000002aa1a351b88 in qio_channel_writev_full (ioc=ioc@entry=0x2aa1c0f8400, iov=0x2aa1c0e88e0, niov=64, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff469f2b80) at ../io/channel.c:108
> #3  0x000002aa1a352a00 in qio_channel_writev_full_all (ioc=0x2aa1c0f8400, iov=iov@entry=0x2aa1b2792a0, niov=<optimized out>, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff469f2b80) at ../io/channel.c:263
> #4  0x000002aa1a352aae in qio_channel_writev_all (ioc=<optimized out>, iov=iov@entry=0x2aa1b2792a0, niov=<optimized out>, errp=errp@entry=0x3ff469f2b80) at ../io/channel.c:242
> #5  0x000002aa1a10de94 in qemu_fflush (f=f@entry=0x2aa1b271260) at ../migration/qemu-file.c:302
> #6  0x000002aa1a10e122 in qemu_fflush (f=0x2aa1b271260) at ../migration/qemu-file.c:297
> #7  add_to_iovec (f=f@entry=0x2aa1b271260, buf=<optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:510
> #8  0x000002aa1a10e606 in qemu_put_buffer_async (f=f@entry=0x2aa1b271260, buf=<error reading variable: value has been optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:535
> #9  0x000002aa1a2bd398 in save_normal_page (async=<optimized out>, buf=<optimized out>, offset=1175552, block=0x2aa1b19f340, pss=0x3ff40003000) at ../migration/ram.c:1400
> #10 ram_save_page (pss=0x3ff40003000, rs=0x3ff40003000) at ../migration/ram.c:1449
> #11 ram_save_target_page_legacy (rs=0x3ff40003000, pss=0x3ff40003000) at ../migration/ram.c:2381
> #12 0x000002aa1a2ba330 in ram_save_host_page (pss=0x3ff40003000, rs=0x3ff40003000) at ../migration/ram.c:2539
> #13 ram_find_and_save_block (rs=rs@entry=0x3ff40003000) at ../migration/ram.c:2620
> #14 0x000002aa1a2ba9e4 in ram_save_iterate (f=0x2aa1b271260, opaque=<optimized out>) at ../migration/ram.c:3361
> #15 0x000002aa1a12a524 in qemu_savevm_state_iterate (f=0x2aa1b271260, postcopy=<optimized out>) at ../migration/savevm.c:1345
> #16 0x000002aa1a11da14 in migration_iteration_run (s=0x2aa1af3a810) at ../migration/migration.c:3896
> #17 migration_thread (opaque=opaque@entry=0x2aa1af3a810) at ../migration/migration.c:4124
> #18 0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #19 0x000003ffb5307e66 in start_thread (arg=0x3ff469f7900) at pthread_create.c:477
> #20 0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Blocked sending some RAM plages to the dst, presumably because the socket buffer
is full.


> Thread 5 (Thread 0x3ff471f8900 (LWP 1841394)):
> #0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff471f39f0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:30
> #1  0x000002aa1a34c7e4 in qio_channel_socket_readv (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, flags=0, errp=0x3ff471f3ca8) at ../io/channel-socket.c:532
> #2  0x000002aa1a351a70 in qio_channel_readv_full (ioc=0x2aa1c0f8400, iov=iov@entry=0x3ff471f3bf0, niov=niov@entry=1, fds=fds@entry=0x0, nfds=nfds@entry=0x0, flags=0, errp=0x3ff471f3ca8) at ../io/channel.c:74
> #3  0x000002aa1a351cf6 in qio_channel_read (ioc=<optimized out>, buf=buf@entry=0x2aa1b17d9d8 "", buflen=buflen@entry=32768, errp=errp@entry=0x3ff471f3ca8) at ../io/channel.c:314
> #4  0x000002aa1a10dd2e in qemu_fill_buffer (f=0x2aa1b17d9a0) at ../migration/qemu-file.c:415
> #5  0x000002aa1a10e908 in qemu_peek_byte (f=0x2aa1b17d9a0, offset=<optimized out>) at ../migration/qemu-file.c:707
> #6  0x000002aa1a10ebaa in qemu_get_byte (f=0x2aa1b17d9a0) at ../migration/qemu-file.c:720
> #7  qemu_get_be16 (f=f@entry=0x2aa1b17d9a0) at ../migration/qemu-file.c:800
> #8  0x000002aa1a11a3d8 in source_return_path_thread (opaque=opaque@entry=0x2aa1af3a810) at ../migration/migration.c:2995
> #9  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #10 0x000003ffb5307e66 in start_thread (arg=0x3ff471f8900) at pthread_create.c:477
> #11 0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65

Blocked waiting for return data from the dst host. Dst probably
has't sent anything

Overall source QEMU doesn't look too suspicious.

> 
> PROCESS: 1841387
> gitlab-+ 1841387 1841031  0 Mar13 ?        00:00:00 ./qemu-system-x86_64 -qtest unix:/tmp/qtest-1841031.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1841031.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-RM5901/dest_serial -incoming unix:/tmp/migration-test-RM5901/migsocket -drive file=/tmp/migration-test-RM5901/bootsect,format=raw -accel qtest

IIUC the dst target QEMU for the migration


> 
> Thread 1 (Thread 0x3ffa6efe4a0 (LWP 1841387)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa370039c0) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa37003968, cond=0x2aa37003998) at pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa37003998, mutex=mutex@entry=0x2aa37003968) at pthread_cond_wait.c:647
> #3  0x000002aa34c44cae in qemu_cond_wait_impl (cond=0x2aa37003998, mutex=0x2aa37003968, file=0x2aa34e65f06 "../util/qemu-thread-posix.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
> #4  0x000002aa34c450ce in qemu_sem_wait (sem=0x2aa37003968) at ../util/qemu-thread-posix.c:314
> #5  0x000002aa348a444a in postcopy_ram_incoming_setup (mis=0x2aa37003800, mis@entry=<error reading variable: value has been optimized out>) at ../migration/postcopy-ram.c:1204
> #6  0x000002aa348abb60 in loadvm_postcopy_handle_listen (mis=<optimized out>) at ../migration/savevm.c:2016
> #7  loadvm_process_command (f=0x2aa37201810) at ../migration/savevm.c:2379
> #8  qemu_loadvm_state_main (f=f@entry=0x2aa37201810, mis=0x2aa37003800) at ../migration/savevm.c:2724
> #9  0x000002aa348abeee in loadvm_handle_cmd_packaged (mis=<optimized out>) at ../migration/savevm.c:2248
> #10 loadvm_process_command (f=0x2aa370eb260) at ../migration/savevm.c:2373
> #11 qemu_loadvm_state_main (f=f@entry=0x2aa370eb260, mis=mis@entry=0x2aa37003800) at ../migration/savevm.c:2724
> #12 0x000002aa348acb04 in qemu_loadvm_state (f=0x2aa370eb260) at ../migration/savevm.c:2790
> #13 0x000002aa34899b54 in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:606
> #14 0x000002aa34c58f16 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> #15 0x000003ffa34d69c2 in __makecontext_ret () at ../sysdeps/unix/sysv/linux/s390/s390-64/__makecontext_ret.S:23
> [Inferior 1 (process 1841387) detached]

The main thread has switched to a coroutine and is waiting on a semaphore
for

    qemu_sem_wait(&mis->postcopy_qemufile_dst_done);

This feels not right to me.  The sem wait is putting the coroutine to
sleep and so control never switches back to the main thread.

I'm not seeing what other running thread is ever going to call

    qemu_sem_post(&mis->postcopy_qemufile_dst_done);

Do we have a race here, where this will only work if the qemu_sem_post runs
before the qemu_sem_wait ?


With regards,
Daniel
Dr. David Alan Gilbert March 14, 2023, 1:13 p.m. UTC | #24
* Peter Maydell (peter.maydell@linaro.org) wrote:
> On Tue, 14 Mar 2023 at 10:12, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> >
> > Copying Peter Xu on this one since it's poscopy, especially the newer
> > postcopy preempt.
> >
> > * Peter Maydell (peter.maydell@linaro.org) wrote:
> > > On Sun, 12 Mar 2023 at 14:06, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > > Here's another one, on the s390x CI runner:
> > > > https://gitlab.com/qemu-project/qemu/-/jobs/3917587994
> > >
> > > And here's a backtrace from a hung migration-test on the s390x
> > > runner (looks like a deadlock, none of these processes were
> > > using CPU):
> >
> > Ah a backtrace!
> 
> I've attached another 2 -- migration-test seems to fairly
> commonly hang on the s390 CI runner; there are several
> stuck tests on it at the moment.

These are the same as the last one I replied to; it's probably best
if we disable /migration/postcopy/preempt/* until peterx has attacked
it.

Dave

> -- PMM

> Process tree:
> migration-test(1841031)-+-qemu-system-x86(1841381)
>                         `-qemu-system-x86(1841387)
> ===========================================================
> PROCESS: 1841031
> gitlab-+ 1841031 1838397  0 Mar13 ?        00:00:02 /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> [New LWP 1841033]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3ffdac792e7, len=len@entry=1, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:30
> 
> Thread 2 (Thread 0x3ffb01ff900 (LWP 1841033)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa133dae34 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa135596b8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
> #3  0x000002aa134079ea in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
> #4  0x000002aa133d9e9a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ffb0b87e66 in start_thread (arg=0x3ffb01ff900) at pthread_create.c:477
> #6  0x000003ffb0a7cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ffb126d990 (LWP 1841031)):
> #0  __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3ffdac792e7, len=len@entry=1, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:30
> #1  0x000002aa133aba22 in recv (__flags=0, __n=1, __buf=0x3ffdac792e7, __fd=5) at /usr/include/s390x-linux-gnu/bits/socket2.h:44
> #2  qmp_fd_receive (fd=<optimized out>) at ../tests/qtest/libqmp.c:73
> #3  0x000002aa133a9b8e in qtest_qmp_receive_dict (s=0x2aa135fb800) at ../tests/qtest/libqtest.c:837
> #4  qtest_qmp_eventwait_ref (event=<optimized out>, s=<optimized out>) at ../tests/qtest/libqtest.c:837
> #5  qtest_qmp_eventwait_ref (s=0x2aa135fb800, event=<optimized out>) at ../tests/qtest/libqtest.c:828
> #6  0x000002aa133a9c1a in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
> #7  0x000002aa1339ec56 in test_postcopy_common (args=0x3ffdac795e8) at ../tests/qtest/migration-test.c:1161
> #8  0x000002aa1339edb4 in test_postcopy_preempt () at ../tests/qtest/migration-test.c:1178
> #9  0x000003ffb0d7e608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #10 0x000003ffb0d7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #11 0x000003ffb0d7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #12 0x000003ffb0d7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #13 0x000003ffb0d7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #14 0x000003ffb0d7eada in g_test_run_suite () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #15 0x000003ffb0d7eb10 in g_test_run () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #16 0x000002aa1339ab14 in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2615
> [Inferior 1 (process 1841031) detached]
> 
> ===========================================================
> PROCESS: 1841381
> gitlab-+ 1841381 1841031  0 Mar13 ?        00:00:06 ./qemu-system-x86_64 -qtest unix:/tmp/qtest-1841031.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1841031.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-RM5901/src_serial -drive file=/tmp/migration-test-RM5901/bootsect,format=raw -accel qtest
> [New LWP 1841383]
> [New LWP 1841384]
> [New LWP 1841385]
> [New LWP 1841394]
> [New LWP 1841395]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> 0x000003ffb51f1c8c in __ppoll (fds=0x2aa1c0ffbd0, nfds=5, timeout=<optimized out>, timeout@entry=0x3fffbb79bc8, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
> 
> Thread 6 (Thread 0x3ff469f7900 (LWP 1841395)):
> #0  __libc_sendmsg (fd=<optimized out>, msg=msg@entry=0x3ff469f2810, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/sendmsg.c:30
> #1  0x000002aa1a34c52a in qio_channel_socket_writev (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=<optimized out>, nfds=<optimized out>, flags=0, errp=0x3ff469f2b80) at ../io/channel-socket.c:605
> #2  0x000002aa1a351b88 in qio_channel_writev_full (ioc=ioc@entry=0x2aa1c0f8400, iov=0x2aa1c0e88e0, niov=64, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff469f2b80) at ../io/channel.c:108
> #3  0x000002aa1a352a00 in qio_channel_writev_full_all (ioc=0x2aa1c0f8400, iov=iov@entry=0x2aa1b2792a0, niov=<optimized out>, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff469f2b80) at ../io/channel.c:263
> #4  0x000002aa1a352aae in qio_channel_writev_all (ioc=<optimized out>, iov=iov@entry=0x2aa1b2792a0, niov=<optimized out>, errp=errp@entry=0x3ff469f2b80) at ../io/channel.c:242
> #5  0x000002aa1a10de94 in qemu_fflush (f=f@entry=0x2aa1b271260) at ../migration/qemu-file.c:302
> #6  0x000002aa1a10e122 in qemu_fflush (f=0x2aa1b271260) at ../migration/qemu-file.c:297
> #7  add_to_iovec (f=f@entry=0x2aa1b271260, buf=<optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:510
> #8  0x000002aa1a10e606 in qemu_put_buffer_async (f=f@entry=0x2aa1b271260, buf=<error reading variable: value has been optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:535
> #9  0x000002aa1a2bd398 in save_normal_page (async=<optimized out>, buf=<optimized out>, offset=1175552, block=0x2aa1b19f340, pss=0x3ff40003000) at ../migration/ram.c:1400
> #10 ram_save_page (pss=0x3ff40003000, rs=0x3ff40003000) at ../migration/ram.c:1449
> #11 ram_save_target_page_legacy (rs=0x3ff40003000, pss=0x3ff40003000) at ../migration/ram.c:2381
> #12 0x000002aa1a2ba330 in ram_save_host_page (pss=0x3ff40003000, rs=0x3ff40003000) at ../migration/ram.c:2539
> #13 ram_find_and_save_block (rs=rs@entry=0x3ff40003000) at ../migration/ram.c:2620
> #14 0x000002aa1a2ba9e4 in ram_save_iterate (f=0x2aa1b271260, opaque=<optimized out>) at ../migration/ram.c:3361
> #15 0x000002aa1a12a524 in qemu_savevm_state_iterate (f=0x2aa1b271260, postcopy=<optimized out>) at ../migration/savevm.c:1345
> #16 0x000002aa1a11da14 in migration_iteration_run (s=0x2aa1af3a810) at ../migration/migration.c:3896
> #17 migration_thread (opaque=opaque@entry=0x2aa1af3a810) at ../migration/migration.c:4124
> #18 0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #19 0x000003ffb5307e66 in start_thread (arg=0x3ff469f7900) at pthread_create.c:477
> #20 0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 5 (Thread 0x3ff471f8900 (LWP 1841394)):
> #0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff471f39f0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:30
> #1  0x000002aa1a34c7e4 in qio_channel_socket_readv (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, flags=0, errp=0x3ff471f3ca8) at ../io/channel-socket.c:532
> #2  0x000002aa1a351a70 in qio_channel_readv_full (ioc=0x2aa1c0f8400, iov=iov@entry=0x3ff471f3bf0, niov=niov@entry=1, fds=fds@entry=0x0, nfds=nfds@entry=0x0, flags=0, errp=0x3ff471f3ca8) at ../io/channel.c:74
> #3  0x000002aa1a351cf6 in qio_channel_read (ioc=<optimized out>, buf=buf@entry=0x2aa1b17d9d8 "", buflen=buflen@entry=32768, errp=errp@entry=0x3ff471f3ca8) at ../io/channel.c:314
> #4  0x000002aa1a10dd2e in qemu_fill_buffer (f=0x2aa1b17d9a0) at ../migration/qemu-file.c:415
> #5  0x000002aa1a10e908 in qemu_peek_byte (f=0x2aa1b17d9a0, offset=<optimized out>) at ../migration/qemu-file.c:707
> #6  0x000002aa1a10ebaa in qemu_get_byte (f=0x2aa1b17d9a0) at ../migration/qemu-file.c:720
> #7  qemu_get_be16 (f=f@entry=0x2aa1b17d9a0) at ../migration/qemu-file.c:800
> #8  0x000002aa1a11a3d8 in source_return_path_thread (opaque=opaque@entry=0x2aa1af3a810) at ../migration/migration.c:2995
> #9  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #10 0x000003ffb5307e66 in start_thread (arg=0x3ff471f8900) at pthread_create.c:477
> #11 0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 4 (Thread 0x3ffa4eab900 (LWP 1841385)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa1b29c30c) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa1ad9b798 <qemu_global_mutex>, cond=0x2aa1b29c2e0) at pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa1b29c2e0, mutex=mutex@entry=0x2aa1ad9b798 <qemu_global_mutex>) at pthread_cond_wait.c:647
> #3  0x000002aa1a4c4cae in qemu_cond_wait_impl (cond=0x2aa1b29c2e0, mutex=0x2aa1ad9b798 <qemu_global_mutex>, file=0x2aa1a60b014 "../softmmu/cpus.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
> #4  0x000002aa1a0edafe in qemu_wait_io_event (cpu=cpu@entry=0x2aa1b26a160) at ../softmmu/cpus.c:424
> #5  0x000002aa1a3315ca in mttcg_cpu_thread_fn (arg=arg@entry=0x2aa1b26a160) at ../accel/tcg/tcg-accel-ops-mttcg.c:123
> #6  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #7  0x000003ffb5307e66 in start_thread (arg=0x3ffa4eab900) at pthread_create.c:477
> #8  0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ffa579d900 (LWP 1841384)):
> #0  0x000003ffb51f1b32 in __GI___poll (fds=0x3ff58003280, nfds=3, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ffb79d4386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ffb79d4790 in g_main_loop_run () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa1a37480e in iothread_run (opaque=opaque@entry=0x2aa1b0aec00) at ../iothread.c:70
> #4  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ffb5307e66 in start_thread (arg=0x3ffa579d900) at pthread_create.c:477
> #6  0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ffa6423900 (LWP 1841383)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa1a4c5264 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa1adccab0 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
> #3  0x000002aa1a4cde3a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
> #4  0x000002aa1a4c42ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ffb5307e66 in start_thread (arg=0x3ffa6423900) at pthread_create.c:477
> #6  0x000003ffb51fcbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ffb8b7e4a0 (LWP 1841381)):
> #0  0x000003ffb51f1c8c in __ppoll (fds=0x2aa1c0ffbd0, nfds=5, timeout=<optimized out>, timeout@entry=0x3fffbb79bc8, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
> #1  0x000002aa1a4da07a in ppoll (__ss=0x0, __timeout=0x3fffbb79bc8, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/s390x-linux-gnu/bits/poll2.h:77
> #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1000000000) at ../util/qemu-timer.c:351
> #3  0x000002aa1a4d7090 in os_host_main_loop_wait (timeout=1000000000) at ../util/main-loop.c:312
> #4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:603
> #5  0x000002aa1a0f46d4 in qemu_main_loop () at ../softmmu/runstate.c:731
> #6  0x000002aa19f0e312 in qemu_default_main () at ../softmmu/main.c:37
> #7  0x000003ffb512440a in __libc_start_main (main=0x2aa19f0b8d8 <main>, argc=<optimized out>, argv=0x3fffbb7a018, init=<optimized out>, fini=<optimized out>, rtld_fini=0x3ffb8b90e50 <_dl_fini>, stack_end=0x3fffbb79f60) at libc-start.c:308
> #8  0x000002aa19f0e23c in _start () at ../softmmu/main.c:48
> [Inferior 1 (process 1841381) detached]
> 
> ===========================================================
> PROCESS: 1841387
> gitlab-+ 1841387 1841031  0 Mar13 ?        00:00:00 ./qemu-system-x86_64 -qtest unix:/tmp/qtest-1841031.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1841031.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-RM5901/dest_serial -incoming unix:/tmp/migration-test-RM5901/migsocket -drive file=/tmp/migration-test-RM5901/bootsect,format=raw -accel qtest
> [New LWP 1841389]
> [New LWP 1841390]
> [New LWP 1841391]
> [New LWP 1841397]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa370039c0) at ../sysdeps/nptl/futex-internal.h:183
> 
> Thread 5 (Thread 0x3ff3a1fd900 (LWP 1841397)):
> #0  0x000003ffa3571b32 in __GI___poll (fds=fds@entry=0x3ff2c000b60, nfds=2, nfds@entry=<error reading variable: value has been optimized out>, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000002aa348a4dfe in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x3ff2c000b60) at /usr/include/s390x-linux-gnu/bits/poll2.h:46
> #2  postcopy_ram_fault_thread (opaque=opaque@entry=0x2aa37003800) at ../migration/postcopy-ram.c:952
> #3  0x000002aa34c442ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #4  0x000003ffa3687e66 in start_thread (arg=0x3ff3a1fd900) at pthread_create.c:477
> #5  0x000003ffa357cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 4 (Thread 0x3ff9322b900 (LWP 1841391)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa3711630c) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa3551b798 <qemu_global_mutex>, cond=0x2aa371162e0) at pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa371162e0, mutex=mutex@entry=0x2aa3551b798 <qemu_global_mutex>) at pthread_cond_wait.c:647
> #3  0x000002aa34c44cae in qemu_cond_wait_impl (cond=0x2aa371162e0, mutex=0x2aa3551b798 <qemu_global_mutex>, file=0x2aa34d8b014 "../softmmu/cpus.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
> #4  0x000002aa3486dafe in qemu_wait_io_event (cpu=cpu@entry=0x2aa370e4160) at ../softmmu/cpus.c:424
> #5  0x000002aa34ab15ca in mttcg_cpu_thread_fn (arg=arg@entry=0x2aa370e4160) at ../accel/tcg/tcg-accel-ops-mttcg.c:123
> #6  0x000002aa34c442ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #7  0x000003ffa3687e66 in start_thread (arg=0x3ff9322b900) at pthread_create.c:477
> #8  0x000003ffa357cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ff93b1d900 (LWP 1841390)):
> #0  0x000003ffa3571b32 in __GI___poll (fds=0x3ff44003280, nfds=3, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ffa5d54386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ffa5d54790 in g_main_loop_run () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa34af480e in iothread_run (opaque=opaque@entry=0x2aa36f28c00) at ../iothread.c:70
> #4  0x000002aa34c442ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ffa3687e66 in start_thread (arg=0x3ff93b1d900) at pthread_create.c:477
> #6  0x000003ffa357cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ff947a3900 (LWP 1841389)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa34c45264 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa3554cab0 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
> #3  0x000002aa34c4de3a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
> #4  0x000002aa34c442ca in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ffa3687e66 in start_thread (arg=0x3ff947a3900) at pthread_create.c:477
> #6  0x000003ffa357cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ffa6efe4a0 (LWP 1841387)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa370039c0) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa37003968, cond=0x2aa37003998) at pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa37003998, mutex=mutex@entry=0x2aa37003968) at pthread_cond_wait.c:647
> #3  0x000002aa34c44cae in qemu_cond_wait_impl (cond=0x2aa37003998, mutex=0x2aa37003968, file=0x2aa34e65f06 "../util/qemu-thread-posix.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
> #4  0x000002aa34c450ce in qemu_sem_wait (sem=0x2aa37003968) at ../util/qemu-thread-posix.c:314
> #5  0x000002aa348a444a in postcopy_ram_incoming_setup (mis=0x2aa37003800, mis@entry=<error reading variable: value has been optimized out>) at ../migration/postcopy-ram.c:1204
> #6  0x000002aa348abb60 in loadvm_postcopy_handle_listen (mis=<optimized out>) at ../migration/savevm.c:2016
> #7  loadvm_process_command (f=0x2aa37201810) at ../migration/savevm.c:2379
> #8  qemu_loadvm_state_main (f=f@entry=0x2aa37201810, mis=0x2aa37003800) at ../migration/savevm.c:2724
> #9  0x000002aa348abeee in loadvm_handle_cmd_packaged (mis=<optimized out>) at ../migration/savevm.c:2248
> #10 loadvm_process_command (f=0x2aa370eb260) at ../migration/savevm.c:2373
> #11 qemu_loadvm_state_main (f=f@entry=0x2aa370eb260, mis=mis@entry=0x2aa37003800) at ../migration/savevm.c:2724
> #12 0x000002aa348acb04 in qemu_loadvm_state (f=0x2aa370eb260) at ../migration/savevm.c:2790
> #13 0x000002aa34899b54 in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:606
> #14 0x000002aa34c58f16 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> #15 0x000003ffa34d69c2 in __makecontext_ret () at ../sysdeps/unix/sysv/linux/s390/s390-64/__makecontext_ret.S:23
> [Inferior 1 (process 1841387) detached]
> 

> Process tree:
> migration-test(1840499)-+-qemu-system-i38(1841012)
>                         `-qemu-system-i38(1841019)
> ===========================================================
> PROCESS: 1840499
> gitlab-+ 1840499 1838397  0 Mar13 ?        00:00:02 /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/build/tests/qtest/migration-test --tap -k
> [New LWP 1840501]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3fff23798b7, len=len@entry=1, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:30
> 30	../sysdeps/unix/sysv/linux/recv.c: No such file or directory.
> 
> Thread 2 (Thread 0x3ffbe3ff900 (LWP 1840501)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa029dae34 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa02b596b8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
> #3  0x000002aa02a079ea in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
> #4  0x000002aa029d9e9a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ffbed87e66 in start_thread (arg=0x3ffbe3ff900) at pthread_create.c:477
> #6  0x000003ffbec7cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ffbf46d990 (LWP 1840499)):
> #0  __libc_recv (fd=fd@entry=5, buf=buf@entry=0x3fff23798b7, len=len@entry=1, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recv.c:30
> #1  0x000002aa029aba22 in recv (__flags=0, __n=1, __buf=0x3fff23798b7, __fd=5) at /usr/include/s390x-linux-gnu/bits/socket2.h:44
> #2  qmp_fd_receive (fd=<optimized out>) at ../tests/qtest/libqmp.c:73
> #3  0x000002aa029a9b8e in qtest_qmp_receive_dict (s=0x2aa0327d120) at ../tests/qtest/libqtest.c:837
> #4  qtest_qmp_eventwait_ref (event=<optimized out>, s=<optimized out>) at ../tests/qtest/libqtest.c:837
> #5  qtest_qmp_eventwait_ref (s=0x2aa0327d120, event=<optimized out>) at ../tests/qtest/libqtest.c:828
> #6  0x000002aa029a9c1a in qtest_qmp_eventwait (s=<optimized out>, event=<optimized out>) at ../tests/qtest/libqtest.c:850
> #7  0x000002aa0299ec56 in test_postcopy_common (args=0x3fff2379bb8) at ../tests/qtest/migration-test.c:1161
> #8  0x000002aa0299edb4 in test_postcopy_preempt () at ../tests/qtest/migration-test.c:1178
> #9  0x000003ffbef7e608 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #10 0x000003ffbef7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #11 0x000003ffbef7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #12 0x000003ffbef7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #13 0x000003ffbef7e392 in ?? () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #14 0x000003ffbef7eada in g_test_run_suite () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #15 0x000003ffbef7eb10 in g_test_run () from /lib/s390x-linux-gnu/libglib-2.0.so.0
> #16 0x000002aa0299ab14 in main (argc=<optimized out>, argv=<optimized out>) at ../tests/qtest/migration-test.c:2615
> [Inferior 1 (process 1840499) detached]
> 
> ===========================================================
> PROCESS: 1841012
> gitlab-+ 1841012 1840499  0 Mar13 ?        00:00:06 ./qemu-system-i386 -qtest unix:/tmp/qtest-1840499.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1840499.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name source,debug-threads=on -m 150M -serial file:/tmp/migration-test-8IQB11/src_serial -drive file=/tmp/migration-test-8IQB11/bootsect,format=raw -accel qtest
> [New LWP 1841014]
> [New LWP 1841015]
> [New LWP 1841016]
> [New LWP 1841062]
> [New LWP 1841063]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> 0x000003ff9a771c8c in __ppoll (fds=0x2aa396d66e0, nfds=5, timeout=<optimized out>, timeout@entry=0x3ffd40fa048, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
> 44	../sysdeps/unix/sysv/linux/ppoll.c: No such file or directory.
> 
> Thread 6 (Thread 0x3ff2bfff900 (LWP 1841063)):
> #0  __libc_sendmsg (fd=<optimized out>, msg=msg@entry=0x3ff2bffa810, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/sendmsg.c:30
> #1  0x000002aa38343b7a in qio_channel_socket_writev (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=<optimized out>, nfds=<optimized out>, flags=0, errp=0x3ff2bffab80) at ../io/channel-socket.c:605
> #2  0x000002aa383491d8 in qio_channel_writev_full (ioc=ioc@entry=0x2aa3a52fa80, iov=0x2aa3a53b300, niov=64, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff2bffab80) at ../io/channel.c:108
> #3  0x000002aa3834a050 in qio_channel_writev_full_all (ioc=0x2aa3a52fa80, iov=iov@entry=0x2aa396b26b0, niov=<optimized out>, fds=fds@entry=0x0, nfds=nfds@entry=0, flags=0, errp=0x3ff2bffab80) at ../io/channel.c:263
> #4  0x000002aa3834a0fe in qio_channel_writev_all (ioc=<optimized out>, iov=iov@entry=0x2aa396b26b0, niov=<optimized out>, errp=errp@entry=0x3ff2bffab80) at ../io/channel.c:242
> #5  0x000002aa3810d404 in qemu_fflush (f=f@entry=0x2aa396aa670) at ../migration/qemu-file.c:302
> #6  0x000002aa3810d692 in qemu_fflush (f=0x2aa396aa670) at ../migration/qemu-file.c:297
> #7  add_to_iovec (f=f@entry=0x2aa396aa670, buf=<optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:510
> #8  0x000002aa3810db76 in qemu_put_buffer_async (f=f@entry=0x2aa396aa670, buf=<error reading variable: value has been optimized out>, size=size@entry=4096, may_free=<optimized out>) at ../migration/qemu-file.c:535
> #9  0x000002aa382b4d00 in save_normal_page (async=<optimized out>, buf=<optimized out>, offset=1175552, block=0x2aa395dabb0, pss=0x3ff24002800) at ../migration/ram.c:1400
> #10 ram_save_page (pss=0x3ff24002800, rs=0x3ff24002800) at ../migration/ram.c:1449
> #11 ram_save_target_page_legacy (rs=0x3ff24002800, pss=0x3ff24002800) at ../migration/ram.c:2381
> #12 0x000002aa382b1c98 in ram_save_host_page (pss=0x3ff24002800, rs=0x3ff24002800) at ../migration/ram.c:2539
> #13 ram_find_and_save_block (rs=rs@entry=0x3ff24002800) at ../migration/ram.c:2620
> #14 0x000002aa382b234c in ram_save_iterate (f=0x2aa396aa670, opaque=<optimized out>) at ../migration/ram.c:3361
> #15 0x000002aa38129a94 in qemu_savevm_state_iterate (f=0x2aa396aa670, postcopy=<optimized out>) at ../migration/savevm.c:1345
> #16 0x000002aa3811cf84 in migration_iteration_run (s=0x2aa39375710) at ../migration/migration.c:3896
> #17 migration_thread (opaque=opaque@entry=0x2aa39375710) at ../migration/migration.c:4124
> #18 0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #19 0x000003ff9a887e66 in start_thread (arg=0x3ff2bfff900) at pthread_create.c:477
> #20 0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 5 (Thread 0x3ff30ff9900 (LWP 1841062)):
> #0  __libc_recvmsg (fd=<optimized out>, msg=msg@entry=0x3ff30ff49f0, flags=flags@entry=0) at ../sysdeps/unix/sysv/linux/recvmsg.c:30
> #1  0x000002aa38343e34 in qio_channel_socket_readv (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, flags=0, errp=0x3ff30ff4ca8) at ../io/channel-socket.c:532
> #2  0x000002aa383490c0 in qio_channel_readv_full (ioc=0x2aa3a52fa80, iov=iov@entry=0x3ff30ff4bf0, niov=niov@entry=1, fds=fds@entry=0x0, nfds=nfds@entry=0x0, flags=0, errp=0x3ff30ff4ca8) at ../io/channel.c:74
> #3  0x000002aa38349346 in qio_channel_read (ioc=<optimized out>, buf=buf@entry=0x2aa395b9548 "", buflen=buflen@entry=32768, errp=errp@entry=0x3ff30ff4ca8) at ../io/channel.c:314
> #4  0x000002aa3810d29e in qemu_fill_buffer (f=0x2aa395b9510) at ../migration/qemu-file.c:415
> #5  0x000002aa3810de78 in qemu_peek_byte (f=0x2aa395b9510, offset=<optimized out>) at ../migration/qemu-file.c:707
> #6  0x000002aa3810e11a in qemu_get_byte (f=0x2aa395b9510) at ../migration/qemu-file.c:720
> #7  qemu_get_be16 (f=f@entry=0x2aa395b9510) at ../migration/qemu-file.c:800
> #8  0x000002aa38119948 in source_return_path_thread (opaque=opaque@entry=0x2aa39375710) at ../migration/migration.c:2995
> #9  0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #10 0x000003ff9a887e66 in start_thread (arg=0x3ff30ff9900) at pthread_create.c:477
> #11 0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 4 (Thread 0x3ff8a42b900 (LWP 1841016)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa396d5718) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa38d8fcf0 <qemu_global_mutex>, cond=0x2aa396d56f0) at pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa396d56f0, mutex=mutex@entry=0x2aa38d8fcf0 <qemu_global_mutex>) at pthread_cond_wait.c:647
> #3  0x000002aa384bc2fe in qemu_cond_wait_impl (cond=0x2aa396d56f0, mutex=0x2aa38d8fcf0 <qemu_global_mutex>, file=0x2aa38602664 "../softmmu/cpus.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
> #4  0x000002aa380ed06e in qemu_wait_io_event (cpu=cpu@entry=0x2aa396a5980) at ../softmmu/cpus.c:424
> #5  0x000002aa38328c1a in mttcg_cpu_thread_fn (arg=arg@entry=0x2aa396a5980) at ../accel/tcg/tcg-accel-ops-mttcg.c:123
> #6  0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #7  0x000003ff9a887e66 in start_thread (arg=0x3ff8a42b900) at pthread_create.c:477
> #8  0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ff8ad1d900 (LWP 1841015)):
> #0  0x000003ff9a771b32 in __GI___poll (fds=0x3ff3c003280, nfds=3, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ff9cf54386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ff9cf54790 in g_main_loop_run () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa3836be5e in iothread_run (opaque=opaque@entry=0x2aa39477800) at ../iothread.c:70
> #4  0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ff9a887e66 in start_thread (arg=0x3ff8ad1d900) at pthread_create.c:477
> #6  0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ff8b9a3900 (LWP 1841014)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa384bc8b4 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa38dbffc8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
> #3  0x000002aa384c548a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
> #4  0x000002aa384bb91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ff9a887e66 in start_thread (arg=0x3ff8b9a3900) at pthread_create.c:477
> #6  0x000003ff9a77cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ff9e0fe4a0 (LWP 1841012)):
> #0  0x000003ff9a771c8c in __ppoll (fds=0x2aa396d66e0, nfds=5, timeout=<optimized out>, timeout@entry=0x3ffd40fa048, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:44
> #1  0x000002aa384d16ca in ppoll (__ss=0x0, __timeout=0x3ffd40fa048, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/s390x-linux-gnu/bits/poll2.h:77
> #2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=1000000000) at ../util/qemu-timer.c:351
> #3  0x000002aa384ce6e0 in os_host_main_loop_wait (timeout=1000000000) at ../util/main-loop.c:312
> #4  main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:603
> #5  0x000002aa380f3c44 in qemu_main_loop () at ../softmmu/runstate.c:731
> #6  0x000002aa37f0d882 in qemu_default_main () at ../softmmu/main.c:37
> #7  0x000003ff9a6a440a in __libc_start_main (main=0x2aa37f0ae48 <main>, argc=<optimized out>, argv=0x3ffd40fa498, init=<optimized out>, fini=<optimized out>, rtld_fini=0x3ff9e110e50 <_dl_fini>, stack_end=0x3ffd40fa3e0) at libc-start.c:308
> #8  0x000002aa37f0d7ac in _start () at ../softmmu/main.c:48
> [Inferior 1 (process 1841012) detached]
> 
> ===========================================================
> PROCESS: 1841019
> gitlab-+ 1841019 1840499  0 Mar13 ?        00:00:00 ./qemu-system-i386 -qtest unix:/tmp/qtest-1840499.sock -qtest-log /dev/null -chardev socket,path=/tmp/qtest-1840499.qmp,id=char0 -mon chardev=char0,mode=control -display none -accel kvm -accel tcg -name target,debug-threads=on -m 150M -serial file:/tmp/migration-test-8IQB11/dest_serial -incoming unix:/tmp/migration-test-8IQB11/migsocket -drive file=/tmp/migration-test-8IQB11/bootsect,format=raw -accel qtest
> [New LWP 1841021]
> [New LWP 1841022]
> [New LWP 1841023]
> [New LWP 1841067]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/s390x-linux-gnu/libthread_db.so.1".
> futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa3a92e1c0) at ../sysdeps/nptl/futex-internal.h:183
> 183	../sysdeps/nptl/futex-internal.h: No such file or directory.
> 
> Thread 5 (Thread 0x3ff4a8ff900 (LWP 1841067)):
> #0  0x000003ffb0071b32 in __GI___poll (fds=fds@entry=0x3ff40000b60, nfds=2, nfds@entry=<error reading variable: value has been optimized out>, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000002aa382a436e in poll (__timeout=-1, __nfds=<optimized out>, __fds=0x3ff40000b60) at /usr/include/s390x-linux-gnu/bits/poll2.h:46
> #2  postcopy_ram_fault_thread (opaque=opaque@entry=0x2aa3a92e000) at ../migration/postcopy-ram.c:952
> #3  0x000002aa3863b91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #4  0x000003ffb0187e66 in start_thread (arg=0x3ff4a8ff900) at pthread_create.c:477
> #5  0x000003ffb007cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 4 (Thread 0x3ff5b7fe900 (LWP 1841023)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa3aa3e69c) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa38f0fcf0 <qemu_global_mutex>, cond=0x2aa3aa3e670) at pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa3aa3e670, mutex=mutex@entry=0x2aa38f0fcf0 <qemu_global_mutex>) at pthread_cond_wait.c:647
> #3  0x000002aa3863c2fe in qemu_cond_wait_impl (cond=0x2aa3aa3e670, mutex=0x2aa38f0fcf0 <qemu_global_mutex>, file=0x2aa38782664 "../softmmu/cpus.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
> #4  0x000002aa3826d06e in qemu_wait_io_event (cpu=cpu@entry=0x2aa3aa0e980) at ../softmmu/cpus.c:424
> #5  0x000002aa384a8c1a in mttcg_cpu_thread_fn (arg=arg@entry=0x2aa3aa0e980) at ../accel/tcg/tcg-accel-ops-mttcg.c:123
> #6  0x000002aa3863b91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #7  0x000003ffb0187e66 in start_thread (arg=0x3ff5b7fe900) at pthread_create.c:477
> #8  0x000003ffb007cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 3 (Thread 0x3ff5bfff900 (LWP 1841022)):
> #0  0x000003ffb0071b32 in __GI___poll (fds=0x3ff54003280, nfds=3, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:29
> #1  0x000003ffb2854386 in  () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #2  0x000003ffb2854790 in g_main_loop_run () at /lib/s390x-linux-gnu/libglib-2.0.so.0
> #3  0x000002aa384ebe5e in iothread_run (opaque=opaque@entry=0x2aa3a7e0800) at ../iothread.c:70
> #4  0x000002aa3863b91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ffb0187e66 in start_thread (arg=0x3ff5bfff900) at pthread_create.c:477
> #6  0x000003ffb007cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 2 (Thread 0x3ffa12a3900 (LWP 1841021)):
> #0  syscall () at ../sysdeps/unix/sysv/linux/s390/s390-64/syscall.S:37
> #1  0x000002aa3863c8b4 in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /home/gitlab-runner/builds/FLaZkdt1/0/qemu-project/qemu/include/qemu/futex.h:29
> #2  qemu_event_wait (ev=ev@entry=0x2aa38f3ffc8 <rcu_call_ready_event>) at ../util/qemu-thread-posix.c:464
> #3  0x000002aa3864548a in call_rcu_thread (opaque=opaque@entry=0x0) at ../util/rcu.c:261
> #4  0x000002aa3863b91a in qemu_thread_start (args=<optimized out>) at ../util/qemu-thread-posix.c:541
> #5  0x000003ffb0187e66 in start_thread (arg=0x3ffa12a3900) at pthread_create.c:477
> #6  0x000003ffb007cbe6 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone.S:65
> 
> Thread 1 (Thread 0x3ffb39fe4a0 (LWP 1841019)):
> #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x2aa3a92e1c0) at ../sysdeps/nptl/futex-internal.h:183
> #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x2aa3a92e168, cond=0x2aa3a92e198) at pthread_cond_wait.c:508
> #2  __pthread_cond_wait (cond=cond@entry=0x2aa3a92e198, mutex=mutex@entry=0x2aa3a92e168) at pthread_cond_wait.c:647
> #3  0x000002aa3863c2fe in qemu_cond_wait_impl (cond=0x2aa3a92e198, mutex=0x2aa3a92e168, file=0x2aa3885b73e "../util/qemu-thread-posix.c", line=<optimized out>) at ../util/qemu-thread-posix.c:225
> #4  0x000002aa3863c71e in qemu_sem_wait (sem=0x2aa3a92e168) at ../util/qemu-thread-posix.c:314
> #5  0x000002aa382a39ba in postcopy_ram_incoming_setup (mis=0x2aa3a92e000, mis@entry=<error reading variable: value has been optimized out>) at ../migration/postcopy-ram.c:1204
> #6  0x000002aa382ab0d0 in loadvm_postcopy_handle_listen (mis=<optimized out>) at ../migration/savevm.c:2016
> #7  loadvm_process_command (f=0x2aa3aa5c7d0) at ../migration/savevm.c:2379
> #8  qemu_loadvm_state_main (f=f@entry=0x2aa3aa5c7d0, mis=0x2aa3a92e000) at ../migration/savevm.c:2724
> #9  0x000002aa382ab45e in loadvm_handle_cmd_packaged (mis=<optimized out>) at ../migration/savevm.c:2248
> #10 loadvm_process_command (f=0x2aa3aa135f0) at ../migration/savevm.c:2373
> #11 qemu_loadvm_state_main (f=f@entry=0x2aa3aa135f0, mis=mis@entry=0x2aa3a92e000) at ../migration/savevm.c:2724
> #12 0x000002aa382ac074 in qemu_loadvm_state (f=0x2aa3aa135f0) at ../migration/savevm.c:2790
> #13 0x000002aa382990c4 in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:606
> #14 0x000002aa38650566 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:177
> #15 0x000003ffaffd69c2 in __makecontext_ret () at ../sysdeps/unix/sysv/linux/s390/s390-64/__makecontext_ret.S:23
> [Inferior 1 (process 1841019) detached]
>
Peter Xu March 14, 2023, 4:46 p.m. UTC | #25
On Tue, Mar 14, 2023 at 10:11:53AM +0000, Dr. David Alan Gilbert wrote:
> OK, I think I kind of see what's happening here, one for Peter Xu.
> If I'm right it's a race something like:
>   a) The test harness tells the source it wants to enter postcopy
>   b) The harness then waits for the source to stop
>   c) ... and the dest to start 
> 
>   It's blocked on one of b&c but can't tell which
> 
>   d) The main thread in the dest is waiting for the postcopy recovery fd
>     to be opened
>   e) But I think the source is still trying to send normal precopy RAM
>     and perhaps hasn't got around yet to opening that socket yet????
>   f) But I think the dest isn't reading from the main channel at that
>     point because of (d)

I think this analysis is spot on.  Thanks Dave!

Src qemu does this with below order:

        1. setup preempt channel
        1.1. connect()  --> this is done in another thread
        1.2. sem_wait(postcopy_qemufile_src_sem) --> make sure it's created
        2. prepare postcopy package (LISTEN, non-iterable states, ping-3, RUN)
        3. send the package

So logically the sequence is guaranteed so that when LISTEN cmd is
processed, we should have connect()ed already.

But I think there's one thing missing on dest.. since the accept() on the
dest node should be run in the main thread, meanwhile the LISTEN cmd is
also processed on the main thread, even if the listening socket is trying
to kick the main thread to do the accept() (so the connection has
established) it won't be able to kick the final accept() as main thread is
waiting in the semaphore.  That caused a deadlock.

A simple fix I can think of is moving the wait channel operation outside
the main thread, e.g. to the preempt thread.

I've attached that simple fix.  Peter, is it easy to verify it?  I'm not
sure the reproducability, fine by me too if it's easier to just disable
preempt tests for 8.0 release.

Thanks,
Daniel P. Berrangé March 14, 2023, 5:48 p.m. UTC | #26
On Tue, Mar 14, 2023 at 12:46:34PM -0400, Peter Xu wrote:
> On Tue, Mar 14, 2023 at 10:11:53AM +0000, Dr. David Alan Gilbert wrote:
> > OK, I think I kind of see what's happening here, one for Peter Xu.
> > If I'm right it's a race something like:
> >   a) The test harness tells the source it wants to enter postcopy
> >   b) The harness then waits for the source to stop
> >   c) ... and the dest to start 
> > 
> >   It's blocked on one of b&c but can't tell which
> > 
> >   d) The main thread in the dest is waiting for the postcopy recovery fd
> >     to be opened
> >   e) But I think the source is still trying to send normal precopy RAM
> >     and perhaps hasn't got around yet to opening that socket yet????
> >   f) But I think the dest isn't reading from the main channel at that
> >     point because of (d)
> 
> I think this analysis is spot on.  Thanks Dave!
> 
> Src qemu does this with below order:
> 
>         1. setup preempt channel
>         1.1. connect()  --> this is done in another thread
>         1.2. sem_wait(postcopy_qemufile_src_sem) --> make sure it's created
>         2. prepare postcopy package (LISTEN, non-iterable states, ping-3, RUN)
>         3. send the package
> 
> So logically the sequence is guaranteed so that when LISTEN cmd is
> processed, we should have connect()ed already.
> 
> But I think there's one thing missing on dest.. since the accept() on the
> dest node should be run in the main thread, meanwhile the LISTEN cmd is
> also processed on the main thread, even if the listening socket is trying
> to kick the main thread to do the accept() (so the connection has
> established) it won't be able to kick the final accept() as main thread is
> waiting in the semaphore.  That caused a deadlock.
> 
> A simple fix I can think of is moving the wait channel operation outside
> the main thread, e.g. to the preempt thread.
> 
> I've attached that simple fix.  Peter, is it easy to verify it?  I'm not
> sure the reproducability, fine by me too if it's easier to just disable
> preempt tests for 8.0 release.
> 
> Thanks,
> 
> -- 
> Peter Xu

> From 92f2f90d2eb270ca158479bfd9a5a855ec7ddf4d Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Tue, 14 Mar 2023 12:24:02 -0400
> Subject: [PATCH] migration: Wait on preempt channel in preempt thread
> 
> QEMU main thread will wait until dest preempt channel established during
> processing the LISTEN command (within the whole postcopy PACKAGED data), by
> waiting on the semaphore postcopy_qemufile_dst_done.
> 
> That's racy, because it's possible that the dest QEMU main thread hasn't
> yet accept()ed the new connection when processing the LISTEN event.  The
> sem_wait() will yield the main thread without being able to run anything
> else including the accept() of the new socket, which can cause deadlock
> within the main thread.
> 
> To avoid the race, move the "wait channel" from main thread to the preempt
> thread right at the start.
> 
> Reported-by: Peter Maydell <peter.maydell@linaro.org>
> Fixes: 5655aab079 ("migration: Postpone postcopy preempt channel to be after main")
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/postcopy-ram.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


This description of the bug & proposed fixed matches what I could
infer as the flaw from the stack trace's Peter provided.

> 
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index f54f44d899..41c0713650 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -1197,11 +1197,6 @@ int postcopy_ram_incoming_setup(MigrationIncomingState *mis)
>      }
>  
>      if (migrate_postcopy_preempt()) {
> -        /*
> -         * The preempt channel is established in asynchronous way.  Wait
> -         * for its completion.
> -         */
> -        qemu_sem_wait(&mis->postcopy_qemufile_dst_done);
>          /*
>           * This thread needs to be created after the temp pages because
>           * it'll fetch RAM_CHANNEL_POSTCOPY PostcopyTmpPage immediately.
> @@ -1668,6 +1663,12 @@ void *postcopy_preempt_thread(void *opaque)
>  
>      qemu_sem_post(&mis->thread_sync_sem);
>  
> +    /*
> +     * The preempt channel is established in asynchronous way.  Wait
> +     * for its completion.
> +     */
> +    qemu_sem_wait(&mis->postcopy_qemufile_dst_done);
> +
>      /* Sending RAM_SAVE_FLAG_EOS to terminate this thread */
>      qemu_mutex_lock(&mis->postcopy_prio_thread_mutex);
>      while (1) {
> -- 
> 2.39.1
> 


With regards,
Daniel
Peter Maydell March 14, 2023, 7:31 p.m. UTC | #27
On Tue, 14 Mar 2023 at 16:46, Peter Xu <peterx@redhat.com> wrote:
> I've attached that simple fix.  Peter, is it easy to verify it?  I'm not
> sure the reproducability, fine by me too if it's easier to just disable
> preempt tests for 8.0 release.

If you all are happy that the fix is correct, I think the easiest thing
is just to get it into master (via the usual migration tree, or I can
apply it directly if you like), and we'll see if it helps.

thanks
-- PMM
Peter Xu March 14, 2023, 8:51 p.m. UTC | #28
On Tue, Mar 14, 2023 at 07:31:38PM +0000, Peter Maydell wrote:
> On Tue, 14 Mar 2023 at 16:46, Peter Xu <peterx@redhat.com> wrote:
> > I've attached that simple fix.  Peter, is it easy to verify it?  I'm not
> > sure the reproducability, fine by me too if it's easier to just disable
> > preempt tests for 8.0 release.
> 
> If you all are happy that the fix is correct, I think the easiest thing
> is just to get it into master (via the usual migration tree, or I can
> apply it directly if you like), and we'll see if it helps.

I'll leave that to you maintainers to decide.  In all case, let me post a
formal patch then.  Thanks.
Peter Maydell March 22, 2023, 8:15 p.m. UTC | #29
On Tue, 7 Mar 2023 at 09:53, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Sat, 4 Mar 2023 at 15:39, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Thu, 2 Mar 2023 at 17:22, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > migration-test has been flaky for a long time, both in CI and
> > > otherwise:
> > >
> >
> >
> > > In the cases where I've looked at the underlying log, this seems to
> > > be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> > > specific subtest by default until somebody can track down the
> > > underlying cause. Enthusiasts can opt back in by setting
> > > QEMU_TEST_FLAKY_TESTS=1 in their environment.
> >
> > So I'm going to apply this, because hopefully it will improve
> > the reliability a bit, but it's clearly not all of the
> > issues with migration-test, because in the course of the
> > run I was doing to test it before applying it I got this
> > error from the OpenBSD VM:
> >
> >  32/646 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
> >            ERROR          134.73s   killed by signal 6 SIGABRT
> > ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> > stderr:
> > qemu-system-aarch64: multifd_send_sync_main: channel 15 has already quit
> > qemu-system-aarch64: failed to save SaveStateEntry with id(name): 2(ram): -1
> > qemu-system-aarch64: Failed to connect to '127.0.0.1:19581': Address
> > already in use
> > query-migrate shows failed migration: Failed to connect to
> > '127.0.0.1:19581': Address already in use
> > **
> > ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
> > assertion failed: (!g_str_equal(status, "failed"))
> >
> > (test program exited with status code -6)
>
> Got another repeat of this one today; again, on the OpenBSD VM:
>
>  32/646 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
>            ERROR
>   131.28s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-aarch64: multifd_send_sync_main: channel 15 has already quit
> qemu-system-aarch64: failed to save SaveStateEntry with id(name): 2(ram): -1
> qemu-system-aarch64: Failed to connect to '127.0.0.1:30312': Address
> already in use
> query-migrate shows failed migration: Failed to connect to
> '127.0.0.1:30312': Address already i
> n use
> **
> ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
> assertion failed: (!
> g_str_equal(status, "failed"))
>
> (test program exited with status code -6)
> ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

This one's still here (openbsd VM again):

 37/774 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test
           ERROR          565.42s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
qemu-system-x86_64: multifd_send_sync_main: channel 15 has already quit
qemu-system-x86_64: Failed to connect to '127.0.0.1:7856': Address
already in use
query-migrate shows failed migration: Failed to connect to
'127.0.0.1:7856': Address already in use
**
ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
assertion failed: (!g_str_equal(status, "failed"))

(test program exited with status code -6)

-- PMM
Peter Maydell April 3, 2023, 7:16 p.m. UTC | #30
On Wed, 22 Mar 2023 at 20:15, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Tue, 7 Mar 2023 at 09:53, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Sat, 4 Mar 2023 at 15:39, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > On Thu, 2 Mar 2023 at 17:22, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > >
> > > > migration-test has been flaky for a long time, both in CI and
> > > > otherwise:
> > > >
> > >
> > >
> > > > In the cases where I've looked at the underlying log, this seems to
> > > > be in the migration/multifd/tcp/plain/cancel subtest.  Disable that
> > > > specific subtest by default until somebody can track down the
> > > > underlying cause. Enthusiasts can opt back in by setting
> > > > QEMU_TEST_FLAKY_TESTS=1 in their environment.
> > >
> > > So I'm going to apply this, because hopefully it will improve
> > > the reliability a bit, but it's clearly not all of the
> > > issues with migration-test, because in the course of the
> > > run I was doing to test it before applying it I got this
> > > error from the OpenBSD VM:
> > >
> > >  32/646 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
> > >            ERROR          134.73s   killed by signal 6 SIGABRT
> > > ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> > > stderr:
> > > qemu-system-aarch64: multifd_send_sync_main: channel 15 has already quit
> > > qemu-system-aarch64: failed to save SaveStateEntry with id(name): 2(ram): -1
> > > qemu-system-aarch64: Failed to connect to '127.0.0.1:19581': Address
> > > already in use
> > > query-migrate shows failed migration: Failed to connect to
> > > '127.0.0.1:19581': Address already in use
> > > **
> > > ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
> > > assertion failed: (!g_str_equal(status, "failed"))
> > >
> > > (test program exited with status code -6)
> >
> > Got another repeat of this one today; again, on the OpenBSD VM:
> >
> >  32/646 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
> >            ERROR
> >   131.28s   killed by signal 6 SIGABRT
> > ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> > stderr:
> > qemu-system-aarch64: multifd_send_sync_main: channel 15 has already quit
> > qemu-system-aarch64: failed to save SaveStateEntry with id(name): 2(ram): -1
> > qemu-system-aarch64: Failed to connect to '127.0.0.1:30312': Address
> > already in use
> > query-migrate shows failed migration: Failed to connect to
> > '127.0.0.1:30312': Address already i
> > n use
> > **
> > ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
> > assertion failed: (!
> > g_str_equal(status, "failed"))
> >
> > (test program exited with status code -6)
> > ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
>
> This one's still here (openbsd VM again):
>
>  37/774 qemu:qtest+qtest-x86_64 / qtest-x86_64/migration-test
>            ERROR          565.42s   killed by signal 6 SIGABRT
> ――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
> stderr:
> qemu-system-x86_64: multifd_send_sync_main: channel 15 has already quit
> qemu-system-x86_64: Failed to connect to '127.0.0.1:7856': Address
> already in use
> query-migrate shows failed migration: Failed to connect to
> '127.0.0.1:7856': Address already in use
> **
> ERROR:../src/tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
> assertion failed: (!g_str_equal(status, "failed"))
>
> (test program exited with status code -6)

And again, here on x86 macos:

 32/626 qemu:qtest+qtest-aarch64 / qtest-aarch64/migration-test
           ERROR          413.00s   killed by signal 6 SIGABRT
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stderr:
qemu-system-aarch64: multifd_send_sync_main: channel 14 has already quit
qemu-system-aarch64: Failed to connect to '127.0.0.1:52689':
Connection reset by peer
query-migrate shows failed migration: Failed to connect to
'127.0.0.1:52689': Connection reset by peer
**
ERROR:../../tests/qtest/migration-helpers.c:151:migrate_query_not_failed:
assertion failed: (!g_str_equal(status, "failed"))

(test program exited with status code -6)
――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――


If this isn't sufficient output to be able to figure out the
problem, can we add more diagnostics to the test and/or the
migration code so that the test does produce helpful output?

thanks
-- PMM
diff mbox series

Patch

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 109bc8e7b13..d4ab3934ed2 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2572,8 +2572,14 @@  int main(int argc, char **argv)
     qtest_add_func("/migration/auto_converge", test_migrate_auto_converge);
     qtest_add_func("/migration/multifd/tcp/plain/none",
                    test_multifd_tcp_none);
-    qtest_add_func("/migration/multifd/tcp/plain/cancel",
-                   test_multifd_tcp_cancel);
+    /*
+     * This test is flaky and sometimes fails in CI and otherwise:
+     * don't run unless user opts in via environment variable.
+     */
+    if (getenv("QEMU_TEST_FLAKY_TESTS")) {
+        qtest_add_func("/migration/multifd/tcp/plain/cancel",
+                       test_multifd_tcp_cancel);
+    }
     qtest_add_func("/migration/multifd/tcp/plain/zlib",
                    test_multifd_tcp_zlib);
 #ifdef CONFIG_ZSTD