diff mbox series

[v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci

Message ID 20240912151003.2045031-1-peter.maydell@linaro.org
State Accepted
Commit 1374ed49e1453c30023483e20d705c4321f19cff
Headers show
Series [v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci | expand

Commit Message

Peter Maydell Sept. 12, 2024, 3:10 p.m. UTC
The cross-i686-tci CI job is persistently flaky with various tests
hitting timeouts.  One theory for why this is happening is that we're
running too many tests in parallel and so sometimes a test gets
starved of CPU and isn't able to complete within the timeout.

(The environment this CI job runs in seems to cause us to default
to a parallelism of 9 in the main CI.)

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
If this works we might be able to wind this up to -j2 or -j3,
and/or consider whether other CI jobs need something similar.
---
 .gitlab-ci.d/crossbuilds.yml | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Thomas Huth Sept. 12, 2024, 4:48 p.m. UTC | #1
On 12/09/2024 17.10, Peter Maydell wrote:
> The cross-i686-tci CI job is persistently flaky with various tests
> hitting timeouts.  One theory for why this is happening is that we're
> running too many tests in parallel and so sometimes a test gets
> starved of CPU and isn't able to complete within the timeout.
> 
> (The environment this CI job runs in seems to cause us to default
> to a parallelism of 9 in the main CI.)
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> If this works we might be able to wind this up to -j2 or -j3,
> and/or consider whether other CI jobs need something similar.

As a start, we could also try replacing the

  JOBS=$(expr $(nproc) + 1)

with

  JOBS=$(nproc)

in the buildtest-template.yml file...?

> ---
>   .gitlab-ci.d/crossbuilds.yml | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/.gitlab-ci.d/crossbuilds.yml b/.gitlab-ci.d/crossbuilds.yml
> index 459273f9da5..1e21d082aa4 100644
> --- a/.gitlab-ci.d/crossbuilds.yml
> +++ b/.gitlab-ci.d/crossbuilds.yml
> @@ -62,7 +62,11 @@ cross-i686-tci:
>       IMAGE: debian-i686-cross
>       ACCEL: tcg-interpreter
>       EXTRA_CONFIGURE_OPTS: --target-list=i386-softmmu,i386-linux-user,aarch64-softmmu,aarch64-linux-user,ppc-softmmu,ppc-linux-user --disable-plugins --disable-kvm
> -    MAKE_CHECK_ARGS: check check-tcg
> +    # Force tests to run in series, to see whether this
> +    # reduces the flakiness of this CI job. The CI
> +    # environment by default shows us 8 CPUs and so we
> +    # would otherwise be using a parallelism of 9.
> +    MAKE_CHECK_ARGS: check check-tcg -j1

Reviewed-by: Thomas Huth <thuth@redhat.com>
Peter Maydell Sept. 13, 2024, 12:24 p.m. UTC | #2
On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> The cross-i686-tci CI job is persistently flaky with various tests
> hitting timeouts.  One theory for why this is happening is that we're
> running too many tests in parallel and so sometimes a test gets
> starved of CPU and isn't able to complete within the timeout.
>
> (The environment this CI job runs in seems to cause us to default
> to a parallelism of 9 in the main CI.)
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> If this works we might be able to wind this up to -j2 or -j3,
> and/or consider whether other CI jobs need something similar.

I gave this a try, but unfortunately the result seems to be
that the whole job times out:
https://gitlab.com/qemu-project/qemu/-/jobs/7818441897

Maybe we could try a compromise of -j3 or thereabouts...

-- PMM
Peter Maydell Sept. 13, 2024, 1:31 p.m. UTC | #3
On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > The cross-i686-tci CI job is persistently flaky with various tests
> > hitting timeouts.  One theory for why this is happening is that we're
> > running too many tests in parallel and so sometimes a test gets
> > starved of CPU and isn't able to complete within the timeout.
> >
> > (The environment this CI job runs in seems to cause us to default
> > to a parallelism of 9 in the main CI.)
> >
> > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > ---
> > If this works we might be able to wind this up to -j2 or -j3,
> > and/or consider whether other CI jobs need something similar.
>
> I gave this a try, but unfortunately the result seems to be
> that the whole job times out:
> https://gitlab.com/qemu-project/qemu/-/jobs/7818441897

...but then this simple retry passed with a runtime of 47 mins:

https://gitlab.com/qemu-project/qemu/-/jobs/7819225200

I'm tempted to commit this as-is, and see whether it helps.
If it doesn't I can always back it off to -j2, and if it does
generate a lot of full-job-timeouts it's only me it's annoying.

Looking at the timed-out job it looks like it just took a lot
longer on the compile phase... (Though it's hard to say because
the fact we use "make all check-build" in our gitlab CI config
means gitlab treats this as all one step when it adds time
annotations, and you can't separate time-for-compile from
time-for-tests.)

-- PMM
Thomas Huth Sept. 13, 2024, 1:55 p.m. UTC | #4
On 13/09/2024 15.31, Peter Maydell wrote:
> On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
>>>
>>> The cross-i686-tci CI job is persistently flaky with various tests
>>> hitting timeouts.  One theory for why this is happening is that we're
>>> running too many tests in parallel and so sometimes a test gets
>>> starved of CPU and isn't able to complete within the timeout.
>>>
>>> (The environment this CI job runs in seems to cause us to default
>>> to a parallelism of 9 in the main CI.)
>>>
>>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
>>> ---
>>> If this works we might be able to wind this up to -j2 or -j3,
>>> and/or consider whether other CI jobs need something similar.
>>
>> I gave this a try, but unfortunately the result seems to be
>> that the whole job times out:
>> https://gitlab.com/qemu-project/qemu/-/jobs/7818441897
> 
> ...but then this simple retry passed with a runtime of 47 mins:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/7819225200
> 
> I'm tempted to commit this as-is, and see whether it helps.

FWIW, I just had a try with your patch, too, and it took 53 minutes:

  https://gitlab.com/thuth/qemu/-/jobs/7818945368

Older jobs without your patch seem to take ~ 25 to ~ 30 minutes instead, so 
the runtime got definitely much worse by the -j1.

Considering that we're close to the 60 minutes timeout, you might need to 
bump the timeout of the job to 70 or 75 minutes now, to be on the safe side? 
Or maybe really try -j2 first?

  Thomas
Daniel P. Berrangé Sept. 13, 2024, 2:05 p.m. UTC | #5
On Fri, Sep 13, 2024 at 02:31:34PM +0100, Peter Maydell wrote:
> On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > The cross-i686-tci CI job is persistently flaky with various tests
> > > hitting timeouts.  One theory for why this is happening is that we're
> > > running too many tests in parallel and so sometimes a test gets
> > > starved of CPU and isn't able to complete within the timeout.
> > >
> > > (The environment this CI job runs in seems to cause us to default
> > > to a parallelism of 9 in the main CI.)
> > >
> > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > > ---
> > > If this works we might be able to wind this up to -j2 or -j3,
> > > and/or consider whether other CI jobs need something similar.
> >
> > I gave this a try, but unfortunately the result seems to be
> > that the whole job times out:
> > https://gitlab.com/qemu-project/qemu/-/jobs/7818441897
> 
> ...but then this simple retry passed with a runtime of 47 mins:
> 
> https://gitlab.com/qemu-project/qemu/-/jobs/7819225200
> 
> I'm tempted to commit this as-is, and see whether it helps.
> If it doesn't I can always back it off to -j2, and if it does
> generate a lot of full-job-timeouts it's only me it's annoying.

Anyone know how many vCPUs our k8s runners have ?

The gitlab runners that contributor forks use will have 2
vCPUs. So our current make -j$(nproc+1)  will be effectively
-j3 already in pipelines for forks. IOW, we intentionally
slightly over-commit CPUs right now. Backing off to just
-j$(nproc)  may be better than hardcoding -j1/-j2, so that
it takes account of different runner sizes ?


With regards,
Daniel
Peter Maydell Sept. 13, 2024, 2:23 p.m. UTC | #6
On Fri, 13 Sept 2024 at 15:05, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Fri, Sep 13, 2024 at 02:31:34PM +0100, Peter Maydell wrote:
> > On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote:
> > >
> > > On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote:
> > > >
> > > > The cross-i686-tci CI job is persistently flaky with various tests
> > > > hitting timeouts.  One theory for why this is happening is that we're
> > > > running too many tests in parallel and so sometimes a test gets
> > > > starved of CPU and isn't able to complete within the timeout.
> > > >
> > > > (The environment this CI job runs in seems to cause us to default
> > > > to a parallelism of 9 in the main CI.)
> > > >
> > > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > > > ---
> > > > If this works we might be able to wind this up to -j2 or -j3,
> > > > and/or consider whether other CI jobs need something similar.
> > >
> > > I gave this a try, but unfortunately the result seems to be
> > > that the whole job times out:
> > > https://gitlab.com/qemu-project/qemu/-/jobs/7818441897
> >
> > ...but then this simple retry passed with a runtime of 47 mins:
> >
> > https://gitlab.com/qemu-project/qemu/-/jobs/7819225200
> >
> > I'm tempted to commit this as-is, and see whether it helps.
> > If it doesn't I can always back it off to -j2, and if it does
> > generate a lot of full-job-timeouts it's only me it's annoying.
>
> Anyone know how many vCPUs our k8s runners have ?

They report as 8, I think, given that in the main CI run this
job gets run as -j9. But we clearly aren't actually getting
a reliable 9 CPUs worth.

-- PMM
diff mbox series

Patch

diff --git a/.gitlab-ci.d/crossbuilds.yml b/.gitlab-ci.d/crossbuilds.yml
index 459273f9da5..1e21d082aa4 100644
--- a/.gitlab-ci.d/crossbuilds.yml
+++ b/.gitlab-ci.d/crossbuilds.yml
@@ -62,7 +62,11 @@  cross-i686-tci:
     IMAGE: debian-i686-cross
     ACCEL: tcg-interpreter
     EXTRA_CONFIGURE_OPTS: --target-list=i386-softmmu,i386-linux-user,aarch64-softmmu,aarch64-linux-user,ppc-softmmu,ppc-linux-user --disable-plugins --disable-kvm
-    MAKE_CHECK_ARGS: check check-tcg
+    # Force tests to run in series, to see whether this
+    # reduces the flakiness of this CI job. The CI
+    # environment by default shows us 8 CPUs and so we
+    # would otherwise be using a parallelism of 9.
+    MAKE_CHECK_ARGS: check check-tcg -j1
 
 cross-mipsel-system:
   extends: .cross_system_build_job