Message ID | 20240912151003.2045031-1-peter.maydell@linaro.org |
---|---|
State | Accepted |
Commit | 1374ed49e1453c30023483e20d705c4321f19cff |
Headers | show |
Series | [v2] .gitlab-ci.d/crossbuilds.yml: Force 'make check' single-threaded for cross-i686-tci | expand |
On 12/09/2024 17.10, Peter Maydell wrote: > The cross-i686-tci CI job is persistently flaky with various tests > hitting timeouts. One theory for why this is happening is that we're > running too many tests in parallel and so sometimes a test gets > starved of CPU and isn't able to complete within the timeout. > > (The environment this CI job runs in seems to cause us to default > to a parallelism of 9 in the main CI.) > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org> > --- > If this works we might be able to wind this up to -j2 or -j3, > and/or consider whether other CI jobs need something similar. As a start, we could also try replacing the JOBS=$(expr $(nproc) + 1) with JOBS=$(nproc) in the buildtest-template.yml file...? > --- > .gitlab-ci.d/crossbuilds.yml | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/.gitlab-ci.d/crossbuilds.yml b/.gitlab-ci.d/crossbuilds.yml > index 459273f9da5..1e21d082aa4 100644 > --- a/.gitlab-ci.d/crossbuilds.yml > +++ b/.gitlab-ci.d/crossbuilds.yml > @@ -62,7 +62,11 @@ cross-i686-tci: > IMAGE: debian-i686-cross > ACCEL: tcg-interpreter > EXTRA_CONFIGURE_OPTS: --target-list=i386-softmmu,i386-linux-user,aarch64-softmmu,aarch64-linux-user,ppc-softmmu,ppc-linux-user --disable-plugins --disable-kvm > - MAKE_CHECK_ARGS: check check-tcg > + # Force tests to run in series, to see whether this > + # reduces the flakiness of this CI job. The CI > + # environment by default shows us 8 CPUs and so we > + # would otherwise be using a parallelism of 9. > + MAKE_CHECK_ARGS: check check-tcg -j1 Reviewed-by: Thomas Huth <thuth@redhat.com>
On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote: > > The cross-i686-tci CI job is persistently flaky with various tests > hitting timeouts. One theory for why this is happening is that we're > running too many tests in parallel and so sometimes a test gets > starved of CPU and isn't able to complete within the timeout. > > (The environment this CI job runs in seems to cause us to default > to a parallelism of 9 in the main CI.) > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org> > --- > If this works we might be able to wind this up to -j2 or -j3, > and/or consider whether other CI jobs need something similar. I gave this a try, but unfortunately the result seems to be that the whole job times out: https://gitlab.com/qemu-project/qemu/-/jobs/7818441897 Maybe we could try a compromise of -j3 or thereabouts... -- PMM
On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote: > > On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote: > > > > The cross-i686-tci CI job is persistently flaky with various tests > > hitting timeouts. One theory for why this is happening is that we're > > running too many tests in parallel and so sometimes a test gets > > starved of CPU and isn't able to complete within the timeout. > > > > (The environment this CI job runs in seems to cause us to default > > to a parallelism of 9 in the main CI.) > > > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org> > > --- > > If this works we might be able to wind this up to -j2 or -j3, > > and/or consider whether other CI jobs need something similar. > > I gave this a try, but unfortunately the result seems to be > that the whole job times out: > https://gitlab.com/qemu-project/qemu/-/jobs/7818441897 ...but then this simple retry passed with a runtime of 47 mins: https://gitlab.com/qemu-project/qemu/-/jobs/7819225200 I'm tempted to commit this as-is, and see whether it helps. If it doesn't I can always back it off to -j2, and if it does generate a lot of full-job-timeouts it's only me it's annoying. Looking at the timed-out job it looks like it just took a lot longer on the compile phase... (Though it's hard to say because the fact we use "make all check-build" in our gitlab CI config means gitlab treats this as all one step when it adds time annotations, and you can't separate time-for-compile from time-for-tests.) -- PMM
On 13/09/2024 15.31, Peter Maydell wrote: > On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote: >> >> On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote: >>> >>> The cross-i686-tci CI job is persistently flaky with various tests >>> hitting timeouts. One theory for why this is happening is that we're >>> running too many tests in parallel and so sometimes a test gets >>> starved of CPU and isn't able to complete within the timeout. >>> >>> (The environment this CI job runs in seems to cause us to default >>> to a parallelism of 9 in the main CI.) >>> >>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> >>> --- >>> If this works we might be able to wind this up to -j2 or -j3, >>> and/or consider whether other CI jobs need something similar. >> >> I gave this a try, but unfortunately the result seems to be >> that the whole job times out: >> https://gitlab.com/qemu-project/qemu/-/jobs/7818441897 > > ...but then this simple retry passed with a runtime of 47 mins: > > https://gitlab.com/qemu-project/qemu/-/jobs/7819225200 > > I'm tempted to commit this as-is, and see whether it helps. FWIW, I just had a try with your patch, too, and it took 53 minutes: https://gitlab.com/thuth/qemu/-/jobs/7818945368 Older jobs without your patch seem to take ~ 25 to ~ 30 minutes instead, so the runtime got definitely much worse by the -j1. Considering that we're close to the 60 minutes timeout, you might need to bump the timeout of the job to 70 or 75 minutes now, to be on the safe side? Or maybe really try -j2 first? Thomas
On Fri, Sep 13, 2024 at 02:31:34PM +0100, Peter Maydell wrote: > On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote: > > > > On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote: > > > > > > The cross-i686-tci CI job is persistently flaky with various tests > > > hitting timeouts. One theory for why this is happening is that we're > > > running too many tests in parallel and so sometimes a test gets > > > starved of CPU and isn't able to complete within the timeout. > > > > > > (The environment this CI job runs in seems to cause us to default > > > to a parallelism of 9 in the main CI.) > > > > > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org> > > > --- > > > If this works we might be able to wind this up to -j2 or -j3, > > > and/or consider whether other CI jobs need something similar. > > > > I gave this a try, but unfortunately the result seems to be > > that the whole job times out: > > https://gitlab.com/qemu-project/qemu/-/jobs/7818441897 > > ...but then this simple retry passed with a runtime of 47 mins: > > https://gitlab.com/qemu-project/qemu/-/jobs/7819225200 > > I'm tempted to commit this as-is, and see whether it helps. > If it doesn't I can always back it off to -j2, and if it does > generate a lot of full-job-timeouts it's only me it's annoying. Anyone know how many vCPUs our k8s runners have ? The gitlab runners that contributor forks use will have 2 vCPUs. So our current make -j$(nproc+1) will be effectively -j3 already in pipelines for forks. IOW, we intentionally slightly over-commit CPUs right now. Backing off to just -j$(nproc) may be better than hardcoding -j1/-j2, so that it takes account of different runner sizes ? With regards, Daniel
On Fri, 13 Sept 2024 at 15:05, Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Fri, Sep 13, 2024 at 02:31:34PM +0100, Peter Maydell wrote: > > On Fri, 13 Sept 2024 at 13:24, Peter Maydell <peter.maydell@linaro.org> wrote: > > > > > > On Thu, 12 Sept 2024 at 16:10, Peter Maydell <peter.maydell@linaro.org> wrote: > > > > > > > > The cross-i686-tci CI job is persistently flaky with various tests > > > > hitting timeouts. One theory for why this is happening is that we're > > > > running too many tests in parallel and so sometimes a test gets > > > > starved of CPU and isn't able to complete within the timeout. > > > > > > > > (The environment this CI job runs in seems to cause us to default > > > > to a parallelism of 9 in the main CI.) > > > > > > > > Signed-off-by: Peter Maydell <peter.maydell@linaro.org> > > > > --- > > > > If this works we might be able to wind this up to -j2 or -j3, > > > > and/or consider whether other CI jobs need something similar. > > > > > > I gave this a try, but unfortunately the result seems to be > > > that the whole job times out: > > > https://gitlab.com/qemu-project/qemu/-/jobs/7818441897 > > > > ...but then this simple retry passed with a runtime of 47 mins: > > > > https://gitlab.com/qemu-project/qemu/-/jobs/7819225200 > > > > I'm tempted to commit this as-is, and see whether it helps. > > If it doesn't I can always back it off to -j2, and if it does > > generate a lot of full-job-timeouts it's only me it's annoying. > > Anyone know how many vCPUs our k8s runners have ? They report as 8, I think, given that in the main CI run this job gets run as -j9. But we clearly aren't actually getting a reliable 9 CPUs worth. -- PMM
diff --git a/.gitlab-ci.d/crossbuilds.yml b/.gitlab-ci.d/crossbuilds.yml index 459273f9da5..1e21d082aa4 100644 --- a/.gitlab-ci.d/crossbuilds.yml +++ b/.gitlab-ci.d/crossbuilds.yml @@ -62,7 +62,11 @@ cross-i686-tci: IMAGE: debian-i686-cross ACCEL: tcg-interpreter EXTRA_CONFIGURE_OPTS: --target-list=i386-softmmu,i386-linux-user,aarch64-softmmu,aarch64-linux-user,ppc-softmmu,ppc-linux-user --disable-plugins --disable-kvm - MAKE_CHECK_ARGS: check check-tcg + # Force tests to run in series, to see whether this + # reduces the flakiness of this CI job. The CI + # environment by default shows us 8 CPUs and so we + # would otherwise be using a parallelism of 9. + MAKE_CHECK_ARGS: check check-tcg -j1 cross-mipsel-system: extends: .cross_system_build_job
The cross-i686-tci CI job is persistently flaky with various tests hitting timeouts. One theory for why this is happening is that we're running too many tests in parallel and so sometimes a test gets starved of CPU and isn't able to complete within the timeout. (The environment this CI job runs in seems to cause us to default to a parallelism of 9 in the main CI.) Signed-off-by: Peter Maydell <peter.maydell@linaro.org> --- If this works we might be able to wind this up to -j2 or -j3, and/or consider whether other CI jobs need something similar. --- .gitlab-ci.d/crossbuilds.yml | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)