Message ID | 20190930192931.20509-1-richard.henderson@linaro.org |
---|---|
State | New |
Headers | show |
Series | user-exec: Do not filter the signal on si_code | expand |
Patchew URL: https://patchew.org/QEMU/20190930192931.20509-1-richard.henderson@linaro.org/ Hi, This series failed the asan build test. Please find the testing commands and their output below. If you have Docker installed, you can probably reproduce it locally. === TEST SCRIPT BEGIN === #!/bin/bash export ARCH=x86_64 make docker-image-fedora V=1 NETWORK=1 time make docker-test-debug@fedora TARGET_LIST=x86_64-softmmu J=14 NETWORK=1 === TEST SCRIPT END === The full log is available at http://patchew.org/logs/20190930192931.20509-1-richard.henderson@linaro.org/testing.asan/?type=message. --- Email generated automatically by Patchew [https://patchew.org/]. Please send your feedback to patchew-devel@redhat.com
On 9/30/19 12:29 PM, Richard Henderson wrote: > This is a workaround for a ppc64le host kernel bug. > > For the test case linux-test, we have an instruction trace > > IN: sig_alarm > ... > > IN: > 0x400080ed28: 380000ac li r0, 0xac > 0x400080ed2c: 44000002 sc > > IN: __libc_nanosleep > 0x1003bb4c: 7c0802a6 mflr r0 > 0x1003bb50: f8010010 std r0, 0x10(r1) > > Our signal return trampoline has, rightly, changed the guest > stack page read-only. Which, rightly, faults on the store of > a return address into a stack frame. > > Checking the host /proc/pid/maps, we see the expected state: > > 4000800000-4000810000 r--p 00000000 00:00 0 > > However, the host kernel has supplied si_code == SEGV_MAPERR, > which is obviously incorrect. > > By dropping this check, we may have an extra walk of the page > tables, but this should be inexpensive. > > Signed-off-by: Richard Henderson <richard.henderson@linaro.org> > --- > > FWIW, filed as > > https://bugzilla.redhat.com/show_bug.cgi?id=1757189 > > out of habit and then > > https://bugs.centos.org/view.php?id=16499 > > when I remembered that the system is running Centos not RHEL. > > --- > accel/tcg/user-exec.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c > index 71c4bf6477..31ef091a70 100644 > --- a/accel/tcg/user-exec.c > +++ b/accel/tcg/user-exec.c > @@ -143,9 +143,12 @@ static inline int handle_cpu_signal(uintptr_t pc, siginfo_t *info, > * for some other kind of fault that should really be passed to the > * guest, we'd end up in an infinite loop of retrying the faulting > * access. > + * > + * XXX: At least one host kernel, ppc64le w/Centos 7 4.14.0-115.6.1, > + * incorrectly reports SEGV_MAPERR for a STDX write to a read-only page. > + * Therefore, do not test info->si_code. > */ > - if (is_write && info->si_signo == SIGSEGV && info->si_code == SEGV_ACCERR && > - h2g_valid(address)) { > + if (is_write && info->si_signo == SIGSEGV && h2g_valid(address)) { Ho hum. This change is in conflict with Peter's long comment; I should have read the context more thoroughly. There is an even longer comment with the patch description: 9c4bbee9e3b83544257e82566342c29e15a88637 The SEGV_ACCERR check here is to prevent a loop by which page_unprotect races with itself and, from Peter's analysis, > * ...but when B gets the mmap lock it finds that the page is already > PAGE_WRITE, and so it exits page_unprotect() via the "not due to > protected translation" code path, and wrongly delivers the signal > to the guest rather than just retrying the access This bug was fixed in the referenced patch. But then continues: > Since this would cause an infinite loop if we ever called > page_unprotect() for some other kind of fault than "write failed due > to bad access permissions", tighten the condition in > handle_cpu_signal() to check the signal number and si_code, and add a > comment so that if somebody does ever find themselves debugging an > infinite loop of faults they have some clue about why. > > (The trick for identifying the correct setting for > current_tb_invalidated for thread B (needed to handle the precise-SMC > case) is due to Richard Henderson. Paolo Bonzini suggested just > relying on si_code rather than trying anything more complicated.) It is disappointing about the kernel bug. But since this affects Centos 7, which is what *all* of the gcc compile farm ppc64 machines use, I think we need to work around it somehow. Should we simply add SEGV_MAPERR to the set of allowed si_code, to directly work around the bug? If we got that code from a kernel without the bug, then page_find should fail to find an entry, and we should then indicate that the signal should be passed to the guest. Thoughts? r~
On Mon, 30 Sep 2019 at 22:01, Richard Henderson <richard.henderson@linaro.org> wrote: > > On 9/30/19 12:29 PM, Richard Henderson wrote: > > This is a workaround for a ppc64le host kernel bug. > > However, the host kernel has supplied si_code == SEGV_MAPERR, > > which is obviously incorrect. > It is disappointing about the kernel bug. But since this affects Centos 7, > which is what *all* of the gcc compile farm ppc64 machines use, I think we need > to work around it somehow. We knew about the ppc kernel bug in 2018: https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg06049.html and the decision at that time was to say "kernel bug, update your kernel" :-) thanks -- PMM
Le 01/10/2019 à 12:34, Peter Maydell a écrit : > On Mon, 30 Sep 2019 at 22:01, Richard Henderson > <richard.henderson@linaro.org> wrote: >> >> On 9/30/19 12:29 PM, Richard Henderson wrote: >>> This is a workaround for a ppc64le host kernel bug. > >>> However, the host kernel has supplied si_code == SEGV_MAPERR, >>> which is obviously incorrect. > >> It is disappointing about the kernel bug. But since this affects Centos 7, >> which is what *all* of the gcc compile farm ppc64 machines use, I think we need >> to work around it somehow. > > We knew about the ppc kernel bug in 2018: > https://lists.gnu.org/archive/html/qemu-devel/2018-03/msg06049.html > and the decision at that time was to say "kernel bug, update your > kernel" :-) Is it possible to update the farm to Centos 8? Or as the kernel involved is specifically for POWER9, is it possible to use only POWER8? Thanks, Laurent
On Tue, 1 Oct 2019 at 12:19, Laurent Vivier <laurent@vivier.eu> wrote: > Is it possible to update the farm to Centos 8? > > Or as the kernel involved is specifically for POWER9, is it possible to > use only POWER8? My experience is that the gcc cfarm admins aren't in principle against the idea of upgrading farm machines, but in practice they tend to have a shortage of effort. If there's a centos-7-kernel-update package that could be installed without doing a full distro upgrade that would probably be pretty easy to ask them to arrange. thanks -- PMM
Le 01/10/2019 à 13:46, Peter Maydell a écrit : > On Tue, 1 Oct 2019 at 12:19, Laurent Vivier <laurent@vivier.eu> wrote: >> Is it possible to update the farm to Centos 8? >> >> Or as the kernel involved is specifically for POWER9, is it possible to >> use only POWER8? > > My experience is that the gcc cfarm admins aren't in > principle against the idea of upgrading farm machines, > but in practice they tend to have a shortage of effort. > If there's a centos-7-kernel-update package that could > be installed without doing a full distro upgrade that > would probably be pretty easy to ask them to arrange. It seems Centos provides a 4.18 kernel for POWER9 on Centos 7: http://mirror.centos.org/altarch/7/os/power9/Packages/kernel-4.18.0-80.7.2.el7.ppc64le.rpm Thanks, Laurent
On 10/1/19 6:15 AM, Laurent Vivier wrote: > Le 01/10/2019 à 13:46, Peter Maydell a écrit : >> On Tue, 1 Oct 2019 at 12:19, Laurent Vivier <laurent@vivier.eu> wrote: >>> Is it possible to update the farm to Centos 8? >>> >>> Or as the kernel involved is specifically for POWER9, is it possible to >>> use only POWER8? >> >> My experience is that the gcc cfarm admins aren't in >> principle against the idea of upgrading farm machines, >> but in practice they tend to have a shortage of effort. >> If there's a centos-7-kernel-update package that could >> be installed without doing a full distro upgrade that >> would probably be pretty easy to ask them to arrange. > > It seems Centos provides a 4.18 kernel for POWER9 on Centos 7: > > http://mirror.centos.org/altarch/7/os/power9/Packages/kernel-4.18.0-80.7.2.el7.ppc64le.rpm Thanks guys. I've sent a message to the admins asking for an update on gcc135. r~
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c index 71c4bf6477..31ef091a70 100644 --- a/accel/tcg/user-exec.c +++ b/accel/tcg/user-exec.c @@ -143,9 +143,12 @@ static inline int handle_cpu_signal(uintptr_t pc, siginfo_t *info, * for some other kind of fault that should really be passed to the * guest, we'd end up in an infinite loop of retrying the faulting * access. + * + * XXX: At least one host kernel, ppc64le w/Centos 7 4.14.0-115.6.1, + * incorrectly reports SEGV_MAPERR for a STDX write to a read-only page. + * Therefore, do not test info->si_code. */ - if (is_write && info->si_signo == SIGSEGV && info->si_code == SEGV_ACCERR && - h2g_valid(address)) { + if (is_write && info->si_signo == SIGSEGV && h2g_valid(address)) { switch (page_unprotect(h2g(address), pc)) { case 0: /* Fault not caused by a page marked unwritable to protect
This is a workaround for a ppc64le host kernel bug. For the test case linux-test, we have an instruction trace IN: sig_alarm ... IN: 0x400080ed28: 380000ac li r0, 0xac 0x400080ed2c: 44000002 sc IN: __libc_nanosleep 0x1003bb4c: 7c0802a6 mflr r0 0x1003bb50: f8010010 std r0, 0x10(r1) Our signal return trampoline has, rightly, changed the guest stack page read-only. Which, rightly, faults on the store of a return address into a stack frame. Checking the host /proc/pid/maps, we see the expected state: 4000800000-4000810000 r--p 00000000 00:00 0 However, the host kernel has supplied si_code == SEGV_MAPERR, which is obviously incorrect. By dropping this check, we may have an extra walk of the page tables, but this should be inexpensive. Signed-off-by: Richard Henderson <richard.henderson@linaro.org> --- FWIW, filed as https://bugzilla.redhat.com/show_bug.cgi?id=1757189 out of habit and then https://bugs.centos.org/view.php?id=16499 when I remembered that the system is running Centos not RHEL. --- accel/tcg/user-exec.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) -- 2.17.1