Message ID | 20190815211843.22799-2-adhemerval.zanella@linaro.org |
---|---|
State | New |
Headers | show |
Series | [1/4] Add RTLD_SINGLE_THREAD_P on generic single-thread.h | expand |
I will push shortly if no one opposes it. On 15/08/2019 18:18, Adhemerval Zanella wrote: > The SIGPIPE can be handled before SIGCANCEL, which makes write fail > and the thread return a non expected result. > > Checked on x86_64-linux-gnu. > > * nptl/tst-cancel2.c (tf): Do not abort with EPIPE. > --- > nptl/tst-cancel2.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c > index 1f0429d343..632ea4e0ae 100644 > --- a/nptl/tst-cancel2.c > +++ b/nptl/tst-cancel2.c > @@ -20,6 +20,7 @@ > #include <signal.h> > #include <stdio.h> > #include <unistd.h> > +#include <errno.h> > > > static int fd[2]; > @@ -32,7 +33,14 @@ tf (void *arg) > write blocks. */ > char buf[100000]; > > - while (write (fd[1], buf, sizeof (buf)) > 0); > + while (1) > + { > + /* Ignore EPIPE errors for case the SIGPIPE is handle before > + SIGCANCEL. */ > + ssize_t ret = write (fd[1], buf, sizeof (buf)); > + if (ret == 0 || (ret == -1 && errno != EPIPE)) > + break; > + } > > return (void *) 42l; > } >
* Adhemerval Zanella: > The SIGPIPE can be handled before SIGCANCEL, which makes write fail > and the thread return a non expected result. > > Checked on x86_64-linux-gnu. > > * nptl/tst-cancel2.c (tf): Do not abort with EPIPE. > --- > nptl/tst-cancel2.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c > index 1f0429d343..632ea4e0ae 100644 > --- a/nptl/tst-cancel2.c > +++ b/nptl/tst-cancel2.c > @@ -20,6 +20,7 @@ > #include <signal.h> > #include <stdio.h> > #include <unistd.h> > +#include <errno.h> > > > static int fd[2]; > @@ -32,7 +33,14 @@ tf (void *arg) > write blocks. */ > char buf[100000]; > > - while (write (fd[1], buf, sizeof (buf)) > 0); > + while (1) > + { > + /* Ignore EPIPE errors for case the SIGPIPE is handle before > + SIGCANCEL. */ > + ssize_t ret = write (fd[1], buf, sizeof (buf)); > + if (ret == 0 || (ret == -1 && errno != EPIPE)) > + break; > + } > > return (void *) 42l; > } I disagree with this change. SIGPIPE does not appear to be a valid error from write in this test because the thread is canceled before the read end of the pipe is closed. POSIX says this: | The cancellation processing in the target thread shall run | asynchronously with respect to the calling thread returning from | pthread_cancel(). But that doesn't mean that the thread can observe actions in its uncanceled state that happen after the cancellation, which is the case when the write fails with SIGPIPE. Thanks, Florian
On 20/08/2019 07:23, Florian Weimer wrote: > * Adhemerval Zanella: > >> The SIGPIPE can be handled before SIGCANCEL, which makes write fail >> and the thread return a non expected result. >> >> Checked on x86_64-linux-gnu. >> >> * nptl/tst-cancel2.c (tf): Do not abort with EPIPE. >> --- >> nptl/tst-cancel2.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c >> index 1f0429d343..632ea4e0ae 100644 >> --- a/nptl/tst-cancel2.c >> +++ b/nptl/tst-cancel2.c >> @@ -20,6 +20,7 @@ >> #include <signal.h> >> #include <stdio.h> >> #include <unistd.h> >> +#include <errno.h> >> >> >> static int fd[2]; >> @@ -32,7 +33,14 @@ tf (void *arg) >> write blocks. */ >> char buf[100000]; >> >> - while (write (fd[1], buf, sizeof (buf)) > 0); >> + while (1) >> + { >> + /* Ignore EPIPE errors for case the SIGPIPE is handle before >> + SIGCANCEL. */ >> + ssize_t ret = write (fd[1], buf, sizeof (buf)); >> + if (ret == 0 || (ret == -1 && errno != EPIPE)) >> + break; >> + } >> >> return (void *) 42l; >> } > > I disagree with this change. SIGPIPE does not appear to be a valid > error from write in this test because the thread is canceled before the > read end of the pipe is closed. POSIX says this: > > | The cancellation processing in the target thread shall run > | asynchronously with respect to the calling thread returning from > | pthread_cancel(). > > But that doesn't mean that the thread can observe actions in its > uncanceled state that happen after the cancellation, which is the case > when the write fails with SIGPIPE. My understanding is since thread cancellation is implemented with an asynchronous signal and there is no *extra* synchronization between the pthread_cancel from main thread and write call on the create one, there is no happens before order between them. It might the case where, due scheduling, the cancellation signal is triggered on the test during the write call and it is a side effect that should be handled. This is exactly the kind of race conditions BZ#12683 make more explicit.
* Adhemerval Zanella: > On 20/08/2019 07:23, Florian Weimer wrote: >> * Adhemerval Zanella: >> >>> The SIGPIPE can be handled before SIGCANCEL, which makes write fail >>> and the thread return a non expected result. >>> >>> Checked on x86_64-linux-gnu. >>> >>> * nptl/tst-cancel2.c (tf): Do not abort with EPIPE. >>> --- >>> nptl/tst-cancel2.c | 10 +++++++++- >>> 1 file changed, 9 insertions(+), 1 deletion(-) >>> >>> diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c >>> index 1f0429d343..632ea4e0ae 100644 >>> --- a/nptl/tst-cancel2.c >>> +++ b/nptl/tst-cancel2.c >>> @@ -20,6 +20,7 @@ >>> #include <signal.h> >>> #include <stdio.h> >>> #include <unistd.h> >>> +#include <errno.h> >>> >>> >>> static int fd[2]; >>> @@ -32,7 +33,14 @@ tf (void *arg) >>> write blocks. */ >>> char buf[100000]; >>> >>> - while (write (fd[1], buf, sizeof (buf)) > 0); >>> + while (1) >>> + { >>> + /* Ignore EPIPE errors for case the SIGPIPE is handle before >>> + SIGCANCEL. */ >>> + ssize_t ret = write (fd[1], buf, sizeof (buf)); >>> + if (ret == 0 || (ret == -1 && errno != EPIPE)) >>> + break; >>> + } >>> >>> return (void *) 42l; >>> } >> >> I disagree with this change. SIGPIPE does not appear to be a valid >> error from write in this test because the thread is canceled before the >> read end of the pipe is closed. POSIX says this: >> >> | The cancellation processing in the target thread shall run >> | asynchronously with respect to the calling thread returning from >> | pthread_cancel(). >> >> But that doesn't mean that the thread can observe actions in its >> uncanceled state that happen after the cancellation, which is the case >> when the write fails with SIGPIPE. > > My understanding is since thread cancellation is implemented with an > asynchronous signal and there is no *extra* synchronization between the > pthread_cancel from main thread and write call on the create one, there > is no happens before order between them. > > It might the case where, due scheduling, the cancellation signal is > triggered on the test during the write call and it is a side effect that > should be handled. This is exactly the kind of race conditions BZ#12683 > make more explicit. Hmm. What are the write call sequences for this test? I can think of the following scenarios: (A) The write call is canceled immediately and never returns. (B) The write call returns a positive value. The second write call is canceled. (C) The first write call returns a positive value. The second write call fails with with EPIPE. The third write call is canceled. I have serious doubts that that (C) is a valid sequence. Practically speaking, I think it will lead to regressions because threads may fail to act upon cancellation compared to what we have today. This works out for the test because there is an infinite number of write calls. Looking at the actual cancellation change, I see this: + /* Call the arch-specific entry points that contains the globals markers + to be checked by SIGCANCEL handler. */ + result = __syscall_cancel_arch (&pd->cancelhandling, nr, a1, a2, a3, a4, a5, + a6); + + if ((result == -EINTR) + && (pd->cancelhandling & CANCELED_BITMASK) + && !(pd->cancelhandling & CANCELSTATE_BITMASK)) + __do_cancel (); Why the restriction to EINTR? Any signal can result in EINTR. And with SA_RESTART, the SIGCANCEL handler will not even result in EINTR. Thanks, Florian
On 20/08/2019 09:59, Florian Weimer wrote: > * Adhemerval Zanella: > >> On 20/08/2019 07:23, Florian Weimer wrote: >>> * Adhemerval Zanella: >>> >>>> The SIGPIPE can be handled before SIGCANCEL, which makes write fail >>>> and the thread return a non expected result. >>>> >>>> Checked on x86_64-linux-gnu. >>>> >>>> * nptl/tst-cancel2.c (tf): Do not abort with EPIPE. >>>> --- >>>> nptl/tst-cancel2.c | 10 +++++++++- >>>> 1 file changed, 9 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c >>>> index 1f0429d343..632ea4e0ae 100644 >>>> --- a/nptl/tst-cancel2.c >>>> +++ b/nptl/tst-cancel2.c >>>> @@ -20,6 +20,7 @@ >>>> #include <signal.h> >>>> #include <stdio.h> >>>> #include <unistd.h> >>>> +#include <errno.h> >>>> >>>> >>>> static int fd[2]; >>>> @@ -32,7 +33,14 @@ tf (void *arg) >>>> write blocks. */ >>>> char buf[100000]; >>>> >>>> - while (write (fd[1], buf, sizeof (buf)) > 0); >>>> + while (1) >>>> + { >>>> + /* Ignore EPIPE errors for case the SIGPIPE is handle before >>>> + SIGCANCEL. */ >>>> + ssize_t ret = write (fd[1], buf, sizeof (buf)); >>>> + if (ret == 0 || (ret == -1 && errno != EPIPE)) >>>> + break; >>>> + } >>>> >>>> return (void *) 42l; >>>> } >>> >>> I disagree with this change. SIGPIPE does not appear to be a valid >>> error from write in this test because the thread is canceled before the >>> read end of the pipe is closed. POSIX says this: >>> >>> | The cancellation processing in the target thread shall run >>> | asynchronously with respect to the calling thread returning from >>> | pthread_cancel(). >>> >>> But that doesn't mean that the thread can observe actions in its >>> uncanceled state that happen after the cancellation, which is the case >>> when the write fails with SIGPIPE. >> >> My understanding is since thread cancellation is implemented with an >> asynchronous signal and there is no *extra* synchronization between the >> pthread_cancel from main thread and write call on the create one, there >> is no happens before order between them. >> >> It might the case where, due scheduling, the cancellation signal is >> triggered on the test during the write call and it is a side effect that >> should be handled. This is exactly the kind of race conditions BZ#12683 >> make more explicit. > > Hmm. What are the write call sequences for this test? > > I can think of the following scenarios: > > (A) The write call is canceled immediately and never returns. > > (B) The write call returns a positive value. The second write call is > canceled. > > (C) The first write call returns a positive value. The second write > call fails with with EPIPE. The third write call is canceled. > > I have serious doubts that that (C) is a valid sequence. Practically > speaking, I think it will lead to regressions because threads may fail > to act upon cancellation compared to what we have today. This works out > for the test because there is an infinite number of write calls. The idea of using a large buffer is exactly to force the write to block indefinitely. Ideally the test should use F_SETPIPE_SZ to make it sure it would block on first call, but current scheme might issue multiple calls until buffer is full (which usually does not trigger because SIGCANCEL is usually triggered beforehand). What happens for current asynchronous handling is if the syscall is indeed interrupted, it will call __pthread_disable_asynccancel and then it will wait on futex_wait_simple (line 97) until the cancellation handling is done. It waits on the futex because the pthread_cancel sets the CANCELING_BITMASK | CANCELED_BITMASK. The sigcancel_handler then acts upon cancellation and finishes the thread. That's the *very* racy issue BZ#12683 aims to fix, since current cancellation does not give the information back to program that a side-effect of the sycall has happens (in this case EPIPE). For tst-cancel2.c, if I add a sleep (1) between pthread_create and pthread_cancel you can see this issue more clearly (dump with strace): [pid 2587] set_robust_list(0x7fffabccf290, 24) = 0 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...> [pid 2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0 ########### Cancellation start to act here, by loading the libgcc to unwinding [pid 2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5 [pid 2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0 [pid 2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000 [pid 2586] close(5) = 0 [pid 2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5 [pid 2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832 [pid 2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0 [pid 2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000 [pid 2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000 [pid 2586] close(5) = 0 [pid 2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0 [pid 2586] munmap(0x7fffabf00000, 63776) = 0 [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 [pid 2586] close(3) = 0 [pid 2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- [pid 2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0 ########### No side-effects reported back to program [pid 2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0 [pid 2587] exit(0) = ? With BZ#12683 fix the cancellation is not acted upon and the testcase then fails depending whether the write is interrupted or not by the cancellation signal. > > Looking at the actual cancellation change, I see this: > > + /* Call the arch-specific entry points that contains the globals markers > + to be checked by SIGCANCEL handler. */ > + result = __syscall_cancel_arch (&pd->cancelhandling, nr, a1, a2, a3, a4, a5, > + a6); > + > + if ((result == -EINTR) > + && (pd->cancelhandling & CANCELED_BITMASK) > + && !(pd->cancelhandling & CANCELSTATE_BITMASK)) > + __do_cancel (); > > Why the restriction to EINTR? Any signal can result in EINTR. And with > SA_RESTART, the SIGCANCEL handler will not even result in EINTR. The EINTR was a special case for __NR_close from original Rich suggestion, which I think we will need to discuss further it is indeed make sense for glibc (it is the long-standing issue regarding of close failure and how to handle it). I removed it in the current version.
* Adhemerval Zanella: > For tst-cancel2.c, if I add a sleep (1) between pthread_create and > pthread_cancel you can see this issue more clearly (dump with strace): > > [pid 2587] set_robust_list(0x7fffabccf290, 24) = 0 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 > [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...> > [pid 2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0 > > ########### Cancellation start to act here, by loading the libgcc to unwinding > [pid 2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5 > [pid 2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0 > [pid 2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000 > [pid 2586] close(5) = 0 > [pid 2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5 > [pid 2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832 > [pid 2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0 > [pid 2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000 > [pid 2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000 > [pid 2586] close(5) = 0 > [pid 2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0 > [pid 2586] munmap(0x7fffabf00000, 63776) = 0 > [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 > [pid 2586] close(3) = 0 > [pid 2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...> > > ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called > [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) > [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- > [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- > [pid 2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0 > > ########### No side-effects reported back to program > [pid 2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0 > [pid 2587] exit(0) = ? > > With BZ#12683 fix the cancellation is not acted upon and the testcase then fails > depending whether the write is interrupted or not by the cancellation signal. Hmm. Which cancellation implementation is this? At which point in the trace do we start unwinding? I'm surprised that strace reports the EPIPE before the SIGPIPE, but maybe that's just a kernel race. My expectation is that the current code unwinds after the system call returns with the EPIPE error, never returning it to the application. I think this is the right behavior for the write system call. Thanks, Florian
On 20/08/2019 12:30, Florian Weimer wrote: > * Adhemerval Zanella: > >> For tst-cancel2.c, if I add a sleep (1) between pthread_create and >> pthread_cancel you can see this issue more clearly (dump with strace): >> >> [pid 2587] set_robust_list(0x7fffabccf290, 24) = 0 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...> >> [pid 2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0 >> >> ########### Cancellation start to act here, by loading the libgcc to unwinding >> [pid 2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5 >> [pid 2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0 >> [pid 2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000 >> [pid 2586] close(5) = 0 >> [pid 2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5 >> [pid 2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832 >> [pid 2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0 >> [pid 2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000 >> [pid 2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000 >> [pid 2586] close(5) = 0 >> [pid 2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0 >> [pid 2586] munmap(0x7fffabf00000, 63776) = 0 >> [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 >> [pid 2586] close(3) = 0 >> [pid 2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...> >> >> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called >> [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) >> [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- >> [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- >> [pid 2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >> >> ########### No side-effects reported back to program >> [pid 2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0 >> [pid 2587] exit(0) = ? >> >> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails >> depending whether the write is interrupted or not by the cancellation signal. > > Hmm. Which cancellation implementation is this? At which point in the > trace do we start unwinding? I'm surprised that strace reports the > EPIPE before the SIGPIPE, but maybe that's just a kernel race. My > expectation is that the current code unwinds after the system call > returns with the EPIPE error, never returning it to the application. I > think this is the right behavior for the write system call. This is current implement, more specifically glibc 2.17, CentOS 7.6 on powerpc64le. From the trace : >> [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 This where pthread_cancel sends the SIGCANCEL signal to thread. >> [pid 2586] close(3) = 0 This is the /* This will cause the write in the child to return. */ close (fd[0]); In tst-cancel2.c. And finally: >> [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) >> [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- SIGPIPE is receives, making the write fail with EPIPE and then >> [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- sigcancel_handler is issued. And the implementation *does* unwind after the syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683).
On 20/08/2019 13:46, Adhemerval Zanella wrote: > > > On 20/08/2019 12:30, Florian Weimer wrote: >> * Adhemerval Zanella: >> >>> For tst-cancel2.c, if I add a sleep (1) between pthread_create and >>> pthread_cancel you can see this issue more clearly (dump with strace): >>> >>> [pid 2587] set_robust_list(0x7fffabccf290, 24) = 0 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...> >>> [pid 2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0 >>> >>> ########### Cancellation start to act here, by loading the libgcc to unwinding >>> [pid 2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5 >>> [pid 2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0 >>> [pid 2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000 >>> [pid 2586] close(5) = 0 >>> [pid 2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5 >>> [pid 2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832 >>> [pid 2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0 >>> [pid 2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000 >>> [pid 2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000 >>> [pid 2586] close(5) = 0 >>> [pid 2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0 >>> [pid 2586] munmap(0x7fffabf00000, 63776) = 0 >>> [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 >>> [pid 2586] close(3) = 0 >>> [pid 2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...> >>> >>> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called >>> [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) >>> [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- >>> [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- >>> [pid 2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >>> >>> ########### No side-effects reported back to program >>> [pid 2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0 >>> [pid 2587] exit(0) = ? >>> >>> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails >>> depending whether the write is interrupted or not by the cancellation signal. >> >> Hmm. Which cancellation implementation is this? At which point in the >> trace do we start unwinding? I'm surprised that strace reports the >> EPIPE before the SIGPIPE, but maybe that's just a kernel race. My >> expectation is that the current code unwinds after the system call >> returns with the EPIPE error, never returning it to the application. I >> think this is the right behavior for the write system call. > > This is current implement, more specifically glibc 2.17, CentOS 7.6 on > powerpc64le. From the trace : > >>> [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 > > This where pthread_cancel sends the SIGCANCEL signal to thread. > >>> [pid 2586] close(3) = 0 > > This is the > > /* This will cause the write in the child to return. */ > close (fd[0]); > > In tst-cancel2.c. > > And finally: > >>> [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) >>> [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- > > SIGPIPE is receives, making the write fail with EPIPE and then > >>> [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- > > sigcancel_handler is issued. And the implementation *does* unwind after the > syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683). > Florian, do still disagree with this change?
* Adhemerval Zanella:
> Florian, do still disagree with this change?
Sorry, I need to run more experiments, hopefully today.
Thanks,
Florian
* Adhemerval Zanella: > On 20/08/2019 12:30, Florian Weimer wrote: >> * Adhemerval Zanella: >> >>> For tst-cancel2.c, if I add a sleep (1) between pthread_create and >>> pthread_cancel you can see this issue more clearly (dump with strace): >>> >>> [pid 2587] set_robust_list(0x7fffabccf290, 24) = 0 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...> >>> [pid 2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0 >>> >>> ########### Cancellation start to act here, by loading the libgcc to unwinding >>> [pid 2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5 >>> [pid 2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0 >>> [pid 2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000 >>> [pid 2586] close(5) = 0 >>> [pid 2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5 >>> [pid 2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832 >>> [pid 2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0 >>> [pid 2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000 >>> [pid 2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000 >>> [pid 2586] close(5) = 0 >>> [pid 2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0 >>> [pid 2586] munmap(0x7fffabf00000, 63776) = 0 >>> [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 >>> [pid 2586] close(3) = 0 >>> [pid 2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...> >>> >>> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called >>> [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) >>> [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- >>> [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- >>> [pid 2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >>> >>> ########### No side-effects reported back to program >>> [pid 2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0 >>> [pid 2587] exit(0) = ? >>> >>> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails >>> depending whether the write is interrupted or not by the cancellation signal. >> >> Hmm. Which cancellation implementation is this? At which point in the >> trace do we start unwinding? I'm surprised that strace reports the >> EPIPE before the SIGPIPE, but maybe that's just a kernel race. My >> expectation is that the current code unwinds after the system call >> returns with the EPIPE error, never returning it to the application. I >> think this is the right behavior for the write system call. > > This is current implement, more specifically glibc 2.17, CentOS 7.6 on > powerpc64le. From the trace : > >>> [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 > > This where pthread_cancel sends the SIGCANCEL signal to thread. > >>> [pid 2586] close(3) = 0 > > This is the > > /* This will cause the write in the child to return. */ > close (fd[0]); > > In tst-cancel2.c. > > And finally: > >>> [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) >>> [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- > > SIGPIPE is receives, making the write fail with EPIPE and then > >>> [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- > > sigcancel_handler is issued. And the implementation *does* unwind after the > syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683). I looked at this some more. It seems that in theory, we could also see EPIPE in the old implementation or at least observe execution of a signal handler, and the asynchronous cancel hides that. In the new implementation, if there is a signal handler and a signal for it is delivered before SIGCANCEL (even if the signal was sent *after* pthread_cancel, from the same thread), I do not think there is any chance whatsoever that we can hide the behavior difference. After all, SIGCANCEL may not trigger an asynchronous cancellation from the signal handler. System calls that are cancellation points appear to fall into these categories: (A) Those that do not perform any resource allocation (write, read). (B) Those that perform resource allocation, but the allocation can be easily reverted (openat, accept4). (C) Those that perform resource allocation, but the allocation is difficult to undo (recvmsg with descriptor passing). (D) close. For (A), I think POSIX allows sufficient wiggle room so that we can exhibit a partial side effect and still act on the cancellation (without reporting a partial write first to the application). For (B), maybe we should undo the resource allocation and then proceed to act on the cancellation. For (C), the complexity may not be worth it. For (A), (B), (C), we can act on the cancellation in the error case, after we observe the cancellation flag in the signal handler trampoline. (I think dropping the EINTR restriction from there achieves that.) If we do that, we do not need to change the test case. (D) is very special. Ideally, we would specify what happens with the descriptor if the close call is canceled. POSIX does not even specify what the state of file descriptor is after an EINTR error, so it doesn't say what happens with cancellation, either. Maybe we have to leave that undefined. Thanks, Florian
On 23/08/2019 09:47, Florian Weimer wrote: > * Adhemerval Zanella: > >> On 20/08/2019 12:30, Florian Weimer wrote: >>> * Adhemerval Zanella: >>> >>>> For tst-cancel2.c, if I add a sleep (1) between pthread_create and >>>> pthread_cancel you can see this issue more clearly (dump with strace): >>>> >>>> [pid 2587] set_robust_list(0x7fffabccf290, 24) = 0 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000) = 100000 >>>> [pid 2587] write(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100000 <unfinished ...> >>>> [pid 2586] <... nanosleep resumed>0x7ffff0c9e7f0) = 0 >>>> >>>> ########### Cancellation start to act here, by loading the libgcc to unwinding >>>> [pid 2586] open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 5 >>>> [pid 2586] fstat(5, {st_mode=S_IFREG|0644, st_size=63776, ...}) = 0 >>>> [pid 2586] mmap(NULL, 63776, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fffabf00000 >>>> [pid 2586] close(5) = 0 >>>> [pid 2586] open("/lib64/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 5 >>>> [pid 2586] read(5, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0\25\0\1\0\0\0\340+\0\0\0\0\0\0"..., 832) = 832 >>>> [pid 2586] fstat(5, {st_mode=S_IFREG|0755, st_size=133696, ...}) = 0 >>>> [pid 2586] mmap(NULL, 197688, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fffab480000 >>>> [pid 2586] mmap(0x7fffab4a0000, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x10000) = 0x7fffab4a0000 >>>> [pid 2586] close(5) = 0 >>>> [pid 2586] mprotect(0x7fffab4a0000, 65536, PROT_READ) = 0 >>>> [pid 2586] munmap(0x7fffabf00000, 63776) = 0 >>>> [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 >>>> [pid 2586] close(3) = 0 >>>> [pid 2586] futex(0x7fffabccf280, FUTEX_WAIT, 2587, NULL <unfinished ...> >>>> >>>> ########### Write returns with broken PIPE and __pthread_disable_asynccancel is called >>>> [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) >>>> [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- >>>> [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- >>>> [pid 2587] futex(0x7fffab4b0224, FUTEX_WAKE_PRIVATE, 2147483647) = 0 >>>> >>>> ########### No side-effects reported back to program >>>> [pid 2587] madvise(0x7fffab4c0000, 8257536, MADV_DONTNEED) = 0 >>>> [pid 2587] exit(0) = ? >>>> >>>> With BZ#12683 fix the cancellation is not acted upon and the testcase then fails >>>> depending whether the write is interrupted or not by the cancellation signal. >>> >>> Hmm. Which cancellation implementation is this? At which point in the >>> trace do we start unwinding? I'm surprised that strace reports the >>> EPIPE before the SIGPIPE, but maybe that's just a kernel race. My >>> expectation is that the current code unwinds after the system call >>> returns with the EPIPE error, never returning it to the application. I >>> think this is the right behavior for the write system call. >> >> This is current implement, more specifically glibc 2.17, CentOS 7.6 on >> powerpc64le. From the trace : >> >>>> [pid 2586] tgkill(2586, 2587, SIGRTMIN) = 0 >> >> This where pthread_cancel sends the SIGCANCEL signal to thread. >> >>>> [pid 2586] close(3) = 0 >> >> This is the >> >> /* This will cause the write in the child to return. */ >> close (fd[0]); >> >> In tst-cancel2.c. >> >> And finally: >> >>>> [pid 2587] <... write resumed>) = -1 EPIPE (Broken pipe) >>>> [pid 2587] --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=2586, si_uid=61684} --- >> >> SIGPIPE is receives, making the write fail with EPIPE and then >> >>>> [pid 2587] --- SIGRTMIN {si_signo=SIGRTMIN, si_code=SI_TKILL, si_pid=2586, si_uid=61684} --- >> >> sigcancel_handler is issued. And the implementation *does* unwind after the >> syscall is done, the problem is it ignores -1/EPIPE (and it is BZ#12683). > > I looked at this some more. It seems that in theory, we could also see > EPIPE in the old implementation or at least observe execution of a > signal handler, and the asynchronous cancel hides that. > > In the new implementation, if there is a signal handler and a signal for > it is delivered before SIGCANCEL (even if the signal was sent *after* > pthread_cancel, from the same thread), I do not think there is any > chance whatsoever that we can hide the behavior difference. After all, > SIGCANCEL may not trigger an asynchronous cancellation from the signal > handler. Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled by the signal handle by removing it from ucontext_t signal mask. > > System calls that are cancellation points appear to fall into these > categories: > > (A) Those that do not perform any resource allocation (write, read). > > (B) Those that perform resource allocation, but the allocation can be > easily reverted (openat, accept4). > > (C) Those that perform resource allocation, but the allocation is > difficult to undo (recvmsg with descriptor passing). > > (D) close. > > For (A), I think POSIX allows sufficient wiggle room so that we can > exhibit a partial side effect and still act on the cancellation (without > reporting a partial write first to the application). I agree, the side-effects for such syscalls should not lead to possible resources leakage. > > For (B), maybe we should undo the resource allocation and then proceed > to act on the cancellation. Besides this requires a lot of more complexity by mapping what kind of resources each syscall would require to free and when to free based on returned codes, it would hide it from the application. My view we just need to return that the syscall has visible side-effects that may result in system resources leaks and let the application deal with. > > For (C), the complexity may not be worth it. > > For (A), (B), (C), we can act on the cancellation in the error case, > after we observe the cancellation flag in the signal handler trampoline. > (I think dropping the EINTR restriction from there achieves that.) If > we do that, we do not need to change the test case. I don't see a real net gain in adding such complexity. It is not on the error case, but rather when visible side-effects has happened and we should alert the program to act upon that. The tst-cancel2 case is an error case because of the EPIPE semantic, but the new tst-cancel28 (based on the leak example from BZ#12683) is a case where the open call returns a valid file descriptor that should be freed. If glibc starts to add extra semantics for cancellation, we will have a standard extensions since now 'open', for instance, would close the file descriptor if the cancellation signal is delivered after the OS allocates the file descriptor resource. IMHO this is clearly what the program itself should be handling, because it might the case where it is prepared to close itself the file descriptor in a different thread. This would lead to possible another incompatibilities regarding libc system implementation, besides adding more complexity. I would like to make cancellation racy-free with minimal semantic change as possible. > > (D) is very special. Ideally, we would specify what happens with the > descriptor if the close call is canceled. POSIX does not even specify > what the state of file descriptor is after an EINTR error, so it doesn't > say what happens with cancellation, either. Maybe we have to leave that > undefined. Indeed even after the POSIX close language clarification [1], the specification is messy (it really does not help that some system does have interruptible close implementation that might return -1). However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux since it shouldn't return -1/EINTR in any case (and I don't know if/when there are kernels that actually does it) [2]. So assuming the nptl is Linux specific, we can assume that, although close is specified as a cancellable entrypoint, it won't happen in practice. [1] http://austingroupbugs.net/view.php?id=529 [2] https://lwn.net/Articles/576478/
* Adhemerval Zanella: >> In the new implementation, if there is a signal handler and a signal for >> it is delivered before SIGCANCEL (even if the signal was sent *after* >> pthread_cancel, from the same thread), I do not think there is any >> chance whatsoever that we can hide the behavior difference. After all, >> SIGCANCEL may not trigger an asynchronous cancellation from the signal >> handler. > > Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled > by the signal handle by removing it from ucontext_t signal mask. Does this mean a thread would lose its ability to be canceled in blocking system calls if it setjmps out of a signal handler? (Where as sigsetjmp would keep cancellation working.) >> System calls that are cancellation points appear to fall into these >> categories: >> >> (A) Those that do not perform any resource allocation (write, read). >> >> (B) Those that perform resource allocation, but the allocation can be >> easily reverted (openat, accept4). >> >> (C) Those that perform resource allocation, but the allocation is >> difficult to undo (recvmsg with descriptor passing). >> >> (D) close. >> For (B), maybe we should undo the resource allocation and then proceed >> to act on the cancellation. > > Besides this requires a lot of more complexity by mapping what kind of > resources each syscall would require to free and when to free based on > returned codes, it would hide it from the application. My view we just > need to return that the syscall has visible side-effects that may result > in system resources leaks and let the application deal with. My concern is that there might be no further cancellation point, and if we do not cancel in the case of (B), we might fail to act on the cancellation even though the canceled thread was blocked at a cancellation point when pthread_cancel was called. >> For (C), the complexity may not be worth it. >> >> For (A), (B), (C), we can act on the cancellation in the error case, >> after we observe the cancellation flag in the signal handler trampoline. >> (I think dropping the EINTR restriction from there achieves that.) If >> we do that, we do not need to change the test case. > > I don't see a real net gain in adding such complexity. It is not on the > error case, but rather when visible side-effects has happened and we should > alert the program to act upon that. The tst-cancel2 case is an error case > because of the EPIPE semantic, but the new tst-cancel28 (based on the leak > example from BZ#12683) is a case where the open call returns a valid file > descriptor that should be freed. I totally agree that the descriptor must not leak. However, tst-cancel28 is written in such a way that it does not expect an error from open: static void * leaker (void *arg) { int fd = open (arg, O_RDONLY); pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0); close (fd); return NULL; } It would be clearer if it actually used error checking, but we call close (-1) on error, which is invalid. If this test represents how application code is actually written, I think we really need to check for cancellation before returning the file descriptor. > If glibc starts to add extra semantics for cancellation, we will have > a standard extensions since now 'open', for instance, would close the > file descriptor if the cancellation signal is delivered after the OS > allocates the file descriptor resource. IMHO this is clearly what the > program itself should be handling, because it might the case where it > is prepared to close itself the file descriptor in a different thread. open is a cancellation point, so according to POSIX, the thread is canceled if the open call completes after pthread_cancel returns. (Due to integration with the memory model, thread ordering issues are difficult to reason about in POSIX, though.) I do not think we should return a descriptor to the application that can only exist because of something that happened *after* the pthread_cancel call. > This would lead to possible another incompatibilities regarding libc > system implementation, besides adding more complexity. I think we would just implement POSIX behavior more closely because open is required to be a cancellation point. It's unfortunate that we do not have kernel support for this. Other systems can likely implement cancellation semantics correctly more easily. >> (D) is very special. Ideally, we would specify what happens with the >> descriptor if the close call is canceled. POSIX does not even specify >> what the state of file descriptor is after an EINTR error, so it doesn't >> say what happens with cancellation, either. Maybe we have to leave that >> undefined. > > Indeed even after the POSIX close language clarification [1], the specification > is messy (it really does not help that some system does have interruptible > close implementation that might return -1). > > However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux > since it shouldn't return -1/EINTR in any case (and I don't know if/when > there are kernels that actually does it) [2]. So assuming the nptl is Linux > specific, we can assume that, although close is specified as a cancellable > entrypoint, it won't happen in practice. > > [1] http://austingroupbugs.net/view.php?id=529 > [2] https://lwn.net/Articles/576478/ I think this is slightly different from the EINTR case. With SA_RESTART, perhaps the signal handler can run *before* the descriptor is closed, so if we never return from the handler, the descriptor leaks? Thanks, Florian
On 26/08/2019 07:06, Florian Weimer wrote: > * Adhemerval Zanella: > >>> In the new implementation, if there is a signal handler and a signal for >>> it is delivered before SIGCANCEL (even if the signal was sent *after* >>> pthread_cancel, from the same thread), I do not think there is any >>> chance whatsoever that we can hide the behavior difference. After all, >>> SIGCANCEL may not trigger an asynchronous cancellation from the signal >>> handler. >> >> Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled >> by the signal handle by removing it from ucontext_t signal mask. > > Does this mean a thread would lose its ability to be canceled in > blocking system calls if it setjmps out of a signal handler? (Where as > sigsetjmp would keep cancellation working.) The setjmp does not really matter here, once the async-cancellation fails to act due side-effects SIGCANCEL is removed from thread signal mask. The idea of making it sticky is always allow the thread to act on such cases regardless whether it calls another cancellation entrypoint. The thread will still be cancelled in the next cancellation entrypoint due it is marked as CANCELED_BIT. The __syscall_cancel_arch will check for the bit and call __do_cancel before issuing the syscall. > >>> System calls that are cancellation points appear to fall into these >>> categories: >>> >>> (A) Those that do not perform any resource allocation (write, read). >>> >>> (B) Those that perform resource allocation, but the allocation can be >>> easily reverted (openat, accept4). >>> >>> (C) Those that perform resource allocation, but the allocation is >>> difficult to undo (recvmsg with descriptor passing). >>> >>> (D) close. > >>> For (B), maybe we should undo the resource allocation and then proceed >>> to act on the cancellation. >> >> Besides this requires a lot of more complexity by mapping what kind of >> resources each syscall would require to free and when to free based on >> returned codes, it would hide it from the application. My view we just >> need to return that the syscall has visible side-effects that may result >> in system resources leaks and let the application deal with. > > My concern is that there might be no further cancellation point, and if > we do not cancel in the case of (B), we might fail to act on the > cancellation even though the canceled thread was blocked at a > cancellation point when pthread_cancel was called. I shared your concern, but my understanding of the whole BZ#12683 issue is we can't really cancel the thread for the (B) case. > >>> For (C), the complexity may not be worth it. >>> >>> For (A), (B), (C), we can act on the cancellation in the error case, >>> after we observe the cancellation flag in the signal handler trampoline. >>> (I think dropping the EINTR restriction from there achieves that.) If >>> we do that, we do not need to change the test case. >> >> I don't see a real net gain in adding such complexity. It is not on the >> error case, but rather when visible side-effects has happened and we should >> alert the program to act upon that. The tst-cancel2 case is an error case >> because of the EPIPE semantic, but the new tst-cancel28 (based on the leak >> example from BZ#12683) is a case where the open call returns a valid file >> descriptor that should be freed. > > I totally agree that the descriptor must not leak. However, > tst-cancel28 is written in such a way that it does not expect an error > from open: > > static void * > leaker (void *arg) > { > int fd = open (arg, O_RDONLY); > pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0); > close (fd); > return NULL; > } > > It would be clearer if it actually used error checking, but we call > close (-1) on error, which is invalid. We can hardness the test to check for open failures (due file descriptor exhaustion or other arcane issue), but the point of the test is fd is always for the *close* call. If the open syscall is interrupted before the file descriptor is actually opened, the thread will be cancelled before close call. Otherwise, the kernel will have to return a valid value and indicate on the PC from the ucontext_t that the syscall has side-effect. > > If this test represents how application code is actually written, I > think we really need to check for cancellation before returning the file > descriptor. > >> If glibc starts to add extra semantics for cancellation, we will have >> a standard extensions since now 'open', for instance, would close the >> file descriptor if the cancellation signal is delivered after the OS >> allocates the file descriptor resource. IMHO this is clearly what the >> program itself should be handling, because it might the case where it >> is prepared to close itself the file descriptor in a different thread. > > open is a cancellation point, so according to POSIX, the thread is > canceled if the open call completes after pthread_cancel returns. (Due > to integration with the memory model, thread ordering issues are > difficult to reason about in POSIX, though.) I do not think we should > return a descriptor to the application that can only exist because of > something that happened *after* the pthread_cancel call. That's not the interpretation I see, the cancellation description for such cases is afaiu: "The side-effects of acting upon a cancellation request while suspended during a call of a function are the same as the side-effects that may be seen in a single-threaded program when a call to a function is interrupted by a signal and the given function returns [EINTR]" And on signal concepts: "[EINTR] Interrupted function call. An asynchronous signal was caught by the process during the execution of an interruptible function. If the signal handler performs a normal return, the interrupted function call may return this condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)." And my understanding is, since glibc implemented cancellation using signals and it requires to use SA_RESTART we can actually return to user if the syscall has side-effects visible to caller. It follows the same interpretation Rich used in explaining the semantic required for the a possible kernel helper for cancellation [1] > >> This would lead to possible another incompatibilities regarding libc >> system implementation, besides adding more complexity. > > I think we would just implement POSIX behavior more closely because open > is required to be a cancellation point. It's unfortunate that we do not > have kernel support for this. Other systems can likely implement > cancellation semantics correctly more easily. I don't think there is an easy solution that kernel can provide to help cancellation to be standard conformant *and* race free. On the same kernel thread [2], Rich discuss some option with kernel developers and although the conclusion is open the ideas are either no conformant (the 'stick signal') or feasible (a new set of syscalls with an extra sigmak mask). > >>> (D) is very special. Ideally, we would specify what happens with the >>> descriptor if the close call is canceled. POSIX does not even specify >>> what the state of file descriptor is after an EINTR error, so it doesn't >>> say what happens with cancellation, either. Maybe we have to leave that >>> undefined. >> >> Indeed even after the POSIX close language clarification [1], the specification >> is messy (it really does not help that some system does have interruptible >> close implementation that might return -1). >> >> However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux >> since it shouldn't return -1/EINTR in any case (and I don't know if/when >> there are kernels that actually does it) [2]. So assuming the nptl is Linux >> specific, we can assume that, although close is specified as a cancellable >> entrypoint, it won't happen in practice. >> >> [1] http://austingroupbugs.net/view.php?id=529 >> [2] https://lwn.net/Articles/576478/ > > I think this is slightly different from the EINTR case. With > SA_RESTART, perhaps the signal handler can run *before* the descriptor > is closed, so if we never return from the handler, the descriptor leaks? My understanding is the cancellation handler for close won't interrupt the syscall, regardless SA_RESTART. But indeed it might be the case where the signal is acted just before the syscall is issue and after the test for CANCELED_BITMASK is done. In this case the thread will be cancelled and the descriptor leaked. The musl handler SYS_close differently by not using the syscall bridge, so cancellation can't be acted upon and so close always succeed. On the tst-cancel28.c, pthread cancellation is explicit disable before calling close. [1] https://lkml.org/lkml/2016/3/9/1065 [2] https://lkml.org/lkml/2016/3/10/424
* Adhemerval Zanella: > On 26/08/2019 07:06, Florian Weimer wrote: >> * Adhemerval Zanella: >> >>>> In the new implementation, if there is a signal handler and a signal for >>>> it is delivered before SIGCANCEL (even if the signal was sent *after* >>>> pthread_cancel, from the same thread), I do not think there is any >>>> chance whatsoever that we can hide the behavior difference. After all, >>>> SIGCANCEL may not trigger an asynchronous cancellation from the signal >>>> handler. >>> >>> Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled >>> by the signal handle by removing it from ucontext_t signal mask. >> >> Does this mean a thread would lose its ability to be canceled in >> blocking system calls if it setjmps out of a signal handler? (Where as >> sigsetjmp would keep cancellation working.) > The setjmp does not really matter here, once the async-cancellation fails > to act due side-effects SIGCANCEL is removed from thread signal mask. The > idea of making it sticky is always allow the thread to act on such cases > regardless whether it calls another cancellation entrypoint. I see. But why are we doing this? Can't we suppress the cancel signal inside pthread_cancel. Then we don't have to worry about the location of the signal mask in the signal context. > The thread will still be cancelled in the next cancellation entrypoint > due it is marked as CANCELED_BIT. The __syscall_cancel_arch will check > for the bit and call __do_cancel before issuing the syscall. Sure, but we can exit the SIGCANCEL handler if we detect that cancellation is already in progress. >>>> System calls that are cancellation points appear to fall into these >>>> categories: >>>> >>>> (A) Those that do not perform any resource allocation (write, read). >>>> >>>> (B) Those that perform resource allocation, but the allocation can be >>>> easily reverted (openat, accept4). >>>> >>>> (C) Those that perform resource allocation, but the allocation is >>>> difficult to undo (recvmsg with descriptor passing). >>>> >>>> (D) close. >> >>>> For (B), maybe we should undo the resource allocation and then proceed >>>> to act on the cancellation. >>> >>> Besides this requires a lot of more complexity by mapping what kind of >>> resources each syscall would require to free and when to free based on >>> returned codes, it would hide it from the application. My view we just >>> need to return that the syscall has visible side-effects that may result >>> in system resources leaks and let the application deal with. >> >> My concern is that there might be no further cancellation point, and if >> we do not cancel in the case of (B), we might fail to act on the >> cancellation even though the canceled thread was blocked at a >> cancellation point when pthread_cancel was called. > > I shared your concern, but my understanding of the whole BZ#12683 > issue is we can't really cancel the thread for the (B) case. Even if we undo the resource allocation before that? That is, check that a cancellation has occurred on the system call return path, and if it has, close the descriptor and start unwinding. >> I totally agree that the descriptor must not leak. However, >> tst-cancel28 is written in such a way that it does not expect an error >> from open: >> >> static void * >> leaker (void *arg) >> { >> int fd = open (arg, O_RDONLY); >> pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0); >> close (fd); >> return NULL; >> } >> >> It would be clearer if it actually used error checking, but we call >> close (-1) on error, which is invalid. > > We can hardness the test to check for open failures (due file descriptor > exhaustion or other arcane issue), but the point of the test is fd is > always for the *close* call. If the open syscall is interrupted before the > file descriptor is actually opened, the thread will be cancelled before > close call. Otherwise, the kernel will have to return a valid value > and indicate on the PC from the ucontext_t that the syscall has side-effect. My point is that the test should still be valid without the close call because the open is always canceled according to POSIX. >> If this test represents how application code is actually written, I >> think we really need to check for cancellation before returning the file >> descriptor. >> >>> If glibc starts to add extra semantics for cancellation, we will have >>> a standard extensions since now 'open', for instance, would close the >>> file descriptor if the cancellation signal is delivered after the OS >>> allocates the file descriptor resource. IMHO this is clearly what the >>> program itself should be handling, because it might the case where it >>> is prepared to close itself the file descriptor in a different thread. >> >> open is a cancellation point, so according to POSIX, the thread is >> canceled if the open call completes after pthread_cancel returns. (Due >> to integration with the memory model, thread ordering issues are >> difficult to reason about in POSIX, though.) I do not think we should >> return a descriptor to the application that can only exist because of >> something that happened *after* the pthread_cancel call. > > That's not the interpretation I see, the cancellation description for > such cases is afaiu: > > "The side-effects of acting upon a cancellation request while suspended > during a call of a function are the same as the side-effects that may be > seen in a single-threaded program when a call to a function is interrupted > by a signal and the given function returns [EINTR]" > > And on signal concepts: > > "[EINTR] > Interrupted function call. An asynchronous signal was caught by the process > during the execution of an interruptible function. If the signal handler > performs a normal return, the interrupted function call may return this > condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)." > > And my understanding is, since glibc implemented cancellation using signals > and it requires to use SA_RESTART we can actually return to user if the > syscall has side-effects visible to caller. The open function cannot allocate a new file descriptor and fail with EINTR. So I don't see how this case applies here. Sorry. >>>> (D) is very special. Ideally, we would specify what happens with the >>>> descriptor if the close call is canceled. POSIX does not even specify >>>> what the state of file descriptor is after an EINTR error, so it doesn't >>>> say what happens with cancellation, either. Maybe we have to leave that >>>> undefined. >>> >>> Indeed even after the POSIX close language clarification [1], the specification >>> is messy (it really does not help that some system does have interruptible >>> close implementation that might return -1). >>> >>> However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux >>> since it shouldn't return -1/EINTR in any case (and I don't know if/when >>> there are kernels that actually does it) [2]. So assuming the nptl is Linux >>> specific, we can assume that, although close is specified as a cancellable >>> entrypoint, it won't happen in practice. >>> >>> [1] http://austingroupbugs.net/view.php?id=529 >>> [2] https://lwn.net/Articles/576478/ >> >> I think this is slightly different from the EINTR case. With >> SA_RESTART, perhaps the signal handler can run *before* the descriptor >> is closed, so if we never return from the handler, the descriptor leaks? > > My understanding is the cancellation handler for close won't interrupt the > syscall, regardless SA_RESTART. What do you mean by “interrupt” here? That close will not return with EINTR? Or that the signal handler does not run? > But indeed it might be the case where the > signal is acted just before the syscall is issue and after the test for > CANCELED_BITMASK is done. In this case the thread will be cancelled and > the descriptor leaked. The musl handler SYS_close differently by not using > the syscall bridge, so cancellation can't be acted upon and so close always > succeed. That seems reasonable. If we check for cancellation after the close system call (successful or not), then I think we have a compliant, leak-free close implementation. Thanks, Florian
On Fri, Aug 23, 2019 at 02:47:39PM +0200, Florian Weimer wrote: > I looked at this some more. It seems that in theory, we could also see > EPIPE in the old implementation or at least observe execution of a > signal handler, and the asynchronous cancel hides that. > > In the new implementation, if there is a signal handler and a signal for > it is delivered before SIGCANCEL (even if the signal was sent *after* > pthread_cancel, from the same thread), I do not think there is any > chance whatsoever that we can hide the behavior difference. After all, > SIGCANCEL may not trigger an asynchronous cancellation from the signal > handler. > > System calls that are cancellation points appear to fall into these > categories: > > (A) Those that do not perform any resource allocation (write, read). > > (B) Those that perform resource allocation, but the allocation can be > easily reverted (openat, accept4). > > (C) Those that perform resource allocation, but the allocation is > difficult to undo (recvmsg with descriptor passing). > > (D) close. > > For (A), I think POSIX allows sufficient wiggle room so that we can > exhibit a partial side effect and still act on the cancellation (without > reporting a partial write first to the application). No. POSIX is very clear that cancellation is only permitted to be acted on when the side effects are the same as EINTR, which can only happen for read/write when no data has been transferred yet. If data has already been transferred, the only option not ruled out by one or more constraints is to return successfully with a short read or write; pending cancellation will then be handled on a subsequent call to a cancellation point. (If you're doing a retry loop until the total length n is transferred, then in the next iteration of the loop.) Of course this case of non-conformance would be less severe than UAF or resource leak errors in most cases, but it's still not good. > For (B), maybe we should undo the resource allocation and then proceed > to act on the cancellation. No, why would you do that? In general it's not possible (think of things like O_TRUNC), and even if it were it's more work. You just don't act on the cancellation if success has already happened. I don't see how/why you're characterizing "accept4" as "can easily be reverted". Closing the accepted socket would *drop a connection* rather than leaving the connection pending for another thread calling accept to receive. > For (C), the complexity may not be worth it. > > For (A), (B), (C), we can act on the cancellation in the error case, > after we observe the cancellation flag in the signal handler trampoline. > (I think dropping the EINTR restriction from there achieves that.) If > we do that, we do not need to change the test case. > > (D) is very special. Ideally, we would specify what happens with the > descriptor if the close call is canceled. POSIX does not even specify > what the state of file descriptor is after an EINTR error, so it doesn't > say what happens with cancellation, either. Maybe we have to leave that > undefined. Failure to specify EINTR was an error in POSIX that's been fixed for the next issue. Leaving it undefined is atrociously bad and results in UAF type errors. Even if glibc doesn't want to apply the fix for EINTR, cancellation of close absolutely needs to behave *as if* by the newly-specified behavior for close with EINTR. Rich
On 30/08/2019 08:51, Florian Weimer wrote: > * Adhemerval Zanella: > >> On 26/08/2019 07:06, Florian Weimer wrote: >>> * Adhemerval Zanella: >>> >>>>> In the new implementation, if there is a signal handler and a signal for >>>>> it is delivered before SIGCANCEL (even if the signal was sent *after* >>>>> pthread_cancel, from the same thread), I do not think there is any >>>>> chance whatsoever that we can hide the behavior difference. After all, >>>>> SIGCANCEL may not trigger an asynchronous cancellation from the signal >>>>> handler. >>>> >>>> Yes, although SIGCANCEL would be set as SA_RESTART, it is explicit disabled >>>> by the signal handle by removing it from ucontext_t signal mask. >>> >>> Does this mean a thread would lose its ability to be canceled in >>> blocking system calls if it setjmps out of a signal handler? (Where as >>> sigsetjmp would keep cancellation working.) > >> The setjmp does not really matter here, once the async-cancellation fails >> to act due side-effects SIGCANCEL is removed from thread signal mask. The >> idea of making it sticky is always allow the thread to act on such cases >> regardless whether it calls another cancellation entrypoint. > > I see. But why are we doing this? Can't we suppress the cancel signal > inside pthread_cancel. Then we don't have to worry about the location > of the signal mask in the signal context. I guess we can, it would require a sigprocmask to obtain the current mask, and another one to remove the cancellation signal. I still think it cleaner to just sets on the signal handler, since the kernel will restore the signal handler after the userspace returns. > >> The thread will still be cancelled in the next cancellation entrypoint >> due it is marked as CANCELED_BIT. The __syscall_cancel_arch will check >> for the bit and call __do_cancel before issuing the syscall. > > Sure, but we can exit the SIGCANCEL handler if we detect that > cancellation is already in progress. I think it is possible, but I see that enforcing it using the kernel facilities to avoid extra code in userspace is cleaner. > >>>>> System calls that are cancellation points appear to fall into these >>>>> categories: >>>>> >>>>> (A) Those that do not perform any resource allocation (write, read). >>>>> >>>>> (B) Those that perform resource allocation, but the allocation can be >>>>> easily reverted (openat, accept4). >>>>> >>>>> (C) Those that perform resource allocation, but the allocation is >>>>> difficult to undo (recvmsg with descriptor passing). >>>>> >>>>> (D) close. >>> >>>>> For (B), maybe we should undo the resource allocation and then proceed >>>>> to act on the cancellation. >>>> >>>> Besides this requires a lot of more complexity by mapping what kind of >>>> resources each syscall would require to free and when to free based on >>>> returned codes, it would hide it from the application. My view we just >>>> need to return that the syscall has visible side-effects that may result >>>> in system resources leaks and let the application deal with. >>> >>> My concern is that there might be no further cancellation point, and if >>> we do not cancel in the case of (B), we might fail to act on the >>> cancellation even though the canceled thread was blocked at a >>> cancellation point when pthread_cancel was called. >> >> I shared your concern, but my understanding of the whole BZ#12683 >> issue is we can't really cancel the thread for the (B) case. > > Even if we undo the resource allocation before that? That is, check > that a cancellation has occurred on the system call return path, and if > it has, close the descriptor and start unwinding. What if the application requires extra steps for the case the syscall returns a side-effects? What about some specific syscalls, such as passing file descriptor using recvmsg? It would need the libc to map *all* possible side-effects for each syscall, determine which might leak side-effect and acts uppon then. What if the kernel, at some version, added another feature to share resources that might create side-effects? We will need to constantly map such cases, adding even more arch-specific glue code to handle it. I really think we should not go on this path. > >>> I totally agree that the descriptor must not leak. However, >>> tst-cancel28 is written in such a way that it does not expect an error >>> from open: >>> >>> static void * >>> leaker (void *arg) >>> { >>> int fd = open (arg, O_RDONLY); >>> pthread_setcancelstate (PTHREAD_CANCEL_DISABLE, 0); >>> close (fd); >>> return NULL; >>> } >>> >>> It would be clearer if it actually used error checking, but we call >>> close (-1) on error, which is invalid. >> >> We can hardness the test to check for open failures (due file descriptor >> exhaustion or other arcane issue), but the point of the test is fd is >> always for the *close* call. If the open syscall is interrupted before the >> file descriptor is actually opened, the thread will be cancelled before >> close call. Otherwise, the kernel will have to return a valid value >> and indicate on the PC from the ucontext_t that the syscall has side-effect. > > My point is that the test should still be valid without the close call > because the open is always canceled according to POSIX. > >>> If this test represents how application code is actually written, I >>> think we really need to check for cancellation before returning the file >>> descriptor. >>> >>>> If glibc starts to add extra semantics for cancellation, we will have >>>> a standard extensions since now 'open', for instance, would close the >>>> file descriptor if the cancellation signal is delivered after the OS >>>> allocates the file descriptor resource. IMHO this is clearly what the >>>> program itself should be handling, because it might the case where it >>>> is prepared to close itself the file descriptor in a different thread. >>> >>> open is a cancellation point, so according to POSIX, the thread is >>> canceled if the open call completes after pthread_cancel returns. (Due >>> to integration with the memory model, thread ordering issues are >>> difficult to reason about in POSIX, though.) I do not think we should >>> return a descriptor to the application that can only exist because of >>> something that happened *after* the pthread_cancel call. >> >> That's not the interpretation I see, the cancellation description for >> such cases is afaiu: >> >> "The side-effects of acting upon a cancellation request while suspended >> during a call of a function are the same as the side-effects that may be >> seen in a single-threaded program when a call to a function is interrupted >> by a signal and the given function returns [EINTR]" >> >> And on signal concepts: >> >> "[EINTR] >> Interrupted function call. An asynchronous signal was caught by the process >> during the execution of an interruptible function. If the signal handler >> performs a normal return, the interrupted function call may return this >> condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)." >> >> And my understanding is, since glibc implemented cancellation using signals >> and it requires to use SA_RESTART we can actually return to user if the >> syscall has side-effects visible to caller. > > The open function cannot allocate a new file descriptor and fail with > EINTR. So I don't see how this case applies here. Sorry. But that is *exactly* the point. The open *does not* fail, but cancellation acts *before* it can return to the callee. We can't really make it fail with EINTR and acts on the cancellation. The idea is "the interrupted function call may return this condition". > >>>>> (D) is very special. Ideally, we would specify what happens with the >>>>> descriptor if the close call is canceled. POSIX does not even specify >>>>> what the state of file descriptor is after an EINTR error, so it doesn't >>>>> say what happens with cancellation, either. Maybe we have to leave that >>>>> undefined. >>>> >>>> Indeed even after the POSIX close language clarification [1], the specification >>>> is messy (it really does not help that some system does have interruptible >>>> close implementation that might return -1). >>>> >>>> However my understanding is we can assume POSIX_CLOSE_RESTART 0 for Linux >>>> since it shouldn't return -1/EINTR in any case (and I don't know if/when >>>> there are kernels that actually does it) [2]. So assuming the nptl is Linux >>>> specific, we can assume that, although close is specified as a cancellable >>>> entrypoint, it won't happen in practice. >>>> >>>> [1] http://austingroupbugs.net/view.php?id=529 >>>> [2] https://lwn.net/Articles/576478/ >>> >>> I think this is slightly different from the EINTR case. With >>> SA_RESTART, perhaps the signal handler can run *before* the descriptor >>> is closed, so if we never return from the handler, the descriptor leaks? >> >> My understanding is the cancellation handler for close won't interrupt the >> syscall, regardless SA_RESTART. > > What do you mean by “interrupt” here? That close will not return with > EINTR? Or that the signal handler does not run? My understanding on the discussions is Linux close implementation does not return EINTR/-1 in any case (and if it did in the past it was a bug). > >> But indeed it might be the case where the >> signal is acted just before the syscall is issue and after the test for >> CANCELED_BITMASK is done. In this case the thread will be cancelled and >> the descriptor leaked. The musl handler SYS_close differently by not using >> the syscall bridge, so cancellation can't be acted upon and so close always >> succeed. > > That seems reasonable. > > If we check for cancellation after the close system call (successful or > not), then I think we have a compliant, leak-free close implementation. In my understanding musl does this to prevent possible problematic cancel syscall implementation that might return -1/EINTR (since for this case syscall will be restart due SA_RESTART and cancellation won't acted upon).
* Rich Felker: > On Fri, Aug 23, 2019 at 02:47:39PM +0200, Florian Weimer wrote: >> I looked at this some more. It seems that in theory, we could also see >> EPIPE in the old implementation or at least observe execution of a >> signal handler, and the asynchronous cancel hides that. >> >> In the new implementation, if there is a signal handler and a signal for >> it is delivered before SIGCANCEL (even if the signal was sent *after* >> pthread_cancel, from the same thread), I do not think there is any >> chance whatsoever that we can hide the behavior difference. After all, >> SIGCANCEL may not trigger an asynchronous cancellation from the signal >> handler. >> >> System calls that are cancellation points appear to fall into these >> categories: >> >> (A) Those that do not perform any resource allocation (write, read). >> >> (B) Those that perform resource allocation, but the allocation can be >> easily reverted (openat, accept4). >> >> (C) Those that perform resource allocation, but the allocation is >> difficult to undo (recvmsg with descriptor passing). >> >> (D) close. >> >> For (A), I think POSIX allows sufficient wiggle room so that we can >> exhibit a partial side effect and still act on the cancellation (without >> reporting a partial write first to the application). > > No. POSIX is very clear that cancellation is only permitted to be > acted on when the side effects are the same as EINTR, which can only > happen for read/write when no data has been transferred yet. If data > has already been transferred, the only option not ruled out by one or > more constraints is to return successfully with a short read or write; > pending cancellation will then be handled on a subsequent call to a > cancellation point. (If you're doing a retry loop until the total > length n is transferred, then in the next iteration of the loop.) This seems to be a backwards interpretation to me. I think POSIX wants to describe what can happen during cancellation, and not say that any cancellation point can be skipped if some side effect happens that cannot happen before EINTR. > Of course this case of non-conformance would be less severe than UAF > or resource leak errors in most cases, but it's still not good. Agreed. And I think not acting on the cancellation and returning success is also not conforming. The main problem is that POSIX would need to specify a memory model in order to make the current language (“side-effects occur before any cancellation cleanup handlers are called”, “a thread is canceled while executing the function”, “a cancellation request has been made”, etc.). >> For (B), maybe we should undo the resource allocation and then proceed >> to act on the cancellation. > > No, why would you do that? In general it's not possible (think of > things like O_TRUNC), and even if it were it's more work. You just > don't act on the cancellation if success has already happened. The problem is that the implementation reorders cancellation requests fairly arbitrarily. I don't think that's what the authors of POSIX had in mind when they said that the cancellation itself progresses asynchronously. I don't think that can be fixed without kernel support, though. We'll see eventually how this impacts existing applications. > I don't see how/why you're characterizing "accept4" as "can easily be > reverted". Closing the accepted socket would *drop a connection* > rather than leaving the connection pending for another thread calling > accept to receive. And this is true for open as well. For FIFOs, there is an observable side effect, likewise for rewinding tape devices, and even for regular files, the close call affects POSIX advisory locks on the file. Oh well. So it's just not possible to unwind directly after file descriptor allocation. >> For (C), the complexity may not be worth it. >> >> For (A), (B), (C), we can act on the cancellation in the error case, >> after we observe the cancellation flag in the signal handler trampoline. >> (I think dropping the EINTR restriction from there achieves that.) If >> we do that, we do not need to change the test case. >> >> (D) is very special. Ideally, we would specify what happens with the >> descriptor if the close call is canceled. POSIX does not even specify >> what the state of file descriptor is after an EINTR error, so it doesn't >> say what happens with cancellation, either. Maybe we have to leave that >> undefined. > > Failure to specify EINTR was an error in POSIX that's been fixed for > the next issue. Leaving it undefined is atrociously bad and results in > UAF type errors. Even if glibc doesn't want to apply the fix for > EINTR, cancellation of close absolutely needs to behave *as if* by the > newly-specified behavior for close with EINTR. Can we even make sure we act on the cancellation only if the descriptor has been closed? Thanks, Florian
* Adhemerval Zanella: > I think it is possible, but I see that enforcing it using the kernel > facilities to avoid extra code in userspace is cleaner. Okay. >> Even if we undo the resource allocation before that? That is, check >> that a cancellation has occurred on the system call return path, and if >> it has, close the descriptor and start unwinding. > > What if the application requires extra steps for the case the syscall > returns a side-effects? What about some specific syscalls, such as > passing file descriptor using recvmsg? It would need the libc to map > *all* possible side-effects for each syscall, determine which might > leak side-effect and acts uppon then. What if the kernel, at some version, > added another feature to share resources that might create side-effects? > We will need to constantly map such cases, adding even more arch-specific > glue code to handle it. > > I really think we should not go on this path. I agree now, given what we've got from the kernel. See my response to Rich. >>> That's not the interpretation I see, the cancellation description for >>> such cases is afaiu: >>> >>> "The side-effects of acting upon a cancellation request while suspended >>> during a call of a function are the same as the side-effects that may be >>> seen in a single-threaded program when a call to a function is interrupted >>> by a signal and the given function returns [EINTR]" >>> >>> And on signal concepts: >>> >>> "[EINTR] >>> Interrupted function call. An asynchronous signal was caught by the process >>> during the execution of an interruptible function. If the signal handler >>> performs a normal return, the interrupted function call may return this >>> condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)." >>> >>> And my understanding is, since glibc implemented cancellation using signals >>> and it requires to use SA_RESTART we can actually return to user if the >>> syscall has side-effects visible to caller. >> >> The open function cannot allocate a new file descriptor and fail with >> EINTR. So I don't see how this case applies here. Sorry. > > But that is *exactly* the point. The open *does not* fail, but > cancellation acts *before* it can return to the callee. We can't > really make it fail with EINTR and acts on the cancellation. The idea > is "the interrupted function call may return this condition". My uneasiness stems from the fact that we're reading POSIX backwards, i.e., our interpretation (necessarily, given the kernel interfaces) behaves in a way that is obviously to complying with a direct reading of POSIX (a to-be-canceled thread synchronizing with the canceling thread, when the pthread_cancel call happened before the synchronization in program order in the canceling thread), and we then proceed to find reasons why this not totally broken. So, back to the start of the thread: Can we avoid the EPIPE error in an other way? Can we remove this line? /* This will cause the write in the child to return. */ close (fd[0]); Cancellation should still happen if the write is blocked, right? It's not really clear to me what the intent of the test was. Thanks, Florian
On 02/09/2019 13:37, Florian Weimer wrote: > * Rich Felker: >> No. POSIX is very clear that cancellation is only permitted to be >> acted on when the side effects are the same as EINTR, which can only >> happen for read/write when no data has been transferred yet. If data >> has already been transferred, the only option not ruled out by one or >> more constraints is to return successfully with a short read or write; >> pending cancellation will then be handled on a subsequent call to a >> cancellation point. (If you're doing a retry loop until the total >> length n is transferred, then in the next iteration of the loop.) > > This seems to be a backwards interpretation to me. I think POSIX wants > to describe what can happen during cancellation, and not say that any > cancellation point can be skipped if some side effect happens that > cannot happen before EINTR. > >> Of course this case of non-conformance would be less severe than UAF >> or resource leak errors in most cases, but it's still not good. > > Agreed. And I think not acting on the cancellation and returning > success is also not conforming. how is that non-conformance observable? the user cannot know if the cancellation request arrived in time or not if io error (short write) can happen for other reasons than the cancellation signal.
On Mon, Sep 02, 2019 at 02:37:17PM +0200, Florian Weimer wrote: > * Rich Felker: > > > On Fri, Aug 23, 2019 at 02:47:39PM +0200, Florian Weimer wrote: > >> I looked at this some more. It seems that in theory, we could also see > >> EPIPE in the old implementation or at least observe execution of a > >> signal handler, and the asynchronous cancel hides that. > >> > >> In the new implementation, if there is a signal handler and a signal for > >> it is delivered before SIGCANCEL (even if the signal was sent *after* > >> pthread_cancel, from the same thread), I do not think there is any > >> chance whatsoever that we can hide the behavior difference. After all, > >> SIGCANCEL may not trigger an asynchronous cancellation from the signal > >> handler. > >> > >> System calls that are cancellation points appear to fall into these > >> categories: > >> > >> (A) Those that do not perform any resource allocation (write, read). > >> > >> (B) Those that perform resource allocation, but the allocation can be > >> easily reverted (openat, accept4). > >> > >> (C) Those that perform resource allocation, but the allocation is > >> difficult to undo (recvmsg with descriptor passing). > >> > >> (D) close. > >> > >> For (A), I think POSIX allows sufficient wiggle room so that we can > >> exhibit a partial side effect and still act on the cancellation (without > >> reporting a partial write first to the application). > > > > No. POSIX is very clear that cancellation is only permitted to be > > acted on when the side effects are the same as EINTR, which can only > > happen for read/write when no data has been transferred yet. If data > > has already been transferred, the only option not ruled out by one or > > more constraints is to return successfully with a short read or write; > > pending cancellation will then be handled on a subsequent call to a > > cancellation point. (If you're doing a retry loop until the total > > length n is transferred, then in the next iteration of the loop.) > > This seems to be a backwards interpretation to me. I think POSIX wants > to describe what can happen during cancellation, and not say that any > cancellation point can be skipped if some side effect happens that > cannot happen before EINTR. The exact text is: "The side-effects of acting upon a cancellation request while suspended during a call of a function are the same as the side-effects that may be seen in a single-threaded program when a call to a function is interrupted by a signal and the given function returns [EINTR]." Admittedly it took me a while to understand this back when I first started investigating the problem in 2010, but based on both careful reading of this text, and on reasoning about what conditions are *absolutely necessary* to make the cancellation functionality usable in any meaningful way (without race conditions), yes it means that if you act on cancellation, the [only] side effects the cancelled function can have are side effects that would be allowed if the function had failed with EINTR. In hindsight, the main obstacle to my understanding this was bias based on how badly wrong all existing implementations I'd seen were, and resistance to believing they were all so badly wrong. But they were/are. Moreover, the specification gives the implementation liberty *not* to act on cancellation in certain conditions (very end of XSH 2.9.5 Thread Cancellation, subheading Cancellation points): "It is unspecified whether the cancellation request is acted upon or whether the cancellation request remains pending and the thread resumes normal execution if: - The thread is suspended at a cancellation point and the event for which it is waiting occurs - A specified timeout expired before the cancellation request is acted upon." The way I read this is that, if you start processing a cancellation request, but see that the timeout has already expired or that whatever the function was blocked waiting for has already happened, you can abort cancellation and leave it pending and return the result. In an implementation where libc/libpthread cancellation and the kernel are tightly coupled, this is really a choice -- for example the kernel could check at the last point before success during SYS_read to see that cancellation is pending, and *not actually transfer the data*, but instead initiate cancellation with no side effects on the open file. But on an implementation like Linux where libc/libpthread and the cancellation implementation is completely decoupled from the kernel and implemented via signals (which is a hack!), you don't actually have this freedom. By the time you see the signal, it's possible that the syscall *already succeeded* with (potentially serious) side effects, and in that case you have to use the option POSIX gives you here to defer cancellation until the next cancellation point. (If POSIX did not give you this option, then it may not even be possible to do a conforming implementation based on the signal hack). > > Of course this case of non-conformance would be less severe than UAF > > or resource leak errors in most cases, but it's still not good. > > Agreed. And I think not acting on the cancellation and returning > success is also not conforming. It is conforming, by the above-quoted passage. > >> For (C), the complexity may not be worth it. > >> > >> For (A), (B), (C), we can act on the cancellation in the error case, > >> after we observe the cancellation flag in the signal handler trampoline. > >> (I think dropping the EINTR restriction from there achieves that.) If > >> we do that, we do not need to change the test case. > >> > >> (D) is very special. Ideally, we would specify what happens with the > >> descriptor if the close call is canceled. POSIX does not even specify > >> what the state of file descriptor is after an EINTR error, so it doesn't > >> say what happens with cancellation, either. Maybe we have to leave that > >> undefined. > > > > Failure to specify EINTR was an error in POSIX that's been fixed for > > the next issue. Leaving it undefined is atrociously bad and results in > > UAF type errors. Even if glibc doesn't want to apply the fix for > > EINTR, cancellation of close absolutely needs to behave *as if* by the > > newly-specified behavior for close with EINTR. > > Can we even make sure we act on the cancellation only if the descriptor > has been closed? If the descriptor has been closed, you *can't* act on cancellation. On Linux, SYS_close *never* returns to userspace without the fd having been closed. It can return EINTR, indicating that the fd has been closed but the underlying open file description (if this was the last reference to it) has not, but this EINTR is not POSIX conforming with the new changes to POSIX, and it was always inconsistent with the semantics implied by EINTR ("no nontrivial side effects have happened yet"). See https://sourceware.org/bugzilla/show_bug.cgi?id=14627 In any case, this is easy to deal with for cancellation, even if fixing the EINTR issue is controversial. If PC is before or at the syscall instruction, nothing has happened yet (fd is still open) and you can act on cancellation. If PC is after the syscall instruction, you can't act on it. This is actually the same as all other syscalls, in some sense, except that for (at least a few) other syscalls you want to also act on cancellation if EINTR is returned, because a few of them EINTR on receipt of any signal, including SA_RESTART ones like SIGCANCEL. Historically there were lots like this, but most of them were conformance bugs and fixed. I think select and/or poll is still like that intentionally, but I may be misremembering. Rich
On 02/09/2019 20:26, Rich Felker wrote: > If the descriptor has been closed, you *can't* act on cancellation. On > Linux, SYS_close *never* returns to userspace without the fd having > been closed. It can return EINTR, indicating that the fd has been > closed but the underlying open file description (if this was the last > reference to it) has not, but this EINTR is not POSIX conforming with > the new changes to POSIX, and it was always inconsistent with the > semantics implied by EINTR ("no nontrivial side effects have happened > yet"). See https://sourceware.org/bugzilla/show_bug.cgi?id=14627 So is it true even for newer version? I had the impression from the lwn article [1] that this is driver dependent and this behaviour will eventually be removed from kernel to not allow close to return EINTR. About BZ#14627, does it still make sense for *Linux* to map EINTR to EINPROGRESS? [1] https://lwn.net/Articles/576478/
On Tue, Sep 03, 2019 at 09:15:58AM -0300, Adhemerval Zanella wrote: > > > On 02/09/2019 20:26, Rich Felker wrote: > > If the descriptor has been closed, you *can't* act on cancellation. On > > Linux, SYS_close *never* returns to userspace without the fd having > > been closed. It can return EINTR, indicating that the fd has been > > closed but the underlying open file description (if this was the last > > reference to it) has not, but this EINTR is not POSIX conforming with > > the new changes to POSIX, and it was always inconsistent with the > > semantics implied by EINTR ("no nontrivial side effects have happened > > yet"). See https://sourceware.org/bugzilla/show_bug.cgi?id=14627 > > So is it true even for newer version? I had the impression from the lwn > article [1] that this is driver dependent and this behaviour will eventually > be removed from kernel to not allow close to return EINTR. > > About BZ#14627, does it still make sense for *Linux* to map EINTR to > EINPROGRESS? > > [1] https://lwn.net/Articles/576478/ The kernel folks are adamantly opposed to changing SYS_close to match the POSIX requirement here, which is the right thing to do but for the wrong reasons -- if it were changed, libc could not safely handle the situation because it couldn't know what meaning the kernel was assigning to EINTR. Hopefully they will just stop returning EINTR entirely. That would not break any of the behavior discussed above. Returning EINPROGRESS may be a problem though because applications may not know to expect it and may treat it as a hard error rather than a reasonable condition (discussed in https://sourceware.org/bugzilla/show_bug.cgi?id=14627). The ideal behavior is probably just returning 0. Rich
On 02/09/2019 09:51, Florian Weimer wrote: > * Adhemerval Zanella: > >> I think it is possible, but I see that enforcing it using the kernel >> facilities to avoid extra code in userspace is cleaner. > > Okay. > >>> Even if we undo the resource allocation before that? That is, check >>> that a cancellation has occurred on the system call return path, and if >>> it has, close the descriptor and start unwinding. >> >> What if the application requires extra steps for the case the syscall >> returns a side-effects? What about some specific syscalls, such as >> passing file descriptor using recvmsg? It would need the libc to map >> *all* possible side-effects for each syscall, determine which might >> leak side-effect and acts uppon then. What if the kernel, at some version, >> added another feature to share resources that might create side-effects? >> We will need to constantly map such cases, adding even more arch-specific >> glue code to handle it. >> >> I really think we should not go on this path. > > I agree now, given what we've got from the kernel. See my response to > Rich. > >>>> That's not the interpretation I see, the cancellation description for >>>> such cases is afaiu: >>>> >>>> "The side-effects of acting upon a cancellation request while suspended >>>> during a call of a function are the same as the side-effects that may be >>>> seen in a single-threaded program when a call to a function is interrupted >>>> by a signal and the given function returns [EINTR]" >>>> >>>> And on signal concepts: >>>> >>>> "[EINTR] >>>> Interrupted function call. An asynchronous signal was caught by the process >>>> during the execution of an interruptible function. If the signal handler >>>> performs a normal return, the interrupted function call may return this >>>> condition (see the Base Definitions volume of POSIX.1-2017, <signal.h>)." >>>> >>>> And my understanding is, since glibc implemented cancellation using signals >>>> and it requires to use SA_RESTART we can actually return to user if the >>>> syscall has side-effects visible to caller. >>> >>> The open function cannot allocate a new file descriptor and fail with >>> EINTR. So I don't see how this case applies here. Sorry. >> >> But that is *exactly* the point. The open *does not* fail, but >> cancellation acts *before* it can return to the callee. We can't >> really make it fail with EINTR and acts on the cancellation. The idea >> is "the interrupted function call may return this condition". > > My uneasiness stems from the fact that we're reading POSIX backwards, > i.e., our interpretation (necessarily, given the kernel interfaces) > behaves in a way that is obviously to complying with a direct reading of > POSIX (a to-be-canceled thread synchronizing with the canceling thread, > when the pthread_cancel call happened before the synchronization in > program order in the canceling thread), and we then proceed to find > reasons why this not totally broken. > > So, back to the start of the thread: Can we avoid the EPIPE error in an > other way? > > Can we remove this line? > > /* This will cause the write in the child to return. */ > close (fd[0]); > > Cancellation should still happen if the write is blocked, right? It's > not really clear to me what the intent of the test was. I think Rich has answered it better than me that using signals to implement thread cancellation is an implementation detail and POSIX allows us for this case to not act on some cancellation signals if it leads to visible side-effects that might lead to resource leakage. Said that, it would be possible to just remove the close call to avoid the EPIPE error on write. The cancellation should still happen because the write is called on a loop, so since it might return with a visible side-effect (partial write), it would be cancelled in the next write call due the thread being marked as cancelled. I really don't have a strong opinion which change is the better, my point is the current code is racy and it will start to fail more often once we fix BZ#12683.
diff --git a/nptl/tst-cancel2.c b/nptl/tst-cancel2.c index 1f0429d343..632ea4e0ae 100644 --- a/nptl/tst-cancel2.c +++ b/nptl/tst-cancel2.c @@ -20,6 +20,7 @@ #include <signal.h> #include <stdio.h> #include <unistd.h> +#include <errno.h> static int fd[2]; @@ -32,7 +33,14 @@ tf (void *arg) write blocks. */ char buf[100000]; - while (write (fd[1], buf, sizeof (buf)) > 0); + while (1) + { + /* Ignore EPIPE errors for case the SIGPIPE is handle before + SIGCANCEL. */ + ssize_t ret = write (fd[1], buf, sizeof (buf)); + if (ret == 0 || (ret == -1 && errno != EPIPE)) + break; + } return (void *) 42l; }