Message ID | 20200220143406.4768-1-adhemerval.zanella@linaro.org |
---|---|
State | New |
Headers | show |
Series | support: Kill process group for test failure | expand |
On 2/20/20 9:34 AM, Adhemerval Zanella wrote: > Some testcases that create multiple subprocesses might abort or exit > prior waiting for their children. In such case, support_test_main > does not try to kill the spawned test process group (as in the > test timeout case). > > On example that we are observing in internal tests is when > malloc/tst-mallocfork2 fails to fork in the signal handling (due > either maximum number of process or other non expected failure). > > This patch kill the process group in the case of failed execution, > similar on how it is done on timeout. LGTM. Reviewed-by: Carlos O'Donell <carlos@redhat.com> > Checked on x86_64-linux-gnu. > --- > support/support_test_main.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/support/support_test_main.c b/support/support_test_main.c > index e3f0bf15f2..ac9f710fb7 100644 > --- a/support/support_test_main.c > +++ b/support/support_test_main.c > @@ -459,6 +459,9 @@ support_test_main (int argc, char **argv, const struct test_config *config) > /* Process terminated normaly without timeout etc. */ > if (WIFEXITED (status)) > { > + /* It is expected that a successful test execution handles all its > + children. */ OK. > + > if (config->expected_status == 0) > { > if (config->expected_signal == 0) > @@ -486,6 +489,10 @@ support_test_main (int argc, char **argv, const struct test_config *config) > /* Process was killed by timer or other signal. */ > else > { > + /* Kill the whole process group if test process aborts or exits prior > + waiting for them. */ > + kill (-test_pid, SIGKILL); OK. Send negation of process group id to kill the whole process group. The process group id is the same as the id of the process that created the group so test_pid is the right value. Notes: - Should we be using test_pgid to make this clear? - Should this cleanup be refactored a bit to avoid duplication from signal_handler() and support_test_main() e.g. kill_process_group () which runs kill looks for errors prints diagnostic etc. > + > if (config->expected_signal == 0) > { > printf ("Didn't expect signal from child: got `%s'\n", > -- Cheers, Carlos.
* Adhemerval Zanella: > Some testcases that create multiple subprocesses might abort or exit > prior waiting for their children. In such case, support_test_main > does not try to kill the spawned test process group (as in the > test timeout case). Does this actually work? Is the process group preserved if a process is reparented to init? Thanks, Florian
On Thu, Feb 20, 2020 at 10:05 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Adhemerval Zanella: > > > Some testcases that create multiple subprocesses might abort or exit > > prior waiting for their children. In such case, support_test_main > > does not try to kill the spawned test process group (as in the > > test timeout case). > > Does this actually work? Is the process group preserved if a process is > reparented to init? No, you are right. There is a race too which I didn't see. Once you waitpid the pid and pgid might be free for reuse and we can't guarantee this will work. Cheers, Carlos.
On Thu, Feb 20, 2020 at 9:53 AM Carlos O'Donell <codonell@redhat.com> wrote: > > On 2/20/20 9:34 AM, Adhemerval Zanella wrote: > > Some testcases that create multiple subprocesses might abort or exit > > prior waiting for their children. In such case, support_test_main > > does not try to kill the spawned test process group (as in the > > test timeout case). > > > > On example that we are observing in internal tests is when > > malloc/tst-mallocfork2 fails to fork in the signal handling (due > > either maximum number of process or other non expected failure). > > > > This patch kill the process group in the case of failed execution, > > similar on how it is done on timeout. > > LGTM. > > Reviewed-by: Carlos O'Donell <carlos@redhat.com> I'm withdrawing my reviewed-by here, since there is a race. Florian highlighted that the children are all going to be reparented to init and that therefore we can't catch them anymore. The only plausible solution here is to use the controlling terminal to kill the orphan children. > > Checked on x86_64-linux-gnu. > > --- > > support/support_test_main.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/support/support_test_main.c b/support/support_test_main.c > > index e3f0bf15f2..ac9f710fb7 100644 > > --- a/support/support_test_main.c > > +++ b/support/support_test_main.c > > @@ -459,6 +459,9 @@ support_test_main (int argc, char **argv, const struct test_config *config) > > /* Process terminated normaly without timeout etc. */ > > if (WIFEXITED (status)) > > { > > + /* It is expected that a successful test execution handles all its > > + children. */ > > OK. > > > + > > if (config->expected_status == 0) > > { > > if (config->expected_signal == 0) > > @@ -486,6 +489,10 @@ support_test_main (int argc, char **argv, const struct test_config *config) > > /* Process was killed by timer or other signal. */ > > else > > { > > + /* Kill the whole process group if test process aborts or exits prior > > + waiting for them. */ > > + kill (-test_pid, SIGKILL); > > OK. Send negation of process group id to kill the whole process group. The > process group id is the same as the id of the process that created the group > so test_pid is the right value. > > Notes: > - Should we be using test_pgid to make this clear? > - Should this cleanup be refactored a bit to avoid duplication from > signal_handler() and support_test_main() e.g. kill_process_group () > which runs kill looks for errors prints diagnostic etc. > > > + > > if (config->expected_signal == 0) > > { > > printf ("Didn't expect signal from child: got `%s'\n", > > > > > -- > Cheers, > Carlos.
On 20/02/2020 12:12, Carlos O'Donell wrote: > On Thu, Feb 20, 2020 at 10:05 AM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Adhemerval Zanella: >> >>> Some testcases that create multiple subprocesses might abort or exit >>> prior waiting for their children. In such case, support_test_main >>> does not try to kill the spawned test process group (as in the >>> test timeout case). >> >> Does this actually work? Is the process group preserved if a process is >> reparented to init? > > No, you are right. > > There is a race too which I didn't see. > > Once you waitpid the pid and pgid might be free for reuse and we can't > guarantee this will work. > I tested by explicit injecting faulty fork calls with: diff --git a/malloc/tst-mallocfork2.c b/malloc/tst-mallocfork2.c index 0602a94895..c14cf9ef41 100644 --- a/malloc/tst-mallocfork2.c +++ b/malloc/tst-mallocfork2.c @@ -65,10 +65,13 @@ static volatile sig_atomic_t progress_indicator = 1; static void sigusr1_handler (int signo) { + static int count = 0; sigusr1_received = 1; /* Perform a fork with a trivial subprocess. */ pid_t pid = fork (); + if (++count == 100) + pid = -1; if (pid == -1) { write_message ("error: fork\n"); And without killing the process groups I see: azanella 18876 6236 18874 7869 0 13:41 pts/0 00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2 azanella 18878 6236 18874 7869 0 13:41 pts/0 00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2 azanella 18879 6236 18874 7869 0 13:41 pts/0 00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2 azanella 18880 6236 18874 7869 0 13:41 pts/0 00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2 azanella 18881 6236 18874 7869 0 13:41 pts/0 00:00:00 ./elf/ld-linux-x86-64.so.2 --library-path .:./math:./elf:./dlfcn:./nss:./nis:./rt:./resolv:./mathvec:./support:./crypt:./nptl malloc/tst-mallocfork2 So the process group is still preserved even when it is reparented to init (6236 is systemd in my case). In any case, as Carlos pointed out the issue is the possible race when the process fails, but it has not spawn any process and thus its group id might be reused. I would aiming to add a more generic solution, but it seems that the testcase itself would need to handle it on the abort situations.
diff --git a/support/support_test_main.c b/support/support_test_main.c index e3f0bf15f2..ac9f710fb7 100644 --- a/support/support_test_main.c +++ b/support/support_test_main.c @@ -459,6 +459,9 @@ support_test_main (int argc, char **argv, const struct test_config *config) /* Process terminated normaly without timeout etc. */ if (WIFEXITED (status)) { + /* It is expected that a successful test execution handles all its + children. */ + if (config->expected_status == 0) { if (config->expected_signal == 0) @@ -486,6 +489,10 @@ support_test_main (int argc, char **argv, const struct test_config *config) /* Process was killed by timer or other signal. */ else { + /* Kill the whole process group if test process aborts or exits prior + waiting for them. */ + kill (-test_pid, SIGKILL); + if (config->expected_signal == 0) { printf ("Didn't expect signal from child: got `%s'\n",