Message ID | 20220421150654.817117821@infradead.org |
---|---|
State | New |
Headers | show |
Series | ptrace-vs-PREEMPT_RT and freezer rewrite | expand |
On 04/21, Peter Zijlstra wrote: > > +static void clear_traced_quiesce(void) > +{ > + spin_lock_irq(¤t->sighand->siglock); > + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); This WARN_ON_ONCE() doesn't look right, the task can be killed right after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and drops siglock. > @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in > /* > * Don't want to allow preemption here, because > * sys_ptrace() needs this task to be inactive. > - * > - * XXX: implement read_unlock_no_resched(). > */ > preempt_disable(); > read_unlock(&tasklist_lock); > - cgroup_enter_frozen(); > + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! > + > + /* > + * JOBCTL_TRACE_QUIESCE bridges the gap between > + * set_current_state(TASK_TRACED) above and schedule() below. > + * There must not be any blocking (specifically anything that > + * touched ->saved_state on PREEMPT_RT) between here and > + * schedule(). > + * > + * ptrace_check_attach() relies on this with its > + * wait_task_inactive() usage. > + */ > + clear_traced_quiesce(); Well, I think it should be called earlier under tasklist_lock, before preempt_disable() above. We need tasklist_lock to protect ->parent, debugger can be killed and go away right after read_unlock(&tasklist_lock). Still trying to convince myself everything is right with JOBCTL_STOPPED/TRACED ... Oleg.
On 04/21, Peter Zijlstra wrote: > > @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in > * schedule() will not sleep if there is a pending signal that > * can awaken the task. > */ > - current->jobctl |= JOBCTL_TRACED; > + current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE; > set_special_state(TASK_TRACED); OK, this looks wrong. I actually mean the previous patch which sets JOBCTL_TRACED. The problem is that the tracee can be already killed, so that fatal_signal_pending(current) is true. In this case we can't rely on signal_wake_up_state() which should clear JOBCTL_TRACED, or the callers of ptrace_signal_wake_up/etc which clear this flag by hand. In this case schedule() won't block and ptrace_stop() will leak JOBCTL_TRACED. Unless I missed something. We could check fatal_signal_pending() and damn! this is what I think ptrace_stop() should have done from the very beginning. But for now I'd suggest to simply clear this flag before return, along with DELAY_WAKEKILL and LISTENING. > current->jobctl &= ~JOBCTL_LISTENING; > + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; current->jobctl &= ~(~JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING); Oleg.
On Mon, Apr 25, 2022 at 04:35:37PM +0200, Oleg Nesterov wrote: > On 04/21, Peter Zijlstra wrote: > > > > +static void clear_traced_quiesce(void) > > +{ > > + spin_lock_irq(¤t->sighand->siglock); > > + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); > > This WARN_ON_ONCE() doesn't look right, the task can be killed right > after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and > drops siglock. OK, will look at that. > > @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in > > /* > > * Don't want to allow preemption here, because > > * sys_ptrace() needs this task to be inactive. > > - * > > - * XXX: implement read_unlock_no_resched(). > > */ > > preempt_disable(); > > read_unlock(&tasklist_lock); > > - cgroup_enter_frozen(); > > + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! > > + > > + /* > > + * JOBCTL_TRACE_QUIESCE bridges the gap between > > + * set_current_state(TASK_TRACED) above and schedule() below. > > + * There must not be any blocking (specifically anything that > > + * touched ->saved_state on PREEMPT_RT) between here and > > + * schedule(). > > + * > > + * ptrace_check_attach() relies on this with its > > + * wait_task_inactive() usage. > > + */ > > + clear_traced_quiesce(); > > Well, I think it should be called earlier under tasklist_lock, > before preempt_disable() above. > > We need tasklist_lock to protect ->parent, debugger can be killed > and go away right after read_unlock(&tasklist_lock). > > Still trying to convince myself everything is right with > JOBCTL_STOPPED/TRACED ... Can't do it earlier, since cgroup_enter_frozen() can do spinlock (eg. use ->saved_state).
Peter Zijlstra <peterz@infradead.org> writes: > On Mon, Apr 25, 2022 at 04:35:37PM +0200, Oleg Nesterov wrote: >> On 04/21, Peter Zijlstra wrote: >> > >> > +static void clear_traced_quiesce(void) >> > +{ >> > + spin_lock_irq(¤t->sighand->siglock); >> > + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); >> >> This WARN_ON_ONCE() doesn't look right, the task can be killed right >> after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and >> drops siglock. > > OK, will look at that. > >> > @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in >> > /* >> > * Don't want to allow preemption here, because >> > * sys_ptrace() needs this task to be inactive. >> > - * >> > - * XXX: implement read_unlock_no_resched(). >> > */ >> > preempt_disable(); >> > read_unlock(&tasklist_lock); >> > - cgroup_enter_frozen(); >> > + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! >> > + >> > + /* >> > + * JOBCTL_TRACE_QUIESCE bridges the gap between >> > + * set_current_state(TASK_TRACED) above and schedule() below. >> > + * There must not be any blocking (specifically anything that >> > + * touched ->saved_state on PREEMPT_RT) between here and >> > + * schedule(). >> > + * >> > + * ptrace_check_attach() relies on this with its >> > + * wait_task_inactive() usage. >> > + */ >> > + clear_traced_quiesce(); >> >> Well, I think it should be called earlier under tasklist_lock, >> before preempt_disable() above. >> >> We need tasklist_lock to protect ->parent, debugger can be killed >> and go away right after read_unlock(&tasklist_lock). >> >> Still trying to convince myself everything is right with >> JOBCTL_STOPPED/TRACED ... > > Can't do it earlier, since cgroup_enter_frozen() can do spinlock (eg. > use ->saved_state). There are some other issues in this part of ptrace_stop(). I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)". Currently in ptrace_check_attach a parameter of __TASK_TRACED is passed so that wait_task_inactive cane fail if the "!current->ptrace" branch of ptrace_stop is take and ptrace_stop does not stop. With the TASK_FROZEN state it appears that "!current->ptrace" branch can continue and freeze somewhere else and wait_task_inactive could decided it was fine. I have to run, but hopefully tommorrow I will post the patches that remove the "!current->ptrace" case altogether and basically remove the need for quiesce and wait_task_inactive detecting which branch is taken. The spinlock in cgroup_enter_frozen remains an issue for PREEMPT_RT. But the rest of the issues are cleared up by using siglock instead of tasklist_lock. Plus the code is just easier to read and understand. Eric
On 04/25, Eric W. Biederman wrote: > > I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)". As Peter explained, in this case we can rely on __ptrace_unlink() which should clear this flag. Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/25, Eric W. Biederman wrote: >> >> I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)". > > As Peter explained, in this case we can rely on __ptrace_unlink() which > should clear this flag. I had missed that that signal_wake_up_state was clearing JOBCTL_TRACED_QUIESCE. Relying on __ptrace_unlink assumes the __ptrace_unlink happens after siglock is taken before calling ptrace_stop. Especially with the ptrace_notify in signal_delivered that does not look guaranteed. The __ptrace_unlink could also happen during arch_ptrace_stop. Relying on siglock is sufficient because __ptrace_unlink holds siglock over clearing task->ptrace. Which means that the simple fix for this is to just test task->ptrace before we set JOBCTL_TRACED_QUEIESCE. Eric
On 04/26, Eric W. Biederman wrote: > > Relying on __ptrace_unlink assumes the __ptrace_unlink happens after > siglock is taken before calling ptrace_stop. Especially with the > ptrace_notify in signal_delivered that does not look guaranteed. > > The __ptrace_unlink could also happen during arch_ptrace_stop. > > Relying on siglock is sufficient because __ptrace_unlink holds siglock > over clearing task->ptrace. Which means that the simple fix for this is > to just test task->ptrace before we set JOBCTL_TRACED_QUEIESCE. Or simply clear _QUEIESCE along with _TRACED/DELAY_WAKEKILL before return? Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/21, Peter Zijlstra wrote: >> >> @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in >> * schedule() will not sleep if there is a pending signal that >> * can awaken the task. >> */ >> - current->jobctl |= JOBCTL_TRACED; >> + current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE; >> set_special_state(TASK_TRACED); > > OK, this looks wrong. I actually mean the previous patch which sets > JOBCTL_TRACED. > > The problem is that the tracee can be already killed, so that > fatal_signal_pending(current) is true. In this case we can't rely on > signal_wake_up_state() which should clear JOBCTL_TRACED, or the > callers of ptrace_signal_wake_up/etc which clear this flag by hand. > > In this case schedule() won't block and ptrace_stop() will leak > JOBCTL_TRACED. Unless I missed something. > > We could check fatal_signal_pending() and damn! this is what I think > ptrace_stop() should have done from the very beginning. But for now > I'd suggest to simply clear this flag before return, along with > DELAY_WAKEKILL and LISTENING. Oh. That is an interesting case for JOBCTL_TRACED. The scheduler refuses to stop if signal_pending_state(TASK_TRACED, p) returns true. The ptrace_stop code used to handle this explicitly and in commit 7d613f9f72ec ("signal: Remove the bogus sigkill_pending in ptrace_stop") I actually removed the test. As the test was somewhat wrong and redundant, and in slightly the wrong location. But doing: /* Don't stop if the task is dying */ if (unlikely(__fatal_signal_pending(current))) return exit_code; Should work. > >> current->jobctl &= ~JOBCTL_LISTENING; >> + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; > > current->jobctl &= > ~(~JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING); I presume you meant: current->jobctl &= ~(JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING); I don't think we want to do that. For the case you are worried about it is a valid fix. In general this is the wrong approach as we want the waker to clear JOBCTL_TRACED. If the waker does not it is possible that ptrace_freeze_traced might attempt to freeze a process whose state is not appropriate for attach, because the code is past the call to schedule(). In fact I think clearing JOBCTL_TRACED at the end of ptrace_stop will allow ptrace_freeze_traced to come in while siglock is dropped, expect the process to stop, and have the process not stop. Of course wait_task_inactive coming first that might not be a problem. This is a minor problem with the patchset I just posted. I thought the only reason wait_task_inactive could fail was if ptrace_stop() hit the !current->ptrace case. Thinking about any it any SIGKILL coming in before tracee stops in schedule will trigger this, so it is not as safe as I thought to not pass a state into wait_task_inactive. It is time for me to shut down today. I will sleep on that and see what I can see tomorrow. Eric
On 04/21, Peter Zijlstra wrote: > > @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l > goto out_put_task_struct; > > ret = arch_ptrace(child, request, addr, data); > - if (ret || request != PTRACE_DETACH) > - ptrace_unfreeze_traced(child); > + ptrace_unfreeze_traced(child); Forgot to mention... whatever we do this doesn't look right. ptrace_unfreeze_traced() must not be called if the tracee was untraced, anothet debugger can come after that. I agree, the current code looks a bit confusing, perhaps it makes sense to re-write it: if (request == PTRACE_DETACH && ret == 0) ; /* nothing to do, no longer traced by us */ else ptrace_unfreeze_traced(child); Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/21, Peter Zijlstra wrote: >> >> @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l >> goto out_put_task_struct; >> >> ret = arch_ptrace(child, request, addr, data); >> - if (ret || request != PTRACE_DETACH) >> - ptrace_unfreeze_traced(child); >> + ptrace_unfreeze_traced(child); > > Forgot to mention... whatever we do this doesn't look right. > > ptrace_unfreeze_traced() must not be called if the tracee was untraced, > anothet debugger can come after that. I agree, the current code looks > a bit confusing, perhaps it makes sense to re-write it: > > if (request == PTRACE_DETACH && ret == 0) > ; /* nothing to do, no longer traced by us */ > else > ptrace_unfreeze_traced(child); This was a bug in my original JOBCTL_DELAY_WAITKILL patch and it was just cut and pasted here. I thought it made sense when I was throwing things together but when I looked more closely I realized that it is not safe. Eric
On Tue, Apr 26, 2022 at 07:24:03PM -0500, Eric W. Biederman wrote: > But doing: > > /* Don't stop if the task is dying */ > if (unlikely(__fatal_signal_pending(current))) > return exit_code; > > Should work. Something like so then... --- Subject: signal,ptrace: Don't stop dying tasks From: Peter Zijlstra <peterz@infradead.org> Date: Thu Apr 28 22:17:56 CEST 2022 Oleg pointed out that the tracee can already be killed such that fatal_signal_pending() is true. In that case signal_wake_up_state() cannot be relied upon to be responsible for the wakeup -- something we're going to want to rely on. As such, explicitly handle this case. Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/signal.c | 4 ++++ 1 file changed, 4 insertions(+) --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2226,6 +2226,10 @@ static int ptrace_stop(int exit_code, in spin_lock_irq(¤t->sighand->siglock); } + /* Don't stop if the task is dying. */ + if (unlikely(__fatal_signal_pending(current))) + return exit_code; + /* * schedule() will not sleep if there is a pending signal that * can awaken the task.
On 04/28, Peter Zijlstra wrote: > > Oleg pointed out that the tracee can already be killed such that > fatal_signal_pending() is true. In that case signal_wake_up_state() > cannot be relied upon to be responsible for the wakeup -- something > we're going to want to rely on. Peter, I am all confused... If this patch is against the current tree, we don't need it. If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's), then it can't help - SIGKILL can come right after the tracee drops siglock and calls schedule(). Perhaps I missed something, but let me repeat the 3rd time: I'd suggest to simply clear JOBCTL_TRACED along with LISTENING/DELAY_WAKEKILL before return to close this race. Oleg. > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2226,6 +2226,10 @@ static int ptrace_stop(int exit_code, in > spin_lock_irq(¤t->sighand->siglock); > } > > + /* Don't stop if the task is dying. */ > + if (unlikely(__fatal_signal_pending(current))) > + return exit_code; > + > /* > * schedule() will not sleep if there is a pending signal that > * can awaken the task. >
On Thu, Apr 28, 2022 at 10:59:57PM +0200, Oleg Nesterov wrote: > On 04/28, Peter Zijlstra wrote: > > > > Oleg pointed out that the tracee can already be killed such that > > fatal_signal_pending() is true. In that case signal_wake_up_state() > > cannot be relied upon to be responsible for the wakeup -- something > > we're going to want to rely on. > > Peter, I am all confused... > > If this patch is against the current tree, we don't need it. > > If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's), > then it can't help - SIGKILL can come right after the tracee drops siglock > and calls schedule(). But by that time it will already have set TRACED and signal_wake_up() wil clear it, no? > Perhaps I missed something, but let me repeat the 3rd time: I'd suggest > to simply clear JOBCTL_TRACED along with LISTENING/DELAY_WAKEKILL before > return to close this race. I think Eric convinced me there was a problem with that, but I'll go over it all again in the morning, perhaps I'll reach a different conclusion :-)
Peter, you know, it is very difficult to me to discuss the changes in the 2 unfinished series and not loose the context ;) Plus I am already sleeping. But I'll try to reply anyway. On 04/29, Peter Zijlstra wrote: > > On Thu, Apr 28, 2022 at 10:59:57PM +0200, Oleg Nesterov wrote: > > If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's), > > then it can't help - SIGKILL can come right after the tracee drops siglock > > and calls schedule(). > > But by that time it will already have set TRACED and signal_wake_up() > wil clear it, no? No. JOBCTL_DELAY_WAKEKILL is already set, this means that signal_wake_up() will remove TASK_WAKEKILL from the "state" passed to signal_wake_up_state() and this is fine and correct, this mean thats ttwu() won't change ->__state. But this also mean that wake_up_state() will return false, and in this case signal_wake_up_state: if (wake_up_state(t, state | TASK_INTERRUPTIBLE)) t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE); won't clear these flags. And this is nice too. But. fatal_signal_pending() is true! And once we change freeze_traced() to not abuse p->__state, schedule() won't block because it will check signal_pending_state(TASK_TRACED == TASK_WAKEKILL | __TASK_TRACED) and __fatal_signal_pending() == T. In this case ptrace_stop() will leak JOBCTL_TRACED, so we simply need to clear it before return along with LISTENING | DELAY_WAKEKILL. > I'll go > over it all again in the morning, perhaps I'll reach a different > conclusion :-) Same here ;) Oleg.
--- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -19,9 +19,11 @@ struct task_struct; #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ #define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */ +#define JOBCTL_DELAY_WAKEKILL_BIT 24 /* delay killable wakeups */ -#define JOBCTL_STOPPED_BIT 24 /* do_signal_stop() */ -#define JOBCTL_TRACED_BIT 25 /* ptrace_stop() */ +#define JOBCTL_STOPPED_BIT 25 /* do_signal_stop() */ +#define JOBCTL_TRACED_BIT 26 /* ptrace_stop() */ +#define JOBCTL_TRACED_QUIESCE_BIT 27 #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) @@ -31,9 +33,11 @@ struct task_struct; #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT) +#define JOBCTL_DELAY_WAKEKILL (1UL << JOBCTL_DELAY_WAKEKILL_BIT) #define JOBCTL_STOPPED (1UL << JOBCTL_STOPPED_BIT) #define JOBCTL_TRACED (1UL << JOBCTL_TRACED_BIT) +#define JOBCTL_TRACED_QUIESCE (1UL << JOBCTL_TRACED_QUIESCE_BIT) #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -193,41 +193,44 @@ static bool looks_like_a_spurious_pid(st */ static bool ptrace_freeze_traced(struct task_struct *task) { + unsigned long flags; bool ret = false; /* Lockless, nobody but us can set this flag */ if (task->jobctl & JOBCTL_LISTENING) return ret; - spin_lock_irq(&task->sighand->siglock); - if (task_is_traced(task) && !looks_like_a_spurious_pid(task) && + if (!lock_task_sighand(task, &flags)) + return ret; + + if (task_is_traced(task) && + !looks_like_a_spurious_pid(task) && !__fatal_signal_pending(task)) { - WRITE_ONCE(task->__state, __TASK_TRACED); + WARN_ON_ONCE(READ_ONCE(task->__state) != TASK_TRACED); + WARN_ON_ONCE(task->jobctl & JOBCTL_DELAY_WAKEKILL); + task->jobctl |= JOBCTL_DELAY_WAKEKILL; ret = true; } - spin_unlock_irq(&task->sighand->siglock); + unlock_task_sighand(task, &flags); return ret; } static void ptrace_unfreeze_traced(struct task_struct *task) { - if (READ_ONCE(task->__state) != __TASK_TRACED) + if (!task_is_traced(task)) return; WARN_ON(!task->ptrace || task->parent != current); - /* - * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. - * Recheck state under the lock to close this race. - */ spin_lock_irq(&task->sighand->siglock); - if (READ_ONCE(task->__state) == __TASK_TRACED) { + if (task_is_traced(task)) { +// WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; if (__fatal_signal_pending(task)) { task->jobctl &= ~JOBCTL_TRACED; - wake_up_state(task, __TASK_TRACED); - } else - WRITE_ONCE(task->__state, TASK_TRACED); + wake_up_state(task, TASK_WAKEKILL); + } } spin_unlock_irq(&task->sighand->siglock); } @@ -251,40 +254,45 @@ static void ptrace_unfreeze_traced(struc */ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) { - int ret = -ESRCH; + int traced; /* * We take the read lock around doing both checks to close a - * possible race where someone else was tracing our child and - * detached between these two checks. After this locked check, - * we are sure that this is our traced child and that can only - * be changed by us so it's not changing right after this. + * possible race where someone else attaches or detaches our + * natural child. */ read_lock(&tasklist_lock); - if (child->ptrace && child->parent == current) { - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); - /* - * child->sighand can't be NULL, release_task() - * does ptrace_unlink() before __exit_signal(). - */ - if (ignore_state || ptrace_freeze_traced(child)) - ret = 0; - } + traced = child->ptrace && child->parent == current; read_unlock(&tasklist_lock); + if (!traced) + return -ESRCH; - if (!ret && !ignore_state) { - if (!wait_task_inactive(child, __TASK_TRACED)) { - /* - * This can only happen if may_ptrace_stop() fails and - * ptrace_stop() changes ->state back to TASK_RUNNING, - * so we should not worry about leaking __TASK_TRACED. - */ - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); - ret = -ESRCH; - } + if (ignore_state) + return 0; + + if (!task_is_traced(child)) + return -ESRCH; + + WARN_ON_ONCE(READ_ONCE(child->jobctl) & JOBCTL_DELAY_WAKEKILL); + + /* Wait for JOBCTL_TRACED_QUIESCE to go away, see ptrace_stop(). */ + for (;;) { + if (fatal_signal_pending(current)) + return -EINTR; + + set_current_state(TASK_KILLABLE); + if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_QUIESCE)) + break; + + schedule(); } + __set_current_state(TASK_RUNNING); - return ret; + if (!wait_task_inactive(child, TASK_TRACED) || + !ptrace_freeze_traced(child)) + return -ESRCH; + + return 0; } static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l goto out_put_task_struct; ret = arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); out_put_task_struct: put_task_struct(child); @@ -1472,8 +1479,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_lo request == PTRACE_INTERRUPT); if (!ret) { ret = compat_arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); } out_put_task_struct: --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6310,10 +6310,7 @@ static void __sched notrace __schedule(u /* * We must load prev->state once (task_struct::state is volatile), such - * that: - * - * - we form a control dependency vs deactivate_task() below. - * - ptrace_{,un}freeze_traced() can change ->state underneath us. + * that we form a control dependency vs deactivate_task() below. */ prev_state = READ_ONCE(prev->__state); if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) { --- a/kernel/signal.c +++ b/kernel/signal.c @@ -764,6 +764,10 @@ void signal_wake_up_state(struct task_st { lockdep_assert_held(&t->sighand->siglock); + /* Suppress wakekill? */ + if (t->jobctl & JOBCTL_DELAY_WAKEKILL) + state &= ~TASK_WAKEKILL; + set_tsk_thread_flag(t, TIF_SIGPENDING); /* @@ -774,7 +778,7 @@ void signal_wake_up_state(struct task_st * handle its death signal. */ if (wake_up_state(t, state | TASK_INTERRUPTIBLE)) - t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED); + t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE); else kick_process(t); } @@ -2187,6 +2191,15 @@ static void do_notify_parent_cldstop(str spin_unlock_irqrestore(&sighand->siglock, flags); } +static void clear_traced_quiesce(void) +{ + spin_lock_irq(¤t->sighand->siglock); + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); + current->jobctl &= ~JOBCTL_TRACED_QUIESCE; + wake_up_state(current->parent, TASK_KILLABLE); + spin_unlock_irq(¤t->sighand->siglock); +} + /* * This must be called with current->sighand->siglock held. * @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in * schedule() will not sleep if there is a pending signal that * can awaken the task. */ - current->jobctl |= JOBCTL_TRACED; + current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE; set_special_state(TASK_TRACED); /* @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in /* * Don't want to allow preemption here, because * sys_ptrace() needs this task to be inactive. - * - * XXX: implement read_unlock_no_resched(). */ preempt_disable(); read_unlock(&tasklist_lock); - cgroup_enter_frozen(); + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! + + /* + * JOBCTL_TRACE_QUIESCE bridges the gap between + * set_current_state(TASK_TRACED) above and schedule() below. + * There must not be any blocking (specifically anything that + * touched ->saved_state on PREEMPT_RT) between here and + * schedule(). + * + * ptrace_check_attach() relies on this with its + * wait_task_inactive() usage. + */ + clear_traced_quiesce(); + preempt_enable_no_resched(); freezable_schedule(); + cgroup_leave_frozen(true); } else { /* @@ -2335,6 +2360,7 @@ static int ptrace_stop(int exit_code, in /* LISTENING can be set only during STOP traps, clear it */ current->jobctl &= ~JOBCTL_LISTENING; + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; /* * Queued signals ignored us while we were stopped for tracing.
Rework ptrace_check_attach() / ptrace_unfreeze_traced() to not rely on task->__state as much. Due to how PREEMPT_RT is changing the rules vs task->__state with the introduction of task->saved_state while TASK_RTLOCK_WAIT (the whole blocking spinlock thing), the way ptrace freeze tries to do things no longer works. Specifically there are two problems: - due to ->saved_state, the ->__state modification removing TASK_WAKEKILL no longer works reliably. - due to ->saved_state, wait_task_inactive() also no longer works reliably. The first problem is solved by a suggestion from Eric that instead of changing __state, TASK_WAKEKILL be delayed. The second problem is solved by a suggestion from Oleg; add JOBCTL_TRACED_QUIESCE to cover the chunk of code between set_current_state(TASK_TRACED) and schedule(), such that ptrace_check_attach() can first wait for JOBCTL_TRACED_QUIESCE to get cleared, and then use wait_task_inactive(). Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/sched/jobctl.h | 8 ++- kernel/ptrace.c | 90 ++++++++++++++++++++++--------------------- kernel/sched/core.c | 5 -- kernel/signal.c | 36 ++++++++++++++--- 4 files changed, 86 insertions(+), 53 deletions(-)