Message ID | 878rrrh32q.fsf_-_@email.froward.int.ebiederm.org |
---|---|
Headers | show |
Series | ptrace: cleaning up ptrace_stop | expand |
On Tue, Apr 26, 2022 at 3:52 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > > xtensa is the last user of the PT_SINGLESTEP flag. Changing tsk->ptrace in > user_enable_single_step and user_disable_single_step without locking could > potentiallly cause problems. > > So use a thread info flag instead of a flag in tsk->ptrace. Use TIF_SINGLESTEP > that xtensa already had defined but unused. > > Remove the definitions of PT_SINGLESTEP and PT_BLOCKSTEP as they have no more users. > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > arch/xtensa/kernel/ptrace.c | 4 ++-- > arch/xtensa/kernel/signal.c | 4 ++-- > include/linux/ptrace.h | 6 ------ > 3 files changed, 4 insertions(+), 10 deletions(-) Acked-by: Max Filippov <jcmvbkbc@gmail.com>
On 2022-04-26 17:52:07 [-0500], Eric W. Biederman wrote: > The functions ptrace_stop and do_signal_stop have to drop siglock > and grab tasklist_lock because the parent/child relation ship > is guarded by siglock and not siglock. "is guarded by tasklist_lock and not siglock." ? > Simplify things by guarding the parent/child relationship > with siglock. For the most part this just requires a little bit > of code motion. In a couple of places more locking was needed. > > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Sebastian
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes: > On 2022-04-26 17:52:07 [-0500], Eric W. Biederman wrote: >> The functions ptrace_stop and do_signal_stop have to drop siglock >> and grab tasklist_lock because the parent/child relation ship >> is guarded by siglock and not siglock. > > "is guarded by tasklist_lock and not siglock." ? Yes. Thank you. I will fix that. >> Simplify things by guarding the parent/child relationship >> with siglock. For the most part this just requires a little bit >> of code motion. In a couple of places more locking was needed. >> >> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > > Sebastian Eric
On 04/26, Eric W. Biederman wrote: > > @@ -2164,7 +2166,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, > } > > sighand = parent->sighand; > - spin_lock_irqsave(&sighand->siglock, flags); > + lock = tsk->sighand != sighand; > + if (lock) > + spin_lock_nested(&sighand->siglock, SINGLE_DEPTH_NESTING); But why is it safe? Suppose we have two tasks, they both trace each other, both call ptrace_stop() at the same time. Of course this is ugly, they both will block. But with this patch in this case we have the trivial ABBA deadlock, no? Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/26, Eric W. Biederman wrote: >> >> @@ -2164,7 +2166,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, >> } >> >> sighand = parent->sighand; >> - spin_lock_irqsave(&sighand->siglock, flags); >> + lock = tsk->sighand != sighand; >> + if (lock) >> + spin_lock_nested(&sighand->siglock, SINGLE_DEPTH_NESTING); > > But why is it safe? > > Suppose we have two tasks, they both trace each other, both call > ptrace_stop() at the same time. Of course this is ugly, they both > will block. > > But with this patch in this case we have the trivial ABBA deadlock, > no? I was thinking in terms of the process tree (which is fine). The ptrace parental relationship definitely has the potential to be a graph with cycles. Which as you point out is not fine. The result is very nice and I don't want to give it up. I suspect something ptrace cycles are always a problem and can simply be forbidden. That is going to take some analsysis and some additional code in ptrace_attach. I will go look at that. Eric
On 04/27, Eric W. Biederman wrote: > > The ptrace parental relationship definitely has the potential to be a > graph with cycles. Which as you point out is not fine. > > The result is very nice and I don't want to give it up. I suspect > something ptrace cycles are always a problem and can simply be > forbidden. OK, please consider another case. We have a parent P and its child C. C traces P. This is not that unusual, I don't think we can forbid this case. P reports an event and calls do_notify_parent_cldstop(). C receives SIGSTOP and calls do_notify_parent_cldstop() too. Oleg.
"Eric W. Biederman" <ebiederm@xmission.com> writes: > Oleg Nesterov <oleg@redhat.com> writes: > >> On 04/26, Eric W. Biederman wrote: >>> >>> @@ -2164,7 +2166,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, >>> } >>> >>> sighand = parent->sighand; >>> - spin_lock_irqsave(&sighand->siglock, flags); >>> + lock = tsk->sighand != sighand; >>> + if (lock) >>> + spin_lock_nested(&sighand->siglock, SINGLE_DEPTH_NESTING); >> >> But why is it safe? >> >> Suppose we have two tasks, they both trace each other, both call >> ptrace_stop() at the same time. Of course this is ugly, they both >> will block. >> >> But with this patch in this case we have the trivial ABBA deadlock, >> no? > > I was thinking in terms of the process tree (which is fine). > > The ptrace parental relationship definitely has the potential to be a > graph with cycles. Which as you point out is not fine. > > > The result is very nice and I don't want to give it up. I suspect > something ptrace cycles are always a problem and can simply be > forbidden. That is going to take some analsysis and some additional > code in ptrace_attach. > > I will go look at that. Hmm. If we have the following process tree. A \ B \ C Process A, B, and C are all in the same process group. Process A and B are setup to receive SIGCHILD when their process stops. Process C traces process A. When a sigstop is delivered to the group we can have: Process B takes siglock(B) siglock(A) to notify the real_parent Process C takes siglock(C) siglock(B) to notify the real_parent Process A takes siglock(A) siglock(C) to notify the tracer If they all take their local lock at the same time there is a deadlock. I don't think the restriction that you can never ptrace anyone up the process tree is going to fly. So it looks like I am back to the drawing board for this one. Eric
On 04/26, Eric W. Biederman wrote: > > @@ -2209,6 +2213,34 @@ static int ptrace_stop(int exit_code, int why, int clear_code, > spin_lock_irq(¤t->sighand->siglock); > } > > + /* Don't stop if current is not ptraced */ > + if (unlikely(!current->ptrace)) > + return (clear_code) ? 0 : exit_code; > + > + /* > + * If @why is CLD_STOPPED, we're trapping to participate in a group > + * stop. Do the bookkeeping. Note that if SIGCONT was delievered > + * across siglock relocks since INTERRUPT was scheduled, PENDING > + * could be clear now. We act as if SIGCONT is received after > + * TASK_TRACED is entered - ignore it. > + */ > + if (why == CLD_STOPPED && (current->jobctl & JOBCTL_STOP_PENDING)) > + gstop_done = task_participate_group_stop(current); > + > + /* > + * Notify parents of the stop. > + * > + * While ptraced, there are two parents - the ptracer and > + * the real_parent of the group_leader. The ptracer should > + * know about every stop while the real parent is only > + * interested in the completion of group stop. The states > + * for the two don't interact with each other. Notify > + * separately unless they're gonna be duplicates. > + */ > + do_notify_parent_cldstop(current, true, why); > + if (gstop_done && ptrace_reparented(current)) > + do_notify_parent_cldstop(current, false, why); This doesn't look right too. The parent should be notified only after we set __state = TASK_TRACED and ->exit code. Suppose that debugger sleeps in do_wait(). do_notify_parent_cldstop() wakes it up, debugger calls wait_task_stopped() and then it will sleep again, task_stopped_code() returns 0. This can be probably fixed if you remove the lockless (fast path) task_stopped_code() check in wait_task_stopped(), but this is not nice performance-wise... Oleg.
On 04/27, Oleg Nesterov wrote: > > On 04/26, Eric W. Biederman wrote: > > > > @@ -2209,6 +2213,34 @@ static int ptrace_stop(int exit_code, int why, int clear_code, > > spin_lock_irq(¤t->sighand->siglock); > > } > > > > + /* Don't stop if current is not ptraced */ > > + if (unlikely(!current->ptrace)) > > + return (clear_code) ? 0 : exit_code; > > + > > + /* > > + * If @why is CLD_STOPPED, we're trapping to participate in a group > > + * stop. Do the bookkeeping. Note that if SIGCONT was delievered > > + * across siglock relocks since INTERRUPT was scheduled, PENDING > > + * could be clear now. We act as if SIGCONT is received after > > + * TASK_TRACED is entered - ignore it. > > + */ > > + if (why == CLD_STOPPED && (current->jobctl & JOBCTL_STOP_PENDING)) > > + gstop_done = task_participate_group_stop(current); > > + > > + /* > > + * Notify parents of the stop. > > + * > > + * While ptraced, there are two parents - the ptracer and > > + * the real_parent of the group_leader. The ptracer should > > + * know about every stop while the real parent is only > > + * interested in the completion of group stop. The states > > + * for the two don't interact with each other. Notify > > + * separately unless they're gonna be duplicates. > > + */ > > + do_notify_parent_cldstop(current, true, why); > > + if (gstop_done && ptrace_reparented(current)) > > + do_notify_parent_cldstop(current, false, why); > > This doesn't look right too. The parent should be notified only after > we set __state = TASK_TRACED and ->exit code. > > Suppose that debugger sleeps in do_wait(). do_notify_parent_cldstop() > wakes it up, debugger calls wait_task_stopped() and then it will sleep > again, task_stopped_code() returns 0. > > This can be probably fixed if you remove the lockless (fast path) > task_stopped_code() check in wait_task_stopped(), but this is not > nice performance-wise... On the other hand, I don't understand why did you move the callsite of do_notify_parent_cldstop() up... just don't do this? Oleg.
On 04/26, Eric W. Biederman wrote: > > static void ptrace_unfreeze_traced(struct task_struct *task) > { > - if (READ_ONCE(task->__state) != __TASK_TRACED) > + if (!(READ_ONCE(task->jobctl) & JOBCTL_DELAY_WAKEKILL)) > return; > > WARN_ON(!task->ptrace || task->parent != current); > @@ -213,11 +213,10 @@ static void ptrace_unfreeze_traced(struct task_struct *task) > * Recheck state under the lock to close this race. > */ > spin_lock_irq(&task->sighand->siglock); Now that we do not check __state = __TASK_TRACED, we need lock_task_sighand(). The tracee can be already woken up by ptrace_resume(), but it is possible that it didn't clear DELAY_WAKEKILL yet. Now, before we take ->siglock, the tracee can exit and another thread can do wait() and reap this task. Also, I think the comment above should be updated. I agree, it makes sense to re-check JOBCTL_DELAY_WAKEKILL under siglock just for clarity, but we no longer need to do this to close the race; jobctl &= ~JOBCTL_DELAY_WAKEKILL and wake_up_state() are safe even if JOBCTL_DELAY_WAKEKILL was already cleared. > @@ -2307,6 +2307,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code, > > /* LISTENING can be set only during STOP traps, clear it */ > current->jobctl &= ~JOBCTL_LISTENING; > + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; minor, but current->jobctl &= ~(JOBCTL_LISTENING | JOBCTL_DELAY_WAKEKILL); looks better. Oleg.
On 04/26, Eric W. Biederman wrote: > > @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) > */ > if (lock_task_sighand(child, &flags)) { > if (child->ptrace && child->parent == current) { > - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); > + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); This WARN_ON() doesn't look right. It is possible that this child was traced by another task and PTRACE_DETACH'ed, but it didn't clear DELAY_WAKEKILL. If the new debugger attaches and calls ptrace() before the child takes siglock ptrace_freeze_traced() will fail, but we can hit this WARN_ON(). Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/26, Eric W. Biederman wrote: >> >> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) >> */ >> if (lock_task_sighand(child, &flags)) { >> if (child->ptrace && child->parent == current) { >> - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); >> + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); > > This WARN_ON() doesn't look right. > > It is possible that this child was traced by another task and PTRACE_DETACH'ed, > but it didn't clear DELAY_WAKEKILL. That would be a bug. That would mean that PTRACE_DETACHED process can not be SIGKILL'd. > If the new debugger attaches and calls ptrace() before the child takes siglock > ptrace_freeze_traced() will fail, but we can hit this WARN_ON(). Eric
On 04/27, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > On 04/26, Eric W. Biederman wrote: > >> > >> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) > >> */ > >> if (lock_task_sighand(child, &flags)) { > >> if (child->ptrace && child->parent == current) { > >> - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); > >> + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); > > > > This WARN_ON() doesn't look right. > > > > It is possible that this child was traced by another task and PTRACE_DETACH'ed, > > but it didn't clear DELAY_WAKEKILL. > > That would be a bug. That would mean that PTRACE_DETACHED process can > not be SIGKILL'd. Why? The tracee will take siglock, clear JOBCTL_DELAY_WAKEKILL and notice SIGKILL after that. Oleg. > > If the new debugger attaches and calls ptrace() before the child takes siglock > > ptrace_freeze_traced() will fail, but we can hit this WARN_ON(). > > Eric >
On 04/27, Oleg Nesterov wrote: > > On 04/27, Eric W. Biederman wrote: > > > > Oleg Nesterov <oleg@redhat.com> writes: > > > > > On 04/26, Eric W. Biederman wrote: > > >> > > >> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) > > >> */ > > >> if (lock_task_sighand(child, &flags)) { > > >> if (child->ptrace && child->parent == current) { > > >> - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); > > >> + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); > > > > > > This WARN_ON() doesn't look right. > > > > > > It is possible that this child was traced by another task and PTRACE_DETACH'ed, > > > but it didn't clear DELAY_WAKEKILL. > > > > That would be a bug. That would mean that PTRACE_DETACHED process can > > not be SIGKILL'd. > > Why? The tracee will take siglock, clear JOBCTL_DELAY_WAKEKILL and notice > SIGKILL after that. Not to mention that the tracee is TASK_RUNNING after PTRACE_DETACH wakes it up, so the pending JOBCTL_DELAY_WAKEKILL simply has no effect. Oleg. > > > If the new debugger attaches and calls ptrace() before the child takes siglock > > > ptrace_freeze_traced() will fail, but we can hit this WARN_ON(). > > > > Eric > >
Oleg Nesterov <oleg@redhat.com> writes: > On 04/27, Oleg Nesterov wrote: >> >> On 04/27, Eric W. Biederman wrote: >> > >> > Oleg Nesterov <oleg@redhat.com> writes: >> > >> > > On 04/26, Eric W. Biederman wrote: >> > >> >> > >> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) >> > >> */ >> > >> if (lock_task_sighand(child, &flags)) { >> > >> if (child->ptrace && child->parent == current) { >> > >> - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); >> > >> + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); >> > > >> > > This WARN_ON() doesn't look right. >> > > >> > > It is possible that this child was traced by another task and PTRACE_DETACH'ed, >> > > but it didn't clear DELAY_WAKEKILL. >> > >> > That would be a bug. That would mean that PTRACE_DETACHED process can >> > not be SIGKILL'd. >> >> Why? The tracee will take siglock, clear JOBCTL_DELAY_WAKEKILL and notice >> SIGKILL after that. > > Not to mention that the tracee is TASK_RUNNING after PTRACE_DETACH wakes it > up, so the pending JOBCTL_DELAY_WAKEKILL simply has no effect. Oh. You are talking about the window when between clearing the traced state and when tracee resumes executing and clears JOBCTL_DELAY_WAKEKILL. I thought you were thinking about JOBCTL_DELAY_WAKEKILL being leaked. That requires both ptrace_attach and ptrace_check_attach for the new tracer to happen before the tracee is scheduled to run. I agree. I think the WARN_ON could reasonably be moved a bit later, but I don't know that the WARN_ON is important. I simply kept it because it seemed to make sense. Eric
Oleg Nesterov <oleg@redhat.com> writes: > On 04/27, Oleg Nesterov wrote: >> >> On 04/26, Eric W. Biederman wrote: >> > >> > @@ -2209,6 +2213,34 @@ static int ptrace_stop(int exit_code, int why, int clear_code, >> > spin_lock_irq(¤t->sighand->siglock); >> > } >> > >> > + /* Don't stop if current is not ptraced */ >> > + if (unlikely(!current->ptrace)) >> > + return (clear_code) ? 0 : exit_code; >> > + >> > + /* >> > + * If @why is CLD_STOPPED, we're trapping to participate in a group >> > + * stop. Do the bookkeeping. Note that if SIGCONT was delievered >> > + * across siglock relocks since INTERRUPT was scheduled, PENDING >> > + * could be clear now. We act as if SIGCONT is received after >> > + * TASK_TRACED is entered - ignore it. >> > + */ >> > + if (why == CLD_STOPPED && (current->jobctl & JOBCTL_STOP_PENDING)) >> > + gstop_done = task_participate_group_stop(current); >> > + >> > + /* >> > + * Notify parents of the stop. >> > + * >> > + * While ptraced, there are two parents - the ptracer and >> > + * the real_parent of the group_leader. The ptracer should >> > + * know about every stop while the real parent is only >> > + * interested in the completion of group stop. The states >> > + * for the two don't interact with each other. Notify >> > + * separately unless they're gonna be duplicates. >> > + */ >> > + do_notify_parent_cldstop(current, true, why); >> > + if (gstop_done && ptrace_reparented(current)) >> > + do_notify_parent_cldstop(current, false, why); >> >> This doesn't look right too. The parent should be notified only after >> we set __state = TASK_TRACED and ->exit code. >> >> Suppose that debugger sleeps in do_wait(). do_notify_parent_cldstop() >> wakes it up, debugger calls wait_task_stopped() and then it will sleep >> again, task_stopped_code() returns 0. >> >> This can be probably fixed if you remove the lockless (fast path) >> task_stopped_code() check in wait_task_stopped(), but this is not >> nice performance-wise... Another detail I have overlooked. Thank you. Or we can change task_stopped_code look something like: static int *task_stopped_code(struct task_struct *p, bool ptrace) { if (ptrace) { - if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING)) + if (p->ptrace && !(p->jobctl & JOBCTL_LISTENING)) return &p->exit_code; } else { if (p->signal->flags & SIGNAL_STOP_STOPPED) return &p->signal->group_exit_code; } return NULL; } I probably need to do a little bit more to ensure that it isn't an actual process exit_code in p->exit_code. But the we don't have to limit ourselves to being precisely in the task_is_traced stopped place for the fast path. > On the other hand, I don't understand why did you move the callsite > of do_notify_parent_cldstop() up... just don't do this? My goal and I still think it makes sense (if not my implementation) is to move set_special_state as close as possible to schedule(). That way we can avoid sleeping spin_locks clobbering it and making our life difficult. My hope is we can just clean up ptrace_stop instead of making it more complicated and harder to follow. Not that I am fundamentally opposed to the quiesce bit but the code is already very hard to follow because of all it's nuance and complexity, and I would really like to reduce that complexity if we can possibly figure out how. Eric
Oleg Nesterov <oleg@redhat.com> writes: 2> On 04/26, Eric W. Biederman wrote: >> >> static void ptrace_unfreeze_traced(struct task_struct *task) >> { >> - if (READ_ONCE(task->__state) != __TASK_TRACED) >> + if (!(READ_ONCE(task->jobctl) & JOBCTL_DELAY_WAKEKILL)) >> return; >> >> WARN_ON(!task->ptrace || task->parent != current); >> @@ -213,11 +213,10 @@ static void ptrace_unfreeze_traced(struct task_struct *task) >> * Recheck state under the lock to close this race. >> */ >> spin_lock_irq(&task->sighand->siglock); > > Now that we do not check __state = __TASK_TRACED, we need lock_task_sighand(). > The tracee can be already woken up by ptrace_resume(), but it is possible that > it didn't clear DELAY_WAKEKILL yet. Yes. The subtle differences in when __TASK_TRACED and JOBCTL_DELAY_WAKEKILL are cleared are causing me some minor issues. This "WARN_ON(!task->ptrace || task->parent != current);" also now needs to be inside siglock, because the __TASK_TRACED is insufficient. > Now, before we take ->siglock, the tracee can exit and another thread can do > wait() and reap this task. > > Also, I think the comment above should be updated. I agree, it makes sense to > re-check JOBCTL_DELAY_WAKEKILL under siglock just for clarity, but we no longer > need to do this to close the race; jobctl &= ~JOBCTL_DELAY_WAKEKILL and > wake_up_state() are safe even if JOBCTL_DELAY_WAKEKILL was already > cleared. I think you are right about it being safe, but I am having a hard time convincing myself that is true. I want to be very careful sending __TASK_TRACED wake_ups as ptrace_stop fundamentally can't handle spurious wake_ups. So I think adding task_is_traced to the test to verify the task is still frozen. static void ptrace_unfreeze_traced(struct task_struct *task) { unsigned long flags; /* * Verify the task is still frozen before unfreezing it, * ptrace_resume could have unfrozen us. */ if (lock_task_sighand(task, &flags)) { if ((task->jobctl & JOBCTL_DELAY_WAKEKILL) && task_is_traced(task)) { task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; if (__fatal_signal_pending(task)) wake_up_state(task, __TASK_TRACED); } unlock_task_sighand(task, &flags); } } >> @@ -2307,6 +2307,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code, >> >> /* LISTENING can be set only during STOP traps, clear it */ >> current->jobctl &= ~JOBCTL_LISTENING; >> + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; > > minor, but > > current->jobctl &= ~(JOBCTL_LISTENING | JOBCTL_DELAY_WAKEKILL); > > looks better. Yes. Eric
"Eric W. Biederman" <ebiederm@xmission.com> writes: > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > index 3c8b34876744..1947c85aa9d9 100644 > --- a/include/linux/sched/signal.h > +++ b/include/linux/sched/signal.h > @@ -437,7 +437,8 @@ extern void signal_wake_up_state(struct task_struct *t, unsigned int state); > > static inline void signal_wake_up(struct task_struct *t, bool resume) > { > - signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0); > + bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL); > + signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0); > } > static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume) > { Grrr. While looking through everything today I have realized that there is a bug. Suppose we have 3 processes: TRACER, TRACEE, KILLER. Meanwhile TRACEE is in the middle of ptrace_stop, just after siglock has been dropped. The TRACER process has performed ptrace_attach on TRACEE and is in the middle of a ptrace operation and has just set JOBCTL_DELAY_WAKEKILL. Then comes in the KILLER process and sends the TRACEE a SIGKILL. The TRACEE __state remains TASK_TRACED, as designed. The bug appears when the TRACEE makes it to schedule(). Inside schedule there is a call to signal_pending_state() which notices a SIGKILL is pending and refuses to sleep. I could avoid setting TIF_SIGPENDING in signal_wake_up but that is insufficient as another signal may be pending. I could avoid marking the task as __fatal_signal_pending but then where would the information that the task needs to become __fatal_signal_pending go. Hmm. This looks like I need my other pending cleanup which introduces a helper to get this idea to work. Eric
On Tue, Apr 26, 2022 at 05:50:21PM -0500, Eric W. Biederman wrote: > .... Peter Zijlstra has > been rewriting the classic freezer and in earlier parts of this > discussion so I presume it is also a problem for PREEMPT_RT. Ah, the freezer thing is in fact a sched/arm64 issue, the common issue between these two issues is ptrace though. Specifically, on recent arm64 chips only a subset of CPUs can execute arm32 code and 32bit processes are restricted to that subset. If by some mishap you try and execute a 32bit task on a non-capable CPU it gets terminated without prejudice. Now, the current freezer has this problem that tasks can spuriously thaw too soon (where too soon is before SMP is restored) which leads to these 32bit tasks being killed dead. That, and it was a good excuse to fix up the current freezer :-)
On Tue, Apr 26, 2022 at 05:52:03PM -0500, Eric W. Biederman wrote: > Rename send_signal send_signal_locked and make to make > it usable outside of signal.c. > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > include/linux/signal.h | 2 ++ > kernel/signal.c | 24 ++++++++++++------------ > 2 files changed, 14 insertions(+), 12 deletions(-) > > diff --git a/include/linux/signal.h b/include/linux/signal.h > index a6db6f2ae113..55605bdf5ce9 100644 > --- a/include/linux/signal.h > +++ b/include/linux/signal.h > @@ -283,6 +283,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info, > extern int group_send_sig_info(int sig, struct kernel_siginfo *info, > struct task_struct *p, enum pid_type type); > extern int __group_send_sig_info(int, struct kernel_siginfo *, struct task_struct *); > +extern int send_signal_locked(int sig, struct kernel_siginfo *info, > + struct task_struct *p, enum pid_type type); > extern int sigprocmask(int, sigset_t *, sigset_t *); > extern void set_current_blocked(sigset_t *); > extern void __set_current_blocked(const sigset_t *); > diff --git a/kernel/signal.c b/kernel/signal.c > index 30cd1ca43bcd..b0403197b0ad 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -1071,8 +1071,8 @@ static inline bool legacy_queue(struct sigpending *signals, int sig) > return (sig < SIGRTMIN) && sigismember(&signals->signal, sig); > } > > -static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struct *t, > - enum pid_type type, bool force) > +static int __send_signal_locked(int sig, struct kernel_siginfo *info, > + struct task_struct *t, enum pid_type type, bool force) > { > struct sigpending *pending; > struct sigqueue *q; While there, could you please replace that assert_spin_locked() with lockdep_assert_held(&t->sighand->siglock) ? The distinction being that assert_spin_locked() checks if the lock is held *by*anyone* whereas lockdep_assert_held() asserts the current context holds the lock. Also, the check goes away if you build without lockdep.
On Tue, Apr 26, 2022 at 05:52:08PM -0500, Eric W. Biederman wrote: > Now that siglock keeps tsk->parent and tsk->real_parent constant > require that do_notify_parent_cldstop is called with tsk->siglock held > instead of the tasklist_lock. > > As all of the callers of do_notify_parent_cldstop had to drop the > siglock and take tasklist_lock this simplifies all of it's callers. > > Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> > --- > kernel/signal.c | 156 +++++++++++++++++------------------------------- > 1 file changed, 55 insertions(+), 101 deletions(-) > > diff --git a/kernel/signal.c b/kernel/signal.c > index 72d96614effc..584d67deb3cb 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2121,11 +2121,13 @@ static void do_notify_parent_cldstop(struct task_struct *tsk, > bool for_ptracer, int why) > { > struct kernel_siginfo info; > - unsigned long flags; > struct task_struct *parent; > struct sighand_struct *sighand; > + bool lock; > u64 utime, stime; > > + assert_spin_locked(&tsk->sighand->siglock); lockdep_assert_held() please...
On 04/27, Eric W. Biederman wrote: > > "Eric W. Biederman" <ebiederm@xmission.com> writes: > > > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h > > index 3c8b34876744..1947c85aa9d9 100644 > > --- a/include/linux/sched/signal.h > > +++ b/include/linux/sched/signal.h > > @@ -437,7 +437,8 @@ extern void signal_wake_up_state(struct task_struct *t, unsigned int state); > > > > static inline void signal_wake_up(struct task_struct *t, bool resume) > > { > > - signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0); > > + bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL); > > + signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0); > > } > > static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume) > > { > > Grrr. While looking through everything today I have realized that there > is a bug. > > Suppose we have 3 processes: TRACER, TRACEE, KILLER. > > Meanwhile TRACEE is in the middle of ptrace_stop, just after siglock has > been dropped. > > The TRACER process has performed ptrace_attach on TRACEE and is in the > middle of a ptrace operation and has just set JOBCTL_DELAY_WAKEKILL. > > Then comes in the KILLER process and sends the TRACEE a SIGKILL. > The TRACEE __state remains TASK_TRACED, as designed. > > The bug appears when the TRACEE makes it to schedule(). Inside > schedule there is a call to signal_pending_state() which notices > a SIGKILL is pending and refuses to sleep. And I think this is fine. This doesn't really differ from the case when the tracee was killed before it takes siglock. The only problem (afaics) is that, once we introduce JOBCTL_TRACED, ptrace_stop() can leak this flag. That is why I suggested to clear it along with LISTENING/DELAY_WAKEKILL before return, exactly because schedule() won't block if fatal_signal_pending() is true. But may be I misunderstood you concern? Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/27, Eric W. Biederman wrote: >> >> "Eric W. Biederman" <ebiederm@xmission.com> writes: >> >> > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h >> > index 3c8b34876744..1947c85aa9d9 100644 >> > --- a/include/linux/sched/signal.h >> > +++ b/include/linux/sched/signal.h >> > @@ -437,7 +437,8 @@ extern void signal_wake_up_state(struct task_struct *t, unsigned int state); >> > >> > static inline void signal_wake_up(struct task_struct *t, bool resume) >> > { >> > - signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0); >> > + bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL); >> > + signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0); >> > } >> > static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume) >> > { >> >> Grrr. While looking through everything today I have realized that there >> is a bug. >> >> Suppose we have 3 processes: TRACER, TRACEE, KILLER. >> >> Meanwhile TRACEE is in the middle of ptrace_stop, just after siglock has >> been dropped. >> >> The TRACER process has performed ptrace_attach on TRACEE and is in the >> middle of a ptrace operation and has just set JOBCTL_DELAY_WAKEKILL. >> >> Then comes in the KILLER process and sends the TRACEE a SIGKILL. >> The TRACEE __state remains TASK_TRACED, as designed. >> >> The bug appears when the TRACEE makes it to schedule(). Inside >> schedule there is a call to signal_pending_state() which notices >> a SIGKILL is pending and refuses to sleep. > > And I think this is fine. This doesn't really differ from the case > when the tracee was killed before it takes siglock. Hmm. Maybe. > The only problem (afaics) is that, once we introduce JOBCTL_TRACED, > ptrace_stop() can leak this flag. That is why I suggested to clear > it along with LISTENING/DELAY_WAKEKILL before return, exactly because > schedule() won't block if fatal_signal_pending() is true. > > But may be I misunderstood you concern? Prior to JOBCTL_DELAY_WAKEKILL once __state was set to __TASK_TRACED we were guaranteed that schedule() would stop if a SIGKILL was received after that point. As well as being immune from wake-ups from SIGKILL. I guess we are immune from wake-ups with JOBCTL_DELAY_WAKEKILL as I have implemented it. The practical concern then seems to be that we are not guaranteed wait_task_inactive will succeed. Which means that it must continue to include the TASK_TRACED bit. Previously we were actually guaranteed in ptrace_check_attach that after ptrace_freeze_traced would succeed as any pending fatal signal would cause ptrace_freeze_traced to fail. Any incoming fatal signal would not stop schedule from sleeping. The ptraced task would continue to be ptraced, as all other ptrace operations are blocked by virtue of ptrace being single threaded. I think in my tired mind yesterday I thought it would messing things up after schedule decided to sleep. Still I would like to be able to let wait_task_inactive not care about the state of the process it is going to sleep for. Eric
On Wed, Apr 27, 2022 at 09:47:10AM -0500, Eric W. Biederman wrote: > Hmm. If we have the following process tree. > > A > \ > B > \ > C > > Process A, B, and C are all in the same process group. > Process A and B are setup to receive SIGCHILD when > their process stops. > > Process C traces process A. > > When a sigstop is delivered to the group we can have: > > Process B takes siglock(B) siglock(A) to notify the real_parent > Process C takes siglock(C) siglock(B) to notify the real_parent > Process A takes siglock(A) siglock(C) to notify the tracer > > If they all take their local lock at the same time there is > a deadlock. > > I don't think the restriction that you can never ptrace anyone > up the process tree is going to fly. So it looks like I am back to the > drawing board for this one. I've not had time to fully appreciate the nested locking here, but if it is possible to rework things to always take both locks at the same time, then it would be possible to impose an arbitrary lock order on things and break the cycle that way. That is, simply order the locks by their heap address or something: static void double_siglock_irq(struct sighand *sh1, struct sighand2 *sh2) { if (sh1 > sh2) swap(sh1, sh2) spin_lock_irq(&sh1->siglock); spin_lock_nested(&sh2->siglock, SINGLE_DEPTH_NESTING); }
On 04/28, Peter Zijlstra wrote: > > I've not had time to fully appreciate the nested locking here, but if it > is possible to rework things to always take both locks at the same time, > then it would be possible to impose an arbitrary lock order on things > and break the cycle that way. This is clear, but this is not that simple. For example (with this series at least), ptrace_stop() already holds current->sighand->siglock which (in particular) we need to protect current->parent, but then we need current->parent->sighand->siglock in do_notify_parent_cldstop(). Oleg.
Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Apr 27, 2022 at 09:47:10AM -0500, Eric W. Biederman wrote: > >> Hmm. If we have the following process tree. >> >> A >> \ >> B >> \ >> C >> >> Process A, B, and C are all in the same process group. >> Process A and B are setup to receive SIGCHILD when >> their process stops. >> >> Process C traces process A. >> >> When a sigstop is delivered to the group we can have: >> >> Process B takes siglock(B) siglock(A) to notify the real_parent >> Process C takes siglock(C) siglock(B) to notify the real_parent >> Process A takes siglock(A) siglock(C) to notify the tracer >> >> If they all take their local lock at the same time there is >> a deadlock. >> >> I don't think the restriction that you can never ptrace anyone >> up the process tree is going to fly. So it looks like I am back to the >> drawing board for this one. > > I've not had time to fully appreciate the nested locking here, but if it > is possible to rework things to always take both locks at the same time, > then it would be possible to impose an arbitrary lock order on things > and break the cycle that way. > > That is, simply order the locks by their heap address or something: > > static void double_siglock_irq(struct sighand *sh1, struct sighand2 *sh2) > { > if (sh1 > sh2) > swap(sh1, sh2) > > spin_lock_irq(&sh1->siglock); > spin_lock_nested(&sh2->siglock, SINGLE_DEPTH_NESTING); > } You know it might be. Especially given that the existing code is already dropping siglock and grabbing tasklist_lock. It would take a potentially triple lock function to lock the task it's real_parent and it's tracer (aka parent). That makes this possible to consider is that notifying the ``parents'' is a fundamental part of the operation so we know we are going to need the lock so we can move it up. Throw in a pinch of lock_task_sighand and the triple lock function gets quite interesting. It is certainly worth trying, and I will. Eric
On 04/28, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > >> The bug appears when the TRACEE makes it to schedule(). Inside > >> schedule there is a call to signal_pending_state() which notices > >> a SIGKILL is pending and refuses to sleep. > > > > And I think this is fine. This doesn't really differ from the case > > when the tracee was killed before it takes siglock. > > Hmm. Maybe. I hope ;) > Previously we were actually guaranteed in ptrace_check_attach that after > ptrace_freeze_traced would succeed as any pending fatal signal would > cause ptrace_freeze_traced to fail. Any incoming fatal signal would not > stop schedule from sleeping. Yes. So let me repeat, 7/9 "ptrace: Simplify the wait_task_inactive call in ptrace_check_attach" looks good to me (except it should use wait_task_inactive(__TASK_TRACED)), but it should come before other meaningfull changes and the changelog should be updated. And then we will probably need to reconsider this wait_task_inactive() and WARN_ON() around it, but depends on what will we finally do. > I think in my tired mind yesterday I got lost too ;) > Still I would like to be able to > let wait_task_inactive not care about the state of the process it is > going to sleep for. Not sure... but to be honest I didn't really pay attention to the wait_task_inactive(match_state => 0) part... Oleg.
"Eric W. Biederman" <ebiederm@xmission.com> writes: > Peter Zijlstra <peterz@infradead.org> writes: > >> On Wed, Apr 27, 2022 at 09:47:10AM -0500, Eric W. Biederman wrote: >> >>> Hmm. If we have the following process tree. >>> >>> A >>> \ >>> B >>> \ >>> C >>> >>> Process A, B, and C are all in the same process group. >>> Process A and B are setup to receive SIGCHILD when >>> their process stops. >>> >>> Process C traces process A. >>> >>> When a sigstop is delivered to the group we can have: >>> >>> Process B takes siglock(B) siglock(A) to notify the real_parent >>> Process C takes siglock(C) siglock(B) to notify the real_parent >>> Process A takes siglock(A) siglock(C) to notify the tracer >>> >>> If they all take their local lock at the same time there is >>> a deadlock. >>> >>> I don't think the restriction that you can never ptrace anyone >>> up the process tree is going to fly. So it looks like I am back to the >>> drawing board for this one. >> >> I've not had time to fully appreciate the nested locking here, but if it >> is possible to rework things to always take both locks at the same time, >> then it would be possible to impose an arbitrary lock order on things >> and break the cycle that way. >> >> That is, simply order the locks by their heap address or something: >> >> static void double_siglock_irq(struct sighand *sh1, struct sighand2 *sh2) >> { >> if (sh1 > sh2) >> swap(sh1, sh2) >> >> spin_lock_irq(&sh1->siglock); >> spin_lock_nested(&sh2->siglock, SINGLE_DEPTH_NESTING); >> } > > You know it might be. Especially given that the existing code is > already dropping siglock and grabbing tasklist_lock. > > It would take a potentially triple lock function to lock > the task it's real_parent and it's tracer (aka parent). > > That makes this possible to consider is that notifying the ``parents'' > is a fundamental part of the operation so we know we are going to > need the lock so we can move it up. > > Throw in a pinch of lock_task_sighand and the triple lock function > gets quite interesting. > > It is certainly worth trying, and I will. To my surprise it doesn't look too bad. The locking simplifications and not using a lock as big as tasklist_lock probably make it even worth doing. I need to sleep on it and look at everything again. In the meantime here is my function that comes in with siglock held, possibly drops it, and grabs the other two locks all in order. static void lock_parents_siglocks(bool lock_tracer) __releases(¤t->sighand->siglock) __acquires(¤t->sighand->siglock) __acquires(¤t->real_parent->sighand->siglock) __acquires(¤t->parent->sighand->siglock) { struct task_struct *me = current; struct sighand_struct *m_sighand = me->sighand; lockdep_assert_held(&m_sighand->siglock); rcu_read_lock(); for (;;) { struct task_struct *parent, *tracer; struct sighand_struct *p_sighand, *t_sighand, *s1, *s2, *s3; parent = me->real_parent; tracer = lock_tracer? me->parent : parent; p_sighand = rcu_dereference(parent->sighand); t_sighand = rcu_dereference(tracer->sighand); /* Sort the sighands so that s1 >= s2 >= s3 */ s1 = m_sighand; s2 = p_sighand; s3 = t_sighand; if (s1 > s2) swap(s1, s2); if (s1 > s3) swap(s1, s3); if (s2 > s3) swap(s2, s3); if (s1 != m_sighand) { spin_unlock(&m_sighand->siglock); spin_lock(&s1->siglock); } if (s1 != s2) spin_lock_nested(&s2->siglock, SIGLOCK_LOCK_SECOND); if (s2 != s3) spin_lock_nested(&s3->siglock, SIGLOCK_LOCK_THIRD); if (likely((me->real_parent == parent) && (me->parent == tracer) && (parent->sighand == p_sighand) && (tracer->sighand == t_sighand))) { break; } spin_unlock(&p_sighand->siglock); if (t_sighand != p_sighand) spin_unlock(&t_sighand->siglock); continue; } rcu_read_unlock(); } Eric
On Thu, Apr 28, 2022 at 03:49:11PM -0500, Eric W. Biederman wrote: > static void lock_parents_siglocks(bool lock_tracer) > __releases(¤t->sighand->siglock) > __acquires(¤t->sighand->siglock) > __acquires(¤t->real_parent->sighand->siglock) > __acquires(¤t->parent->sighand->siglock) > { > struct task_struct *me = current; > struct sighand_struct *m_sighand = me->sighand; > > lockdep_assert_held(&m_sighand->siglock); > > rcu_read_lock(); > for (;;) { > struct task_struct *parent, *tracer; > struct sighand_struct *p_sighand, *t_sighand, *s1, *s2, *s3; > > parent = me->real_parent; > tracer = lock_tracer? me->parent : parent; > > p_sighand = rcu_dereference(parent->sighand); > t_sighand = rcu_dereference(tracer->sighand); > > /* Sort the sighands so that s1 >= s2 >= s3 */ > s1 = m_sighand; > s2 = p_sighand; > s3 = t_sighand; > if (s1 > s2) > swap(s1, s2); > if (s1 > s3) > swap(s1, s3); > if (s2 > s3) > swap(s2, s3); > > if (s1 != m_sighand) { > spin_unlock(&m_sighand->siglock); > spin_lock(&s1->siglock); > } > > if (s1 != s2) > spin_lock_nested(&s2->siglock, SIGLOCK_LOCK_SECOND); > if (s2 != s3) > spin_lock_nested(&s3->siglock, SIGLOCK_LOCK_THIRD); > Might as well just use 1 and 2 for subclass at this point, or use SIGLOCK_LOCK_FIRST below. > if (likely((me->real_parent == parent) && > (me->parent == tracer) && > (parent->sighand == p_sighand) && > (tracer->sighand == t_sighand))) { > break; > } > spin_unlock(&p_sighand->siglock); > if (t_sighand != p_sighand) > spin_unlock(&t_sighand->siglock); Indent fail above ^, also you likey need this: /* * Since [pt]_sighand will likely change if we go * around, and m_sighand is the only one held, make sure * it is subclass-0, since the above 's1 != m_sighand' * clause very much relies on that. */ lock_set_subclass(&m_sighand->siglock, 0, _RET_IP_); > continue; > } > rcu_read_unlock(); > } > > Eric