mbox series

[0/9] ptrace: cleaning up ptrace_stop

Message ID 878rrrh32q.fsf_-_@email.froward.int.ebiederm.org
Headers show
Series ptrace: cleaning up ptrace_stop | expand

Message

Eric W. Biederman April 26, 2022, 10:50 p.m. UTC
While looking at how ptrace is broken on PREEMPT_RT I realized
that ptrace_stop would be much simpler and more maintainable
if tsk->ptrace, tsk->parent, and tsk->real_parent were protected
by siglock.  Most of the changes are general cleanups in support
of this locking change.

While making the necessary changes to protect tsk->ptrace with
siglock I discovered we have two architectures xtensa and um
that were using tsk->ptrace for what most other architectures
use TIF_SIGPENDING for and not protecting tsk->ptrace with any lock.

By the end of this series ptrace should work on PREEMPT_RT with the
CONFIG_FREEZER and CONFIG_CGROUPS disabled, by the simple fact that the
ptrace_stop code becomes less special.  The function cgroup_enter_frozen
because it takes a lock which is a sleeping lock on PREEMPT_RT with
preemption disabled definitely remains a problem.  Peter Zijlstra has
been rewriting the classic freezer and in earlier parts of this
discussion so I presume it is also a problem for PREEMPT_RT.

Peter's series rewriting the freezer[1] should work on top of this
series with minimal changes and patch 2/5 removed.

Eric W. Biederman (9):
      signal: Rename send_signal send_signal_locked
      signal: Replace __group_send_sig_info with send_signal_locked
      ptrace/um: Replace PT_DTRACE with TIF_SINGLESTEP
      ptrace/xtensa: Replace PT_SINGLESTEP with TIF_SINGLESTEP
      signal: Protect parent child relationships by childs siglock
      signal: Always call do_notify_parent_cldstop with siglock held
      ptrace: Simplify the wait_task_inactive call in ptrace_check_attach
      ptrace: Use siglock instead of tasklist_lock in ptrace_check_attach
      ptrace: Don't change __state

 arch/um/include/asm/thread_info.h |   2 +
 arch/um/kernel/exec.c             |   2 +-
 arch/um/kernel/process.c          |   2 +-
 arch/um/kernel/ptrace.c           |   8 +-
 arch/um/kernel/signal.c           |   4 +-
 arch/xtensa/kernel/ptrace.c       |   4 +-
 arch/xtensa/kernel/signal.c       |   4 +-
 drivers/tty/tty_jobctrl.c         |   4 +-
 include/linux/ptrace.h            |   7 --
 include/linux/sched/jobctl.h      |   2 +
 include/linux/sched/signal.h      |   3 +-
 include/linux/signal.h            |   3 +-
 kernel/exit.c                     |   4 +
 kernel/fork.c                     |  12 +--
 kernel/ptrace.c                   |  61 ++++++-------
 kernel/signal.c                   | 187 ++++++++++++++------------------------
 kernel/time/posix-cpu-timers.c    |   6 +-
 17 files changed, 131 insertions(+), 184 deletions(-)

[1] https://lkml.kernel.org/r/20220421150248.667412396@infradead.org

Eric

Comments

Max Filippov April 26, 2022, 11:33 p.m. UTC | #1
On Tue, Apr 26, 2022 at 3:52 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> xtensa is the last user of the PT_SINGLESTEP flag.  Changing tsk->ptrace in
> user_enable_single_step and user_disable_single_step without locking could
> potentiallly cause problems.
>
> So use a thread info flag instead of a flag in tsk->ptrace.  Use TIF_SINGLESTEP
> that xtensa already had defined but unused.
>
> Remove the definitions of PT_SINGLESTEP and PT_BLOCKSTEP as they have no more users.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  arch/xtensa/kernel/ptrace.c | 4 ++--
>  arch/xtensa/kernel/signal.c | 4 ++--
>  include/linux/ptrace.h      | 6 ------
>  3 files changed, 4 insertions(+), 10 deletions(-)

Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Sebastian Andrzej Siewior April 27, 2022, 6:40 a.m. UTC | #2
On 2022-04-26 17:52:07 [-0500], Eric W. Biederman wrote:
> The functions ptrace_stop and do_signal_stop have to drop siglock
> and grab tasklist_lock because the parent/child relation ship
> is guarded by siglock and not siglock.

 "is guarded by tasklist_lock and not siglock." ?

> Simplify things by guarding the parent/child relationship
> with siglock.  For the most part this just requires a little bit
> of code motion.  In a couple of places more locking was needed.
> 
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Sebastian
Eric W. Biederman April 27, 2022, 1:35 p.m. UTC | #3
Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> On 2022-04-26 17:52:07 [-0500], Eric W. Biederman wrote:
>> The functions ptrace_stop and do_signal_stop have to drop siglock
>> and grab tasklist_lock because the parent/child relation ship
>> is guarded by siglock and not siglock.
>
>  "is guarded by tasklist_lock and not siglock." ?

Yes.   Thank you.  I will fix that.

>> Simplify things by guarding the parent/child relationship
>> with siglock.  For the most part this just requires a little bit
>> of code motion.  In a couple of places more locking was needed.
>> 
>> 
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Sebastian

Eric
Oleg Nesterov April 27, 2022, 2:10 p.m. UTC | #4
On 04/26, Eric W. Biederman wrote:
>
> @@ -2164,7 +2166,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>   	}
>
>  	sighand = parent->sighand;
> -	spin_lock_irqsave(&sighand->siglock, flags);
> +	lock = tsk->sighand != sighand;
> +	if (lock)
> +		spin_lock_nested(&sighand->siglock, SINGLE_DEPTH_NESTING);

But why is it safe?

Suppose we have two tasks, they both trace each other, both call
ptrace_stop() at the same time. Of course this is ugly, they both
will block.

But with this patch in this case we have the trivial ABBA deadlock,
no?

Oleg.
Eric W. Biederman April 27, 2022, 2:20 p.m. UTC | #5
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/26, Eric W. Biederman wrote:
>>
>> @@ -2164,7 +2166,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>>   	}
>>
>>  	sighand = parent->sighand;
>> -	spin_lock_irqsave(&sighand->siglock, flags);
>> +	lock = tsk->sighand != sighand;
>> +	if (lock)
>> +		spin_lock_nested(&sighand->siglock, SINGLE_DEPTH_NESTING);
>
> But why is it safe?
>
> Suppose we have two tasks, they both trace each other, both call
> ptrace_stop() at the same time. Of course this is ugly, they both
> will block.
>
> But with this patch in this case we have the trivial ABBA deadlock,
> no?

I was thinking in terms of the process tree (which is fine).

The ptrace parental relationship definitely has the potential to be a
graph with cycles.  Which as you point out is not fine.


The result is very nice and I don't want to give it up.  I suspect
something ptrace cycles are always a problem and can simply be
forbidden.  That is going to take some analsysis and some additional
code in ptrace_attach.

I will go look at that.

Eric
Oleg Nesterov April 27, 2022, 2:43 p.m. UTC | #6
On 04/27, Eric W. Biederman wrote:
>
> The ptrace parental relationship definitely has the potential to be a
> graph with cycles.  Which as you point out is not fine.
>
> The result is very nice and I don't want to give it up.  I suspect
> something ptrace cycles are always a problem and can simply be
> forbidden.

OK, please consider another case.

We have a parent P and its child C. C traces P.

This is not that unusual, I don't think we can forbid this case.

P reports an event and calls do_notify_parent_cldstop().

C receives SIGSTOP and calls do_notify_parent_cldstop() too.

Oleg.
Eric W. Biederman April 27, 2022, 2:47 p.m. UTC | #7
"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Oleg Nesterov <oleg@redhat.com> writes:
>
>> On 04/26, Eric W. Biederman wrote:
>>>
>>> @@ -2164,7 +2166,9 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>>>   	}
>>>
>>>  	sighand = parent->sighand;
>>> -	spin_lock_irqsave(&sighand->siglock, flags);
>>> +	lock = tsk->sighand != sighand;
>>> +	if (lock)
>>> +		spin_lock_nested(&sighand->siglock, SINGLE_DEPTH_NESTING);
>>
>> But why is it safe?
>>
>> Suppose we have two tasks, they both trace each other, both call
>> ptrace_stop() at the same time. Of course this is ugly, they both
>> will block.
>>
>> But with this patch in this case we have the trivial ABBA deadlock,
>> no?
>
> I was thinking in terms of the process tree (which is fine).
>
> The ptrace parental relationship definitely has the potential to be a
> graph with cycles.  Which as you point out is not fine.
>
>
> The result is very nice and I don't want to give it up.  I suspect
> something ptrace cycles are always a problem and can simply be
> forbidden.  That is going to take some analsysis and some additional
> code in ptrace_attach.
>
> I will go look at that.


Hmm.  If we have the following process tree.

    A
     \
      B
       \
        C

Process A, B, and C are all in the same process group.
Process A and B are setup to receive SIGCHILD when
their process stops.

Process C traces process A.

When a sigstop is delivered to the group we can have:

Process B takes siglock(B) siglock(A) to notify the real_parent
Process C takes siglock(C) siglock(B) to notify the real_parent
Process A takes siglock(A) siglock(C) to notify the tracer

If they all take their local lock at the same time there is
a deadlock.

I don't think the restriction that you can never ptrace anyone
up the process tree is going to fly.  So it looks like I am back to the
drawing board for this one.

Eric
Oleg Nesterov April 27, 2022, 2:56 p.m. UTC | #8
On 04/26, Eric W. Biederman wrote:
>
> @@ -2209,6 +2213,34 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>  		spin_lock_irq(&current->sighand->siglock);
>  	}
>  
> +	/* Don't stop if current is not ptraced */
> +	if (unlikely(!current->ptrace))
> +		return (clear_code) ? 0 : exit_code;
> +
> +	/*
> +	 * If @why is CLD_STOPPED, we're trapping to participate in a group
> +	 * stop.  Do the bookkeeping.  Note that if SIGCONT was delievered
> +	 * across siglock relocks since INTERRUPT was scheduled, PENDING
> +	 * could be clear now.  We act as if SIGCONT is received after
> +	 * TASK_TRACED is entered - ignore it.
> +	 */
> +	if (why == CLD_STOPPED && (current->jobctl & JOBCTL_STOP_PENDING))
> +		gstop_done = task_participate_group_stop(current);
> +
> +	/*
> +	 * Notify parents of the stop.
> +	 *
> +	 * While ptraced, there are two parents - the ptracer and
> +	 * the real_parent of the group_leader.  The ptracer should
> +	 * know about every stop while the real parent is only
> +	 * interested in the completion of group stop.  The states
> +	 * for the two don't interact with each other.  Notify
> +	 * separately unless they're gonna be duplicates.
> +	 */
> +	do_notify_parent_cldstop(current, true, why);
> +	if (gstop_done && ptrace_reparented(current))
> +		do_notify_parent_cldstop(current, false, why);

This doesn't look right too. The parent should be notified only after
we set __state = TASK_TRACED and ->exit code.

Suppose that debugger sleeps in do_wait(). do_notify_parent_cldstop()
wakes it up, debugger calls wait_task_stopped() and then it will sleep
again, task_stopped_code() returns 0.

This can be probably fixed if you remove the lockless (fast path)
task_stopped_code() check in wait_task_stopped(), but this is not
nice performance-wise...

Oleg.
Oleg Nesterov April 27, 2022, 3 p.m. UTC | #9
On 04/27, Oleg Nesterov wrote:
>
> On 04/26, Eric W. Biederman wrote:
> >
> > @@ -2209,6 +2213,34 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
> >  		spin_lock_irq(&current->sighand->siglock);
> >  	}
> >
> > +	/* Don't stop if current is not ptraced */
> > +	if (unlikely(!current->ptrace))
> > +		return (clear_code) ? 0 : exit_code;
> > +
> > +	/*
> > +	 * If @why is CLD_STOPPED, we're trapping to participate in a group
> > +	 * stop.  Do the bookkeeping.  Note that if SIGCONT was delievered
> > +	 * across siglock relocks since INTERRUPT was scheduled, PENDING
> > +	 * could be clear now.  We act as if SIGCONT is received after
> > +	 * TASK_TRACED is entered - ignore it.
> > +	 */
> > +	if (why == CLD_STOPPED && (current->jobctl & JOBCTL_STOP_PENDING))
> > +		gstop_done = task_participate_group_stop(current);
> > +
> > +	/*
> > +	 * Notify parents of the stop.
> > +	 *
> > +	 * While ptraced, there are two parents - the ptracer and
> > +	 * the real_parent of the group_leader.  The ptracer should
> > +	 * know about every stop while the real parent is only
> > +	 * interested in the completion of group stop.  The states
> > +	 * for the two don't interact with each other.  Notify
> > +	 * separately unless they're gonna be duplicates.
> > +	 */
> > +	do_notify_parent_cldstop(current, true, why);
> > +	if (gstop_done && ptrace_reparented(current))
> > +		do_notify_parent_cldstop(current, false, why);
>
> This doesn't look right too. The parent should be notified only after
> we set __state = TASK_TRACED and ->exit code.
>
> Suppose that debugger sleeps in do_wait(). do_notify_parent_cldstop()
> wakes it up, debugger calls wait_task_stopped() and then it will sleep
> again, task_stopped_code() returns 0.
>
> This can be probably fixed if you remove the lockless (fast path)
> task_stopped_code() check in wait_task_stopped(), but this is not
> nice performance-wise...

On the other hand, I don't understand why did you move the callsite
of do_notify_parent_cldstop() up... just don't do this?

Oleg.
Oleg Nesterov April 27, 2022, 3:41 p.m. UTC | #10
On 04/26, Eric W. Biederman wrote:
>
>  static void ptrace_unfreeze_traced(struct task_struct *task)
>  {
> -	if (READ_ONCE(task->__state) != __TASK_TRACED)
> +	if (!(READ_ONCE(task->jobctl) & JOBCTL_DELAY_WAKEKILL))
>  		return;
>
>  	WARN_ON(!task->ptrace || task->parent != current);
> @@ -213,11 +213,10 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>  	 * Recheck state under the lock to close this race.
>  	 */
>  	spin_lock_irq(&task->sighand->siglock);

Now that we do not check __state = __TASK_TRACED, we need lock_task_sighand().
The tracee can be already woken up by ptrace_resume(), but it is possible that
it didn't clear DELAY_WAKEKILL yet.

Now, before we take ->siglock, the tracee can exit and another thread can do
wait() and reap this task.

Also, I think the comment above should be updated. I agree, it makes sense to
re-check JOBCTL_DELAY_WAKEKILL under siglock just for clarity, but we no longer
need to do this to close the race; jobctl &= ~JOBCTL_DELAY_WAKEKILL and
wake_up_state() are safe even if JOBCTL_DELAY_WAKEKILL was already cleared.

> @@ -2307,6 +2307,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>
>  	/* LISTENING can be set only during STOP traps, clear it */
>  	current->jobctl &= ~JOBCTL_LISTENING;
> +	current->jobctl &= ~JOBCTL_DELAY_WAKEKILL;

minor, but

	current->jobctl &= ~(JOBCTL_LISTENING | JOBCTL_DELAY_WAKEKILL);

looks better.

Oleg.
Oleg Nesterov April 27, 2022, 4:09 p.m. UTC | #11
On 04/26, Eric W. Biederman wrote:
>
> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>  	 */
>  	if (lock_task_sighand(child, &flags)) {
>  		if (child->ptrace && child->parent == current) {
> -			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> +			WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);

This WARN_ON() doesn't look right.

It is possible that this child was traced by another task and PTRACE_DETACH'ed,
but it didn't clear DELAY_WAKEKILL.

If the new debugger attaches and calls ptrace() before the child takes siglock
ptrace_freeze_traced() will fail, but we can hit this WARN_ON().

Oleg.
Eric W. Biederman April 27, 2022, 4:33 p.m. UTC | #12
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/26, Eric W. Biederman wrote:
>>
>> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>>  	 */
>>  	if (lock_task_sighand(child, &flags)) {
>>  		if (child->ptrace && child->parent == current) {
>> -			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
>> +			WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
>
> This WARN_ON() doesn't look right.
>
> It is possible that this child was traced by another task and PTRACE_DETACH'ed,
> but it didn't clear DELAY_WAKEKILL.

That would be a bug.  That would mean that PTRACE_DETACHED process can
not be SIGKILL'd.

> If the new debugger attaches and calls ptrace() before the child takes siglock
> ptrace_freeze_traced() will fail, but we can hit this WARN_ON().

Eric
Oleg Nesterov April 27, 2022, 5:18 p.m. UTC | #13
On 04/27, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 04/26, Eric W. Biederman wrote:
> >>
> >> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
> >>  	 */
> >>  	if (lock_task_sighand(child, &flags)) {
> >>  		if (child->ptrace && child->parent == current) {
> >> -			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> >> +			WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
> >
> > This WARN_ON() doesn't look right.
> >
> > It is possible that this child was traced by another task and PTRACE_DETACH'ed,
> > but it didn't clear DELAY_WAKEKILL.
>
> That would be a bug.  That would mean that PTRACE_DETACHED process can
> not be SIGKILL'd.

Why? The tracee will take siglock, clear JOBCTL_DELAY_WAKEKILL and notice
SIGKILL after that.

Oleg.

> > If the new debugger attaches and calls ptrace() before the child takes siglock
> > ptrace_freeze_traced() will fail, but we can hit this WARN_ON().
> 
> Eric
>
Oleg Nesterov April 27, 2022, 5:21 p.m. UTC | #14
On 04/27, Oleg Nesterov wrote:
>
> On 04/27, Eric W. Biederman wrote:
> >
> > Oleg Nesterov <oleg@redhat.com> writes:
> >
> > > On 04/26, Eric W. Biederman wrote:
> > >>
> > >> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
> > >>  	 */
> > >>  	if (lock_task_sighand(child, &flags)) {
> > >>  		if (child->ptrace && child->parent == current) {
> > >> -			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
> > >> +			WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
> > >
> > > This WARN_ON() doesn't look right.
> > >
> > > It is possible that this child was traced by another task and PTRACE_DETACH'ed,
> > > but it didn't clear DELAY_WAKEKILL.
> >
> > That would be a bug.  That would mean that PTRACE_DETACHED process can
> > not be SIGKILL'd.
>
> Why? The tracee will take siglock, clear JOBCTL_DELAY_WAKEKILL and notice
> SIGKILL after that.

Not to mention that the tracee is TASK_RUNNING after PTRACE_DETACH wakes it
up, so the pending JOBCTL_DELAY_WAKEKILL simply has no effect.

Oleg.

> > > If the new debugger attaches and calls ptrace() before the child takes siglock
> > > ptrace_freeze_traced() will fail, but we can hit this WARN_ON().
> > 
> > Eric
> >
Eric W. Biederman April 27, 2022, 5:31 p.m. UTC | #15
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/27, Oleg Nesterov wrote:
>>
>> On 04/27, Eric W. Biederman wrote:
>> >
>> > Oleg Nesterov <oleg@redhat.com> writes:
>> >
>> > > On 04/26, Eric W. Biederman wrote:
>> > >>
>> > >> @@ -253,7 +252,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
>> > >>  	 */
>> > >>  	if (lock_task_sighand(child, &flags)) {
>> > >>  		if (child->ptrace && child->parent == current) {
>> > >> -			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
>> > >> +			WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL);
>> > >
>> > > This WARN_ON() doesn't look right.
>> > >
>> > > It is possible that this child was traced by another task and PTRACE_DETACH'ed,
>> > > but it didn't clear DELAY_WAKEKILL.
>> >
>> > That would be a bug.  That would mean that PTRACE_DETACHED process can
>> > not be SIGKILL'd.
>>
>> Why? The tracee will take siglock, clear JOBCTL_DELAY_WAKEKILL and notice
>> SIGKILL after that.
>
> Not to mention that the tracee is TASK_RUNNING after PTRACE_DETACH wakes it
> up, so the pending JOBCTL_DELAY_WAKEKILL simply has no effect.

Oh.  You are talking about the window when between clearing the
traced state and when tracee resumes executing and clears
JOBCTL_DELAY_WAKEKILL.

I thought you were thinking about JOBCTL_DELAY_WAKEKILL being leaked.

That requires both ptrace_attach and ptrace_check_attach for the new
tracer to happen before the tracee is scheduled to run.

I agree.  I think the WARN_ON could reasonably be moved a bit later,
but I don't know that the WARN_ON is important. I simply kept it because
it seemed to make sense.

Eric
Eric W. Biederman April 27, 2022, 9:52 p.m. UTC | #16
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/27, Oleg Nesterov wrote:
>>
>> On 04/26, Eric W. Biederman wrote:
>> >
>> > @@ -2209,6 +2213,34 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>> >  		spin_lock_irq(&current->sighand->siglock);
>> >  	}
>> >
>> > +	/* Don't stop if current is not ptraced */
>> > +	if (unlikely(!current->ptrace))
>> > +		return (clear_code) ? 0 : exit_code;
>> > +
>> > +	/*
>> > +	 * If @why is CLD_STOPPED, we're trapping to participate in a group
>> > +	 * stop.  Do the bookkeeping.  Note that if SIGCONT was delievered
>> > +	 * across siglock relocks since INTERRUPT was scheduled, PENDING
>> > +	 * could be clear now.  We act as if SIGCONT is received after
>> > +	 * TASK_TRACED is entered - ignore it.
>> > +	 */
>> > +	if (why == CLD_STOPPED && (current->jobctl & JOBCTL_STOP_PENDING))
>> > +		gstop_done = task_participate_group_stop(current);
>> > +
>> > +	/*
>> > +	 * Notify parents of the stop.
>> > +	 *
>> > +	 * While ptraced, there are two parents - the ptracer and
>> > +	 * the real_parent of the group_leader.  The ptracer should
>> > +	 * know about every stop while the real parent is only
>> > +	 * interested in the completion of group stop.  The states
>> > +	 * for the two don't interact with each other.  Notify
>> > +	 * separately unless they're gonna be duplicates.
>> > +	 */
>> > +	do_notify_parent_cldstop(current, true, why);
>> > +	if (gstop_done && ptrace_reparented(current))
>> > +		do_notify_parent_cldstop(current, false, why);
>>
>> This doesn't look right too. The parent should be notified only after
>> we set __state = TASK_TRACED and ->exit code.
>>
>> Suppose that debugger sleeps in do_wait(). do_notify_parent_cldstop()
>> wakes it up, debugger calls wait_task_stopped() and then it will sleep
>> again, task_stopped_code() returns 0.
>>
>> This can be probably fixed if you remove the lockless (fast path)
>> task_stopped_code() check in wait_task_stopped(), but this is not
>> nice performance-wise...

Another detail I have overlooked.  Thank you.

Or we can change task_stopped_code look something like:

static int *task_stopped_code(struct task_struct *p, bool ptrace)
{
	if (ptrace) {
-		if (task_is_traced(p) && !(p->jobctl & JOBCTL_LISTENING))
+		if (p->ptrace && !(p->jobctl & JOBCTL_LISTENING))
			return &p->exit_code;
	} else {
		if (p->signal->flags & SIGNAL_STOP_STOPPED)
			return &p->signal->group_exit_code;
	}
	return NULL;
}

I probably need to do a little bit more to ensure that it isn't an
actual process exit_code in p->exit_code.  But the we don't have to
limit ourselves to being precisely in the task_is_traced stopped place
for the fast path.


> On the other hand, I don't understand why did you move the callsite
> of do_notify_parent_cldstop() up... just don't do this?

My goal and I still think it makes sense (if not my implementation)
is to move set_special_state as close as possible to schedule().

That way we can avoid sleeping spin_locks clobbering it and making
our life difficult.

My hope is we can just clean up ptrace_stop instead of making it more
complicated and harder to follow.  Not that I am fundamentally opposed
to the quiesce bit but the code is already very hard to follow because
of all it's nuance and complexity, and I would really like to reduce
that complexity if we can possibly figure out how.

Eric
Eric W. Biederman April 27, 2022, 10:35 p.m. UTC | #17
Oleg Nesterov <oleg@redhat.com> writes:

2> On 04/26, Eric W. Biederman wrote:
>>
>>  static void ptrace_unfreeze_traced(struct task_struct *task)
>>  {
>> -	if (READ_ONCE(task->__state) != __TASK_TRACED)
>> +	if (!(READ_ONCE(task->jobctl) & JOBCTL_DELAY_WAKEKILL))
>>  		return;
>>
>>  	WARN_ON(!task->ptrace || task->parent != current);
>> @@ -213,11 +213,10 @@ static void ptrace_unfreeze_traced(struct task_struct *task)
>>  	 * Recheck state under the lock to close this race.
>>  	 */
>>  	spin_lock_irq(&task->sighand->siglock);
>
> Now that we do not check __state = __TASK_TRACED, we need lock_task_sighand().
> The tracee can be already woken up by ptrace_resume(), but it is possible that
> it didn't clear DELAY_WAKEKILL yet.

Yes.  The subtle differences in when __TASK_TRACED and
JOBCTL_DELAY_WAKEKILL are cleared are causing me some minor issues.

This "WARN_ON(!task->ptrace || task->parent != current);" also now
needs to be inside siglock, because the __TASK_TRACED is insufficient.


> Now, before we take ->siglock, the tracee can exit and another thread can do
> wait() and reap this task.
>
> Also, I think the comment above should be updated. I agree, it makes sense to
> re-check JOBCTL_DELAY_WAKEKILL under siglock just for clarity, but we no longer
> need to do this to close the race; jobctl &= ~JOBCTL_DELAY_WAKEKILL and
> wake_up_state() are safe even if JOBCTL_DELAY_WAKEKILL was already
> cleared.

I think you are right about it being safe, but I am having a hard time
convincing myself that is true.  I want to be very careful sending
__TASK_TRACED wake_ups as ptrace_stop fundamentally can't handle
spurious wake_ups.

So I think adding task_is_traced to the test to verify the task
is still frozen.

static void ptrace_unfreeze_traced(struct task_struct *task)
{
	unsigned long flags;

	/*
	 * Verify the task is still frozen before unfreezing it,
	 * ptrace_resume could have unfrozen us.
	 */
	if (lock_task_sighand(task, &flags)) {
		if ((task->jobctl & JOBCTL_DELAY_WAKEKILL) &&
		    task_is_traced(task)) {
			task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
			if (__fatal_signal_pending(task))
				wake_up_state(task, __TASK_TRACED);
		}
		unlock_task_sighand(task, &flags);
	}
}

>> @@ -2307,6 +2307,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code,
>>
>>  	/* LISTENING can be set only during STOP traps, clear it */
>>  	current->jobctl &= ~JOBCTL_LISTENING;
>> +	current->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
>
> minor, but
>
> 	current->jobctl &= ~(JOBCTL_LISTENING | JOBCTL_DELAY_WAKEKILL);
>
> looks better.

Yes.


Eric
Eric W. Biederman April 27, 2022, 11:05 p.m. UTC | #18
"Eric W. Biederman" <ebiederm@xmission.com> writes:

> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 3c8b34876744..1947c85aa9d9 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -437,7 +437,8 @@ extern void signal_wake_up_state(struct task_struct *t, unsigned int state);
>  
>  static inline void signal_wake_up(struct task_struct *t, bool resume)
>  {
> -	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
> +	bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL);
> +	signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0);
>  }
>  static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
>  {

Grrr.  While looking through everything today I have realized that there
is a bug.

Suppose we have 3 processes: TRACER, TRACEE, KILLER.

Meanwhile TRACEE is in the middle of ptrace_stop, just after siglock has
been dropped.

The TRACER process has performed ptrace_attach on TRACEE and is in the
middle of a ptrace operation and has just set JOBCTL_DELAY_WAKEKILL.

Then comes in the KILLER process and sends the TRACEE a SIGKILL.
The TRACEE __state remains TASK_TRACED, as designed.

The bug appears when the TRACEE makes it to schedule().  Inside
schedule there is a call to signal_pending_state() which notices
a SIGKILL is pending and refuses to sleep.

I could avoid setting TIF_SIGPENDING in signal_wake_up but that
is insufficient as another signal may be pending.

I could avoid marking the task as __fatal_signal_pending but then
where would the information that the task needs to become
__fatal_signal_pending go.

Hmm.

This looks like I need my other pending cleanup which introduces a
helper to get this idea to work.

Eric
Peter Zijlstra April 28, 2022, 10:07 a.m. UTC | #19
On Tue, Apr 26, 2022 at 05:50:21PM -0500, Eric W. Biederman wrote:
> ....  Peter Zijlstra has
> been rewriting the classic freezer and in earlier parts of this
> discussion so I presume it is also a problem for PREEMPT_RT.

Ah, the freezer thing is in fact a sched/arm64 issue, the common issue
between these two issues is ptrace though.

Specifically, on recent arm64 chips only a subset of CPUs can execute
arm32 code and 32bit processes are restricted to that subset. If by some
mishap you try and execute a 32bit task on a non-capable CPU it gets
terminated without prejudice.

Now, the current freezer has this problem that tasks can spuriously thaw
too soon (where too soon is before SMP is restored) which leads to these
32bit tasks being killed dead.

That, and it was a good excuse to fix up the current freezer :-)
Peter Zijlstra April 28, 2022, 10:27 a.m. UTC | #20
On Tue, Apr 26, 2022 at 05:52:03PM -0500, Eric W. Biederman wrote:
> Rename send_signal send_signal_locked and make to make
> it usable outside of signal.c.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  include/linux/signal.h |  2 ++
>  kernel/signal.c        | 24 ++++++++++++------------
>  2 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index a6db6f2ae113..55605bdf5ce9 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -283,6 +283,8 @@ extern int do_send_sig_info(int sig, struct kernel_siginfo *info,
>  extern int group_send_sig_info(int sig, struct kernel_siginfo *info,
>  			       struct task_struct *p, enum pid_type type);
>  extern int __group_send_sig_info(int, struct kernel_siginfo *, struct task_struct *);
> +extern int send_signal_locked(int sig, struct kernel_siginfo *info,
> +			      struct task_struct *p, enum pid_type type);
>  extern int sigprocmask(int, sigset_t *, sigset_t *);
>  extern void set_current_blocked(sigset_t *);
>  extern void __set_current_blocked(const sigset_t *);
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 30cd1ca43bcd..b0403197b0ad 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1071,8 +1071,8 @@ static inline bool legacy_queue(struct sigpending *signals, int sig)
>  	return (sig < SIGRTMIN) && sigismember(&signals->signal, sig);
>  }
>  
> -static int __send_signal(int sig, struct kernel_siginfo *info, struct task_struct *t,
> -			enum pid_type type, bool force)
> +static int __send_signal_locked(int sig, struct kernel_siginfo *info,
> +				struct task_struct *t, enum pid_type type, bool force)
>  {
>  	struct sigpending *pending;
>  	struct sigqueue *q;

While there, could you please replace that assert_spin_locked() with
lockdep_assert_held(&t->sighand->siglock) ?

The distinction being that assert_spin_locked() checks if the lock is
held *by*anyone* whereas lockdep_assert_held() asserts the current
context holds the lock.  Also, the check goes away if you build without
lockdep.
Peter Zijlstra April 28, 2022, 10:38 a.m. UTC | #21
On Tue, Apr 26, 2022 at 05:52:08PM -0500, Eric W. Biederman wrote:
> Now that siglock keeps tsk->parent and tsk->real_parent constant
> require that do_notify_parent_cldstop is called with tsk->siglock held
> instead of the tasklist_lock.
> 
> As all of the callers of do_notify_parent_cldstop had to drop the
> siglock and take tasklist_lock this simplifies all of it's callers.
> 
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/signal.c | 156 +++++++++++++++++-------------------------------
>  1 file changed, 55 insertions(+), 101 deletions(-)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 72d96614effc..584d67deb3cb 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2121,11 +2121,13 @@ static void do_notify_parent_cldstop(struct task_struct *tsk,
>  				     bool for_ptracer, int why)
>  {
>  	struct kernel_siginfo info;
> -	unsigned long flags;
>  	struct task_struct *parent;
>  	struct sighand_struct *sighand;
> +	bool lock;
>  	u64 utime, stime;
>  
> +	assert_spin_locked(&tsk->sighand->siglock);

lockdep_assert_held() please...
Oleg Nesterov April 28, 2022, 3:11 p.m. UTC | #22
On 04/27, Eric W. Biederman wrote:
>
> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>
> > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> > index 3c8b34876744..1947c85aa9d9 100644
> > --- a/include/linux/sched/signal.h
> > +++ b/include/linux/sched/signal.h
> > @@ -437,7 +437,8 @@ extern void signal_wake_up_state(struct task_struct *t, unsigned int state);
> >
> >  static inline void signal_wake_up(struct task_struct *t, bool resume)
> >  {
> > -	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
> > +	bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL);
> > +	signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0);
> >  }
> >  static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
> >  {
>
> Grrr.  While looking through everything today I have realized that there
> is a bug.
>
> Suppose we have 3 processes: TRACER, TRACEE, KILLER.
>
> Meanwhile TRACEE is in the middle of ptrace_stop, just after siglock has
> been dropped.
>
> The TRACER process has performed ptrace_attach on TRACEE and is in the
> middle of a ptrace operation and has just set JOBCTL_DELAY_WAKEKILL.
>
> Then comes in the KILLER process and sends the TRACEE a SIGKILL.
> The TRACEE __state remains TASK_TRACED, as designed.
>
> The bug appears when the TRACEE makes it to schedule().  Inside
> schedule there is a call to signal_pending_state() which notices
> a SIGKILL is pending and refuses to sleep.

And I think this is fine. This doesn't really differ from the case
when the tracee was killed before it takes siglock.

The only problem (afaics) is that, once we introduce JOBCTL_TRACED,
ptrace_stop() can leak this flag. That is why I suggested to clear
it along with LISTENING/DELAY_WAKEKILL before return, exactly because
schedule() won't block if fatal_signal_pending() is true.

But may be I misunderstood you concern?

Oleg.
Eric W. Biederman April 28, 2022, 4:50 p.m. UTC | #23
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/27, Eric W. Biederman wrote:
>>
>> "Eric W. Biederman" <ebiederm@xmission.com> writes:
>>
>> > diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> > index 3c8b34876744..1947c85aa9d9 100644
>> > --- a/include/linux/sched/signal.h
>> > +++ b/include/linux/sched/signal.h
>> > @@ -437,7 +437,8 @@ extern void signal_wake_up_state(struct task_struct *t, unsigned int state);
>> >
>> >  static inline void signal_wake_up(struct task_struct *t, bool resume)
>> >  {
>> > -	signal_wake_up_state(t, resume ? TASK_WAKEKILL : 0);
>> > +	bool wakekill = resume && !(t->jobctl & JOBCTL_DELAY_WAKEKILL);
>> > +	signal_wake_up_state(t, wakekill ? TASK_WAKEKILL : 0);
>> >  }
>> >  static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
>> >  {
>>
>> Grrr.  While looking through everything today I have realized that there
>> is a bug.
>>
>> Suppose we have 3 processes: TRACER, TRACEE, KILLER.
>>
>> Meanwhile TRACEE is in the middle of ptrace_stop, just after siglock has
>> been dropped.
>>
>> The TRACER process has performed ptrace_attach on TRACEE and is in the
>> middle of a ptrace operation and has just set JOBCTL_DELAY_WAKEKILL.
>>
>> Then comes in the KILLER process and sends the TRACEE a SIGKILL.
>> The TRACEE __state remains TASK_TRACED, as designed.
>>
>> The bug appears when the TRACEE makes it to schedule().  Inside
>> schedule there is a call to signal_pending_state() which notices
>> a SIGKILL is pending and refuses to sleep.
>
> And I think this is fine. This doesn't really differ from the case
> when the tracee was killed before it takes siglock.

Hmm.  Maybe.

> The only problem (afaics) is that, once we introduce JOBCTL_TRACED,
> ptrace_stop() can leak this flag. That is why I suggested to clear
> it along with LISTENING/DELAY_WAKEKILL before return, exactly because
> schedule() won't block if fatal_signal_pending() is true.
>
> But may be I misunderstood you concern?

Prior to JOBCTL_DELAY_WAKEKILL once __state was set to __TASK_TRACED
we were guaranteed that schedule() would stop if a SIGKILL was
received after that point.  As well as being immune from wake-ups
from SIGKILL.

I guess we are immune from wake-ups with JOBCTL_DELAY_WAKEKILL as I have
implemented it.

The practical concern then seems to be that we are not guaranteed
wait_task_inactive will succeed.  Which means that it must continue
to include the TASK_TRACED bit.

Previously we were actually guaranteed in ptrace_check_attach that after
ptrace_freeze_traced would succeed as any pending fatal signal would
cause ptrace_freeze_traced to fail.  Any incoming fatal signal would not
stop schedule from sleeping.  The ptraced task would continue to be
ptraced, as all other ptrace operations are blocked by virtue of ptrace
being single threaded.

I think in my tired mind yesterday I thought it would messing things
up after schedule decided to sleep.  Still I would like to be able to
let wait_task_inactive not care about the state of the process it is
going to sleep for.

Eric
Peter Zijlstra April 28, 2022, 5:44 p.m. UTC | #24
On Wed, Apr 27, 2022 at 09:47:10AM -0500, Eric W. Biederman wrote:

> Hmm.  If we have the following process tree.
> 
>     A
>      \
>       B
>        \
>         C
> 
> Process A, B, and C are all in the same process group.
> Process A and B are setup to receive SIGCHILD when
> their process stops.
> 
> Process C traces process A.
> 
> When a sigstop is delivered to the group we can have:
> 
> Process B takes siglock(B) siglock(A) to notify the real_parent
> Process C takes siglock(C) siglock(B) to notify the real_parent
> Process A takes siglock(A) siglock(C) to notify the tracer
> 
> If they all take their local lock at the same time there is
> a deadlock.
> 
> I don't think the restriction that you can never ptrace anyone
> up the process tree is going to fly.  So it looks like I am back to the
> drawing board for this one.

I've not had time to fully appreciate the nested locking here, but if it
is possible to rework things to always take both locks at the same time,
then it would be possible to impose an arbitrary lock order on things
and break the cycle that way.

That is, simply order the locks by their heap address or something:

static void double_siglock_irq(struct sighand *sh1, struct sighand2 *sh2)
{
	if (sh1 > sh2)
		swap(sh1, sh2)

	spin_lock_irq(&sh1->siglock);
	spin_lock_nested(&sh2->siglock, SINGLE_DEPTH_NESTING);
}
Oleg Nesterov April 28, 2022, 6:22 p.m. UTC | #25
On 04/28, Peter Zijlstra wrote:
>
> I've not had time to fully appreciate the nested locking here, but if it
> is possible to rework things to always take both locks at the same time,
> then it would be possible to impose an arbitrary lock order on things
> and break the cycle that way.

This is clear, but this is not that simple.

For example (with this series at least), ptrace_stop() already holds
current->sighand->siglock which (in particular) we need to protect
current->parent, but then we need current->parent->sighand->siglock
in do_notify_parent_cldstop().

Oleg.
Eric W. Biederman April 28, 2022, 6:37 p.m. UTC | #26
Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Apr 27, 2022 at 09:47:10AM -0500, Eric W. Biederman wrote:
>
>> Hmm.  If we have the following process tree.
>> 
>>     A
>>      \
>>       B
>>        \
>>         C
>> 
>> Process A, B, and C are all in the same process group.
>> Process A and B are setup to receive SIGCHILD when
>> their process stops.
>> 
>> Process C traces process A.
>> 
>> When a sigstop is delivered to the group we can have:
>> 
>> Process B takes siglock(B) siglock(A) to notify the real_parent
>> Process C takes siglock(C) siglock(B) to notify the real_parent
>> Process A takes siglock(A) siglock(C) to notify the tracer
>> 
>> If they all take their local lock at the same time there is
>> a deadlock.
>> 
>> I don't think the restriction that you can never ptrace anyone
>> up the process tree is going to fly.  So it looks like I am back to the
>> drawing board for this one.
>
> I've not had time to fully appreciate the nested locking here, but if it
> is possible to rework things to always take both locks at the same time,
> then it would be possible to impose an arbitrary lock order on things
> and break the cycle that way.
>
> That is, simply order the locks by their heap address or something:
>
> static void double_siglock_irq(struct sighand *sh1, struct sighand2 *sh2)
> {
> 	if (sh1 > sh2)
> 		swap(sh1, sh2)
>
> 	spin_lock_irq(&sh1->siglock);
> 	spin_lock_nested(&sh2->siglock, SINGLE_DEPTH_NESTING);
> }

You know it might be.  Especially given that the existing code is
already dropping siglock and grabbing tasklist_lock.

It would take a potentially triple lock function to lock
the task it's real_parent and it's tracer (aka parent).

That makes this possible to consider is that notifying the ``parents''
is a fundamental part of the operation so we know we are going to
need the lock so we can move it up.

Throw in a pinch of lock_task_sighand and the triple lock function
gets quite interesting.

It is certainly worth trying, and I will.

Eric
Oleg Nesterov April 28, 2022, 6:53 p.m. UTC | #27
On 04/28, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> >> The bug appears when the TRACEE makes it to schedule().  Inside
> >> schedule there is a call to signal_pending_state() which notices
> >> a SIGKILL is pending and refuses to sleep.
> >
> > And I think this is fine. This doesn't really differ from the case
> > when the tracee was killed before it takes siglock.
>
> Hmm.  Maybe.

I hope ;)

> Previously we were actually guaranteed in ptrace_check_attach that after
> ptrace_freeze_traced would succeed as any pending fatal signal would
> cause ptrace_freeze_traced to fail.  Any incoming fatal signal would not
> stop schedule from sleeping.

Yes.

So let me repeat, 7/9 "ptrace: Simplify the wait_task_inactive call in
ptrace_check_attach" looks good to me (except it should use
wait_task_inactive(__TASK_TRACED)), but it should come before other
meaningfull changes and the changelog should be updated.

And then we will probably need to reconsider this wait_task_inactive()
and WARN_ON() around it, but depends on what will we finally do.

> I think in my tired mind yesterday

I got lost too ;)

> Still I would like to be able to
> let wait_task_inactive not care about the state of the process it is
> going to sleep for.

Not sure... but to be honest I didn't really pay attention to the
wait_task_inactive(match_state => 0) part...

Oleg.
Eric W. Biederman April 28, 2022, 8:49 p.m. UTC | #28
"Eric W. Biederman" <ebiederm@xmission.com> writes:

> Peter Zijlstra <peterz@infradead.org> writes:
>
>> On Wed, Apr 27, 2022 at 09:47:10AM -0500, Eric W. Biederman wrote:
>>
>>> Hmm.  If we have the following process tree.
>>> 
>>>     A
>>>      \
>>>       B
>>>        \
>>>         C
>>> 
>>> Process A, B, and C are all in the same process group.
>>> Process A and B are setup to receive SIGCHILD when
>>> their process stops.
>>> 
>>> Process C traces process A.
>>> 
>>> When a sigstop is delivered to the group we can have:
>>> 
>>> Process B takes siglock(B) siglock(A) to notify the real_parent
>>> Process C takes siglock(C) siglock(B) to notify the real_parent
>>> Process A takes siglock(A) siglock(C) to notify the tracer
>>> 
>>> If they all take their local lock at the same time there is
>>> a deadlock.
>>> 
>>> I don't think the restriction that you can never ptrace anyone
>>> up the process tree is going to fly.  So it looks like I am back to the
>>> drawing board for this one.
>>
>> I've not had time to fully appreciate the nested locking here, but if it
>> is possible to rework things to always take both locks at the same time,
>> then it would be possible to impose an arbitrary lock order on things
>> and break the cycle that way.
>>
>> That is, simply order the locks by their heap address or something:
>>
>> static void double_siglock_irq(struct sighand *sh1, struct sighand2 *sh2)
>> {
>> 	if (sh1 > sh2)
>> 		swap(sh1, sh2)
>>
>> 	spin_lock_irq(&sh1->siglock);
>> 	spin_lock_nested(&sh2->siglock, SINGLE_DEPTH_NESTING);
>> }
>
> You know it might be.  Especially given that the existing code is
> already dropping siglock and grabbing tasklist_lock.
>
> It would take a potentially triple lock function to lock
> the task it's real_parent and it's tracer (aka parent).
>
> That makes this possible to consider is that notifying the ``parents''
> is a fundamental part of the operation so we know we are going to
> need the lock so we can move it up.
>
> Throw in a pinch of lock_task_sighand and the triple lock function
> gets quite interesting.
>
> It is certainly worth trying, and I will.

To my surprise it doesn't look too bad.  The locking simplifications and
not using a lock as big as tasklist_lock probably make it even worth
doing.

I need to sleep on it and look at everything again.  In the
meantime here is my function that comes in with siglock held,
possibly drops it, and grabs the other two locks all in
order.

static void lock_parents_siglocks(bool lock_tracer)
	__releases(&current->sighand->siglock)
	__acquires(&current->sighand->siglock)
	__acquires(&current->real_parent->sighand->siglock)
	__acquires(&current->parent->sighand->siglock)
{
	struct task_struct *me = current;
	struct sighand_struct *m_sighand = me->sighand;

	lockdep_assert_held(&m_sighand->siglock);

	rcu_read_lock();
	for (;;) {
		struct task_struct *parent, *tracer;
		struct sighand_struct *p_sighand, *t_sighand, *s1, *s2, *s3;

		parent = me->real_parent;
		tracer = lock_tracer? me->parent : parent;

		p_sighand = rcu_dereference(parent->sighand);
		t_sighand = rcu_dereference(tracer->sighand);

		/* Sort the sighands so that s1 >= s2 >= s3 */
		s1 = m_sighand;
		s2 = p_sighand;
		s3 = t_sighand;
		if (s1 > s2)
			swap(s1, s2);
		if (s1 > s3)
			swap(s1, s3);
		if (s2 > s3)
			swap(s2, s3);

		if (s1 != m_sighand) {
			spin_unlock(&m_sighand->siglock);
			spin_lock(&s1->siglock);
		}

		if (s1 != s2)
			spin_lock_nested(&s2->siglock, SIGLOCK_LOCK_SECOND);
		if (s2 != s3)
			spin_lock_nested(&s3->siglock, SIGLOCK_LOCK_THIRD);

		if (likely((me->real_parent == parent) &&
			   (me->parent == tracer) &&
			   (parent->sighand == p_sighand) &&
			   (tracer->sighand == t_sighand))) {
			break;
		}
		spin_unlock(&p_sighand->siglock);
                if (t_sighand != p_sighand)
			spin_unlock(&t_sighand->siglock);
		continue;
	}
	rcu_read_unlock();
}

Eric
Peter Zijlstra April 28, 2022, 10:19 p.m. UTC | #29
On Thu, Apr 28, 2022 at 03:49:11PM -0500, Eric W. Biederman wrote:

> static void lock_parents_siglocks(bool lock_tracer)
> 	__releases(&current->sighand->siglock)
> 	__acquires(&current->sighand->siglock)
> 	__acquires(&current->real_parent->sighand->siglock)
> 	__acquires(&current->parent->sighand->siglock)
> {
> 	struct task_struct *me = current;
> 	struct sighand_struct *m_sighand = me->sighand;
> 
> 	lockdep_assert_held(&m_sighand->siglock);
> 
> 	rcu_read_lock();
> 	for (;;) {
> 		struct task_struct *parent, *tracer;
> 		struct sighand_struct *p_sighand, *t_sighand, *s1, *s2, *s3;
> 
> 		parent = me->real_parent;
> 		tracer = lock_tracer? me->parent : parent;
> 
> 		p_sighand = rcu_dereference(parent->sighand);
> 		t_sighand = rcu_dereference(tracer->sighand);
> 
> 		/* Sort the sighands so that s1 >= s2 >= s3 */
> 		s1 = m_sighand;
> 		s2 = p_sighand;
> 		s3 = t_sighand;
> 		if (s1 > s2)
> 			swap(s1, s2);
> 		if (s1 > s3)
> 			swap(s1, s3);
> 		if (s2 > s3)
> 			swap(s2, s3);
> 
> 		if (s1 != m_sighand) {
> 			spin_unlock(&m_sighand->siglock);
> 			spin_lock(&s1->siglock);
> 		}
> 
> 		if (s1 != s2)
> 			spin_lock_nested(&s2->siglock, SIGLOCK_LOCK_SECOND);
> 		if (s2 != s3)
> 			spin_lock_nested(&s3->siglock, SIGLOCK_LOCK_THIRD);
> 

Might as well just use 1 and 2 for subclass at this point, or use
SIGLOCK_LOCK_FIRST below.

> 		if (likely((me->real_parent == parent) &&
> 			   (me->parent == tracer) &&
> 			   (parent->sighand == p_sighand) &&
> 			   (tracer->sighand == t_sighand))) {
> 			break;
> 		}
> 		spin_unlock(&p_sighand->siglock);
>                 if (t_sighand != p_sighand)
> 			spin_unlock(&t_sighand->siglock);

Indent fail above ^, also you likey need this:

		/*
		 * Since [pt]_sighand will likely change if we go
		 * around, and m_sighand is the only one held, make sure
		 * it is subclass-0, since the above 's1 != m_sighand'
		 * clause very much relies on that.
		 */
		lock_set_subclass(&m_sighand->siglock, 0, _RET_IP_);

> 		continue;
> 	}
> 	rcu_read_unlock();
> }
> 
> Eric