diff mbox series

[v2,2/5] sched,ptrace: Fix ptrace_check_attach() vs PREEMPT_RT

Message ID 20220421150654.817117821@infradead.org
State New
Headers show
Series ptrace-vs-PREEMPT_RT and freezer rewrite | expand

Commit Message

Peter Zijlstra April 21, 2022, 3:02 p.m. UTC
Rework ptrace_check_attach() / ptrace_unfreeze_traced() to not rely on
task->__state as much.

Due to how PREEMPT_RT is changing the rules vs task->__state with the
introduction of task->saved_state while TASK_RTLOCK_WAIT (the whole
blocking spinlock thing), the way ptrace freeze tries to do things no
longer works.

Specifically there are two problems:

 - due to ->saved_state, the ->__state modification removing
   TASK_WAKEKILL no longer works reliably.

 - due to ->saved_state, wait_task_inactive() also no longer works
   reliably.

The first problem is solved by a suggestion from Eric that instead
of changing __state, TASK_WAKEKILL be delayed.

The second problem is solved by a suggestion from Oleg; add
JOBCTL_TRACED_QUIESCE to cover the chunk of code between
set_current_state(TASK_TRACED) and schedule(), such that
ptrace_check_attach() can first wait for JOBCTL_TRACED_QUIESCE to get
cleared, and then use wait_task_inactive().

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/jobctl.h |    8 ++-
 kernel/ptrace.c              |   90 ++++++++++++++++++++++---------------------
 kernel/sched/core.c          |    5 --
 kernel/signal.c              |   36 ++++++++++++++---
 4 files changed, 86 insertions(+), 53 deletions(-)

Comments

Oleg Nesterov April 25, 2022, 2:35 p.m. UTC | #1
On 04/21, Peter Zijlstra wrote:
>
> +static void clear_traced_quiesce(void)
> +{
> +	spin_lock_irq(&current->sighand->siglock);
> +	WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE));

This WARN_ON_ONCE() doesn't look right, the task can be killed right
after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and
drops siglock.

> @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in
>  		/*
>  		 * Don't want to allow preemption here, because
>  		 * sys_ptrace() needs this task to be inactive.
> -		 *
> -		 * XXX: implement read_unlock_no_resched().
>  		 */
>  		preempt_disable();
>  		read_unlock(&tasklist_lock);
> -		cgroup_enter_frozen();
> +		cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!!
> +
> +		/*
> +		 * JOBCTL_TRACE_QUIESCE bridges the gap between
> +		 * set_current_state(TASK_TRACED) above and schedule() below.
> +		 * There must not be any blocking (specifically anything that
> +		 * touched ->saved_state on PREEMPT_RT) between here and
> +		 * schedule().
> +		 *
> +		 * ptrace_check_attach() relies on this with its
> +		 * wait_task_inactive() usage.
> +		 */
> +		clear_traced_quiesce();

Well, I think it should be called earlier under tasklist_lock,
before preempt_disable() above.

We need tasklist_lock to protect ->parent, debugger can be killed
and go away right after read_unlock(&tasklist_lock).

Still trying to convince myself everything is right with
JOBCTL_STOPPED/TRACED ...

Oleg.
Oleg Nesterov April 25, 2022, 5:47 p.m. UTC | #2
On 04/21, Peter Zijlstra wrote:
>
> @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in
>  	 * schedule() will not sleep if there is a pending signal that
>  	 * can awaken the task.
>  	 */
> -	current->jobctl |= JOBCTL_TRACED;
> +	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE;
>  	set_special_state(TASK_TRACED);

OK, this looks wrong. I actually mean the previous patch which sets
JOBCTL_TRACED.

The problem is that the tracee can be already killed, so that
fatal_signal_pending(current) is true. In this case we can't rely on
signal_wake_up_state() which should clear JOBCTL_TRACED, or the
callers of ptrace_signal_wake_up/etc which clear this flag by hand.

In this case schedule() won't block and ptrace_stop() will leak
JOBCTL_TRACED. Unless I missed something.

We could check fatal_signal_pending() and damn! this is what I think
ptrace_stop() should have done from the very beginning. But for now
I'd suggest to simply clear this flag before return, along with
DELAY_WAKEKILL and LISTENING.

>  	current->jobctl &= ~JOBCTL_LISTENING;
> +	current->jobctl &= ~JOBCTL_DELAY_WAKEKILL;

	current->jobctl &=
		~(~JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING);

Oleg.
Peter Zijlstra April 25, 2022, 6:33 p.m. UTC | #3
On Mon, Apr 25, 2022 at 04:35:37PM +0200, Oleg Nesterov wrote:
> On 04/21, Peter Zijlstra wrote:
> >
> > +static void clear_traced_quiesce(void)
> > +{
> > +	spin_lock_irq(&current->sighand->siglock);
> > +	WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE));
> 
> This WARN_ON_ONCE() doesn't look right, the task can be killed right
> after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and
> drops siglock.

OK, will look at that.

> > @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in
> >  		/*
> >  		 * Don't want to allow preemption here, because
> >  		 * sys_ptrace() needs this task to be inactive.
> > -		 *
> > -		 * XXX: implement read_unlock_no_resched().
> >  		 */
> >  		preempt_disable();
> >  		read_unlock(&tasklist_lock);
> > -		cgroup_enter_frozen();
> > +		cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!!
> > +
> > +		/*
> > +		 * JOBCTL_TRACE_QUIESCE bridges the gap between
> > +		 * set_current_state(TASK_TRACED) above and schedule() below.
> > +		 * There must not be any blocking (specifically anything that
> > +		 * touched ->saved_state on PREEMPT_RT) between here and
> > +		 * schedule().
> > +		 *
> > +		 * ptrace_check_attach() relies on this with its
> > +		 * wait_task_inactive() usage.
> > +		 */
> > +		clear_traced_quiesce();
> 
> Well, I think it should be called earlier under tasklist_lock,
> before preempt_disable() above.
> 
> We need tasklist_lock to protect ->parent, debugger can be killed
> and go away right after read_unlock(&tasklist_lock).
> 
> Still trying to convince myself everything is right with
> JOBCTL_STOPPED/TRACED ...

Can't do it earlier, since cgroup_enter_frozen() can do spinlock (eg.
use ->saved_state).
Eric W. Biederman April 26, 2022, 12:38 a.m. UTC | #4
Peter Zijlstra <peterz@infradead.org> writes:

> On Mon, Apr 25, 2022 at 04:35:37PM +0200, Oleg Nesterov wrote:
>> On 04/21, Peter Zijlstra wrote:
>> >
>> > +static void clear_traced_quiesce(void)
>> > +{
>> > +	spin_lock_irq(&current->sighand->siglock);
>> > +	WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE));
>> 
>> This WARN_ON_ONCE() doesn't look right, the task can be killed right
>> after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and
>> drops siglock.
>
> OK, will look at that.
>
>> > @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in
>> >  		/*
>> >  		 * Don't want to allow preemption here, because
>> >  		 * sys_ptrace() needs this task to be inactive.
>> > -		 *
>> > -		 * XXX: implement read_unlock_no_resched().
>> >  		 */
>> >  		preempt_disable();
>> >  		read_unlock(&tasklist_lock);
>> > -		cgroup_enter_frozen();
>> > +		cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!!
>> > +
>> > +		/*
>> > +		 * JOBCTL_TRACE_QUIESCE bridges the gap between
>> > +		 * set_current_state(TASK_TRACED) above and schedule() below.
>> > +		 * There must not be any blocking (specifically anything that
>> > +		 * touched ->saved_state on PREEMPT_RT) between here and
>> > +		 * schedule().
>> > +		 *
>> > +		 * ptrace_check_attach() relies on this with its
>> > +		 * wait_task_inactive() usage.
>> > +		 */
>> > +		clear_traced_quiesce();
>> 
>> Well, I think it should be called earlier under tasklist_lock,
>> before preempt_disable() above.
>> 
>> We need tasklist_lock to protect ->parent, debugger can be killed
>> and go away right after read_unlock(&tasklist_lock).
>> 
>> Still trying to convince myself everything is right with
>> JOBCTL_STOPPED/TRACED ...
>
> Can't do it earlier, since cgroup_enter_frozen() can do spinlock (eg.
> use ->saved_state).

There are some other issues in this part of ptrace_stop().


I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)".


Currently in ptrace_check_attach a parameter of __TASK_TRACED is passed
so that wait_task_inactive cane fail if the "!current->ptrace" branch
of ptrace_stop is take and ptrace_stop does not stop.  With the
TASK_FROZEN state it appears that "!current->ptrace" branch can continue
and freeze somewhere else and wait_task_inactive could decided it was
fine.


I have to run, but hopefully tommorrow I will post the patches that
remove the "!current->ptrace" case altogether and basically
remove the need for quiesce and wait_task_inactive detecting
which branch is taken.

The spinlock in cgroup_enter_frozen remains an issue for PREEMPT_RT.
But the rest of the issues are cleared up by using siglock instead
of tasklist_lock.  Plus the code is just easier to read and understand.

Eric
Oleg Nesterov April 26, 2022, 5:51 a.m. UTC | #5
On 04/25, Eric W. Biederman wrote:
>
> I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)".

As Peter explained, in this case we can rely on __ptrace_unlink() which
should clear this flag.

Oleg.
Eric W. Biederman April 26, 2022, 5:19 p.m. UTC | #6
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/25, Eric W. Biederman wrote:
>>
>> I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)".
>
> As Peter explained, in this case we can rely on __ptrace_unlink() which
> should clear this flag.

I had missed that that signal_wake_up_state was clearing
JOBCTL_TRACED_QUIESCE.

Relying on __ptrace_unlink assumes the __ptrace_unlink happens after
siglock is taken before calling ptrace_stop.  Especially with the
ptrace_notify in signal_delivered that does not look guaranteed.

The __ptrace_unlink could also happen during arch_ptrace_stop.

Relying on siglock is sufficient because __ptrace_unlink holds siglock
over clearing task->ptrace.  Which means that the simple fix for this is
to just test task->ptrace before we set JOBCTL_TRACED_QUEIESCE.

Eric
Oleg Nesterov April 26, 2022, 6:11 p.m. UTC | #7
On 04/26, Eric W. Biederman wrote:
>
> Relying on __ptrace_unlink assumes the __ptrace_unlink happens after
> siglock is taken before calling ptrace_stop.  Especially with the
> ptrace_notify in signal_delivered that does not look guaranteed.
>
> The __ptrace_unlink could also happen during arch_ptrace_stop.
>
> Relying on siglock is sufficient because __ptrace_unlink holds siglock
> over clearing task->ptrace.  Which means that the simple fix for this is
> to just test task->ptrace before we set JOBCTL_TRACED_QUEIESCE.

Or simply clear _QUEIESCE along with _TRACED/DELAY_WAKEKILL before return?

Oleg.
Eric W. Biederman April 27, 2022, 12:24 a.m. UTC | #8
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/21, Peter Zijlstra wrote:
>>
>> @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in
>>  	 * schedule() will not sleep if there is a pending signal that
>>  	 * can awaken the task.
>>  	 */
>> -	current->jobctl |= JOBCTL_TRACED;
>> +	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE;
>>  	set_special_state(TASK_TRACED);
>
> OK, this looks wrong. I actually mean the previous patch which sets
> JOBCTL_TRACED.
>
> The problem is that the tracee can be already killed, so that
> fatal_signal_pending(current) is true. In this case we can't rely on
> signal_wake_up_state() which should clear JOBCTL_TRACED, or the
> callers of ptrace_signal_wake_up/etc which clear this flag by hand.
>
> In this case schedule() won't block and ptrace_stop() will leak
> JOBCTL_TRACED. Unless I missed something.
>
> We could check fatal_signal_pending() and damn! this is what I think
> ptrace_stop() should have done from the very beginning. But for now
> I'd suggest to simply clear this flag before return, along with
> DELAY_WAKEKILL and LISTENING.

Oh.  That is an interesting case for JOBCTL_TRACED.  The
scheduler refuses to stop if signal_pending_state(TASK_TRACED, p)
returns true.

The ptrace_stop code used to handle this explicitly and in commit
7d613f9f72ec ("signal: Remove the bogus sigkill_pending in ptrace_stop")
I actually removed the test.  As the test was somewhat wrong and
redundant, and in slightly the wrong location.

But doing:

	/* Don't stop if the task is dying */
	if (unlikely(__fatal_signal_pending(current)))
		return exit_code;

Should work.

>
>>  	current->jobctl &= ~JOBCTL_LISTENING;
>> +	current->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
>
> 	current->jobctl &=
> 		~(~JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING);


I presume you meant:

	current->jobctl &=
 		~(JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING);

I don't think we want to do that.  For the case you are worried about it
is a valid fix.

In general this is the wrong approach as we want the waker to clear
JOBCTL_TRACED.  If the waker does not it is possible that
ptrace_freeze_traced might attempt to freeze a process whose state
is not appropriate for attach, because the code is past the call
to schedule().

In fact I think clearing JOBCTL_TRACED at the end of ptrace_stop
will allow ptrace_freeze_traced to come in while siglock is dropped,
expect the process to stop, and have the process not stop.  Of
course wait_task_inactive coming first that might not be a problem.



This is a minor problem with the patchset I just posted.  I thought the
only reason wait_task_inactive could fail was if ptrace_stop() hit the
!current->ptrace case.  Thinking about any it any SIGKILL coming in
before tracee stops in schedule will trigger this, so it is not as
safe as I thought to not pass a state into wait_task_inactive.

It is time for me to shut down today.  I will sleep on that and
see what I can see tomorrow.

Eric
Oleg Nesterov April 27, 2022, 3:53 p.m. UTC | #9
On 04/21, Peter Zijlstra wrote:
>
> @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l
>  		goto out_put_task_struct;
>  
>  	ret = arch_ptrace(child, request, addr, data);
> -	if (ret || request != PTRACE_DETACH)
> -		ptrace_unfreeze_traced(child);
> +	ptrace_unfreeze_traced(child);

Forgot to mention... whatever we do this doesn't look right.

ptrace_unfreeze_traced() must not be called if the tracee was untraced,
anothet debugger can come after that. I agree, the current code looks
a bit confusing, perhaps it makes sense to re-write it:

	if (request == PTRACE_DETACH && ret == 0)
		; /* nothing to do, no longer traced by us */
	else
		ptrace_unfreeze_traced(child);

Oleg.
Eric W. Biederman April 27, 2022, 9:57 p.m. UTC | #10
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/21, Peter Zijlstra wrote:
>>
>> @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l
>>  		goto out_put_task_struct;
>>  
>>  	ret = arch_ptrace(child, request, addr, data);
>> -	if (ret || request != PTRACE_DETACH)
>> -		ptrace_unfreeze_traced(child);
>> +	ptrace_unfreeze_traced(child);
>
> Forgot to mention... whatever we do this doesn't look right.
>
> ptrace_unfreeze_traced() must not be called if the tracee was untraced,
> anothet debugger can come after that. I agree, the current code looks
> a bit confusing, perhaps it makes sense to re-write it:
>
> 	if (request == PTRACE_DETACH && ret == 0)
> 		; /* nothing to do, no longer traced by us */
> 	else
> 		ptrace_unfreeze_traced(child);

This was a bug in my original JOBCTL_DELAY_WAITKILL patch and it was
just cut and pasted here.  I thought it made sense when I was throwing
things together but when I looked more closely I realized that it is
not safe.

Eric
Peter Zijlstra April 28, 2022, 8:29 p.m. UTC | #11
On Tue, Apr 26, 2022 at 07:24:03PM -0500, Eric W. Biederman wrote:
> But doing:
> 
> 	/* Don't stop if the task is dying */
> 	if (unlikely(__fatal_signal_pending(current)))
> 		return exit_code;
> 
> Should work.

Something like so then...

---
Subject: signal,ptrace: Don't stop dying tasks
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Apr 28 22:17:56 CEST 2022

Oleg pointed out that the tracee can already be killed such that
fatal_signal_pending() is true. In that case signal_wake_up_state()
cannot be relied upon to be responsible for the wakeup -- something
we're going to want to rely on.

As such, explicitly handle this case.

Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/signal.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2226,6 +2226,10 @@ static int ptrace_stop(int exit_code, in
 		spin_lock_irq(&current->sighand->siglock);
 	}
 
+	/* Don't stop if the task is dying. */
+	if (unlikely(__fatal_signal_pending(current)))
+		return exit_code;
+
 	/*
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
Oleg Nesterov April 28, 2022, 8:59 p.m. UTC | #12
On 04/28, Peter Zijlstra wrote:
>
> Oleg pointed out that the tracee can already be killed such that
> fatal_signal_pending() is true. In that case signal_wake_up_state()
> cannot be relied upon to be responsible for the wakeup -- something
> we're going to want to rely on.

Peter, I am all confused...

If this patch is against the current tree, we don't need it.

If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's),
then it can't help - SIGKILL can come right after the tracee drops siglock
and calls schedule().

Perhaps I missed something, but let me repeat the 3rd time: I'd suggest
to simply clear JOBCTL_TRACED along with LISTENING/DELAY_WAKEKILL before
return to close this race.

Oleg.

> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2226,6 +2226,10 @@ static int ptrace_stop(int exit_code, in
>  		spin_lock_irq(&current->sighand->siglock);
>  	}
>  
> +	/* Don't stop if the task is dying. */
> +	if (unlikely(__fatal_signal_pending(current)))
> +		return exit_code;
> +
>  	/*
>  	 * schedule() will not sleep if there is a pending signal that
>  	 * can awaken the task.
>
Peter Zijlstra April 28, 2022, 10:21 p.m. UTC | #13
On Thu, Apr 28, 2022 at 10:59:57PM +0200, Oleg Nesterov wrote:
> On 04/28, Peter Zijlstra wrote:
> >
> > Oleg pointed out that the tracee can already be killed such that
> > fatal_signal_pending() is true. In that case signal_wake_up_state()
> > cannot be relied upon to be responsible for the wakeup -- something
> > we're going to want to rely on.
> 
> Peter, I am all confused...
> 
> If this patch is against the current tree, we don't need it.
> 
> If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's),
> then it can't help - SIGKILL can come right after the tracee drops siglock
> and calls schedule().

But by that time it will already have set TRACED and signal_wake_up()
wil clear it, no?

> Perhaps I missed something, but let me repeat the 3rd time: I'd suggest
> to simply clear JOBCTL_TRACED along with LISTENING/DELAY_WAKEKILL before
> return to close this race.

I think Eric convinced me there was a problem with that, but I'll go
over it all again in the morning, perhaps I'll reach a different
conclusion :-)
Oleg Nesterov April 28, 2022, 10:50 p.m. UTC | #14
Peter, you know, it is very difficult to me to discuss the changes
in the 2 unfinished series and not loose the context ;) Plus I am
already sleeping. But I'll try to reply anyway.

On 04/29, Peter Zijlstra wrote:
>
> On Thu, Apr 28, 2022 at 10:59:57PM +0200, Oleg Nesterov wrote:
> > If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's),
> > then it can't help - SIGKILL can come right after the tracee drops siglock
> > and calls schedule().
>
> But by that time it will already have set TRACED and signal_wake_up()
> wil clear it, no?

No. JOBCTL_DELAY_WAKEKILL is already set, this means that signal_wake_up()
will remove TASK_WAKEKILL from the "state" passed to signal_wake_up_state()
and this is fine and correct, this mean thats ttwu() won't change ->__state.

But this also mean that wake_up_state() will return false, and in this case

	signal_wake_up_state:

		if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
			t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE);

won't clear these flags. And this is nice too.

But. fatal_signal_pending() is true! And once we change freeze_traced()
to not abuse p->__state, schedule() won't block because it will check
signal_pending_state(TASK_TRACED == TASK_WAKEKILL | __TASK_TRACED) and
__fatal_signal_pending() == T.

In this case ptrace_stop() will leak JOBCTL_TRACED, so we simply need
to clear it before return along with LISTENING | DELAY_WAKEKILL.

> I'll go
> over it all again in the morning, perhaps I'll reach a different
> conclusion :-)

Same here ;)

Oleg.
diff mbox series

Patch

--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,9 +19,11 @@  struct task_struct;
 #define JOBCTL_TRAPPING_BIT	21	/* switching to TRACED */
 #define JOBCTL_LISTENING_BIT	22	/* ptracer is listening for events */
 #define JOBCTL_TRAP_FREEZE_BIT	23	/* trap for cgroup freezer */
+#define JOBCTL_DELAY_WAKEKILL_BIT 24	/* delay killable wakeups */
 
-#define JOBCTL_STOPPED_BIT	24	/* do_signal_stop() */
-#define JOBCTL_TRACED_BIT	25	/* ptrace_stop() */
+#define JOBCTL_STOPPED_BIT	25	/* do_signal_stop() */
+#define JOBCTL_TRACED_BIT	26	/* ptrace_stop() */
+#define JOBCTL_TRACED_QUIESCE_BIT 27
 
 #define JOBCTL_STOP_DEQUEUED	(1UL << JOBCTL_STOP_DEQUEUED_BIT)
 #define JOBCTL_STOP_PENDING	(1UL << JOBCTL_STOP_PENDING_BIT)
@@ -31,9 +33,11 @@  struct task_struct;
 #define JOBCTL_TRAPPING		(1UL << JOBCTL_TRAPPING_BIT)
 #define JOBCTL_LISTENING	(1UL << JOBCTL_LISTENING_BIT)
 #define JOBCTL_TRAP_FREEZE	(1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_DELAY_WAKEKILL	(1UL << JOBCTL_DELAY_WAKEKILL_BIT)
 
 #define JOBCTL_STOPPED		(1UL << JOBCTL_STOPPED_BIT)
 #define JOBCTL_TRACED		(1UL << JOBCTL_TRACED_BIT)
+#define JOBCTL_TRACED_QUIESCE	(1UL << JOBCTL_TRACED_QUIESCE_BIT)
 
 #define JOBCTL_TRAP_MASK	(JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
 #define JOBCTL_PENDING_MASK	(JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -193,41 +193,44 @@  static bool looks_like_a_spurious_pid(st
  */
 static bool ptrace_freeze_traced(struct task_struct *task)
 {
+	unsigned long flags;
 	bool ret = false;
 
 	/* Lockless, nobody but us can set this flag */
 	if (task->jobctl & JOBCTL_LISTENING)
 		return ret;
 
-	spin_lock_irq(&task->sighand->siglock);
-	if (task_is_traced(task) && !looks_like_a_spurious_pid(task) &&
+	if (!lock_task_sighand(task, &flags))
+		return ret;
+
+	if (task_is_traced(task) &&
+	    !looks_like_a_spurious_pid(task) &&
 	    !__fatal_signal_pending(task)) {
-		WRITE_ONCE(task->__state, __TASK_TRACED);
+		WARN_ON_ONCE(READ_ONCE(task->__state) != TASK_TRACED);
+		WARN_ON_ONCE(task->jobctl & JOBCTL_DELAY_WAKEKILL);
+		task->jobctl |= JOBCTL_DELAY_WAKEKILL;
 		ret = true;
 	}
-	spin_unlock_irq(&task->sighand->siglock);
+	unlock_task_sighand(task, &flags);
 
 	return ret;
 }
 
 static void ptrace_unfreeze_traced(struct task_struct *task)
 {
-	if (READ_ONCE(task->__state) != __TASK_TRACED)
+	if (!task_is_traced(task))
 		return;
 
 	WARN_ON(!task->ptrace || task->parent != current);
 
-	/*
-	 * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely.
-	 * Recheck state under the lock to close this race.
-	 */
 	spin_lock_irq(&task->sighand->siglock);
-	if (READ_ONCE(task->__state) == __TASK_TRACED) {
+	if (task_is_traced(task)) {
+//		WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL));
+		task->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
 		if (__fatal_signal_pending(task)) {
 			task->jobctl &= ~JOBCTL_TRACED;
-			wake_up_state(task, __TASK_TRACED);
-		} else
-			WRITE_ONCE(task->__state, TASK_TRACED);
+			wake_up_state(task, TASK_WAKEKILL);
+		}
 	}
 	spin_unlock_irq(&task->sighand->siglock);
 }
@@ -251,40 +254,45 @@  static void ptrace_unfreeze_traced(struc
  */
 static int ptrace_check_attach(struct task_struct *child, bool ignore_state)
 {
-	int ret = -ESRCH;
+	int traced;
 
 	/*
 	 * We take the read lock around doing both checks to close a
-	 * possible race where someone else was tracing our child and
-	 * detached between these two checks.  After this locked check,
-	 * we are sure that this is our traced child and that can only
-	 * be changed by us so it's not changing right after this.
+	 * possible race where someone else attaches or detaches our
+	 * natural child.
 	 */
 	read_lock(&tasklist_lock);
-	if (child->ptrace && child->parent == current) {
-		WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-		/*
-		 * child->sighand can't be NULL, release_task()
-		 * does ptrace_unlink() before __exit_signal().
-		 */
-		if (ignore_state || ptrace_freeze_traced(child))
-			ret = 0;
-	}
+	traced = child->ptrace && child->parent == current;
 	read_unlock(&tasklist_lock);
+	if (!traced)
+		return -ESRCH;
 
-	if (!ret && !ignore_state) {
-		if (!wait_task_inactive(child, __TASK_TRACED)) {
-			/*
-			 * This can only happen if may_ptrace_stop() fails and
-			 * ptrace_stop() changes ->state back to TASK_RUNNING,
-			 * so we should not worry about leaking __TASK_TRACED.
-			 */
-			WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED);
-			ret = -ESRCH;
-		}
+	if (ignore_state)
+		return 0;
+
+	if (!task_is_traced(child))
+		return -ESRCH;
+
+	WARN_ON_ONCE(READ_ONCE(child->jobctl) & JOBCTL_DELAY_WAKEKILL);
+
+	/* Wait for JOBCTL_TRACED_QUIESCE to go away, see ptrace_stop(). */
+	for (;;) {
+		if (fatal_signal_pending(current))
+			return -EINTR;
+
+		set_current_state(TASK_KILLABLE);
+		if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_QUIESCE))
+			break;
+
+		schedule();
 	}
+	__set_current_state(TASK_RUNNING);
 
-	return ret;
+	if (!wait_task_inactive(child, TASK_TRACED) ||
+	    !ptrace_freeze_traced(child))
+		return -ESRCH;
+
+	return 0;
 }
 
 static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode)
@@ -1329,8 +1337,7 @@  SYSCALL_DEFINE4(ptrace, long, request, l
 		goto out_put_task_struct;
 
 	ret = arch_ptrace(child, request, addr, data);
-	if (ret || request != PTRACE_DETACH)
-		ptrace_unfreeze_traced(child);
+	ptrace_unfreeze_traced(child);
 
  out_put_task_struct:
 	put_task_struct(child);
@@ -1472,8 +1479,7 @@  COMPAT_SYSCALL_DEFINE4(ptrace, compat_lo
 				  request == PTRACE_INTERRUPT);
 	if (!ret) {
 		ret = compat_arch_ptrace(child, request, addr, data);
-		if (ret || request != PTRACE_DETACH)
-			ptrace_unfreeze_traced(child);
+		ptrace_unfreeze_traced(child);
 	}
 
  out_put_task_struct:
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6310,10 +6310,7 @@  static void __sched notrace __schedule(u
 
 	/*
 	 * We must load prev->state once (task_struct::state is volatile), such
-	 * that:
-	 *
-	 *  - we form a control dependency vs deactivate_task() below.
-	 *  - ptrace_{,un}freeze_traced() can change ->state underneath us.
+	 * that we form a control dependency vs deactivate_task() below.
 	 */
 	prev_state = READ_ONCE(prev->__state);
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -764,6 +764,10 @@  void signal_wake_up_state(struct task_st
 {
 	lockdep_assert_held(&t->sighand->siglock);
 
+	/* Suppress wakekill? */
+	if (t->jobctl & JOBCTL_DELAY_WAKEKILL)
+		state &= ~TASK_WAKEKILL;
+
 	set_tsk_thread_flag(t, TIF_SIGPENDING);
 
 	/*
@@ -774,7 +778,7 @@  void signal_wake_up_state(struct task_st
 	 * handle its death signal.
 	 */
 	if (wake_up_state(t, state | TASK_INTERRUPTIBLE))
-		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED);
+		t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE);
 	else
 		kick_process(t);
 }
@@ -2187,6 +2191,15 @@  static void do_notify_parent_cldstop(str
 	spin_unlock_irqrestore(&sighand->siglock, flags);
 }
 
+static void clear_traced_quiesce(void)
+{
+	spin_lock_irq(&current->sighand->siglock);
+	WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE));
+	current->jobctl &= ~JOBCTL_TRACED_QUIESCE;
+	wake_up_state(current->parent, TASK_KILLABLE);
+	spin_unlock_irq(&current->sighand->siglock);
+}
+
 /*
  * This must be called with current->sighand->siglock held.
  *
@@ -2225,7 +2238,7 @@  static int ptrace_stop(int exit_code, in
 	 * schedule() will not sleep if there is a pending signal that
 	 * can awaken the task.
 	 */
-	current->jobctl |= JOBCTL_TRACED;
+	current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE;
 	set_special_state(TASK_TRACED);
 
 	/*
@@ -2290,14 +2303,26 @@  static int ptrace_stop(int exit_code, in
 		/*
 		 * Don't want to allow preemption here, because
 		 * sys_ptrace() needs this task to be inactive.
-		 *
-		 * XXX: implement read_unlock_no_resched().
 		 */
 		preempt_disable();
 		read_unlock(&tasklist_lock);
-		cgroup_enter_frozen();
+		cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!!
+
+		/*
+		 * JOBCTL_TRACE_QUIESCE bridges the gap between
+		 * set_current_state(TASK_TRACED) above and schedule() below.
+		 * There must not be any blocking (specifically anything that
+		 * touched ->saved_state on PREEMPT_RT) between here and
+		 * schedule().
+		 *
+		 * ptrace_check_attach() relies on this with its
+		 * wait_task_inactive() usage.
+		 */
+		clear_traced_quiesce();
+
 		preempt_enable_no_resched();
 		freezable_schedule();
+
 		cgroup_leave_frozen(true);
 	} else {
 		/*
@@ -2335,6 +2360,7 @@  static int ptrace_stop(int exit_code, in
 
 	/* LISTENING can be set only during STOP traps, clear it */
 	current->jobctl &= ~JOBCTL_LISTENING;
+	current->jobctl &= ~JOBCTL_DELAY_WAKEKILL;
 
 	/*
 	 * Queued signals ignored us while we were stopped for tracing.