mbox series

[v3,00/11] Add support for synchronous signals on perf events

Message ID 20210324112503.623833-1-elver@google.com
Headers show
Series Add support for synchronous signals on perf events | expand

Message

Marco Elver March 24, 2021, 11:24 a.m. UTC
The perf subsystem today unifies various tracing and monitoring
features, from both software and hardware. One benefit of the perf
subsystem is automatically inheriting events to child tasks, which
enables process-wide events monitoring with low overheads. By default
perf events are non-intrusive, not affecting behaviour of the tasks
being monitored.

For certain use-cases, however, it makes sense to leverage the
generality of the perf events subsystem and optionally allow the tasks
being monitored to receive signals on events they are interested in.
This patch series adds the option to synchronously signal user space on
events.

To better support process-wide synchronous self-monitoring, without
events propagating to children that do not share the current process's
shared environment, two pre-requisite patches are added to optionally
restrict inheritance to CLONE_THREAD, and remove events on exec (without
affecting the parent).

Examples how to use these features can be found in the tests added at
the end of the series. In addition to the tests added, the series has
also been subjected to syzkaller fuzzing (focus on 'kernel/events/'
coverage).

Motivation and Example Uses
---------------------------

1. 	Our immediate motivation is low-overhead sampling-based race
	detection for user space [1]. By using perf_event_open() at
	process initialization, we can create hardware
	breakpoint/watchpoint events that are propagated automatically
	to all threads in a process. As far as we are aware, today no
	existing kernel facility (such as ptrace) allows us to set up
	process-wide watchpoints with minimal overheads (that are
	comparable to mprotect() of whole pages).

2.	Other low-overhead error detectors that rely on detecting
	accesses to certain memory locations or code, process-wide and
	also only in a specific set of subtasks or threads.

[1] https://llvm.org/devmtg/2020-09/slides/Morehouse-GWP-Tsan.pdf

Other ideas for use-cases we found interesting, but should only
illustrate the range of potential to further motivate the utility (we're
sure there are more):

3.	Code hot patching without full stop-the-world. Specifically, by
	setting a code breakpoint to entry to the patched routine, then
	send signals to threads and check that they are not in the
	routine, but without stopping them further. If any of the
	threads will enter the routine, it will receive SIGTRAP and
	pause.

4.	Safepoints without mprotect(). Some Java implementations use
	"load from a known memory location" as a safepoint. When threads
	need to be stopped, the page containing the location is
	mprotect()ed and threads get a signal. This could be replaced with
	a watchpoint, which does not require a whole page nor DTLB
	shootdowns.

5.	Threads receiving signals on performance events to
	throttle/unthrottle themselves.

6.	Tracking data flow globally.

Changelog
---------

v3:
* Add patch "perf: Rework perf_event_exit_event()" to beginning of
  series, courtesy of Peter Zijlstra.
* Rework "perf: Add support for event removal on exec" based on
  the added "perf: Rework perf_event_exit_event()".
* Fix kselftests to work with more recent libc, due to the way it forces
  using the kernel's own siginfo_t.
* Add basic perf-tool built-in test.

v2/RFC: https://lkml.kernel.org/r/20210310104139.679618-1-elver@google.com
* Patch "Support only inheriting events if cloned with CLONE_THREAD"
  added to series.
* Patch "Add support for event removal on exec" added to series.
* Patch "Add kselftest for process-wide sigtrap handling" added to
  series.
* Patch "Add kselftest for remove_on_exec" added to series.
* Implicitly restrict inheriting events if sigtrap, but the child was
  cloned with CLONE_CLEAR_SIGHAND, because it is not generally safe if
  the child cleared all signal handlers to continue sending SIGTRAP.
* Various minor fixes (see details in patches).

v1/RFC: https://lkml.kernel.org/r/20210223143426.2412737-1-elver@google.com

Pre-series: The discussion at [2] led to the changes in this series. The
approach taken in "Add support for SIGTRAP on perf events" to trigger
the signal was suggested by Peter Zijlstra in [3].

[2] https://lore.kernel.org/lkml/CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZD=-XQ@mail.gmail.com/

[3] https://lore.kernel.org/lkml/YBv3rAT566k+6zjg@hirez.programming.kicks-ass.net/


Marco Elver (10):
  perf: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children
  perf: Support only inheriting events if cloned with CLONE_THREAD
  perf: Add support for event removal on exec
  signal: Introduce TRAP_PERF si_code and si_perf to siginfo
  perf: Add support for SIGTRAP on perf events
  perf: Add breakpoint information to siginfo on SIGTRAP
  selftests/perf_events: Add kselftest for process-wide sigtrap handling
  selftests/perf_events: Add kselftest for remove_on_exec
  tools headers uapi: Sync tools/include/uapi/linux/perf_event.h
  perf test: Add basic stress test for sigtrap handling

Peter Zijlstra (1):
  perf: Rework perf_event_exit_event()

 arch/m68k/kernel/signal.c                     |   3 +
 arch/x86/kernel/signal_compat.c               |   5 +-
 fs/signalfd.c                                 |   4 +
 include/linux/compat.h                        |   2 +
 include/linux/perf_event.h                    |   6 +-
 include/linux/signal.h                        |   1 +
 include/uapi/asm-generic/siginfo.h            |   6 +-
 include/uapi/linux/perf_event.h               |   5 +-
 include/uapi/linux/signalfd.h                 |   4 +-
 kernel/events/core.c                          | 297 +++++++++++++-----
 kernel/fork.c                                 |   2 +-
 kernel/signal.c                               |  11 +
 tools/include/uapi/linux/perf_event.h         |   5 +-
 tools/perf/tests/Build                        |   1 +
 tools/perf/tests/builtin-test.c               |   5 +
 tools/perf/tests/sigtrap.c                    | 148 +++++++++
 tools/perf/tests/tests.h                      |   1 +
 .../testing/selftests/perf_events/.gitignore  |   3 +
 tools/testing/selftests/perf_events/Makefile  |   6 +
 tools/testing/selftests/perf_events/config    |   1 +
 .../selftests/perf_events/remove_on_exec.c    | 260 +++++++++++++++
 tools/testing/selftests/perf_events/settings  |   1 +
 .../selftests/perf_events/sigtrap_threads.c   | 206 ++++++++++++
 23 files changed, 896 insertions(+), 87 deletions(-)
 create mode 100644 tools/perf/tests/sigtrap.c
 create mode 100644 tools/testing/selftests/perf_events/.gitignore
 create mode 100644 tools/testing/selftests/perf_events/Makefile
 create mode 100644 tools/testing/selftests/perf_events/config
 create mode 100644 tools/testing/selftests/perf_events/remove_on_exec.c
 create mode 100644 tools/testing/selftests/perf_events/settings
 create mode 100644 tools/testing/selftests/perf_events/sigtrap_threads.c

Comments

Peter Zijlstra March 24, 2021, 12:53 p.m. UTC | #1
On Wed, Mar 24, 2021 at 12:24:59PM +0100, Marco Elver wrote:
> Encode information from breakpoint attributes into siginfo_t, which
> helps disambiguate which breakpoint fired.
> 
> Note, providing the event fd may be unreliable, since the event may have
> been modified (via PERF_EVENT_IOC_MODIFY_ATTRIBUTES) between the event
> triggering and the signal being delivered to user space.
> 
> Signed-off-by: Marco Elver <elver@google.com>
> ---
> v2:
> * Add comment about si_perf==0.
> ---
>  kernel/events/core.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 1e4c949bf75f..0316d39e8c8f 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6399,6 +6399,22 @@ static void perf_sigtrap(struct perf_event *event)
>  	info.si_signo = SIGTRAP;
>  	info.si_code = TRAP_PERF;
>  	info.si_errno = event->attr.type;
> +
> +	switch (event->attr.type) {
> +	case PERF_TYPE_BREAKPOINT:
> +		info.si_addr = (void *)(unsigned long)event->attr.bp_addr;
> +		info.si_perf = (event->attr.bp_len << 16) | (u64)event->attr.bp_type;

Ahh, here's the si_perf user. I wasn't really clear to me what was
supposed to be in that field at patch #5 where it was introduced.

Would it perhaps make sense to put the user address of struct
perf_event_attr in there instead? (Obviously we'd have to carry it from
the syscall to here, but it might be more useful than a random encoding
of some bits therefrom).

Then we can also clearly document that's in that field, and it might be
more useful for possible other uses.
Peter Zijlstra March 24, 2021, 1:21 p.m. UTC | #2
On Wed, Mar 24, 2021 at 02:01:56PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 24, 2021 at 01:53:48PM +0100, Peter Zijlstra wrote:
> > On Wed, Mar 24, 2021 at 12:24:59PM +0100, Marco Elver wrote:
> > > Encode information from breakpoint attributes into siginfo_t, which
> > > helps disambiguate which breakpoint fired.
> > > 
> > > Note, providing the event fd may be unreliable, since the event may have
> > > been modified (via PERF_EVENT_IOC_MODIFY_ATTRIBUTES) between the event
> > > triggering and the signal being delivered to user space.
> > > 
> > > Signed-off-by: Marco Elver <elver@google.com>
> > > ---
> > > v2:
> > > * Add comment about si_perf==0.
> > > ---
> > >  kernel/events/core.c | 16 ++++++++++++++++
> > >  1 file changed, 16 insertions(+)
> > > 
> > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > index 1e4c949bf75f..0316d39e8c8f 100644
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -6399,6 +6399,22 @@ static void perf_sigtrap(struct perf_event *event)
> > >  	info.si_signo = SIGTRAP;
> > >  	info.si_code = TRAP_PERF;
> > >  	info.si_errno = event->attr.type;
> > > +
> > > +	switch (event->attr.type) {
> > > +	case PERF_TYPE_BREAKPOINT:
> > > +		info.si_addr = (void *)(unsigned long)event->attr.bp_addr;
> > > +		info.si_perf = (event->attr.bp_len << 16) | (u64)event->attr.bp_type;
> > 
> > Ahh, here's the si_perf user. I wasn't really clear to me what was
> > supposed to be in that field at patch #5 where it was introduced.
> > 
> > Would it perhaps make sense to put the user address of struct
> > perf_event_attr in there instead? (Obviously we'd have to carry it from
> > the syscall to here, but it might be more useful than a random encoding
> > of some bits therefrom).
> > 
> > Then we can also clearly document that's in that field, and it might be
> > more useful for possible other uses.
> 
> Something like so...

Ok possibly something like so, which also gets the data address right
for more cases.

---
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -778,6 +778,8 @@ struct perf_event {
 	void *security;
 #endif
 	struct list_head		sb_list;
+
+	struct kernel_siginfo 		siginfo;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5652,13 +5652,17 @@ static long _perf_ioctl(struct perf_even
 		return perf_event_query_prog_array(event, (void __user *)arg);
 
 	case PERF_EVENT_IOC_MODIFY_ATTRIBUTES: {
+		struct perf_event_attr __user *uattr;
 		struct perf_event_attr new_attr;
-		int err = perf_copy_attr((struct perf_event_attr __user *)arg,
-					 &new_attr);
+		int err;
 
+		uattr = (struct perf_event_attr __user *)arg;
+		err = perf_copy_attr(uattr, &new_attr);
 		if (err)
 			return err;
 
+		event->siginfo.si_perf = (unsigned long)uattr;
+
 		return perf_event_modify_attr(event,  &new_attr);
 	}
 	default:
@@ -6394,13 +6398,7 @@ void perf_event_wakeup(struct perf_event
 
 static void perf_sigtrap(struct perf_event *event)
 {
-	struct kernel_siginfo info;
-
-	clear_siginfo(&info);
-	info.si_signo = SIGTRAP;
-	info.si_code = TRAP_PERF;
-	info.si_errno = event->attr.type;
-	force_sig_info(&info);
+	force_sig_info(&event->siginfo);
 }
 
 static void perf_pending_event_disable(struct perf_event *event)
@@ -6414,8 +6412,8 @@ static void perf_pending_event_disable(s
 		WRITE_ONCE(event->pending_disable, -1);
 
 		if (event->attr.sigtrap) {
-			atomic_set(&event->event_limit, 1); /* rearm event */
 			perf_sigtrap(event);
+			atomic_set_release(&event->event_limit, 1); /* rearm event */
 			return;
 		}
 
@@ -9121,6 +9119,7 @@ static int __perf_event_overflow(struct
 	if (events && atomic_dec_and_test(&event->event_limit)) {
 		ret = 1;
 		event->pending_kill = POLL_HUP;
+		event->siginfo.si_addr = (void *)data->addr;
 
 		perf_event_disable_inatomic(event);
 	}
@@ -12011,6 +12010,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_task;
 	}
 
+	clear_siginfo(&event->siginfo);
+	event->siginfo.si_signo = SIGTRAP;
+	event->siginfo.si_code = TRAP_PERF;
+	event->siginfo.si_perf = (unsigned long)attr_uptr;
+
 	if (is_sampling_event(event)) {
 		if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
 			err = -EOPNOTSUPP;
Marco Elver March 24, 2021, 1:47 p.m. UTC | #3
On Wed, 24 Mar 2021 at 14:21, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Mar 24, 2021 at 02:01:56PM +0100, Peter Zijlstra wrote:
> > On Wed, Mar 24, 2021 at 01:53:48PM +0100, Peter Zijlstra wrote:
> > > On Wed, Mar 24, 2021 at 12:24:59PM +0100, Marco Elver wrote:
> > > > Encode information from breakpoint attributes into siginfo_t, which
> > > > helps disambiguate which breakpoint fired.
> > > >
> > > > Note, providing the event fd may be unreliable, since the event may have
> > > > been modified (via PERF_EVENT_IOC_MODIFY_ATTRIBUTES) between the event
> > > > triggering and the signal being delivered to user space.
> > > >
> > > > Signed-off-by: Marco Elver <elver@google.com>
> > > > ---
> > > > v2:
> > > > * Add comment about si_perf==0.
> > > > ---
> > > >  kernel/events/core.c | 16 ++++++++++++++++
> > > >  1 file changed, 16 insertions(+)
> > > >
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index 1e4c949bf75f..0316d39e8c8f 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -6399,6 +6399,22 @@ static void perf_sigtrap(struct perf_event *event)
> > > >   info.si_signo = SIGTRAP;
> > > >   info.si_code = TRAP_PERF;
> > > >   info.si_errno = event->attr.type;
> > > > +
> > > > + switch (event->attr.type) {
> > > > + case PERF_TYPE_BREAKPOINT:
> > > > +         info.si_addr = (void *)(unsigned long)event->attr.bp_addr;
> > > > +         info.si_perf = (event->attr.bp_len << 16) | (u64)event->attr.bp_type;
> > >
> > > Ahh, here's the si_perf user. I wasn't really clear to me what was
> > > supposed to be in that field at patch #5 where it was introduced.
> > >
> > > Would it perhaps make sense to put the user address of struct
> > > perf_event_attr in there instead? (Obviously we'd have to carry it from
> > > the syscall to here, but it might be more useful than a random encoding
> > > of some bits therefrom).
> > >
> > > Then we can also clearly document that's in that field, and it might be
> > > more useful for possible other uses.
> >
> > Something like so...
>
> Ok possibly something like so, which also gets the data address right
> for more cases.

It'd be nice if this could work. Though I think there's an inherent
problem (same as with fd) with trying to pass a reference back to the
user, while the user can concurrently modify that reference.

Let's assume that user space creates new copies of perf_event_attr for
every version they want, there's still a race where the user modifies
an event, and concurrently in another thread a signal arrives. I
currently don't see a way to determine when it's safe to free a
perf_event_attr or reuse, without there still being a chance that a
signal arrives due to some old perf_event_attr. And for our usecase,
we really need to know a precise subset out of attr that triggered the
event.

So the safest thing I can see is to stash a copy of the relevant
information in siginfo, which is how we ended up with encoding bits
from perf_event_attr into si_perf.

One way around this I could see is that we know that there's a limited
number of combinations of attrs, and the user just creates an instance
for every version they want (and hope it doesn't exceed some large
number). Of course, for breakpoints, we have bp_addr, but let's assume
that si_addr has the right version, so we won't need to access
perf_event_attr::bp_addr.

But given the additional complexities, I'm not sure it's worth it. Is
there a way to solve the modify-signal-race problem in a nicer way?

Thanks,
-- Marco
Marco Elver March 24, 2021, 2:05 p.m. UTC | #4
On Wed, 24 Mar 2021 at 15:01, Peter Zijlstra <peterz@infradead.org> wrote:
>
> One last try, I'll leave it alone now, I promise :-)

This looks like it does what you suggested, thanks! :-)

I'll still need to think about it, because of the potential problem
with modify-signal-races and what the user's synchronization story
would look like then.

> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -778,6 +778,9 @@ struct perf_event {
>         void *security;
>  #endif
>         struct list_head                sb_list;
> +
> +       unsigned long                   si_uattr;
> +       unsigned long                   si_data;
>  #endif /* CONFIG_PERF_EVENTS */
>  };
>
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -5652,13 +5652,17 @@ static long _perf_ioctl(struct perf_even
>                 return perf_event_query_prog_array(event, (void __user *)arg);
>
>         case PERF_EVENT_IOC_MODIFY_ATTRIBUTES: {
> +               struct perf_event_attr __user *uattr;
>                 struct perf_event_attr new_attr;
> -               int err = perf_copy_attr((struct perf_event_attr __user *)arg,
> -                                        &new_attr);
> +               int err;
>
> +               uattr = (struct perf_event_attr __user *)arg;
> +               err = perf_copy_attr(uattr, &new_attr);
>                 if (err)
>                         return err;
>
> +               event->si_uattr = (unsigned long)uattr;
> +
>                 return perf_event_modify_attr(event,  &new_attr);
>         }
>         default:
> @@ -6399,7 +6403,12 @@ static void perf_sigtrap(struct perf_eve
>         clear_siginfo(&info);
>         info.si_signo = SIGTRAP;
>         info.si_code = TRAP_PERF;
> -       info.si_errno = event->attr.type;
> +       info.si_addr = (void *)event->si_data;
> +
> +       info.si_perf = event->si_uattr;
> +       if (event->parent)
> +               info.si_perf = event->parent->si_uattr;
> +
>         force_sig_info(&info);
>  }
>
> @@ -6414,8 +6423,8 @@ static void perf_pending_event_disable(s
>                 WRITE_ONCE(event->pending_disable, -1);
>
>                 if (event->attr.sigtrap) {
> -                       atomic_set(&event->event_limit, 1); /* rearm event */
>                         perf_sigtrap(event);
> +                       atomic_set_release(&event->event_limit, 1); /* rearm event */
>                         return;
>                 }
>
> @@ -9121,6 +9130,7 @@ static int __perf_event_overflow(struct
>         if (events && atomic_dec_and_test(&event->event_limit)) {
>                 ret = 1;
>                 event->pending_kill = POLL_HUP;
> +               event->si_data = data->addr;
>
>                 perf_event_disable_inatomic(event);
>         }
> @@ -12011,6 +12021,8 @@ SYSCALL_DEFINE5(perf_event_open,
>                 goto err_task;
>         }
>
> +       event->si_uattr = (unsigned long)attr_uptr;
> +
>         if (is_sampling_event(event)) {
>                 if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
>                         err = -EOPNOTSUPP;
Dmitry Vyukov March 24, 2021, 2:15 p.m. UTC | #5
On Wed, Mar 24, 2021 at 3:12 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> > On Wed, 24 Mar 2021 at 15:01, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > One last try, I'll leave it alone now, I promise :-)
> >
> > This looks like it does what you suggested, thanks! :-)
> >
> > I'll still need to think about it, because of the potential problem
> > with modify-signal-races and what the user's synchronization story
> > would look like then.
>
> I agree that this looks inherently racy. The attr can't be allocated
> on stack, user synchronization may be tricky and expensive. The API
> may provoke bugs and some users may not even realize the race problem.
>
> One potential alternative is use of an opaque u64 context (if we could
> shove it into the attr). A user can pass a pointer to the attr in
> there (makes it equivalent to this proposal), or bit-pack size/type
> (as we want), pass some sequence number or whatever.

Just to clarify what I was thinking about, but did not really state:
perf_event_attr_t includes u64 ctx, and we return it back to the user
in siginfo_t. Kernel does not treat it in any way. This is a pretty
common API pattern in general.


> > > --- a/include/linux/perf_event.h
> > > +++ b/include/linux/perf_event.h
> > > @@ -778,6 +778,9 @@ struct perf_event {
> > >         void *security;
> > >  #endif
> > >         struct list_head                sb_list;
> > > +
> > > +       unsigned long                   si_uattr;
> > > +       unsigned long                   si_data;
> > >  #endif /* CONFIG_PERF_EVENTS */
> > >  };
> > >
> > > --- a/kernel/events/core.c
> > > +++ b/kernel/events/core.c
> > > @@ -5652,13 +5652,17 @@ static long _perf_ioctl(struct perf_even
> > >                 return perf_event_query_prog_array(event, (void __user *)arg);
> > >
> > >         case PERF_EVENT_IOC_MODIFY_ATTRIBUTES: {
> > > +               struct perf_event_attr __user *uattr;
> > >                 struct perf_event_attr new_attr;
> > > -               int err = perf_copy_attr((struct perf_event_attr __user *)arg,
> > > -                                        &new_attr);
> > > +               int err;
> > >
> > > +               uattr = (struct perf_event_attr __user *)arg;
> > > +               err = perf_copy_attr(uattr, &new_attr);
> > >                 if (err)
> > >                         return err;
> > >
> > > +               event->si_uattr = (unsigned long)uattr;
> > > +
> > >                 return perf_event_modify_attr(event,  &new_attr);
> > >         }
> > >         default:
> > > @@ -6399,7 +6403,12 @@ static void perf_sigtrap(struct perf_eve
> > >         clear_siginfo(&info);
> > >         info.si_signo = SIGTRAP;
> > >         info.si_code = TRAP_PERF;
> > > -       info.si_errno = event->attr.type;
> > > +       info.si_addr = (void *)event->si_data;
> > > +
> > > +       info.si_perf = event->si_uattr;
> > > +       if (event->parent)
> > > +               info.si_perf = event->parent->si_uattr;
> > > +
> > >         force_sig_info(&info);
> > >  }
> > >
> > > @@ -6414,8 +6423,8 @@ static void perf_pending_event_disable(s
> > >                 WRITE_ONCE(event->pending_disable, -1);
> > >
> > >                 if (event->attr.sigtrap) {
> > > -                       atomic_set(&event->event_limit, 1); /* rearm event */
> > >                         perf_sigtrap(event);
> > > +                       atomic_set_release(&event->event_limit, 1); /* rearm event */
> > >                         return;
> > >                 }
> > >
> > > @@ -9121,6 +9130,7 @@ static int __perf_event_overflow(struct
> > >         if (events && atomic_dec_and_test(&event->event_limit)) {
> > >                 ret = 1;
> > >                 event->pending_kill = POLL_HUP;
> > > +               event->si_data = data->addr;
> > >
> > >                 perf_event_disable_inatomic(event);
> > >         }
> > > @@ -12011,6 +12021,8 @@ SYSCALL_DEFINE5(perf_event_open,
> > >                 goto err_task;
> > >         }
> > >
> > > +       event->si_uattr = (unsigned long)attr_uptr;
> > > +
> > >         if (is_sampling_event(event)) {
> > >                 if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
> > >                         err = -EOPNOTSUPP;
Marco Elver March 25, 2021, 7 a.m. UTC | #6
On Wed, 24 Mar 2021 at 15:15, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Wed, Mar 24, 2021 at 3:12 PM Dmitry Vyukov <dvyukov@google.com> wrote:

> > > On Wed, 24 Mar 2021 at 15:01, Peter Zijlstra <peterz@infradead.org> wrote:

> > > >

> > > > One last try, I'll leave it alone now, I promise :-)

> > >

> > > This looks like it does what you suggested, thanks! :-)

> > >

> > > I'll still need to think about it, because of the potential problem

> > > with modify-signal-races and what the user's synchronization story

> > > would look like then.

> >

> > I agree that this looks inherently racy. The attr can't be allocated

> > on stack, user synchronization may be tricky and expensive. The API

> > may provoke bugs and some users may not even realize the race problem.

> >

> > One potential alternative is use of an opaque u64 context (if we could

> > shove it into the attr). A user can pass a pointer to the attr in

> > there (makes it equivalent to this proposal), or bit-pack size/type

> > (as we want), pass some sequence number or whatever.

>

> Just to clarify what I was thinking about, but did not really state:

> perf_event_attr_t includes u64 ctx, and we return it back to the user

> in siginfo_t. Kernel does not treat it in any way. This is a pretty

> common API pattern in general.


Ok, let's go for a new field in perf_event_attr which is copied to
si_perf. This gives user space full flexibility to decide what to
stick in it, and the kernel does not prescribe some weird encoding or
synchronization that user space would have to live with. I'll probably
call it perf_event_attr::sig_data, because all si_* things are macros.

Thanks,
-- Marco
Marco Elver March 25, 2021, 10:17 a.m. UTC | #7
On Wed, Mar 24, 2021 at 12:24PM +0100, Marco Elver wrote:
> From: Peter Zijlstra <peterz@infradead.org>

> 

> Make perf_event_exit_event() more robust, such that we can use it from

> other contexts. Specifically the up and coming remove_on_exec.

> 

> For this to work we need to address a few issues. Remove_on_exec will

> not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to

> disable event_function_call() and we thus have to use

> perf_remove_from_context().

> 

> When using perf_remove_from_context(), there's two races to consider.

> The first is against close(), where we can have concurrent tear-down

> of the event. The second is against child_list iteration, which should

> not find a half baked event.

> 

> To address this, teach perf_remove_from_context() to special case

> !ctx->is_active and about DETACH_CHILD.

> 

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> Signed-off-by: Marco Elver <elver@google.com>

> ---

> v3:

> * New dependency for series:

>   https://lkml.kernel.org/r/YFn/I3aKF+TOjGcl@hirez.programming.kicks-ass.net

> ---


syzkaller found a crash with stack trace pointing at changes in this
patch. Can't tell if this is an old issue or introduced in this series.

It looks like task_pid_ptr() wants to access task_struct::signal, but
the task_struct pointer is NULL.

Any ideas?

general protection fault, probably for non-canonical address 0xdffffc0000000103: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000818-0x000000000000081f]
CPU: 2 PID: 15084 Comm: syz-executor.1 Not tainted 5.12.0-rc4+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
RIP: 0010:task_pid_ptr kernel/pid.c:325 [inline]
RIP: 0010:__task_pid_nr_ns+0x137/0x3e0 kernel/pid.c:500
Code: 8b 75 00 eb 08 e8 59 28 29 00 45 31 f6 31 ff 44 89 fe e8 5c 2c 29 00 45 85 ff 74 49 48 81 c3 20 08 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 20 00 74 08 48 89 df e8 aa 03 6d 00 48 8b 2b 44 89 fb bf
RSP: 0018:ffffc9000c76f6d0 EFLAGS: 00010007
RAX: 0000000000000103 RBX: 000000000000081f RCX: ffff8880717d8000
RDX: ffff8880717d8000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: 0000000000000001 R08: ffffffff814fe814 R09: fffffbfff1f296b1
R10: fffffbfff1f296b1 R11: 0000000000000000 R12: dffffc0000000000
R13: 1ffff1100e6dfc5c R14: ffff888057fba108 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff88802cf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffcc3b05bc0 CR3: 0000000040ac0000 CR4: 0000000000750ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
PKRU: 55555554
Call Trace:
 perf_event_pid_type kernel/events/core.c:1412 [inline]
 perf_event_pid kernel/events/core.c:1421 [inline]
 perf_event_read_event kernel/events/core.c:7511 [inline]
 sync_child_event kernel/events/core.c:12521 [inline]
 perf_child_detach kernel/events/core.c:2223 [inline]
 __perf_remove_from_context+0x569/0xd30 kernel/events/core.c:2359
 perf_remove_from_context+0x19d/0x220 kernel/events/core.c:2395
 perf_event_exit_event+0x76/0x950 kernel/events/core.c:12559
 perf_event_exit_task_context kernel/events/core.c:12640 [inline]
 perf_event_exit_task+0x715/0xa40 kernel/events/core.c:12673
 do_exit+0x6c2/0x2290 kernel/exit.c:834
 do_group_exit+0x168/0x2d0 kernel/exit.c:922
 get_signal+0x1734/0x1ef0 kernel/signal.c:2779
 arch_do_signal_or_restart+0x41/0x620 arch/x86/kernel/signal.c:789
 handle_signal_work kernel/entry/common.c:147 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:208
 irqentry_exit_to_user_mode+0x6/0x40 kernel/entry/common.c:314
 exc_general_protection+0x222/0x370 arch/x86/kernel/traps.c:530
 asm_exc_general_protection+0x1e/0x30 arch/x86/include/asm/idtentry.h:571
Ingo Molnar March 25, 2021, 2:18 p.m. UTC | #8
* Dmitry Vyukov <dvyukov@google.com> wrote:

> On Wed, Mar 24, 2021 at 3:05 PM Marco Elver <elver@google.com> wrote:

> >

> > On Wed, 24 Mar 2021 at 15:01, Peter Zijlstra <peterz@infradead.org> wrote:

> > >

> > > One last try, I'll leave it alone now, I promise :-)

> >

> > This looks like it does what you suggested, thanks! :-)

> >

> > I'll still need to think about it, because of the potential problem

> > with modify-signal-races and what the user's synchronization story

> > would look like then.

> 

> I agree that this looks inherently racy. The attr can't be allocated

> on stack, user synchronization may be tricky and expensive. The API

> may provoke bugs and some users may not even realize the race problem.


Yeah, so why cannot we allocate enough space from the signal handler 
user-space stack and put the attr there, and point to it from 
sig_info?

The idea would be to create a stable, per-signal snapshot of whatever 
the perf_attr state is at the moment the event happens and the signal 
is generated - which is roughly what user-space wants, right?

Thanks,

	Ingo
Marco Elver March 25, 2021, 3:17 p.m. UTC | #9
On Thu, 25 Mar 2021 at 15:18, Ingo Molnar <mingo@kernel.org> wrote:
>

> * Dmitry Vyukov <dvyukov@google.com> wrote:

>

> > On Wed, Mar 24, 2021 at 3:05 PM Marco Elver <elver@google.com> wrote:

> > >

> > > On Wed, 24 Mar 2021 at 15:01, Peter Zijlstra <peterz@infradead.org> wrote:

> > > >

> > > > One last try, I'll leave it alone now, I promise :-)

> > >

> > > This looks like it does what you suggested, thanks! :-)

> > >

> > > I'll still need to think about it, because of the potential problem

> > > with modify-signal-races and what the user's synchronization story

> > > would look like then.

> >

> > I agree that this looks inherently racy. The attr can't be allocated

> > on stack, user synchronization may be tricky and expensive. The API

> > may provoke bugs and some users may not even realize the race problem.

>

> Yeah, so why cannot we allocate enough space from the signal handler

> user-space stack and put the attr there, and point to it from

> sig_info?

>

> The idea would be to create a stable, per-signal snapshot of whatever

> the perf_attr state is at the moment the event happens and the signal

> is generated - which is roughly what user-space wants, right?


I certainly couldn't say how feasible this is. Is there infrastructure
in place to do this? Or do we have to introduce support for stashing
things on the signal stack?

From what we can tell, the most flexible option though appears to be
just some user settable opaque data in perf_event_attr, that is copied
to siginfo. It'd allow user space to store a pointer or a hash/key, or
just encode the relevant information it wants; but could also go
further, and add information beyond perf_event_attr, such as things
like a signal receiver filter (e.g. task ID or set of threads which
should process the signal etc.).

So if there's no strong objection to the additional field in
perf_event_attr, I think it'll give us the simplest and most flexible
option.

Thanks,
-- Marco

> Thanks,

>

>         Ingo
Ingo Molnar March 25, 2021, 3:35 p.m. UTC | #10
* Marco Elver <elver@google.com> wrote:

> > Yeah, so why cannot we allocate enough space from the signal 

> > handler user-space stack and put the attr there, and point to it 

> > from sig_info?

> >

> > The idea would be to create a stable, per-signal snapshot of 

> > whatever the perf_attr state is at the moment the event happens 

> > and the signal is generated - which is roughly what user-space 

> > wants, right?

> 

> I certainly couldn't say how feasible this is. Is there 

> infrastructure in place to do this? Or do we have to introduce 

> support for stashing things on the signal stack?

> 

> From what we can tell, the most flexible option though appears to be 

> just some user settable opaque data in perf_event_attr, that is 

> copied to siginfo. It'd allow user space to store a pointer or a 

> hash/key, or just encode the relevant information it wants; but 

> could also go further, and add information beyond perf_event_attr, 

> such as things like a signal receiver filter (e.g. task ID or set of 

> threads which should process the signal etc.).

> 

> So if there's no strong objection to the additional field in 

> perf_event_attr, I think it'll give us the simplest and most 

> flexible option.


Sounds good to me - it's also probably measurably faster than copying 
the not-so-small-anymore perf_attr structure.

Thanks,

	Ingo
Marco Elver March 25, 2021, 4:17 p.m. UTC | #11
On Thu, Mar 25, 2021 at 11:17AM +0100, Marco Elver wrote:
> On Wed, Mar 24, 2021 at 12:24PM +0100, Marco Elver wrote:

> > From: Peter Zijlstra <peterz@infradead.org>

> > 

> > Make perf_event_exit_event() more robust, such that we can use it from

> > other contexts. Specifically the up and coming remove_on_exec.

> > 

> > For this to work we need to address a few issues. Remove_on_exec will

> > not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to

> > disable event_function_call() and we thus have to use

> > perf_remove_from_context().

> > 

> > When using perf_remove_from_context(), there's two races to consider.

> > The first is against close(), where we can have concurrent tear-down

> > of the event. The second is against child_list iteration, which should

> > not find a half baked event.

> > 

> > To address this, teach perf_remove_from_context() to special case

> > !ctx->is_active and about DETACH_CHILD.

> > 

> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> > Signed-off-by: Marco Elver <elver@google.com>

> > ---

> > v3:

> > * New dependency for series:

> >   https://lkml.kernel.org/r/YFn/I3aKF+TOjGcl@hirez.programming.kicks-ass.net

> > ---

> 

> syzkaller found a crash with stack trace pointing at changes in this

> patch. Can't tell if this is an old issue or introduced in this series.


Yay, I found a reproducer. v5.12-rc4 is good, and sadly with this patch only we
crash. :-/

Here's a stacktrace with just this patch applied:

| BUG: kernel NULL pointer dereference, address: 00000000000007af
| #PF: supervisor read access in kernel mode
| #PF: error_code(0x0000) - not-present page
| PGD 0 P4D 0
| Oops: 0000 [#1] PREEMPT SMP PTI
| CPU: 7 PID: 465 Comm: a.out Not tainted 5.12.0-rc4+ #25
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
| RIP: 0010:task_pid_ptr kernel/pid.c:324 [inline]
| RIP: 0010:__task_pid_nr_ns+0x112/0x240 kernel/pid.c:500
| Code: e8 13 55 07 00 e8 1e a6 0e 00 48 c7 c6 83 1e 0b 81 48 c7 c7 a0 2e d5 82 e8 4b 08 04 00 44 89 e0 5b 5d 41 5c c3 e8 fe a5 0e 00 <48> 8b 85 b0 07 00 00 4a 8d ac e0 98 01 00 00 e9 5a ff ff ff e8 e5
| RSP: 0000:ffffc90001b73a60 EFLAGS: 00010093
| RAX: 0000000000000000 RBX: ffffffff82c69820 RCX: ffffffff810b1eb2
| RDX: ffff888108d143c0 RSI: 0000000000000000 RDI: ffffffff8299ccc6
| RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000000
| R10: ffff888108d14db8 R11: 0000000000000000 R12: 0000000000000001
| R13: ffffffffffffffff R14: ffffffffffffffff R15: ffff888108e05240
| FS:  0000000000000000(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
| CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
| CR2: 00000000000007af CR3: 0000000002c22002 CR4: 0000000000770ee0
| DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
| DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
| PKRU: 55555554
| Call Trace:
|  perf_event_pid_type kernel/events/core.c:1412 [inline]
|  perf_event_pid kernel/events/core.c:1421 [inline]
|  perf_event_read_event+0x78/0x1d0 kernel/events/core.c:7406
|  sync_child_event kernel/events/core.c:12404 [inline]
|  perf_child_detach kernel/events/core.c:2223 [inline]
|  __perf_remove_from_context+0x14d/0x280 kernel/events/core.c:2359
|  perf_remove_from_context+0x9f/0xf0 kernel/events/core.c:2395
|  perf_event_exit_event kernel/events/core.c:12442 [inline]
|  perf_event_exit_task_context kernel/events/core.c:12523 [inline]
|  perf_event_exit_task+0x276/0x4c0 kernel/events/core.c:12556
|  do_exit+0x4cd/0xed0 kernel/exit.c:834
|  do_group_exit+0x4d/0xf0 kernel/exit.c:922
|  get_signal+0x1d2/0xf30 kernel/signal.c:2777
|  arch_do_signal_or_restart+0xf7/0x750 arch/x86/kernel/signal.c:789
|  handle_signal_work kernel/entry/common.c:147 [inline]
|  exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
|  exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:208
|  irqentry_exit_to_user_mode+0x6/0x30 kernel/entry/common.c:314
|  asm_exc_general_protection+0x1e/0x30 arch/x86/include/asm/idtentry.h:571

Attached is a C reproducer of the syzkaller program that crashes us.

Thanks,
-- Marco
// autogenerated by syzkaller (https://github.com/google/syzkaller)
/*
Generated from this syzkaller program:

clone(0x88004400, 0x0, 0x0, 0x0, 0x0)
perf_event_open(&(0x7f00000003c0)={0x4, 0x70, 0x40, 0x1, 0x3, 0x1, 0x0, 0x6, 0x10001, 0x0, 0x0, 0x1, 0x0, 0x1, 0x1, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x0, 0x1, 0x1, 0x1, 0x0, 0x1, 0x1, 0x0, 0x1, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x80000001, 0x2, @perf_bp={&(0x7f0000000380), 0xd}, 0x1000, 0x6, 0x0, 0x4, 0x1, 0x4, 0x8}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x1)
clone(0x8000, &(0x7f0000000200)="3017248985480229c715f01f2776139977f49770d8181077dce816423a929ed5e59bf26ca77f2ba311b783dda29870d621ff2394424d9c799be5fa29f1ee42102645b56fd9727401d2fe52073c20023d4623dd48522d13dff56af96e4d73f53d62f3de841a58436c591733b58072f04a49bd5cf0473e3f568b604959c06365a82e0e1350550271c25298", &(0x7f0000000100), &(0x7f0000000140), &(0x7f00000002c0)="8c0e32ae8f2716cdf998f341eb4ff0b404c7dca07d9e895c109603d3552c42f07c0190860e4c880d03ba867e8d5d738172839bdbe974d38580e5bc8a91713bee4b859c1a4500f61f197d3610ef2f515474d0b302af29f64053899418054cdf0afe2e75f313f92daf84b3f77cdb10d9d002c44bf43d0cb532cce29b249aab4d6e8218e2528c95453d255e31715422b9d3014c35603fa361ec70136322a7366868f53b78b7c369496dc39cf8ea248b7345e378")
*/

#define _GNU_SOURCE

#include <endian.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>

#define BITMASK(bf_off, bf_len) (((1ull << (bf_len)) - 1) << (bf_off))
#define STORE_BY_BITMASK(type, htobe, addr, val, bf_off, bf_len)               \
  *(type*)(addr) =                                                             \
      htobe((htobe(*(type*)(addr)) & ~BITMASK((bf_off), (bf_len))) |           \
            (((type)(val) << (bf_off)) & BITMASK((bf_off), (bf_len))))

int main(void)
{
  syscall(__NR_mmap, 0x1ffff000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
  syscall(__NR_mmap, 0x21000000ul, 0x1000ul, 0ul, 0x32ul, -1, 0ul);
  syscall(__NR_clone, 0x88004400ul, 0ul, 0ul, 0ul, 0ul);
  *(uint32_t*)0x200003c0 = 4;
  *(uint32_t*)0x200003c4 = 0x70;
  *(uint8_t*)0x200003c8 = 0x40;
  *(uint8_t*)0x200003c9 = 1;
  *(uint8_t*)0x200003ca = 3;
  *(uint8_t*)0x200003cb = 1;
  *(uint32_t*)0x200003cc = 0;
  *(uint64_t*)0x200003d0 = 6;
  *(uint64_t*)0x200003d8 = 0x10001;
  *(uint64_t*)0x200003e0 = 0;
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 0, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 1, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 2, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 3, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 4, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 5, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 6, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 7, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 8, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 9, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 10, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 11, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 12, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 13, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 14, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 15, 2);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 17, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 18, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 19, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 20, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 21, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 22, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 23, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 24, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 25, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 26, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 27, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 28, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 29, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 30, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 31, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 32, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 33, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 1, 34, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 35, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 36, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 37, 1);
  STORE_BY_BITMASK(uint64_t, , 0x200003e8, 0, 38, 26);
  *(uint32_t*)0x200003f0 = 0x80000001;
  *(uint32_t*)0x200003f4 = 2;
  *(uint64_t*)0x200003f8 = 0x20000380;
  *(uint64_t*)0x20000400 = 0xd;
  *(uint64_t*)0x20000408 = 0x1000;
  *(uint64_t*)0x20000410 = 6;
  *(uint32_t*)0x20000418 = 0;
  *(uint32_t*)0x2000041c = 4;
  *(uint64_t*)0x20000420 = 1;
  *(uint32_t*)0x20000428 = 4;
  *(uint16_t*)0x2000042c = 8;
  *(uint16_t*)0x2000042e = 0;
  syscall(__NR_perf_event_open, 0x200003c0ul, 0, -1ul, -1, 1ul);
  memcpy(
      (void*)0x20000200,
      "\x30\x17\x24\x89\x85\x48\x02\x29\xc7\x15\xf0\x1f\x27\x76\x13\x99\x77\xf4"
      "\x97\x70\xd8\x18\x10\x77\xdc\xe8\x16\x42\x3a\x92\x9e\xd5\xe5\x9b\xf2\x6c"
      "\xa7\x7f\x2b\xa3\x11\xb7\x83\xdd\xa2\x98\x70\xd6\x21\xff\x23\x94\x42\x4d"
      "\x9c\x79\x9b\xe5\xfa\x29\xf1\xee\x42\x10\x26\x45\xb5\x6f\xd9\x72\x74\x01"
      "\xd2\xfe\x52\x07\x3c\x20\x02\x3d\x46\x23\xdd\x48\x52\x2d\x13\xdf\xf5\x6a"
      "\xf9\x6e\x4d\x73\xf5\x3d\x62\xf3\xde\x84\x1a\x58\x43\x6c\x59\x17\x33\xb5"
      "\x80\x72\xf0\x4a\x49\xbd\x5c\xf0\x47\x3e\x3f\x56\x8b\x60\x49\x59\xc0\x63"
      "\x65\xa8\x2e\x0e\x13\x50\x55\x02\x71\xc2\x52\x98",
      138);
  memcpy(
      (void*)0x200002c0,
      "\x8c\x0e\x32\xae\x8f\x27\x16\xcd\xf9\x98\xf3\x41\xeb\x4f\xf0\xb4\x04\xc7"
      "\xdc\xa0\x7d\x9e\x89\x5c\x10\x96\x03\xd3\x55\x2c\x42\xf0\x7c\x01\x90\x86"
      "\x0e\x4c\x88\x0d\x03\xba\x86\x7e\x8d\x5d\x73\x81\x72\x83\x9b\xdb\xe9\x74"
      "\xd3\x85\x80\xe5\xbc\x8a\x91\x71\x3b\xee\x4b\x85\x9c\x1a\x45\x00\xf6\x1f"
      "\x19\x7d\x36\x10\xef\x2f\x51\x54\x74\xd0\xb3\x02\xaf\x29\xf6\x40\x53\x89"
      "\x94\x18\x05\x4c\xdf\x0a\xfe\x2e\x75\xf3\x13\xf9\x2d\xaf\x84\xb3\xf7\x7c"
      "\xdb\x10\xd9\xd0\x02\xc4\x4b\xf4\x3d\x0c\xb5\x32\xcc\xe2\x9b\x24\x9a\xab"
      "\x4d\x6e\x82\x18\xe2\x52\x8c\x95\x45\x3d\x25\x5e\x31\x71\x54\x22\xb9\xd3"
      "\x01\x4c\x35\x60\x3f\xa3\x61\xec\x70\x13\x63\x22\xa7\x36\x68\x68\xf5\x3b"
      "\x78\xb7\xc3\x69\x49\x6d\xc3\x9c\xf8\xea\x24\x8b\x73\x45\xe3\x78",
      178);
  syscall(__NR_clone, 0x8000ul, 0x20000200ul, 0x20000100ul, 0x20000140ul,
          0x200002c0ul);
  return 0;
}
Marco Elver March 25, 2021, 7:10 p.m. UTC | #12
On Thu, Mar 25, 2021 at 05:17PM +0100, Marco Elver wrote:
[...]
> > syzkaller found a crash with stack trace pointing at changes in this

> > patch. Can't tell if this is an old issue or introduced in this series.

> 

> Yay, I found a reproducer. v5.12-rc4 is good, and sadly with this patch only we

> crash. :-/

> 

> Here's a stacktrace with just this patch applied:

> 

> | BUG: kernel NULL pointer dereference, address: 00000000000007af

[...]
> | RIP: 0010:task_pid_ptr kernel/pid.c:324 [inline]

> | RIP: 0010:__task_pid_nr_ns+0x112/0x240 kernel/pid.c:500

[...]
> | Call Trace:

> |  perf_event_pid_type kernel/events/core.c:1412 [inline]

> |  perf_event_pid kernel/events/core.c:1421 [inline]

> |  perf_event_read_event+0x78/0x1d0 kernel/events/core.c:7406

> |  sync_child_event kernel/events/core.c:12404 [inline]

> |  perf_child_detach kernel/events/core.c:2223 [inline]

> |  __perf_remove_from_context+0x14d/0x280 kernel/events/core.c:2359

> |  perf_remove_from_context+0x9f/0xf0 kernel/events/core.c:2395

> |  perf_event_exit_event kernel/events/core.c:12442 [inline]

> |  perf_event_exit_task_context kernel/events/core.c:12523 [inline]

> |  perf_event_exit_task+0x276/0x4c0 kernel/events/core.c:12556

> |  do_exit+0x4cd/0xed0 kernel/exit.c:834

> |  do_group_exit+0x4d/0xf0 kernel/exit.c:922

> |  get_signal+0x1d2/0xf30 kernel/signal.c:2777

> |  arch_do_signal_or_restart+0xf7/0x750 arch/x86/kernel/signal.c:789

> |  handle_signal_work kernel/entry/common.c:147 [inline]

> |  exit_to_user_mode_loop kernel/entry/common.c:171 [inline]

> |  exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:208

> |  irqentry_exit_to_user_mode+0x6/0x30 kernel/entry/common.c:314

> |  asm_exc_general_protection+0x1e/0x30 arch/x86/include/asm/idtentry.h:571


I spun up gdb, and it showed me this:

| #0  perf_event_read_event (event=event@entry=0xffff888107cd5000, task=task@entry=0xffffffffffffffff)
|     at kernel/events/core.c:7397
									^^^ TASK_TOMBSTONE
| #1  0xffffffff811fc9cd in sync_child_event (child_event=0xffff888107cd5000) at kernel/events/core.c:12404
| #2  perf_child_detach (event=0xffff888107cd5000) at kernel/events/core.c:2223
| #3  __perf_remove_from_context (event=event@entry=0xffff888107cd5000, cpuctx=cpuctx@entry=0xffff88842fdf0c00,
|     ctx=ctx@entry=0xffff8881073cb800, info=info@entry=0x3 <fixed_percpu_data+3>) at kernel/events/core.c:2359
| #4  0xffffffff811fcb9f in perf_remove_from_context (event=event@entry=0xffff888107cd5000, flags=flags@entry=3)
|     at kernel/events/core.c:2395
| #5  0xffffffff81204526 in perf_event_exit_event (ctx=0xffff8881073cb800, event=0xffff888107cd5000)
|     at kernel/events/core.c:12442
| #6  perf_event_exit_task_context (ctxn=0, child=0xffff88810531a200) at kernel/events/core.c:12523
| #7  perf_event_exit_task (child=0xffff88810531a200) at kernel/events/core.c:12556
| #8  0xffffffff8108838d in do_exit (code=code@entry=11) at kernel/exit.c:834
| #9  0xffffffff81088e4d in do_group_exit (exit_code=11) at kernel/exit.c:922

and therefore synthesized this fix on top:

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 57de8d436efd..e77294c7e654 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12400,7 +12400,7 @@ static void sync_child_event(struct perf_event *child_event)
 	if (child_event->attr.inherit_stat) {
 		struct task_struct *task = child_event->ctx->task;
 
-		if (task)
+		if (task && task != TASK_TOMBSTONE)
 			perf_event_read_event(child_event, task);
 	}
 
which fixes the problem. My guess is that the parent and child are both
racing to exit?

Does that make any sense?

Thanks,
-- Marco
Peter Zijlstra March 29, 2021, 11:50 a.m. UTC | #13
On Thu, Mar 25, 2021 at 08:10:51PM +0100, Marco Elver wrote:

> and therefore synthesized this fix on top:

> 

> diff --git a/kernel/events/core.c b/kernel/events/core.c

> index 57de8d436efd..e77294c7e654 100644

> --- a/kernel/events/core.c

> +++ b/kernel/events/core.c

> @@ -12400,7 +12400,7 @@ static void sync_child_event(struct perf_event *child_event)

>  	if (child_event->attr.inherit_stat) {

>  		struct task_struct *task = child_event->ctx->task;

>  

> -		if (task)

> +		if (task && task != TASK_TOMBSTONE)

>  			perf_event_read_event(child_event, task);

>  	}

>  

> which fixes the problem. My guess is that the parent and child are both

> racing to exit?

> 

> Does that make any sense?


Yes, I think it does. ACK