diff mbox series

tracing: Correct the length check which causes memory corruption

Message ID 20210607125734.1770447-1-liangyan.peng@linux.alibaba.com
State Accepted
Commit 3e08a9f9760f4a70d633c328a76408e62d6f80a3
Headers show
Series tracing: Correct the length check which causes memory corruption | expand

Commit Message

Liangyan June 7, 2021, 12:57 p.m. UTC
We've suffered from severe kernel crashes due to memory corruption on
our production environment, like,

Call Trace:
[1640542.554277] general protection fault: 0000 [#1] SMP PTI
[1640542.554856] CPU: 17 PID: 26996 Comm: python Kdump: loaded Tainted:G
[1640542.556629] RIP: 0010:kmem_cache_alloc+0x90/0x190
[1640542.559074] RSP: 0018:ffffb16faa597df8 EFLAGS: 00010286
[1640542.559587] RAX: 0000000000000000 RBX: 0000000000400200 RCX:
0000000006e931bf
[1640542.560323] RDX: 0000000006e931be RSI: 0000000000400200 RDI:
ffff9a45ff004300
[1640542.560996] RBP: 0000000000400200 R08: 0000000000023420 R09:
0000000000000000
[1640542.561670] R10: 0000000000000000 R11: 0000000000000000 R12:
ffffffff9a20608d
[1640542.562366] R13: ffff9a45ff004300 R14: ffff9a45ff004300 R15:
696c662f65636976
[1640542.563128] FS:  00007f45d7c6f740(0000) GS:ffff9a45ff840000(0000)
knlGS:0000000000000000
[1640542.563937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1640542.564557] CR2: 00007f45d71311a0 CR3: 000000189d63e004 CR4:
00000000003606e0
[1640542.565279] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1640542.566069] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1640542.566742] Call Trace:
[1640542.567009]  anon_vma_clone+0x5d/0x170
[1640542.567417]  __split_vma+0x91/0x1a0
[1640542.567777]  do_munmap+0x2c6/0x320
[1640542.568128]  vm_munmap+0x54/0x70
[1640542.569990]  __x64_sys_munmap+0x22/0x30
[1640542.572005]  do_syscall_64+0x5b/0x1b0
[1640542.573724]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[1640542.575642] RIP: 0033:0x7f45d6e61e27

James Wang has reproduced it stably on the latest 4.19 LTS.
After some debugging, we finally proved that it's due to ftrace
buffer out-of-bound access using a debug tool as follows:
[   86.775200] BUG: Out-of-bounds write at addr 0xffff88aefe8b7000
[   86.780806]  no_context+0xdf/0x3c0
[   86.784327]  __do_page_fault+0x252/0x470
[   86.788367]  do_page_fault+0x32/0x140
[   86.792145]  page_fault+0x1e/0x30
[   86.795576]  strncpy_from_unsafe+0x66/0xb0
[   86.799789]  fetch_memory_string+0x25/0x40
[   86.804002]  fetch_deref_string+0x51/0x60
[   86.808134]  kprobe_trace_func+0x32d/0x3a0
[   86.812347]  kprobe_dispatcher+0x45/0x50
[   86.816385]  kprobe_ftrace_handler+0x90/0xf0
[   86.820779]  ftrace_ops_assist_func+0xa1/0x140
[   86.825340]  0xffffffffc00750bf
[   86.828603]  do_sys_open+0x5/0x1f0
[   86.832124]  do_syscall_64+0x5b/0x1b0
[   86.835900]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

commit b220c049d519 ("tracing: Check length before giving out
the filter buffer") adds length check to protect trace data
overflow introduced in 0fc1b09ff1ff, seems that this fix can't prevent
overflow entirely, the length check should also take the sizeof
entry->array[0] into account, since this array[0] is filled the
length of trace data and occupy addtional space and risk overflow.

Cc: stable@vger.kernel.org
Fixes: b220c049d519 ("tracing: Check length before giving out the filter buffer")
Signed-off-by: Liangyan <liangyan.peng@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: yinbinbin <yinbinbin@alibabacloud.com>
Reviewed-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Tested-by: James Wang <jnwang@linux.alibaba.com>
Cc: Xunlei Pang <xlpang@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Linus Torvalds June 9, 2021, 8:08 p.m. UTC | #1
Steven?

On Mon, Jun 7, 2021 at 6:46 AM James Wang <jnwang@linux.alibaba.com> wrote:
>

> >

> > James Wang has reproduced it stably on the latest 4.19 LTS.

> > After some debugging, we finally proved that it's due to ftrace

> > buffer out-of-bound access using a debug tool as follows:

[..]

Looks about right:

> > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c

> > index a21ef9cd2aae..9299057feb56 100644

> > --- a/kernel/trace/trace.c

> > +++ b/kernel/trace/trace.c

> > @@ -2736,7 +2736,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,

> >           (entry = this_cpu_read(trace_buffered_event))) {

> >               /* Try to use the per cpu buffer first */

> >               val = this_cpu_inc_return(trace_buffered_event_cnt);

> > -             if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) {

> > +             if ((len < (PAGE_SIZE - sizeof(*entry) - sizeof(entry->array[0]))) && val == 1) {

> >                       trace_event_setup(entry, type, trace_ctx);

> >                       entry->array[0] = len;

> >                       return entry;


I have to say that I don't love that code. Not before, and not with the fix.

That "sizeof(*entry)" is clearly wrong, because it doesn't take the
unsized array into account.

But adding the sizeof() for a single array entry doesn't make that
already unreadable and buggy code much more readable.

It would probably be better to use "struct_size(entry, buffer, 1)"
instead, and I think it would be good to just split things up a bit to
be more legibe:

        unsigned long max_len = PAGE_SIZE - struct_size(entry, array, 1);

        if (val == 1 && len < max_len && val == 1) {
                trace_event_setup(entry, type, trace_ctx);
                ..

instead.

However, I have a few questions:

 - why "len < max_offset" rather than "<="?

 - why don't we check the length before we even try to reserve that
percpu buffer with the expensive atomic this_cpu_inc_return()?

 - is the size of that array guaranteed to always be 1? If so, why is
it unsized? Why is it an array at all?

 - clearly the array{} size must not be guaranteed to be 1, but why a
size of 1 then always sufficient here? Clearly a size of 1 is the
minimum required since we do that

        entry->array[0] = len;

   and thus use one entry, but what is it that makes it ok that it
really is just one entry?

Steven, please excuse the above stupid questions of mine, but that
code looks really odd.

               Linus
Steven Rostedt June 9, 2021, 8:51 p.m. UTC | #2
On Wed, 9 Jun 2021 13:08:34 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > > --- a/kernel/trace/trace.c

> > > +++ b/kernel/trace/trace.c

> > > @@ -2736,7 +2736,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,

> > >           (entry = this_cpu_read(trace_buffered_event))) {

> > >               /* Try to use the per cpu buffer first */

> > >               val = this_cpu_inc_return(trace_buffered_event_cnt);

> > > -             if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) {

> > > +             if ((len < (PAGE_SIZE - sizeof(*entry) - sizeof(entry->array[0]))) && val == 1) {

> > >                       trace_event_setup(entry, type, trace_ctx);

> > >                       entry->array[0] = len;

> > >                       return entry;  

> 

> I have to say that I don't love that code. Not before, and not with the fix.


That's OK, neither do I.

> 

> That "sizeof(*entry)" is clearly wrong, because it doesn't take the

> unsized array into account.


Correct. That's because I forgot that the structure has that empty array :-(

> 

> But adding the sizeof() for a single array entry doesn't make that

> already unreadable and buggy code much more readable.


I wont argue that.

> 

> It would probably be better to use "struct_size(entry, buffer, 1)"

> instead, and I think it would be good to just split things up a bit to

> be more legibe:


I keep forgetting about that struct_size() macro. It would definitely
be an improvement.

> 

>         unsigned long max_len = PAGE_SIZE - struct_size(entry, array, 1);

> 

>         if (val == 1 && len < max_len && val == 1) {

>                 trace_event_setup(entry, type, trace_ctx);

>                 ..

> 

> instead.


Again, no arguments from me.

> 

> However, I have a few questions:

> 

>  - why "len < max_offset" rather than "<="?


Probably because I sided on the error of being off by one with extra
remaining.

> 

>  - why don't we check the length before we even try to reserve that

> percpu buffer with the expensive atomic this_cpu_inc_return()?


because it seldom hits that length (if ever), and the val != 1 is more
likely to be true than the overflow. I thought about doing the len
compare first, but that failing basically only happens on extreme
anomalies (you have to try hard to make the event bigger than a page).
Not to mention, if it is that big, it will fail to record anyway,
because the ring buffer currently only allows events less than a page
in size. The reason this doesn't fail outright when it is that big, is
because there's nothing in the design of the ring buffer that doesn't
let you make the chunks be bigger than a page, and I didn't want to
couple this code with that current requirement. That requirement is
only because the chunks of the ring buffer are currently defined as
PAGE_SIZE and that also means it wont accept anything bigger than a
page.

> 

>  - is the size of that array guaranteed to always be 1? If so, why is

> it unsized? Why is it an array at all?


It's the way the ring buffer event works:

(documented at the top of include/linux/ring_buffer.h)

That structure has a type_len field that is 5 bits (for compaction).

If that is 29, 30, or 31, then it is a special event (padding, time
extend or time stamp respectively). But if it has 1-28 in it, it
represents the size (in 4 byte increments). If the data on the event is
112 bytes (4*28 = 112) or less (which most events are). then the size
of the data load starts right at the array field.

struct ring_buffer_event {
	u32 type_len:5 time_delta:27;
	u32 array[0] <- data load starts here.
};

But if the data load is bigger than 112 bytes, then type_len is zero
and the array[0] holds that length, and the data starts after it:

struct ring_buffer_event {
	u32 type_len:5 time_delta:27;
	u32 array[0] <- holds the length of data
	data[]; < data starts here.
};


> 

>  - clearly the array{} size must not be guaranteed to be 1, but why a

> size of 1 then always sufficient here? Clearly a size of 1 is the

> minimum required since we do that

> 

>         entry->array[0] = len;

> 

>    and thus use one entry, but what is it that makes it ok that it

> really is just one entry?


This is called when filtering is on, and that we are likely to discard
events instead of keeping them. In this case, it is more expensive to
allocate on the ring buffer directly to only discard it from the ring
buffer, as discarding requires several atomic operations and creates a
visible overhead. Since filtering events is suppose to make it faster,
this discard actually slows down the trace.

To counter that, I create a per cpu PAGE_SIZE temp buffer to write the
event to, such that the event from the trace_event code will write to
this instead of directly into the buffer. On the commit of the event,
the filtering takes place to test the fields that were written to, and
if they don't match, then the buffer is simply discarded (no overhead).
If the filter does match, then it will then copy this into the ring
buffer properly.

This was added years after the tracing code was written, and to make it
work, the temp buffer had to simulate a ring_buffer_event. To do so,
the temp buffer wrote zero into the type_len field, and used array[0]
for the length, even if the size of the event is 112 bytes or less.
That's because the logic to read the size only checked that type_len
field to know if it should read the array[] field or not.

Thus, for this temp buffer, the array[0] is always going to hold the
length of the event.

Note, if the len is too big, or an interrupt came in while another
event was using the per cpu temp buffer, it just goes back to using the
ring buffer directly, and discarding if it fails the filter. Yes, it is
slower, but that case doesn't happen often.

> 

> Steven, please excuse the above stupid questions of mine, but that

> code looks really odd.


Not stupid questions at all. It is odd due to how it grew.

But I hope I explained it better.

That said, even though I'm almost done with my tests on this patch, and
was going to send you a pull request later today, I can hold off and
have Liangyan send a v2 using struct_size and with the clean ups you
suggested.

-- Steve
Linus Torvalds June 9, 2021, 8:55 p.m. UTC | #3
On Wed, Jun 9, 2021 at 1:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>

> That said, even though I'm almost done with my tests on this patch, and

> was going to send you a pull request later today, I can hold off and

> have Liangyan send a v2 using struct_size and with the clean ups you

> suggested.


I think the cleanups could be separate. I in no way _hate_ Liangyan's
patch, it's just that when looking at it I just went "that code is
very confusing".

So I'm certainly also ok with just getting that ugly fix, and then
hope for a cleanup later. Maybe with a comment or two.

Thanks,
            Linus
Steven Rostedt June 9, 2021, 9:11 p.m. UTC | #4
On Wed, 9 Jun 2021 13:55:04 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I think the cleanups could be separate. I in no way _hate_ Liangyan's

> patch, it's just that when looking at it I just went "that code is

> very confusing".


OK, I'll do the clean up now for the next merge window.

> 

> So I'm certainly also ok with just getting that ugly fix, and then

> hope for a cleanup later. Maybe with a comment or two.


My tests are currently at the "allyesconfig" portion, and has other
tests to finish up after that. So I expect it will be done in a few
hours, and if there's no issues, I'll send it to you after it is
completed.

Thanks,

-- Steve
Linus Torvalds June 9, 2021, 9:43 p.m. UTC | #5
On Wed, Jun 9, 2021 at 1:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >

> > That "sizeof(*entry)" is clearly wrong, because it doesn't take the

> > unsized array into account.

>

> Correct. That's because I forgot that the structure has that empty array :-(


Note that 'sparse' does have the option to warn about odd flexible
array uses. Including 'sizeof()'.

You can do something like

    CF='-Wflexible-array-sizeof' make C=2 kernel/trace/trace.o

and you'll see

  kernel/trace/trace.c:1022:17: warning: using sizeof on a flexible structure
  kernel/trace/trace.c:2645:17: warning: using sizeof on a flexible structure
  kernel/trace/trace.c:2739:41: warning: using sizeof on a flexible structure
  kernel/trace/trace.c:3290:16: warning: using sizeof on a flexible structure
  kernel/trace/trace.c:3350:16: warning: using sizeof on a flexible structure
  kernel/trace/trace.c:6989:16: warning: using sizeof on a flexible structure
  kernel/trace/trace.c:7070:16: warning: using sizeof on a flexible structure

and I suspect every single one of those should be using
'struct_size()' instead for a sizeof() on the base structure plus some
manual arithmetic (or, as in the case of this bug, _without_ the extra
arithmetic).

And yeah, it isn't just the tracing code that does this. We have it
all over, so that sparse check isn't on by default. Sparse is pretty
darn noisy even without it, but it can be worth using that
CF='-Wflexible-array-sizeof' on individual files that you want to
check.

               Linus
Steven Rostedt June 9, 2021, 10:18 p.m. UTC | #6
On Wed, 9 Jun 2021 14:43:51 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Jun 9, 2021 at 1:52 PM Steven Rostedt <rostedt@goodmis.org> wrote:

> > >

> > > That "sizeof(*entry)" is clearly wrong, because it doesn't take the

> > > unsized array into account.  

> >

> > Correct. That's because I forgot that the structure has that empty array :-(  

> 

> Note that 'sparse' does have the option to warn about odd flexible

> array uses. Including 'sizeof()'.

> 

> You can do something like

> 

>     CF='-Wflexible-array-sizeof' make C=2 kernel/trace/trace.o

> 

> and you'll see

> 

>   kernel/trace/trace.c:1022:17: warning: using sizeof on a flexible structure


	alloc = sizeof(*entry) + size + 2; /* possible \n added */

Entry is created via a macro. But I guess that too could use a struct_size().

>   kernel/trace/trace.c:2645:17: warning: using sizeof on a flexible structure


		memset(event, 0, sizeof(*event));

Hah, this is the part that is clearing the first part of the
ring_buffer_event structure of that temp buffer. Where it's setting
"type_len" to zero. It doesn't care about the extended portion here.

>   kernel/trace/trace.c:2739:41: warning: using sizeof on a flexible structure


This is the code we are talking about now.

>   kernel/trace/trace.c:3290:16: warning: using sizeof on a flexible structure


This is like the first one. A macro created the structure, but probably
could be switched over.

>   kernel/trace/trace.c:3350:16: warning: using sizeof on a flexible structure


Same as the first occurrence.

>   kernel/trace/trace.c:6989:16: warning: using sizeof on a flexible structure


Also the same as the first.

>   kernel/trace/trace.c:7070:16: warning: using sizeof on a flexible structure


And same as the first.

> 

> and I suspect every single one of those should be using

> 'struct_size()' instead for a sizeof() on the base structure plus some

> manual arithmetic (or, as in the case of this bug, _without_ the extra

> arithmetic).


Most of the above are for structures that hold dynamic size strings.
But I have no problem converting them to struct_size. But since they
are created by FTRACE_ENTRY*() macros in kernel/trace/trace_entries.h,
we need to be careful in picking the field to use.

> 

> And yeah, it isn't just the tracing code that does this. We have it

> all over, so that sparse check isn't on by default. Sparse is pretty

> darn noisy even without it, but it can be worth using that

> CF='-Wflexible-array-sizeof' on individual files that you want to

> check.


Could be a task for an intern to do?

-- Steve
diff mbox series

Patch

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index a21ef9cd2aae..9299057feb56 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2736,7 +2736,7 @@  trace_event_buffer_lock_reserve(struct trace_buffer **current_rb,
 	    (entry = this_cpu_read(trace_buffered_event))) {
 		/* Try to use the per cpu buffer first */
 		val = this_cpu_inc_return(trace_buffered_event_cnt);
-		if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) {
+		if ((len < (PAGE_SIZE - sizeof(*entry) - sizeof(entry->array[0]))) && val == 1) {
 			trace_event_setup(entry, type, trace_ctx);
 			entry->array[0] = len;
 			return entry;