Message ID | 20210607125734.1770447-1-liangyan.peng@linux.alibaba.com |
---|---|
State | Accepted |
Commit | 3e08a9f9760f4a70d633c328a76408e62d6f80a3 |
Headers | show |
Series | tracing: Correct the length check which causes memory corruption | expand |
Steven? On Mon, Jun 7, 2021 at 6:46 AM James Wang <jnwang@linux.alibaba.com> wrote: > > > > > James Wang has reproduced it stably on the latest 4.19 LTS. > > After some debugging, we finally proved that it's due to ftrace > > buffer out-of-bound access using a debug tool as follows: [..] Looks about right: > > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > > index a21ef9cd2aae..9299057feb56 100644 > > --- a/kernel/trace/trace.c > > +++ b/kernel/trace/trace.c > > @@ -2736,7 +2736,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb, > > (entry = this_cpu_read(trace_buffered_event))) { > > /* Try to use the per cpu buffer first */ > > val = this_cpu_inc_return(trace_buffered_event_cnt); > > - if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) { > > + if ((len < (PAGE_SIZE - sizeof(*entry) - sizeof(entry->array[0]))) && val == 1) { > > trace_event_setup(entry, type, trace_ctx); > > entry->array[0] = len; > > return entry; I have to say that I don't love that code. Not before, and not with the fix. That "sizeof(*entry)" is clearly wrong, because it doesn't take the unsized array into account. But adding the sizeof() for a single array entry doesn't make that already unreadable and buggy code much more readable. It would probably be better to use "struct_size(entry, buffer, 1)" instead, and I think it would be good to just split things up a bit to be more legibe: unsigned long max_len = PAGE_SIZE - struct_size(entry, array, 1); if (val == 1 && len < max_len && val == 1) { trace_event_setup(entry, type, trace_ctx); .. instead. However, I have a few questions: - why "len < max_offset" rather than "<="? - why don't we check the length before we even try to reserve that percpu buffer with the expensive atomic this_cpu_inc_return()? - is the size of that array guaranteed to always be 1? If so, why is it unsized? Why is it an array at all? - clearly the array{} size must not be guaranteed to be 1, but why a size of 1 then always sufficient here? Clearly a size of 1 is the minimum required since we do that entry->array[0] = len; and thus use one entry, but what is it that makes it ok that it really is just one entry? Steven, please excuse the above stupid questions of mine, but that code looks really odd. Linus
On Wed, 9 Jun 2021 13:08:34 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > --- a/kernel/trace/trace.c > > > +++ b/kernel/trace/trace.c > > > @@ -2736,7 +2736,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb, > > > (entry = this_cpu_read(trace_buffered_event))) { > > > /* Try to use the per cpu buffer first */ > > > val = this_cpu_inc_return(trace_buffered_event_cnt); > > > - if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) { > > > + if ((len < (PAGE_SIZE - sizeof(*entry) - sizeof(entry->array[0]))) && val == 1) { > > > trace_event_setup(entry, type, trace_ctx); > > > entry->array[0] = len; > > > return entry; > > I have to say that I don't love that code. Not before, and not with the fix. That's OK, neither do I. > > That "sizeof(*entry)" is clearly wrong, because it doesn't take the > unsized array into account. Correct. That's because I forgot that the structure has that empty array :-( > > But adding the sizeof() for a single array entry doesn't make that > already unreadable and buggy code much more readable. I wont argue that. > > It would probably be better to use "struct_size(entry, buffer, 1)" > instead, and I think it would be good to just split things up a bit to > be more legibe: I keep forgetting about that struct_size() macro. It would definitely be an improvement. > > unsigned long max_len = PAGE_SIZE - struct_size(entry, array, 1); > > if (val == 1 && len < max_len && val == 1) { > trace_event_setup(entry, type, trace_ctx); > .. > > instead. Again, no arguments from me. > > However, I have a few questions: > > - why "len < max_offset" rather than "<="? Probably because I sided on the error of being off by one with extra remaining. > > - why don't we check the length before we even try to reserve that > percpu buffer with the expensive atomic this_cpu_inc_return()? because it seldom hits that length (if ever), and the val != 1 is more likely to be true than the overflow. I thought about doing the len compare first, but that failing basically only happens on extreme anomalies (you have to try hard to make the event bigger than a page). Not to mention, if it is that big, it will fail to record anyway, because the ring buffer currently only allows events less than a page in size. The reason this doesn't fail outright when it is that big, is because there's nothing in the design of the ring buffer that doesn't let you make the chunks be bigger than a page, and I didn't want to couple this code with that current requirement. That requirement is only because the chunks of the ring buffer are currently defined as PAGE_SIZE and that also means it wont accept anything bigger than a page. > > - is the size of that array guaranteed to always be 1? If so, why is > it unsized? Why is it an array at all? It's the way the ring buffer event works: (documented at the top of include/linux/ring_buffer.h) That structure has a type_len field that is 5 bits (for compaction). If that is 29, 30, or 31, then it is a special event (padding, time extend or time stamp respectively). But if it has 1-28 in it, it represents the size (in 4 byte increments). If the data on the event is 112 bytes (4*28 = 112) or less (which most events are). then the size of the data load starts right at the array field. struct ring_buffer_event { u32 type_len:5 time_delta:27; u32 array[0] <- data load starts here. }; But if the data load is bigger than 112 bytes, then type_len is zero and the array[0] holds that length, and the data starts after it: struct ring_buffer_event { u32 type_len:5 time_delta:27; u32 array[0] <- holds the length of data data[]; < data starts here. }; > > - clearly the array{} size must not be guaranteed to be 1, but why a > size of 1 then always sufficient here? Clearly a size of 1 is the > minimum required since we do that > > entry->array[0] = len; > > and thus use one entry, but what is it that makes it ok that it > really is just one entry? This is called when filtering is on, and that we are likely to discard events instead of keeping them. In this case, it is more expensive to allocate on the ring buffer directly to only discard it from the ring buffer, as discarding requires several atomic operations and creates a visible overhead. Since filtering events is suppose to make it faster, this discard actually slows down the trace. To counter that, I create a per cpu PAGE_SIZE temp buffer to write the event to, such that the event from the trace_event code will write to this instead of directly into the buffer. On the commit of the event, the filtering takes place to test the fields that were written to, and if they don't match, then the buffer is simply discarded (no overhead). If the filter does match, then it will then copy this into the ring buffer properly. This was added years after the tracing code was written, and to make it work, the temp buffer had to simulate a ring_buffer_event. To do so, the temp buffer wrote zero into the type_len field, and used array[0] for the length, even if the size of the event is 112 bytes or less. That's because the logic to read the size only checked that type_len field to know if it should read the array[] field or not. Thus, for this temp buffer, the array[0] is always going to hold the length of the event. Note, if the len is too big, or an interrupt came in while another event was using the per cpu temp buffer, it just goes back to using the ring buffer directly, and discarding if it fails the filter. Yes, it is slower, but that case doesn't happen often. > > Steven, please excuse the above stupid questions of mine, but that > code looks really odd. Not stupid questions at all. It is odd due to how it grew. But I hope I explained it better. That said, even though I'm almost done with my tests on this patch, and was going to send you a pull request later today, I can hold off and have Liangyan send a v2 using struct_size and with the clean ups you suggested. -- Steve
On Wed, Jun 9, 2021 at 1:52 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > That said, even though I'm almost done with my tests on this patch, and > was going to send you a pull request later today, I can hold off and > have Liangyan send a v2 using struct_size and with the clean ups you > suggested. I think the cleanups could be separate. I in no way _hate_ Liangyan's patch, it's just that when looking at it I just went "that code is very confusing". So I'm certainly also ok with just getting that ugly fix, and then hope for a cleanup later. Maybe with a comment or two. Thanks, Linus
On Wed, 9 Jun 2021 13:55:04 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > I think the cleanups could be separate. I in no way _hate_ Liangyan's > patch, it's just that when looking at it I just went "that code is > very confusing". OK, I'll do the clean up now for the next merge window. > > So I'm certainly also ok with just getting that ugly fix, and then > hope for a cleanup later. Maybe with a comment or two. My tests are currently at the "allyesconfig" portion, and has other tests to finish up after that. So I expect it will be done in a few hours, and if there's no issues, I'll send it to you after it is completed. Thanks, -- Steve
On Wed, Jun 9, 2021 at 1:52 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > That "sizeof(*entry)" is clearly wrong, because it doesn't take the > > unsized array into account. > > Correct. That's because I forgot that the structure has that empty array :-( Note that 'sparse' does have the option to warn about odd flexible array uses. Including 'sizeof()'. You can do something like CF='-Wflexible-array-sizeof' make C=2 kernel/trace/trace.o and you'll see kernel/trace/trace.c:1022:17: warning: using sizeof on a flexible structure kernel/trace/trace.c:2645:17: warning: using sizeof on a flexible structure kernel/trace/trace.c:2739:41: warning: using sizeof on a flexible structure kernel/trace/trace.c:3290:16: warning: using sizeof on a flexible structure kernel/trace/trace.c:3350:16: warning: using sizeof on a flexible structure kernel/trace/trace.c:6989:16: warning: using sizeof on a flexible structure kernel/trace/trace.c:7070:16: warning: using sizeof on a flexible structure and I suspect every single one of those should be using 'struct_size()' instead for a sizeof() on the base structure plus some manual arithmetic (or, as in the case of this bug, _without_ the extra arithmetic). And yeah, it isn't just the tracing code that does this. We have it all over, so that sparse check isn't on by default. Sparse is pretty darn noisy even without it, but it can be worth using that CF='-Wflexible-array-sizeof' on individual files that you want to check. Linus
On Wed, 9 Jun 2021 14:43:51 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Wed, Jun 9, 2021 at 1:52 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > That "sizeof(*entry)" is clearly wrong, because it doesn't take the > > > unsized array into account. > > > > Correct. That's because I forgot that the structure has that empty array :-( > > Note that 'sparse' does have the option to warn about odd flexible > array uses. Including 'sizeof()'. > > You can do something like > > CF='-Wflexible-array-sizeof' make C=2 kernel/trace/trace.o > > and you'll see > > kernel/trace/trace.c:1022:17: warning: using sizeof on a flexible structure alloc = sizeof(*entry) + size + 2; /* possible \n added */ Entry is created via a macro. But I guess that too could use a struct_size(). > kernel/trace/trace.c:2645:17: warning: using sizeof on a flexible structure memset(event, 0, sizeof(*event)); Hah, this is the part that is clearing the first part of the ring_buffer_event structure of that temp buffer. Where it's setting "type_len" to zero. It doesn't care about the extended portion here. > kernel/trace/trace.c:2739:41: warning: using sizeof on a flexible structure This is the code we are talking about now. > kernel/trace/trace.c:3290:16: warning: using sizeof on a flexible structure This is like the first one. A macro created the structure, but probably could be switched over. > kernel/trace/trace.c:3350:16: warning: using sizeof on a flexible structure Same as the first occurrence. > kernel/trace/trace.c:6989:16: warning: using sizeof on a flexible structure Also the same as the first. > kernel/trace/trace.c:7070:16: warning: using sizeof on a flexible structure And same as the first. > > and I suspect every single one of those should be using > 'struct_size()' instead for a sizeof() on the base structure plus some > manual arithmetic (or, as in the case of this bug, _without_ the extra > arithmetic). Most of the above are for structures that hold dynamic size strings. But I have no problem converting them to struct_size. But since they are created by FTRACE_ENTRY*() macros in kernel/trace/trace_entries.h, we need to be careful in picking the field to use. > > And yeah, it isn't just the tracing code that does this. We have it > all over, so that sparse check isn't on by default. Sparse is pretty > darn noisy even without it, but it can be worth using that > CF='-Wflexible-array-sizeof' on individual files that you want to > check. Could be a task for an intern to do? -- Steve
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index a21ef9cd2aae..9299057feb56 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2736,7 +2736,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb, (entry = this_cpu_read(trace_buffered_event))) { /* Try to use the per cpu buffer first */ val = this_cpu_inc_return(trace_buffered_event_cnt); - if ((len < (PAGE_SIZE - sizeof(*entry))) && val == 1) { + if ((len < (PAGE_SIZE - sizeof(*entry) - sizeof(entry->array[0]))) && val == 1) { trace_event_setup(entry, type, trace_ctx); entry->array[0] = len; return entry;