Message ID | 20221105060024.598488967@goodmis.org |
---|---|
Headers | show |
Series | timers: Use timer_shutdown*() before freeing timers | expand |
On Sat, Nov 05, 2022 at 02:00:24AM -0400, Steven Rostedt wrote: > > Back in April, I posted an RFC patch set to help mitigate a common issue > where a timer gets armed just before it is freed, and when the timer > goes off, it crashes in the timer code without any evidence of who the > culprit was. I got side tracked and never finished up on that patch set. > Since this type of crash is still our #1 crash we are seeing in the field, > it has become a priority again to finish it. > > The last version of that patch set is here: > > https://lore.kernel.org/all/20221104054053.431922658@goodmis.org/ > > I'm calling this version 4a as it only has obvious changes were the timer that > is being shutdown is in the same function where it will be freed or released, > as this series should be "safe" for adding. I'll be calling the other patches > 4b for the next merge window. > Just in case you didn't notice: Looking through the resulting code, I think some of the remaining calls to del_singleshot_timer_sync() can be converted as well. The calls in drivers/staging/wlan-ng/prism2usb.c:prism2sta_disconnect_usb() are obvious (the containing data structure is freed in the same function). For drivers/char/tpm/tpm-dev-common.c:tpm_common_release(), the containing data structure is freed in the calling code. Thanks, Guenter
On Fri, Nov 4, 2022 at 11:01 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > Patch 1 fixes an issue with sunrpc/xprt where it incorrectly uses > del_singleshot_timer_sync() for something that is not a oneshot timer. As this > will be converted to shutdown, this needs to be fixed first. So this is the kind of thing that I would *not* want to get eartly. I really would want to get just the infrastructure in to let people start doing conversions. And then the "mindlessly obvious patches that are done by scripting and can not possibly matter". The kinds that do not *need* review, because they are mechanical, and that just cause pointless noise for the rest of the patches that *do* want review. Not this kind of thing that is so subtle that you have to explain it. That's not a "scripted patch for no semantic change". So leave the del_singleshot_timer_sync() cases alone, they are irrelevant for the new infrastructure and for the "mindless scripted conversion" patches. > Patches 2-4 changes existing timer_shutdown() functions used locally in ARM and > some drivers to better namespace names. Ok, these are relevant. > Patch 5 implements the new timer_shutdown() and timer_shutdown_sync() functions > that disable re-arming the timer after they are called. This is obviously what I'd want early so that people can start doign this in their trees. > Patches 6-28 change all the locations where there's a kfree(), kfree_rcu(), > kmem_cache_free() and one call_rcu() call where the RCU function frees the > timer (the workqueue patch) in the same function as the del_timer{,_sync}() is > called on that timer, and there's no extra exit path between the del_timer and > freeing of the timer. So honestly, I was literally hoping for a "this is the coccinelle script" kind of patch. Now there seems to be a number of patches here that are actualyl really hard to see that they are "obviously correct" and I can't tell if they are actually scripted or not. They don't *look* scripted, but I can't really tell. I looked at the patches with ten lines of context, and I didn't see the immediately following kfree() even in that expanded patch context, so it's fairly far away. Others in the series were *definitely* not scripted, doing clearly manual cleanups: - if (dch->timer.function) { - del_timer(&dch->timer); - dch->timer.function = NULL; - } + timer_shutdown(&dch->timer); so no, this does *not* make me feel "ok, this is all trivial". IOW, I'd really want *just* the infrastructure and *just* the provably trivial stuff. If it wasn't some scripted really obvious thing that cannot possibly change anything and that wasn't then edited manually for some reason, I really don't want it early. IOW, any early conversions I'd take are literally about removing pure mindless noise. Not about doing conversions. And I wouldn't mind it as a single conversion patch that has the coccinelle script as the explanation. Really just THAT kind of "100% mindless conversion". Linus
On Sat, 5 Nov 2022 08:59:36 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Nov 4, 2022 at 11:01 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > Patch 1 fixes an issue with sunrpc/xprt where it incorrectly uses > > del_singleshot_timer_sync() for something that is not a oneshot timer. As this > > will be converted to shutdown, this needs to be fixed first. > > So this is the kind of thing that I would *not* want to get eartly. So I'll have to break up patch 5 to not update the del_singleshot_timer_sync() to a timer_shutdown_sync(), because that breaks this code. Hmm, since that is a functional change, it probably should wait till the merge window. I'll move this patch and that part of patch 5 to the second part of the series for the merge window. > > I really would want to get just the infrastructure in to let people > start doing conversions. > > And then the "mindlessly obvious patches that are done by scripting > and can not possibly matter". > > The kinds that do not *need* review, because they are mechanical, and > that just cause pointless noise for the rest of the patches that *do* > want review. > > Not this kind of thing that is so subtle that you have to explain it. > That's not a "scripted patch for no semantic change". > > So leave the del_singleshot_timer_sync() cases alone, they are > irrelevant for the new infrastructure and for the "mindless scripted > conversion" patches. > > > Patches 2-4 changes existing timer_shutdown() functions used locally in ARM and > > some drivers to better namespace names. > > Ok, these are relevant. > > > Patch 5 implements the new timer_shutdown() and timer_shutdown_sync() functions > > that disable re-arming the timer after they are called. > > This is obviously what I'd want early so that people can start doign > this in their trees. But will need to remove the part that it changes del_singleshot_timer_sync(). > > > Patches 6-28 change all the locations where there's a kfree(), kfree_rcu(), > > kmem_cache_free() and one call_rcu() call where the RCU function frees the > > timer (the workqueue patch) in the same function as the del_timer{,_sync}() is > > called on that timer, and there's no extra exit path between the del_timer and > > freeing of the timer. > > So honestly, I was literally hoping for a "this is the coccinelle > script" kind of patch. The above actual was, but I walked through them manually too, because I don't trust my conccinelle skills. All but the call_rcu() one was caught by conccinelle. That's why I pointed out the worqueue one. I'll remove that from this series. > > Now there seems to be a number of patches here that are actualyl > really hard to see that they are "obviously correct" and I can't tell > if they are actually scripted or not. Yes they are. The script that found these were: ----------------------8<------------------------ @@ identifier ptr, timer, rfield, slab; @@ ( - del_timer(&ptr->timer); + timer_shutdown(&ptr->timer); | - del_timer_sync(&ptr->timer); + timer_shutdown_sync(&ptr->timer); ) ... ( kfree_rcu(ptr, rfield); | kmem_cache_free(slab, ptr); | kfree(ptr); ) ---------------------->8------------------------ So any function that had a del_timer*(&obj->timer) and then that obj was freed with kfree(), kfree_rcu() or kmem_cache_free() was updated. What I did manually was to make sure there was no exit of the routine between those two calls. I'm sure coccinelle could do that too, but I'm not good enough at it to add that feature. The reason the patches don't look obvious is because the distance between the del_timer() and the free may be quite far. I walked through these patches at least 3 times manually to make sure they are all OK. > > They don't *look* scripted, but I can't really tell. I looked at the > patches with ten lines of context, and I didn't see the immediately > following kfree() even in that expanded patch context, so it's fairly > far away. Yes, some are like a 100 lines away. > > Others in the series were *definitely* not scripted, doing clearly > manual cleanups: > > - if (dch->timer.function) { > - del_timer(&dch->timer); > - dch->timer.function = NULL; > - } > + timer_shutdown(&dch->timer); > > so no, this does *not* make me feel "ok, this is all trivial". Sorry, I'll remove that. It's basically open-coding the timer_shutdown() as the way it shuts down the timer is simply by setting the timer.function to NULL. > > IOW, I'd really want *just* the infrastructure and *just* the provably > trivial stuff. If it wasn't some scripted really obvious thing that > cannot possibly change anything and that wasn't then edited manually > for some reason, I really don't want it early. > > IOW, any early conversions I'd take are literally about removing pure > mindless noise. Not about doing conversions. > > And I wouldn't mind it as a single conversion patch that has the > coccinelle script as the explanation. I'll need to update the coccinelle script (or ask someone to give me a fix) that catches the case of: del_timer(&obj->timer); if (x) goto out; kfree(obj); out: return; I'm sure it's a trivial change. I'll look into it some more. I'm guessing you don't care about the case of: del_timer(&obj->timer); if (x) goto label; label: kfree(obj); As that's a bit more complex if we avoid the first goto case? Even though the second case is obviously correct. I believe both of these cases exist in the kernel. I manually removed the places that my script found for the first case. > > Really just THAT kind of "100% mindless conversion". I'll look at making the most obviously correct case, where del_timer and kfree have no goto or returns between them. We can always add the rest in the merge window. -- Steve
On Sat, 5 Nov 2022 08:59:36 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Others in the series were *definitely* not scripted, doing clearly > manual cleanups: > > - if (dch->timer.function) { > - del_timer(&dch->timer); > - dch->timer.function = NULL; > - } > + timer_shutdown(&dch->timer); > > so no, this does *not* make me feel "ok, this is all trivial". I just ran the script and the above code turned to: diff --git a/drivers/isdn/hardware/mISDN/hfcmulti.c b/drivers/isdn/hardware/mISDN/hfcmulti.c index 4f7eaa17fb27..2695bbde52db 100644 --- a/drivers/isdn/hardware/mISDN/hfcmulti.c +++ b/drivers/isdn/hardware/mISDN/hfcmulti.c @@ -4544,7 +4544,7 @@ release_port(struct hfc_multi *hc, struct dchannel *dch) spin_lock_irqsave(&hc->lock, flags); if (dch->timer.function) { - del_timer(&dch->timer); + timer_shutdown(&dch->timer); dch->timer.function = NULL; } Which is silly. Because timer_shutdown() makes timer.function = NULL. That's why I changed it. And it really shouldn't be touching timer.function anyway. -- Steve
On Sat, 5 Nov 2022 14:03:56 -0400 Steven Rostedt <rostedt@goodmis.org> wrote: > --- a/drivers/isdn/hardware/mISDN/hfcmulti.c > +++ b/drivers/isdn/hardware/mISDN/hfcmulti.c > @@ -4544,7 +4544,7 @@ release_port(struct hfc_multi *hc, struct dchannel *dch) > spin_lock_irqsave(&hc->lock, flags); > > if (dch->timer.function) { > - del_timer(&dch->timer); > + timer_shutdown(&dch->timer); > dch->timer.function = NULL; > } > I still hate the above. -- Steve
On Sat, Nov 05, 2022 at 02:00:24AM -0400, Steven Rostedt wrote: > > Back in April, I posted an RFC patch set to help mitigate a common issue > where a timer gets armed just before it is freed, and when the timer > goes off, it crashes in the timer code without any evidence of who the > culprit was. I got side tracked and never finished up on that patch set. > Since this type of crash is still our #1 crash we are seeing in the field, > it has become a priority again to finish it. > > The last version of that patch set is here: > > https://lore.kernel.org/all/20221104054053.431922658@goodmis.org/ > > I'm calling this version 4a as it only has obvious changes were the timer that > is being shutdown is in the same function where it will be freed or released, > as this series should be "safe" for adding. I'll be calling the other patches > 4b for the next merge window. > For the series, as far as my testbed goes: Build results: total: 152 pass: 152 fail: 0 Qemu test results: total: 500 pass: 500 fail: 0 No runtime crashes or warnings observed. Tested-by: Guenter Roeck <linux@roeck-us.net> Guenter
On Sat, Nov 5, 2022 at 2:03 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Something that might help here is changing the `...` into > `... when exists` or into `... when != ptr` or similar. I actually tried that. You don't want "when exists", you'd want "when forall", but that seems to be the default. And trying "when != ptr->timer" actually does the right thing in that it gets rid of the case where the timer is modified outside of the del_timer() case, *but* it also causes odd other changes to the output. Look at what it generates for that drivers/media/usb/pvrusb2/pvrusb2-hdw.c file, which finds a lot of triggers with the "when != ptr->timer", but only does one without it. So I gave up, just because I clearly don't understand the rules. (Comparing output is also fun because the ordering of the patches is random, so consecutive runs with the same rule will give different patches. I assume that it's just because it's done in parallel, but it doesn't help the "try to see what changes when you change the script" ;) Linus
On Sat, 5 Nov 2022 14:13:14 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > (Comparing output is also fun because the ordering of the patches is > random, so consecutive runs with the same rule will give different > patches. I assume that it's just because it's done in parallel, but it > doesn't help the "try to see what changes when you change the script" > ;) What I do to compare is: patch -p1 < cocci1.patch git commit -a git show | patch -p1 -R patch -p1 < cocci2.patch git diff Then I see how things changed. This is how I was able to show you the tweaks I made. -- Steve