Message ID | 20220707062112.308239-1-windhl@126.com |
---|---|
State | New |
Headers | show |
Series | thermal/core: Fix refcount bugs in __thermal_cooling_device_register() | expand |
On Thu, Jul 7, 2022 at 8:21 AM Liang He <windhl@126.com> wrote: > > For each new reference of 'device_node', we should increase its > refcount. Otherwise, there will be premature free. > > For example, in drivers\thermal\tegra\soctherm.c, the function > soctherm_init_hw_throt_cdev() will use for_each_child_of_node() to > iterate its child device_node which will be then passed into > __thermal_cooling_device_register(). As for_each_xxx OF APIs will > automatically increase and decrease the refcount of 'device_node', > we should use additional of_node_get() to record the new refernece. reference > > NOTE, we should also call the corresponding of_node_put() in fail path > or when the *_unregister() function is called. The NOTE in capitals above is somewhat confusing. I would just say "Accordingly, the corresponding of_node_put() needs to be run in the error code path and on cooling device unregistration." > > Fixes: a116b5d44f14 ("thermal: core: introduce thermal_of_cooling_device_register") > Signed-off-by: Liang He <windhl@126.com> > --- > I cannot confirm, in *_unregister, we should put of_node_put() in or > out of the *_lock/*_unlock functions. Please check it carefully. This doesn't matter too much AFAICS. Please note that the of_node_put() can still "leak" into the critical section through the "unlock" operation, because the latter is not a full memory barrier. Moreover, dropping the reference means that the object in question won't be used any more by the holder of that reference and it is no reason I can see why it would be necessary to hold the lock while doing that. > drivers/thermal/thermal_core.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c > index cdc0552e8c42..c459e2958b7b 100644 > --- a/drivers/thermal/thermal_core.c > +++ b/drivers/thermal/thermal_core.c > @@ -919,7 +919,7 @@ __thermal_cooling_device_register(struct device_node *np, > > mutex_init(&cdev->lock); > INIT_LIST_HEAD(&cdev->thermal_instances); > - cdev->np = np; > + cdev->np = of_node_get(np); > cdev->ops = ops; > cdev->updated = false; > cdev->device.class = &thermal_class; > @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np, > return cdev; > > out_kfree_type: > + of_node_put(cdev->np); > thermal_cooling_device_destroy_sysfs(cdev); > kfree(cdev->type); > put_device(&cdev->device); > @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev) > > mutex_unlock(&thermal_list_lock); > > + of_node_put(cdev->np); Could this be done right before the thermal_cooling_device_destroy_sysfs() below? Then the sequence would be completely analogous to the error code path above. > ida_simple_remove(&thermal_cdev_ida, cdev->id); > device_del(&cdev->device); > thermal_cooling_device_destroy_sysfs(cdev); > -- Overall, this looks like a genuine fix to me. Daniel, what do you think?
At 2022-07-16 01:14:31, "Rafael J. Wysocki" <rafael@kernel.org> wrote: >On Thu, Jul 7, 2022 at 8:21 AM Liang He <windhl@126.com> wrote: >> >> For each new reference of 'device_node', we should increase its >> refcount. Otherwise, there will be premature free. >> >> For example, in drivers\thermal\tegra\soctherm.c, the function >> soctherm_init_hw_throt_cdev() will use for_each_child_of_node() to >> iterate its child device_node which will be then passed into >> __thermal_cooling_device_register(). As for_each_xxx OF APIs will >> automatically increase and decrease the refcount of 'device_node', >> we should use additional of_node_get() to record the new refernece. > >reference Thanks! > >> >> NOTE, we should also call the corresponding of_node_put() in fail path >> or when the *_unregister() function is called. > >The NOTE in capitals above is somewhat confusing. I would just say >"Accordingly, the corresponding of_node_put() needs to be run in the >error code path and on cooling device unregistration." > Thanks, I will change that in new version. >> >> Fixes: a116b5d44f14 ("thermal: core: introduce thermal_of_cooling_device_register") >> Signed-off-by: Liang He <windhl@126.com> >> --- >> I cannot confirm, in *_unregister, we should put of_node_put() in or >> out of the *_lock/*_unlock functions. Please check it carefully. > >This doesn't matter too much AFAICS. > >Please note that the of_node_put() can still "leak" into the critical >section through the "unlock" operation, because the latter is not a >full memory barrier. > >Moreover, dropping the reference means that the object in question >won't be used any more by the holder of that reference and it is no >reason I can see why it would be necessary to hold the lock while >doing that. > >> drivers/thermal/thermal_core.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c >> index cdc0552e8c42..c459e2958b7b 100644 >> --- a/drivers/thermal/thermal_core.c >> +++ b/drivers/thermal/thermal_core.c >> @@ -919,7 +919,7 @@ __thermal_cooling_device_register(struct device_node *np, >> >> mutex_init(&cdev->lock); >> INIT_LIST_HEAD(&cdev->thermal_instances); >> - cdev->np = np; >> + cdev->np = of_node_get(np); >> cdev->ops = ops; >> cdev->updated = false; >> cdev->device.class = &thermal_class; >> @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np, >> return cdev; >> >> out_kfree_type: >> + of_node_put(cdev->np); >> thermal_cooling_device_destroy_sysfs(cdev); >> kfree(cdev->type); >> put_device(&cdev->device); >> @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev) >> >> mutex_unlock(&thermal_list_lock); >> >> + of_node_put(cdev->np); > >Could this be done right before the >thermal_cooling_device_destroy_sysfs() below? Then the sequence would >be completely analogous to the error code path above. > That souds good, I will change it in new version. >> ida_simple_remove(&thermal_cdev_ida, cdev->id); >> device_del(&cdev->device); >> thermal_cooling_device_destroy_sysfs(cdev); >> -- > >Overall, this looks like a genuine fix to me. > >Daniel, what do you think? I will also wait Daniel's reponse before I send new version. Thanks again, Liang
On 15/07/2022 19:14, Rafael J. Wysocki wrote: > On Thu, Jul 7, 2022 at 8:21 AM Liang He <windhl@126.com> wrote: [ ... ] >> - cdev->np = np; >> + cdev->np = of_node_get(np); >> cdev->ops = ops; >> cdev->updated = false; >> cdev->device.class = &thermal_class; >> @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np, >> return cdev; >> >> out_kfree_type: >> + of_node_put(cdev->np); >> thermal_cooling_device_destroy_sysfs(cdev); >> kfree(cdev->type); >> put_device(&cdev->device); >> @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev) >> >> mutex_unlock(&thermal_list_lock); >> >> + of_node_put(cdev->np); > > Could this be done right before the > thermal_cooling_device_destroy_sysfs() below? Then the sequence would > be completely analogous to the error code path above. > >> ida_simple_remove(&thermal_cdev_ida, cdev->id); >> device_del(&cdev->device); >> thermal_cooling_device_destroy_sysfs(cdev); >> -- > > Overall, this looks like a genuine fix to me. > > Daniel, what do you think? Yes, the of_node_put() is often missing when there is the for_each_xxx OF API. But here the cdev->np is only used to compare pointers so used as an identifier, not de-referenced just comparing the addresses. It is used to bind a thermal zone with a cooling device: thermal_of.c: if (tcbp->cooling_device == cdev->np) { thermal_of.c: if (tcbp->cooling_device == cdev->np) { That is probably why no refcount was taken on this device node. Moreover, this will go away with the series reworking the thermal-of I sent last week
At 2022-07-17 06:03:46, "Daniel Lezcano" <daniel.lezcano@linaro.org> wrote: >On 15/07/2022 19:14, Rafael J. Wysocki wrote: >> On Thu, Jul 7, 2022 at 8:21 AM Liang He <windhl@126.com> wrote: > >[ ... ] > >>> - cdev->np = np; >>> + cdev->np = of_node_get(np); >>> cdev->ops = ops; >>> cdev->updated = false; >>> cdev->device.class = &thermal_class; >>> @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np, >>> return cdev; >>> >>> out_kfree_type: >>> + of_node_put(cdev->np); >>> thermal_cooling_device_destroy_sysfs(cdev); >>> kfree(cdev->type); >>> put_device(&cdev->device); >>> @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev) >>> >>> mutex_unlock(&thermal_list_lock); >>> >>> + of_node_put(cdev->np); >> >> Could this be done right before the >> thermal_cooling_device_destroy_sysfs() below? Then the sequence would >> be completely analogous to the error code path above. >> >>> ida_simple_remove(&thermal_cdev_ida, cdev->id); >>> device_del(&cdev->device); >>> thermal_cooling_device_destroy_sysfs(cdev); >>> -- >> >> Overall, this looks like a genuine fix to me. >> >> Daniel, what do you think? > >Yes, the of_node_put() is often missing when there is the for_each_xxx >OF API. But here the cdev->np is only used to compare pointers so used >as an identifier, not de-referenced just comparing the addresses. > Thanks, this is a good lesson that explains when there is no need to refcount new reference. So I think there is also no need to patch anything, right? Thanks again, Liang >It is used to bind a thermal zone with a cooling device: > >thermal_of.c: if (tcbp->cooling_device == cdev->np) { >thermal_of.c: if (tcbp->cooling_device == cdev->np) { > >That is probably why no refcount was taken on this device node. > >Moreover, this will go away with the series reworking the thermal-of I >sent last week > >> >-- ><http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs > >Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | ><http://twitter.com/#!/linaroorg> Twitter | ><http://www.linaro.org/linaro-blog/> Blog
On 17/07/2022 04:57, Liang He wrote: [ ... ] >> Yes, the of_node_put() is often missing when there is the for_each_xxx >> OF API. But here the cdev->np is only used to compare pointers so used >> as an identifier, not de-referenced just comparing the addresses. > >> > > Thanks, this is a good lesson that explains when there is no need > to refcount new reference. > So I think there is also no need to patch anything, right? Right, no need a change for this. Thanks anyway for tracking down the refcount in the code [ ... ]
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index cdc0552e8c42..c459e2958b7b 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -919,7 +919,7 @@ __thermal_cooling_device_register(struct device_node *np, mutex_init(&cdev->lock); INIT_LIST_HEAD(&cdev->thermal_instances); - cdev->np = np; + cdev->np = of_node_get(np); cdev->ops = ops; cdev->updated = false; cdev->device.class = &thermal_class; @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np, return cdev; out_kfree_type: + of_node_put(cdev->np); thermal_cooling_device_destroy_sysfs(cdev); kfree(cdev->type); put_device(&cdev->device); @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev) mutex_unlock(&thermal_list_lock); + of_node_put(cdev->np); ida_simple_remove(&thermal_cdev_ida, cdev->id); device_del(&cdev->device); thermal_cooling_device_destroy_sysfs(cdev);
For each new reference of 'device_node', we should increase its refcount. Otherwise, there will be premature free. For example, in drivers\thermal\tegra\soctherm.c, the function soctherm_init_hw_throt_cdev() will use for_each_child_of_node() to iterate its child device_node which will be then passed into __thermal_cooling_device_register(). As for_each_xxx OF APIs will automatically increase and decrease the refcount of 'device_node', we should use additional of_node_get() to record the new refernece. NOTE, we should also call the corresponding of_node_put() in fail path or when the *_unregister() function is called. Fixes: a116b5d44f14 ("thermal: core: introduce thermal_of_cooling_device_register") Signed-off-by: Liang He <windhl@126.com> --- I cannot confirm, in *_unregister, we should put of_node_put() in or out of the *_lock/*_unlock functions. Please check it carefully. drivers/thermal/thermal_core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)