diff mbox series

thermal/core: Fix refcount bugs in __thermal_cooling_device_register()

Message ID 20220707062112.308239-1-windhl@126.com
State New
Headers show
Series thermal/core: Fix refcount bugs in __thermal_cooling_device_register() | expand

Commit Message

Liang He July 7, 2022, 6:21 a.m. UTC
For each new reference of 'device_node', we should increase its
refcount. Otherwise, there will be premature free.

For example, in drivers\thermal\tegra\soctherm.c, the function
soctherm_init_hw_throt_cdev() will use for_each_child_of_node() to
iterate its child device_node which will be then passed into
 __thermal_cooling_device_register(). As for_each_xxx OF APIs will
automatically increase and decrease the refcount of 'device_node',
we should use additional of_node_get() to record the new refernece.

NOTE, we should also call the corresponding of_node_put() in fail path
or when the *_unregister() function is called.

Fixes: a116b5d44f14 ("thermal: core: introduce thermal_of_cooling_device_register")
Signed-off-by: Liang He <windhl@126.com>
---
 I cannot confirm, in *_unregister, we should put of_node_put() in or
out of the *_lock/*_unlock functions. Please check it carefully.

 drivers/thermal/thermal_core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki July 15, 2022, 5:14 p.m. UTC | #1
On Thu, Jul 7, 2022 at 8:21 AM Liang He <windhl@126.com> wrote:
>
> For each new reference of 'device_node', we should increase its
> refcount. Otherwise, there will be premature free.
>
> For example, in drivers\thermal\tegra\soctherm.c, the function
> soctherm_init_hw_throt_cdev() will use for_each_child_of_node() to
> iterate its child device_node which will be then passed into
>  __thermal_cooling_device_register(). As for_each_xxx OF APIs will
> automatically increase and decrease the refcount of 'device_node',
> we should use additional of_node_get() to record the new refernece.

reference

>
> NOTE, we should also call the corresponding of_node_put() in fail path
> or when the *_unregister() function is called.

The NOTE in capitals above is somewhat confusing.  I would just say
"Accordingly, the corresponding of_node_put() needs to be run in the
error code path and on cooling device unregistration."

>
> Fixes: a116b5d44f14 ("thermal: core: introduce thermal_of_cooling_device_register")
> Signed-off-by: Liang He <windhl@126.com>
> ---
>  I cannot confirm, in *_unregister, we should put of_node_put() in or
> out of the *_lock/*_unlock functions. Please check it carefully.

This doesn't matter too much AFAICS.

Please note that the of_node_put() can still "leak" into the critical
section through the "unlock" operation, because the latter is not a
full memory barrier.

Moreover, dropping the reference means that the object in question
won't be used any more by the holder of that reference and it is no
reason I can see why it would be necessary to hold the lock while
doing that.

>  drivers/thermal/thermal_core.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index cdc0552e8c42..c459e2958b7b 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -919,7 +919,7 @@ __thermal_cooling_device_register(struct device_node *np,
>
>         mutex_init(&cdev->lock);
>         INIT_LIST_HEAD(&cdev->thermal_instances);
> -       cdev->np = np;
> +       cdev->np = of_node_get(np);
>         cdev->ops = ops;
>         cdev->updated = false;
>         cdev->device.class = &thermal_class;
> @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np,
>         return cdev;
>
>  out_kfree_type:
> +       of_node_put(cdev->np);
>         thermal_cooling_device_destroy_sysfs(cdev);
>         kfree(cdev->type);
>         put_device(&cdev->device);
> @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
>
>         mutex_unlock(&thermal_list_lock);
>
> +       of_node_put(cdev->np);

Could this be done right before the
thermal_cooling_device_destroy_sysfs() below?  Then the sequence would
be completely analogous to the error code path above.

>         ida_simple_remove(&thermal_cdev_ida, cdev->id);
>         device_del(&cdev->device);
>         thermal_cooling_device_destroy_sysfs(cdev);
> --

Overall, this looks like a genuine fix to me.

Daniel, what do you think?
Liang He July 16, 2022, 2 a.m. UTC | #2
At 2022-07-16 01:14:31, "Rafael J. Wysocki" <rafael@kernel.org> wrote:
>On Thu, Jul 7, 2022 at 8:21 AM Liang He <windhl@126.com> wrote:
>>
>> For each new reference of 'device_node', we should increase its
>> refcount. Otherwise, there will be premature free.
>>
>> For example, in drivers\thermal\tegra\soctherm.c, the function
>> soctherm_init_hw_throt_cdev() will use for_each_child_of_node() to
>> iterate its child device_node which will be then passed into
>>  __thermal_cooling_device_register(). As for_each_xxx OF APIs will
>> automatically increase and decrease the refcount of 'device_node',
>> we should use additional of_node_get() to record the new refernece.
>
>reference

Thanks!

>
>>
>> NOTE, we should also call the corresponding of_node_put() in fail path
>> or when the *_unregister() function is called.
>
>The NOTE in capitals above is somewhat confusing.  I would just say
>"Accordingly, the corresponding of_node_put() needs to be run in the
>error code path and on cooling device unregistration."
>
Thanks, I will change that in new version.

>>
>> Fixes: a116b5d44f14 ("thermal: core: introduce thermal_of_cooling_device_register")
>> Signed-off-by: Liang He <windhl@126.com>
>> ---
>>  I cannot confirm, in *_unregister, we should put of_node_put() in or
>> out of the *_lock/*_unlock functions. Please check it carefully.
>
>This doesn't matter too much AFAICS.
>
>Please note that the of_node_put() can still "leak" into the critical
>section through the "unlock" operation, because the latter is not a
>full memory barrier.
>
>Moreover, dropping the reference means that the object in question
>won't be used any more by the holder of that reference and it is no
>reason I can see why it would be necessary to hold the lock while
>doing that.
>
>>  drivers/thermal/thermal_core.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
>> index cdc0552e8c42..c459e2958b7b 100644
>> --- a/drivers/thermal/thermal_core.c
>> +++ b/drivers/thermal/thermal_core.c
>> @@ -919,7 +919,7 @@ __thermal_cooling_device_register(struct device_node *np,
>>
>>         mutex_init(&cdev->lock);
>>         INIT_LIST_HEAD(&cdev->thermal_instances);
>> -       cdev->np = np;
>> +       cdev->np = of_node_get(np);
>>         cdev->ops = ops;
>>         cdev->updated = false;
>>         cdev->device.class = &thermal_class;
>> @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np,
>>         return cdev;
>>
>>  out_kfree_type:
>> +       of_node_put(cdev->np);
>>         thermal_cooling_device_destroy_sysfs(cdev);
>>         kfree(cdev->type);
>>         put_device(&cdev->device);
>> @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
>>
>>         mutex_unlock(&thermal_list_lock);
>>
>> +       of_node_put(cdev->np);
>
>Could this be done right before the
>thermal_cooling_device_destroy_sysfs() below?  Then the sequence would
>be completely analogous to the error code path above.
>

That souds good, I will change it in new version.

>>         ida_simple_remove(&thermal_cdev_ida, cdev->id);
>>         device_del(&cdev->device);
>>         thermal_cooling_device_destroy_sysfs(cdev);
>> --
>
>Overall, this looks like a genuine fix to me.
>
>Daniel, what do you think?

I will also wait Daniel's reponse before I send new version.

Thanks again,

Liang
Daniel Lezcano July 16, 2022, 10:03 p.m. UTC | #3
On 15/07/2022 19:14, Rafael J. Wysocki wrote:
> On Thu, Jul 7, 2022 at 8:21 AM Liang He <windhl@126.com> wrote:

[ ... ]

>> -       cdev->np = np;
>> +       cdev->np = of_node_get(np);
>>          cdev->ops = ops;
>>          cdev->updated = false;
>>          cdev->device.class = &thermal_class;
>> @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np,
>>          return cdev;
>>
>>   out_kfree_type:
>> +       of_node_put(cdev->np);
>>          thermal_cooling_device_destroy_sysfs(cdev);
>>          kfree(cdev->type);
>>          put_device(&cdev->device);
>> @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
>>
>>          mutex_unlock(&thermal_list_lock);
>>
>> +       of_node_put(cdev->np);
> 
> Could this be done right before the
> thermal_cooling_device_destroy_sysfs() below?  Then the sequence would
> be completely analogous to the error code path above.
> 
>>          ida_simple_remove(&thermal_cdev_ida, cdev->id);
>>          device_del(&cdev->device);
>>          thermal_cooling_device_destroy_sysfs(cdev);
>> --
> 
> Overall, this looks like a genuine fix to me.
> 
> Daniel, what do you think?

Yes, the of_node_put() is often missing when there is the for_each_xxx 
OF API. But here the cdev->np is only used to compare pointers so used 
as an identifier, not de-referenced just comparing the addresses.

It is used to bind a thermal zone with a cooling device:

thermal_of.c:                   if (tcbp->cooling_device == cdev->np) {
thermal_of.c:                   if (tcbp->cooling_device == cdev->np) {

That is probably why no refcount was taken on this device node.

Moreover, this will go away with the series reworking the thermal-of I 
sent last week
Liang He July 17, 2022, 2:57 a.m. UTC | #4
At 2022-07-17 06:03:46, "Daniel Lezcano" <daniel.lezcano@linaro.org> wrote:
>On 15/07/2022 19:14, Rafael J. Wysocki wrote:
>> On Thu, Jul 7, 2022 at 8:21 AM Liang He <windhl@126.com> wrote:
>
>[ ... ]
>
>>> -       cdev->np = np;
>>> +       cdev->np = of_node_get(np);
>>>          cdev->ops = ops;
>>>          cdev->updated = false;
>>>          cdev->device.class = &thermal_class;
>>> @@ -947,6 +947,7 @@ __thermal_cooling_device_register(struct device_node *np,
>>>          return cdev;
>>>
>>>   out_kfree_type:
>>> +       of_node_put(cdev->np);
>>>          thermal_cooling_device_destroy_sysfs(cdev);
>>>          kfree(cdev->type);
>>>          put_device(&cdev->device);
>>> @@ -1111,6 +1112,7 @@ void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
>>>
>>>          mutex_unlock(&thermal_list_lock);
>>>
>>> +       of_node_put(cdev->np);
>> 
>> Could this be done right before the
>> thermal_cooling_device_destroy_sysfs() below?  Then the sequence would
>> be completely analogous to the error code path above.
>> 
>>>          ida_simple_remove(&thermal_cdev_ida, cdev->id);
>>>          device_del(&cdev->device);
>>>          thermal_cooling_device_destroy_sysfs(cdev);
>>> --
>> 
>> Overall, this looks like a genuine fix to me.
>> 
>> Daniel, what do you think?
>
>Yes, the of_node_put() is often missing when there is the for_each_xxx 
>OF API. But here the cdev->np is only used to compare pointers so used 
>as an identifier, not de-referenced just comparing the addresses.

>

Thanks, this is a good lesson that explains when there is no need
to refcount new reference.
So I think there is also no need to patch anything, right?

Thanks again,

Liang

>It is used to bind a thermal zone with a cooling device:
>
>thermal_of.c:                   if (tcbp->cooling_device == cdev->np) {
>thermal_of.c:                   if (tcbp->cooling_device == cdev->np) {
>
>That is probably why no refcount was taken on this device node.
>
>Moreover, this will go away with the series reworking the thermal-of I 
>sent last week

>
>>
>-- 
><http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
>Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
><http://twitter.com/#!/linaroorg> Twitter |
><http://www.linaro.org/linaro-blog/> Blog
Daniel Lezcano July 17, 2022, 7:06 a.m. UTC | #5
On 17/07/2022 04:57, Liang He wrote:

[ ... ]

>> Yes, the of_node_put() is often missing when there is the for_each_xxx
>> OF API. But here the cdev->np is only used to compare pointers so used
>> as an identifier, not de-referenced just comparing the addresses.
> 
>>
> 
> Thanks, this is a good lesson that explains when there is no need
> to refcount new reference.
> So I think there is also no need to patch anything, right?


Right, no need a change for this.

Thanks anyway for tracking down the refcount in the code


[ ... ]
diff mbox series

Patch

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index cdc0552e8c42..c459e2958b7b 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -919,7 +919,7 @@  __thermal_cooling_device_register(struct device_node *np,
 
 	mutex_init(&cdev->lock);
 	INIT_LIST_HEAD(&cdev->thermal_instances);
-	cdev->np = np;
+	cdev->np = of_node_get(np);
 	cdev->ops = ops;
 	cdev->updated = false;
 	cdev->device.class = &thermal_class;
@@ -947,6 +947,7 @@  __thermal_cooling_device_register(struct device_node *np,
 	return cdev;
 
 out_kfree_type:
+	of_node_put(cdev->np);
 	thermal_cooling_device_destroy_sysfs(cdev);
 	kfree(cdev->type);
 	put_device(&cdev->device);
@@ -1111,6 +1112,7 @@  void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev)
 
 	mutex_unlock(&thermal_list_lock);
 
+	of_node_put(cdev->np);
 	ida_simple_remove(&thermal_cdev_ida, cdev->id);
 	device_del(&cdev->device);
 	thermal_cooling_device_destroy_sysfs(cdev);