mbox series

[v1,0/2] thermal: core: Handle failed temperature checks more carefully

Message ID 2348857.ElGaqSPkdT@rjwysocki.net
Headers show
Series thermal: core: Handle failed temperature checks more carefully | expand

Message

Rafael J. Wysocki July 18, 2024, 6:57 p.m. UTC
Hi Everyone,

This series kind of augments

https://lore.kernel.org/linux-pm/4950004.31r3eYUQgx@rjwysocki.net/

so I'm considering adding it to 6.11.

The problem with handing temperature check errors in __thermal_zone_device_update()
after the above is that if someone has a dead thermal zone returning such errors
continuously lurking somewhere in their system, they will get a flood of
"temperature check failed" messages in the log which will be reported as a
regression.  Rightfully, because these messages render the kernel log
practically unusable and the continuous and useless polling of such a thermal
zone may even prevent the system from entering deep idle states.  Clearly,
something needs to be done about this.

One possible approach might be to simply disable the thermal zone in question
after the first error (that is not -EAGAIN) returned by its .get_temp()
callback, but that cannot be done because there are thermal zones in which
.get_temp() returns errors to start with, but they recover later, and they
need to be taken into account.

So the only other alternative that is not overly complicated is to add a
back-off mechanism to the polling, so the thermal zone has a chance to recover,
but the core will not wait for that forever.  At one point it will just disable
the thermal zone and let user space re-enable it if that's regarded as a good
idea.  This is done in the second patch and the first patch is preparatory.

Thanks!