Message ID | 20230607003721.834038-1-evalenti@kernel.org |
---|---|
State | New |
Headers | show |
Series | [1/1] thermal: sysfs: avoid actual readings from sysfs | expand |
On 6/7/23 11:38, Eduardo Valentin wrote: >> Can you elaborate 'the timing requirement for the governors' ? I'm >> missing the point > > > The point is to avoid contention on the device update path. > Governor that use differential equations on temperature over time > will be very time sensitive. Step wise, power allocator, or any > PID will be very sensitive to time. So, If userspace is hitting > this API too often we can see cases where the updates needed to > service userspace may defer/delay the execution of the governor > logic. > > Despite that, there is really no point to have more updates than > what was configured for the thermal zone to support. Say that > we configure a thermal zone to update itself every 500ms, yet > userspace keeps sending reads every 100ms, we do not need necessarily > to do a trip to the device every single time to update the temperature, > as per the design for the thermal zone. A userspace governor might *also* use PID or filter multiple samples taken at high rate. I specifically switched my python fan control script from the Intel coretemp hwmon to the x86_pkg_tmp thermal zone because of the coretemp driver's annoying 1-second caching behavior.
Hi Eduardo, On 08/06/2023 19:44, Eduardo Valentin wrote: [ ... ] >> Do you have a use case with some measurements to spot an issue or is it >> a potential issue you identified ? > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz) > and needs to update the zone every 100ms. Each read in this bus, if done alone > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte, > well technically 9, but rounding for the sake of the example, which gets you > 50 / 100KHz = 500 us. That is for a single read. You add one single extra > userspace read triggering an unused device update, that is already a 1ms drift. > Basically you looking at 0.5% for each extra userspace read competing in this > sysfs node. You add extra devices in the same I2C bus, your governor is looking > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz. > I did not even include the lock overhead considered for this CPU ;-) > > Again, this is not about controlling the DIE temperature of the CPU you > are running the thermal subsystem. This is about controlling > a target device. Ok. The target device is on a bus which is slow and prone to contention. This hardware is not designed to be monitored with a high precision, so reading the temperature at a high rate does not really make sense. Moreover (putting apart a potential contention), the delayed read does not change the time interval, which remains the same from the governor point of view. In addition, i2c sensors are usually handled in the hwmon subsystem which are registered in the thermal framework from there. Those have most of their 'read' callback with a cached value in a jiffies based way eg. [1]. So the feature already exists for slow devices and are handled in the drivers directly via the hwmon subsystem. From my POV, the feature is not needed in the thermal framework. [1] https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/tree/drivers/hwmon/lm95234.c#n163
On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote: > > > > Hi Eduardo, > > On 08/06/2023 19:44, Eduardo Valentin wrote: > > [ ... ] > > > > Do you have a use case with some measurements to spot an issue or is it > > > a potential issue you identified ? > > > > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz) > > and needs to update the zone every 100ms. Each read in this bus, if done alone > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte, > > well technically 9, but rounding for the sake of the example, which gets you > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra > > userspace read triggering an unused device update, that is already a 1ms drift. > > Basically you looking at 0.5% for each extra userspace read competing in this > > sysfs node. You add extra devices in the same I2C bus, your governor is looking > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz. > > I did not even include the lock overhead considered for this CPU ;-) > > > > Again, this is not about controlling the DIE temperature of the CPU you > > are running the thermal subsystem. This is about controlling > > a target device. > > Ok. The target device is on a bus which is slow and prone to contention. > > This hardware is not designed to be monitored with a high precision, so > reading the temperature at a high rate does not really make sense. On the contrary, it needs even more precision and any extra delay adds to loss on accuracy :-) > > Moreover (putting apart a potential contention), the delayed read does > not change the time interval, which remains the same from the governor > point of view. It does not change the governor update interval and that is a property of the thermal zone. Correct. And that is the intention of the change. The actual temperature updates driven by the governor will always result in a driver call. While a userspace call will not be in the way of the governor update. Sysfs reads, However, with the current code as is, it may cause jittering on the actual execution of the governor throttle function. causing the computation of the desired outcome cooling device being skewed. > > In addition, i2c sensors are usually handled in the hwmon subsystem > which are registered in the thermal framework from there. Those have > most of their 'read' callback with a cached value in a jiffies based way > eg. [1]. I guess what you are really saying is: go read the hwmon sysfs node, or, hwmon solves this for us, which unfortunately is not true for all devices. > > So the feature already exists for slow devices and are handled in the > drivers directly via the hwmon subsystem. > > From my POV, the feature is not needed in the thermal framework. The fact that hwmon does it in some way is another evidence of the actual problem. Telling that this has to be solved by another subsystem for a sysfs node that is part of thermal subsystem does not really solve the problem. Also as I mentioned, this is not common on all hwmon devices, and not all I2C devices are hwmon devices. In fact, I2C was just one example of a slow device. There are more I can quote that are not necessarily under the hwmon case. Not sure if you missed, but an alternative for the difference of opinion on how this should behave is to have caching for response of sysfs read of tz/temp as an option/configuration. Then we let userspace to choose which behavior it wants. > > > > [1] > https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/tree/drivers/hwmon/lm95234.c#n163 > > > > -- > <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs > > Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | > <http://twitter.com/#!/linaroorg> Twitter | > <http://www.linaro.org/linaro-blog/> Blog >
On 21/06/2023 07:06, Eduardo Valentin wrote: > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote: >> >> >> >> Hi Eduardo, >> >> On 08/06/2023 19:44, Eduardo Valentin wrote: >> >> [ ... ] >> >>>> Do you have a use case with some measurements to spot an issue or is it >>>> a potential issue you identified ? >>> >>> >>> yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz) >>> and needs to update the zone every 100ms. Each read in this bus, if done alone >>> would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte, >>> well technically 9, but rounding for the sake of the example, which gets you >>> 50 / 100KHz = 500 us. That is for a single read. You add one single extra >>> userspace read triggering an unused device update, that is already a 1ms drift. >>> Basically you looking at 0.5% for each extra userspace read competing in this >>> sysfs node. You add extra devices in the same I2C bus, your governor is looking >>> at more than 1% overhead. And I am talking also about a main CPU of ~800MHz. >>> I did not even include the lock overhead considered for this CPU ;-) >>> >>> Again, this is not about controlling the DIE temperature of the CPU you >>> are running the thermal subsystem. This is about controlling >>> a target device. >> >> Ok. The target device is on a bus which is slow and prone to contention. >> >> This hardware is not designed to be monitored with a high precision, so >> reading the temperature at a high rate does not really make sense. > > On the contrary, it needs even more precision and any extra delay adds to > loss on accuracy :-) What I meant is if the hardware designer thought there could be a problem with the thermal zone they would have put another kind of sensor, not one with a i2c based communication. >> Moreover (putting apart a potential contention), the delayed read does >> not change the time interval, which remains the same from the governor >> point of view. > > It does not change the governor update interval and that is a property of > the thermal zone. Correct. And that is the intention of the change. > The actual temperature updates driven by the governor will always > result in a driver call. While a userspace call will not be in the way > of the governor update. > > Sysfs reads, However, with the current code as is, it may cause > jittering on the actual execution of the governor throttle function. > causing the computation of the desired outcome cooling device being skewed. > >> >> In addition, i2c sensors are usually handled in the hwmon subsystem >> which are registered in the thermal framework from there. Those have >> most of their 'read' callback with a cached value in a jiffies based way >> eg. [1]. > > I guess what you are really saying is: go read the hwmon sysfs node, > or, hwmon solves this for us, which unfortunately is not true for all devices. I meant the i2c sensors are under the hwmon subsystem. This subsystem is connected with the thermal framework, so when a hwmon sensor is created, it register this sensor as a thermal zone. >> So the feature already exists for slow devices and are handled in the >> drivers directly via the hwmon subsystem. >> >> From my POV, the feature is not needed in the thermal framework. > > The fact that hwmon does it in some way is another evidence of the > actual problem. Not really, it shows the i2c sensors are in the hwmon subsystems. > Telling that this has to be solved by another subsystem > for a sysfs node that is part of thermal subsystem does not really solve > the problem. Also as I mentioned, this is not common on all hwmon > devices, and not all I2C devices are hwmon devices. In fact, I2C > was just one example of a slow device. There are more I can quote > that are not necessarily under the hwmon case. Yes, please. Can you give examples with existing drivers in the thermal framework and observed issues on specific platforms ? Numbers would help. > Not sure if you missed, but an alternative for the difference of > opinion on how this should behave is to have caching for response > of sysfs read of tz/temp as an option/configuration. Then we let > userspace to choose which behavior it wants. Before that, you have to prove the feature is really needed and show how the patches solves an issue. At this point, this is not demonstrated.
On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote: > > > > On 21/06/2023 07:06, Eduardo Valentin wrote: > > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote: > > > > > > > > > > > > Hi Eduardo, > > > > > > On 08/06/2023 19:44, Eduardo Valentin wrote: > > > > > > [ ... ] > > > > > > > > Do you have a use case with some measurements to spot an issue or is it > > > > > a potential issue you identified ? > > > > > > > > > > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz) > > > > and needs to update the zone every 100ms. Each read in this bus, if done alone > > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte, > > > > well technically 9, but rounding for the sake of the example, which gets you > > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra > > > > userspace read triggering an unused device update, that is already a 1ms drift. > > > > Basically you looking at 0.5% for each extra userspace read competing in this > > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking > > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz. > > > > I did not even include the lock overhead considered for this CPU ;-) > > > > > > > > Again, this is not about controlling the DIE temperature of the CPU you > > > > are running the thermal subsystem. This is about controlling > > > > a target device. > > > > > > Ok. The target device is on a bus which is slow and prone to contention. > > > > > > This hardware is not designed to be monitored with a high precision, so > > > reading the temperature at a high rate does not really make sense. > > > > On the contrary, it needs even more precision and any extra delay adds to > > loss on accuracy :-) > > What I meant is if the hardware designer thought there could be a > problem with the thermal zone they would have put another kind of > sensor, not one with a i2c based communication. No, that is not a problem in the physical thermal zone. Or to cover for a hardware design flaw. This is an actual typical case. However, yes, designer must take into account any sort of delays or jittering in the control system, and typically one would add some thermal margins on the control system. But to the original point here, we should eliminate unnecessary jittering or delay in the control system. > > > > > Moreover (putting apart a potential contention), the delayed read does > > > not change the time interval, which remains the same from the governor > > > point of view. > > > > It does not change the governor update interval and that is a property of > > the thermal zone. Correct. And that is the intention of the change. > > The actual temperature updates driven by the governor will always > > result in a driver call. While a userspace call will not be in the way > > of the governor update. > > > > Sysfs reads, However, with the current code as is, it may cause > > jittering on the actual execution of the governor throttle function. > > causing the computation of the desired outcome cooling device being skewed. > > > > > > > > In addition, i2c sensors are usually handled in the hwmon subsystem > > > which are registered in the thermal framework from there. Those have > > > most of their 'read' callback with a cached value in a jiffies based way > > > eg. [1]. > > > > I guess what you are really saying is: go read the hwmon sysfs node, > > or, hwmon solves this for us, which unfortunately is not true for all devices. > > I meant the i2c sensors are under the hwmon subsystem. This subsystem is > connected with the thermal framework, so when a hwmon sensor is created, > it register this sensor as a thermal zone. > > > > > So the feature already exists for slow devices and are handled in the > > > drivers directly via the hwmon subsystem. > > > > > > From my POV, the feature is not needed in the thermal framework. > > > > The fact that hwmon does it in some way is another evidence of the > > actual problem. > > Not really, it shows the i2c sensors are in the hwmon subsystems. They are there too. And hwmon also sees same problem of too frequent device update. The problem is there regardless of the subsystem we use to represent the device. Also, I dont buy the argument that this is a problem of hwmon because it is already supported to plug in hwmon devices in the thermal subsystem. That is actually the original design in fact :-). So running a control in the thermal subsystem on top of inputs from hwmon, which happens to support I2C devices, is not a problem for hwmon to solve, since the control is in the thermal subsystem, and hwmon does not offer control solutions. > > > > Telling that this has to be solved by another subsystem > > for a sysfs node that is part of thermal subsystem does not really solve > > the problem. Also as I mentioned, this is not common on all hwmon > > devices, and not all I2C devices are hwmon devices. In fact, I2C > > was just one example of a slow device. There are more I can quote > > that are not necessarily under the hwmon case. > > Yes, please. Can you give examples with existing drivers in the thermal > framework and observed issues on specific platforms ? Numbers would help. I believe I already gave you numbers. And as I explained above, the driver does not need to sit on thermal subsystem only, we already support plugging in I2C/pmbus devices on the thermal subsystem, so basically all drivers on hwmon that has the REGISTER_TZ flag are actual samples for this problem
On Thu, Jun 22, 2023 at 6:56 AM Eduardo Valentin <evalenti@kernel.org> wrote: > > On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote: > > > > > > > > On 21/06/2023 07:06, Eduardo Valentin wrote: > > > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote: > > > > > > > > > > > > > > > > Hi Eduardo, > > > > > > > > On 08/06/2023 19:44, Eduardo Valentin wrote: > > > > > > > > [ ... ] > > > > > > > > > > Do you have a use case with some measurements to spot an issue or is it > > > > > > a potential issue you identified ? > > > > > > > > > > > > > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz) > > > > > and needs to update the zone every 100ms. Each read in this bus, if done alone > > > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte, > > > > > well technically 9, but rounding for the sake of the example, which gets you > > > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra > > > > > userspace read triggering an unused device update, that is already a 1ms drift. > > > > > Basically you looking at 0.5% for each extra userspace read competing in this > > > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking > > > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz. > > > > > I did not even include the lock overhead considered for this CPU ;-) > > > > > > > > > > Again, this is not about controlling the DIE temperature of the CPU you > > > > > are running the thermal subsystem. This is about controlling > > > > > a target device. > > > > > > > > Ok. The target device is on a bus which is slow and prone to contention. > > > > > > > > This hardware is not designed to be monitored with a high precision, so > > > > reading the temperature at a high rate does not really make sense. > > > > > > On the contrary, it needs even more precision and any extra delay adds to > > > loss on accuracy :-) > > > > What I meant is if the hardware designer thought there could be a > > problem with the thermal zone they would have put another kind of > > sensor, not one with a i2c based communication. > > No, that is not a problem in the physical thermal zone. Or to cover > for a hardware design flaw. This is an actual typical case. However, > yes, designer must take into account any sort of delays or jittering > in the control system, and typically one would add some thermal margins > on the control system. But to the original point here, we should eliminate > unnecessary jittering or delay in the control system. > > > > > > > > > Moreover (putting apart a potential contention), the delayed read does > > > > not change the time interval, which remains the same from the governor > > > > point of view. > > > > > > It does not change the governor update interval and that is a property of > > > the thermal zone. Correct. And that is the intention of the change. > > > The actual temperature updates driven by the governor will always > > > result in a driver call. While a userspace call will not be in the way > > > of the governor update. > > > > > > Sysfs reads, However, with the current code as is, it may cause > > > jittering on the actual execution of the governor throttle function. > > > causing the computation of the desired outcome cooling device being skewed. > > > > > > > > > > > In addition, i2c sensors are usually handled in the hwmon subsystem > > > > which are registered in the thermal framework from there. Those have > > > > most of their 'read' callback with a cached value in a jiffies based way > > > > eg. [1]. > > > > > > I guess what you are really saying is: go read the hwmon sysfs node, > > > or, hwmon solves this for us, which unfortunately is not true for all devices. > > > > I meant the i2c sensors are under the hwmon subsystem. This subsystem is > > connected with the thermal framework, so when a hwmon sensor is created, > > it register this sensor as a thermal zone. > > > > > > > > So the feature already exists for slow devices and are handled in the > > > > drivers directly via the hwmon subsystem. > > > > > > > > From my POV, the feature is not needed in the thermal framework. > > > > > > The fact that hwmon does it in some way is another evidence of the > > > actual problem. > > > > Not really, it shows the i2c sensors are in the hwmon subsystems. > > They are there too. And hwmon also sees same problem of too frequent > device update. The problem is there regardless of the subsystem we use > to represent the device. Also, I dont buy the argument that this is > a problem of hwmon because it is already supported to plug in > hwmon devices in the thermal subsystem. That is actually the original > design in fact :-). So running a control in the thermal subsystem > on top of inputs from hwmon, which happens to support I2C devices, > is not a problem for hwmon to solve, since the control is in the > thermal subsystem, and hwmon does not offer control solutions. Regardless of where the problem is etc, if my understanding of the patch is correct, it is proposing to change the behavior of a well-known sysfs interface in a way that is likely to be incompatible with at least some of its users. This is an obvious no-go in kernel development and I would expect you to be well aware of it. IOW, if you want the cached value to be returned, a new interface (eg. a new sysfs attribute) is needed. And I think that the use case is not really i2c sensors in general, because at least in some cases they work just fine AFAICS, but a platform with a control loop running in the kernel that depends on sensor reads carried out at a specific, approximately constant, rate that can be disturbed by user space checking the sensor temperature via sysfs at "wrong" times. If at the same time the user space program in question doesn't care about the most recent values reported by the sensor, it may very well use the values cached by the in-kernel control loop.
On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote: > > > > On Thu, Jun 22, 2023 at 6:56 AM Eduardo Valentin <evalenti@kernel.org> wrote: > > > > On Wed, Jun 21, 2023 at 01:43:26PM +0200, Daniel Lezcano wrote: > > > > > > > > > > > > On 21/06/2023 07:06, Eduardo Valentin wrote: > > > > On Mon, Jun 12, 2023 at 10:17:51AM +0200, Daniel Lezcano wrote: > > > > > > > > > > > > > > > > > > > > Hi Eduardo, > > > > > > > > > > On 08/06/2023 19:44, Eduardo Valentin wrote: > > > > > > > > > > [ ... ] > > > > > > > > > > > > Do you have a use case with some measurements to spot an issue or is it > > > > > > > a potential issue you identified ? > > > > > > > > > > > > > > > > > > yes, a governor that is using I2C device as input, behind I2C fast mode (100KHz) > > > > > > and needs to update the zone every 100ms. Each read in this bus, if done alone > > > > > > would be around 500us, takes 10bytes to read the device, it is 10 clocks per byte, > > > > > > well technically 9, but rounding for the sake of the example, which gets you > > > > > > 50 / 100KHz = 500 us. That is for a single read. You add one single extra > > > > > > userspace read triggering an unused device update, that is already a 1ms drift. > > > > > > Basically you looking at 0.5% for each extra userspace read competing in this > > > > > > sysfs node. You add extra devices in the same I2C bus, your governor is looking > > > > > > at more than 1% overhead. And I am talking also about a main CPU of ~800MHz. > > > > > > I did not even include the lock overhead considered for this CPU ;-) > > > > > > > > > > > > Again, this is not about controlling the DIE temperature of the CPU you > > > > > > are running the thermal subsystem. This is about controlling > > > > > > a target device. > > > > > > > > > > Ok. The target device is on a bus which is slow and prone to contention. > > > > > > > > > > This hardware is not designed to be monitored with a high precision, so > > > > > reading the temperature at a high rate does not really make sense. > > > > > > > > On the contrary, it needs even more precision and any extra delay adds to > > > > loss on accuracy :-) > > > > > > What I meant is if the hardware designer thought there could be a > > > problem with the thermal zone they would have put another kind of > > > sensor, not one with a i2c based communication. > > > > No, that is not a problem in the physical thermal zone. Or to cover > > for a hardware design flaw. This is an actual typical case. However, > > yes, designer must take into account any sort of delays or jittering > > in the control system, and typically one would add some thermal margins > > on the control system. But to the original point here, we should eliminate > > unnecessary jittering or delay in the control system. > > > > > > > > > > > > > Moreover (putting apart a potential contention), the delayed read does > > > > > not change the time interval, which remains the same from the governor > > > > > point of view. > > > > > > > > It does not change the governor update interval and that is a property of > > > > the thermal zone. Correct. And that is the intention of the change. > > > > The actual temperature updates driven by the governor will always > > > > result in a driver call. While a userspace call will not be in the way > > > > of the governor update. > > > > > > > > Sysfs reads, However, with the current code as is, it may cause > > > > jittering on the actual execution of the governor throttle function. > > > > causing the computation of the desired outcome cooling device being skewed. > > > > > > > > > > > > > > In addition, i2c sensors are usually handled in the hwmon subsystem > > > > > which are registered in the thermal framework from there. Those have > > > > > most of their 'read' callback with a cached value in a jiffies based way > > > > > eg. [1]. > > > > > > > > I guess what you are really saying is: go read the hwmon sysfs node, > > > > or, hwmon solves this for us, which unfortunately is not true for all devices. > > > > > > I meant the i2c sensors are under the hwmon subsystem. This subsystem is > > > connected with the thermal framework, so when a hwmon sensor is created, > > > it register this sensor as a thermal zone. > > > > > > > > > > > So the feature already exists for slow devices and are handled in the > > > > > drivers directly via the hwmon subsystem. > > > > > > > > > > From my POV, the feature is not needed in the thermal framework. > > > > > > > > The fact that hwmon does it in some way is another evidence of the > > > > actual problem. > > > > > > Not really, it shows the i2c sensors are in the hwmon subsystems. > > > > They are there too. And hwmon also sees same problem of too frequent > > device update. The problem is there regardless of the subsystem we use > > to represent the device. Also, I dont buy the argument that this is > > a problem of hwmon because it is already supported to plug in > > hwmon devices in the thermal subsystem. That is actually the original > > design in fact :-). So running a control in the thermal subsystem > > on top of inputs from hwmon, which happens to support I2C devices, > > is not a problem for hwmon to solve, since the control is in the > > thermal subsystem, and hwmon does not offer control solutions. > > Regardless of where the problem is etc, if my understanding of the > patch is correct, it is proposing to change the behavior of a > well-known sysfs interface in a way that is likely to be incompatible > with at least some of its users. This is an obvious no-go in kernel > development and I would expect you to be well aware of it. yeah I get it. > > IOW, if you want the cached value to be returned, a new interface (eg. > a new sysfs attribute) is needed. Yeah, I am fine with either a new sysfs entry to return the cached value, or a new sysfs entry to change the behavior of the existing /temp, as I mentioned before, either way works for me, if changing the existing one is too intrusive. > > And I think that the use case is not really i2c sensors in general, I2C was just the factual example I had, but you are right, the use case is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C, the actual issue just happen to be easier to see when I2C devices, slower than typical MMIO devices, are being used as input for the control. > because at least in some cases they work just fine AFAICS, but a > platform with a control loop running in the kernel that depends on > sensor reads carried out at a specific, approximately constant, rate > that can be disturbed by user space checking the sensor temperature > via sysfs at "wrong" times. If at the same time the user space > program in question doesn't care about the most recent values reported > by the sensor, it may very well use the values cached by the in-kernel > control loop. That is right, the balance between supporting user space reads and running the control timely is the actual original concern. The problem fades out a bit when you have device reads in the us / ns time scale and control update is in 100s of ms. But becomes more apparent on slower devices, when reads are in ms and policy update is in the 100s ms, that is why the I2C case was quoted. But nothing wrong with I2C, or supporting I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ, the problem is on having to support the control in kernel and a read in userspace that can create jitter to the control. And as you properly stated, for this use case, the userspace does not care about the most recent value of the device, so that is why the change proposes to give cached values. On the flip side though, there may be user space based policies that require the most recent device value. But in this case, the expectation is to disable the in kernel policy and switch the thermal zone to mode == disabled. And that is also why this patch will go the path to request the most recent device value when the /temp sysfs entry is read and the mode is disabled. I would suggest to have an addition sysfs entry that sets the thermal zone into cached mode or not, let's say for the sake of the discussion, we call it 'cached_values', with default to 'not cached'. This way, we could support: a) Default, current situation, where all reads in /temp are always backed up with an actual device .get_temp(). Nothing changes here, keeps reading under /temp, and so long system designer is satisfied with jittering, no need to change anything. b) When one has control in kernel, and frequent userspace reads on /temp but system designer wants to protect the control in kernel to avoid jittering. Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'. Then userspace will get updated values as frequent as the kernel control has and the kernel control will not be disturbed by frequent device reads. c) When one has control in userspace, and wants to have the most frequent device read. Here, one can simply disable the in kernel control by setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set to 'not cached'. Well in fact, the way I thought this originally in this patch was to simply always read the device when /temp is read is 'mode' is 'disabled'. I believe you proposed to have another sysfs entry sysfs entry for reading cached temperature. Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work. The only possible downside is to have two entries to read temperature. Strong opinions on any of the above?
On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: > > On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote: > > [cut] > > > > Regardless of where the problem is etc, if my understanding of the > > patch is correct, it is proposing to change the behavior of a > > well-known sysfs interface in a way that is likely to be incompatible > > with at least some of its users. This is an obvious no-go in kernel > > development and I would expect you to be well aware of it. > > yeah I get it. > > > > > IOW, if you want the cached value to be returned, a new interface (eg. > > a new sysfs attribute) is needed. > > Yeah, I am fine with either a new sysfs entry to return the cached value, > or a new sysfs entry to change the behavior of the existing /temp, as I > mentioned before, either way works for me, if changing the existing one > is too intrusive. > > > > > And I think that the use case is not really i2c sensors in general, > > I2C was just the factual example I had, but you are right, the use case > is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C, > the actual issue just happen to be easier to see when I2C devices, slower > than typical MMIO devices, are being used as input for the control. > > > because at least in some cases they work just fine AFAICS, but a > > platform with a control loop running in the kernel that depends on > > sensor reads carried out at a specific, approximately constant, rate > > that can be disturbed by user space checking the sensor temperature > > via sysfs at "wrong" times. If at the same time the user space > > program in question doesn't care about the most recent values reported > > by the sensor, it may very well use the values cached by the in-kernel > > control loop. > > That is right, the balance between supporting user space reads and > running the control timely is the actual original concern. The problem > fades out a bit when you have device reads in the us / ns time scale > and control update is in 100s of ms. But becomes more apparent on slower > devices, when reads are in ms and policy update is in the 100s ms, that is > why the I2C case was quoted. But nothing wrong with I2C, or supporting > I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ, > the problem is on having to support the control in kernel and a read in > userspace that can create jitter to the control. > > And as you properly stated, for this use case, the userspace does not care > about the most recent value of the device, so that is why the change > proposes to give cached values. > > On the flip side though, there may be user space based policies that > require the most recent device value. But in this case, the expectation > is to disable the in kernel policy and switch the thermal zone to > mode == disabled. And that is also why this patch will go the path > to request the most recent device value when the /temp sysfs entry > is read and the mode is disabled. > > I would suggest to have an addition sysfs entry that sets the > thermal zone into cached mode or not, let's say for the sake of the > discussion, we call it 'cached_values', with default to 'not cached'. > This way, we could support: > > a) Default, current situation, where all reads in /temp are always backed up > with an actual device .get_temp(). Nothing changes here, keeps reading > under /temp, and so long system designer is satisfied with jittering, > no need to change anything. > > b) When one has control in kernel, and frequent userspace reads on /temp > but system designer wants to protect the control in kernel to avoid jittering. > Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'. > Then userspace will get updated values as frequent as the kernel control has > and the kernel control will not be disturbed by frequent device reads. > > c) When one has control in userspace, and wants to have the most frequent > device read. Here, one can simply disable the in kernel control by > setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set > to 'not cached'. Well in fact, the way I thought this originally in this patch > was to simply always read the device when /temp is read is 'mode' is 'disabled'. > > I believe you proposed to have another sysfs entry sysfs entry for reading cached temperature. > Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work. > The only possible downside is to have two entries to read temperature. > > Strong opinions on any of the above? So what about adding a new zone attribute that can be used to specify the preferred caching time for the temperature? That is, if the time interval between two consecutive updates of the cached temperature value is less than the value of the new attribute, the cached temperature value will be returned by "temp". Otherwise, it will cause the sensor to be read and the value obtained from it will be returned to user space and cached. If the value of the new attribute is 0, everything will work as it does now (which will also need to be the default behavior).
Hi Rafael, On 30/06/2023 10:16, Rafael J. Wysocki wrote: > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: [ ... ] > So what about adding a new zone attribute that can be used to specify > the preferred caching time for the temperature? > > That is, if the time interval between two consecutive updates of the > cached temperature value is less than the value of the new attribute, > the cached temperature value will be returned by "temp". Otherwise, > it will cause the sensor to be read and the value obtained from it > will be returned to user space and cached. > > If the value of the new attribute is 0, everything will work as it > does now (which will also need to be the default behavior). I'm still not convinced about the feature. Eduardo provided some numbers but they seem based on the characteristics of the I2C, not to a real use case. Eduardo? Before adding more complexity in the thermal framework and yet another sysfs entry, it would be interesting to have an experiment and show the impact of both configurations, not from a timing point of view but with a temperature mitigation accuracy. Without a real use case, this feature does make really sense IMO.
Hi Daniel, On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote: > > > Hi Rafael, > > On 30/06/2023 10:16, Rafael J. Wysocki wrote: > > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: > > [ ... ] > > > So what about adding a new zone attribute that can be used to specify > > the preferred caching time for the temperature? > > > > That is, if the time interval between two consecutive updates of the > > cached temperature value is less than the value of the new attribute, > > the cached temperature value will be returned by "temp". Otherwise, > > it will cause the sensor to be read and the value obtained from it > > will be returned to user space and cached. > > > > If the value of the new attribute is 0, everything will work as it > > does now (which will also need to be the default behavior). > > I'm still not convinced about the feature. > > Eduardo provided some numbers but they seem based on the characteristics > of the I2C, not to a real use case. Eduardo? > > Before adding more complexity in the thermal framework and yet another > sysfs entry, it would be interesting to have an experiment and show the > impact of both configurations, not from a timing point of view but with > a temperature mitigation accuracy. > > Without a real use case, this feature does make really sense IMO. I'm kind of unsure why you think that it is not a good idea in general to have a way to limit the rate of accessing a temperature sensor, for energy-efficiency reasons if nothing more.
On 30/06/2023 12:46, Rafael J. Wysocki wrote: > Hi Daniel, > > On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano > <daniel.lezcano@linaro.org> wrote: >> >> >> Hi Rafael, >> >> On 30/06/2023 10:16, Rafael J. Wysocki wrote: >>> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: >> >> [ ... ] >> >>> So what about adding a new zone attribute that can be used to specify >>> the preferred caching time for the temperature? >>> >>> That is, if the time interval between two consecutive updates of the >>> cached temperature value is less than the value of the new attribute, >>> the cached temperature value will be returned by "temp". Otherwise, >>> it will cause the sensor to be read and the value obtained from it >>> will be returned to user space and cached. >>> >>> If the value of the new attribute is 0, everything will work as it >>> does now (which will also need to be the default behavior). >> >> I'm still not convinced about the feature. >> >> Eduardo provided some numbers but they seem based on the characteristics >> of the I2C, not to a real use case. Eduardo? >> >> Before adding more complexity in the thermal framework and yet another >> sysfs entry, it would be interesting to have an experiment and show the >> impact of both configurations, not from a timing point of view but with >> a temperature mitigation accuracy. >> >> Without a real use case, this feature does make really sense IMO. > > I'm kind of unsure why you think that it is not a good idea in general > to have a way to limit the rate of accessing a temperature sensor, for > energy-efficiency reasons if nothing more. I don't think it is not a good idea. I've no judgement with the proposed change. But I'm not convinced it is really useful, that is why having a real use case and some numbers showing that feature solves the issue would be nice. It is illogical we want a fast and accurate response on a specific hardware and then design it with slow sensors and contention prone bus. In Eduardo's example, we have 100ms monitoring rate on a I2C. This rate is usually to monitor CPUs with very fast transitions. With a remote site, the monitoring rate would be much slower, so if there is a contention in the bus because a dumb process is reading constantly the temperature, then it should be negligible. All that are hypothesis, that is why having a real use case would help to figure out the temperature limit drift at mitigation time. Assuming it is really needed, I'm not sure that should be exported via sysfs. It is a driver issue and it may register the thermal zone with a parameter telling the userspace rate limit. On the other side, hwmon and thermal are connected. hwmon drivers register a thermal zone and thermal drivers add themselves in the hwmon sysfs directory. The temperature cache is handled in the driver level in the hwmon subsystems and we want to handle the temperature cache at the thermal sysfs level. How will we cope with this inconsistency? As a side note, slow drivers are usually going under drivers/hwmon.
On Fri, Jun 30, 2023 at 10:16:38AM +0200, Rafael J. Wysocki wrote: > > > > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: > > > > On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote: > > > > > [cut] > > > > > > > Regardless of where the problem is etc, if my understanding of the > > > patch is correct, it is proposing to change the behavior of a > > > well-known sysfs interface in a way that is likely to be incompatible > > > with at least some of its users. This is an obvious no-go in kernel > > > development and I would expect you to be well aware of it. > > > > yeah I get it. > > > > > > > > IOW, if you want the cached value to be returned, a new interface (eg. > > > a new sysfs attribute) is needed. > > > > Yeah, I am fine with either a new sysfs entry to return the cached value, > > or a new sysfs entry to change the behavior of the existing /temp, as I > > mentioned before, either way works for me, if changing the existing one > > is too intrusive. > > > > > > > > And I think that the use case is not really i2c sensors in general, > > > > I2C was just the factual example I had, but you are right, the use case > > is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C, > > the actual issue just happen to be easier to see when I2C devices, slower > > than typical MMIO devices, are being used as input for the control. > > > > > because at least in some cases they work just fine AFAICS, but a > > > platform with a control loop running in the kernel that depends on > > > sensor reads carried out at a specific, approximately constant, rate > > > that can be disturbed by user space checking the sensor temperature > > > via sysfs at "wrong" times. If at the same time the user space > > > program in question doesn't care about the most recent values reported > > > by the sensor, it may very well use the values cached by the in-kernel > > > control loop. > > > > That is right, the balance between supporting user space reads and > > running the control timely is the actual original concern. The problem > > fades out a bit when you have device reads in the us / ns time scale > > and control update is in 100s of ms. But becomes more apparent on slower > > devices, when reads are in ms and policy update is in the 100s ms, that is > > why the I2C case was quoted. But nothing wrong with I2C, or supporting > > I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ, > > the problem is on having to support the control in kernel and a read in > > userspace that can create jitter to the control. > > > > And as you properly stated, for this use case, the userspace does not care > > about the most recent value of the device, so that is why the change > > proposes to give cached values. > > > > On the flip side though, there may be user space based policies that > > require the most recent device value. But in this case, the expectation > > is to disable the in kernel policy and switch the thermal zone to > > mode == disabled. And that is also why this patch will go the path > > to request the most recent device value when the /temp sysfs entry > > is read and the mode is disabled. > > > > I would suggest to have an addition sysfs entry that sets the > > thermal zone into cached mode or not, let's say for the sake of the > > discussion, we call it 'cached_values', with default to 'not cached'. > > This way, we could support: > > > > a) Default, current situation, where all reads in /temp are always backed up > > with an actual device .get_temp(). Nothing changes here, keeps reading > > under /temp, and so long system designer is satisfied with jittering, > > no need to change anything. > > > > b) When one has control in kernel, and frequent userspace reads on /temp > > but system designer wants to protect the control in kernel to avoid jittering. > > Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'. > > Then userspace will get updated values as frequent as the kernel control has > > and the kernel control will not be disturbed by frequent device reads. > > > > c) When one has control in userspace, and wants to have the most frequent > > device read. Here, one can simply disable the in kernel control by > > setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set > > to 'not cached'. Well in fact, the way I thought this originally in this patch > > was to simply always read the device when /temp is read is 'mode' is 'disabled'. > > > > I believe you proposed to have another sysfs entry sysfs entry for reading cached temperature. > > Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work. > > The only possible downside is to have two entries to read temperature. > > > > Strong opinions on any of the above? > > So what about adding a new zone attribute that can be used to specify > the preferred caching time for the temperature? > > That is, if the time interval between two consecutive updates of the > cached temperature value is less than the value of the new attribute, > the cached temperature value will be returned by "temp". Otherwise, > it will cause the sensor to be read and the value obtained from it > will be returned to user space and cached. > > If the value of the new attribute is 0, everything will work as it > does now (which will also need to be the default behavior). Yeah, that makes sense to me. I can cook something up in the next version.
Hey Daniel, On Fri, Jun 30, 2023 at 12:11:25PM +0200, Daniel Lezcano wrote: > > > > Hi Rafael, > > On 30/06/2023 10:16, Rafael J. Wysocki wrote: > > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: > > [ ... ] > > > So what about adding a new zone attribute that can be used to specify > > the preferred caching time for the temperature? > > > > That is, if the time interval between two consecutive updates of the > > cached temperature value is less than the value of the new attribute, > > the cached temperature value will be returned by "temp". Otherwise, > > it will cause the sensor to be read and the value obtained from it > > will be returned to user space and cached. > > > > If the value of the new attribute is 0, everything will work as it > > does now (which will also need to be the default behavior). > > I'm still not convinced about the feature. > > Eduardo provided some numbers but they seem based on the characteristics > of the I2C, not to a real use case. Eduardo? Why I2C is not a real use case?
Hello, On Fri, Jun 30, 2023 at 02:09:44PM +0200, Daniel Lezcano wrote: > > > > On 30/06/2023 12:46, Rafael J. Wysocki wrote: > > Hi Daniel, > > > > On Fri, Jun 30, 2023 at 12:11 PM Daniel Lezcano > > <daniel.lezcano@linaro.org> wrote: > > > > > > > > > Hi Rafael, > > > > > > On 30/06/2023 10:16, Rafael J. Wysocki wrote: > > > > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: > > > > > > [ ... ] > > > > > > > So what about adding a new zone attribute that can be used to specify > > > > the preferred caching time for the temperature? > > > > > > > > That is, if the time interval between two consecutive updates of the > > > > cached temperature value is less than the value of the new attribute, > > > > the cached temperature value will be returned by "temp". Otherwise, > > > > it will cause the sensor to be read and the value obtained from it > > > > will be returned to user space and cached. > > > > > > > > If the value of the new attribute is 0, everything will work as it > > > > does now (which will also need to be the default behavior). > > > > > > I'm still not convinced about the feature. > > > > > > Eduardo provided some numbers but they seem based on the characteristics > > > of the I2C, not to a real use case. Eduardo? > > > > > > Before adding more complexity in the thermal framework and yet another > > > sysfs entry, it would be interesting to have an experiment and show the > > > impact of both configurations, not from a timing point of view but with > > > a temperature mitigation accuracy. > > > > > > Without a real use case, this feature does make really sense IMO. > > > > I'm kind of unsure why you think that it is not a good idea in general > > to have a way to limit the rate of accessing a temperature sensor, for > > energy-efficiency reasons if nothing more. > > I don't think it is not a good idea. I've no judgement with the proposed > change. > > But I'm not convinced it is really useful, that is why having a real use > case and some numbers showing that feature solves the issue would be nice. > > It is illogical we want a fast and accurate response on a specific > hardware and then design it with slow sensors and contention prone bus. Totally agree, but at same time, this is real world :-) > > In Eduardo's example, we have 100ms monitoring rate on a I2C. This rate > is usually to monitor CPUs with very fast transitions. With a remote > site, the monitoring rate would be much slower, so if there is a > contention in the bus because a dumb process is reading constantly the > temperature, then it should be negligible. > > All that are hypothesis, that is why having a real use case would help > to figure out the temperature limit drift at mitigation time. Yeah, I guess the problem here is that you are assuming I2C is not a real use case, not sure why. But it is and very common design in fact. > > Assuming it is really needed, I'm not sure that should be exported via > sysfs. It is a driver issue and it may register the thermal zone with a > parameter telling the userspace rate limit. > > On the other side, hwmon and thermal are connected. hwmon drivers > register a thermal zone and thermal drivers add themselves in the hwmon > sysfs directory. The temperature cache is handled in the driver level in > the hwmon subsystems and we want to handle the temperature cache at the > thermal sysfs level. How will we cope with this inconsistency? Yeah, I do not see this, again, as where to handle cache type of design problem only. This is really a protective / defensive code on the thermal core to avoid userspace interfering on a kernel based control. I agree that drivers may be free to go and defend themselves against too frequent userspace requests, like they do, as you already shared a link in another email. But saying that it is up to the driver to do this is basically saying that the thermal subsystem do not care about their own threads being delayed by a too frequent reads on a sysfs entry created by the thermal subsystem, just because it is drivers responsability to cache. To that is a missing defensive code. > > As a side note, slow drivers are usually going under drivers/hwmon. Have you seen this code? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850 I also do not understand when you say slow drivers are usually going under drivers/hwmon, does it really matter? One can design a thermal zone that is connected to a hwmon device as input. Why would that be illogical? > > -- > <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs > > Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | > <http://twitter.com/#!/linaroorg> Twitter | > <http://www.linaro.org/linaro-blog/> Blog >
Eduardo, On 01/07/2023 03:49, Eduardo Valentin wrote: [ ... ] >> All that are hypothesis, that is why having a real use case would help >> to figure out the temperature limit drift at mitigation time. > > Yeah, I guess the problem here is that you are assuming I2C is not a real > use case, not sure why. But it is and very common design in fact. If it is so common you should be able to reproduce the issue and give numbers. At this point, what I read is "that may happen because I2C is slow and we may monitor it at an insane rate, so let's cache the value". >> Assuming it is really needed, I'm not sure that should be exported via >> sysfs. It is a driver issue and it may register the thermal zone with a >> parameter telling the userspace rate limit. >> >> On the other side, hwmon and thermal are connected. hwmon drivers >> register a thermal zone and thermal drivers add themselves in the hwmon >> sysfs directory. The temperature cache is handled in the driver level in >> the hwmon subsystems and we want to handle the temperature cache at the >> thermal sysfs level. How will we cope with this inconsistency? > > Yeah, I do not see this, again, as where to handle cache type of design problem only. > This is really a protective / defensive code on the thermal core to avoid > userspace interfering on a kernel based control. > > > I agree that drivers may be free to go and defend themselves against > too frequent userspace requests, like they do, as you already shared > a link in another email. But saying that it is up to the driver to do this > is basically saying that the thermal subsystem do not care about their > own threads being delayed by a too frequent reads on a sysfs entry > created by the thermal subsystem, just because it is drivers responsability > to cache. To that is a missing defensive code. No, the core code has not to be defensive against bad hardware design. If multiple processes are reading in an infinite loop the temperature, they will constantly take the lock, and as the monitoring thread is a CFS task, this one will be considered as the readers and be delayed, with probably a mitigation temperature drift. Here we have a missing defensive / optimized code against a DoS but it is unrelated to the hardware and the fix is not caching the value. >> As a side note, slow drivers are usually going under drivers/hwmon. > > Have you seen this code? > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517 > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850 Yes, and ? That is what I said, the hwmon and the thermal zone are connected. > I also do not understand when you say slow drivers are usually going under > drivers/hwmon, does it really matter? One can design a thermal zone > that is connected to a hwmon device as input. Why would that be illogical? I'm not saying it is illogical. I'm pointing the slow drivers are under hwmon subsystems and usually don't aim in-kernel mitigation. The get_temp ops is going through hwmon and the drivers may cache the values. So *if* there is an in-kernel mitigation, the value will be already cached usually. I do believe I raised some valid concerns regarding the approach. Could please take them into account instead of eluding them? 1. A real use case with temperature limit drift (easy to reproduce because it is very common) 2. How about the consistency between hwmon and thermal? (one driver but two ways to access the temperature - one may cache and the other not) Another question regarding the I2C example, if another subsystem is using the I2C, won't it take the bus lock and create the contention also? I mean it is possible to create a mitigation drift by reading constantly another sensor value (eg. voltage or whatever) ?
Hi Eduardo, On 01/07/2023 03:38, Eduardo Valentin wrote: > Hey Daniel, > > On Fri, Jun 30, 2023 at 12:11:25PM +0200, Daniel Lezcano wrote: >> >> >> >> Hi Rafael, >> >> On 30/06/2023 10:16, Rafael J. Wysocki wrote: >>> On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: >> >> [ ... ] >> >>> So what about adding a new zone attribute that can be used to specify >>> the preferred caching time for the temperature? >>> >>> That is, if the time interval between two consecutive updates of the >>> cached temperature value is less than the value of the new attribute, >>> the cached temperature value will be returned by "temp". Otherwise, >>> it will cause the sensor to be read and the value obtained from it >>> will be returned to user space and cached. >>> >>> If the value of the new attribute is 0, everything will work as it >>> does now (which will also need to be the default behavior). >> >> I'm still not convinced about the feature. >> >> Eduardo provided some numbers but they seem based on the characteristics >> of the I2C, not to a real use case. Eduardo? > > Why I2C is not a real use case? What I meant is "I2C is slow, ok. But what is the setup where the problem arises?"
On Sat, Jul 01, 2023 at 09:28:31AM +0200, Daniel Lezcano wrote: > > > > Eduardo, > > On 01/07/2023 03:49, Eduardo Valentin wrote: > > [ ... ] > > > > All that are hypothesis, that is why having a real use case would help > > > to figure out the temperature limit drift at mitigation time. > > > > Yeah, I guess the problem here is that you are assuming I2C is not a real > > use case, not sure why. But it is and very common design in fact. > > If it is so common you should be able to reproduce the issue and give > numbers. At this point, what I read is "that may happen because I2C is > slow and we may monitor it at an insane rate, so let's cache the value". > > > > Assuming it is really needed, I'm not sure that should be exported via > > > sysfs. It is a driver issue and it may register the thermal zone with a > > > parameter telling the userspace rate limit. > > > > > > On the other side, hwmon and thermal are connected. hwmon drivers > > > register a thermal zone and thermal drivers add themselves in the hwmon > > > sysfs directory. The temperature cache is handled in the driver level in > > > the hwmon subsystems and we want to handle the temperature cache at the > > > thermal sysfs level. How will we cope with this inconsistency? > > > > Yeah, I do not see this, again, as where to handle cache type of design problem only. > > This is really a protective / defensive code on the thermal core to avoid > > userspace interfering on a kernel based control. > > > > > > I agree that drivers may be free to go and defend themselves against > > too frequent userspace requests, like they do, as you already shared > > a link in another email. But saying that it is up to the driver to do this > > is basically saying that the thermal subsystem do not care about their > > own threads being delayed by a too frequent reads on a sysfs entry > > created by the thermal subsystem, just because it is drivers responsability > > to cache. To that is a missing defensive code. > > No, the core code has not to be defensive against bad hardware design. I do not understand why you are calling this a bad hardware design. > > If multiple processes are reading in an infinite loop the temperature, > they will constantly take the lock, and as the monitoring thread is a > CFS task, this one will be considered as the readers and be delayed, > with probably a mitigation temperature drift. Here we have a missing > defensive / optimized code against a DoS but it is unrelated to the > hardware and the fix is not caching the value. I am not even going into the CFS discussion, yet. But if we go that direction, here we are having a case of a userspace process contending into a in kernel process, regardless of the userspace process priority. But again that is not the main point of discussion for this change. > > > > As a side note, slow drivers are usually going under drivers/hwmon. > > > > Have you seen this code? > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517 > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850 > > Yes, and ? I mean , I am sure can git grep for all occurences if that is what are looking for. > > That is what I said, the hwmon and the thermal zone are connected. > > > I also do not understand when you say slow drivers are usually going under > > drivers/hwmon, does it really matter? One can design a thermal zone > > that is connected to a hwmon device as input. Why would that be illogical? > > I'm not saying it is illogical. I'm pointing the slow drivers are under > hwmon subsystems and usually don't aim in-kernel mitigation. The > get_temp ops is going through hwmon and the drivers may cache the > values. So *if* there is an in-kernel mitigation, the value will be > already cached usually. Oh I see. yes, I totally give you that the most common case is to have the in kernel policy written with drivers under thermal, but neglecting the existence of drivers under /hwmon that connects into the thermal subsystem is not fair, is it? They are there, they connect to thermal subsystem and one can certainly have an in kernel control with hwmon drivers as input, I am not understanding why protecting against those cases make the change invalid or overcomplicates the subsystem design. > > I do believe I raised some valid concerns regarding the approach. Could > please take them into account instead of eluding them > > 1. A real use case with temperature limit drift (easy to reproduce > because it is very common) > > 2. How about the consistency between hwmon and thermal? (one driver but > two ways to access the temperature - one may cache and the other not) I am not eluding anything. But this conversation seams to not move forward because of the assumption that building in kernel control based on hwmon drivers is either wrong or uncommon. I fail to see that as the main argument to be discussed mainly because it is a problem that exists. Driver count is also not a meaningful metric to conclude if the problem is common or not. How many motherboard can you think that are out there that may have an lm75 to represent an actual temperature point in the PCB? Or an lm75 that may come from a PCI board? That is what I meant by common design. Seams to me that you are trying to make a point that this problem is not worth having a fix for, even if you already recognized the basic point of it, because (a) this is not happening MMIO based drivers which is the (today) common case for drivers under /thermal and (b) hwmon drivers are supposed to feed in only control in userspace. for both argument that seams to restrict the thermal subsystem to only MMIO base devices drivers to feed into CPU temperature control, which is a limited application of it in real world scenario, for any real life scenario really, mobile. BMC, or any embedded case. When in fact, the abstraction and desing on the thermal subsystem today is pure control of temperature and can be used in any application that does it. Limiting the application of the thermal subsystem to only MMIO based devices is, well, limiting :-). > > Another question regarding the I2C example, if another subsystem is > using the I2C, won't it take the bus lock and create the contention > also? I mean it is possible to create a mitigation drift by reading > constantly another sensor value (eg. voltage or whatever) ? Yes, if a flood of events in the bus happen, then the drift will also happen, specially if reads are aligned with the in kernel monitoring thread update / call to .get_temp(). > > > -- > <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs > > Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | > <http://twitter.com/#!/linaroorg> Twitter | > <http://www.linaro.org/linaro-blog/> Blog >
On Sat, Jul 1, 2023 at 3:37 AM Eduardo Valentin <evalenti@kernel.org> wrote: > > On Fri, Jun 30, 2023 at 10:16:38AM +0200, Rafael J. Wysocki wrote: > > > > > > > > On Wed, Jun 28, 2023 at 11:10 PM Eduardo Valentin <evalenti@kernel.org> wrote: > > > > > > On Fri, Jun 23, 2023 at 07:31:43PM +0200, Rafael J. Wysocki wrote: > > > > > > > > [cut] > > > > > > > > > > Regardless of where the problem is etc, if my understanding of the > > > > patch is correct, it is proposing to change the behavior of a > > > > well-known sysfs interface in a way that is likely to be incompatible > > > > with at least some of its users. This is an obvious no-go in kernel > > > > development and I would expect you to be well aware of it. > > > > > > yeah I get it. > > > > > > > > > > > IOW, if you want the cached value to be returned, a new interface (eg. > > > > a new sysfs attribute) is needed. > > > > > > Yeah, I am fine with either a new sysfs entry to return the cached value, > > > or a new sysfs entry to change the behavior of the existing /temp, as I > > > mentioned before, either way works for me, if changing the existing one > > > is too intrusive. > > > > > > > > > > > And I think that the use case is not really i2c sensors in general, > > > > > > I2C was just the factual example I had, but you are right, the use case > > > is not isolated to I2C sensor. Rather, to be clear I am not blaming I2C, > > > the actual issue just happen to be easier to see when I2C devices, slower > > > than typical MMIO devices, are being used as input for the control. > > > > > > > because at least in some cases they work just fine AFAICS, but a > > > > platform with a control loop running in the kernel that depends on > > > > sensor reads carried out at a specific, approximately constant, rate > > > > that can be disturbed by user space checking the sensor temperature > > > > via sysfs at "wrong" times. If at the same time the user space > > > > program in question doesn't care about the most recent values reported > > > > by the sensor, it may very well use the values cached by the in-kernel > > > > control loop. > > > > > > That is right, the balance between supporting user space reads and > > > running the control timely is the actual original concern. The problem > > > fades out a bit when you have device reads in the us / ns time scale > > > and control update is in 100s of ms. But becomes more apparent on slower > > > devices, when reads are in ms and policy update is in the 100s ms, that is > > > why the I2C case was quoted. But nothing wrong with I2C, or supporting > > > I2C on the thermal subsystem as we do today via the hwmon interface REGISTER_TZ, > > > the problem is on having to support the control in kernel and a read in > > > userspace that can create jitter to the control. > > > > > > And as you properly stated, for this use case, the userspace does not care > > > about the most recent value of the device, so that is why the change > > > proposes to give cached values. > > > > > > On the flip side though, there may be user space based policies that > > > require the most recent device value. But in this case, the expectation > > > is to disable the in kernel policy and switch the thermal zone to > > > mode == disabled. And that is also why this patch will go the path > > > to request the most recent device value when the /temp sysfs entry > > > is read and the mode is disabled. > > > > > > I would suggest to have an addition sysfs entry that sets the > > > thermal zone into cached mode or not, let's say for the sake of the > > > discussion, we call it 'cached_values', with default to 'not cached'. > > > This way, we could support: > > > > > > a) Default, current situation, where all reads in /temp are always backed up > > > with an actual device .get_temp(). Nothing changes here, keeps reading > > > under /temp, and so long system designer is satisfied with jittering, > > > no need to change anything. > > > > > > b) When one has control in kernel, and frequent userspace reads on /temp > > > but system designer wants to protect the control in kernel to avoid jittering. > > > Just keep reading from /temp but set the new sysfs property 'cached_values' to 'cached'. > > > Then userspace will get updated values as frequent as the kernel control has > > > and the kernel control will not be disturbed by frequent device reads. > > > > > > c) When one has control in userspace, and wants to have the most frequent > > > device read. Here, one can simply disable the in kernel control by > > > setting the 'mode' sysfs entry to 'disabled', and making sure the new sysfs property is set > > > to 'not cached'. Well in fact, the way I thought this originally in this patch > > > was to simply always read the device when /temp is read is 'mode' is 'disabled'. > > > > > > I believe you proposed to have another sysfs entry sysfs entry for reading cached temperature. > > > Something like 'temp_cached'. Thinking of it, as I mentioned before, it will work. > > > The only possible downside is to have two entries to read temperature. > > > > > > Strong opinions on any of the above? > > > > So what about adding a new zone attribute that can be used to specify > > the preferred caching time for the temperature? > > > > That is, if the time interval between two consecutive updates of the > > cached temperature value is less than the value of the new attribute, > > the cached temperature value will be returned by "temp". Otherwise, > > it will cause the sensor to be read and the value obtained from it > > will be returned to user space and cached. > > > > If the value of the new attribute is 0, everything will work as it > > does now (which will also need to be the default behavior). > > Yeah, that makes sense to me. I can cook something up in the next version. Yes, please. Also I think that the $subject patch was inspired by observations made on a specific system in practice. It would be good to say what the system is and include some numbers illustrating how severe the problem is (that is, what is expected and what is observed and why the discrepancy is attributed to the effect of direct sensor accesses from user space via sysfs).
On Thu, Jul 06, 2023 at 03:22:55PM +0200, Rafael J. Wysocki wrote: > > > > On Thu, Jul 6, 2023 at 12:50 AM Eduardo Valentin <evalenti@kernel.org> wrote: > > > > On Sat, Jul 01, 2023 at 09:28:31AM +0200, Daniel Lezcano wrote: > > > > > > > > > > > > Eduardo, > > > > > > On 01/07/2023 03:49, Eduardo Valentin wrote: > > > > > > [ ... ] > > > > > > > > All that are hypothesis, that is why having a real use case would help > > > > > to figure out the temperature limit drift at mitigation time. > > > > > > > > Yeah, I guess the problem here is that you are assuming I2C is not a real > > > > use case, not sure why. But it is and very common design in fact. > > > > > > If it is so common you should be able to reproduce the issue and give > > > numbers. At this point, what I read is "that may happen because I2C is > > > slow and we may monitor it at an insane rate, so let's cache the value". > > > > > > > > Assuming it is really needed, I'm not sure that should be exported via > > > > > sysfs. It is a driver issue and it may register the thermal zone with a > > > > > parameter telling the userspace rate limit. > > > > > > > > > > On the other side, hwmon and thermal are connected. hwmon drivers > > > > > register a thermal zone and thermal drivers add themselves in the hwmon > > > > > sysfs directory. The temperature cache is handled in the driver level in > > > > > the hwmon subsystems and we want to handle the temperature cache at the > > > > > thermal sysfs level. How will we cope with this inconsistency? > > > > > > > > Yeah, I do not see this, again, as where to handle cache type of design problem only. > > > > This is really a protective / defensive code on the thermal core to avoid > > > > userspace interfering on a kernel based control. > > > > > > > > > > > > I agree that drivers may be free to go and defend themselves against > > > > too frequent userspace requests, like they do, as you already shared > > > > a link in another email. But saying that it is up to the driver to do this > > > > is basically saying that the thermal subsystem do not care about their > > > > own threads being delayed by a too frequent reads on a sysfs entry > > > > created by the thermal subsystem, just because it is drivers responsability > > > > to cache. To that is a missing defensive code. > > > > > > No, the core code has not to be defensive against bad hardware design. > > > > I do not understand why you are calling this a bad hardware design. > > > > > > > > If multiple processes are reading in an infinite loop the temperature, > > > they will constantly take the lock, and as the monitoring thread is a > > > CFS task, this one will be considered as the readers and be delayed, > > > with probably a mitigation temperature drift. Here we have a missing > > > defensive / optimized code against a DoS but it is unrelated to the > > > hardware and the fix is not caching the value. > > > > I am not even going into the CFS discussion, yet. But if we go that direction, > > here we are having a case of a userspace process contending into > > a in kernel process, regardless of the userspace process priority. > > > > But again that is not the main point of discussion for this change. > > > > > > > > > > As a side note, slow drivers are usually going under drivers/hwmon. > > > > > > > > Have you seen this code? > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/lm75.c#n517 > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/hwmon/hwmon.c#n850 > > > > > > Yes, and ? > > > > > > I mean , I am sure can git grep for all occurences if that is what are looking for. > > > > > > That is what I said, the hwmon and the thermal zone are connected. > > > > > > > I also do not understand when you say slow drivers are usually going under > > > > drivers/hwmon, does it really matter? One can design a thermal zone > > > > that is connected to a hwmon device as input. Why would that be illogical? > > > > > > I'm not saying it is illogical. I'm pointing the slow drivers are under > > > hwmon subsystems and usually don't aim in-kernel mitigation. The > > > get_temp ops is going through hwmon and the drivers may cache the > > > values. So *if* there is an in-kernel mitigation, the value will be > > > already cached usually. > > > > Oh I see. yes, I totally give you that the most common case is to have > > the in kernel policy written with drivers under thermal, but neglecting > > the existence of drivers under /hwmon that connects into the thermal subsystem > > is not fair, is it? They are there, they connect to thermal subsystem > > and one can certainly have an in kernel control with hwmon drivers as input, > > I am not understanding why protecting against those cases make the change > > invalid or overcomplicates the subsystem design. > > > > > > > > I do believe I raised some valid concerns regarding the approach. Could > > > please take them into account instead of eluding them > > > > > > 1. A real use case with temperature limit drift (easy to reproduce > > > because it is very common) > > > > > > 2. How about the consistency between hwmon and thermal? (one driver but > > > two ways to access the temperature - one may cache and the other not) > > > > I am not eluding anything. But this conversation seams to not move forward > > because of the assumption that building in kernel control based on > > hwmon drivers is either wrong or uncommon. I fail to see that as the > > main argument to be discussed mainly because it is a problem that exists. > > Driver count is also not a meaningful metric to conclude if the problem > > is common or not. How many motherboard can you think that are out there > > that may have an lm75 to represent an actual temperature point in the > > PCB? Or an lm75 that may come from a PCI board? That is what I meant > > by common design. > > > > Seams to me that you are trying to make a point that this problem > > is not worth having a fix for, even if you already recognized the > > basic point of it, because (a) this is not happening MMIO based > > drivers which is the (today) common case for drivers under /thermal > > and (b) hwmon drivers are supposed to feed in only control in userspace. > > > > for both argument that seams to restrict the thermal subsystem > > to only MMIO base devices drivers to feed into CPU temperature control, > > which is a limited application of it in real world scenario, for > > any real life scenario really, mobile. BMC, or any embedded case. > > When in fact, the abstraction and desing on the thermal subsystem today > > is pure control of temperature and can be used in any application > > that does it. Limiting the application of the thermal subsystem to > > only MMIO based devices is, well, limiting :-). > > > > > > > > Another question regarding the I2C example, if another subsystem is > > > using the I2C, won't it take the bus lock and create the contention > > > also? I mean it is possible to create a mitigation drift by reading > > > constantly another sensor value (eg. voltage or whatever) ? > > > > Yes, if a flood of events in the bus happen, then the drift will also happen, > > specially if reads are aligned with the in kernel monitoring thread > > update / call to .get_temp(). > > This is all in general terms, though. > > I think that Daniel has been asking for a specific example, because if > a new sysfs i/f is added with a certain issue in mind and then it > turns out to be never used by anyone, because the issue is theoretical > or not severe enough for anyone to care, it is a pure maintenance > burden. It would be good to have at least some assurance that this > will not be the case. > > Daniel, even if you think that the hardware in question is > misdesigned, Linux has a long record of supporting misdesigned > hardware, so this wouldn't be the first time ever. However, I do > agree that it would be good to have a specific example supporting this > case. I will add some examples in the next version with the numbers I quoted in this thread. Also, generally speaking, this series uses the thermal subsystem mostly for board monitoring use cases, which includes applications like BMC, and that has generated some churn because the current set of existing drivers mainly targets in-kernel control for the running system CPU temperature monitoring. However, even since time of thermal device tree binding conception, this limitation was never part of the original thoughts on enhancing the thermal subsystem from ACPI only to a more general solution. For this reason, next iteration I will start with enhancing documentation. I took a brief look in the existing master tree and plenty of the existing documentation references to board monitoring have been lost in reshuffle of thermal documentation files, including thermal device tree bindings, which is a shame. I will recover those examples with refreshed examples. Thermal doc has been not the prettiest anyways so it is always good to spend some time enhancing it.
diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c index b6daea2398da..a240c58d9e08 100644 --- a/drivers/thermal/thermal_sysfs.c +++ b/drivers/thermal/thermal_sysfs.c @@ -35,12 +35,23 @@ static ssize_t temp_show(struct device *dev, struct device_attribute *attr, char *buf) { struct thermal_zone_device *tz = to_thermal_zone(dev); - int temperature, ret; - - ret = thermal_zone_get_temp(tz, &temperature); + int temperature; - if (ret) - return ret; + /* + * don't force new update from external reads + * This way we avoid messing up with time constraints. + */ + if (tz->mode == THERMAL_DEVICE_DISABLED) { + int r; + + r = thermal_zone_get_temp(tz, &temperature); /* holds tz->lock*/ + if (r) + return r; + } else { + mutex_lock(&tz->lock); + temperature = tz->temperature; + mutex_unlock(&tz->lock); + } return sprintf(buf, "%d\n", temperature); }