[RFC,0/3] New thermal interface allowing IPA to get max power

Message ID	20210126104001.20361-1-lukasz.luba@arm.com
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Lukasz Luba <lukasz.luba@arm.com> To: linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org Cc: vireshk@kernel.org, rafael@kernel.org, daniel.lezcano@linaro.org, Dietmar.Eggemann@arm.com, lukasz.luba@arm.com, amitk@kernel.org, rui.zhang@intel.com, cw00.choi@samsung.com, myungjoo.ham@samsung.com, kyungmin.park@samsung.com Subject: [RFC][PATCH 0/3] New thermal interface allowing IPA to get max power Date: Tue, 26 Jan 2021 10:39:58 +0000 Message-Id: <20210126104001.20361-1-lukasz.luba@arm.com> Precedence: bulk
Series	New thermal interface allowing IPA to get max power \| expand [RFC,0/3] New thermal interface allowing IPA to get max power [RFC,1/3] PM /devfreq: add user frequency limits into devfreq struct [RFC,2/3] thermal: devfreq_cooling: add new callback to get user limit for min state [RFC,3/3] thermal: power_allocator: get proper max power limited by user

Lukasz Luba Jan. 26, 2021, 10:39 a.m. UTC

Hi all,

This patch set tries to add the missing feature in the Intelligent Power
Allocation (IPA) governor which is: frequency limit set by user space.
User can set max allowed frequency for a given device which has impact on
max allowed power. In current design there is no mechanism to figure this
out. IPA must know the maximum allowed power for every device. It is then
used for proper power split and divvy-up. When the user limit for max
frequency is not know, IPA assumes it is the highest possible frequency.
It causes wrong power split across the devices.

This new mechanism provides the max allowed frequency to the thermal
framework and then max allowed power to the IPA.
The implementation is done in this way because currently there is no way
to retrieve the limits from the PM QoS, without uncapping the local
thermal limit and reading the next value. It would be a heavy way of
doing these things, since it should be done every polling time (e.g. 50ms).
Also, the value stored in PM QoS can be different than the real OPP 'rate'
so still would need conversion into proper OPP for comparison with EM.
Furthermore, uncapping the device in thermal just to check the user freq
limit is not the safest way.
Thus, this simple implementation moves the calculation of the proper
frequency to the sysfs write code, since it's called less often. The value
is then used as-is in the thermal framework without any hassle.

As it's a RFC, it still misses the cpufreq sysfs implementation, but would
be addressed if all agree.

Regards,
Lukasz Luba

Lukasz Luba (3):
  PM /devfreq: add user frequency limits into devfreq struct
  thermal: devfreq_cooling: add new callback to get user limit for min
    state
  thermal: power_allocator: get proper max power limited by user

 drivers/devfreq/devfreq.c             | 41 ++++++++++++++++++++++++---
 drivers/thermal/devfreq_cooling.c     | 33 +++++++++++++++++++++
 drivers/thermal/gov_power_allocator.c | 17 +++++++++--
 include/linux/devfreq.h               |  4 +++
 include/linux/thermal.h               |  1 +
 5 files changed, 90 insertions(+), 6 deletions(-)

Viresh Kumar Jan. 27, 2021, 9:15 a.m. UTC | #1

On 26-01-21, 10:39, Lukasz Luba wrote:
> As it's a RFC, it still misses the cpufreq sysfs implementation, but would
> be addressed if all agree.

Not commenting on the whole stuff but if you ever need something for cpufreq, it
is already there. Look for these.

store_one(scaling_min_freq, min);
store_one(scaling_max_freq, max);

Hopefully they will work just fine.

Lukasz Luba Jan. 27, 2021, 10:11 a.m. UTC | #2

On 1/27/21 9:15 AM, Viresh Kumar wrote:
> On 26-01-21, 10:39, Lukasz Luba wrote:
>> As it's a RFC, it still misses the cpufreq sysfs implementation, but would
>> be addressed if all agree.
> 
> Not commenting on the whole stuff but if you ever need something for cpufreq, it
> is already there. Look for these.
> 
> store_one(scaling_min_freq, min);
> store_one(scaling_max_freq, max);
> 
> Hopefully they will work just fine.
> 

So, can I assume you don't mind to plumb it into these two?

Yes, I know them and the tricky macro. I just wanted to avoid
one patch for this macro and one patch for cpufreq_cooling.c,
which would use it.

If you agree and Chanwoo agrees for the devfreq, and Daniel
for the new callback in cooling device, then I would continue
by adding missing patches for cpufreq-cooling part.

Regards,
Lukasz

Lukasz Luba Feb. 1, 2021, 11:23 a.m. UTC | #3

Daniel, Chanwoo

Gentle ping. Have you have a chance to check these patches?

On 1/26/21 10:39 AM, Lukasz Luba wrote:
> Hi all,

> 

> This patch set tries to add the missing feature in the Intelligent Power

> Allocation (IPA) governor which is: frequency limit set by user space.

> User can set max allowed frequency for a given device which has impact on

> max allowed power. In current design there is no mechanism to figure this

> out. IPA must know the maximum allowed power for every device. It is then

> used for proper power split and divvy-up. When the user limit for max

> frequency is not know, IPA assumes it is the highest possible frequency.

> It causes wrong power split across the devices.

> 

> This new mechanism provides the max allowed frequency to the thermal

> framework and then max allowed power to the IPA.

> The implementation is done in this way because currently there is no way

> to retrieve the limits from the PM QoS, without uncapping the local

> thermal limit and reading the next value. It would be a heavy way of

> doing these things, since it should be done every polling time (e.g. 50ms).

> Also, the value stored in PM QoS can be different than the real OPP 'rate'

> so still would need conversion into proper OPP for comparison with EM.

> Furthermore, uncapping the device in thermal just to check the user freq

> limit is not the safest way.

> Thus, this simple implementation moves the calculation of the proper

> frequency to the sysfs write code, since it's called less often. The value

> is then used as-is in the thermal framework without any hassle.

> 

> As it's a RFC, it still misses the cpufreq sysfs implementation, but would

> be addressed if all agree.

> 

> Regards,

> Lukasz Luba

> 

> Lukasz Luba (3):

>    PM /devfreq: add user frequency limits into devfreq struct

>    thermal: devfreq_cooling: add new callback to get user limit for min

>      state

>    thermal: power_allocator: get proper max power limited by user

> 

>   drivers/devfreq/devfreq.c             | 41 ++++++++++++++++++++++++---

>   drivers/thermal/devfreq_cooling.c     | 33 +++++++++++++++++++++

>   drivers/thermal/gov_power_allocator.c | 17 +++++++++--

>   include/linux/devfreq.h               |  4 +++

>   include/linux/thermal.h               |  1 +

>   5 files changed, 90 insertions(+), 6 deletions(-)

>

Rafael J. Wysocki Feb. 1, 2021, 2:19 p.m. UTC | #4

On Tue, Jan 26, 2021 at 11:40 AM Lukasz Luba <lukasz.luba@arm.com> wrote:
>

> Hi all,

>

> This patch set tries to add the missing feature in the Intelligent Power

> Allocation (IPA) governor which is: frequency limit set by user space.

> User can set max allowed frequency for a given device which has impact on

> max allowed power.


If there is more than one frequency that can be limited for the given
device, are you going to add a limit knob for each of them?

> In current design there is no mechanism to figure this

> out. IPA must know the maximum allowed power for every device. It is then

> used for proper power split and divvy-up. When the user limit for max

> frequency is not know, IPA assumes it is the highest possible frequency.

> It causes wrong power split across the devices.


Do I think correctly that this depends on the Energy Model?

> This new mechanism provides the max allowed frequency to the thermal

> framework and then max allowed power to the IPA.

> The implementation is done in this way because currently there is no way

> to retrieve the limits from the PM QoS, without uncapping the local

> thermal limit and reading the next value.


The above is unclear.  What PM QoS limit are you referring to in the
first place?

> It would be a heavy way of

> doing these things, since it should be done every polling time (e.g. 50ms).

> Also, the value stored in PM QoS can be different than the real OPP 'rate'

> so still would need conversion into proper OPP for comparison with EM.

> Furthermore, uncapping the device in thermal just to check the user freq

> limit is not the safest way.

> Thus, this simple implementation moves the calculation of the proper

> frequency to the sysfs write code, since it's called less often. The value

> is then used as-is in the thermal framework without any hassle.

>

> As it's a RFC, it still misses the cpufreq sysfs implementation,


What exactly do you mean by this?

> but would be addressed if all agree.


Depending on the answers above.

But my general comment would be that it might turn out to be
unrealistic to expect user space to know what frequency limit to use
to get the desired result in terms of constraining power.

Daniel Lezcano Feb. 1, 2021, 2:21 p.m. UTC | #5

Hi Lukasz,

On 01/02/2021 12:23, Lukasz Luba wrote:
> Daniel, Chanwoo

> 

> Gentle ping. Have you have a chance to check these patches?


I will review the patches in a couple of days

  -- Daniel


> On 1/26/21 10:39 AM, Lukasz Luba wrote:

>> Hi all,

>>

>> This patch set tries to add the missing feature in the Intelligent Power

>> Allocation (IPA) governor which is: frequency limit set by user space.

>> User can set max allowed frequency for a given device which has impact on

>> max allowed power. In current design there is no mechanism to figure this

>> out. IPA must know the maximum allowed power for every device. It is then

>> used for proper power split and divvy-up. When the user limit for max

>> frequency is not know, IPA assumes it is the highest possible frequency.

>> It causes wrong power split across the devices.

>>

>> This new mechanism provides the max allowed frequency to the thermal

>> framework and then max allowed power to the IPA.

>> The implementation is done in this way because currently there is no way

>> to retrieve the limits from the PM QoS, without uncapping the local

>> thermal limit and reading the next value. It would be a heavy way of

>> doing these things, since it should be done every polling time (e.g.

>> 50ms).

>> Also, the value stored in PM QoS can be different than the real OPP

>> 'rate'

>> so still would need conversion into proper OPP for comparison with EM.

>> Furthermore, uncapping the device in thermal just to check the user freq

>> limit is not the safest way.

>> Thus, this simple implementation moves the calculation of the proper

>> frequency to the sysfs write code, since it's called less often. The

>> value

>> is then used as-is in the thermal framework without any hassle.

>>

>> As it's a RFC, it still misses the cpufreq sysfs implementation, but

>> would

>> be addressed if all agree.

>>

>> Regards,

>> Lukasz Luba

>>

>> Lukasz Luba (3):

>>    PM /devfreq: add user frequency limits into devfreq struct

>>    thermal: devfreq_cooling: add new callback to get user limit for min

>>      state

>>    thermal: power_allocator: get proper max power limited by user

>>

>>   drivers/devfreq/devfreq.c             | 41 ++++++++++++++++++++++++---

>>   drivers/thermal/devfreq_cooling.c     | 33 +++++++++++++++++++++

>>   drivers/thermal/gov_power_allocator.c | 17 +++++++++--

>>   include/linux/devfreq.h               |  4 +++

>>   include/linux/thermal.h               |  1 +

>>   5 files changed, 90 insertions(+), 6 deletions(-)

>>



-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Lukasz Luba Feb. 1, 2021, 4:37 p.m. UTC | #6

Hi Rafael,

On 2/1/21 2:19 PM, Rafael J. Wysocki wrote:
> On Tue, Jan 26, 2021 at 11:40 AM Lukasz Luba <lukasz.luba@arm.com> wrote:

>>

>> Hi all,

>>

>> This patch set tries to add the missing feature in the Intelligent Power

>> Allocation (IPA) governor which is: frequency limit set by user space.

>> User can set max allowed frequency for a given device which has impact on

>> max allowed power.

> 

> If there is more than one frequency that can be limited for the given

> device, are you going to add a limit knob for each of them?

I might be unclear. I was referring to normal sysfs scaling_max_freq,
which sets the max frequency for CPU:

echo XYZ > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq

similar for devfreq device, like GPU.

> 

>> In current design there is no mechanism to figure this

>> out. IPA must know the maximum allowed power for every device. It is then

>> used for proper power split and divvy-up. When the user limit for max

>> frequency is not know, IPA assumes it is the highest possible frequency.

>> It causes wrong power split across the devices.

> 

> Do I think correctly that this depends on the Energy Model?

Not directly, but IPA uses the max freq to ask EM for max power. The
issue is that I don't know this 'max freq' for a given device, because
user might set a limit for that device. In that case IPA still blindly 
picks up the power for highest frequency.

> 

>> This new mechanism provides the max allowed frequency to the thermal

>> framework and then max allowed power to the IPA.

>> The implementation is done in this way because currently there is no way

>> to retrieve the limits from the PM QoS, without uncapping the local

>> thermal limit and reading the next value.

> 

> The above is unclear.  What PM QoS limit are you referring to in the

> first place?

The PM QoS which we use in thermal for setting the frequency limits,
for cpufreq_cooling [1] and for devfreq_cooling [2]. I am able to read
that PM QoS value, but it's the lowest, but not set by user.
Example:
2000MHz
1800MHz <----- user set this to 'max freq'
1400MHz <----- thermal set that to 'max freq'

then PM QoS would give me the 1400MHz, because it is the limit for
the max freq.

That's why I said that PM QoS is not able to give me the user limit,
unless I revert in IPA the capping for that device.

> 

>> It would be a heavy way of

>> doing these things, since it should be done every polling time (e.g. 50ms).

>> Also, the value stored in PM QoS can be different than the real OPP 'rate'

>> so still would need conversion into proper OPP for comparison with EM.

>> Furthermore, uncapping the device in thermal just to check the user freq

>> limit is not the safest way.

>> Thus, this simple implementation moves the calculation of the proper

>> frequency to the sysfs write code, since it's called less often. The value

>> is then used as-is in the thermal framework without any hassle.

>>

>> As it's a RFC, it still misses the cpufreq sysfs implementation,

> 

> What exactly do you mean by this?

I haven't modified cpufreq.c and cpufreq_cooling.c because
maybe for CPUs there is a way to solve it differently or you might
don't want at all to modify CPUs code.

> 

>> but would be addressed if all agree.

> 

> Depending on the answers above.

> 

> But my general comment would be that it might turn out to be

> unrealistic to expect user space to know what frequency limit to use

> to get the desired result in terms of constraining power.

> 

There are scenarios, where middleware (which is aware what is on
the foreground in mobile) might limit the GPU max freq, to not
burn out some power spent on highest OPPs.

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/latest/source/drivers/thermal/cpufreq_cooling.c#L443
[2] 
https://elixir.bootlin.com/linux/latest/source/drivers/thermal/devfreq_cooling.c#L106

Lukasz Luba Feb. 1, 2021, 4:37 p.m. UTC | #7

Hi Daniel,

On 2/1/21 2:21 PM, Daniel Lezcano wrote:
> 

> Hi Lukasz,

> 

> On 01/02/2021 12:23, Lukasz Luba wrote:

>> Daniel, Chanwoo

>>

>> Gentle ping. Have you have a chance to check these patches?

> 

> I will review the patches in a couple of days


Thank you, I will wait then.

Regards,
Lukasz

> 

>    -- Daniel

> 

>

Chanwoo Choi Feb. 2, 2021, 9:31 a.m. UTC | #8

Hi Lukasz,

I'll review this patchset until tomorrow.

Thanks.
Chanwoo Choi 

On 2/1/21 8:23 PM, Lukasz Luba wrote:
> Daniel, Chanwoo

> 

> Gentle ping. Have you have a chance to check these patches?

> 

> On 1/26/21 10:39 AM, Lukasz Luba wrote:

>> Hi all,

>>

>> This patch set tries to add the missing feature in the Intelligent Power

>> Allocation (IPA) governor which is: frequency limit set by user space.

>> User can set max allowed frequency for a given device which has impact on

>> max allowed power. In current design there is no mechanism to figure this

>> out. IPA must know the maximum allowed power for every device. It is then

>> used for proper power split and divvy-up. When the user limit for max

>> frequency is not know, IPA assumes it is the highest possible frequency.

>> It causes wrong power split across the devices.

>>

>> This new mechanism provides the max allowed frequency to the thermal

>> framework and then max allowed power to the IPA.

>> The implementation is done in this way because currently there is no way

>> to retrieve the limits from the PM QoS, without uncapping the local

>> thermal limit and reading the next value. It would be a heavy way of

>> doing these things, since it should be done every polling time (e.g. 50ms).

>> Also, the value stored in PM QoS can be different than the real OPP 'rate'

>> so still would need conversion into proper OPP for comparison with EM.

>> Furthermore, uncapping the device in thermal just to check the user freq

>> limit is not the safest way.

>> Thus, this simple implementation moves the calculation of the proper

>> frequency to the sysfs write code, since it's called less often. The value

>> is then used as-is in the thermal framework without any hassle.

>>

>> As it's a RFC, it still misses the cpufreq sysfs implementation, but would

>> be addressed if all agree.

>>

>> Regards,

>> Lukasz Luba

>>

>> Lukasz Luba (3):

>>    PM /devfreq: add user frequency limits into devfreq struct

>>    thermal: devfreq_cooling: add new callback to get user limit for min

>>      state

>>    thermal: power_allocator: get proper max power limited by user

>>

>>   drivers/devfreq/devfreq.c             | 41 ++++++++++++++++++++++++---

>>   drivers/thermal/devfreq_cooling.c     | 33 +++++++++++++++++++++

>>   drivers/thermal/gov_power_allocator.c | 17 +++++++++--

>>   include/linux/devfreq.h               |  4 +++

>>   include/linux/thermal.h               |  1 +

>>   5 files changed, 90 insertions(+), 6 deletions(-)

>>

> 

> 



-- 
Best Regards,
Chanwoo Choi
Samsung Electronics

Lukasz Luba Feb. 2, 2021, 9:56 a.m. UTC | #9

On 2/2/21 9:31 AM, Chanwoo Choi wrote:
> Hi Lukasz,

> 

> I'll review this patchset until tomorrow.


Thank you Chanwoo, I will wait then.

Lukasz

> 

> Thanks.

> Chanwoo Choi

>

Daniel Lezcano Feb. 22, 2021, 10:22 a.m. UTC | #10

Hi Lukasz,

sorry for the delay, it took more time to finish my current work before
commenting these patches.

On 26/01/2021 11:39, Lukasz Luba wrote:
> Hi all,

> 

> This patch set tries to add the missing feature in the Intelligent Power

> Allocation (IPA) governor which is: frequency limit set by user space.

It is unclear if we are talking about frequency limit of the dvfs device
by setting the hardware limit (min-max freq). If it is the case, then
that is an energy model change, and all user of the energy model must be
notified about this change. But I don't see why userspace wants to
change that.

If we just want to set a frequency limit, then that is what we are doing
with the DTPM framework via power numbers.

> User can set max allowed frequency for a given device which has impact on

> max allowed power. In current design there is no mechanism to figure this

> out. IPA must know the maximum allowed power for every device. It is then

> used for proper power split and divvy-up. When the user limit for max

> frequency is not know, IPA assumes it is the highest possible frequency.

> It causes wrong power split across the devices.

That is because the IPA introduced the power rebalancing between devices
belonging the same thermal zone, so the feature was very use case specific.

The DTPM is supposed to solve this by providing an unified uW unit to
act on the different power capable devices on a generic way.

Today DTPM is mapped to userspace using the powercap framework, but it
is considered to add kernel API to let other subsystem to act on it
directly. May be, you can add those and call them from IPA directly, so
the governor does power decision and ask the DTPM to do the changes.

Conceptually, that would be much more consistent and will remove
duplicated code IMO.

May be create a power_qos framework to unify the units ...

> This new mechanism provides the max allowed frequency to the thermal

> framework and then max allowed power to the IPA.

> The implementation is done in this way because currently there is no way

> to retrieve the limits from the PM QoS, without uncapping the local

> thermal limit and reading the next value. It would be a heavy way of

> doing these things, since it should be done every polling time (e.g. 50ms).

>

> Also, the value stored in PM QoS can be different than the real OPP 'rate'

> so still would need conversion into proper OPP for comparison with EM.

> Furthermore, uncapping the device in thermal just to check the user freq

> limit is not the safest way.

> Thus, this simple implementation moves the calculation of the proper

> frequency to the sysfs write code, since it's called less often. The value

> is then used as-is in the thermal framework without any hassle.

Sounds like the DTPM is doing exactly that, no ?

> As it's a RFC, it still misses the cpufreq sysfs implementation, but would

> be addressed if all agree.

We are talking about power, frequency, in-kernel governor, userspace
having knowledge of max frequency limit to set power.

I'm a bit lost. What is the problem we want to solve here ?

May be I'm missing something. Is it possible to share a scenario where
the userspace acts on the devfreq and the IPA taking decision to
illustrate your proposal ?

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Lukasz Luba Feb. 22, 2021, 12:10 p.m. UTC | #11

Hi Daniel,

On 2/22/21 10:22 AM, Daniel Lezcano wrote:
> 

> Hi Lukasz,

> 

> sorry for the delay, it took more time to finish my current work before

> commenting these patches.


No worries, thank you looking at this.

> 

> On 26/01/2021 11:39, Lukasz Luba wrote:

>> Hi all,

>>

>> This patch set tries to add the missing feature in the Intelligent Power

>> Allocation (IPA) governor which is: frequency limit set by user space.

> 

> It is unclear if we are talking about frequency limit of the dvfs device

> by setting the hardware limit (min-max freq). If it is the case, then

> that is an energy model change, and all user of the energy model must be

> notified about this change. But I don't see why userspace wants to

> change that.

> 

> If we just want to set a frequency limit, then that is what we are doing

> with the DTPM framework via power numbers.

> 

>> User can set max allowed frequency for a given device which has impact on

>> max allowed power. In current design there is no mechanism to figure this

>> out. IPA must know the maximum allowed power for every device. It is then

>> used for proper power split and divvy-up. When the user limit for max

>> frequency is not know, IPA assumes it is the highest possible frequency.

>> It causes wrong power split across the devices.

> 

> That is because the IPA introduced the power rebalancing between devices

> belonging the same thermal zone, so the feature was very use case specific.

> 

> The DTPM is supposed to solve this by providing an unified uW unit to

> act on the different power capable devices on a generic way.

> 

> Today DTPM is mapped to userspace using the powercap framework, but it

> is considered to add kernel API to let other subsystem to act on it

> directly. May be, you can add those and call them from IPA directly, so

> the governor does power decision and ask the DTPM to do the changes.

> 

> Conceptually, that would be much more consistent and will remove

> duplicated code IMO.

> 

> May be create a power_qos framework to unify the units ...

> 

>> This new mechanism provides the max allowed frequency to the thermal

>> framework and then max allowed power to the IPA.

>> The implementation is done in this way because currently there is no way

>> to retrieve the limits from the PM QoS, without uncapping the local

>> thermal limit and reading the next value. It would be a heavy way of

>> doing these things, since it should be done every polling time (e.g. 50ms).

>>

>> Also, the value stored in PM QoS can be different than the real OPP 'rate'

>> so still would need conversion into proper OPP for comparison with EM.

>> Furthermore, uncapping the device in thermal just to check the user freq

>> limit is not the safest way.

>> Thus, this simple implementation moves the calculation of the proper

>> frequency to the sysfs write code, since it's called less often. The value

>> is then used as-is in the thermal framework without any hassle.

> 

> Sounds like the DTPM is doing exactly that, no ?

> 

>> As it's a RFC, it still misses the cpufreq sysfs implementation, but would

>> be addressed if all agree.

> We are talking about power, frequency, in-kernel governor, userspace

> having knowledge of max frequency limit to set power.

> 

> I'm a bit lost. What is the problem we want to solve here ?

> 

> May be I'm missing something. Is it possible to share a scenario where

> the userspace acts on the devfreq and the IPA taking decision to

> illustrate your proposal ?

> 

> 


Sure, here is the description of the configuration and scenario in which
the issue is present.
SoC with 2 CPU clusters (consuming 1W (little cluster) and 3W (big
cluster) at max freq, plenty of OPPs available),
1 GPU (at max consuming ~6W, a few of OPPs),

Scenario:
IPA is working because temperature crossed 1st threshold and tries to
control the system to 'converge' to 2nd temp threshold. It checks
the actors max possible power [1], gets the current power, calculates
current budget, split that budget and grants power across actors so
max allowed frequency is set via QoS.

The state2power() callback called at [1] with argument '0' would return
the power from EM for the highest OPP. This is fine in most cases. That
power information is used in line 359 and 364 during the split.

If the user-space (the aware middleware) wants to switch into different
power-performance mode e.g. power-saving, it writes into device sysfs
to limit max allowed freq. Then IPA does not know about it and makes
wrong decisions. It's an issue for GPUs (but CPUs also) which can
consume big power at higher freq. For example to limit this 6W into
3W, simple freq write via sysfs is enough, but IPA completely is not
aware of that (as you can see in the code).

The sysfs interface for GPU:
$ cat /sys/class/devfreq/<device>/available_frequencies
400000000 600000000 800000000 1000000000

$ echo 600000000 > /sys/class/devfreq/<device>/max_freq
$ cat /sys/class/devfreq/<device>/max_freq
600000000

IMHO is not an issue of IPA, because DTPM might suffer for this
missing 'user write' information as well. It's just missing
design feature, to provide that user information further to the
other frameworks or governors.

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/latest/source/drivers/thermal/gov_power_allocator.c#L458

[RFC,0/3] New thermal interface allowing IPA to get max power

Message

Comments