mbox series

[v5,0/5] Add support for devices in the Energy Model

Message ID 20200318114548.19916-1-lukasz.luba@arm.com
Headers show
Series Add support for devices in the Energy Model | expand

Message

Lukasz Luba March 18, 2020, 11:45 a.m. UTC
Hi all,

This patch set introduces support for devices in the Energy Model (EM)
framework. It will unify the power model for thermal subsystem and make it
simpler. The 1st patch refactors EM framework and adds support for devices.
The 2nd patch changes dev_pm_opp_of_register_em() in OPP/OF which now should
take as an argument struct device pointer. It touches a few trees
(OMAP, NXP, Mediatek) updating their CPUfreq drivers to the new interface.
Patch 4/5 changes thermal devfreq cooling removing old code for calculating
local power table. It simplifies the code and uses EM for requested power
calculation. Last patch 5/5 adds EM to Panfrost driver.

The patch set is based on linux-next tag next-20200317. I have decided to add
a pending patch developed by Matthias [5]. It introduces PM QoS limits in
devfreq cooling, to not break the build test. When it lands into
thermal/linux-next and then in linux-next, I'll drop it from this series.

Changes:
v5:
- devfreq cooling: rebased on top of pending patch introducing PM QoS limits
- devfreq cooling: added Matthias's patch to make this series build check pass
- devfreq cooling: removed OPP disable code and switched to PM QoS
- devfreq cooling: since thermal code always used a pointer to devfreq_dev_status,
  switched to work on a local copy and avoid potential race when either busy_time or
  total_time could change in the background
- devfreq cooling: added _normalize_load() and handle all scenarios when
  busy_time and total_time could have odd values (even raw counters)
- Energy Model patch 2/4: removed prints from cpufreq drivers and added print inside
  dev_pm_opp_of_register_em()
- update patch 2/4 description to better reflect upcoming changes
- collected ACK from Quentin for patch 1/4 and Reviewed-by from Steven for 4/4
v4 [4]:
- devfreq cooling: added two new registration functions, which will take care
  of registering EM for the device and simplify drivers code
  (suggested by Robin and Rob)
- Energy Model: changed unregistering code, added kref to track usage, added
  code freeing tables, added helper function
- added return value to function dev_pm_opp_of_register_em() and updated
  CPUFreq drivers code, added debug prints in case of failure
- updated comments in devfreq cooling removing statement that only
  simple_ondemand devfreq governor is supported to work with power extentions
- fixed spelling in the documentation (reported by Randy)
v3 [3]:
- added back the cpumask 'cpus' in the em_perf_domain due potential cache misses
- removed _is_cpu_em() since there is no need for it
- changed function name from em_pd_energy() to em_cpu_energy(), which is
  optimized for usage from the scheduler making some assumptions and not
  validating arguments to speed-up, there is a comment stressing that it should
  be used only for CPUs em_perf_domain
- changed em_get_pd() to em_pd_get() which is now aligned with em_cpu_get()
  naming
- Energy Model: add code which checks if the EM is already registered for the
  devfreq device
- extended comment in em_cpu_get() describing the need for this function
- fixed build warning reported on x86 by kbuild test robot in devfreq_cooling.c
- updated documentation in the energy-model.rst
- changed print messages from 'energy_model' to 'EM'
- changed dev_warn to dev_dbg, should calm down test scripts in case the
  platform has OPPs less efficient in the OPP table (some of them are there for
  cooling reasons, we shouldn't warn in this case, debug info is enough)
v2 [2]:
- changed EM API em_register_perf_domain() adding cpumask_t pointer
  as last argument (which was discussed with Dietmar and Quentin)
- removed dependency on PM_OPP, thanks to the cpumask_t argument
- removed enum em_type and em->type dependent code
- em_get_pd() can handle CPU device as well as devfreq device
- updated EM documentation
- in devfreq cooling added code which prevents from race condition with
  devfreq governors which are trying to use OPPs while thermal is in the middle
  of disabling them.
- in devfreq cooling added code which updates state of the devfreq device to
  avoid working on stale data when governor has not updated it for a long time
- in devfreq cooling added backward compatibility frequency table for drivers
  which did not provide EM
- added Steven's Reviewed-by to trace code in thermal
- added another CPUFreq driver which needs to be updated to the new API

The v1 can be found here [1].

Regards,
Lukasz Luba

[1] https://lkml.org/lkml/2020/1/16/619
[2] https://lkml.org/lkml/2020/2/6/377
[3] https://lkml.org/lkml/2020/2/21/1910
[4] https://lkml.org/lkml/2020/3/9/471
[5] https://patchwork.kernel.org/patch/11435217/

Lukasz Luba (4):
  PM / EM: add devices to Energy Model
  OPP: refactor dev_pm_opp_of_register_em() and update related drivers
  thermal: devfreq_cooling: Refactor code and switch to use Energy Model
  drm/panfrost: Register devfreq cooling and attempt to add Energy Model

Matthias Kaehlcke (1):
  thermal: devfreq_cooling: Use PM QoS to set frequency limits

 Documentation/power/energy-model.rst        | 133 ++---
 Documentation/scheduler/sched-energy.rst    |   2 +-
 drivers/cpufreq/cpufreq-dt.c                |   2 +-
 drivers/cpufreq/imx6q-cpufreq.c             |   2 +-
 drivers/cpufreq/mediatek-cpufreq.c          |   2 +-
 drivers/cpufreq/omap-cpufreq.c              |   2 +-
 drivers/cpufreq/qcom-cpufreq-hw.c           |   2 +-
 drivers/cpufreq/scmi-cpufreq.c              |  13 +-
 drivers/cpufreq/scpi-cpufreq.c              |   2 +-
 drivers/cpufreq/vexpress-spc-cpufreq.c      |   2 +-
 drivers/gpu/drm/panfrost/panfrost_devfreq.c |   2 +-
 drivers/opp/of.c                            |  76 +--
 drivers/thermal/cpufreq_cooling.c           |  12 +-
 drivers/thermal/devfreq_cooling.c           | 536 ++++++++++----------
 include/linux/devfreq_cooling.h             |  39 +-
 include/linux/energy_model.h                | 111 ++--
 include/linux/pm_opp.h                      |  15 +-
 include/trace/events/thermal.h              |  19 +-
 kernel/power/energy_model.c                 | 465 +++++++++++++----
 kernel/sched/fair.c                         |   2 +-
 kernel/sched/topology.c                     |   4 +-
 21 files changed, 901 insertions(+), 542 deletions(-)

Comments

Daniel Lezcano April 3, 2020, 4:05 p.m. UTC | #1
Hi Lukasz,


On 18/03/2020 12:45, Lukasz Luba wrote:
> Add support of other devices into the Energy Model framework not only the
> CPUs. Change the interface to be more unified which can handle other
> devices as well.

thanks for taking care of that. Overall I like the changes in this patch
but it hard to review in details because the patch is too big :/

Could you split this patch into smaller ones?

eg. (at your convenience)

 - One patch renaming s/cap/perf/

 - One patch adding a new function:

    em_dev_register_perf_domain(struct device *dev,
				unsigned int nr_states,
				struct em_data_callback *cb);

   (+ EXPORT_SYMBOL_GPL)

    And em_register_perf_domain() using it.

 - One converting the em_register_perf_domain() user to
	em_dev_register_perf_domain

 - One adding the different new 'em' functions

 - And finally one removing em_register_perf_domain().


> Acked-by: Quentin Perret <qperret@google.com>
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---

[ ... ]

>  2. Core APIs
> @@ -70,14 +72,16 @@ CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
>  Drivers are expected to register performance domains into the EM framework by
>  calling the following API::
>  
> -  int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
> -			      struct em_data_callback *cb);
> +  int em_register_perf_domain(struct device *dev, unsigned int nr_states,
> +		struct em_data_callback *cb, cpumask_t *cpus);

Isn't possible to get rid of this cpumask by using
cpufreq_cpu_get() which returns the cpufreq's policy and from their get
the related cpus ?

[ ... ]
Daniel Lezcano April 3, 2020, 4:43 p.m. UTC | #2
On 18/03/2020 12:45, Lukasz Luba wrote:
> From: Matthias Kaehlcke <mka@chromium.org>
> 
> Now that devfreq supports limiting the frequency range of a device
> through PM QoS make use of it instead of disabling OPPs that should
> not be used.
> 
> The switch from disabling OPPs to PM QoS introduces a subtle behavioral
> change in case of conflicting requests (min > max): PM QoS gives
> precedence to the MIN_FREQUENCY request, while higher OPPs disabled
> with dev_pm_opp_disable() would override MIN_FREQUENCY.
> 
> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
> Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com>

This patch is standalone, right? If yes, I will apply it.
Daniel Lezcano April 3, 2020, 5:19 p.m. UTC | #3
On 03/04/2020 19:18, Matthias Kaehlcke wrote:
> Hi Daniel,
> 
> On Fri, Apr 03, 2020 at 06:43:20PM +0200, Daniel Lezcano wrote:
>> On 18/03/2020 12:45, Lukasz Luba wrote:
>>> From: Matthias Kaehlcke <mka@chromium.org>
>>>
>>> Now that devfreq supports limiting the frequency range of a device
>>> through PM QoS make use of it instead of disabling OPPs that should
>>> not be used.
>>>
>>> The switch from disabling OPPs to PM QoS introduces a subtle behavioral
>>> change in case of conflicting requests (min > max): PM QoS gives
>>> precedence to the MIN_FREQUENCY request, while higher OPPs disabled
>>> with dev_pm_opp_disable() would override MIN_FREQUENCY.
>>>
>>> Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
>>> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
>>> Reviewed-by: Chanwoo Choi <cw00.choi@samsung.com>
>>
>> This patch is standalone, right? If yes, I will apply it.
> 
> Yes, it is standalone, please apply

Applied on 'testing', thanks
Lukasz Luba April 6, 2020, 4:07 p.m. UTC | #4
On 4/6/20 3:58 PM, Daniel Lezcano wrote:
> 
> Hi Lukasz,
> 
> 
> On 06/04/2020 15:29, Lukasz Luba wrote:
>> Hi Daniel,
>>
>> Thank you for the review.
>>
>> On 4/3/20 5:05 PM, Daniel Lezcano wrote:
>>>
>>> Hi Lukasz,
>>>
>>>
>>> On 18/03/2020 12:45, Lukasz Luba wrote:
>>>> Add support of other devices into the Energy Model framework not only
>>>> the
>>>> CPUs. Change the interface to be more unified which can handle other
>>>> devices as well.
>>>
>>> thanks for taking care of that. Overall I like the changes in this patch
>>> but it hard to review in details because the patch is too big :/
>>>
>>> Could you split this patch into smaller ones?
>>>
>>> eg. (at your convenience)
>>>
>>>    - One patch renaming s/cap/perf/
>>>
>>>    - One patch adding a new function:
>>>
>>>       em_dev_register_perf_domain(struct device *dev,
>>>                  unsigned int nr_states,
>>>                  struct em_data_callback *cb);
>>>
>>>      (+ EXPORT_SYMBOL_GPL)
>>>
>>>       And em_register_perf_domain() using it.
>>>
>>>    - One converting the em_register_perf_domain() user to
>>>      em_dev_register_perf_domain
>>>
>>>    - One adding the different new 'em' functions
>>>
>>>    - And finally one removing em_register_perf_domain().
>>
>> I agree and will do the split. I could also break the dependencies
>> for future easier merge.
>>
>>>
>>>
>>>> Acked-by: Quentin Perret <qperret@google.com>
>>>> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
>>>> ---
>>>
>>> [ ... ]
>>>
>>>>    2. Core APIs
>>>> @@ -70,14 +72,16 @@ CONFIG_ENERGY_MODEL must be enabled to use the EM
>>>> framework.
>>>>    Drivers are expected to register performance domains into the EM
>>>> framework by
>>>>    calling the following API::
>>>>    -  int em_register_perf_domain(cpumask_t *span, unsigned int
>>>> nr_states,
>>>> -                  struct em_data_callback *cb);
>>>> +  int em_register_perf_domain(struct device *dev, unsigned int
>>>> nr_states,
>>>> +        struct em_data_callback *cb, cpumask_t *cpus);
>>>
>>> Isn't possible to get rid of this cpumask by using
>>> cpufreq_cpu_get() which returns the cpufreq's policy and from their get
>>> the related cpus ?
>>
>> We had similar thoughts with Quentin and I've checked this.
> 
> Yeah, I suspected you already think about that :)
> 
>> Unfortunately, if the policy is a 'new policy' [1] it gets
>> allocated and passed into cpufreq driver ->init(policy) [2].
>> Then that policy is set into per_cpu pointer for each related_cpu [3]:
>>
>> for_each_cpu(j, policy->related_cpus)
>>      per_cpu(cpufreq_cpu_data, j) = policy;
>>
>>   
>> Thus, any calls of functions (i.e. cpufreq_cpu_get()) which try to
>> take this ptr before [3] won't work.
>>
>> We are trying to register EM from cpufreq_driver->init(policy) and the
>> per_cpu policy is likely to be not populated at that phase.
> 
> What is the problem of registering at the end of the cpufreq_online ?

We want to enable driver developers to choose one of two options for the
registration of Energy Model:
1. a simple one via dev_pm_opp_of_register_em(), which uses default
    callback function calculating power based on: voltage, freq
    and DT entry 'dynamic-power-coefficient' for each OPP
2. a more sophisticated, when driver provides callback function, which
   will be called from EM for each OPP to ask for related power;
   This interface could also be used by devices which relay not only
   on one source of 'voltage', i.e. manipulate body bias or have
   other controlling voltage for gates in the new 3D transistors. They
   might provide custom callback function in their cpufreq driver.
   This is used i.e. in cpufreq drivers which use firmware to get power,
   like scmi-cpufreq.c;

To meet this requirement the registration of EM is moved into cpufreq
drivers, not in the framework i.e cpufreq_online(). If we could limit
the support for only option 1. then we could move the registration
call into cpufreq framework and clean the cpufreq drivers.

Regards,
Lukasz