[RFC,v2,0/1] cpuidle: teo: Introduce optional util-awareness

Message ID	20221003144914.160547-1-kajetan.puchalski@arm.com
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Kajetan Puchalski <kajetan.puchalski@arm.com> To: rafael@kernel.org Cc: daniel.lezcano@linaro.org, lukasz.luba@arm.com, Dietmar.Eggemann@arm.com, dsmythies@telus.net, yu.chen.surf@gmail.com, kajetan.puchalski@arm.com, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v2 0/1] cpuidle: teo: Introduce optional util-awareness Date: Mon, 3 Oct 2022 15:49:13 +0100 Message-Id: <20221003144914.160547-1-kajetan.puchalski@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	cpuidle: teo: Introduce optional util-awareness \| expand [RFC,v2,0/1] cpuidle: teo: Introduce optional util-awareness [RFC,v2,1/1] cpuidle: teo: Introduce optional util-awareness

Kajetan Puchalski Oct. 3, 2022, 2:49 p.m. UTC

Hi,

At the moment, all the available idle governors operate mainly based on their own past performance
without taking into account any scheduling information. Especially on interactive systems, this
results in them frequently selecting a deeper idle state and then waking up before its target
residency is hit, thus leading to increased wakeup latency and lower performance with no power
saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep')
account for over 24% of all wakeups.

At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer
it over C1. Sleeps that happened in C0 while they could have used C1 ('too shallow') only save
less power than they otherwise could have. Too deep sleeps, on the other hand, harm performance
and nullify the potential power saving from using C1 in the first place. While taking this into
account, it is clear that on balance it is preferable for an idle governor to have more too shallow
sleeps instead of more too deep sleeps on those kinds of platforms.

Currently the best available governor under this metric is TEO which on average results in less than
half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
increased performance in the process.

This proposed optional extension to TEO would specifically tune it for minimising too deep
sleeps and minimising latency to achieve better performance. To this end, before selecting the next
idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
CPU is being utilized. This util value is then compared to a threshold defined as a percentage of
the cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the
threshold, the idle state selected by TEO metrics will be reduced by 1, thus selecting a shallower
state. If the util is below the threshold, the governor defaults to the TEO metrics mechanism to try
to select the deepest available idle state based on the closest timer event and its own correctness.

As of v2 the patch includes a 'fast exit' path for arm-based and similar systems where only 2 idle
states are present. If there's just 2 idle states and the the CPU is utilized, we can directly select
the shallowest state and save cycles by skipping the entire metrics mechanism.

Initially I am sending this as a patch for TEO to visualize the proposed mechanism and simplify
the review process. An alternative way of implementing it while not interfering
with existing TEO code would be to fork TEO into a separate but mostly identical for the time being
governor (working name 'idleutil') and then implement util-awareness there, so that the two
approaches can coexist and both be available at runtime instead of relying on a compile-time option.
I am happy to send a patchset doing that if you think it's a cleaner approach than doing it this way.

This approach can outperform all the other currently available governors, at least on mobile device
workloads, which is why I think it is worth keeping as an option.

Additionally, in my view, the reason why it makes more sense to implement this type of mechanism
inside a governor rather than outside using something like QoS or some other way to disable certain
idle states on the fly are the governor's metrics. If we were disabling idle states and reenabling
them without the governor 'knowing' about it, the governor's metrics would end up being updated
based on state selections not caused by the governor itself. This could interfere with the
correctness of said metrics as that's not what they were designed for as far as I understand.
This approach skips metrics updates whenever a state was selected based on the util and not based
on the metrics.

There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base
it on TEO because it performs the best out of all the available options and I didn't think there was
any point in reinventing the wheel on the side of computing governor metrics. If a
better approach comes along at some point, there's no reason why the same idle aware mechanism
couldn't be used with any other metrics algorithm. That would, however, require implemeting it as
a separate governor rather than a TEO add-on.

As for how the extension performs in practice, below I'll add some benchmark results I got while
testing this patchset.

Pixel 6 (Android 12, mainline kernel 5.18):

1. Geekbench 5 (latency-sensitive, heavy load test)

The values below are gmean values across 3 back to back iteration of Geekbench 5.
As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices
resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual
values for all of the governors can change between runs as the benchmark might be affected by factors
other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better
scores than all the other governors.

'shallow' is a trivial governor that only ever selects the shallowest available state, included here
for reference and to establish the lower bound of latency possible to achieve through cpuidle.

'gmean too deep %' and 'gmean too shallow %' are percentages of too deep and too shallow sleeps
computed using the new trace event - cpu_idle_miss. The percentage is obtained by counting the two
types of misses over the course of a run and then dividing them by the total number of wakeups.

| metric                                | menu           | teo               | shallow           | teo + util-aware  |
| ------------------------------------- | -------------  | ---------------   | ---------------   | ---------------   |
| gmean score                           | 2716.4 (0.0%)  | 2795 (+2.89%)     | 2780.5 (+2.36%)   | 2830.8 (+4.21%)   |
| gmean too deep %                      | 16.64%         | 9.61%             | 0%                | 4.19%             |
| gmean too shallow %                   | 2.66%          | 5.54%             | 31.47%            | 15.3%             |
| gmean task wakeup latency (gb5)       | 82.05μs (0.0%) | 73.97μs (-9.85%)  | 42.05μs (-48.76%) | 66.91μs (-18.45%) |
| gmean task wakeup latency (asynctask) | 75.66μs (0.0%) | 56.58μs (-25.22%) | 65.78μs (-13.06%) | 55.35μs (-26.84%) |

In case of this benchmark, the difference in latency does seem to translate into better scores.

Additionally, here's a set of runs of Geekbench done after holding the phone in
the fridge for exactly an hour each time in order to minimise the impact of thermal issues.

| metric                                | menu           | teo               | teo + util-aware  |
| ------------------------------------- | -------------  | ---------------   | ---------------   |
| gmean multicore score                 | 2792.1 (0.0%)  | 2845.2 (+1.9%)    | 2857.4 (+2.34%)   |
| gmean single-core score               | 1048.3 (0.0%)  | 1052.6 (+0.41%)   | 1055.3 (+0.67%)   |

2. PCMark Web Browsing (non latency-sensitive, normal usage test)

The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing.

| metric                    | menu           | teo               | shallow          | teo + util-aware  |
| ------------------------- | -------------  | ---------------   | ---------------  | ---------------   |
| gmean score               | 6283.0 (0.0%)  | 6262.9 (-0.32%)   | 6258.4 (-0.39%)  | 6323.7 (+0.65%)   |
| gmean too deep %          | 24.15%         | 10.32%            | 0%               | 3.2%              |
| gmean too shallow %       | 2.81%          | 7.68%             | 27.69%           | 17.189%           |
| gmean power usage [mW]    | 209.1 (0.0%)   | 187.8 (-10.17%)   | 205.5 (-1.71%)   | 205 (-1.96%)      |
| gmean task wakeup latency | 204.6μs (0.0%) | 184.39μs (-9.87%) | 95.55μs (-53.3%) | 95.98μs (-53.09%) |

As this is a web browsing benchmark, the task for which the wakeup latency was recorded was Chrome's
rendering task, ie CrRendererMain. The latency improvement for the actual benchmark task was very
similar.

In this case the large latency improvement does not translate into a notable increase in benchmark score as
this particular benchmark mainly responds to changes in operating frequency. Nevertheless, the small power
saving compared to menu with no decrease in benchmark score indicate that there are no regressions for this
type of workload while using this governor.

Note: The results above were as mentioned obtained on the 5.18 kernel. Results for Geekbench obtained after
backporting CFS patches from the most recent mainline can be found in the pdf linked below [1].
The results and improvements still hold up but the numbers change slightly. Additionally, the pdf contains
plots for all the relevant results obtained with this and other idle governors, alongside power usage numbers.

At the very least this approach seems promising so I wanted to discuss it in RFC form first.
Thank you for taking your time to read this!

--
Kajetan

[1] https://github.com/mrkajetanp/lisa-notebooks/blob/a2361a5b647629bfbfc676b942c8e6498fb9bd03/idle_util_aware.pdf

v1 -> v2:
- rework the mechanism to reduce selected state by 1 instead of directly selecting C0 (suggested by Doug Smythies)
- add a fast-exit path for systems with 2 idle states to not waste cycles on metrics when utilized
- fix typos in comments
- include a missing header

Kajetan Puchalski (1):
  cpuidle: teo: Introduce optional util-awareness

 drivers/cpuidle/Kconfig         | 12 +++++
 drivers/cpuidle/governors/teo.c | 96 ++++++++++++++++++++++++++++++++-
 2 files changed, 107 insertions(+), 1 deletion(-)

Doug Smythies Oct. 7, 2022, 12:06 a.m. UTC | #1

On Mon, Oct 3, 2022 at 7:49 AM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> Hi,
>
> At the moment, all the available idle governors operate mainly based on their own past performance
> without taking into account any scheduling information.

...

I tried V2 on my Intel i5-10600K, but there was little difference when
compared to V1 in terms of enormous increase in processor power
consumption for the periodic type tests I ran. However, sometimes the
power use would drop down to comparable to the normal teo or menu
governors for a short time, which V1 didn't do.

I would suggest a completely new governor for your
only 2 idle states, and of similar power use, scenario.
Incredibly simple:

If CPU utilization >= threshold
  select idle state 0
Else
  Select idle state 1
Endif

As previously mentioned, you are overriding
everything that teo does anyhow.

Note to other readers: I also sent Kajetan an off-list email with more
tests descriptions and web links to results graphs and such.

... Doug

Kajetan Puchalski Oct. 10, 2022, 10:09 a.m. UTC | #2

On Thu, Oct 06, 2022 at 05:06:58PM -0700, Doug Smythies wrote:
> 
> I would suggest a completely new governor for your
> only 2 idle states, and of similar power use, scenario.
> Incredibly simple:
> 
> If CPU utilization >= threshold
>   select idle state 0
> Else
>   Select idle state 1
> Endif

That would be easy to implement at least but sadly just doesn't really
work at all. It would just result in huge amounts of too deep C1 sleeps
which are the main problem for us at the moment. For all intents and
purposes, only ever selecting C0 gives better results than this approach
on our hardware.

> As previously mentioned, you are overriding
> everything that teo does anyhow.

To an extent yes, this just came from observations of how cpuidle
behaves on arm. We tried an approach with only ever using C0 and it was
all right on power usage, the device would just heat up too much and
lose performance. That's why this approach of using TEO when the core is
completely idle and moving to the lower state when it's not achieves the
best of both worlds.

> Note to other readers: I also sent Kajetan an off-list email with more
> tests descriptions and web links to results graphs and such.

Looks like the email got lost in my inbox, could be some email filtering
or something like that.. Could you possibly resend and cc my private
email (kajetan.puchalski@tuta.io)?
So far I've not directly received any of your emails, they show up on
the mailing list archive but not in my inbox. Could be an IT issue on my
end but might be worth checking your email setup regardless.

> ... Doug

Thanks,
Kajetan

Lukasz Luba Oct. 11, 2022, 8:21 a.m. UTC | #3

Hi Doug, Kajetan,

On 10/10/22 11:09, Kajetan Puchalski wrote:
> On Thu, Oct 06, 2022 at 05:06:58PM -0700, Doug Smythies wrote:
>>
>> I would suggest a completely new governor for your
>> only 2 idle states, and of similar power use, scenario.
>> Incredibly simple:
>>
>> If CPU utilization >= threshold
>>    select idle state 0
>> Else
>>    Select idle state 1
>> Endif
> 
> That would be easy to implement at least but sadly just doesn't really
> work at all. It would just result in huge amounts of too deep C1 sleeps
> which are the main problem for us at the moment. For all intents and
> purposes, only ever selecting C0 gives better results than this approach
> on our hardware.
> 
>> As previously mentioned, you are overriding
>> everything that teo does anyhow.
> 
> To an extent yes, this just came from observations of how cpuidle
> behaves on arm. We tried an approach with only ever using C0 and it was
> all right on power usage, the device would just heat up too much and
> lose performance. That's why this approach of using TEO when the core is
> completely idle and moving to the lower state when it's not achieves the
> best of both worlds.
> 
>> Note to other readers: I also sent Kajetan an off-list email with more
>> tests descriptions and web links to results graphs and such.

Thanks Doug for testing this work, we really appreciate that.
Doug maybe you could post these into some public space, so wider
community would also see them. Kajetan has used github to put
a report with testing results containing some graphs/plots.

> 
> Looks like the email got lost in my inbox, could be some email filtering
> or something like that.. Could you possibly resend and cc my private
> email (kajetan.puchalski@tuta.io)?
> So far I've not directly received any of your emails, they show up on
> the mailing list archive but not in my inbox. Could be an IT issue on my
> end but might be worth checking your email setup regardless.

I also have some issues with the email inbox. Me apologies Doug, we will
sort it.

Regards,
Lukasz

Kajetan Puchalski Oct. 13, 2022, 11:12 a.m. UTC | #4

On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote:
> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski
> <kajetan.puchalski@arm.com> wrote:
> >
> > Hi,
> >
> > At the moment, all the available idle governors operate mainly based on their own past performance
> 
> Not true, at least for the menu and teo governors that use the
> information on the distribution of CPU wakeups that is available to
> them and try to predict the next idle duration with the help of it.
> This has a little to do with their performance.

You're right of course, I should have written "their own past
correctness" as that's what I was referring to. I just meant that for
instance with TEO the initial timer-based choice is only adjusted using
the governor's own metrics and not any information from anywhere else in
the system.

> > without taking into account any scheduling information. Especially on interactive systems, this
> > results in them frequently selecting a deeper idle state and then waking up before its target
> > residency is hit, thus leading to increased wakeup latency and lower performance with no power
> > saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep')
> > account for over 24% of all wakeups.
> 
> How is this measured?

Using the cpu_idle_miss trace event. Over the course of a benchmark run
I collect all cpu_idle and cpu_idle_miss trace events. Then I divide the
number of too deep misses by the total number of cpu_idle wakeup events
which gives me the percentage. Those are the percentages described as
'gmean too deep %' in the tables included in the cover letter. Gmean
because I run the benchmarks for many iterations and then take an
average of those percentages to account for outliers.
PCMark Web Browsing is a 'benchmark' that just amounts to browsing the
web on Android, hence I can use data from it to talk about what the
system behaviour under normal usage would be.

> > At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer
> > it over C1.
> 
> Well, energy-efficiency is relative, so strictly speaking it is
> invalid to say "power efficient enough".

Yes, by 'enough' I meant that the power savings of C0 vs C1 on arm are
fairly comparable as opposed to other platforms. From Doug's data
collected on an Intel CPU, the power usage difference of only-C0
compared to only-C1 was over 20-fold ie 46w vs 2.6w. With only C0
enabled on Pixel 6 that difference is closer to something like 4%. It's
just fundamentally different hardware. With 4% being your ceiling you
can talk about performance/latency tradeoffs etc, if you're talking
about potential over 1700% increases, not so much.

> Also, as far as idle CPUs are concerned, we are talking about the
> situation in which no useful work is done at all, so the state drawing
> less power is always more energy-efficient than the one drawing more
> power.

Yes, assuming the CPU is woken up after the target residency of the
state has been met. If the wakeup happens too early then for that
situation C0 would've been more power efficient than C1 even though C1
technically draws less power, right? That's what we're trying to fix
here, we just noticed that for mobile interactive workloads at least
we're getting this situation way too often.

The result being that this util-aware TEO variant while using much less
C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in
PCMark Web Browsing also uses almost 2% less power. Clearly the power is
being wasted on not hitting C1 residency over and over.

> You may argue that predicting idle durations that are too long too
> often leads to both excessive task wakeup latency and excessive energy
> usage at the same time, but this may very well mean that the target
> residency value for C1 is too low.

We get residency values from DT and they're meant to be the descriptions
of each CPU's hardware so I don't think tweaking them to get better
results would be a good idea. Unless I'm misunderstanding what you mean?

> > Currently the best available governor under this metric is TEO which on average results in less than
> > half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
> > increased performance in the process.
> 
> Well, good to hear that, but some numbers in support of that claim
> would be nice to have too.

Those are the numbers I included in the cover letter for the two
benchmarks, they've been very consistent in terms of the pattern
across all the runs and workloads I've seen. For too deep % for
instance in GB5 we had on average menu 16.6%, TEO 9.6%, TEO+util 4.19%.
For PCMark Web Browsing menu 24.15%, TEO 10.32%, TEO+util 3.2%. The
values differ per-workload but every dataset I've seen had that same
'staircase' pattern.

> > This proposed optional extension to TEO would specifically tune it for minimising too deep
> > sleeps and minimising latency to achieve better performance. To this end, before selecting the next
> > idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
> > CPU is being utilized.
> 
> Which has no bearing on what the CPU idle time governors have to do
> which is (1) to predict the next idle duration as precisely as
> reasonably possible and (2) to minimise the cost in terms of task
> wakeup latencies associated with using deep idle states.
> 
> The avg_util value tells us nothing about how much the CPU is going to
> be idle this time and it also tells us nothing about the
> latency-sensitivity of the workload.
> 
> Yes, it tells us how much idle time there was on the given CPU in the
> past, on the average, but there is zero information about the
> distribution of that idle time in it.
> 
> So in the first place please tell me why it fundamentally makes sense
> to use avg_util in CPU idle time management at all.

Right, the idea here is slightly similar to that of temporal locality.
We obviously can't predict the future which is sort of what an idle
governor tries to achieve. Focusing on timer events makes a lot of sense
and is probably close to as good as it gets in estimating future
behaviour.

The observation we're relying on here is simply that if the
CPU was doing enough work in the recent past for its avg_util to still
be raised while going into idle, it is very likely that the same CPU
might be needed again soon. From my tests that assumption tends to be
correct quite often. In those situations, when avg_util is high and the
next timer event is far enough for C1 to be selected, a lot of the time
the CPU does actually get woken up before the residency is hit leading
to all the issues described above.

I don't think using avg_util as the *only* input for idle management
would be a good idea at all. The way I see it, it serves as a very good hint
to determine if we are likely to get a wakeup between now and the next
timer event and provides an additional dimension for decision making.
While the current metrics only adjust themselves after making a certain
number of mistakes and are a "trailing" adjusting mechanism, using
avg_util this way provides a "leading" mechanism that potentially lets
us not make those mistakes in the first place. It's not just theory
either, it very clearly works and gets results, at least on the
platforms/workloads we've been looking at.

On the Intel & power usage angle you might have seen in the discussion,
Doug sent me some interesting data privately. As far as I can tell the
main issue there is that C0 on Intel doesn't actually do power saving so
moving the state selection down to it is a pretty bad idea because C1
could be very close in terms of latency and save much more power.

A potential solution could be altering the v2 to only decrease the state
selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us
because arm systems with 2 states use the early exit path anyway. It'd
just amount to changing this hunk:

+       if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable)
+               idx--;

to:

+       if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable)
+               idx--;

What would you think about that? Should make it much less intense for
Intel systems.

Thanks a lot for your interest,
Kajetan

Daniel Lezcano Oct. 20, 2022, 7:52 p.m. UTC | #5

Hi Kajetan,

On 20/10/2022 18:20, Kajetan Puchalski wrote:
> Hi Rafael,
> 
>> The avg_util value tells us nothing about how much the CPU is going to
>> be idle this time and it also tells us nothing about the
>> latency-sensitivity of the workload.
>>
>> Yes, it tells us how much idle time there was on the given CPU in the
>> past, on the average, but there is zero information about the
>> distribution of that idle time in it.
>>
>> So in the first place please tell me why it fundamentally makes sense
>> to use avg_util in CPU idle time management at all.
> 
> I have an alternative suggestion that could be a reasonable way forward
> here. Instead of applying util-awareness on top of TEO where it would
> have to be reconciled with how TEO is currently expected to work, I just
> wrote a simple completely new governor which operates only on timer
> events alongside util values.

I second the idea. I took a long time to investigate how to improve the 
governor and reached the conclusion having a dedicated governor for 
mobile platform makes sense. Also the behavior is very platform dependent.

Regarding the utilization, one of the issue is the kernel threads 
preventing a task to wake up on the same CPU and forcing its migration 
at wake up time. So the prediction is screwed up at that time.

There is a paper talking this issue [1]

I've done a 'mobile' governor, including the next interrupt prediction 
[2]. It is very simple and almost has the same results as the teo on my 
platform (rock960).

I'm not planning to upstream it because I don't have spare time to 
improve the results and take care of the IPIs. part.

Also the paradigm is radically different and you may be interested in 
the approach.

So if you want to rework, improve, test, upstream it, feel free to reuse 
the code.

   -- Daniel

[1] Dynamic workload characterization for power efficient scheduling on 
CMP systems : https://cseweb.ucsd.edu//~tullsen/islped10.pdf

[2] 
https://git.linaro.org/people/daniel.lezcano/linux.git/commit/?h=cpuidle/mobile-governor-v5.1&id=de1edb05e3c342f0738b414aa84263d6555b7462


> The idea is this:
> 1. Find the deepest state based on residency and time until the next timer event
> 2. If sched_cpu_util() is above the threshold, select a shallower non-polling state
> 
> There's no other metrics or anything else under the current
> implementation. I can't say how it would work on Intel systems and in
> the presence of more idle states but having a completely separate
> governor would be very useful for us to tune it specifically for our use
> cases and types of systems (ie. ones with 2 idle states and no polling
> states).
> 
> As it stands it performs quite well and achieves better results
> (especially in terms of latency) than both menu & TEO but slightly worse
> than the previously suggested TEO + util. As far as we're concerned
> that's okay, we can work from there to try to find a way of doing
> metrics or improving the algorithm that would be more tailored to using
> the util approach. I think it's much cleaner than what we were
> discussing previously since that was effectively overriding most of what
> TEO was doing.
> 
> Here are some numbers to visualize the results. They were all obtained
> in the same way as the ones in the cover letter so you can refer to that
> in case something isn't clear.
> 
> 'teo_util' is of course TEO + util as in the patchset.
> 'idleutil' is this entirely new proposed minimal governor.
> 
> 1. Geekbench 5 (latency-sensitive, heavy load test)
> 
> +-----------------+----------+---------+-------------+
> | metric          | kernel   |   value | perc_diff   |
> |-----------------+----------+---------+-------------|
> | multicore_score | menu     |  2832.3 | 0.0%        |
> | multicore_score | teo      |  2815.3 | -0.6%       |
> | multicore_score | teo_util |  2880.6 | 1.7%        |
> | multicore_score | idleutil |  2859.3 | 0.95%       |
> +-----------------+----------+---------+-------------+
> 
> Percentages & types of idle misses
> 
> +-----------+-------------+--------------+
> | kernel    | type        |   percentage |
> |-----------+-------------+--------------|
> | menu      | too deep    |      15.613% |
> | teo       | too deep    |       9.376% |
> | teo_util  | too deep    |       4.581% |
> | idleutil  | too deep    |       5.464% |
> | menu      | too shallow |       2.611% |
> | teo       | too shallow |       6.099% |
> | teo_util  | too shallow |      14.141% |
> | idleutil  | too shallow |      13.282% |
> +-----------+-------------+--------------+
> 
> Power usage [mW]
> 
> +--------------+----------+----------+---------+-------------+
> | chan_name    | metric   | kernel   |   value | perc_diff   |
> |--------------+----------+----------+---------+-------------|
> | total_power  | gmean    | menu     |  2705.9 | 0.0%        |
> | total_power  | gmean    | teo      |  2668.2 | -1.39%      |
> | total_power  | gmean    | teo_util |  2710.2 | 0.16%       |
> | total_power  | gmean    | idleutil |  2657.9 | -1.78%      |
> +--------------+----------+----------+---------+-------------+
> 
> Wakeup latency
> 
> +-----------------+----------+----------+-------------+-------------+
> | comm            | metric   | kernel   |       value | perc_diff   |
> |-----------------+----------+----------+-------------+-------------|
> | AsyncTask #1    | gmean    | menu     | 66.85μs     | 0.0%        |
> | AsyncTask #1    | gmean    | teo      | 66.79μs     | -0.09%      |
> | AsyncTask #1    | gmean    | teo_util | 57.84μs     | -13.47%     |
> | AsyncTask #1    | gmean    | idleutil | 62.61μs     | -6.35%      |
> | labs.geekbench5 | gmean    | menu     | 80.62μs     | 0.0%        |
> | labs.geekbench5 | gmean    | teo      | 94.75μs     | 17.52%      |
> | labs.geekbench5 | gmean    | teo_util | 52.98μs     | -34.28%     |
> | labs.geekbench5 | gmean    | idleutil | 68.58μs     | -14.93%     |
> +-----------------+----------+----------+-------------+-------------+
> 
> 2. PCMark Web Browsing (non latency-sensitive, normal usage test)
> 
> +----------------+----------+---------+-------------+
> | metric         | kernel   |   value | perc_diff   |
> |----------------+----------+---------+-------------|
> | PcmaWebV2Score | menu     |  5232   | 0.0%        |
> | PcmaWebV2Score | teo      |  5219.8 | -0.23%      |
> | PcmaWebV2Score | teo_util |  5249.7 | 0.34%       |
> | PcmaWebV2Score | idleutil |  5215.7 | -0.31%      |
> +----------------+----------+---------+-------------+
> 
> Percentages & types of idle misses
> 
> +-----------+-------------+--------------+
> | kernel    | type        |   percentage |
> |-----------+-------------+--------------|
> | menu      | too deep    |      24.814% |
> | teo       | too deep    |       11.65% |
> | teo_util  | too deep    |       3.753% |
> | idleutil  | too deep    |       4.304% |
> | menu      | too shallow |       3.101% |
> | teo       | too shallow |       8.578% |
> | teo_util  | too shallow |      18.309% |
> | idleutil  | too shallow |      17.638% |
> +-----------+-------------+--------------+
> 
> Power usage [mW]
> 
> +--------------+----------+----------+---------+-------------+
> | chan_name    | metric   | kernel   |   value | perc_diff   |
> |--------------+----------+----------+---------+-------------|
> | total_power  | gmean    | menu     |   179.2 | 0.0%        |
> | total_power  | gmean    | teo      |   184.8 | 3.1%        |
> | total_power  | gmean    | teo_util |   180.5 | 0.71%       |
> | total_power  | gmean    | idleutil |   185   | 3.24%       |
> +--------------+----------+----------+---------+-------------+
> 
> Wakeup latency
> 
> +-----------------+----------+----------+-------------+-------------+
> | comm            | metric   | kernel   |       value | perc_diff   |
> |-----------------+----------+----------+-------------+-------------|
> | CrRendererMain  | gmean    | menu     | 236.63μs    | 0.0%        |
> | CrRendererMain  | gmean    | teo      | 201.85μs    | -14.7%      |
> | CrRendererMain  | gmean    | teo_util | 111.76μs    | -52.77%     |
> | CrRendererMain  | gmean    | idleutil | 105.55μs    | -55.39%     |
> | chmark:workload | gmean    | menu     | 100.30μs    | 0.0%        |
> | chmark:workload | gmean    | teo      | 80.20μs     | -20.04%     |
> | chmark:workload | gmean    | teo_util | 53.81μs     | -46.35%     |
> | chmark:workload | gmean    | idleutil | 71.29μs     | -28.92%     |
> | RenderThread    | gmean    | menu     | 37.97μs     | 0.0%        |
> | RenderThread    | gmean    | teo      | 31.69μs     | -16.54%     |
> | RenderThread    | gmean    | teo_util | 34.32μs     | -9.63%      |
> | RenderThread    | gmean    | idleutil | 35.78μs     | -5.77%      |
> | surfaceflinger  | gmean    | menu     | 97.57μs     | 0.0%        |
> | surfaceflinger  | gmean    | teo      | 98.86μs     | 1.31%       |
> | surfaceflinger  | gmean    | teo_util | 72.59μs     | -25.6%      |
> | surfaceflinger  | gmean    | idleutil | 56.23μs     | -42.37%     |
> +-----------------+----------+----------+-------------+-------------+
> 
> I also have similar data for Jankbench & Speedometer with right about
> the same results, I'll skip those for now for brevity.
> Would you like me to send a patch with this new governor instead? What
> would you think about this instead of the previously suggested approach?
> 
> Thanks,
> Kajetan

Lukasz Luba Oct. 27, 2022, 7:56 p.m. UTC | #6

Hi Doug,

Thank you for your effort in testing these patches and different
governors. We really appreciate that, since this helped us to
better understand the platform that you are using. It is different
to what we have and our workloads. That's why I have some comments.

It would be hard to combine these two worlds and requirements.
I have some concerns to the tests, the setup and the platform.
I can see a reason why this patch has to prove the
strengths on this platform and environment.
Please see my comments below.

On 10/13/22 23:12, Doug Smythies wrote:
> Hi All,
> 
> On Thu, Oct 13, 2022 at 4:12 AM Kajetan Puchalski
> <kajetan.puchalski@arm.com> wrote:
>> On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote:
>>> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski
>>> <kajetan.puchalski@arm.com> wrote:
> ...
> 
>> On the Intel & power usage angle you might have seen in the discussion,
>> Doug sent me some interesting data privately. As far as I can tell the
>> main issue there is that C0 on Intel doesn't actually do power saving so
>> moving the state selection down to it is a pretty bad idea because C1
>> could be very close in terms of latency and save much more power.
>>
>> A potential solution could be altering the v2 to only decrease the state
>> selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us
>> because arm systems with 2 states use the early exit path anyway. It'd
>> just amount to changing this hunk:
>>
>> +       if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable)
>> +               idx--;
>>
>> to:
>>
>> +       if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable)
>> +               idx--;
>>
>> What would you think about that? Should make it much less intense for
>> Intel systems.
> 
> I tested the above, which you sent me as patch version v2-2.
> 
> By default, my Intel i5-10600K has 4 idle states:
> 
> $ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/name
> /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL

This active polling state type worries me a bit. We don't have
such on our platforms. Our shallowest idle state is really different.
We don't have active polling and there is no need for such.

> /sys/devices/system/cpu/cpu7/cpuidle/state1/name:C1_ACPI
> /sys/devices/system/cpu/cpu7/cpuidle/state2/name:C2_ACPI
> /sys/devices/system/cpu/cpu7/cpuidle/state3/name:C3_ACPI
> 
> Idle driver governor legend:
> teo: the normal teo idle governor
> menu: the normal menu idle governor
> util or v1: the original patch
> util-v2 or v2: V2 of the patch
> util-v2-2 or v2-2: the suggestion further up in this thread.
> 
> Test 1: Timer based periodic:
> 
> A load sweep from 0 to 100%, then 100% to 0, first 73 hertz, then 113,
> 211,347 and finally 401 hertz work/sleep frequency. Single thread.

This 'Single thread' worries me a bit as well. Probably the
task don't migrate at all over CPUs, or very unlikely.

> 
> http://smythies.com/~doug/linux/idle/teo-util/consume/idle-1/
> 
> Summary, average processor package powers (watts):
> 
> teo              menu          v1               v2             v2-2
> 10.19399    10.74804    22.12791    21.0431    11.27865
>                       5.44%      117.07%     106.43%     10.64%
> 
> There is no performance measurement for this test, it just has to
> finish the work packet before the next period starts. Note that
> overruns do occur as the workload approaches 100%, but I do not record
> that data, as typically the lower workload percentages are the area of
> interest.
> 
> Test 2: Ping-pong test rotating through 6 different cores, with a
> variable packet of work to do at each stop. This test goes gradually
> through different idle states and is not timer based. A different 2
> core test (which I have not done) is used to better explore the idle
> state 0 to idle state 1 transition. This test has a performance
> measurement. The CPU scaling governor was set to performance. HWP was

The 'performance' governor also worries me here. When we fix the
frequency of the CPU then some basic statistics mechanisms would be good
enough for reasoning.

In our world, a few conditions are different:
1. The CPU frequency changes. We work with SchedUtil and adjust the
frequency quite often. Therefore, simple statistics which are not
aware of the frequency change and the impact to the CPU computation
capacity might be misleading. The utilization signal of the CPU runqueue
brings that information to our idle decisions.
2. Single threaded workloads aren't typical apps. When we deal
with many tasks and the task scheduler migrates them across many
CPUs we would like to 'see' this. The 'old-school' statistics
observing only the local CPU usage are not able to figure out
fast enough that some bigger task just migrated to that CPU.
With utilization of the runqueue, we know that upfront, because the task
utilization was subtracted from the old CPU's runqueue and
added to the new CPU's runqueue. Our approach with this util
signal would allow us to make a better decision in these two use cases:
a) task is leaving the CPU and rq util drops dramatically - so we can
go into deeper sleep immediately
b) task just arrived on this CPU and rq util got higher value - so we
shouldn't go into deep idle state, since there is 'not small' task.
3. Power saving on our platform in shallowest idle state was improved
recently and creates a scope for saving power and increase performance.

It would be fair to let TEO continue it's evolution (on the platforms
that it was designed for) and create a new governor which would address
better other platforms and workloads needs.

I will ask Rafael if that can happen. Kajetan has a tiny patch with
basic mechanisms, which performs really good. I will ask him to send it
so Rafael could have a look and decide. We could then develop/improve
that new governor with ideas from other experienced engineers in
mobile platforms.

Regards,
Lukasz

Lukasz Luba Oct. 27, 2022, 8:04 p.m. UTC | #7

Hi Rafael,

On 10/13/22 12:12, Kajetan Puchalski wrote:
> On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote:
>> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski
>> <kajetan.puchalski@arm.com> wrote:
>>>
>>> Hi,
>>>
>>> At the moment, all the available idle governors operate mainly based on their own past performance
>>
>> Not true, at least for the menu and teo governors that use the
>> information on the distribution of CPU wakeups that is available to
>> them and try to predict the next idle duration with the help of it.
>> This has a little to do with their performance.
> 
> You're right of course, I should have written "their own past
> correctness" as that's what I was referring to. I just meant that for
> instance with TEO the initial timer-based choice is only adjusted using
> the governor's own metrics and not any information from anywhere else in
> the system.
> 

[snip]

Would it be possible to consider a new small and simple idle governor
which is better suited for those other workloads and platforms?
Kajetan has such one and can send to the LKML, so you could have a look.

I have sent some detailed explanation about this to Doug in this
thread (don't want to duplicate it).

It looks like it would be hard to meet both worlds' requirements.

Regards,
Lukasz

Lukasz Luba Oct. 28, 2022, 7:08 a.m. UTC | #8

On 10/20/22 20:52, Daniel Lezcano wrote:
> 
> Hi Kajetan,
> 
> On 20/10/2022 18:20, Kajetan Puchalski wrote:
>> Hi Rafael,
>>
>>> The avg_util value tells us nothing about how much the CPU is going to
>>> be idle this time and it also tells us nothing about the
>>> latency-sensitivity of the workload.
>>>
>>> Yes, it tells us how much idle time there was on the given CPU in the
>>> past, on the average, but there is zero information about the
>>> distribution of that idle time in it.
>>>
>>> So in the first place please tell me why it fundamentally makes sense
>>> to use avg_util in CPU idle time management at all.
>>
>> I have an alternative suggestion that could be a reasonable way forward
>> here. Instead of applying util-awareness on top of TEO where it would
>> have to be reconciled with how TEO is currently expected to work, I just
>> wrote a simple completely new governor which operates only on timer
>> events alongside util values.
> 
> I second the idea. I took a long time to investigate how to improve the 
> governor and reached the conclusion having a dedicated governor for 
> mobile platform makes sense. Also the behavior is very platform dependent.
> 
> Regarding the utilization, one of the issue is the kernel threads 
> preventing a task to wake up on the same CPU and forcing its migration 
> at wake up time. So the prediction is screwed up at that time.
> 
> There is a paper talking this issue [1]
> 
> I've done a 'mobile' governor, including the next interrupt prediction 
> [2]. It is very simple and almost has the same results as the teo on my 
> platform (rock960).
> 
> I'm not planning to upstream it because I don't have spare time to 
> improve the results and take care of the IPIs. part.
> 
> Also the paradigm is radically different and you may be interested in 
> the approach.
> 
> So if you want to rework, improve, test, upstream it, feel free to reuse 
> the code.
> 
>    -- Daniel
> 
> [1] Dynamic workload characterization for power efficient scheduling on 
> CMP systems : https://cseweb.ucsd.edu//~tullsen/islped10.pdf
> 
> [2] 
> https://git.linaro.org/people/daniel.lezcano/linux.git/commit/?h=cpuidle/mobile-governor-v5.1&id=de1edb05e3c342f0738b414aa84263d6555b7462 
> 
> 
> 

Thanks Daniel! I forgot about your work in this area. As I have
responded in some other email in this thread, we might start
from a new small governor and than others can contribute.

Even this small governor that Kajetan showed me performs really
good on pixel6.

Regards,
Lukasz

Daniel Lezcano Oct. 28, 2022, 7:11 a.m. UTC | #9

On 28/10/2022 09:08, Lukasz Luba wrote:

[ ... ]

>> [1] Dynamic workload characterization for power efficient scheduling 
>> on CMP systems : https://cseweb.ucsd.edu//~tullsen/islped10.pdf
>>
>> [2] 
>> https://git.linaro.org/people/daniel.lezcano/linux.git/commit/?h=cpuidle/mobile-governor-v5.1&id=de1edb05e3c342f0738b414aa84263d6555b7462
>>
>>
> 
> Thanks Daniel! I forgot about your work in this area. As I have
> responded in some other email in this thread, we might start
> from a new small governor and than others can contribute.
> 
> Even this small governor that Kajetan showed me performs really
> good on pixel6.

Do you have others ARM64 platforms to compare with ?

Lukasz Luba Oct. 28, 2022, 7:23 a.m. UTC | #10

On 10/28/22 08:11, Daniel Lezcano wrote:
> On 28/10/2022 09:08, Lukasz Luba wrote:
> 
> [ ... ]
> 
>>> [1] Dynamic workload characterization for power efficient scheduling 
>>> on CMP systems : https://cseweb.ucsd.edu//~tullsen/islped10.pdf
>>>
>>> [2] 
>>> https://git.linaro.org/people/daniel.lezcano/linux.git/commit/?h=cpuidle/mobile-governor-v5.1&id=de1edb05e3c342f0738b414aa84263d6555b7462 
>>>
>>>
>>>
>>
>> Thanks Daniel! I forgot about your work in this area. As I have
>> responded in some other email in this thread, we might start
>> from a new small governor and than others can contribute.
>>
>> Even this small governor that Kajetan showed me performs really
>> good on pixel6.
> 
> Do you have others ARM64 platforms to compare with ?
> 

Yes we have, also some older platforms (~10years now, which
would be nice to check how they would perform).
We also have a big arm64 server to give it a try there. So those results
will be available with the new patch when we decide to go the new
governor.

Rafael J. Wysocki Oct. 28, 2022, 1:12 p.m. UTC | #11

On Thu, Oct 13, 2022 at 1:12 PM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote:
> > On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski
> > <kajetan.puchalski@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > At the moment, all the available idle governors operate mainly based on their own past performance
> >
> > Not true, at least for the menu and teo governors that use the
> > information on the distribution of CPU wakeups that is available to
> > them and try to predict the next idle duration with the help of it.
> > This has a little to do with their performance.
>
> You're right of course, I should have written "their own past
> correctness" as that's what I was referring to. I just meant that for
> instance with TEO the initial timer-based choice is only adjusted using
> the governor's own metrics and not any information from anywhere else in
> the system.

The last sentence is just right, so that's what I would say in the changelog.


> > > without taking into account any scheduling information. Especially on interactive systems, this
> > > results in them frequently selecting a deeper idle state and then waking up before its target
> > > residency is hit, thus leading to increased wakeup latency and lower performance with no power
> > > saving. For 'menu' while web browsing on Android for instance, those types of wakeups ('too deep')
> > > account for over 24% of all wakeups.
> >
> > How is this measured?
>
> Using the cpu_idle_miss trace event. Over the course of a benchmark run
> I collect all cpu_idle and cpu_idle_miss trace events. Then I divide the
> number of too deep misses by the total number of cpu_idle wakeup events
> which gives me the percentage. Those are the percentages described as
> 'gmean too deep %' in the tables included in the cover letter. Gmean
> because I run the benchmarks for many iterations and then take an
> average of those percentages to account for outliers.
> PCMark Web Browsing is a 'benchmark' that just amounts to browsing the
> web on Android, hence I can use data from it to talk about what the
> system behaviour under normal usage would be.
>
> > > At the same time, on some platforms C0 can be power efficient enough to warrant wanting to prefer
> > > it over C1.
> >
> > Well, energy-efficiency is relative, so strictly speaking it is
> > invalid to say "power efficient enough".
>
> Yes, by 'enough' I meant that the power savings of C0 vs C1 on arm are
> fairly comparable as opposed to other platforms. From Doug's data
> collected on an Intel CPU, the power usage difference of only-C0
> compared to only-C1 was over 20-fold ie 46w vs 2.6w. With only C0
> enabled on Pixel 6 that difference is closer to something like 4%. It's
> just fundamentally different hardware. With 4% being your ceiling you
> can talk about performance/latency tradeoffs etc, if you're talking
> about potential over 1700% increases, not so much.

The above is very close to a proper problem statement.

IIUC, on the hardware in question the power difference between the
first available idle state (state 0) and the next idle state (state 1)
is relatively small, but the target residency of state 1 is relatively
large and if it is missed, energy is wasted and the extra cost in
terms of latency is relatively high.  At the same time, this is the
idle duration range where the latency matters the most, so it is
desirable to reduce the likelihood of mispredicting higher idle
duration in this range beyond what the teo governor does by itself.

Also, unlike on Intel systems, state 0 actually is an idle state (on
Intel systems state 0 is a polling state and it is there to avoid the
latency cost of C1 in the cases when it wouldn't save any energy due
to the nonzero target residency).

Fair enough.

> > Also, as far as idle CPUs are concerned, we are talking about the
> > situation in which no useful work is done at all, so the state drawing
> > less power is always more energy-efficient than the one drawing more
> > power.
>
> Yes, assuming the CPU is woken up after the target residency of the
> state has been met. If the wakeup happens too early then for that
> situation C0 would've been more power efficient than C1 even though C1
> technically draws less power, right? That's what we're trying to fix
> here, we just noticed that for mobile interactive workloads at least
> we're getting this situation way too often.

Well, the interactive workloads are likely to be similar on any clent
systems (and the term "mobile" is somewhat vague).

> The result being that this util-aware TEO variant while using much less
> C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in
> PCMark Web Browsing also uses almost 2% less power. Clearly the power is
> being wasted on not hitting C1 residency over and over.

Hmm.  The PCMark Web Browsing table in your cover letter doesn't indicate that.

The "gmean power usage" there for "teo + util-aware" is 205, whereas
for "teo" alone it is 187.8.  This is still arguably balanced by the
latency difference (~100 us vs ~185 us, respectively), but this looks
like trading energy for performance.

On the side note, unmodified "teo" shows some nice 10%-range
improvements in terms of both power and latency over "menu" in this
case, even though it underestimates the idle duration much more often
(which kind of supports the idea that underestimating the idle
duration is better than overestimating it).

> > You may argue that predicting idle durations that are too long too
> > often leads to both excessive task wakeup latency and excessive energy
> > usage at the same time, but this may very well mean that the target
> > residency value for C1 is too low.
>
> We get residency values from DT and they're meant to be the descriptions
> of each CPU's hardware so I don't think tweaking them to get better
> results would be a good idea. Unless I'm misunderstanding what you mean?

I mean that the target residency values from DT may be less than perfect.

Also, they really are input for the governor's decisions, nothing
more, so putting values that are likely to yield more desirable
governor behavior in there is not a bad idea.

> > > Currently the best available governor under this metric is TEO which on average results in less than
> > > half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and
> > > increased performance in the process.
> >
> > Well, good to hear that, but some numbers in support of that claim
> > would be nice to have too.
>
> Those are the numbers I included in the cover letter for the two
> benchmarks, they've been very consistent in terms of the pattern
> across all the runs and workloads I've seen. For too deep % for
> instance in GB5 we had on average menu 16.6%, TEO 9.6%, TEO+util 4.19%.
> For PCMark Web Browsing menu 24.15%, TEO 10.32%, TEO+util 3.2%. The
> values differ per-workload but every dataset I've seen had that same
> 'staircase' pattern.

I see.

> > > This proposed optional extension to TEO would specifically tune it for minimising too deep
> > > sleeps and minimising latency to achieve better performance. To this end, before selecting the next
> > > idle state it uses the avg_util signal of a CPU's runqueue in order to determine to what extent the
> > > CPU is being utilized.
> >
> > Which has no bearing on what the CPU idle time governors have to do
> > which is (1) to predict the next idle duration as precisely as
> > reasonably possible and (2) to minimise the cost in terms of task
> > wakeup latencies associated with using deep idle states.
> >
> > The avg_util value tells us nothing about how much the CPU is going to
> > be idle this time and it also tells us nothing about the
> > latency-sensitivity of the workload.
> >
> > Yes, it tells us how much idle time there was on the given CPU in the
> > past, on the average, but there is zero information about the
> > distribution of that idle time in it.
> >
> > So in the first place please tell me why it fundamentally makes sense
> > to use avg_util in CPU idle time management at all.
>
> Right, the idea here is slightly similar to that of temporal locality.
> We obviously can't predict the future which is sort of what an idle
> governor tries to achieve. Focusing on timer events makes a lot of sense
> and is probably close to as good as it gets in estimating future
> behaviour.
>
> The observation we're relying on here is simply that if the
> CPU was doing enough work in the recent past for its avg_util to still
> be raised while going into idle, it is very likely that the same CPU
> might be needed again soon. From my tests that assumption tends to be
> correct quite often. In those situations, when avg_util is high and the
> next timer event is far enough for C1 to be selected, a lot of the time
> the CPU does actually get woken up before the residency is hit leading
> to all the issues described above.

This is quite reasonable overall.

> I don't think using avg_util as the *only* input for idle management
> would be a good idea at all. The way I see it, it serves as a very good hint
> to determine if we are likely to get a wakeup between now and the next
> timer event and provides an additional dimension for decision making.
> While the current metrics only adjust themselves after making a certain
> number of mistakes and are a "trailing" adjusting mechanism, using
> avg_util this way provides a "leading" mechanism that potentially lets
> us not make those mistakes in the first place. It's not just theory
> either, it very clearly works and gets results, at least on the
> platforms/workloads we've been looking at.
>
>
> On the Intel & power usage angle you might have seen in the discussion,
> Doug sent me some interesting data privately. As far as I can tell the
> main issue there is that C0 on Intel doesn't actually do power saving so
> moving the state selection down to it is a pretty bad idea because C1
> could be very close in terms of latency and save much more power.
>
> A potential solution could be altering the v2 to only decrease the state
> selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us
> because arm systems with 2 states use the early exit path anyway. It'd
> just amount to changing this hunk:
>
> +       if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable)
> +               idx--;
>
> to:
>
> +       if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable)
> +               idx--;
>
> What would you think about that?

Definitely it should not be changed if the previous state is a polling
one which can be checked right away.  That would take care of the
"Intel case" automatically.

> Should make it much less intense for Intel systems.

So I think that this adjustment only makes sense if the current
candidate state is state 1 and state 0 is not polling.  In the other
cases the cost of missing an opportunity to save energy would be too
high for the observed performance gain.

Rafael J. Wysocki Oct. 28, 2022, 1:22 p.m. UTC | #12

Hi,

On Thu, Oct 20, 2022 at 6:21 PM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> Hi Rafael,
>
> > The avg_util value tells us nothing about how much the CPU is going to
> > be idle this time and it also tells us nothing about the
> > latency-sensitivity of the workload.
> >
> > Yes, it tells us how much idle time there was on the given CPU in the
> > past, on the average, but there is zero information about the
> > distribution of that idle time in it.
> >
> > So in the first place please tell me why it fundamentally makes sense
> > to use avg_util in CPU idle time management at all.
>
> I have an alternative suggestion that could be a reasonable way forward
> here. Instead of applying util-awareness on top of TEO where it would
> have to be reconciled with how TEO is currently expected to work, I just
> wrote a simple completely new governor which operates only on timer
> events alongside util values.
>
> The idea is this:
> 1. Find the deepest state based on residency and time until the next timer event
> 2. If sched_cpu_util() is above the threshold, select a shallower non-polling state
>
> There's no other metrics or anything else under the current
> implementation. I can't say how it would work on Intel systems and in
> the presence of more idle states but having a completely separate
> governor would be very useful for us to tune it specifically for our use
> cases and types of systems (ie. ones with 2 idle states and no polling
> states).

So this is not a totally bad idea IMV and the simplicity of this new
governor is certainly attractive.

However, it is likely to underperform in the cases when the interrupt
activity is not directly related to the average CPU load, like when
CPUs do much work, but they are not interrupted very often.

> As it stands it performs quite well and achieves better results
> (especially in terms of latency) than both menu & TEO but slightly worse
> than the previously suggested TEO + util.

Well, precisely, because teo takes other factors into account too.

> As far as we're concerned
> that's okay, we can work from there to try to find a way of doing
> metrics or improving the algorithm that would be more tailored to using
> the util approach. I think it's much cleaner than what we were
> discussing previously since that was effectively overriding most of what
> TEO was doing.
>
> Here are some numbers to visualize the results. They were all obtained
> in the same way as the ones in the cover letter so you can refer to that
> in case something isn't clear.
>
> 'teo_util' is of course TEO + util as in the patchset.
> 'idleutil' is this entirely new proposed minimal governor.
>
> 1. Geekbench 5 (latency-sensitive, heavy load test)
>
> +-----------------+----------+---------+-------------+
> | metric          | kernel   |   value | perc_diff   |
> |-----------------+----------+---------+-------------|
> | multicore_score | menu     |  2832.3 | 0.0%        |
> | multicore_score | teo      |  2815.3 | -0.6%       |
> | multicore_score | teo_util |  2880.6 | 1.7%        |
> | multicore_score | idleutil |  2859.3 | 0.95%       |
> +-----------------+----------+---------+-------------+
>
> Percentages & types of idle misses
>
> +-----------+-------------+--------------+
> | kernel    | type        |   percentage |
> |-----------+-------------+--------------|
> | menu      | too deep    |      15.613% |
> | teo       | too deep    |       9.376% |
> | teo_util  | too deep    |       4.581% |
> | idleutil  | too deep    |       5.464% |
> | menu      | too shallow |       2.611% |
> | teo       | too shallow |       6.099% |
> | teo_util  | too shallow |      14.141% |
> | idleutil  | too shallow |      13.282% |
> +-----------+-------------+--------------+
>
> Power usage [mW]
>
> +--------------+----------+----------+---------+-------------+
> | chan_name    | metric   | kernel   |   value | perc_diff   |
> |--------------+----------+----------+---------+-------------|
> | total_power  | gmean    | menu     |  2705.9 | 0.0%        |
> | total_power  | gmean    | teo      |  2668.2 | -1.39%      |
> | total_power  | gmean    | teo_util |  2710.2 | 0.16%       |
> | total_power  | gmean    | idleutil |  2657.9 | -1.78%      |
> +--------------+----------+----------+---------+-------------+
>
> Wakeup latency
>
> +-----------------+----------+----------+-------------+-------------+
> | comm            | metric   | kernel   |       value | perc_diff   |
> |-----------------+----------+----------+-------------+-------------|
> | AsyncTask #1    | gmean    | menu     | 66.85μs     | 0.0%        |
> | AsyncTask #1    | gmean    | teo      | 66.79μs     | -0.09%      |
> | AsyncTask #1    | gmean    | teo_util | 57.84μs     | -13.47%     |
> | AsyncTask #1    | gmean    | idleutil | 62.61μs     | -6.35%      |
> | labs.geekbench5 | gmean    | menu     | 80.62μs     | 0.0%        |
> | labs.geekbench5 | gmean    | teo      | 94.75μs     | 17.52%      |
> | labs.geekbench5 | gmean    | teo_util | 52.98μs     | -34.28%     |
> | labs.geekbench5 | gmean    | idleutil | 68.58μs     | -14.93%     |
> +-----------------+----------+----------+-------------+-------------+
>
> 2. PCMark Web Browsing (non latency-sensitive, normal usage test)
>
> +----------------+----------+---------+-------------+
> | metric         | kernel   |   value | perc_diff   |
> |----------------+----------+---------+-------------|
> | PcmaWebV2Score | menu     |  5232   | 0.0%        |
> | PcmaWebV2Score | teo      |  5219.8 | -0.23%      |
> | PcmaWebV2Score | teo_util |  5249.7 | 0.34%       |
> | PcmaWebV2Score | idleutil |  5215.7 | -0.31%      |
> +----------------+----------+---------+-------------+
>
> Percentages & types of idle misses
>
> +-----------+-------------+--------------+
> | kernel    | type        |   percentage |
> |-----------+-------------+--------------|
> | menu      | too deep    |      24.814% |
> | teo       | too deep    |       11.65% |
> | teo_util  | too deep    |       3.753% |
> | idleutil  | too deep    |       4.304% |
> | menu      | too shallow |       3.101% |
> | teo       | too shallow |       8.578% |
> | teo_util  | too shallow |      18.309% |
> | idleutil  | too shallow |      17.638% |
> +-----------+-------------+--------------+
>
> Power usage [mW]
>
> +--------------+----------+----------+---------+-------------+
> | chan_name    | metric   | kernel   |   value | perc_diff   |
> |--------------+----------+----------+---------+-------------|
> | total_power  | gmean    | menu     |   179.2 | 0.0%        |
> | total_power  | gmean    | teo      |   184.8 | 3.1%        |
> | total_power  | gmean    | teo_util |   180.5 | 0.71%       |
> | total_power  | gmean    | idleutil |   185   | 3.24%       |
> +--------------+----------+----------+---------+-------------+
>
> Wakeup latency
>
> +-----------------+----------+----------+-------------+-------------+
> | comm            | metric   | kernel   |       value | perc_diff   |
> |-----------------+----------+----------+-------------+-------------|
> | CrRendererMain  | gmean    | menu     | 236.63μs    | 0.0%        |
> | CrRendererMain  | gmean    | teo      | 201.85μs    | -14.7%      |
> | CrRendererMain  | gmean    | teo_util | 111.76μs    | -52.77%     |
> | CrRendererMain  | gmean    | idleutil | 105.55μs    | -55.39%     |
> | chmark:workload | gmean    | menu     | 100.30μs    | 0.0%        |
> | chmark:workload | gmean    | teo      | 80.20μs     | -20.04%     |
> | chmark:workload | gmean    | teo_util | 53.81μs     | -46.35%     |
> | chmark:workload | gmean    | idleutil | 71.29μs     | -28.92%     |
> | RenderThread    | gmean    | menu     | 37.97μs     | 0.0%        |
> | RenderThread    | gmean    | teo      | 31.69μs     | -16.54%     |
> | RenderThread    | gmean    | teo_util | 34.32μs     | -9.63%      |
> | RenderThread    | gmean    | idleutil | 35.78μs     | -5.77%      |
> | surfaceflinger  | gmean    | menu     | 97.57μs     | 0.0%        |
> | surfaceflinger  | gmean    | teo      | 98.86μs     | 1.31%       |
> | surfaceflinger  | gmean    | teo_util | 72.59μs     | -25.6%      |
> | surfaceflinger  | gmean    | idleutil | 56.23μs     | -42.37%     |
> +-----------------+----------+----------+-------------+-------------+
>
> I also have similar data for Jankbench & Speedometer with right about
> the same results, I'll skip those for now for brevity.
> Would you like me to send a patch with this new governor instead? What
> would you think about this instead of the previously suggested approach?

I would still kind of prefer to improve teo so it covers the known use
cases better, especially that modified teo is likely to give you
better results than the new simplistic one.

Please see my other reply for the possible direction of improvement.

Thanks!

Rafael J. Wysocki Oct. 28, 2022, 1:25 p.m. UTC | #13

On Thu, Oct 20, 2022 at 9:52 PM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
>
> Hi Kajetan,
>
> On 20/10/2022 18:20, Kajetan Puchalski wrote:
> > Hi Rafael,
> >
> >> The avg_util value tells us nothing about how much the CPU is going to
> >> be idle this time and it also tells us nothing about the
> >> latency-sensitivity of the workload.
> >>
> >> Yes, it tells us how much idle time there was on the given CPU in the
> >> past, on the average, but there is zero information about the
> >> distribution of that idle time in it.
> >>
> >> So in the first place please tell me why it fundamentally makes sense
> >> to use avg_util in CPU idle time management at all.
> >
> > I have an alternative suggestion that could be a reasonable way forward
> > here. Instead of applying util-awareness on top of TEO where it would
> > have to be reconciled with how TEO is currently expected to work, I just
> > wrote a simple completely new governor which operates only on timer
> > events alongside util values.
>
> I second the idea. I took a long time to investigate how to improve the
> governor and reached the conclusion having a dedicated governor for
> mobile platform makes sense.

Please define "mobile".

> Also the behavior is very platform dependent.

I'm not sure what you mean.

Rafael J. Wysocki Oct. 28, 2022, 1:29 p.m. UTC | #14

On Thu, Oct 27, 2022 at 9:56 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Doug,
>
> Thank you for your effort in testing these patches and different
> governors. We really appreciate that, since this helped us to
> better understand the platform that you are using. It is different
> to what we have and our workloads. That's why I have some comments.
>
> It would be hard to combine these two worlds and requirements.
> I have some concerns to the tests, the setup and the platform.
> I can see a reason why this patch has to prove the
> strengths on this platform and environment.
> Please see my comments below.
>
> On 10/13/22 23:12, Doug Smythies wrote:
> > Hi All,
> >
> > On Thu, Oct 13, 2022 at 4:12 AM Kajetan Puchalski
> > <kajetan.puchalski@arm.com> wrote:
> >> On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote:
> >>> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski
> >>> <kajetan.puchalski@arm.com> wrote:
> > ...
> >
> >> On the Intel & power usage angle you might have seen in the discussion,
> >> Doug sent me some interesting data privately. As far as I can tell the
> >> main issue there is that C0 on Intel doesn't actually do power saving so
> >> moving the state selection down to it is a pretty bad idea because C1
> >> could be very close in terms of latency and save much more power.
> >>
> >> A potential solution could be altering the v2 to only decrease the state
> >> selection by 1 if it's above 1, ie 2->1 but not 1->0. It's fine for us
> >> because arm systems with 2 states use the early exit path anyway. It'd
> >> just amount to changing this hunk:
> >>
> >> +       if (cpu_data->utilized && idx > 0 && !dev->states_usage[idx-1].disable)
> >> +               idx--;
> >>
> >> to:
> >>
> >> +       if (cpu_data->utilized && idx > 1 && !dev->states_usage[idx-1].disable)
> >> +               idx--;
> >>
> >> What would you think about that? Should make it much less intense for
> >> Intel systems.
> >
> > I tested the above, which you sent me as patch version v2-2.
> >
> > By default, my Intel i5-10600K has 4 idle states:
> >
> > $ grep . /sys/devices/system/cpu/cpu7/cpuidle/state*/name
> > /sys/devices/system/cpu/cpu7/cpuidle/state0/name:POLL
>
> This active polling state type worries me a bit. We don't have
> such on our platforms. Our shallowest idle state is really different.
> We don't have active polling and there is no need for such.

So as I said in a reply to Kajetan, the way to go is to avoid them
when you do this utilization-based optimization.

CPUIDLE_FLAG_POLLING is for that and it is used already in the code.

Moreover, as I said in the other message, IMO the utilization-based
optimization makes the most sense when the current candidate state is
state 1, so it may not make sense to do it on Intel systems at all.

Rafael J. Wysocki Oct. 28, 2022, 1:37 p.m. UTC | #15

Hi,

On Thu, Oct 27, 2022 at 10:04 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Rafael,
>
> On 10/13/22 12:12, Kajetan Puchalski wrote:
> > On Wed, Oct 12, 2022 at 08:50:39PM +0200, Rafael J. Wysocki wrote:
> >> On Mon, Oct 3, 2022 at 4:50 PM Kajetan Puchalski
> >> <kajetan.puchalski@arm.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> At the moment, all the available idle governors operate mainly based on their own past performance
> >>
> >> Not true, at least for the menu and teo governors that use the
> >> information on the distribution of CPU wakeups that is available to
> >> them and try to predict the next idle duration with the help of it.
> >> This has a little to do with their performance.
> >
> > You're right of course, I should have written "their own past
> > correctness" as that's what I was referring to. I just meant that for
> > instance with TEO the initial timer-based choice is only adjusted using
> > the governor's own metrics and not any information from anywhere else in
> > the system.
> >
>
> [snip]
>
> Would it be possible to consider a new small and simple idle governor
> which is better suited for those other workloads and platforms?
> Kajetan has such one and can send to the LKML, so you could have a look.
>
> I have sent some detailed explanation about this to Doug in this
> thread (don't want to duplicate it).
>
> It looks like it would be hard to meet both worlds' requirements.

It may or may not be the case.  Let's first explore the original idea
of improving "teo" a bit more.

As I said in another message in this thread, there are clear cases in
which the new governor is likely to underperform, because it only
takes 2 sources of information into account (the next timer event and
CPU utilization).  If some more information is to be taken into
account without adding too much overhead, "teo" is the simplest thing
that I could imagine (but perhaps my imagination is limited).

Kajetan Puchalski Oct. 28, 2022, 3 p.m. UTC | #16

On Fri, Oct 28, 2022 at 03:12:43PM +0200, Rafael J. Wysocki wrote:

> > The result being that this util-aware TEO variant while using much less
> > C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in
> > PCMark Web Browsing also uses almost 2% less power. Clearly the power is
> > being wasted on not hitting C1 residency over and over.
> 
> Hmm.  The PCMark Web Browsing table in your cover letter doesn't indicate that.
> 
> The "gmean power usage" there for "teo + util-aware" is 205, whereas
> for "teo" alone it is 187.8.  This is still arguably balanced by the
> latency difference (~100 us vs ~185 us, respectively), but this looks
> like trading energy for performance.

In this case yes, I meant 2% less compared to menu but you're right of
course.

[...]

> Definitely it should not be changed if the previous state is a polling
> one which can be checked right away.  That would take care of the
> "Intel case" automatically.

Makes sense, I already used the polling flag to implement this in this other
governor I mentioned.

> 
> > Should make it much less intense for Intel systems.
> 
> So I think that this adjustment only makes sense if the current
> candidate state is state 1 and state 0 is not polling.  In the other
> cases the cost of missing an opportunity to save energy would be too
> high for the observed performance gain.

Interesting, but only applying it to C1 and only when C0 isn't polling would
make it effectively not do anything on Intel systems, right? From what I've
seen on Doug's plots even C1 is hardly ever used on his platform, most
sleeps end up in the deepest possible state.

Checking for the polling flag is a good idea regardless so I can send a
v3 with that. If you'd like me to also restrict the entire mechanism to
only working on C1 as you suggested then I'm okay with including that in
the v3 as well. What do you think?

Thanks a lot for all your time & input,
Kajetan

Rafael J. Wysocki Oct. 28, 2022, 3:04 p.m. UTC | #17

On Fri, Oct 28, 2022 at 5:01 PM Kajetan Puchalski
<kajetan.puchalski@arm.com> wrote:
>
> On Fri, Oct 28, 2022 at 03:12:43PM +0200, Rafael J. Wysocki wrote:
>
> > > The result being that this util-aware TEO variant while using much less
> > > C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in
> > > PCMark Web Browsing also uses almost 2% less power. Clearly the power is
> > > being wasted on not hitting C1 residency over and over.
> >
> > Hmm.  The PCMark Web Browsing table in your cover letter doesn't indicate that.
> >
> > The "gmean power usage" there for "teo + util-aware" is 205, whereas
> > for "teo" alone it is 187.8.  This is still arguably balanced by the
> > latency difference (~100 us vs ~185 us, respectively), but this looks
> > like trading energy for performance.
>
> In this case yes, I meant 2% less compared to menu but you're right of
> course.
>
> [...]
>
> > Definitely it should not be changed if the previous state is a polling
> > one which can be checked right away.  That would take care of the
> > "Intel case" automatically.
>
> Makes sense, I already used the polling flag to implement this in this other
> governor I mentioned.
>
> >
> > > Should make it much less intense for Intel systems.
> >
> > So I think that this adjustment only makes sense if the current
> > candidate state is state 1 and state 0 is not polling.  In the other
> > cases the cost of missing an opportunity to save energy would be too
> > high for the observed performance gain.
>
> Interesting, but only applying it to C1 and only when C0 isn't polling would
> make it effectively not do anything on Intel systems, right?

Indeed.

> From what I've seen on Doug's plots even C1 is hardly ever used on his platform, most
> sleeps end up in the deepest possible state.

That depends a lot on the workload.  There are workloads in which C1
is mostly used and the deeper idle states aren't.

> Checking for the polling flag is a good idea regardless so I can send a
> v3 with that. If you'd like me to also restrict the entire mechanism to
> only working on C1 as you suggested then I'm okay with including that in
> the v3 as well. What do you think?

It would be good to do that and see if there are any significant
differences in the results.

> Thanks a lot for all your time & input,

No problem at all.

Rafael J. Wysocki Oct. 28, 2022, 3:08 p.m. UTC | #18

On Fri, Oct 28, 2022 at 5:04 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Oct 28, 2022 at 5:01 PM Kajetan Puchalski
> <kajetan.puchalski@arm.com> wrote:
> >
> > On Fri, Oct 28, 2022 at 03:12:43PM +0200, Rafael J. Wysocki wrote:
> >
> > > > The result being that this util-aware TEO variant while using much less
> > > > C1 and decreasing the percentage of too deep sleeps from ~24% to ~3% in
> > > > PCMark Web Browsing also uses almost 2% less power. Clearly the power is
> > > > being wasted on not hitting C1 residency over and over.
> > >
> > > Hmm.  The PCMark Web Browsing table in your cover letter doesn't indicate that.
> > >
> > > The "gmean power usage" there for "teo + util-aware" is 205, whereas
> > > for "teo" alone it is 187.8.  This is still arguably balanced by the
> > > latency difference (~100 us vs ~185 us, respectively), but this looks
> > > like trading energy for performance.
> >
> > In this case yes, I meant 2% less compared to menu but you're right of
> > course.
> >
> > [...]
> >
> > > Definitely it should not be changed if the previous state is a polling
> > > one which can be checked right away.  That would take care of the
> > > "Intel case" automatically.
> >
> > Makes sense, I already used the polling flag to implement this in this other
> > governor I mentioned.
> >
> > >
> > > > Should make it much less intense for Intel systems.
> > >
> > > So I think that this adjustment only makes sense if the current
> > > candidate state is state 1 and state 0 is not polling.  In the other
> > > cases the cost of missing an opportunity to save energy would be too
> > > high for the observed performance gain.
> >
> > Interesting, but only applying it to C1 and only when C0 isn't polling would
> > make it effectively not do anything on Intel systems, right?
>
> Indeed.
>
> > From what I've seen on Doug's plots even C1 is hardly ever used on his platform, most
> > sleeps end up in the deepest possible state.
>
> That depends a lot on the workload.  There are workloads in which C1
> is mostly used and the deeper idle states aren't.
>
> > Checking for the polling flag is a good idea regardless so I can send a
> > v3 with that. If you'd like me to also restrict the entire mechanism to
> > only working on C1 as you suggested then I'm okay with including that in
> > the v3 as well. What do you think?
>
> It would be good to do that and see if there are any significant
> differences in the results.

BTW, you may as well drop the extra #ifdeffery from the v3, I don't
think that it is particularly useful.

[RFC,v2,0/1] cpuidle: teo: Introduce optional util-awareness

Message

Comments