Message ID | 20230105145159.1089531-1-kajetan.puchalski@arm.com |
---|---|
Headers | show |
Series | cpuidle: teo: Introduce util-awareness | expand |
On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > Modern interactive systems, such as recent Android phones, tend to have power > efficient shallow idle states. Selecting deeper idle states on a device while a > latency-sensitive workload is running can adversely impact performance due to > increased latency. Additionally, if the CPU wakes up from a deeper sleep before > its target residency as is often the case, it results in a waste of energy on > top of that. > > At the moment, none of the available idle governors take any scheduling > information into account. They also tend to overestimate the idle > duration quite often, which causes them to select excessively deep idle > states, thus leading to increased wakeup latency and lower performance with no > power saving. For 'menu' while web browsing on Android for instance, those > types of wakeups ('too deep') account for over 24% of all wakeups. > > At the same time, on some platforms idle state 0 can be power efficient > enough to warrant wanting to prefer it over idle state 1. This is because > the power usage of the two states can be so close that sufficient amounts > of too deep state 1 sleeps can completely offset the state 1 power saving to the > point where it would've been more power efficient to just use state 0 instead. > This is of course for systems where state 0 is not a polling state, such as > arm-based devices. > > Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only > save less power than they otherwise could have. Too deep sleeps, on the other > hand, harm performance and nullify the potential power saving from using state 1 in > the first place. While taking this into account, it is clear that on balance it > is preferable for an idle governor to have more too shallow sleeps instead of > more too deep sleeps on those kinds of platforms. > > This patch specifically tunes TEO to prefer shallower idle states in > order to reduce wakeup latency and achieve better performance. > To this end, before selecting the next idle state it uses the avg_util signal > of a CPU's runqueue in order to determine to what extent the CPU is being utilized. > This util value is then compared to a threshold defined as a percentage of the > cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the > util is above the threshold, the idle state selected by TEO metrics will be > reduced by 1, thus selecting a shallower state. If the util is below the threshold, > the governor defaults to the TEO metrics mechanism to try to select the deepest > available idle state based on the closest timer event and its own correctness. > > The main goal of this is to reduce latency and increase performance for some > workloads. Under some workloads it will result in an increase in power usage > (Geekbench 5) while for other workloads it will also result in a decrease in > power usage compared to TEO (PCMark Web, Jankbench, Speedometer). > > It can provide drastically decreased latency and performance benefits in certain > types of workloads that are sensitive to latency. > > Example test results: > > 1. GB5 (better score, latency & more power usage) > > | metric | menu | teo | teo-util-aware | > | ------------------------------------- | -------------- | ----------------- | ----------------- | > | gmean score | 2826.5 (0.0%) | 2764.8 (-2.18%) | 2865 (1.36%) | > | gmean power usage [mW] | 2551.4 (0.0%) | 2606.8 (2.17%) | 2722.3 (6.7%) | > | gmean too deep % | 14.99% | 9.65% | 4.02% | > | gmean too shallow % | 2.5% | 5.96% | 14.59% | > | gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) | > > 2. Jankbench (better score, latency & less power usage) > > | metric | menu | teo | teo-util-aware | > | ------------------------------------- | -------------- | ----------------- | ----------------- | > | gmean frame duration | 13.9 (0.0%) | 14.7 (6.0%) | 12.6 (-9.0%) | > | gmean jank percentage | 1.5 (0.0%) | 2.1 (36.99%) | 1.3 (-17.37%) | > | gmean power usage [mW] | 144.6 (0.0%) | 136.9 (-5.27%) | 121.3 (-16.08%) | > | gmean too deep % | 26.00% | 11.00% | 2.54% | > | gmean too shallow % | 4.74% | 11.89% | 21.93% | > | gmean wakeup latency (RenderThread) | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%) | > | gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%) | 87.65μs (-29.33%) | > > Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com> This looks good enough for me. There are still a couple of things I would change in it, but I may as well do that when applying it, so never mind. The most important question for now is what the scheduler people think about calling sched_cpu_util() from a CPU idle governor. Peter, Vincent? > --- > drivers/cpuidle/governors/teo.c | 92 ++++++++++++++++++++++++++++++++- > 1 file changed, 91 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c > index e2864474a98d..2a2be4f45b70 100644 > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -2,8 +2,13 @@ > /* > * Timer events oriented CPU idle governor > * > + * TEO governor: > * Copyright (C) 2018 - 2021 Intel Corporation > * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > + * > + * Util-awareness mechanism: > + * Copyright (C) 2022 Arm Ltd. > + * Author: Kajetan Puchalski <kajetan.puchalski@arm.com> > */ > > /** > @@ -99,14 +104,55 @@ > * select the given idle state instead of the candidate one. > * > * 3. By default, select the candidate state. > + * > + * Util-awareness mechanism: > + * > + * The idea behind the util-awareness extension is that there are two distinct > + * scenarios for the CPU which should result in two different approaches to idle > + * state selection - utilized and not utilized. > + * > + * In this case, 'utilized' means that the average runqueue util of the CPU is > + * above a certain threshold. > + * > + * When the CPU is utilized while going into idle, more likely than not it will > + * be woken up to do more work soon and so a shallower idle state should be > + * selected to minimise latency and maximise performance. When the CPU is not > + * being utilized, the usual metrics-based approach to selecting the deepest > + * available idle state should be preferred to take advantage of the power > + * saving. > + * > + * In order to achieve this, the governor uses a utilization threshold. > + * The threshold is computed per-cpu as a percentage of the CPU's capacity > + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%) > + * seems to be getting the best results. > + * > + * Before selecting the next idle state, the governor compares the current CPU > + * util to the precomputed util threshold. If it's below, it defaults to the > + * TEO metrics mechanism. If it's above, the idle state will be reduced to C0 > + * as long as C0 is not a polling state. > */ > > #include <linux/cpuidle.h> > #include <linux/jiffies.h> > #include <linux/kernel.h> > +#include <linux/sched.h> > #include <linux/sched/clock.h> > +#include <linux/sched/topology.h> > #include <linux/tick.h> > > +/* > + * The number of bits to shift the cpu's capacity by in order to determine > + * the utilized threshold. > + * > + * 6 was chosen based on testing as the number that achieved the best balance > + * of power and performance on average. > + * > + * The resulting threshold is high enough to not be triggered by background > + * noise and low enough to react quickly when activity starts to ramp up. > + */ > +#define UTIL_THRESHOLD_SHIFT 6 > + > + > /* > * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value > * is used for decreasing metrics on a regular basis. > @@ -137,9 +183,11 @@ struct teo_bin { > * @time_span_ns: Time between idle state selection and post-wakeup update. > * @sleep_length_ns: Time till the closest timer event (at the selection time). > * @state_bins: Idle state data bins for this CPU. > - * @total: Grand total of the "intercepts" and "hits" mertics for all bins. > + * @total: Grand total of the "intercepts" and "hits" metrics for all bins. > * @next_recent_idx: Index of the next @recent_idx entry to update. > * @recent_idx: Indices of bins corresponding to recent "intercepts". > + * @util_threshold: Threshold above which the CPU is considered utilized > + * @utilized: Whether the last sleep on the CPU happened while utilized > */ > struct teo_cpu { > s64 time_span_ns; > @@ -148,10 +196,29 @@ struct teo_cpu { > unsigned int total; > int next_recent_idx; > int recent_idx[NR_RECENT]; > + unsigned long util_threshold; > + bool utilized; > }; > > static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); > > +/** > + * teo_cpu_is_utilized - Check if the CPU's util is above the threshold > + * @cpu: Target CPU > + * @cpu_data: Governor CPU data for the target CPU > + */ > +#ifdef CONFIG_SMP > +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data) > +{ > + return sched_cpu_util(cpu) > cpu_data->util_threshold; > +} > +#else > +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data) > +{ > + return false; > +} > +#endif > + > /** > * teo_update - Update CPU metrics after wakeup. > * @drv: cpuidle driver containing state data. > @@ -323,6 +390,20 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > goto end; > } > > + cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data); > + /* > + * The cpu is being utilized over the threshold there are only 2 states to choose from. > + * No need to consider metrics, choose the shallowest non-polling state and exit. > + */ > + if (drv->state_count < 3 && cpu_data->utilized) { > + for (i = 0; i < drv->state_count; ++i) { > + if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) { > + idx = i; > + goto end; > + } > + } > + } > + > /* > * Find the deepest idle state whose target residency does not exceed > * the current sleep length and the deepest idle state not deeper than > @@ -454,6 +535,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > if (idx > constraint_idx) > idx = constraint_idx; > > + /* > + * If the CPU is being utilized over the threshold, > + * choose a shallower non-polling state to improve latency > + */ > + if (cpu_data->utilized) > + idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); > + > end: > /* > * Don't stop the tick if the selected state is a polling one or if the > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv, > struct cpuidle_device *dev) > { > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu); > int i; > > memset(cpu_data, 0, sizeof(*cpu_data)); > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT; > > for (i = 0; i < NR_RECENT; i++) > cpu_data->recent_idx[i] = -1; > -- > 2.37.1 >
On 1/5/23 15:07, Rafael J. Wysocki wrote: > On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: >> >> Modern interactive systems, such as recent Android phones, tend to have power >> efficient shallow idle states. Selecting deeper idle states on a device while a >> latency-sensitive workload is running can adversely impact performance due to >> increased latency. Additionally, if the CPU wakes up from a deeper sleep before >> its target residency as is often the case, it results in a waste of energy on >> top of that. >> >> At the moment, none of the available idle governors take any scheduling >> information into account. They also tend to overestimate the idle >> duration quite often, which causes them to select excessively deep idle >> states, thus leading to increased wakeup latency and lower performance with no >> power saving. For 'menu' while web browsing on Android for instance, those >> types of wakeups ('too deep') account for over 24% of all wakeups. >> >> At the same time, on some platforms idle state 0 can be power efficient >> enough to warrant wanting to prefer it over idle state 1. This is because >> the power usage of the two states can be so close that sufficient amounts >> of too deep state 1 sleeps can completely offset the state 1 power saving to the >> point where it would've been more power efficient to just use state 0 instead. >> This is of course for systems where state 0 is not a polling state, such as >> arm-based devices. >> >> Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only >> save less power than they otherwise could have. Too deep sleeps, on the other >> hand, harm performance and nullify the potential power saving from using state 1 in >> the first place. While taking this into account, it is clear that on balance it >> is preferable for an idle governor to have more too shallow sleeps instead of >> more too deep sleeps on those kinds of platforms. >> >> This patch specifically tunes TEO to prefer shallower idle states in >> order to reduce wakeup latency and achieve better performance. >> To this end, before selecting the next idle state it uses the avg_util signal >> of a CPU's runqueue in order to determine to what extent the CPU is being utilized. >> This util value is then compared to a threshold defined as a percentage of the >> cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the >> util is above the threshold, the idle state selected by TEO metrics will be >> reduced by 1, thus selecting a shallower state. If the util is below the threshold, >> the governor defaults to the TEO metrics mechanism to try to select the deepest >> available idle state based on the closest timer event and its own correctness. >> >> The main goal of this is to reduce latency and increase performance for some >> workloads. Under some workloads it will result in an increase in power usage >> (Geekbench 5) while for other workloads it will also result in a decrease in >> power usage compared to TEO (PCMark Web, Jankbench, Speedometer). >> >> It can provide drastically decreased latency and performance benefits in certain >> types of workloads that are sensitive to latency. >> >> Example test results: >> >> 1. GB5 (better score, latency & more power usage) >> >> | metric | menu | teo | teo-util-aware | >> | ------------------------------------- | -------------- | ----------------- | ----------------- | >> | gmean score | 2826.5 (0.0%) | 2764.8 (-2.18%) | 2865 (1.36%) | >> | gmean power usage [mW] | 2551.4 (0.0%) | 2606.8 (2.17%) | 2722.3 (6.7%) | >> | gmean too deep % | 14.99% | 9.65% | 4.02% | >> | gmean too shallow % | 2.5% | 5.96% | 14.59% | >> | gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) | >> >> 2. Jankbench (better score, latency & less power usage) >> >> | metric | menu | teo | teo-util-aware | >> | ------------------------------------- | -------------- | ----------------- | ----------------- | >> | gmean frame duration | 13.9 (0.0%) | 14.7 (6.0%) | 12.6 (-9.0%) | >> | gmean jank percentage | 1.5 (0.0%) | 2.1 (36.99%) | 1.3 (-17.37%) | >> | gmean power usage [mW] | 144.6 (0.0%) | 136.9 (-5.27%) | 121.3 (-16.08%) | >> | gmean too deep % | 26.00% | 11.00% | 2.54% | >> | gmean too shallow % | 4.74% | 11.89% | 21.93% | >> | gmean wakeup latency (RenderThread) | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%) | >> | gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%) | 87.65μs (-29.33%) | >> >> Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com> > > This looks good enough for me. > > There are still a couple of things I would change in it, but I may as > well do that when applying it, so never mind. > > The most important question for now is what the scheduler people think > about calling sched_cpu_util() from a CPU idle governor. Peter, > Vincent? > We have a precedence in thermal framework for purpose of thermal governor - IPA. It's been there for a while to estimate the power of CPUs in the frequency domain for cpufreq_cooling device [1]. That's how this API sched_cpu_util() got created. Then it was also adopted to PowerCap DTPM [2] (for the same power estimation purpose). It's a function available with form include/linux/sched.h so I don't see reasons why to not use it. [1] https://elixir.bootlin.com/linux/latest/source/drivers/thermal/cpufreq_cooling.c#L151 [2] https://elixir.bootlin.com/linux/latest/source/drivers/powercap/dtpm_cpu.c#L83
On Thu, 5 Jan 2023 at 16:07, Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: > > > > Modern interactive systems, such as recent Android phones, tend to have power > > efficient shallow idle states. Selecting deeper idle states on a device while a > > latency-sensitive workload is running can adversely impact performance due to > > increased latency. Additionally, if the CPU wakes up from a deeper sleep before > > its target residency as is often the case, it results in a waste of energy on > > top of that. > > > > At the moment, none of the available idle governors take any scheduling > > information into account. They also tend to overestimate the idle > > duration quite often, which causes them to select excessively deep idle > > states, thus leading to increased wakeup latency and lower performance with no > > power saving. For 'menu' while web browsing on Android for instance, those > > types of wakeups ('too deep') account for over 24% of all wakeups. > > > > At the same time, on some platforms idle state 0 can be power efficient > > enough to warrant wanting to prefer it over idle state 1. This is because > > the power usage of the two states can be so close that sufficient amounts > > of too deep state 1 sleeps can completely offset the state 1 power saving to the > > point where it would've been more power efficient to just use state 0 instead. > > This is of course for systems where state 0 is not a polling state, such as > > arm-based devices. > > > > Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only > > save less power than they otherwise could have. Too deep sleeps, on the other > > hand, harm performance and nullify the potential power saving from using state 1 in > > the first place. While taking this into account, it is clear that on balance it > > is preferable for an idle governor to have more too shallow sleeps instead of > > more too deep sleeps on those kinds of platforms. > > > > This patch specifically tunes TEO to prefer shallower idle states in > > order to reduce wakeup latency and achieve better performance. > > To this end, before selecting the next idle state it uses the avg_util signal > > of a CPU's runqueue in order to determine to what extent the CPU is being utilized. > > This util value is then compared to a threshold defined as a percentage of the > > cpu's capacity (capacity >> 6 ie. ~1.5% in the current implementation). If the > > util is above the threshold, the idle state selected by TEO metrics will be > > reduced by 1, thus selecting a shallower state. If the util is below the threshold, > > the governor defaults to the TEO metrics mechanism to try to select the deepest > > available idle state based on the closest timer event and its own correctness. > > > > The main goal of this is to reduce latency and increase performance for some > > workloads. Under some workloads it will result in an increase in power usage > > (Geekbench 5) while for other workloads it will also result in a decrease in > > power usage compared to TEO (PCMark Web, Jankbench, Speedometer). > > > > It can provide drastically decreased latency and performance benefits in certain > > types of workloads that are sensitive to latency. > > > > Example test results: > > > > 1. GB5 (better score, latency & more power usage) > > > > | metric | menu | teo | teo-util-aware | > > | ------------------------------------- | -------------- | ----------------- | ----------------- | > > | gmean score | 2826.5 (0.0%) | 2764.8 (-2.18%) | 2865 (1.36%) | > > | gmean power usage [mW] | 2551.4 (0.0%) | 2606.8 (2.17%) | 2722.3 (6.7%) | > > | gmean too deep % | 14.99% | 9.65% | 4.02% | > > | gmean too shallow % | 2.5% | 5.96% | 14.59% | > > | gmean task wakeup latency (asynctask) | 78.16μs (0.0%) | 61.60μs (-21.19%) | 54.45μs (-30.34%) | > > > > 2. Jankbench (better score, latency & less power usage) > > > > | metric | menu | teo | teo-util-aware | > > | ------------------------------------- | -------------- | ----------------- | ----------------- | > > | gmean frame duration | 13.9 (0.0%) | 14.7 (6.0%) | 12.6 (-9.0%) | > > | gmean jank percentage | 1.5 (0.0%) | 2.1 (36.99%) | 1.3 (-17.37%) | > > | gmean power usage [mW] | 144.6 (0.0%) | 136.9 (-5.27%) | 121.3 (-16.08%) | > > | gmean too deep % | 26.00% | 11.00% | 2.54% | > > | gmean too shallow % | 4.74% | 11.89% | 21.93% | > > | gmean wakeup latency (RenderThread) | 139.5μs (0.0%) | 116.5μs (-16.49%) | 91.11μs (-34.7%) | > > | gmean wakeup latency (surfaceflinger) | 124.0μs (0.0%) | 151.9μs (22.47%) | 87.65μs (-29.33%) | > > > > Signed-off-by: Kajetan Puchalski <kajetan.puchalski@arm.com> > > This looks good enough for me. > > There are still a couple of things I would change in it, but I may as > well do that when applying it, so never mind. > > The most important question for now is what the scheduler people think > about calling sched_cpu_util() from a CPU idle governor. Peter, > Vincent? I don't see a problem with using sched_cpu_util() outside the scheduler as it's already used in thermal and dtpm to get cpu utilization. > > > --- > > drivers/cpuidle/governors/teo.c | 92 ++++++++++++++++++++++++++++++++- > > 1 file changed, 91 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c > > index e2864474a98d..2a2be4f45b70 100644 > > --- a/drivers/cpuidle/governors/teo.c > > +++ b/drivers/cpuidle/governors/teo.c > > @@ -2,8 +2,13 @@ > > /* > > * Timer events oriented CPU idle governor > > * > > + * TEO governor: > > * Copyright (C) 2018 - 2021 Intel Corporation > > * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > + * > > + * Util-awareness mechanism: > > + * Copyright (C) 2022 Arm Ltd. > > + * Author: Kajetan Puchalski <kajetan.puchalski@arm.com> > > */ > > > > /** > > @@ -99,14 +104,55 @@ > > * select the given idle state instead of the candidate one. > > * > > * 3. By default, select the candidate state. > > + * > > + * Util-awareness mechanism: > > + * > > + * The idea behind the util-awareness extension is that there are two distinct > > + * scenarios for the CPU which should result in two different approaches to idle > > + * state selection - utilized and not utilized. > > + * > > + * In this case, 'utilized' means that the average runqueue util of the CPU is > > + * above a certain threshold. > > + * > > + * When the CPU is utilized while going into idle, more likely than not it will > > + * be woken up to do more work soon and so a shallower idle state should be > > + * selected to minimise latency and maximise performance. When the CPU is not > > + * being utilized, the usual metrics-based approach to selecting the deepest > > + * available idle state should be preferred to take advantage of the power > > + * saving. > > + * > > + * In order to achieve this, the governor uses a utilization threshold. > > + * The threshold is computed per-cpu as a percentage of the CPU's capacity > > + * by bit shifting the capacity value. Based on testing, the shift of 6 (~1.56%) > > + * seems to be getting the best results. > > + * > > + * Before selecting the next idle state, the governor compares the current CPU > > + * util to the precomputed util threshold. If it's below, it defaults to the > > + * TEO metrics mechanism. If it's above, the idle state will be reduced to C0 > > + * as long as C0 is not a polling state. > > */ > > > > #include <linux/cpuidle.h> > > #include <linux/jiffies.h> > > #include <linux/kernel.h> > > +#include <linux/sched.h> > > #include <linux/sched/clock.h> > > +#include <linux/sched/topology.h> > > #include <linux/tick.h> > > > > +/* > > + * The number of bits to shift the cpu's capacity by in order to determine > > + * the utilized threshold. > > + * > > + * 6 was chosen based on testing as the number that achieved the best balance > > + * of power and performance on average. > > + * > > + * The resulting threshold is high enough to not be triggered by background > > + * noise and low enough to react quickly when activity starts to ramp up. > > + */ > > +#define UTIL_THRESHOLD_SHIFT 6 > > + > > + > > /* > > * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value > > * is used for decreasing metrics on a regular basis. > > @@ -137,9 +183,11 @@ struct teo_bin { > > * @time_span_ns: Time between idle state selection and post-wakeup update. > > * @sleep_length_ns: Time till the closest timer event (at the selection time). > > * @state_bins: Idle state data bins for this CPU. > > - * @total: Grand total of the "intercepts" and "hits" mertics for all bins. > > + * @total: Grand total of the "intercepts" and "hits" metrics for all bins. > > * @next_recent_idx: Index of the next @recent_idx entry to update. > > * @recent_idx: Indices of bins corresponding to recent "intercepts". > > + * @util_threshold: Threshold above which the CPU is considered utilized > > + * @utilized: Whether the last sleep on the CPU happened while utilized > > */ > > struct teo_cpu { > > s64 time_span_ns; > > @@ -148,10 +196,29 @@ struct teo_cpu { > > unsigned int total; > > int next_recent_idx; > > int recent_idx[NR_RECENT]; > > + unsigned long util_threshold; > > + bool utilized; > > }; > > > > static DEFINE_PER_CPU(struct teo_cpu, teo_cpus); > > > > +/** > > + * teo_cpu_is_utilized - Check if the CPU's util is above the threshold > > + * @cpu: Target CPU > > + * @cpu_data: Governor CPU data for the target CPU > > + */ > > +#ifdef CONFIG_SMP > > +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data) > > +{ > > + return sched_cpu_util(cpu) > cpu_data->util_threshold; > > +} > > +#else > > +static bool teo_cpu_is_utilized(int cpu, struct teo_cpu *cpu_data) > > +{ > > + return false; > > +} > > +#endif > > + > > /** > > * teo_update - Update CPU metrics after wakeup. > > * @drv: cpuidle driver containing state data. > > @@ -323,6 +390,20 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > > goto end; > > } > > > > + cpu_data->utilized = teo_cpu_is_utilized(dev->cpu, cpu_data); > > + /* > > + * The cpu is being utilized over the threshold there are only 2 states to choose from. > > + * No need to consider metrics, choose the shallowest non-polling state and exit. > > + */ > > + if (drv->state_count < 3 && cpu_data->utilized) { > > + for (i = 0; i < drv->state_count; ++i) { > > + if (!dev->states_usage[i].disable && !(drv->states[i].flags & CPUIDLE_FLAG_POLLING)) { > > + idx = i; > > + goto end; > > + } > > + } > > + } > > + > > /* > > * Find the deepest idle state whose target residency does not exceed > > * the current sleep length and the deepest idle state not deeper than > > @@ -454,6 +535,13 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, > > if (idx > constraint_idx) > > idx = constraint_idx; > > > > + /* > > + * If the CPU is being utilized over the threshold, > > + * choose a shallower non-polling state to improve latency > > + */ > > + if (cpu_data->utilized) > > + idx = teo_find_shallower_state(drv, dev, idx, duration_ns, true); > > + > > end: > > /* > > * Don't stop the tick if the selected state is a polling one or if the > > @@ -510,9 +598,11 @@ static int teo_enable_device(struct cpuidle_driver *drv, > > struct cpuidle_device *dev) > > { > > struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu); > > + unsigned long max_capacity = arch_scale_cpu_capacity(dev->cpu); > > int i; > > > > memset(cpu_data, 0, sizeof(*cpu_data)); > > + cpu_data->util_threshold = max_capacity >> UTIL_THRESHOLD_SHIFT; > > > > for (i = 0; i < NR_RECENT; i++) > > cpu_data->recent_idx[i] = -1; > > -- > > 2.37.1 > >
On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski <kajetan.puchalski@arm.com> wrote: > > Hi, > > At the moment, none of the available idle governors take any scheduling > information into account. They also tend to overestimate the idle > duration quite often, which causes them to select excessively deep idle > states, thus leading to increased wakeup latency and lower performance with no > power saving. For 'menu' while web browsing on Android for instance, those > types of wakeups ('too deep') account for over 24% of all wakeups. > > At the same time, on some platforms idle state 0 can be power efficient > enough to warrant wanting to prefer it over idle state 1. This is because > the power usage of the two states can be so close that sufficient amounts > of too deep state 1 sleeps can completely offset the state 1 power saving to the > point where it would've been more power efficient to just use state 0 instead. > This is of course for systems where state 0 is not a polling state, such as > arm-based devices. > > Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only > save less power than they otherwise could have. Too deep sleeps, on the other > hand, harm performance and nullify the potential power saving from using state 1 in > the first place. While taking this into account, it is clear that on balance it > is preferable for an idle governor to have more too shallow sleeps instead of > more too deep sleeps on those kinds of platforms. > > Currently the best available governor under this metric is TEO which on average results in less than > half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and > increased performance in the process. > > This patchset specifically tunes TEO to prefer shallower idle states in order to reduce wakeup latency > and achieve better performance. To this end, before selecting the next idle state it uses the avg_util > signal of a CPU's runqueue in order to determine to what extent the CPU is being utilized. > This util value is then compared to a threshold defined as a percentage of the cpu's capacity > (capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the threshold, the idle > state selected by TEO metrics will be reduced by 1, thus selecting a shallower state. If the util is > below the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest > available idle state based on the closest timer event and its own correctness. > > The main goal of this is to reduce latency and increase performance for some workloads. Under some > workloads it will result in an increase in power usage (Geekbench 5) while for other workloads it > will also result in a decrease in power usage compared to TEO (PCMark Web, Jankbench, Speedometer). > > As of v2 the patch includes a 'fast exit' path for arm-based and similar systems where only 2 idle > states are present. If there's just 2 idle states and the CPU is utilized, we can directly select > the shallowest state and save cycles by skipping the entire metrics mechanism. > > Under the current implementation, the state will not be reduced by 1 if the change would lead to > selecting a polling state instead of a non-polling state. > > This approach can outperform all the other currently available governors, at least on mobile device > workloads, which is why I think it is worth keeping as an option. > > There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base > it on TEO because it performs the best out of all the available options and I didn't think there was > any point in reinventing the wheel on the side of computing governor metrics. If a > better approach comes along at some point, there's no reason why the same idle aware mechanism > couldn't be used with any other metrics algorithm. That would, however, require implemeting it as > a separate governor rather than a TEO add-on. > > As for how the extension performs in practice, below I'll add some benchmark results I got while > testing this patchset. All the benchmarks were run after holding the phone in the fridge for exactly > an hour each time to minimise the impact of thermal issues. > > Pixel 6 (Android 12, mainline kernel 5.18, with newer mainline CFS patches): > > 1. Geekbench 5 (latency-sensitive, heavy load test) > > The values below are gmean values across 3 back to back iteration of Geekbench 5. > As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices > resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual > values for all of the governors can change between runs as the benchmark might be affected by factors > other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better > scores than all the other governors. > > Benchmark scores > > +-----------------+-------------+---------+-------------+ > | metric | kernel | value | perc_diff | > |-----------------+-------------+---------+-------------| > | multicore_score | menu | 2826.5 | 0.0% | > | multicore_score | teo | 2764.8 | -2.18% | > | multicore_score | teo_util_v3 | 2849 | 0.8% | > | multicore_score | teo_util_v4 | 2865 | 1.36% | > | score | menu | 1053 | 0.0% | > | score | teo | 1050.7 | -0.22% | > | score | teo_util_v3 | 1059.6 | 0.63% | > | score | teo_util_v4 | 1057.6 | 0.44% | > +-----------------+-------------+---------+-------------+ > > Idle misses > > The numbers are percentages of too deep and too shallow sleeps computed using the new trace > event - cpu_idle_miss. The percentage is obtained by counting the two types of misses over > the course of a run and then dividing them by the total number of wakeups in that run. > > +-------------+-------------+--------------+ > | wa_path | type | count_perc | > |-------------+-------------+--------------| > | menu | too deep | 14.994% | > | teo | too deep | 9.649% | > | teo_util_v3 | too deep | 4.298% | > | teo_util_v4 | too deep | 4.02 % | > | menu | too shallow | 2.497% | > | teo | too shallow | 5.963% | > | teo_util_v3 | too shallow | 13.773% | > | teo_util_v4 | too shallow | 14.598% | > +-------------+-------------+--------------+ > > Power usage [mW] > > +--------------+----------+-------------+---------+-------------+ > | chan_name | metric | kernel | value | perc_diff | > |--------------+----------+-------------+---------+-------------| > | total_power | gmean | menu | 2551.4 | 0.0% | > | total_power | gmean | teo | 2606.8 | 2.17% | > | total_power | gmean | teo_util_v3 | 2670.1 | 4.65% | > | total_power | gmean | teo_util_v4 | 2722.3 | 6.7% | > +--------------+----------+-------------+---------+-------------+ > > Task wakeup latency > > +-----------------+----------+-------------+-------------+-------------+ > | comm | metric | kernel | value | perc_diff | > |-----------------+----------+-------------+-------------+-------------| > | AsyncTask #1 | gmean | menu | 78.16μs | 0.0% | > | AsyncTask #1 | gmean | teo | 61.60μs | -21.19% | > | AsyncTask #1 | gmean | teo_util_v3 | 74.34μs | -4.89% | > | AsyncTask #1 | gmean | teo_util_v4 | 54.45μs | -30.34% | > | labs.geekbench5 | gmean | menu | 88.55μs | 0.0% | > | labs.geekbench5 | gmean | teo | 100.97μs | 14.02% | > | labs.geekbench5 | gmean | teo_util_v3 | 53.57μs | -39.5% | > | labs.geekbench5 | gmean | teo_util_v4 | 59.60μs | -32.7% | > +-----------------+----------+-------------+-------------+-------------+ > > In case of this benchmark, the difference in latency does seem to translate into better scores. > > 2. PCMark Web Browsing (non latency-sensitive, normal usage web browsing test) > > The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing. > > Benchmark scores > > +----------------+-------------+---------+-------------+ > | metric | kernel | value | perc_diff | > |----------------+-------------+---------+-------------| > | PcmaWebV2Score | menu | 5232 | 0.0% | > | PcmaWebV2Score | teo | 5219.8 | -0.23% | > | PcmaWebV2Score | teo_util_v3 | 5273.5 | 0.79% | > | PcmaWebV2Score | teo_util_v4 | 5239.9 | 0.15% | > +----------------+-------------+---------+-------------+ > > Idle misses > > +-------------+-------------+--------------+ > | wa_path | type | count_perc | > |-------------+-------------+--------------| > | menu | too deep | 24.814% | > | teo | too deep | 11.65% | > | teo_util_v3 | too deep | 3.481% | > | teo_util_v4 | too deep | 3.662% | > | menu | too shallow | 3.101% | > | teo | too shallow | 8.578% | > | teo_util_v3 | too shallow | 18.326% | > | teo_util_v4 | too shallow | 18.692% | > +-------------+-------------+--------------+ > > Power usage [mW] > > +--------------+----------+-------------+---------+-------------+ > | chan_name | metric | kernel | value | perc_diff | > |--------------+----------+-------------+---------+-------------| > | total_power | gmean | menu | 179.2 | 0.0% | > | total_power | gmean | teo | 184.8 | 3.1% | > | total_power | gmean | teo_util_v3 | 177.4 | -1.02% | > | total_power | gmean | teo_util_v4 | 184.1 | 2.71% | > +--------------+----------+-------------+---------+-------------+ > > Task wakeup latency > > +-----------------+----------+-------------+-------------+-------------+ > | comm | metric | kernel | value | perc_diff | > |-----------------+----------+-------------+-------------+-------------| > | CrRendererMain | gmean | menu | 236.63μs | 0.0% | > | CrRendererMain | gmean | teo | 201.85μs | -14.7% | > | CrRendererMain | gmean | teo_util_v3 | 106.46μs | -55.01% | > | CrRendererMain | gmean | teo_util_v4 | 106.72μs | -54.9% | > | chmark:workload | gmean | menu | 100.30μs | 0.0% | > | chmark:workload | gmean | teo | 80.20μs | -20.04% | > | chmark:workload | gmean | teo_util_v3 | 65.88μs | -34.32% | > | chmark:workload | gmean | teo_util_v4 | 57.90μs | -42.28% | > | surfaceflinger | gmean | menu | 97.57μs | 0.0% | > | surfaceflinger | gmean | teo | 98.86μs | 1.31% | > | surfaceflinger | gmean | teo_util_v3 | 56.49μs | -42.1% | > | surfaceflinger | gmean | teo_util_v4 | 72.68μs | -25.52% | > +-----------------+----------+-------------+-------------+-------------+ > > In this case the large latency improvement does not translate into a notable increase in benchmark score as > this particular benchmark mainly responds to changes in operating frequency. > > 3. Jankbench (locked 60hz screen) (normal usage UI test) > > Frame durations > > +---------------+------------------+---------+-------------+ > | variable | kernel | value | perc_diff | > |---------------+------------------+---------+-------------| > | mean_duration | menu_60hz | 13.9 | 0.0% | > | mean_duration | teo_60hz | 14.7 | 6.0% | > | mean_duration | teo_util_v3_60hz | 13.8 | -0.87% | > | mean_duration | teo_util_v4_60hz | 12.6 | -9.0% | > +---------------+------------------+---------+-------------+ > > Jank percentage > > +------------+------------------+---------+-------------+ > | variable | kernel | value | perc_diff | > |------------+------------------+---------+-------------| > | jank_perc | menu_60hz | 1.5 | 0.0% | > | jank_perc | teo_60hz | 2.1 | 36.99% | > | jank_perc | teo_util_v3_60hz | 1.3 | -13.95% | > | jank_perc | teo_util_v4_60hz | 1.3 | -17.37% | > +------------+------------------+---------+-------------+ > > Idle misses > > +------------------+-------------+--------------+ > | wa_path | type | count_perc | > |------------------+-------------+--------------| > | menu_60hz | too deep | 26.00% | > | teo_60hz | too deep | 11.00% | > | teo_util_v3_60hz | too deep | 2.33% | > | teo_util_v4_60hz | too deep | 2.54% | > | menu_60hz | too shallow | 4.74% | > | teo_60hz | too shallow | 11.89% | > | teo_util_v3_60hz | too shallow | 21.78% | > | teo_util_v4_60hz | too shallow | 21.93% | > +------------------+-------------+--------------+ > > Power usage [mW] > > +--------------+------------------+---------+-------------+ > | chan_name | kernel | value | perc_diff | > |--------------+------------------+---------+-------------| > | total_power | menu_60hz | 144.6 | 0.0% | > | total_power | teo_60hz | 136.9 | -5.27% | > | total_power | teo_util_v3_60hz | 134.2 | -7.19% | > | total_power | teo_util_v4_60hz | 121.3 | -16.08% | > +--------------+------------------+---------+-------------+ > > Task wakeup latency > > +-----------------+------------------+-------------+-------------+ > | comm | kernel | value | perc_diff | > |-----------------+------------------+-------------+-------------| > | RenderThread | menu_60hz | 139.52μs | 0.0% | > | RenderThread | teo_60hz | 116.51μs | -16.49% | > | RenderThread | teo_util_v3_60hz | 86.76μs | -37.82% | > | RenderThread | teo_util_v4_60hz | 91.11μs | -34.7% | > | droid.benchmark | menu_60hz | 135.88μs | 0.0% | > | droid.benchmark | teo_60hz | 105.21μs | -22.57% | > | droid.benchmark | teo_util_v3_60hz | 83.92μs | -38.24% | > | droid.benchmark | teo_util_v4_60hz | 83.18μs | -38.79% | > | surfaceflinger | menu_60hz | 124.03μs | 0.0% | > | surfaceflinger | teo_60hz | 151.90μs | 22.47% | > | surfaceflinger | teo_util_v3_60hz | 100.19μs | -19.22% | > | surfaceflinger | teo_util_v4_60hz | 87.65μs | -29.33% | > +-----------------+------------------+-------------+-------------+ > > 4. Speedometer 2 (heavy load web browsing test) > > Benchmark scores > > +-------------------+-------------+---------+-------------+ > | metric | kernel | value | perc_diff | > |-------------------+-------------+---------+-------------| > | Speedometer Score | menu | 102 | 0.0% | > | Speedometer Score | teo | 104.9 | 2.88% | > | Speedometer Score | teo_util_v3 | 102.1 | 0.16% | > | Speedometer Score | teo_util_v4 | 103.8 | 1.83% | > +-------------------+-------------+---------+-------------+ > > Idle misses > > +-------------+-------------+--------------+ > | wa_path | type | count_perc | > |-------------+-------------+--------------| > | menu | too deep | 17.95% | > | teo | too deep | 6.46% | > | teo_util_v3 | too deep | 0.63% | > | teo_util_v4 | too deep | 0.64% | > | menu | too shallow | 3.86% | > | teo | too shallow | 8.21% | > | teo_util_v3 | too shallow | 14.72% | > | teo_util_v4 | too shallow | 14.43% | > +-------------+-------------+--------------+ > > Power usage [mW] > > +--------------+----------+-------------+---------+-------------+ > | chan_name | metric | kernel | value | perc_diff | > |--------------+----------+-------------+---------+-------------| > | total_power | gmean | menu | 2059 | 0.0% | > | total_power | gmean | teo | 2187.8 | 6.26% | > | total_power | gmean | teo_util_v3 | 2212.9 | 7.47% | > | total_power | gmean | teo_util_v4 | 2121.8 | 3.05% | > +--------------+----------+-------------+---------+-------------+ > > Task wakeup latency > > +-----------------+----------+-------------+-------------+-------------+ > | comm | metric | kernel | value | perc_diff | > |-----------------+----------+-------------+-------------+-------------| > | CrRendererMain | gmean | menu | 17.18μs | 0.0% | > | CrRendererMain | gmean | teo | 16.18μs | -5.82% | > | CrRendererMain | gmean | teo_util_v3 | 18.04μs | 5.05% | > | CrRendererMain | gmean | teo_util_v4 | 18.25μs | 6.27% | > | RenderThread | gmean | menu | 68.60μs | 0.0% | > | RenderThread | gmean | teo | 48.44μs | -29.39% | > | RenderThread | gmean | teo_util_v3 | 48.01μs | -30.02% | > | RenderThread | gmean | teo_util_v4 | 51.24μs | -25.3% | > | surfaceflinger | gmean | menu | 42.23μs | 0.0% | > | surfaceflinger | gmean | teo | 29.84μs | -29.33% | > | surfaceflinger | gmean | teo_util_v3 | 24.51μs | -41.95% | > | surfaceflinger | gmean | teo_util_v4 | 29.64μs | -29.8% | > +-----------------+----------+-------------+-------------+-------------+ > > Thank you for taking your time to read this! > > -- > Kajetan > > v5 -> v6: > - amended some wording in the commit description & cover letter > - included test results in the commit description > - refactored checking the CPU utilized status to account for !SMP systems > - dropped the RFC from the patchset header > > v4 -> v5: > - remove the restriction to only apply the mechanism for C1 candidate state > - clarify some code comments, fix comment style > - refactor the fast-exit path loop implementation > - move some cover letter information into the commit description > > v3 -> v4: > - remove the chunk of code skipping metrics updates when the CPU was utilized > - include new test results and more benchmarks in the cover letter > > v2 -> v3: > - add a patch adding an option to skip polling states in teo_find_shallower_state() > - only reduce the state if the candidate state is C1 and C0 is not a polling state > - add a check for polling states in the 2-states fast-exit path > - remove the ifdefs and Kconfig option > > v1 -> v2: > - rework the mechanism to reduce selected state by 1 instead of directly selecting C0 (suggested by Doug Smythies) > - add a fast-exit path for systems with 2 idle states to not waste cycles on metrics when utilized > - fix typos in comments > - include a missing header > > > Kajetan Puchalski (2): > cpuidle: teo: Optionally skip polling states in teo_find_shallower_state() > cpuidle: teo: Introduce util-awareness > > drivers/cpuidle/governors/teo.c | 100 ++++++++++++++++++++++++++++++-- > 1 file changed, 96 insertions(+), 4 deletions(-) > > -- Both patches in the series applied as 6.3 material, thanks!
On Thu, Jan 12, 2023 at 08:22:24PM +0100, Rafael J. Wysocki wrote: > On Thu, Jan 5, 2023 at 3:52 PM Kajetan Puchalski > <kajetan.puchalski@arm.com> wrote: > > > > Hi, > > > > At the moment, none of the available idle governors take any scheduling > > information into account. They also tend to overestimate the idle > > duration quite often, which causes them to select excessively deep idle > > states, thus leading to increased wakeup latency and lower performance with no > > power saving. For 'menu' while web browsing on Android for instance, those > > types of wakeups ('too deep') account for over 24% of all wakeups. > > > > At the same time, on some platforms idle state 0 can be power efficient > > enough to warrant wanting to prefer it over idle state 1. This is because > > the power usage of the two states can be so close that sufficient amounts > > of too deep state 1 sleeps can completely offset the state 1 power saving to the > > point where it would've been more power efficient to just use state 0 instead. > > This is of course for systems where state 0 is not a polling state, such as > > arm-based devices. > > > > Sleeps that happened in state 0 while they could have used state 1 ('too shallow') only > > save less power than they otherwise could have. Too deep sleeps, on the other > > hand, harm performance and nullify the potential power saving from using state 1 in > > the first place. While taking this into account, it is clear that on balance it > > is preferable for an idle governor to have more too shallow sleeps instead of > > more too deep sleeps on those kinds of platforms. > > > > Currently the best available governor under this metric is TEO which on average results in less than > > half the percentage of too deep sleeps compared to 'menu', getting much better wakeup latencies and > > increased performance in the process. > > > > This patchset specifically tunes TEO to prefer shallower idle states in order to reduce wakeup latency > > and achieve better performance. To this end, before selecting the next idle state it uses the avg_util > > signal of a CPU's runqueue in order to determine to what extent the CPU is being utilized. > > This util value is then compared to a threshold defined as a percentage of the cpu's capacity > > (capacity >> 6 ie. ~1.5% in the current implementation). If the util is above the threshold, the idle > > state selected by TEO metrics will be reduced by 1, thus selecting a shallower state. If the util is > > below the threshold, the governor defaults to the TEO metrics mechanism to try to select the deepest > > available idle state based on the closest timer event and its own correctness. > > > > The main goal of this is to reduce latency and increase performance for some workloads. Under some > > workloads it will result in an increase in power usage (Geekbench 5) while for other workloads it > > will also result in a decrease in power usage compared to TEO (PCMark Web, Jankbench, Speedometer). > > > > As of v2 the patch includes a 'fast exit' path for arm-based and similar systems where only 2 idle > > states are present. If there's just 2 idle states and the CPU is utilized, we can directly select > > the shallowest state and save cycles by skipping the entire metrics mechanism. > > > > Under the current implementation, the state will not be reduced by 1 if the change would lead to > > selecting a polling state instead of a non-polling state. > > > > This approach can outperform all the other currently available governors, at least on mobile device > > workloads, which is why I think it is worth keeping as an option. > > > > There is no particular attachment or reliance on TEO for this mechanism, I simply chose to base > > it on TEO because it performs the best out of all the available options and I didn't think there was > > any point in reinventing the wheel on the side of computing governor metrics. If a > > better approach comes along at some point, there's no reason why the same idle aware mechanism > > couldn't be used with any other metrics algorithm. That would, however, require implemeting it as > > a separate governor rather than a TEO add-on. > > > > As for how the extension performs in practice, below I'll add some benchmark results I got while > > testing this patchset. All the benchmarks were run after holding the phone in the fridge for exactly > > an hour each time to minimise the impact of thermal issues. > > > > Pixel 6 (Android 12, mainline kernel 5.18, with newer mainline CFS patches): > > > > 1. Geekbench 5 (latency-sensitive, heavy load test) > > > > The values below are gmean values across 3 back to back iteration of Geekbench 5. > > As GB5 is a heavy benchmark, after more than 3 iterations intense throttling kicks in on mobile devices > > resulting in skewed benchmark scores, which makes it difficult to collect reliable results. The actual > > values for all of the governors can change between runs as the benchmark might be affected by factors > > other than just latency. Nevertheless, on the runs I've seen, util-aware TEO frequently achieved better > > scores than all the other governors. > > > > Benchmark scores > > > > +-----------------+-------------+---------+-------------+ > > | metric | kernel | value | perc_diff | > > |-----------------+-------------+---------+-------------| > > | multicore_score | menu | 2826.5 | 0.0% | > > | multicore_score | teo | 2764.8 | -2.18% | > > | multicore_score | teo_util_v3 | 2849 | 0.8% | > > | multicore_score | teo_util_v4 | 2865 | 1.36% | > > | score | menu | 1053 | 0.0% | > > | score | teo | 1050.7 | -0.22% | > > | score | teo_util_v3 | 1059.6 | 0.63% | > > | score | teo_util_v4 | 1057.6 | 0.44% | > > +-----------------+-------------+---------+-------------+ > > > > Idle misses > > > > The numbers are percentages of too deep and too shallow sleeps computed using the new trace > > event - cpu_idle_miss. The percentage is obtained by counting the two types of misses over > > the course of a run and then dividing them by the total number of wakeups in that run. > > > > +-------------+-------------+--------------+ > > | wa_path | type | count_perc | > > |-------------+-------------+--------------| > > | menu | too deep | 14.994% | > > | teo | too deep | 9.649% | > > | teo_util_v3 | too deep | 4.298% | > > | teo_util_v4 | too deep | 4.02 % | > > | menu | too shallow | 2.497% | > > | teo | too shallow | 5.963% | > > | teo_util_v3 | too shallow | 13.773% | > > | teo_util_v4 | too shallow | 14.598% | > > +-------------+-------------+--------------+ > > > > Power usage [mW] > > > > +--------------+----------+-------------+---------+-------------+ > > | chan_name | metric | kernel | value | perc_diff | > > |--------------+----------+-------------+---------+-------------| > > | total_power | gmean | menu | 2551.4 | 0.0% | > > | total_power | gmean | teo | 2606.8 | 2.17% | > > | total_power | gmean | teo_util_v3 | 2670.1 | 4.65% | > > | total_power | gmean | teo_util_v4 | 2722.3 | 6.7% | > > +--------------+----------+-------------+---------+-------------+ > > > > Task wakeup latency > > > > +-----------------+----------+-------------+-------------+-------------+ > > | comm | metric | kernel | value | perc_diff | > > |-----------------+----------+-------------+-------------+-------------| > > | AsyncTask #1 | gmean | menu | 78.16μs | 0.0% | > > | AsyncTask #1 | gmean | teo | 61.60μs | -21.19% | > > | AsyncTask #1 | gmean | teo_util_v3 | 74.34μs | -4.89% | > > | AsyncTask #1 | gmean | teo_util_v4 | 54.45μs | -30.34% | > > | labs.geekbench5 | gmean | menu | 88.55μs | 0.0% | > > | labs.geekbench5 | gmean | teo | 100.97μs | 14.02% | > > | labs.geekbench5 | gmean | teo_util_v3 | 53.57μs | -39.5% | > > | labs.geekbench5 | gmean | teo_util_v4 | 59.60μs | -32.7% | > > +-----------------+----------+-------------+-------------+-------------+ > > > > In case of this benchmark, the difference in latency does seem to translate into better scores. > > > > 2. PCMark Web Browsing (non latency-sensitive, normal usage web browsing test) > > > > The table below contains gmean values across 20 back to back iterations of PCMark 2 Web Browsing. > > > > Benchmark scores > > > > +----------------+-------------+---------+-------------+ > > | metric | kernel | value | perc_diff | > > |----------------+-------------+---------+-------------| > > | PcmaWebV2Score | menu | 5232 | 0.0% | > > | PcmaWebV2Score | teo | 5219.8 | -0.23% | > > | PcmaWebV2Score | teo_util_v3 | 5273.5 | 0.79% | > > | PcmaWebV2Score | teo_util_v4 | 5239.9 | 0.15% | > > +----------------+-------------+---------+-------------+ > > > > Idle misses > > > > +-------------+-------------+--------------+ > > | wa_path | type | count_perc | > > |-------------+-------------+--------------| > > | menu | too deep | 24.814% | > > | teo | too deep | 11.65% | > > | teo_util_v3 | too deep | 3.481% | > > | teo_util_v4 | too deep | 3.662% | > > | menu | too shallow | 3.101% | > > | teo | too shallow | 8.578% | > > | teo_util_v3 | too shallow | 18.326% | > > | teo_util_v4 | too shallow | 18.692% | > > +-------------+-------------+--------------+ > > > > Power usage [mW] > > > > +--------------+----------+-------------+---------+-------------+ > > | chan_name | metric | kernel | value | perc_diff | > > |--------------+----------+-------------+---------+-------------| > > | total_power | gmean | menu | 179.2 | 0.0% | > > | total_power | gmean | teo | 184.8 | 3.1% | > > | total_power | gmean | teo_util_v3 | 177.4 | -1.02% | > > | total_power | gmean | teo_util_v4 | 184.1 | 2.71% | > > +--------------+----------+-------------+---------+-------------+ > > > > Task wakeup latency > > > > +-----------------+----------+-------------+-------------+-------------+ > > | comm | metric | kernel | value | perc_diff | > > |-----------------+----------+-------------+-------------+-------------| > > | CrRendererMain | gmean | menu | 236.63μs | 0.0% | > > | CrRendererMain | gmean | teo | 201.85μs | -14.7% | > > | CrRendererMain | gmean | teo_util_v3 | 106.46μs | -55.01% | > > | CrRendererMain | gmean | teo_util_v4 | 106.72μs | -54.9% | > > | chmark:workload | gmean | menu | 100.30μs | 0.0% | > > | chmark:workload | gmean | teo | 80.20μs | -20.04% | > > | chmark:workload | gmean | teo_util_v3 | 65.88μs | -34.32% | > > | chmark:workload | gmean | teo_util_v4 | 57.90μs | -42.28% | > > | surfaceflinger | gmean | menu | 97.57μs | 0.0% | > > | surfaceflinger | gmean | teo | 98.86μs | 1.31% | > > | surfaceflinger | gmean | teo_util_v3 | 56.49μs | -42.1% | > > | surfaceflinger | gmean | teo_util_v4 | 72.68μs | -25.52% | > > +-----------------+----------+-------------+-------------+-------------+ > > > > In this case the large latency improvement does not translate into a notable increase in benchmark score as > > this particular benchmark mainly responds to changes in operating frequency. > > > > 3. Jankbench (locked 60hz screen) (normal usage UI test) > > > > Frame durations > > > > +---------------+------------------+---------+-------------+ > > | variable | kernel | value | perc_diff | > > |---------------+------------------+---------+-------------| > > | mean_duration | menu_60hz | 13.9 | 0.0% | > > | mean_duration | teo_60hz | 14.7 | 6.0% | > > | mean_duration | teo_util_v3_60hz | 13.8 | -0.87% | > > | mean_duration | teo_util_v4_60hz | 12.6 | -9.0% | > > +---------------+------------------+---------+-------------+ > > > > Jank percentage > > > > +------------+------------------+---------+-------------+ > > | variable | kernel | value | perc_diff | > > |------------+------------------+---------+-------------| > > | jank_perc | menu_60hz | 1.5 | 0.0% | > > | jank_perc | teo_60hz | 2.1 | 36.99% | > > | jank_perc | teo_util_v3_60hz | 1.3 | -13.95% | > > | jank_perc | teo_util_v4_60hz | 1.3 | -17.37% | > > +------------+------------------+---------+-------------+ > > > > Idle misses > > > > +------------------+-------------+--------------+ > > | wa_path | type | count_perc | > > |------------------+-------------+--------------| > > | menu_60hz | too deep | 26.00% | > > | teo_60hz | too deep | 11.00% | > > | teo_util_v3_60hz | too deep | 2.33% | > > | teo_util_v4_60hz | too deep | 2.54% | > > | menu_60hz | too shallow | 4.74% | > > | teo_60hz | too shallow | 11.89% | > > | teo_util_v3_60hz | too shallow | 21.78% | > > | teo_util_v4_60hz | too shallow | 21.93% | > > +------------------+-------------+--------------+ > > > > Power usage [mW] > > > > +--------------+------------------+---------+-------------+ > > | chan_name | kernel | value | perc_diff | > > |--------------+------------------+---------+-------------| > > | total_power | menu_60hz | 144.6 | 0.0% | > > | total_power | teo_60hz | 136.9 | -5.27% | > > | total_power | teo_util_v3_60hz | 134.2 | -7.19% | > > | total_power | teo_util_v4_60hz | 121.3 | -16.08% | > > +--------------+------------------+---------+-------------+ > > > > Task wakeup latency > > > > +-----------------+------------------+-------------+-------------+ > > | comm | kernel | value | perc_diff | > > |-----------------+------------------+-------------+-------------| > > | RenderThread | menu_60hz | 139.52μs | 0.0% | > > | RenderThread | teo_60hz | 116.51μs | -16.49% | > > | RenderThread | teo_util_v3_60hz | 86.76μs | -37.82% | > > | RenderThread | teo_util_v4_60hz | 91.11μs | -34.7% | > > | droid.benchmark | menu_60hz | 135.88μs | 0.0% | > > | droid.benchmark | teo_60hz | 105.21μs | -22.57% | > > | droid.benchmark | teo_util_v3_60hz | 83.92μs | -38.24% | > > | droid.benchmark | teo_util_v4_60hz | 83.18μs | -38.79% | > > | surfaceflinger | menu_60hz | 124.03μs | 0.0% | > > | surfaceflinger | teo_60hz | 151.90μs | 22.47% | > > | surfaceflinger | teo_util_v3_60hz | 100.19μs | -19.22% | > > | surfaceflinger | teo_util_v4_60hz | 87.65μs | -29.33% | > > +-----------------+------------------+-------------+-------------+ > > > > 4. Speedometer 2 (heavy load web browsing test) > > > > Benchmark scores > > > > +-------------------+-------------+---------+-------------+ > > | metric | kernel | value | perc_diff | > > |-------------------+-------------+---------+-------------| > > | Speedometer Score | menu | 102 | 0.0% | > > | Speedometer Score | teo | 104.9 | 2.88% | > > | Speedometer Score | teo_util_v3 | 102.1 | 0.16% | > > | Speedometer Score | teo_util_v4 | 103.8 | 1.83% | > > +-------------------+-------------+---------+-------------+ > > > > Idle misses > > > > +-------------+-------------+--------------+ > > | wa_path | type | count_perc | > > |-------------+-------------+--------------| > > | menu | too deep | 17.95% | > > | teo | too deep | 6.46% | > > | teo_util_v3 | too deep | 0.63% | > > | teo_util_v4 | too deep | 0.64% | > > | menu | too shallow | 3.86% | > > | teo | too shallow | 8.21% | > > | teo_util_v3 | too shallow | 14.72% | > > | teo_util_v4 | too shallow | 14.43% | > > +-------------+-------------+--------------+ > > > > Power usage [mW] > > > > +--------------+----------+-------------+---------+-------------+ > > | chan_name | metric | kernel | value | perc_diff | > > |--------------+----------+-------------+---------+-------------| > > | total_power | gmean | menu | 2059 | 0.0% | > > | total_power | gmean | teo | 2187.8 | 6.26% | > > | total_power | gmean | teo_util_v3 | 2212.9 | 7.47% | > > | total_power | gmean | teo_util_v4 | 2121.8 | 3.05% | > > +--------------+----------+-------------+---------+-------------+ > > > > Task wakeup latency > > > > +-----------------+----------+-------------+-------------+-------------+ > > | comm | metric | kernel | value | perc_diff | > > |-----------------+----------+-------------+-------------+-------------| > > | CrRendererMain | gmean | menu | 17.18μs | 0.0% | > > | CrRendererMain | gmean | teo | 16.18μs | -5.82% | > > | CrRendererMain | gmean | teo_util_v3 | 18.04μs | 5.05% | > > | CrRendererMain | gmean | teo_util_v4 | 18.25μs | 6.27% | > > | RenderThread | gmean | menu | 68.60μs | 0.0% | > > | RenderThread | gmean | teo | 48.44μs | -29.39% | > > | RenderThread | gmean | teo_util_v3 | 48.01μs | -30.02% | > > | RenderThread | gmean | teo_util_v4 | 51.24μs | -25.3% | > > | surfaceflinger | gmean | menu | 42.23μs | 0.0% | > > | surfaceflinger | gmean | teo | 29.84μs | -29.33% | > > | surfaceflinger | gmean | teo_util_v3 | 24.51μs | -41.95% | > > | surfaceflinger | gmean | teo_util_v4 | 29.64μs | -29.8% | > > +-----------------+----------+-------------+-------------+-------------+ > > > > Thank you for taking your time to read this! > > > > -- > > Kajetan > > > > v5 -> v6: > > - amended some wording in the commit description & cover letter > > - included test results in the commit description > > - refactored checking the CPU utilized status to account for !SMP systems > > - dropped the RFC from the patchset header > > > > v4 -> v5: > > - remove the restriction to only apply the mechanism for C1 candidate state > > - clarify some code comments, fix comment style > > - refactor the fast-exit path loop implementation > > - move some cover letter information into the commit description > > > > v3 -> v4: > > - remove the chunk of code skipping metrics updates when the CPU was utilized > > - include new test results and more benchmarks in the cover letter > > > > v2 -> v3: > > - add a patch adding an option to skip polling states in teo_find_shallower_state() > > - only reduce the state if the candidate state is C1 and C0 is not a polling state > > - add a check for polling states in the 2-states fast-exit path > > - remove the ifdefs and Kconfig option > > > > v1 -> v2: > > - rework the mechanism to reduce selected state by 1 instead of directly selecting C0 (suggested by Doug Smythies) > > - add a fast-exit path for systems with 2 idle states to not waste cycles on metrics when utilized > > - fix typos in comments > > - include a missing header > > > > > > Kajetan Puchalski (2): > > cpuidle: teo: Optionally skip polling states in teo_find_shallower_state() > > cpuidle: teo: Introduce util-awareness > > > > drivers/cpuidle/governors/teo.c | 100 ++++++++++++++++++++++++++++++-- > > 1 file changed, 96 insertions(+), 4 deletions(-) > > > > -- > > Both patches in the series applied as 6.3 material, thanks! Thanks a lot, take care!