Message ID | 20210625152603.25960-3-lukasz.luba@arm.com |
---|---|
State | New |
Headers | show |
Series | Improve EAS energy estimation and increase precision | expand |
On 25/06/2021 17:26, Lukasz Luba wrote: > The Energy Model (EM) em_cpu_energy() is responsible for providing good > estimation regarding CPUs energy. It contains proper data structures which > are then used during calculation. The values stored in there are in > milli-Watts precision (or in abstract scale) smaller that 0xffff, which use I guess you refer to 'if (... || power > EM_MAX_POWER)' check in em_create_perf_table() [kernel/power/energy_model.c]. > sufficient unsigned long even on 32-bit machines. There are scenarios where ^^^^^^^^^ Can you describe these scenarios better with one example (EAS placement of an example task on a 2 PD system) which highlights the issue and how it this patch-set solves it? In this example you can list all the things which must be there to create a situation in EAS in which the patch-set helps. > we would like to provide calculated estimations in a better precision and > the values might be 1000 times bigger. This patch makes possible to use Where is this `1000` coming from? > quite big values for also 32-bit machines. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > include/linux/energy_model.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > index 3f221dbf5f95..2016f5a706e0 100644 > --- a/include/linux/energy_model.h > +++ b/include/linux/energy_model.h > @@ -101,7 +101,7 @@ void em_dev_unregister_perf_domain(struct device *dev); > * Return: the sum of the energy consumed by the CPUs of the domain assuming > * a capacity state satisfying the max utilization of the domain. > */ > -static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > +static inline u64 em_cpu_energy(struct em_perf_domain *pd, > unsigned long max_util, unsigned long sum_util, > unsigned long allowed_cpu_cap) > { > @@ -180,7 +180,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > * pd_nrg = ------------------------ (4) > * scale_cpu > */ > - return ps->cost * sum_util / scale_cpu; > + return div_u64((u64)ps->cost * sum_util, scale_cpu); > } > > /** > @@ -217,7 +217,7 @@ static inline struct em_perf_domain *em_pd_get(struct device *dev) > { > return NULL; > } > -static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > +static inline u64 em_cpu_energy(struct em_perf_domain *pd, > unsigned long max_util, unsigned long sum_util, > unsigned long allowed_cpu_cap) > { >
On 7/5/21 1:44 PM, Dietmar Eggemann wrote: > On 25/06/2021 17:26, Lukasz Luba wrote: >> The Energy Model (EM) em_cpu_energy() is responsible for providing good >> estimation regarding CPUs energy. It contains proper data structures which >> are then used during calculation. The values stored in there are in >> milli-Watts precision (or in abstract scale) smaller that 0xffff, which use > > I guess you refer to 'if (... || power > EM_MAX_POWER)' check in > em_create_perf_table() [kernel/power/energy_model.c]. Correct > >> sufficient unsigned long even on 32-bit machines. There are scenarios where > ^^^^^^^^^ > > Can you describe these scenarios better with one example (EAS placement > of an example task on a 2 PD system) which highlights the issue and how > it this patch-set solves it? There are two places in the code where it makes a difference: 1. In the find_energy_efficient_cpu() where we are searching for best_delta. We might suffer there when two PDs return the same result, like in the example below. Scenario: Low utilized system e.g. ~200 sum_util for PD0 and ~220 for PD1. There are quite a few small tasks ~10-15 util. These tasks would suffer for the rounding error. Such system utilization has been seen while playing some simple games. In such condition our partner reported 5..10mA less battery drain. Some details: We have two Perf Domains (PDs): PD0 (big) and PD1 (little) Let's compare w/o patch set ('old') and w/ patch set ('new') We are comparing energy w/ task and w/o task placed in the PDs a) 'old' w/o patch set, PD0 task_util = 13 cost = 480 sum_util_w/o_task = 215 sum_util_w_task = 228 scale_cpu = 1024 energy_w/o_task = 480 * 215 / 1024 = 100.78 => 100 energy_w_task = 480 * 228 / 1024 = 106.87 => 106 energy_diff = 106 - 100 = 6 (this is equal to 'old' PD1's energy_diff in 'c)') b) 'new' w/ patch set, PD0 task_util = 13 cost = 480 * 10000 = 4800000 sum_util_w/o_task = 215 sum_util_w_task = 228 energy_w/o_task = 4800000 * 215 / 1024 = 1007812 energy_w_task = 4800000 * 228 / 1024 = 1068750 energy_diff = 1068750 - 1007812 = 60938 (this is not equal to 'new' PD1's energy_diff in 'd)') c) 'old' w/o patch set, PD1 task_util = 13 cost = 160 sum_util_w/o_task = 283 sum_util_w_task = 293 scale_cpu = 355 energy_w/o_task = 160 * 283 / 355 = 127.55 => 127 energy_w_task = 160 * 296 / 355 = 133.41 => 133 energy_diff = 133 - 127 = 6 (this is equal to 'old' PD0's energy_diff in 'a)') d) 'new' w/ patch set, PD1 task_util = 13 cost = 160 * 10000 = 1600000 sum_util_w/o_task = 283 sum_util_w_task = 293 scale_cpu = 355 (no '/ scale_cpu' needed here) energy_w/o_task = 1600000 * 283 / 355 = 1275492 energy_w_task = 1600000 * 296 / 355 = 1334084 energy_diff = 1334084 - 1275492 = 58592 (this is not equal to 'new' PD0's energy_diff in 'b)') 2. Difference in the the last feec() step: margin filter With the patch set the margin comparison also has better resolution, so it's possible to hit better placement thanks to that. Please see the traces below. How to interpret these values: In the first trace below, there is diff=124964 and margin=123381 the EM 'cost' is multiplied by 10000, so we we divide these two, it will be '12 > 12', so it won't be placed into the better PD with lower best delta. In the last 2 examples you would see close values in the prev_delta=49390 best_delta=43945 Without the patch they would be rounded to prev_delta=4 best_delta=4 and the task might be placed wrongly. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ systemd-logind-440 [000] d..5 82.164218: compute_energy: energy=43945, sum_util=9 cpu=4 systemd-logind-440 [000] d..5 82.164232: compute_energy: energy=766601, sum_util=157 cpu=4 systemd-logind-440 [000] d..5 82.164242: compute_energy: energy=766601, sum_util=157 cpu=4 systemd-logind-440 [000] d..5 82.164253: compute_energy: energy=1207500, sum_util=299 cpu=0 systemd-logind-440 [000] d..5 82.164263: compute_energy: energy=1805192, sum_util=447 cpu=0 systemd-logind-440 [000] d..5 82.164273: select_task_rq_fair: EAS: prev_delta=722656 best_delta=597692 diff=124964 margin=123381 systemd-logind-440 [000] d..5 82.164278: select_task_rq_fair: EAS: hit!!! systemd-logind-440 [000] d.h4 134.954038: compute_energy: energy=366210, sum_util=75 cpu=4 systemd-logind-440 [000] d.h4 134.954067: compute_energy: energy=463867, sum_util=95 cpu=4 systemd-logind-440 [000] d.h4 134.954090: compute_energy: energy=463867, sum_util=95 cpu=4 systemd-logind-440 [000] d.h4 134.954117: compute_energy: energy=257347, sum_util=99 cpu=0 systemd-logind-440 [000] d.h4 134.954137: compute_energy: energy=309336, sum_util=119 cpu=0 systemd-logind-440 [000] d.h4 134.954160: select_task_rq_fair: EAS: prev_delta=97657 best_delta=51989 diff=45668 margin=45075 systemd-logind-440 [000] d.h4 134.954171: select_task_rq_fair: EAS: hit!!! <idle>-0 [001] d.s4 226.019763: compute_energy: energy=0, sum_util=0 cpu=4 <idle>-0 [001] d.s4 226.019790: compute_energy: energy=43945, sum_util=9 cpu=4 <idle>-0 [001] d.s4 226.019817: compute_energy: energy=5198, sum_util=2 cpu=0 <idle>-0 [001] d.s4 226.019838: compute_energy: energy=54588, sum_util=21 cpu=0 <idle>-0 [001] d.s4 226.019858: compute_energy: energy=54588, sum_util=21 cpu=0 <idle>-0 [001] d.s4 226.019881: select_task_rq_fair: EAS: prev_delta=49390 best_delta=43945 diff=5445 margin=3411 <idle>-0 [001] d.s4 226.019891: select_task_rq_fair: EAS: hit!!! <idle>-0 [001] d.s4 270.019780: compute_energy: energy=0, sum_util=0 cpu=4 <idle>-0 [001] d.s4 270.019807: compute_energy: energy=43945, sum_util=9 cpu=4 <idle>-0 [001] d.s4 270.019833: compute_energy: energy=5198, sum_util=2 cpu=0 <idle>-0 [001] d.s4 270.019854: compute_energy: energy=54588, sum_util=21 cpu=0 <idle>-0 [001] d.s4 270.019874: compute_energy: energy=54588, sum_util=21 cpu=0 <idle>-0 [001] d.s4 270.019897: select_task_rq_fair: EAS: prev_delta=49390 best_delta=43945 diff=5445 margin=3411 <idle>-0 [001] d.s4 270.019908: select_task_rq_fair: EAS: hit!!! +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > In this example you can list all the things which must be there to > create a situation in EAS in which the patch-set helps. I hope the description above now add more light into this issue. > >> we would like to provide calculated estimations in a better precision and >> the values might be 1000 times bigger. This patch makes possible to use > > Where is this `1000` coming from? It's just a statement that in the next patches we would increase the resolution by a few orders of magnitude. In patch 3/3 it's 10000. I can align with that value also in this statement. Thank you Dietmar for having a look at this! Regards, Lukasz
On Fri, Jun 25, 2021 at 04:26:02PM +0100, Lukasz Luba wrote: > The Energy Model (EM) em_cpu_energy() is responsible for providing good > estimation regarding CPUs energy. It contains proper data structures which > are then used during calculation. The values stored in there are in > milli-Watts precision (or in abstract scale) smaller that 0xffff, which use > sufficient unsigned long even on 32-bit machines. There are scenarios where > we would like to provide calculated estimations in a better precision and > the values might be 1000 times bigger. This patch makes possible to use > quite big values for also 32-bit machines. > > Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> > --- > include/linux/energy_model.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > index 3f221dbf5f95..2016f5a706e0 100644 > --- a/include/linux/energy_model.h > +++ b/include/linux/energy_model.h > @@ -101,7 +101,7 @@ void em_dev_unregister_perf_domain(struct device *dev); > * Return: the sum of the energy consumed by the CPUs of the domain assuming > * a capacity state satisfying the max utilization of the domain. > */ > -static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > +static inline u64 em_cpu_energy(struct em_perf_domain *pd, > unsigned long max_util, unsigned long sum_util, > unsigned long allowed_cpu_cap) > { > @@ -180,7 +180,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > * pd_nrg = ------------------------ (4) > * scale_cpu > */ > - return ps->cost * sum_util / scale_cpu; > + return div_u64((u64)ps->cost * sum_util, scale_cpu); So these patches are all rather straight forward, however.. the above is pretty horrific on a 32bit box, and we do quite a few of them per wakeup. Is this really worth the performance penalty on 32bit CPUs? Do you really still care about 32bit CPUs, or is this mostly an artifact of wanting to unconditionally increase the precision?
On 7/7/21 8:07 AM, Peter Zijlstra wrote: > On Fri, Jun 25, 2021 at 04:26:02PM +0100, Lukasz Luba wrote: >> The Energy Model (EM) em_cpu_energy() is responsible for providing good >> estimation regarding CPUs energy. It contains proper data structures which >> are then used during calculation. The values stored in there are in >> milli-Watts precision (or in abstract scale) smaller that 0xffff, which use >> sufficient unsigned long even on 32-bit machines. There are scenarios where >> we would like to provide calculated estimations in a better precision and >> the values might be 1000 times bigger. This patch makes possible to use >> quite big values for also 32-bit machines. >> >> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> >> --- >> include/linux/energy_model.h | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h >> index 3f221dbf5f95..2016f5a706e0 100644 >> --- a/include/linux/energy_model.h >> +++ b/include/linux/energy_model.h >> @@ -101,7 +101,7 @@ void em_dev_unregister_perf_domain(struct device *dev); >> * Return: the sum of the energy consumed by the CPUs of the domain assuming >> * a capacity state satisfying the max utilization of the domain. >> */ >> -static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, >> +static inline u64 em_cpu_energy(struct em_perf_domain *pd, >> unsigned long max_util, unsigned long sum_util, >> unsigned long allowed_cpu_cap) >> { >> @@ -180,7 +180,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, >> * pd_nrg = ------------------------ (4) >> * scale_cpu >> */ >> - return ps->cost * sum_util / scale_cpu; >> + return div_u64((u64)ps->cost * sum_util, scale_cpu); > > So these patches are all rather straight forward, however.. the above is > pretty horrific on a 32bit box, and we do quite a few of them per > wakeup. Is this really worth the performance penalty on 32bit CPUs? True, for 2 cluster SoC we might do this 5 times (or less, depends on system state). We don't have new 32bit big.LITTLE platforms, the newest is ~7years old and is actually the only one using EAS. It's not put into new devices AFAIK. > > Do you really still care about 32bit CPUs, or is this mostly an artifact > of wanting to unconditionally increase the precision? > We discussed this internally and weighted the 32bit old big.little. There is a solution, but needs more work and a lot of changes in the whole kernel due to modified EM (affects IPA, DTPM, registration, ...). I have been working on a next step for code that you've pointed: get rid of this runtime division. It would be possible to pre-calculate the: 'ps->cost / scale_cpu' at the moment when EM is registered and store it in the ps->cost. So we would have just: return ps->cost * sum_util The only issue is a late boot of biggest cores, which would destroy the old scale_cpu values for other PDs. I need to probably add RCU locking into the EM and update the other PDs' EMs when the last biggest CPU boots after a few second and registers its EM. For now we would live with this simple code which improves all recent 64bit platforms and is easy to take it into Android common kernel. The next step would be more scattered across other subsystems, so harder to backport to Android 5.4 and others.
On Wed, Jul 07, 2021 at 09:09:08AM +0100, Lukasz Luba wrote: > For now we would live with this simple code which improves > all recent 64bit platforms and is easy to take it into Android > common kernel. The next step would be more scattered across > other subsystems, so harder to backport to Android 5.4 and others. Ah, you *do* only care about 64bit :-) So one option is to only increase precision for 64BIT builds, just like we do for scale_load() and friends.
On 7/7/21 11:01 AM, Peter Zijlstra wrote: > On Wed, Jul 07, 2021 at 09:09:08AM +0100, Lukasz Luba wrote: >> For now we would live with this simple code which improves >> all recent 64bit platforms and is easy to take it into Android >> common kernel. The next step would be more scattered across >> other subsystems, so harder to backport to Android 5.4 and others. > > Ah, you *do* only care about 64bit :-) So one option is to only increase > precision for 64BIT builds, just like we do for scale_load() and > friends. > Your suggestion is potentially a good compromise :-) We could leave the 32bit alone and they would have old code and precision. Thank you for the comments. Let me discuss this internally with my team.
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h index 3f221dbf5f95..2016f5a706e0 100644 --- a/include/linux/energy_model.h +++ b/include/linux/energy_model.h @@ -101,7 +101,7 @@ void em_dev_unregister_perf_domain(struct device *dev); * Return: the sum of the energy consumed by the CPUs of the domain assuming * a capacity state satisfying the max utilization of the domain. */ -static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, +static inline u64 em_cpu_energy(struct em_perf_domain *pd, unsigned long max_util, unsigned long sum_util, unsigned long allowed_cpu_cap) { @@ -180,7 +180,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, * pd_nrg = ------------------------ (4) * scale_cpu */ - return ps->cost * sum_util / scale_cpu; + return div_u64((u64)ps->cost * sum_util, scale_cpu); } /** @@ -217,7 +217,7 @@ static inline struct em_perf_domain *em_pd_get(struct device *dev) { return NULL; } -static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, +static inline u64 em_cpu_energy(struct em_perf_domain *pd, unsigned long max_util, unsigned long sum_util, unsigned long allowed_cpu_cap) {
The Energy Model (EM) em_cpu_energy() is responsible for providing good estimation regarding CPUs energy. It contains proper data structures which are then used during calculation. The values stored in there are in milli-Watts precision (or in abstract scale) smaller that 0xffff, which use sufficient unsigned long even on 32-bit machines. There are scenarios where we would like to provide calculated estimations in a better precision and the values might be 1000 times bigger. This patch makes possible to use quite big values for also 32-bit machines. Signed-off-by: Lukasz Luba <lukasz.luba@arm.com> --- include/linux/energy_model.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)