diff mbox series

[v7,2/2] sched/fair: update scale invariance of PELT

Message ID 1542711308-25256-3-git-send-email-vincent.guittot@linaro.org
State Superseded
Headers show
Series sched/fair: update scale invariance of PELT | expand

Commit Message

Vincent Guittot Nov. 20, 2018, 10:55 a.m. UTC
The current implementation of load tracking invariance scales the
contribution with current frequency and uarch performance (only for
utilization) of the CPU. One main result of this formula is that the
figures are capped by current capacity of CPU. Another one is that the
load_avg is not invariant because not scaled with uarch.

The util_avg of a periodic task that runs r time slots every p time slots
varies in the range :

    U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

with U is the max util_avg value = SCHED_CAPACITY_SCALE

At a lower capacity, the range becomes:

    U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

with C reflecting the compute capacity ratio between current capacity and
max capacity.

so C tries to compensate changes in (1-y^r') but it can't be accurate.

Instead of scaling the contribution value of PELT algo, we should scale the
running time. The PELT signal aims to track the amount of computation of
tasks and/or rq so it seems more correct to scale the running time to
reflect the effective amount of computation done since the last update.

In order to be fully invariant, we need to apply the same amount of
running time and idle time whatever the current capacity. Because running
at lower capacity implies that the task will run longer, we have to ensure
that the same amount of idle time will be applied when system becomes idle
and no idle time has been "stolen". But reaching the maximum utilization
value (SCHED_CAPACITY_SCALE) means that the task is seen as an
always-running task whatever the capacity of the CPU (even at max compute
capacity). In this case, we can discard this "stolen" idle times which
becomes meaningless.

In order to achieve this time scaling, a new clock_pelt is created per rq.
The increase of this clock scales with current capacity when something
is running on rq and synchronizes with clock_task when rq is idle. With
this mechanism, we ensure the same running and idle time whatever the
current capacity. This also enables to simplify the pelt algorithm by
removing all references of uarch and frequency and applying the same
contribution to utilization and loads. Furthermore, the scaling is done
only once per update of clock (update_rq_clock_task()) instead of during
each update of sched_entities and cfs/rt/dl_rq of the rq like the current
implementation. This is interesting when cgroup are involved as shown in
the results below:

On a hikey (octo Arm64 platform).
Performance cpufreq governor and only shallowest c-state to remove variance
generated by those power features so we only track the impact of pelt algo.

each test runs 16 times

./perf bench sched pipe
(higher is better)
kernel	tip/sched/core     + patch
        ops/seconds        ops/seconds         diff
cgroup
root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%

hackbench -l 1000
(lower is better)
kernel	tip/sched/core     + patch
        duration(sec)      duration(sec)        diff
cgroup
root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%

Then, the responsiveness of PELT is improved when CPU is not running at max
capacity with this new algorithm. I have put below some examples of
duration to reach some typical load values according to the capacity of the
CPU with current implementation and with this patch. These values has been
computed based on the geometric series and the half period value:

Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
972 (95%)    138ms         not reachable            276ms
486 (47.5%)  30ms          138ms                     60ms
256 (25%)    13ms           32ms                     26ms

On my hikey (octo Arm64 platform) with schedutil governor, the time to
reach max OPP when starting from a null utilization, decreases from 223ms
with current scale invariance down to 121ms with the new algorithm.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 include/linux/sched.h   |  23 +++-------
 kernel/sched/core.c     |   1 +
 kernel/sched/deadline.c |   6 +--
 kernel/sched/fair.c     |  45 +++++++++++---------
 kernel/sched/pelt.c     |  45 +++++++++++---------
 kernel/sched/pelt.h     | 111 ++++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/rt.c       |   6 +--
 kernel/sched/sched.h    |   5 ++-
 8 files changed, 174 insertions(+), 68 deletions(-)

-- 
2.7.4

Comments

Vincent Guittot Nov. 28, 2018, 9:54 a.m. UTC | #1
Hi,

On Tue, 20 Nov 2018 at 11:55, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>

> The current implementation of load tracking invariance scales the

> contribution with current frequency and uarch performance (only for

> utilization) of the CPU. One main result of this formula is that the

> figures are capped by current capacity of CPU. Another one is that the

> load_avg is not invariant because not scaled with uarch.

>

> The util_avg of a periodic task that runs r time slots every p time slots

> varies in the range :

>

>     U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

>

> with U is the max util_avg value = SCHED_CAPACITY_SCALE

>

> At a lower capacity, the range becomes:

>

>     U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)

>

> with C reflecting the compute capacity ratio between current capacity and

> max capacity.

>

> so C tries to compensate changes in (1-y^r') but it can't be accurate.

>

> Instead of scaling the contribution value of PELT algo, we should scale the

> running time. The PELT signal aims to track the amount of computation of

> tasks and/or rq so it seems more correct to scale the running time to

> reflect the effective amount of computation done since the last update.

>

> In order to be fully invariant, we need to apply the same amount of

> running time and idle time whatever the current capacity. Because running

> at lower capacity implies that the task will run longer, we have to ensure

> that the same amount of idle time will be applied when system becomes idle

> and no idle time has been "stolen". But reaching the maximum utilization

> value (SCHED_CAPACITY_SCALE) means that the task is seen as an

> always-running task whatever the capacity of the CPU (even at max compute

> capacity). In this case, we can discard this "stolen" idle times which

> becomes meaningless.

>

> In order to achieve this time scaling, a new clock_pelt is created per rq.

> The increase of this clock scales with current capacity when something

> is running on rq and synchronizes with clock_task when rq is idle. With

> this mechanism, we ensure the same running and idle time whatever the

> current capacity. This also enables to simplify the pelt algorithm by

> removing all references of uarch and frequency and applying the same

> contribution to utilization and loads. Furthermore, the scaling is done

> only once per update of clock (update_rq_clock_task()) instead of during

> each update of sched_entities and cfs/rt/dl_rq of the rq like the current

> implementation. This is interesting when cgroup are involved as shown in

> the results below:

>

> On a hikey (octo Arm64 platform).

> Performance cpufreq governor and only shallowest c-state to remove variance

> generated by those power features so we only track the impact of pelt algo.

>

> each test runs 16 times

>

> ./perf bench sched pipe

> (higher is better)

> kernel  tip/sched/core     + patch

>         ops/seconds        ops/seconds         diff

> cgroup

> root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%

> level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%

> level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%

>

> hackbench -l 1000

> (lower is better)

> kernel  tip/sched/core     + patch

>         duration(sec)      duration(sec)        diff

> cgroup

> root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%

> level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%

> level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%

>

> Then, the responsiveness of PELT is improved when CPU is not running at max

> capacity with this new algorithm. I have put below some examples of

> duration to reach some typical load values according to the capacity of the

> CPU with current implementation and with this patch. These values has been

> computed based on the geometric series and the half period value:

>

> Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)

> 972 (95%)    138ms         not reachable            276ms

> 486 (47.5%)  30ms          138ms                     60ms

> 256 (25%)    13ms           32ms                     26ms

>

> On my hikey (octo Arm64 platform) with schedutil governor, the time to

> reach max OPP when starting from a null utilization, decreases from 223ms

> with current scale invariance down to 121ms with the new algorithm.

>

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>


Is there anything else that I should do for these patches ?

Regards,
Vincent
Peter Zijlstra Nov. 28, 2018, 10:02 a.m. UTC | #2
On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> Is there anything else that I should do for these patches ?


IIRC, Morten mention they break util_est; Patrick was going to explain.

But yes, I like them.
Patrick Bellasi Nov. 28, 2018, 11:53 a.m. UTC | #3
On 28-Nov 11:02, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> 

> > Is there anything else that I should do for these patches ?

> 

> IIRC, Morten mention they break util_est; Patrick was going to explain.


I guess the problem is that, once we cross the current capacity,
strictly speaking util_avg does not represent anymore a utilization.

With the new signal this could happen and we end up storing estimated
utilization samples which will overestimate the task requirements.

We will have a spike in estimated utilization at next wakeup, since we
use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in
case we collect multiple samples above the current capacity.

So, a possible fix could be to avoid storing util_est samples if we
end up with a utilization above the current capacity.

Something like:

----8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac855b2f4774..93e0cf5d8a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 	if (!task_sleep)
 		return;
 
+	/* Skip samples which do not represent an actual utilization */
+	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))
+		return;
+
 	/*
 	 * If the PELT values haven't changed since enqueue time,
 	 * skip the util_est update.
---8<---

Could that work ?

Maybe using a new utility function to wrap the new check.

-- 
#include <best/regards.h>

Patrick Bellasi
Vincent Guittot Nov. 28, 2018, 1:33 p.m. UTC | #4
On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 28-Nov 11:02, Peter Zijlstra wrote:

> > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> >

> > > Is there anything else that I should do for these patches ?

> >

> > IIRC, Morten mention they break util_est; Patrick was going to explain.

>

> I guess the problem is that, once we cross the current capacity,

> strictly speaking util_avg does not represent anymore a utilization.

>

> With the new signal this could happen and we end up storing estimated

> utilization samples which will overestimate the task requirements.

>

> We will have a spike in estimated utilization at next wakeup, since we

> use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> case we collect multiple samples above the current capacity.


TBH I don't see how it's different from current implementation with a
task that was scheduled on big core and now wakes up on little core.
The util_est is overestimated as well.

But I'm fine with adding your proposal on to on the patchset

>

> So, a possible fix could be to avoid storing util_est samples if we

> end up with a utilization above the current capacity.

>

> Something like:

>

> ----8<---

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index ac855b2f4774..93e0cf5d8a76 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

>         if (!task_sleep)

>                 return;

>

> +       /* Skip samples which do not represent an actual utilization */

> +       if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> +               return;

> +

>         /*

>          * If the PELT values haven't changed since enqueue time,

>          * skip the util_est update.

> ---8<---

>

> Could that work ?

>

> Maybe using a new utility function to wrap the new check.

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi
Vincent Guittot Nov. 28, 2018, 1:35 p.m. UTC | #5
On Wed, 28 Nov 2018 at 14:33, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>

> On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > >

> > > > Is there anything else that I should do for these patches ?

> > >

> > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> >

> > I guess the problem is that, once we cross the current capacity,

> > strictly speaking util_avg does not represent anymore a utilization.

> >

> > With the new signal this could happen and we end up storing estimated

> > utilization samples which will overestimate the task requirements.

> >

> > We will have a spike in estimated utilization at next wakeup, since we

> > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > case we collect multiple samples above the current capacity.

>

> TBH I don't see how it's different from current implementation with a

> task that was scheduled on big core and now wakes up on little core.

> The util_est is overestimated as well.

>

> But I'm fine with adding your proposal on to on the patchset

s/on to on/on top of/

>

> >

> > So, a possible fix could be to avoid storing util_est samples if we

> > end up with a utilization above the current capacity.

> >

> > Something like:

> >

> > ----8<---

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index ac855b2f4774..93e0cf5d8a76 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

> >         if (!task_sleep)

> >                 return;

> >

> > +       /* Skip samples which do not represent an actual utilization */

> > +       if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> > +               return;

> > +

> >         /*

> >          * If the PELT values haven't changed since enqueue time,

> >          * skip the util_est update.

> > ---8<---

> >

> > Could that work ?

> >

> > Maybe using a new utility function to wrap the new check.

> >

> > --

> > #include <best/regards.h>

> >

> > Patrick Bellasi
Patrick Bellasi Nov. 28, 2018, 2:40 p.m. UTC | #6
On 28-Nov 14:33, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > >

> > > > Is there anything else that I should do for these patches ?

> > >

> > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> >

> > I guess the problem is that, once we cross the current capacity,

> > strictly speaking util_avg does not represent anymore a utilization.

> >

> > With the new signal this could happen and we end up storing estimated

> > utilization samples which will overestimate the task requirements.

> >

> > We will have a spike in estimated utilization at next wakeup, since we

> > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > case we collect multiple samples above the current capacity.

> 

> TBH I don't see how it's different from current implementation with a

> task that was scheduled on big core and now wakes up on little core.

> The util_est is overestimated as well.


While running below the capacity of a CPU, either big or LITTLE, we
can still measure the actual used bandwidth as long as we have idle
time. If the task is then moved into a lower capacity core, I think
it's still safe to assume that, likely, it would need more capacity.

Why do you say it's the same ?

With your new signal instead, once we cross the current capacity,
utilization is just not anymore utilization. Thus, IMHO it make sense
avoid to accumulate a sample for what we call "estimated utilization".

I would also say that, with the current implementation which caps
utilization to the current capacity, we get better estimation in
general. At least we can say with absolute precision:

   "the task needs _at least_ that amount of capacity".

Potentially we can also flag the task as being under-provisioned, in
case there was not idle time, and _let a policy_ decide what to do
with it and the granted information we have.

While, with your new signal, once we are over the current capacity,
the "utilization" is just a sort of "random" number at best useful to
drive some conclusions about how long the task has been delayed.

IOW, I fear that we are embedding a policy within a signal which is
currently representing something very well defined: how much cpu
bandwidth a task used. While, latency/under-provisioning policies
perhaps should be better placed somewhere else.

Perhaps I've missed it in some of the previous discussions:
have we have considered/discussed this signal-vs-policy aspect ?

-- 
#include <best/regards.h>

Patrick Bellasi
Vincent Guittot Nov. 28, 2018, 2:55 p.m. UTC | #7
On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 28-Nov 14:33, Vincent Guittot wrote:

> > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > >

> > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > >

> > > > > Is there anything else that I should do for these patches ?

> > > >

> > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > >

> > > I guess the problem is that, once we cross the current capacity,

> > > strictly speaking util_avg does not represent anymore a utilization.

> > >

> > > With the new signal this could happen and we end up storing estimated

> > > utilization samples which will overestimate the task requirements.

> > >

> > > We will have a spike in estimated utilization at next wakeup, since we

> > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > case we collect multiple samples above the current capacity.

> >

> > TBH I don't see how it's different from current implementation with a

> > task that was scheduled on big core and now wakes up on little core.

> > The util_est is overestimated as well.

>

> While running below the capacity of a CPU, either big or LITTLE, we

> can still measure the actual used bandwidth as long as we have idle

> time. If the task is then moved into a lower capacity core, I think

> it's still safe to assume that, likely, it would need more capacity.

>

> Why do you say it's the same ?


In the example of a task that runs 39ms in period of 80ms that we used
during previous version,
the utilization on the big core will reach 709 so will util_est too
When the task migrates on little core (512), util_est is higher than
current cpu capacity

>

> With your new signal instead, once we cross the current capacity,

> utilization is just not anymore utilization. Thus, IMHO it make sense

> avoid to accumulate a sample for what we call "estimated utilization".

>

> I would also say that, with the current implementation which caps

> utilization to the current capacity, we get better estimation in

> general. At least we can say with absolute precision:

>

>    "the task needs _at least_ that amount of capacity".

>

> Potentially we can also flag the task as being under-provisioned, in

> case there was not idle time, and _let a policy_ decide what to do

> with it and the granted information we have.

>

> While, with your new signal, once we are over the current capacity,

> the "utilization" is just a sort of "random" number at best useful to

> drive some conclusions about how long the task has been delayed.

>

> IOW, I fear that we are embedding a policy within a signal which is

> currently representing something very well defined: how much cpu

> bandwidth a task used. While, latency/under-provisioning policies

> perhaps should be better placed somewhere else.

>

> Perhaps I've missed it in some of the previous discussions:

> have we have considered/discussed this signal-vs-policy aspect ?

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi
Patrick Bellasi Nov. 28, 2018, 3:21 p.m. UTC | #8
On 28-Nov 15:55, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > On 28-Nov 14:33, Vincent Guittot wrote:

> > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > >

> > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > >

> > > > > > Is there anything else that I should do for these patches ?

> > > > >

> > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > >

> > > > I guess the problem is that, once we cross the current capacity,

> > > > strictly speaking util_avg does not represent anymore a utilization.

> > > >

> > > > With the new signal this could happen and we end up storing estimated

> > > > utilization samples which will overestimate the task requirements.

> > > >

> > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > case we collect multiple samples above the current capacity.

> > >

> > > TBH I don't see how it's different from current implementation with a

> > > task that was scheduled on big core and now wakes up on little core.

> > > The util_est is overestimated as well.

> >

> > While running below the capacity of a CPU, either big or LITTLE, we

> > can still measure the actual used bandwidth as long as we have idle

> > time. If the task is then moved into a lower capacity core, I think

> > it's still safe to assume that, likely, it would need more capacity.

> >

> > Why do you say it's the same ?

> 

> In the example of a task that runs 39ms in period of 80ms that we used

> during previous version,

> the utilization on the big core will reach 709 so will util_est too

> When the task migrates on little core (512), util_est is higher than

> current cpu capacity


Right, and what's the problem ?

1) We know that PELT is calibrated to 32ms period task and in your
   example, since the runtime is higher then the half-life, it's
   correct to estimate a utilization higher then 50%.

   PELT utilization is defined _based on the half-life_: thus
   your task having a 50% duty cycle does not mean we are not correct
   if report a utilization != 50%.
   It would be as broken as reporting 10% utilization for a task
   running 100ms every 1s.

2) If it was a 70% task on a previous activation, once it's moved into
   a lower capacity CPU it's still correct to assume that it's likely
   going to require the same bandwidth and thus will be
   under-provisioned.

I still don't see where we are wrong in this case :/

To me it looks different then the problem I described.

> > With your new signal instead, once we cross the current capacity,

> > utilization is just not anymore utilization. Thus, IMHO it make sense

> > avoid to accumulate a sample for what we call "estimated utilization".

> >

> > I would also say that, with the current implementation which caps

> > utilization to the current capacity, we get better estimation in

> > general. At least we can say with absolute precision:

> >

> >    "the task needs _at least_ that amount of capacity".

> >

> > Potentially we can also flag the task as being under-provisioned, in

> > case there was not idle time, and _let a policy_ decide what to do

> > with it and the granted information we have.

> >

> > While, with your new signal, once we are over the current capacity,

> > the "utilization" is just a sort of "random" number at best useful to

> > drive some conclusions about how long the task has been delayed.

> >

> > IOW, I fear that we are embedding a policy within a signal which is

> > currently representing something very well defined: how much cpu

> > bandwidth a task used. While, latency/under-provisioning policies

> > perhaps should be better placed somewhere else.

> >

> > Perhaps I've missed it in some of the previous discussions:

> > have we have considered/discussed this signal-vs-policy aspect ?


What's your opinion on the above instead ?

-- 
#include <best/regards.h>

Patrick Bellasi
Vincent Guittot Nov. 28, 2018, 3:42 p.m. UTC | #9
On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 28-Nov 15:55, Vincent Guittot wrote:

> > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > >

> > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > >

> > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > >

> > > > > > > Is there anything else that I should do for these patches ?

> > > > > >

> > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > >

> > > > > I guess the problem is that, once we cross the current capacity,

> > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > >

> > > > > With the new signal this could happen and we end up storing estimated

> > > > > utilization samples which will overestimate the task requirements.

> > > > >

> > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > case we collect multiple samples above the current capacity.

> > > >

> > > > TBH I don't see how it's different from current implementation with a

> > > > task that was scheduled on big core and now wakes up on little core.

> > > > The util_est is overestimated as well.

> > >

> > > While running below the capacity of a CPU, either big or LITTLE, we

> > > can still measure the actual used bandwidth as long as we have idle

> > > time. If the task is then moved into a lower capacity core, I think

> > > it's still safe to assume that, likely, it would need more capacity.

> > >

> > > Why do you say it's the same ?

> >

> > In the example of a task that runs 39ms in period of 80ms that we used

> > during previous version,

> > the utilization on the big core will reach 709 so will util_est too

> > When the task migrates on little core (512), util_est is higher than

> > current cpu capacity

>

> Right, and what's the problem ?


you worry about an util_est being higher than capacity which is the case there

>

> 1) We know that PELT is calibrated to 32ms period task and in your

>    example, since the runtime is higher then the half-life, it's

>    correct to estimate a utilization higher then 50%.

>

>    PELT utilization is defined _based on the half-life_: thus

>    your task having a 50% duty cycle does not mean we are not correct

>    if report a utilization != 50%.

>    It would be as broken as reporting 10% utilization for a task

>    running 100ms every 1s.

>

> 2) If it was a 70% task on a previous activation, once it's moved into

>    a lower capacity CPU it's still correct to assume that it's likely

>    going to require the same bandwidth and thus will be

>    under-provisioned.

>

> I still don't see where we are wrong in this case :/

>

> To me it looks different then the problem I described.

>

> > > With your new signal instead, once we cross the current capacity,

> > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > avoid to accumulate a sample for what we call "estimated utilization".


This is not true. With the example above, the util_est will be exactly the same
 on big and little cores with the new signal

> > >

> > > I would also say that, with the current implementation which caps

> > > utilization to the current capacity, we get better estimation in

> > > general. At least we can say with absolute precision:

> > >

> > >    "the task needs _at least_ that amount of capacity".

> > >

> > > Potentially we can also flag the task as being under-provisioned, in

> > > case there was not idle time, and _let a policy_ decide what to do

> > > with it and the granted information we have.

> > >

> > > While, with your new signal, once we are over the current capacity,

> > > the "utilization" is just a sort of "random" number at best useful to

> > > drive some conclusions about how long the task has been delayed.


see my comment above

> > >

> > > IOW, I fear that we are embedding a policy within a signal which is

> > > currently representing something very well defined: how much cpu

> > > bandwidth a task used. While, latency/under-provisioning policies

> > > perhaps should be better placed somewhere else.

> > >

> > > Perhaps I've missed it in some of the previous discussions:

> > > have we have considered/discussed this signal-vs-policy aspect ?

>

> What's your opinion on the above instead ?


It's not a policy but it gives better knowledge about the amount a work done
I have put below discussion on the  subject on previous version

> >

> > With contribution scaling the PELT utilization of a task is a _minimum_

> > utilization. Regardless of where the task is currently/was running (and

> > provided that it doesn't change behaviour) its PELT utilization will

> > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

>

> The main drawback is that the _minimum_ utilization depends on the CPU

> capacity on which the task runs. The two 25% tasks on a 256 capacity

> CPU will have an utilization of 128 as an example

>

> >

> > With time scaling the PELT utilization doesn't really have a meaning on

> > its own. It has to be compared to the capacity of the CPU where it

> > is/was running to know what the its current PELT utilization means. When

>

> I would have said the opposite. The utilization of the task will

> always reflect the same amount of work that has been already done

> whatever the CPU capacity.

> In fact, the new scaling mechanism uses the real amount of work that

> has been already done to compute the utilization signal which is not

> the case currently. This gives more information about the real amount

> of worked that has been computed in the over utilization case.

>

> > the utilization over-shoots the capacity its value is no longer

> > represents utilization, it just means that it has a higher compute

> > demand than is offered on its current CPU and a high value means that it

> > has been suffering longer. It can't be used to predict the actual

> > utilization on an idle 1024 capacity any better than contribution scaled

> > PELT utilization.

>

> I think that it provides earlier detection of over utilization and

> more accurate signal for a longer time duration which can help the

> load balance

> Coming back to 50% task example . I will use a 50ms running time

> during a 100ms period for the example below to make it easier

>

> Starting from 0, the evolution of the utilization is:

>

> With contribution scaling:

>          time  0ms  50ms  100ms  150ms  200ms

> capacity

> 1024           0    666

> 512            0    333   453

> When the CPU start to be over utilized (@100ms), the utilization is

> already too low (453 instead of 666) and scheduler doesn't detect yet

> that we are over utilized

> 256            0    169   226    246    252

> That's even worse with this lower capacity

>

> With time scaling,

>          time  0ms  50ms  100ms  150ms  200ms

> capacity

> 1024           0    666

> 512            0    428   677

> We know that the current capacity is not enough and the utilization

> reflect the correct utilization level compare to 1024 capacity (the

> 666 vs 677 difference comes from the 1024us window so the last window

> is not full in the case of max capacity)

> 256            0    234   468    564    677

> At 100ms, we know that there is not enough capacity. (In fact we know

> that at 56ms). And even at time 200ms, the amount of work is exactly

> what would have been executed on a CPU 4x faster

>

> >

> > This change might not be a showstopper, but it is something to be aware

> > off and take into account wherever PELT utilization is used.

>

> The point above is clearly a big difference between the 2 approaches

> of the no spare cycle case but I think it will help by giving more

> information in the over utilization case

>

> Vincent

> >

> > Morten


>

> --

> #include <best/regards.h>

>

> Patrick Bellasi
Patrick Bellasi Nov. 28, 2018, 4:35 p.m. UTC | #10
On 28-Nov 16:42, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> >

> > On 28-Nov 15:55, Vincent Guittot wrote:

> > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > >

> > > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > >

> > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > > >

> > > > > > > > Is there anything else that I should do for these patches ?

> > > > > > >

> > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > > >

> > > > > > I guess the problem is that, once we cross the current capacity,

> > > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > > >

> > > > > > With the new signal this could happen and we end up storing estimated

> > > > > > utilization samples which will overestimate the task requirements.

> > > > > >

> > > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > > case we collect multiple samples above the current capacity.

> > > > >

> > > > > TBH I don't see how it's different from current implementation with a

> > > > > task that was scheduled on big core and now wakes up on little core.

> > > > > The util_est is overestimated as well.

> > > >

> > > > While running below the capacity of a CPU, either big or LITTLE, we

> > > > can still measure the actual used bandwidth as long as we have idle

> > > > time. If the task is then moved into a lower capacity core, I think

> > > > it's still safe to assume that, likely, it would need more capacity.

> > > >

> > > > Why do you say it's the same ?

> > >

> > > In the example of a task that runs 39ms in period of 80ms that we used

> > > during previous version,

> > > the utilization on the big core will reach 709 so will util_est too

> > > When the task migrates on little core (512), util_est is higher than

> > > current cpu capacity

> >

> > Right, and what's the problem ?

> 

> you worry about an util_est being higher than capacity which is the case there


I worry about util_est being higher then the capacity the task WAS
running... not the capacity the task IS running... if that value does
not correspond to what the task really need... (more on that at the
end).

> > 1) We know that PELT is calibrated to 32ms period task and in your

> >    example, since the runtime is higher then the half-life, it's

> >    correct to estimate a utilization higher then 50%.

> >

> >    PELT utilization is defined _based on the half-life_: thus

> >    your task having a 50% duty cycle does not mean we are not correct

> >    if report a utilization != 50%.

> >    It would be as broken as reporting 10% utilization for a task

> >    running 100ms every 1s.

> >

> > 2) If it was a 70% task on a previous activation, once it's moved into

> >    a lower capacity CPU it's still correct to assume that it's likely

> >    going to require the same bandwidth and thus will be

> >    under-provisioned.

> >

> > I still don't see where we are wrong in this case :/

> >

> > To me it looks different then the problem I described.

> >

> > > > With your new signal instead, once we cross the current capacity,

> > > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > > avoid to accumulate a sample for what we call "estimated utilization".

> 

> This is not true. With the example above, the util_est will be exactly the same

>  on big and little cores with the new signal


... AFAIU only if we have idle time...

> > > > I would also say that, with the current implementation which caps

> > > > utilization to the current capacity, we get better estimation in

> > > > general. At least we can say with absolute precision:

> > > >

> > > >    "the task needs _at least_ that amount of capacity".

> > > >

> > > > Potentially we can also flag the task as being under-provisioned, in

> > > > case there was not idle time, and _let a policy_ decide what to do

> > > > with it and the granted information we have.

> > > >

> > > > While, with your new signal, once we are over the current capacity,

> > > > the "utilization" is just a sort of "random" number at best useful to

> > > > drive some conclusions about how long the task has been delayed.

> 

> see my comment above

> 

> > > >

> > > > IOW, I fear that we are embedding a policy within a signal which is

> > > > currently representing something very well defined: how much cpu

> > > > bandwidth a task used. While, latency/under-provisioning policies

> > > > perhaps should be better placed somewhere else.

> > > >

> > > > Perhaps I've missed it in some of the previous discussions:

> > > > have we have considered/discussed this signal-vs-policy aspect ?

> >

> > What's your opinion on the above instead ?

> 

> It's not a policy but it gives better knowledge about the amount a work done

> I have put below discussion on the  subject on previous version


Thanks, I think I've skimmed through it, but it's sill useful...

> > > With contribution scaling the PELT utilization of a task is a _minimum_

> > > utilization. Regardless of where the task is currently/was running (and

> > > provided that it doesn't change behaviour) its PELT utilization will

> > > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

> >

> > The main drawback is that the _minimum_ utilization depends on the CPU

> > capacity on which the task runs. The two 25% tasks on a 256 capacity

> > CPU will have an utilization of 128 as an example

> >

> > >

> > > With time scaling the PELT utilization doesn't really have a meaning on

> > > its own. It has to be compared to the capacity of the CPU where it

> > > is/was running to know what the its current PELT utilization means. When

> >

> > I would have said the opposite. The utilization of the task will

> > always reflect the same amount of work that has been already done

> > whatever the CPU capacity.

> > In fact, the new scaling mechanism uses the real amount of work that

> > has been already done to compute the utilization signal which is not

> > the case currently. This gives more information about the real amount

> > of worked that has been computed in the over utilization case.

> >

> > > the utilization over-shoots the capacity its value is no longer

> > > represents utilization, it just means that it has a higher compute

> > > demand than is offered on its current CPU and a high value means that it

> > > has been suffering longer. It can't be used to predict the actual

> > > utilization on an idle 1024 capacity any better than contribution scaled

> > > PELT utilization.

> >

> > I think that it provides earlier detection of over utilization and

> > more accurate signal for a longer time duration which can help the

> > load balance

> > Coming back to 50% task example . I will use a 50ms running time

> > during a 100ms period for the example below to make it easier

> >

> > Starting from 0, the evolution of the utilization is:

> >

> > With contribution scaling:

> >          time  0ms  50ms  100ms  150ms  200ms

> > capacity

> > 1024           0    666

> > 512            0    333   453

> > When the CPU start to be over utilized (@100ms), the utilization is

> > already too low (453 instead of 666) and scheduler doesn't detect yet

> > that we are over utilized

> > 256            0    169   226    246    252

> > That's even worse with this lower capacity

> >

> > With time scaling,

> >          time  0ms  50ms  100ms  150ms  200ms

> > capacity

> > 1024           0    666

> > 512            0    428   677

> > We know that the current capacity is not enough and the utilization

> > reflect the correct utilization level compare to 1024 capacity (the

> > 666 vs 677 difference comes from the 1024us window so the last window

> > is not full in the case of max capacity)

> > 256            0    234   468    564    677

> > At 100ms, we know that there is not enough capacity. (In fact we know

> > that at 56ms). And even at time 200ms, the amount of work is exactly

> > what would have been executed on a CPU 4x faster

> >

> > >

> > > This change might not be a showstopper, but it is something to be aware

> > > off and take into account wherever PELT utilization is used.

> >

> > The point above is clearly a big difference between the 2 approaches

> > of the no spare cycle case but I think it will help by giving more

> > information in the over utilization case


I like the idea that we ramp up faster and always get to the same
value. I like also the idea that we always reach the same value on
both LITTLE and big.

As long as there is idle time this is working fine, in these cases we
should probably also collect util_est samples.

But what happens when we don't have idle time ?

Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity.

Are not these two tasks being reported as 50% tasks (after a while) ?

If that's the case, these are samples we should not store...

-- 
#include <best/regards.h>

Patrick Bellasi
Vincent Guittot Nov. 29, 2018, 10:43 a.m. UTC | #11
On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 28-Nov 16:42, Vincent Guittot wrote:

> > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > >

> > > On 28-Nov 15:55, Vincent Guittot wrote:

> > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > >

> > > > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > >

> > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > > > >

> > > > > > > > > Is there anything else that I should do for these patches ?

> > > > > > > >

> > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > > > >

> > > > > > > I guess the problem is that, once we cross the current capacity,

> > > > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > > > >

> > > > > > > With the new signal this could happen and we end up storing estimated

> > > > > > > utilization samples which will overestimate the task requirements.

> > > > > > >

> > > > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > > > case we collect multiple samples above the current capacity.

> > > > > >

> > > > > > TBH I don't see how it's different from current implementation with a

> > > > > > task that was scheduled on big core and now wakes up on little core.

> > > > > > The util_est is overestimated as well.

> > > > >

> > > > > While running below the capacity of a CPU, either big or LITTLE, we

> > > > > can still measure the actual used bandwidth as long as we have idle

> > > > > time. If the task is then moved into a lower capacity core, I think

> > > > > it's still safe to assume that, likely, it would need more capacity.

> > > > >

> > > > > Why do you say it's the same ?

> > > >

> > > > In the example of a task that runs 39ms in period of 80ms that we used

> > > > during previous version,

> > > > the utilization on the big core will reach 709 so will util_est too

> > > > When the task migrates on little core (512), util_est is higher than

> > > > current cpu capacity

> > >

> > > Right, and what's the problem ?

> >

> > you worry about an util_est being higher than capacity which is the case there

>

> I worry about util_est being higher then the capacity the task WAS

> running... not the capacity the task IS running... if that value does

> not correspond to what the task really need... (more on that at the

> end).

>

> > > 1) We know that PELT is calibrated to 32ms period task and in your

> > >    example, since the runtime is higher then the half-life, it's

> > >    correct to estimate a utilization higher then 50%.

> > >

> > >    PELT utilization is defined _based on the half-life_: thus

> > >    your task having a 50% duty cycle does not mean we are not correct

> > >    if report a utilization != 50%.

> > >    It would be as broken as reporting 10% utilization for a task

> > >    running 100ms every 1s.

> > >

> > > 2) If it was a 70% task on a previous activation, once it's moved into

> > >    a lower capacity CPU it's still correct to assume that it's likely

> > >    going to require the same bandwidth and thus will be

> > >    under-provisioned.

> > >

> > > I still don't see where we are wrong in this case :/

> > >

> > > To me it looks different then the problem I described.

> > >

> > > > > With your new signal instead, once we cross the current capacity,

> > > > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > > > avoid to accumulate a sample for what we call "estimated utilization".

> >

> > This is not true. With the example above, the util_est will be exactly the same

> >  on big and little cores with the new signal

>

> ... AFAIU only if we have idle time...

>

> > > > > I would also say that, with the current implementation which caps

> > > > > utilization to the current capacity, we get better estimation in

> > > > > general. At least we can say with absolute precision:

> > > > >

> > > > >    "the task needs _at least_ that amount of capacity".

> > > > >

> > > > > Potentially we can also flag the task as being under-provisioned, in

> > > > > case there was not idle time, and _let a policy_ decide what to do

> > > > > with it and the granted information we have.

> > > > >

> > > > > While, with your new signal, once we are over the current capacity,

> > > > > the "utilization" is just a sort of "random" number at best useful to

> > > > > drive some conclusions about how long the task has been delayed.

> >

> > see my comment above

> >

> > > > >

> > > > > IOW, I fear that we are embedding a policy within a signal which is

> > > > > currently representing something very well defined: how much cpu

> > > > > bandwidth a task used. While, latency/under-provisioning policies

> > > > > perhaps should be better placed somewhere else.

> > > > >

> > > > > Perhaps I've missed it in some of the previous discussions:

> > > > > have we have considered/discussed this signal-vs-policy aspect ?

> > >

> > > What's your opinion on the above instead ?

> >

> > It's not a policy but it gives better knowledge about the amount a work done

> > I have put below discussion on the  subject on previous version

>

> Thanks, I think I've skimmed through it, but it's sill useful...

>

> > > > With contribution scaling the PELT utilization of a task is a _minimum_

> > > > utilization. Regardless of where the task is currently/was running (and

> > > > provided that it doesn't change behaviour) its PELT utilization will

> > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

> > >

> > > The main drawback is that the _minimum_ utilization depends on the CPU

> > > capacity on which the task runs. The two 25% tasks on a 256 capacity

> > > CPU will have an utilization of 128 as an example

> > >

> > > >

> > > > With time scaling the PELT utilization doesn't really have a meaning on

> > > > its own. It has to be compared to the capacity of the CPU where it

> > > > is/was running to know what the its current PELT utilization means. When

> > >

> > > I would have said the opposite. The utilization of the task will

> > > always reflect the same amount of work that has been already done

> > > whatever the CPU capacity.

> > > In fact, the new scaling mechanism uses the real amount of work that

> > > has been already done to compute the utilization signal which is not

> > > the case currently. This gives more information about the real amount

> > > of worked that has been computed in the over utilization case.

> > >

> > > > the utilization over-shoots the capacity its value is no longer

> > > > represents utilization, it just means that it has a higher compute

> > > > demand than is offered on its current CPU and a high value means that it

> > > > has been suffering longer. It can't be used to predict the actual

> > > > utilization on an idle 1024 capacity any better than contribution scaled

> > > > PELT utilization.

> > >

> > > I think that it provides earlier detection of over utilization and

> > > more accurate signal for a longer time duration which can help the

> > > load balance

> > > Coming back to 50% task example . I will use a 50ms running time

> > > during a 100ms period for the example below to make it easier

> > >

> > > Starting from 0, the evolution of the utilization is:

> > >

> > > With contribution scaling:

> > >          time  0ms  50ms  100ms  150ms  200ms

> > > capacity

> > > 1024           0    666

> > > 512            0    333   453

> > > When the CPU start to be over utilized (@100ms), the utilization is

> > > already too low (453 instead of 666) and scheduler doesn't detect yet

> > > that we are over utilized

> > > 256            0    169   226    246    252

> > > That's even worse with this lower capacity

> > >

> > > With time scaling,

> > >          time  0ms  50ms  100ms  150ms  200ms

> > > capacity

> > > 1024           0    666

> > > 512            0    428   677

> > > We know that the current capacity is not enough and the utilization

> > > reflect the correct utilization level compare to 1024 capacity (the

> > > 666 vs 677 difference comes from the 1024us window so the last window

> > > is not full in the case of max capacity)

> > > 256            0    234   468    564    677

> > > At 100ms, we know that there is not enough capacity. (In fact we know

> > > that at 56ms). And even at time 200ms, the amount of work is exactly

> > > what would have been executed on a CPU 4x faster

> > >

> > > >

> > > > This change might not be a showstopper, but it is something to be aware

> > > > off and take into account wherever PELT utilization is used.

> > >

> > > The point above is clearly a big difference between the 2 approaches

> > > of the no spare cycle case but I think it will help by giving more

> > > information in the over utilization case

>

> I like the idea that we ramp up faster and always get to the same

> value. I like also the idea that we always reach the same value on

> both LITTLE and big.

>

> As long as there is idle time this is working fine, in these cases we

> should probably also collect util_est samples.

>

> But what happens when we don't have idle time ?


As shown above, the utilization stays correct for a longer time frame
even after the over utilization point and provides better over
utilization detection

>

> Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity.

>

> Are not these two tasks being reported as 50% tasks (after a while) ?


Yes they will but similarly to above they will stay correct for longer
time even when they become higher than current cpu capacity


>

> If that's the case, these are samples we should not store...

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi
Peter Zijlstra Nov. 29, 2018, 12:53 p.m. UTC | #12
On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index ac855b2f4774..93e0cf5d8a76 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

>  	if (!task_sleep)

>  		return;

>  

> +	/* Skip samples which do not represent an actual utilization */

> +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> +		return;

> +

>  	/*

>  	 * If the PELT values haven't changed since enqueue time,

>  	 * skip the util_est update.


Would you not want something like:

	min(task_util(p), capacity_of(task_cpu(p)))

And is this the only place where we need this?

OTOH, if the task is always running, it will be always running
irrespective of where it runs.

Not storing these samples seems weird though; this is the exact
condition you want to record -- the task is very active, if we skip
these, we'll come back at a low frequency on the next wakeup.
Patrick Bellasi Nov. 29, 2018, 3 p.m. UTC | #13
On 29-Nov 11:43, Vincent Guittot wrote:
> On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > On 28-Nov 16:42, Vincent Guittot wrote:

> > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > On 28-Nov 15:55, Vincent Guittot wrote:

> > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > > > > >

> > > > > > > > > > Is there anything else that I should do for these patches ?

> > > > > > > > >

> > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > > > > >

> > > > > > > > I guess the problem is that, once we cross the current capacity,

> > > > > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > > > > >

> > > > > > > > With the new signal this could happen and we end up storing estimated

> > > > > > > > utilization samples which will overestimate the task requirements.

> > > > > > > >

> > > > > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > > > > case we collect multiple samples above the current capacity.

> > > > > > >

> > > > > > > TBH I don't see how it's different from current implementation with a

> > > > > > > task that was scheduled on big core and now wakes up on little core.

> > > > > > > The util_est is overestimated as well.

> > > > > >

> > > > > > While running below the capacity of a CPU, either big or LITTLE, we

> > > > > > can still measure the actual used bandwidth as long as we have idle

> > > > > > time. If the task is then moved into a lower capacity core, I think

> > > > > > it's still safe to assume that, likely, it would need more capacity.

> > > > > >

> > > > > > Why do you say it's the same ?

> > > > >

> > > > > In the example of a task that runs 39ms in period of 80ms that we used

> > > > > during previous version,

> > > > > the utilization on the big core will reach 709 so will util_est too

> > > > > When the task migrates on little core (512), util_est is higher than

> > > > > current cpu capacity

> > > >

> > > > Right, and what's the problem ?

> > >

> > > you worry about an util_est being higher than capacity which is the case there

> >

> > I worry about util_est being higher then the capacity the task WAS

> > running... not the capacity the task IS running... if that value does

> > not correspond to what the task really need... (more on that at the

> > end).

> >

> > > > 1) We know that PELT is calibrated to 32ms period task and in your

> > > >    example, since the runtime is higher then the half-life, it's

> > > >    correct to estimate a utilization higher then 50%.

> > > >

> > > >    PELT utilization is defined _based on the half-life_: thus

> > > >    your task having a 50% duty cycle does not mean we are not correct

> > > >    if report a utilization != 50%.

> > > >    It would be as broken as reporting 10% utilization for a task

> > > >    running 100ms every 1s.

> > > >

> > > > 2) If it was a 70% task on a previous activation, once it's moved into

> > > >    a lower capacity CPU it's still correct to assume that it's likely

> > > >    going to require the same bandwidth and thus will be

> > > >    under-provisioned.

> > > >

> > > > I still don't see where we are wrong in this case :/

> > > >

> > > > To me it looks different then the problem I described.

> > > >

> > > > > > With your new signal instead, once we cross the current capacity,

> > > > > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > > > > avoid to accumulate a sample for what we call "estimated utilization".

> > >

> > > This is not true. With the example above, the util_est will be exactly the same

> > >  on big and little cores with the new signal

> >

> > ... AFAIU only if we have idle time...

> >

> > > > > > I would also say that, with the current implementation which caps

> > > > > > utilization to the current capacity, we get better estimation in

> > > > > > general. At least we can say with absolute precision:

> > > > > >

> > > > > >    "the task needs _at least_ that amount of capacity".

> > > > > >

> > > > > > Potentially we can also flag the task as being under-provisioned, in

> > > > > > case there was not idle time, and _let a policy_ decide what to do

> > > > > > with it and the granted information we have.

> > > > > >

> > > > > > While, with your new signal, once we are over the current capacity,

> > > > > > the "utilization" is just a sort of "random" number at best useful to

> > > > > > drive some conclusions about how long the task has been delayed.

> > >

> > > see my comment above

> > >

> > > > > >

> > > > > > IOW, I fear that we are embedding a policy within a signal which is

> > > > > > currently representing something very well defined: how much cpu

> > > > > > bandwidth a task used. While, latency/under-provisioning policies

> > > > > > perhaps should be better placed somewhere else.

> > > > > >

> > > > > > Perhaps I've missed it in some of the previous discussions:

> > > > > > have we have considered/discussed this signal-vs-policy aspect ?

> > > >

> > > > What's your opinion on the above instead ?

> > >

> > > It's not a policy but it gives better knowledge about the amount a work done

> > > I have put below discussion on the  subject on previous version

> >

> > Thanks, I think I've skimmed through it, but it's sill useful...

> >

> > > > > With contribution scaling the PELT utilization of a task is a _minimum_

> > > > > utilization. Regardless of where the task is currently/was running (and

> > > > > provided that it doesn't change behaviour) its PELT utilization will

> > > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

> > > >

> > > > The main drawback is that the _minimum_ utilization depends on the CPU

> > > > capacity on which the task runs. The two 25% tasks on a 256 capacity

> > > > CPU will have an utilization of 128 as an example

> > > >

> > > > >

> > > > > With time scaling the PELT utilization doesn't really have a meaning on

> > > > > its own. It has to be compared to the capacity of the CPU where it

> > > > > is/was running to know what the its current PELT utilization means. When

> > > >

> > > > I would have said the opposite. The utilization of the task will

> > > > always reflect the same amount of work that has been already done

> > > > whatever the CPU capacity.

> > > > In fact, the new scaling mechanism uses the real amount of work that

> > > > has been already done to compute the utilization signal which is not

> > > > the case currently. This gives more information about the real amount

> > > > of worked that has been computed in the over utilization case.

> > > >

> > > > > the utilization over-shoots the capacity its value is no longer

> > > > > represents utilization, it just means that it has a higher compute

> > > > > demand than is offered on its current CPU and a high value means that it

> > > > > has been suffering longer. It can't be used to predict the actual

> > > > > utilization on an idle 1024 capacity any better than contribution scaled

> > > > > PELT utilization.

> > > >

> > > > I think that it provides earlier detection of over utilization and

> > > > more accurate signal for a longer time duration which can help the

> > > > load balance

> > > > Coming back to 50% task example . I will use a 50ms running time

> > > > during a 100ms period for the example below to make it easier

> > > >

> > > > Starting from 0, the evolution of the utilization is:

> > > >

> > > > With contribution scaling:

> > > >          time  0ms  50ms  100ms  150ms  200ms

> > > > capacity

> > > > 1024           0    666

> > > > 512            0    333   453

> > > > When the CPU start to be over utilized (@100ms), the utilization is

> > > > already too low (453 instead of 666) and scheduler doesn't detect yet

> > > > that we are over utilized

> > > > 256            0    169   226    246    252

> > > > That's even worse with this lower capacity

> > > >





> > > > With time scaling,

> > > >          time  0ms  50ms  100ms  150ms  200ms

> > > > capacity

> > > > 1024           0    666

> > > > 512            0    428   677

> > > > 256            0    234   468    564    677


[...]

> > I like the idea that we ramp up faster and always get to the same

> > value. I like also the idea that we always reach the same value on

> > both LITTLE and big.

> >

> > As long as there is idle time this is working fine, in these cases we

> > should probably also collect util_est samples.

> >

> > But what happens when we don't have idle time ?

> 

> As shown above, the utilization stays correct for a longer time frame

> even after the over utilization point and provides better over

> utilization detection

> 

> >

> > Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity.

> >

> > Are not these two tasks being reported as 50% tasks (after a while) ?

> 

> Yes they will but similarly to above they will stay correct for longer

> time even when they become higher than current cpu capacity


Seems we agree that, when there is no idle time:
- the two 15% tasks will be overestimated
- their utilization will reach 50% after a while

If I'm not wrong, we will have:
- 30% CPU util in  ~16ms @1024 capacity
                   ~64ms  @256 capacity

Thus, the tasks will be certainly over-estimated after ~64ms.
Is that correct ?

> > If that's the case, these are samples we should not store...


Now, we can argue that 64ms is a pretty long time and thus it's quite
unlucky we will have no idle for such a long time.

Still, I'm wondering if we should keep collecting those samples or
better find a way to detect that and skip the sampling.

-- 
#include <best/regards.h>

Patrick Bellasi
Patrick Bellasi Nov. 29, 2018, 3:13 p.m. UTC | #14
On 29-Nov 13:53, Peter Zijlstra wrote:
> On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:

> 

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index ac855b2f4774..93e0cf5d8a76 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

> >  	if (!task_sleep)

> >  		return;

> > 

> > +	/* Skip samples which do not represent an actual utilization */

> > +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> > +		return;

> > +

> >  	/*

> >  	 * If the PELT values haven't changed since enqueue time,

> >  	 * skip the util_est update.

> 

> Would you not want something like:

> 

> 	min(task_util(p), capacity_of(task_cpu(p)))

> 

> And is this the only place where we need this?


Mmm... even this could be an over-estimation:

I've just posted an example in my last reply to Vincent, end of:

   Message-ID: <20181129150020.GF23094@e110439-lin>
   https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/

> OTOH, if the task is always running, it will be always running

> irrespective of where it runs.


That's not what I'm concerned about. I'm concerned about small tasks
which are running on limited capacity (e.g. due to thermal capping)
without idle time. In this case, the new "utilization" signal could
overestimate the real task needs.

> Not storing these samples seems weird though; this is the exact

> condition you want to record -- the task is very active, if we skip

> these, we'll come back at a low frequency on the next wakeup.


When there is not idle time, we don't know if the reported
utilization, above the cpu capacity, is due to the task being bigger...
or just the new utilization signal converging towards:

    100% / RUNNABLE_TASKS_COUNT

-- 
#include <best/regards.h>

Patrick Bellasi
Vincent Guittot Nov. 29, 2018, 4:19 p.m. UTC | #15
On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 29-Nov 11:43, Vincent Guittot wrote:

> > On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > On 28-Nov 16:42, Vincent Guittot wrote:

> > > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > On 28-Nov 15:55, Vincent Guittot wrote:

> > > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > > On 28-Nov 14:33, Vincent Guittot wrote:

> > > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote:

> > > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:

> > > > > > > > > >

> > > > > > > > > > > Is there anything else that I should do for these patches ?

> > > > > > > > > >

> > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain.

> > > > > > > > >

> > > > > > > > > I guess the problem is that, once we cross the current capacity,

> > > > > > > > > strictly speaking util_avg does not represent anymore a utilization.

> > > > > > > > >

> > > > > > > > > With the new signal this could happen and we end up storing estimated

> > > > > > > > > utilization samples which will overestimate the task requirements.

> > > > > > > > >

> > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we

> > > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in

> > > > > > > > > case we collect multiple samples above the current capacity.

> > > > > > > >

> > > > > > > > TBH I don't see how it's different from current implementation with a

> > > > > > > > task that was scheduled on big core and now wakes up on little core.

> > > > > > > > The util_est is overestimated as well.

> > > > > > >

> > > > > > > While running below the capacity of a CPU, either big or LITTLE, we

> > > > > > > can still measure the actual used bandwidth as long as we have idle

> > > > > > > time. If the task is then moved into a lower capacity core, I think

> > > > > > > it's still safe to assume that, likely, it would need more capacity.

> > > > > > >

> > > > > > > Why do you say it's the same ?

> > > > > >

> > > > > > In the example of a task that runs 39ms in period of 80ms that we used

> > > > > > during previous version,

> > > > > > the utilization on the big core will reach 709 so will util_est too

> > > > > > When the task migrates on little core (512), util_est is higher than

> > > > > > current cpu capacity

> > > > >

> > > > > Right, and what's the problem ?

> > > >

> > > > you worry about an util_est being higher than capacity which is the case there

> > >

> > > I worry about util_est being higher then the capacity the task WAS

> > > running... not the capacity the task IS running... if that value does

> > > not correspond to what the task really need... (more on that at the

> > > end).

> > >

> > > > > 1) We know that PELT is calibrated to 32ms period task and in your

> > > > >    example, since the runtime is higher then the half-life, it's

> > > > >    correct to estimate a utilization higher then 50%.

> > > > >

> > > > >    PELT utilization is defined _based on the half-life_: thus

> > > > >    your task having a 50% duty cycle does not mean we are not correct

> > > > >    if report a utilization != 50%.

> > > > >    It would be as broken as reporting 10% utilization for a task

> > > > >    running 100ms every 1s.

> > > > >

> > > > > 2) If it was a 70% task on a previous activation, once it's moved into

> > > > >    a lower capacity CPU it's still correct to assume that it's likely

> > > > >    going to require the same bandwidth and thus will be

> > > > >    under-provisioned.

> > > > >

> > > > > I still don't see where we are wrong in this case :/

> > > > >

> > > > > To me it looks different then the problem I described.

> > > > >

> > > > > > > With your new signal instead, once we cross the current capacity,

> > > > > > > utilization is just not anymore utilization. Thus, IMHO it make sense

> > > > > > > avoid to accumulate a sample for what we call "estimated utilization".

> > > >

> > > > This is not true. With the example above, the util_est will be exactly the same

> > > >  on big and little cores with the new signal

> > >

> > > ... AFAIU only if we have idle time...

> > >

> > > > > > > I would also say that, with the current implementation which caps

> > > > > > > utilization to the current capacity, we get better estimation in

> > > > > > > general. At least we can say with absolute precision:

> > > > > > >

> > > > > > >    "the task needs _at least_ that amount of capacity".

> > > > > > >

> > > > > > > Potentially we can also flag the task as being under-provisioned, in

> > > > > > > case there was not idle time, and _let a policy_ decide what to do

> > > > > > > with it and the granted information we have.

> > > > > > >

> > > > > > > While, with your new signal, once we are over the current capacity,

> > > > > > > the "utilization" is just a sort of "random" number at best useful to

> > > > > > > drive some conclusions about how long the task has been delayed.

> > > >

> > > > see my comment above

> > > >

> > > > > > >

> > > > > > > IOW, I fear that we are embedding a policy within a signal which is

> > > > > > > currently representing something very well defined: how much cpu

> > > > > > > bandwidth a task used. While, latency/under-provisioning policies

> > > > > > > perhaps should be better placed somewhere else.

> > > > > > >

> > > > > > > Perhaps I've missed it in some of the previous discussions:

> > > > > > > have we have considered/discussed this signal-vs-policy aspect ?

> > > > >

> > > > > What's your opinion on the above instead ?

> > > >

> > > > It's not a policy but it gives better knowledge about the amount a work done

> > > > I have put below discussion on the  subject on previous version

> > >

> > > Thanks, I think I've skimmed through it, but it's sill useful...

> > >

> > > > > > With contribution scaling the PELT utilization of a task is a _minimum_

> > > > > > utilization. Regardless of where the task is currently/was running (and

> > > > > > provided that it doesn't change behaviour) its PELT utilization will

> > > > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU.

> > > > >

> > > > > The main drawback is that the _minimum_ utilization depends on the CPU

> > > > > capacity on which the task runs. The two 25% tasks on a 256 capacity

> > > > > CPU will have an utilization of 128 as an example

> > > > >

> > > > > >

> > > > > > With time scaling the PELT utilization doesn't really have a meaning on

> > > > > > its own. It has to be compared to the capacity of the CPU where it

> > > > > > is/was running to know what the its current PELT utilization means. When

> > > > >

> > > > > I would have said the opposite. The utilization of the task will

> > > > > always reflect the same amount of work that has been already done

> > > > > whatever the CPU capacity.

> > > > > In fact, the new scaling mechanism uses the real amount of work that

> > > > > has been already done to compute the utilization signal which is not

> > > > > the case currently. This gives more information about the real amount

> > > > > of worked that has been computed in the over utilization case.

> > > > >

> > > > > > the utilization over-shoots the capacity its value is no longer

> > > > > > represents utilization, it just means that it has a higher compute

> > > > > > demand than is offered on its current CPU and a high value means that it

> > > > > > has been suffering longer. It can't be used to predict the actual

> > > > > > utilization on an idle 1024 capacity any better than contribution scaled

> > > > > > PELT utilization.

> > > > >

> > > > > I think that it provides earlier detection of over utilization and

> > > > > more accurate signal for a longer time duration which can help the

> > > > > load balance

> > > > > Coming back to 50% task example . I will use a 50ms running time

> > > > > during a 100ms period for the example below to make it easier

> > > > >

> > > > > Starting from 0, the evolution of the utilization is:

> > > > >

> > > > > With contribution scaling:

> > > > >          time  0ms  50ms  100ms  150ms  200ms

> > > > > capacity

> > > > > 1024           0    666

> > > > > 512            0    333   453

> > > > > When the CPU start to be over utilized (@100ms), the utilization is

> > > > > already too low (453 instead of 666) and scheduler doesn't detect yet

> > > > > that we are over utilized

> > > > > 256            0    169   226    246    252

> > > > > That's even worse with this lower capacity

> > > > >

>

>

>

>

> > > > > With time scaling,

> > > > >          time  0ms  50ms  100ms  150ms  200ms

> > > > > capacity

> > > > > 1024           0    666

> > > > > 512            0    428   677

> > > > > 256            0    234   468    564    677

>

> [...]

>

> > > I like the idea that we ramp up faster and always get to the same

> > > value. I like also the idea that we always reach the same value on

> > > both LITTLE and big.

> > >

> > > As long as there is idle time this is working fine, in these cases we

> > > should probably also collect util_est samples.

> > >

> > > But what happens when we don't have idle time ?

> >

> > As shown above, the utilization stays correct for a longer time frame

> > even after the over utilization point and provides better over

> > utilization detection

> >

> > >

> > > Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity.

> > >

> > > Are not these two tasks being reported as 50% tasks (after a while) ?

> >

> > Yes they will but similarly to above they will stay correct for longer

> > time even when they become higher than current cpu capacity

>

> Seems we agree that, when there is no idle time:

> - the two 15% tasks will be overestimated

> - their utilization will reach 50% after a while

>

> If I'm not wrong, we will have:

> - 30% CPU util in  ~16ms @1024 capacity

>                    ~64ms  @256 capacity

>

> Thus, the tasks will be certainly over-estimated after ~64ms.

> Is that correct ?


From a pure util_avg pov it's correct
But i'd like to weight that a bit with the example below

>

> > > If that's the case, these are samples we should not store...

>

> Now, we can argue that 64ms is a pretty long time and thus it's quite

> unlucky we will have no idle for such a long time.

>

> Still, I'm wondering if we should keep collecting those samples or

> better find a way to detect that and skip the sampling.


The problem is that you can have util_avg above capacity even with idle time
In the 1st example of this thread, the 39ms/80ms task will reach 709
which is the value saved by util_est on a big core
But on core with half capacity, there is still idle time so 709 is a
correct value although above 512

In fact, max will be always above the linear ratio because it's based
on geometric series

And this is true even with 15.6ms/32ms (same ratio as above) task
although the impact is smaller (max value, which should be saved by
util est, becomes  587 in this case).






>

> --

> #include <best/regards.h>

>

> Patrick Bellasi
Patrick Bellasi Jan. 10, 2019, 3:30 p.m. UTC | #16
On 29-Nov 17:19, Vincent Guittot wrote:
> On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > On 29-Nov 11:43, Vincent Guittot wrote:


[...]

> > Seems we agree that, when there is no idle time:

> > - the two 15% tasks will be overestimated

> > - their utilization will reach 50% after a while

> >

> > If I'm not wrong, we will have:

> > - 30% CPU util in  ~16ms @1024 capacity

> >                    ~64ms  @256 capacity

> >

> > Thus, the tasks will be certainly over-estimated after ~64ms.

> > Is that correct ?

> 

> From a pure util_avg pov it's correct

> But i'd like to weight that a bit with the example below

> 

> > Now, we can argue that 64ms is a pretty long time and thus it's quite

> > unlucky we will have no idle for such a long time.

> >

> > Still, I'm wondering if we should keep collecting those samples or

> > better find a way to detect that and skip the sampling.

> 

> The problem is that you can have util_avg above capacity even with idle time

> In the 1st example of this thread, the 39ms/80ms task will reach 709

> which is the value saved by util_est on a big core

> But on core with half capacity, there is still idle time so 709 is a

> correct value although above 512


Right, I see your point and (in principle) I like the idea of
collecting samples for tasks which happen to run at a lower capacity
then required and the utilization value makes sense...

> In fact, max will be always above the linear ratio because it's based

> on geometric series

> 

> And this is true even with 15.6ms/32ms (same ratio as above) task

> although the impact is smaller (max value, which should be saved by

> util est, becomes  587 in this case).


However that's not always the case... as per my example above.

Moreover, we should also consider that util_est is mainly meant to be
a lower-bound for tasks utilization.
That's why task_util_est() already returns the actual util_avg when
it's higher than the estimated utilization.

With your new signal and without any special check on samples
collection, if a task is limited because of thermal capping for
example, we could end up overestimating its utilization and thus
perhaps generating an unwanted frequency spike when the capping is
relaxed... and (even worst) it will take some more activations for the
estimated utilization to converge back to the actual utilization.

Since we cannot easily know if there is idle time in a CPU when a task
completes an activation with a utilization higher then the CPU
capacity, I would better prefer to just skip the sampling with
something like:

---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9332863d122a..485053026533 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 {
 	long last_ewma_diff;
 	struct util_est ue;
+	int cpu;
 
 	if (!sched_feat(UTIL_EST))
 		return;
@@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 	if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
 		return;
 
+	/*
+	 * To avoid overestimation of actual task utilization, skip updates if
+	 * we cannot grant there is idle time in this CPU.
+	 */
+	cpu = cpu_of(rq_of(cfs_rq));
+	if (task_util(p) > cpu_capacity(cpu))
+		return;
+
 	/*
 	 * Update Task's estimated utilization
 	 *
---8<---

At least this will ensure that util_est always provides an actual
measured lower bound for a task utilization.

If you think this makes sense, feel free to add such a patch on
top of your series.

Cheers Patrick

-- 
#include <best/regards.h>

Patrick Bellasi
Vincent Guittot Jan. 11, 2019, 2:29 p.m. UTC | #17
On Thu, 10 Jan 2019 at 16:30, Patrick Bellasi <patrick.bellasi@arm.com> wrote:
>

> On 29-Nov 17:19, Vincent Guittot wrote:

> > On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote:

> > > On 29-Nov 11:43, Vincent Guittot wrote:

>

> [...]

>

> > > Seems we agree that, when there is no idle time:

> > > - the two 15% tasks will be overestimated

> > > - their utilization will reach 50% after a while

> > >

> > > If I'm not wrong, we will have:

> > > - 30% CPU util in  ~16ms @1024 capacity

> > >                    ~64ms  @256 capacity

> > >

> > > Thus, the tasks will be certainly over-estimated after ~64ms.

> > > Is that correct ?

> >

> > From a pure util_avg pov it's correct

> > But i'd like to weight that a bit with the example below

> >

> > > Now, we can argue that 64ms is a pretty long time and thus it's quite

> > > unlucky we will have no idle for such a long time.

> > >

> > > Still, I'm wondering if we should keep collecting those samples or

> > > better find a way to detect that and skip the sampling.

> >

> > The problem is that you can have util_avg above capacity even with idle time

> > In the 1st example of this thread, the 39ms/80ms task will reach 709

> > which is the value saved by util_est on a big core

> > But on core with half capacity, there is still idle time so 709 is a

> > correct value although above 512

>

> Right, I see your point and (in principle) I like the idea of

> collecting samples for tasks which happen to run at a lower capacity

> then required and the utilization value makes sense...

>

> > In fact, max will be always above the linear ratio because it's based

> > on geometric series

> >

> > And this is true even with 15.6ms/32ms (same ratio as above) task

> > although the impact is smaller (max value, which should be saved by

> > util est, becomes  587 in this case).

>

> However that's not always the case... as per my example above.

>

> Moreover, we should also consider that util_est is mainly meant to be

> a lower-bound for tasks utilization.

> That's why task_util_est() already returns the actual util_avg when

> it's higher than the estimated utilization.


I can imagine that the fact that we use max(util_avg, util_est) helps
to keep using correct utilization in the scheduler when util_avg goes
above cpu capacity whereas there is still idle time

>

> With your new signal and without any special check on samples

> collection, if a task is limited because of thermal capping for

> example, we could end up overestimating its utilization and thus

> perhaps generating an unwanted frequency spike when the capping is

> relaxed... and (even worst) it will take some more activations for the

> estimated utilization to converge back to the actual utilization.

>

> Since we cannot easily know if there is idle time in a CPU when a task

> completes an activation with a utilization higher then the CPU

> capacity, I would better prefer to just skip the sampling with

> something like:

>

> ---8<---

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 9332863d122a..485053026533 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

>  {

>         long last_ewma_diff;

>         struct util_est ue;

> +       int cpu;

>

>         if (!sched_feat(UTIL_EST))

>                 return;

> @@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

>         if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))

>                 return;

>

> +       /*

> +        * To avoid overestimation of actual task utilization, skip updates if

> +        * we cannot grant there is idle time in this CPU.

> +        */

> +       cpu = cpu_of(rq_of(cfs_rq));

> +       if (task_util(p) > cpu_capacity(cpu))

> +               return;

> +

>         /*

>          * Update Task's estimated utilization

>          *

> ---8<---

>

> At least this will ensure that util_est always provides an actual

> measured lower bound for a task utilization.

>

> If you think this makes sense, feel free to add such a patch on

> top of your series.


ok. I'm going to add it when rebasing the series

Thanks
Vincent
>

> Cheers Patrick

>

> --

> #include <best/regards.h>

>

> Patrick Bellasi
Peter Zijlstra Jan. 24, 2019, 9:07 a.m. UTC | #18
Sorry; trying to get back to this and re-reading the old conversations.

On Thu, Nov 29, 2018 at 03:13:16PM +0000, Patrick Bellasi wrote:
> On 29-Nov 13:53, Peter Zijlstra wrote:

> > On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:

> > 

> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > > index ac855b2f4774..93e0cf5d8a76 100644

> > > --- a/kernel/sched/fair.c

> > > +++ b/kernel/sched/fair.c

> > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

> > >  	if (!task_sleep)

> > >  		return;

> > > 

> > > +	/* Skip samples which do not represent an actual utilization */

> > > +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> > > +		return;

> > > +

> > >  	/*

> > >  	 * If the PELT values haven't changed since enqueue time,

> > >  	 * skip the util_est update.

> > 

> > Would you not want something like:

> > 

> > 	min(task_util(p), capacity_of(task_cpu(p)))

> > 

> > And is this the only place where we need this?

> 

> Mmm... even this could be an over-estimation:

> 

> I've just posted an example in my last reply to Vincent, end of:

> 

>    Message-ID: <20181129150020.GF23094@e110439-lin>

>    https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/


In particular this bit:

 | Seems we agree that, when there is no idle time:
 | - the two 15% tasks will be overestimated
 | - their utilization will reach 50% after a while

Right?

> > OTOH, if the task is always running, it will be always running

> > irrespective of where it runs.

> 

> That's not what I'm concerned about. I'm concerned about small tasks

> which are running on limited capacity (e.g. due to thermal capping)

> without idle time. In this case, the new "utilization" signal could

> overestimate the real task needs.

> 

> > Not storing these samples seems weird though; this is the exact

> > condition you want to record -- the task is very active, if we skip

> > these, we'll come back at a low frequency on the next wakeup.

> 

> When there is not idle time, we don't know if the reported

> utilization, above the cpu capacity, is due to the task being bigger...

> or just the new utilization signal converging towards:

> 

>     100% / RUNNABLE_TASKS_COUNT


So if I'm not mistaken we then have 3 cases:

 1) runnable == util <= capacity

    no contention, idle

 2) runnable == util > capacity

    no contention, no idle

 3) runnable > util

    contention, no idle

For 1) we can use: 'util'
For 2) we can use: 'capacity'
For 3) we can use: 'util * capacity >> 10'

(note that 2 is a special case of 3 when u=1)

This should work right?

Now, instead of doing complicated things like that, you instead figure
that when there's no idle there's also no dequeue happening and we can
simply short-cut by skipping the entire thing, forgetting everything
about 2,3.

Did I get that right?
Patrick Bellasi Jan. 24, 2019, 2:04 p.m. UTC | #19
On 24-Jan 10:07, Peter Zijlstra wrote:
> 

> Sorry; trying to get back to this and re-reading the old conversations.

> 

> On Thu, Nov 29, 2018 at 03:13:16PM +0000, Patrick Bellasi wrote:

> > On 29-Nov 13:53, Peter Zijlstra wrote:

> > > On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote:

> > > 

> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > > > index ac855b2f4774..93e0cf5d8a76 100644

> > > > --- a/kernel/sched/fair.c

> > > > +++ b/kernel/sched/fair.c

> > > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)

> > > >  	if (!task_sleep)

> > > >  		return;

> > > > 

> > > > +	/* Skip samples which do not represent an actual utilization */

> > > > +	if (unlikely(task_util(p) > capacity_of(task_cpu(p))))

> > > > +		return;

> > > > +

> > > >  	/*

> > > >  	 * If the PELT values haven't changed since enqueue time,

> > > >  	 * skip the util_est update.

> > > 

> > > Would you not want something like:

> > > 

> > > 	min(task_util(p), capacity_of(task_cpu(p)))

> > > 

> > > And is this the only place where we need this?

> > 

> > Mmm... even this could be an over-estimation:

> > 

> > I've just posted an example in my last reply to Vincent, end of:

> > 

> >    Message-ID: <20181129150020.GF23094@e110439-lin>

> >    https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/

> 

> In particular this bit:

> 

>  | Seems we agree that, when there is no idle time:

>  | - the two 15% tasks will be overestimated

>  | - their utilization will reach 50% after a while

> 

> Right?

> 

> > > OTOH, if the task is always running, it will be always running

> > > irrespective of where it runs.

> > 

> > That's not what I'm concerned about. I'm concerned about small tasks

> > which are running on limited capacity (e.g. due to thermal capping)

> > without idle time. In this case, the new "utilization" signal could

> > overestimate the real task needs.

> > 

> > > Not storing these samples seems weird though; this is the exact

> > > condition you want to record -- the task is very active, if we skip

> > > these, we'll come back at a low frequency on the next wakeup.

> > 

> > When there is not idle time, we don't know if the reported

> > utilization, above the cpu capacity, is due to the task being bigger...

> > or just the new utilization signal converging towards:

> > 

> >     100% / RUNNABLE_TASKS_COUNT

> 

> So if I'm not mistaken we then have 3 cases:

> 

>  1) runnable == util <= capacity

> 

>     no contention, idle

> 

>  2) runnable == util > capacity

> 

>     no contention, no idle

> 

>  3) runnable > util

> 

>     contention, no idle

> 

> For 1) we can use: 'util'

> For 2) we can use: 'capacity'

> For 3) we can use: 'util * capacity >> 10'

> 

> (note that 2 is a special case of 3 when u=1)

> 

> This should work right?


I think there is a case, similar to 2, in which the new 'util' could
potentially be used. That's the case for example of a 20% (estimated)
utilization task running alone on a 15% capacity CPU, for a single
activation. In that case such a task will complete and be dequeued
with:

   runnable == util > capacity

The problem is that we need to be sure there was not contention... and
that seems to be difficult to detect.

> Now, instead of doing complicated things like that, you instead figure

> that when there's no idle there's also no dequeue happening and we can

> simply short-cut by skipping the entire thing, forgetting everything

> about 2,3.

> 

> Did I get that right?


More or less... just saying that 1 is the only easy to detect scenario
in which we are granted the utilization represents an actual bandwidth
request and thus the only safe values to sample for estimated
utilization. For the other cases, since anyway:

   util_est := max(max(ewma, last_util), util_avg)

util_est will just keep representing a safe and actually measured
lower-bound for the expected utilization of a task, without
side-affecting the EWMA which has a "slow" update dynamic.

-- 
#include <best/regards.h>

Patrick Bellasi
Peter Zijlstra Jan. 29, 2019, 7:42 p.m. UTC | #20
On Thu, Jan 24, 2019 at 02:04:32PM +0000, Patrick Bellasi wrote:

> > So if I'm not mistaken we then have 3 cases:

> > 

> >  1) runnable == util <= capacity

> > 

> >     no contention, idle

> > 

> >  2) runnable == util > capacity

> > 

> >     no contention, no idle

> > 

> >  3) runnable > util

> > 

> >     contention, no idle

> > 

> > For 1) we can use: 'util'

> > For 2) we can use: 'capacity'

> > For 3) we can use: 'util * capacity >> 10'

> > 

> > (note that 2 is a special case of 3 when u=1)

> > 

> > This should work right?

> 

> I think there is a case, similar to 2, in which the new 'util' could

> potentially be used. That's the case for example of a 20% (estimated)

> utilization task running alone on a 15% capacity CPU, for a single

> activation. In that case such a task will complete and be dequeued

> with:

> 

>    runnable == util > capacity

> 

> The problem is that we need to be sure there was not contention... and

> that seems to be difficult to detect.


When there is contention runnable and util should diverge. Given this is
all discrete stuff, there's a few funnies, but who cares about those :-)

> > Now, instead of doing complicated things like that, you instead figure

> > that when there's no idle there's also no dequeue happening and we can

> > simply short-cut by skipping the entire thing, forgetting everything

> > about 2,3.

> > 

> > Did I get that right?

> 

> More or less... just saying that 1 is the only easy to detect scenario

> in which we are granted the utilization represents an actual bandwidth

> request and thus the only safe values to sample for estimated

> utilization. For the other cases, since anyway:

> 

>    util_est := max(max(ewma, last_util), util_avg)

> 

> util_est will just keep representing a safe and actually measured

> lower-bound for the expected utilization of a task, without

> side-affecting the EWMA which has a "slow" update dynamic.


Right, so maybe we should expound the comment here a bit; but otherwise
I'm inclined to merge that v9 Vincent posted.
diff mbox series

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f8a541..d61a523 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -356,12 +356,6 @@  struct util_est {
  * For cfs_rq, it is the aggregated load_avg of all runnable and
  * blocked sched_entities.
  *
- * load_avg may also take frequency scaling into account:
- *
- *   load_avg = runnable% * scale_load_down(load) * freq%
- *
- * where freq% is the CPU frequency normalized to the highest frequency.
- *
  * [util_avg definition]
  *
  *   util_avg = running% * SCHED_CAPACITY_SCALE
@@ -370,17 +364,14 @@  struct util_est {
  * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable
  * and blocked sched_entities.
  *
- * util_avg may also factor frequency scaling and CPU capacity scaling:
- *
- *   util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity%
- *
- * where freq% is the same as above, and capacity% is the CPU capacity
- * normalized to the greatest capacity (due to uarch differences, etc).
+ * load_avg and util_avg don't direcly factor frequency scaling and CPU
+ * capacity scaling. The scaling is done through the rq_clock_pelt that
+ * is used for computing those signals (see update_rq_clock_pelt())
  *
- * N.B., the above ratios (runnable%, running%, freq%, and capacity%)
- * themselves are in the range of [0, 1]. To do fixed point arithmetics,
- * we therefore scale them to as large a range as necessary. This is for
- * example reflected by util_avg's SCHED_CAPACITY_SCALE.
+ * N.B., the above ratios (runnable% and running%) themselves are in the
+ * range of [0, 1]. To do fixed point arithmetics, we therefore scale them
+ * to as large a range as necessary. This is for example reflected by
+ * util_avg's SCHED_CAPACITY_SCALE.
  *
  * [Overflow issue]
  *
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5afb868..79d57da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -180,6 +180,7 @@  static void update_rq_clock_task(struct rq *rq, s64 delta)
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
 		update_irq_load_avg(rq, irq_delta + steal);
 #endif
+	update_rq_clock_pelt(rq, delta);
 }
 
 void update_rq_clock(struct rq *rq)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 470ba6b..a9f1d62 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1767,7 +1767,7 @@  pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	deadline_queue_push_tasks(rq);
 
 	if (rq->curr->sched_class != &dl_sched_class)
-		update_dl_rq_load_avg(rq_clock_task(rq), rq, 0);
+		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	return p;
 }
@@ -1776,7 +1776,7 @@  static void put_prev_task_dl(struct rq *rq, struct task_struct *p)
 {
 	update_curr_dl(rq);
 
-	update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
+	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1)
 		enqueue_pushable_dl_task(rq, p);
 }
@@ -1793,7 +1793,7 @@  static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued)
 {
 	update_curr_dl(rq);
 
-	update_dl_rq_load_avg(rq_clock_task(rq), rq, 1);
+	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 	/*
 	 * Even when we have runtime, update_curr_dl() might have resulted in us
 	 * not being the leftmost task anymore. In that case NEED_RESCHED will
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9cd1f32..4debba7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -674,9 +674,8 @@  static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return calc_delta_fair(sched_slice(cfs_rq, se), se);
 }
 
-#ifdef CONFIG_SMP
 #include "pelt.h"
-#include "sched-pelt.h"
+#ifdef CONFIG_SMP
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
@@ -764,7 +763,7 @@  void post_init_entity_util_avg(struct sched_entity *se)
 			 * such that the next switched_to_fair() has the
 			 * expected state.
 			 */
-			se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
+			se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq);
 			return;
 		}
 	}
@@ -3110,7 +3109,7 @@  void set_task_rq_fair(struct sched_entity *se,
 	p_last_update_time = prev->avg.last_update_time;
 	n_last_update_time = next->avg.last_update_time;
 #endif
-	__update_load_avg_blocked_se(p_last_update_time, cpu_of(rq_of(prev)), se);
+	__update_load_avg_blocked_se(p_last_update_time, se);
 	se->avg.last_update_time = n_last_update_time;
 }
 
@@ -3245,11 +3244,11 @@  update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf
 
 	/*
 	 * runnable_sum can't be lower than running_sum
-	 * As running sum is scale with CPU capacity wehreas the runnable sum
-	 * is not we rescale running_sum 1st
+	 * Rescale running sum to be in the same range as runnable sum
+	 * running_sum is in [0 : LOAD_AVG_MAX <<  SCHED_CAPACITY_SHIFT]
+	 * runnable_sum is in [0 : LOAD_AVG_MAX]
 	 */
-	running_sum = se->avg.util_sum /
-		arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq)));
+	running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT;
 	runnable_sum = max(runnable_sum, running_sum);
 
 	load_sum = (s64)se_weight(se) * runnable_sum;
@@ -3352,7 +3351,7 @@  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
 
 /**
  * update_cfs_rq_load_avg - update the cfs_rq's load/util averages
- * @now: current time, as per cfs_rq_clock_task()
+ * @now: current time, as per cfs_rq_clock_pelt()
  * @cfs_rq: cfs_rq to update
  *
  * The cfs_rq avg is the direct sum of all its entities (blocked and runnable)
@@ -3397,7 +3396,7 @@  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 		decayed = 1;
 	}
 
-	decayed |= __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq);
+	decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
 
 #ifndef CONFIG_64BIT
 	smp_wmb();
@@ -3487,9 +3486,7 @@  static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
-	u64 now = cfs_rq_clock_task(cfs_rq);
-	struct rq *rq = rq_of(cfs_rq);
-	int cpu = cpu_of(rq);
+	u64 now = cfs_rq_clock_pelt(cfs_rq);
 	int decayed;
 
 	/*
@@ -3497,7 +3494,7 @@  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	 * track group sched_entity load average for task_h_load calc in migration
 	 */
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
-		__update_load_avg_se(now, cpu, cfs_rq, se);
+		__update_load_avg_se(now, cfs_rq, se);
 
 	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
 	decayed |= propagate_entity_load_avg(se);
@@ -3549,7 +3546,7 @@  void sync_entity_load_avg(struct sched_entity *se)
 	u64 last_update_time;
 
 	last_update_time = cfs_rq_last_update_time(cfs_rq);
-	__update_load_avg_blocked_se(last_update_time, cpu_of(rq_of(cfs_rq)), se);
+	__update_load_avg_blocked_se(last_update_time, se);
 }
 
 /*
@@ -6763,6 +6760,12 @@  done: __maybe_unused;
 	if (new_tasks > 0)
 		goto again;
 
+	/*
+	 * rq is about to be idle, check if we need to update the
+	 * lost_idle_time of clock_pelt
+	 */
+	update_idle_rq_clock_pelt(rq);
+
 	return NULL;
 }
 
@@ -7422,7 +7425,7 @@  static void update_blocked_averages(int cpu)
 		if (throttled_hierarchy(cfs_rq))
 			continue;
 
-		if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq))
+		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
 			update_tg_load_avg(cfs_rq, 0);
 
 		/* Propagate pending load changes to the parent, if any: */
@@ -7443,8 +7446,8 @@  static void update_blocked_averages(int cpu)
 	}
 
 	curr_class = rq->curr->sched_class;
-	update_rt_rq_load_avg(rq_clock_task(rq), rq, curr_class == &rt_sched_class);
-	update_dl_rq_load_avg(rq_clock_task(rq), rq, curr_class == &dl_sched_class);
+	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
 	update_irq_load_avg(rq, 0);
 	/* Don't need periodic decay once load/util_avg are null */
 	if (others_have_blocked(rq))
@@ -7514,11 +7517,11 @@  static inline void update_blocked_averages(int cpu)
 
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
-	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
+	update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
 
 	curr_class = rq->curr->sched_class;
-	update_rt_rq_load_avg(rq_clock_task(rq), rq, curr_class == &rt_sched_class);
-	update_dl_rq_load_avg(rq_clock_task(rq), rq, curr_class == &dl_sched_class);
+	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
+	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
 	update_irq_load_avg(rq, 0);
 #ifdef CONFIG_NO_HZ_COMMON
 	rq->last_blocked_load_update_tick = jiffies;
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 90fb5bc..befce29 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -26,7 +26,6 @@ 
 
 #include <linux/sched.h>
 #include "sched.h"
-#include "sched-pelt.h"
 #include "pelt.h"
 
 /*
@@ -106,16 +105,12 @@  static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
  *                     n=1
  */
 static __always_inline u32
-accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
+accumulate_sum(u64 delta, struct sched_avg *sa,
 	       unsigned long load, unsigned long runnable, int running)
 {
-	unsigned long scale_freq, scale_cpu;
 	u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
 	u64 periods;
 
-	scale_freq = arch_scale_freq_capacity(cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
-
 	delta += sa->period_contrib;
 	periods = delta / 1024; /* A period is 1024us (~1ms) */
 
@@ -137,13 +132,12 @@  accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
 	}
 	sa->period_contrib = delta;
 
-	contrib = cap_scale(contrib, scale_freq);
 	if (load)
 		sa->load_sum += load * contrib;
 	if (runnable)
 		sa->runnable_load_sum += runnable * contrib;
 	if (running)
-		sa->util_sum += contrib * scale_cpu;
+		sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
 
 	return periods;
 }
@@ -177,7 +171,7 @@  accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
 static __always_inline int
-___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
+___update_load_sum(u64 now, struct sched_avg *sa,
 		  unsigned long load, unsigned long runnable, int running)
 {
 	u64 delta;
@@ -221,7 +215,7 @@  ___update_load_sum(u64 now, int cpu, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, cpu, sa, load, runnable, running))
+	if (!accumulate_sum(delta, sa, load, runnable, running))
 		return 0;
 
 	return 1;
@@ -267,9 +261,9 @@  ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
  *   runnable_load_avg = \Sum se->avg.runable_load_avg
  */
 
-int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
+int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
-	if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) {
+	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
 		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
 		return 1;
 	}
@@ -277,9 +271,9 @@  int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se)
 	return 0;
 }
 
-int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se)
+int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq,
+	if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq,
 				cfs_rq->curr == se)) {
 
 		___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
@@ -290,9 +284,9 @@  int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e
 	return 0;
 }
 
-int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
+int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
 {
-	if (___update_load_sum(now, cpu, &cfs_rq->avg,
+	if (___update_load_sum(now, &cfs_rq->avg,
 				scale_load_down(cfs_rq->load.weight),
 				scale_load_down(cfs_rq->runnable_weight),
 				cfs_rq->curr != NULL)) {
@@ -317,7 +311,7 @@  int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq)
 
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 {
-	if (___update_load_sum(now, rq->cpu, &rq->avg_rt,
+	if (___update_load_sum(now, &rq->avg_rt,
 				running,
 				running,
 				running)) {
@@ -340,7 +334,7 @@  int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
 
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 {
-	if (___update_load_sum(now, rq->cpu, &rq->avg_dl,
+	if (___update_load_sum(now, &rq->avg_dl,
 				running,
 				running,
 				running)) {
@@ -365,22 +359,31 @@  int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
 int update_irq_load_avg(struct rq *rq, u64 running)
 {
 	int ret = 0;
+
+	/*
+	 * We can't use clock_pelt because irq time is not accounted in
+	 * clock_task. Instead we directly scale the running time to
+	 * reflect the real amount of computation
+	 */
+	running = cap_scale(running, arch_scale_freq_capacity(cpu_of(rq)));
+	running = cap_scale(running, arch_scale_cpu_capacity(NULL, cpu_of(rq)));
+
 	/*
 	 * We know the time that has been used by interrupt since last update
 	 * but we don't when. Let be pessimistic and assume that interrupt has
 	 * happened just before the update. This is not so far from reality
 	 * because interrupt will most probably wake up task and trig an update
-	 * of rq clock during which the metric si updated.
+	 * of rq clock during which the metric is updated.
 	 * We start to decay with normal context time and then we add the
 	 * interrupt context time.
 	 * We can safely remove running from rq->clock because
 	 * rq->clock += delta with delta >= running
 	 */
-	ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq,
+	ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
 				0,
 				0,
 				0);
-	ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq,
+	ret += ___update_load_sum(rq->clock, &rq->avg_irq,
 				1,
 				1,
 				1);
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 7e56b48..ae269ba 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -1,8 +1,9 @@ 
 #ifdef CONFIG_SMP
+#include "sched-pelt.h"
 
-int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se);
-int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se);
-int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq);
+int __update_load_avg_blocked_se(u64 now, struct sched_entity *se);
+int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se);
+int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq);
 int update_rt_rq_load_avg(u64 now, struct rq *rq, int running);
 int update_dl_rq_load_avg(u64 now, struct rq *rq, int running);
 
@@ -42,6 +43,98 @@  static inline void cfs_se_util_change(struct sched_avg *avg)
 	WRITE_ONCE(avg->util_est.enqueued, enqueued);
 }
 
+/*
+ * The clock_pelt scales the time to reflect the effective amount of
+ * computation done during the running delta time but then sync back to
+ * clock_task when rq is idle.
+ *
+ *
+ * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
+ * @ max capacity  ------******---------------******---------------
+ * @ half capacity ------************---------************---------
+ * clock pelt      | 1| 2|    3|    4| 7| 8| 9|   10|   11|14|15|16
+ *
+ */
+static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
+{
+	if (unlikely(is_idle_task(rq->curr))) {
+		/* The rq is idle, we can sync to clock_task */
+		rq->clock_pelt  = rq_clock_task(rq);
+		return;
+	}
+
+	/*
+	 * When a rq runs at a lower compute capacity, it will need
+	 * more time to do the same amount of work than at max
+	 * capacity. In order to be invariant, we scale the delta to
+	 * reflect how much work has been really done.
+	 * Running longer results in stealing idle time that will
+	 * disturb the load signal compared to max capacity. This
+	 * stolen idle time will be automatically reflected when the
+	 * rq will be idle and the clock will be synced with
+	 * rq_clock_task.
+	 */
+
+	/*
+	 * Scale the elapsed time to reflect the real amount of
+	 * computation
+	 */
+	delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu_of(rq)));
+	delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq)));
+
+	rq->clock_pelt += delta;
+}
+
+/*
+ * When rq becomes idle, we have to check if it has lost idle time
+ * because it was fully busy. A rq is fully used when the /Sum util_sum
+ * is greater or equal to:
+ * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
+ * For optimization and computing rounding purpose, we don't take into account
+ * the position in the current window (period_contrib) and we use the higher
+ * bound of util_sum to decide.
+ */
+static inline void update_idle_rq_clock_pelt(struct rq *rq)
+{
+	u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
+	u32 util_sum = rq->cfs.avg.util_sum;
+	util_sum += rq->avg_rt.util_sum;
+	util_sum += rq->avg_dl.util_sum;
+
+	/*
+	 * Reflecting stolen time makes sense only if the idle
+	 * phase would be present at max capacity. As soon as the
+	 * utilization of a rq has reached the maximum value, it is
+	 * considered as an always runnig rq without idle time to
+	 * steal. This potential idle time is considered as lost in
+	 * this case. We keep track of this lost idle time compare to
+	 * rq's clock_task.
+	 */
+	if (util_sum >= divider)
+		rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+}
+
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+	return rq->clock_pelt - rq->lost_idle_time;
+}
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
+static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
+{
+	if (unlikely(cfs_rq->throttle_count))
+		return cfs_rq->throttled_clock_task - cfs_rq->throttled_clock_task_time;
+
+	return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time;
+}
+#else
+static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
+{
+	return rq_clock_pelt(rq_of(cfs_rq));
+}
+#endif
+
 #else
 
 static inline int
@@ -67,6 +160,18 @@  update_irq_load_avg(struct rq *rq, u64 running)
 {
 	return 0;
 }
+
+static inline u64 rq_clock_pelt(struct rq *rq)
+{
+	return rq_clock_task(rq);
+}
+
+static inline void
+update_rq_clock_pelt(struct rq *rq, s64 delta) { }
+
+static inline void
+update_idle_rq_clock_pelt(struct rq *rq) { }
+
 #endif
 
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9aa3287..579c60e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1587,7 +1587,7 @@  pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * rt task
 	 */
 	if (rq->curr->sched_class != &rt_sched_class)
-		update_rt_rq_load_avg(rq_clock_task(rq), rq, 0);
+		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
 
 	return p;
 }
@@ -1596,7 +1596,7 @@  static void put_prev_task_rt(struct rq *rq, struct task_struct *p)
 {
 	update_curr_rt(rq);
 
-	update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 
 	/*
 	 * The previous task needs to be made eligible for pushing
@@ -2327,7 +2327,7 @@  static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
 	struct sched_rt_entity *rt_se = &p->rt;
 
 	update_curr_rt(rq);
-	update_rt_rq_load_avg(rq_clock_task(rq), rq, 1);
+	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1);
 
 	watchdog(rq, p);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67e2d56..9a8ae27 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -840,7 +840,10 @@  struct rq {
 
 	unsigned int		clock_update_flags;
 	u64			clock;
-	u64			clock_task;
+	/* Ensure that all clocks are in the same cache line */
+	u64			clock_task ____cacheline_aligned;
+	u64			clock_pelt;
+	unsigned long		lost_idle_time;
 
 	atomic_t		nr_iowait;