Message ID | 1542711308-25256-3-git-send-email-vincent.guittot@linaro.org |
---|---|
State | Superseded |
Headers | show |
Series | sched/fair: update scale invariance of PELT | expand |
Hi, On Tue, 20 Nov 2018 at 11:55, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > The current implementation of load tracking invariance scales the > contribution with current frequency and uarch performance (only for > utilization) of the CPU. One main result of this formula is that the > figures are capped by current capacity of CPU. Another one is that the > load_avg is not invariant because not scaled with uarch. > > The util_avg of a periodic task that runs r time slots every p time slots > varies in the range : > > U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) > > with U is the max util_avg value = SCHED_CAPACITY_SCALE > > At a lower capacity, the range becomes: > > U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) > > with C reflecting the compute capacity ratio between current capacity and > max capacity. > > so C tries to compensate changes in (1-y^r') but it can't be accurate. > > Instead of scaling the contribution value of PELT algo, we should scale the > running time. The PELT signal aims to track the amount of computation of > tasks and/or rq so it seems more correct to scale the running time to > reflect the effective amount of computation done since the last update. > > In order to be fully invariant, we need to apply the same amount of > running time and idle time whatever the current capacity. Because running > at lower capacity implies that the task will run longer, we have to ensure > that the same amount of idle time will be applied when system becomes idle > and no idle time has been "stolen". But reaching the maximum utilization > value (SCHED_CAPACITY_SCALE) means that the task is seen as an > always-running task whatever the capacity of the CPU (even at max compute > capacity). In this case, we can discard this "stolen" idle times which > becomes meaningless. > > In order to achieve this time scaling, a new clock_pelt is created per rq. > The increase of this clock scales with current capacity when something > is running on rq and synchronizes with clock_task when rq is idle. With > this mechanism, we ensure the same running and idle time whatever the > current capacity. This also enables to simplify the pelt algorithm by > removing all references of uarch and frequency and applying the same > contribution to utilization and loads. Furthermore, the scaling is done > only once per update of clock (update_rq_clock_task()) instead of during > each update of sched_entities and cfs/rt/dl_rq of the rq like the current > implementation. This is interesting when cgroup are involved as shown in > the results below: > > On a hikey (octo Arm64 platform). > Performance cpufreq governor and only shallowest c-state to remove variance > generated by those power features so we only track the impact of pelt algo. > > each test runs 16 times > > ./perf bench sched pipe > (higher is better) > kernel tip/sched/core + patch > ops/seconds ops/seconds diff > cgroup > root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38% > level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57% > level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86% > > hackbench -l 1000 > (lower is better) > kernel tip/sched/core + patch > duration(sec) duration(sec) diff > cgroup > root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57% > level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60% > level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66% > > Then, the responsiveness of PELT is improved when CPU is not running at max > capacity with this new algorithm. I have put below some examples of > duration to reach some typical load values according to the capacity of the > CPU with current implementation and with this patch. These values has been > computed based on the geometric series and the half period value: > > Util (%) max capacity half capacity(mainline) half capacity(w/ patch) > 972 (95%) 138ms not reachable 276ms > 486 (47.5%) 30ms 138ms 60ms > 256 (25%) 13ms 32ms 26ms > > On my hikey (octo Arm64 platform) with schedutil governor, the time to > reach max OPP when starting from a null utilization, decreases from 223ms > with current scale invariance down to 121ms with the new algorithm. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Is there anything else that I should do for these patches ? Regards, Vincent
On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote:
> Is there anything else that I should do for these patches ?
IIRC, Morten mention they break util_est; Patrick was going to explain.
But yes, I like them.
On 28-Nov 11:02, Peter Zijlstra wrote: > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > Is there anything else that I should do for these patches ? > > IIRC, Morten mention they break util_est; Patrick was going to explain. I guess the problem is that, once we cross the current capacity, strictly speaking util_avg does not represent anymore a utilization. With the new signal this could happen and we end up storing estimated utilization samples which will overestimate the task requirements. We will have a spike in estimated utilization at next wakeup, since we use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in case we collect multiple samples above the current capacity. So, a possible fix could be to avoid storing util_est samples if we end up with a utilization above the current capacity. Something like: ----8<--- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ac855b2f4774..93e0cf5d8a76 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) if (!task_sleep) return; + /* Skip samples which do not represent an actual utilization */ + if (unlikely(task_util(p) > capacity_of(task_cpu(p)))) + return; + /* * If the PELT values haven't changed since enqueue time, * skip the util_est update. ---8<--- Could that work ? Maybe using a new utility function to wrap the new check. -- #include <best/regards.h> Patrick Bellasi
On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 28-Nov 11:02, Peter Zijlstra wrote: > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > Is there anything else that I should do for these patches ? > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > I guess the problem is that, once we cross the current capacity, > strictly speaking util_avg does not represent anymore a utilization. > > With the new signal this could happen and we end up storing estimated > utilization samples which will overestimate the task requirements. > > We will have a spike in estimated utilization at next wakeup, since we > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > case we collect multiple samples above the current capacity. TBH I don't see how it's different from current implementation with a task that was scheduled on big core and now wakes up on little core. The util_est is overestimated as well. But I'm fine with adding your proposal on to on the patchset > > So, a possible fix could be to avoid storing util_est samples if we > end up with a utilization above the current capacity. > > Something like: > > ----8<--- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ac855b2f4774..93e0cf5d8a76 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > if (!task_sleep) > return; > > + /* Skip samples which do not represent an actual utilization */ > + if (unlikely(task_util(p) > capacity_of(task_cpu(p)))) > + return; > + > /* > * If the PELT values haven't changed since enqueue time, > * skip the util_est update. > ---8<--- > > Could that work ? > > Maybe using a new utility function to wrap the new check. > > -- > #include <best/regards.h> > > Patrick Bellasi
On Wed, 28 Nov 2018 at 14:33, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > Is there anything else that I should do for these patches ? > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > I guess the problem is that, once we cross the current capacity, > > strictly speaking util_avg does not represent anymore a utilization. > > > > With the new signal this could happen and we end up storing estimated > > utilization samples which will overestimate the task requirements. > > > > We will have a spike in estimated utilization at next wakeup, since we > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > case we collect multiple samples above the current capacity. > > TBH I don't see how it's different from current implementation with a > task that was scheduled on big core and now wakes up on little core. > The util_est is overestimated as well. > > But I'm fine with adding your proposal on to on the patchset s/on to on/on top of/ > > > > > So, a possible fix could be to avoid storing util_est samples if we > > end up with a utilization above the current capacity. > > > > Something like: > > > > ----8<--- > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index ac855b2f4774..93e0cf5d8a76 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > > if (!task_sleep) > > return; > > > > + /* Skip samples which do not represent an actual utilization */ > > + if (unlikely(task_util(p) > capacity_of(task_cpu(p)))) > > + return; > > + > > /* > > * If the PELT values haven't changed since enqueue time, > > * skip the util_est update. > > ---8<--- > > > > Could that work ? > > > > Maybe using a new utility function to wrap the new check. > > > > -- > > #include <best/regards.h> > > > > Patrick Bellasi
On 28-Nov 14:33, Vincent Guittot wrote: > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > Is there anything else that I should do for these patches ? > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > I guess the problem is that, once we cross the current capacity, > > strictly speaking util_avg does not represent anymore a utilization. > > > > With the new signal this could happen and we end up storing estimated > > utilization samples which will overestimate the task requirements. > > > > We will have a spike in estimated utilization at next wakeup, since we > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > case we collect multiple samples above the current capacity. > > TBH I don't see how it's different from current implementation with a > task that was scheduled on big core and now wakes up on little core. > The util_est is overestimated as well. While running below the capacity of a CPU, either big or LITTLE, we can still measure the actual used bandwidth as long as we have idle time. If the task is then moved into a lower capacity core, I think it's still safe to assume that, likely, it would need more capacity. Why do you say it's the same ? With your new signal instead, once we cross the current capacity, utilization is just not anymore utilization. Thus, IMHO it make sense avoid to accumulate a sample for what we call "estimated utilization". I would also say that, with the current implementation which caps utilization to the current capacity, we get better estimation in general. At least we can say with absolute precision: "the task needs _at least_ that amount of capacity". Potentially we can also flag the task as being under-provisioned, in case there was not idle time, and _let a policy_ decide what to do with it and the granted information we have. While, with your new signal, once we are over the current capacity, the "utilization" is just a sort of "random" number at best useful to drive some conclusions about how long the task has been delayed. IOW, I fear that we are embedding a policy within a signal which is currently representing something very well defined: how much cpu bandwidth a task used. While, latency/under-provisioning policies perhaps should be better placed somewhere else. Perhaps I've missed it in some of the previous discussions: have we have considered/discussed this signal-vs-policy aspect ? -- #include <best/regards.h> Patrick Bellasi
On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 28-Nov 14:33, Vincent Guittot wrote: > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > > > Is there anything else that I should do for these patches ? > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > > > I guess the problem is that, once we cross the current capacity, > > > strictly speaking util_avg does not represent anymore a utilization. > > > > > > With the new signal this could happen and we end up storing estimated > > > utilization samples which will overestimate the task requirements. > > > > > > We will have a spike in estimated utilization at next wakeup, since we > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > > case we collect multiple samples above the current capacity. > > > > TBH I don't see how it's different from current implementation with a > > task that was scheduled on big core and now wakes up on little core. > > The util_est is overestimated as well. > > While running below the capacity of a CPU, either big or LITTLE, we > can still measure the actual used bandwidth as long as we have idle > time. If the task is then moved into a lower capacity core, I think > it's still safe to assume that, likely, it would need more capacity. > > Why do you say it's the same ? In the example of a task that runs 39ms in period of 80ms that we used during previous version, the utilization on the big core will reach 709 so will util_est too When the task migrates on little core (512), util_est is higher than current cpu capacity > > With your new signal instead, once we cross the current capacity, > utilization is just not anymore utilization. Thus, IMHO it make sense > avoid to accumulate a sample for what we call "estimated utilization". > > I would also say that, with the current implementation which caps > utilization to the current capacity, we get better estimation in > general. At least we can say with absolute precision: > > "the task needs _at least_ that amount of capacity". > > Potentially we can also flag the task as being under-provisioned, in > case there was not idle time, and _let a policy_ decide what to do > with it and the granted information we have. > > While, with your new signal, once we are over the current capacity, > the "utilization" is just a sort of "random" number at best useful to > drive some conclusions about how long the task has been delayed. > > IOW, I fear that we are embedding a policy within a signal which is > currently representing something very well defined: how much cpu > bandwidth a task used. While, latency/under-provisioning policies > perhaps should be better placed somewhere else. > > Perhaps I've missed it in some of the previous discussions: > have we have considered/discussed this signal-vs-policy aspect ? > > -- > #include <best/regards.h> > > Patrick Bellasi
On 28-Nov 15:55, Vincent Guittot wrote: > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > On 28-Nov 14:33, Vincent Guittot wrote: > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > > > > > Is there anything else that I should do for these patches ? > > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > > > > > I guess the problem is that, once we cross the current capacity, > > > > strictly speaking util_avg does not represent anymore a utilization. > > > > > > > > With the new signal this could happen and we end up storing estimated > > > > utilization samples which will overestimate the task requirements. > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > > > case we collect multiple samples above the current capacity. > > > > > > TBH I don't see how it's different from current implementation with a > > > task that was scheduled on big core and now wakes up on little core. > > > The util_est is overestimated as well. > > > > While running below the capacity of a CPU, either big or LITTLE, we > > can still measure the actual used bandwidth as long as we have idle > > time. If the task is then moved into a lower capacity core, I think > > it's still safe to assume that, likely, it would need more capacity. > > > > Why do you say it's the same ? > > In the example of a task that runs 39ms in period of 80ms that we used > during previous version, > the utilization on the big core will reach 709 so will util_est too > When the task migrates on little core (512), util_est is higher than > current cpu capacity Right, and what's the problem ? 1) We know that PELT is calibrated to 32ms period task and in your example, since the runtime is higher then the half-life, it's correct to estimate a utilization higher then 50%. PELT utilization is defined _based on the half-life_: thus your task having a 50% duty cycle does not mean we are not correct if report a utilization != 50%. It would be as broken as reporting 10% utilization for a task running 100ms every 1s. 2) If it was a 70% task on a previous activation, once it's moved into a lower capacity CPU it's still correct to assume that it's likely going to require the same bandwidth and thus will be under-provisioned. I still don't see where we are wrong in this case :/ To me it looks different then the problem I described. > > With your new signal instead, once we cross the current capacity, > > utilization is just not anymore utilization. Thus, IMHO it make sense > > avoid to accumulate a sample for what we call "estimated utilization". > > > > I would also say that, with the current implementation which caps > > utilization to the current capacity, we get better estimation in > > general. At least we can say with absolute precision: > > > > "the task needs _at least_ that amount of capacity". > > > > Potentially we can also flag the task as being under-provisioned, in > > case there was not idle time, and _let a policy_ decide what to do > > with it and the granted information we have. > > > > While, with your new signal, once we are over the current capacity, > > the "utilization" is just a sort of "random" number at best useful to > > drive some conclusions about how long the task has been delayed. > > > > IOW, I fear that we are embedding a policy within a signal which is > > currently representing something very well defined: how much cpu > > bandwidth a task used. While, latency/under-provisioning policies > > perhaps should be better placed somewhere else. > > > > Perhaps I've missed it in some of the previous discussions: > > have we have considered/discussed this signal-vs-policy aspect ? What's your opinion on the above instead ? -- #include <best/regards.h> Patrick Bellasi
On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 28-Nov 15:55, Vincent Guittot wrote: > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > On 28-Nov 14:33, Vincent Guittot wrote: > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > > > > > > > Is there anything else that I should do for these patches ? > > > > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > > > > > > > I guess the problem is that, once we cross the current capacity, > > > > > strictly speaking util_avg does not represent anymore a utilization. > > > > > > > > > > With the new signal this could happen and we end up storing estimated > > > > > utilization samples which will overestimate the task requirements. > > > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > > > > case we collect multiple samples above the current capacity. > > > > > > > > TBH I don't see how it's different from current implementation with a > > > > task that was scheduled on big core and now wakes up on little core. > > > > The util_est is overestimated as well. > > > > > > While running below the capacity of a CPU, either big or LITTLE, we > > > can still measure the actual used bandwidth as long as we have idle > > > time. If the task is then moved into a lower capacity core, I think > > > it's still safe to assume that, likely, it would need more capacity. > > > > > > Why do you say it's the same ? > > > > In the example of a task that runs 39ms in period of 80ms that we used > > during previous version, > > the utilization on the big core will reach 709 so will util_est too > > When the task migrates on little core (512), util_est is higher than > > current cpu capacity > > Right, and what's the problem ? you worry about an util_est being higher than capacity which is the case there > > 1) We know that PELT is calibrated to 32ms period task and in your > example, since the runtime is higher then the half-life, it's > correct to estimate a utilization higher then 50%. > > PELT utilization is defined _based on the half-life_: thus > your task having a 50% duty cycle does not mean we are not correct > if report a utilization != 50%. > It would be as broken as reporting 10% utilization for a task > running 100ms every 1s. > > 2) If it was a 70% task on a previous activation, once it's moved into > a lower capacity CPU it's still correct to assume that it's likely > going to require the same bandwidth and thus will be > under-provisioned. > > I still don't see where we are wrong in this case :/ > > To me it looks different then the problem I described. > > > > With your new signal instead, once we cross the current capacity, > > > utilization is just not anymore utilization. Thus, IMHO it make sense > > > avoid to accumulate a sample for what we call "estimated utilization". This is not true. With the example above, the util_est will be exactly the same on big and little cores with the new signal > > > > > > I would also say that, with the current implementation which caps > > > utilization to the current capacity, we get better estimation in > > > general. At least we can say with absolute precision: > > > > > > "the task needs _at least_ that amount of capacity". > > > > > > Potentially we can also flag the task as being under-provisioned, in > > > case there was not idle time, and _let a policy_ decide what to do > > > with it and the granted information we have. > > > > > > While, with your new signal, once we are over the current capacity, > > > the "utilization" is just a sort of "random" number at best useful to > > > drive some conclusions about how long the task has been delayed. see my comment above > > > > > > IOW, I fear that we are embedding a policy within a signal which is > > > currently representing something very well defined: how much cpu > > > bandwidth a task used. While, latency/under-provisioning policies > > > perhaps should be better placed somewhere else. > > > > > > Perhaps I've missed it in some of the previous discussions: > > > have we have considered/discussed this signal-vs-policy aspect ? > > What's your opinion on the above instead ? It's not a policy but it gives better knowledge about the amount a work done I have put below discussion on the subject on previous version > > > > With contribution scaling the PELT utilization of a task is a _minimum_ > > utilization. Regardless of where the task is currently/was running (and > > provided that it doesn't change behaviour) its PELT utilization will > > approximate its _minimum_ utilization on an idle 1024 capacity CPU. > > The main drawback is that the _minimum_ utilization depends on the CPU > capacity on which the task runs. The two 25% tasks on a 256 capacity > CPU will have an utilization of 128 as an example > > > > > With time scaling the PELT utilization doesn't really have a meaning on > > its own. It has to be compared to the capacity of the CPU where it > > is/was running to know what the its current PELT utilization means. When > > I would have said the opposite. The utilization of the task will > always reflect the same amount of work that has been already done > whatever the CPU capacity. > In fact, the new scaling mechanism uses the real amount of work that > has been already done to compute the utilization signal which is not > the case currently. This gives more information about the real amount > of worked that has been computed in the over utilization case. > > > the utilization over-shoots the capacity its value is no longer > > represents utilization, it just means that it has a higher compute > > demand than is offered on its current CPU and a high value means that it > > has been suffering longer. It can't be used to predict the actual > > utilization on an idle 1024 capacity any better than contribution scaled > > PELT utilization. > > I think that it provides earlier detection of over utilization and > more accurate signal for a longer time duration which can help the > load balance > Coming back to 50% task example . I will use a 50ms running time > during a 100ms period for the example below to make it easier > > Starting from 0, the evolution of the utilization is: > > With contribution scaling: > time 0ms 50ms 100ms 150ms 200ms > capacity > 1024 0 666 > 512 0 333 453 > When the CPU start to be over utilized (@100ms), the utilization is > already too low (453 instead of 666) and scheduler doesn't detect yet > that we are over utilized > 256 0 169 226 246 252 > That's even worse with this lower capacity > > With time scaling, > time 0ms 50ms 100ms 150ms 200ms > capacity > 1024 0 666 > 512 0 428 677 > We know that the current capacity is not enough and the utilization > reflect the correct utilization level compare to 1024 capacity (the > 666 vs 677 difference comes from the 1024us window so the last window > is not full in the case of max capacity) > 256 0 234 468 564 677 > At 100ms, we know that there is not enough capacity. (In fact we know > that at 56ms). And even at time 200ms, the amount of work is exactly > what would have been executed on a CPU 4x faster > > > > > This change might not be a showstopper, but it is something to be aware > > off and take into account wherever PELT utilization is used. > > The point above is clearly a big difference between the 2 approaches > of the no spare cycle case but I think it will help by giving more > information in the over utilization case > > Vincent > > > > Morten > > -- > #include <best/regards.h> > > Patrick Bellasi
On 28-Nov 16:42, Vincent Guittot wrote: > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > On 28-Nov 15:55, Vincent Guittot wrote: > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > > On 28-Nov 14:33, Vincent Guittot wrote: > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > > > > > > > > > Is there anything else that I should do for these patches ? > > > > > > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > > > > > > > > > I guess the problem is that, once we cross the current capacity, > > > > > > strictly speaking util_avg does not represent anymore a utilization. > > > > > > > > > > > > With the new signal this could happen and we end up storing estimated > > > > > > utilization samples which will overestimate the task requirements. > > > > > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > > > > > case we collect multiple samples above the current capacity. > > > > > > > > > > TBH I don't see how it's different from current implementation with a > > > > > task that was scheduled on big core and now wakes up on little core. > > > > > The util_est is overestimated as well. > > > > > > > > While running below the capacity of a CPU, either big or LITTLE, we > > > > can still measure the actual used bandwidth as long as we have idle > > > > time. If the task is then moved into a lower capacity core, I think > > > > it's still safe to assume that, likely, it would need more capacity. > > > > > > > > Why do you say it's the same ? > > > > > > In the example of a task that runs 39ms in period of 80ms that we used > > > during previous version, > > > the utilization on the big core will reach 709 so will util_est too > > > When the task migrates on little core (512), util_est is higher than > > > current cpu capacity > > > > Right, and what's the problem ? > > you worry about an util_est being higher than capacity which is the case there I worry about util_est being higher then the capacity the task WAS running... not the capacity the task IS running... if that value does not correspond to what the task really need... (more on that at the end). > > 1) We know that PELT is calibrated to 32ms period task and in your > > example, since the runtime is higher then the half-life, it's > > correct to estimate a utilization higher then 50%. > > > > PELT utilization is defined _based on the half-life_: thus > > your task having a 50% duty cycle does not mean we are not correct > > if report a utilization != 50%. > > It would be as broken as reporting 10% utilization for a task > > running 100ms every 1s. > > > > 2) If it was a 70% task on a previous activation, once it's moved into > > a lower capacity CPU it's still correct to assume that it's likely > > going to require the same bandwidth and thus will be > > under-provisioned. > > > > I still don't see where we are wrong in this case :/ > > > > To me it looks different then the problem I described. > > > > > > With your new signal instead, once we cross the current capacity, > > > > utilization is just not anymore utilization. Thus, IMHO it make sense > > > > avoid to accumulate a sample for what we call "estimated utilization". > > This is not true. With the example above, the util_est will be exactly the same > on big and little cores with the new signal ... AFAIU only if we have idle time... > > > > I would also say that, with the current implementation which caps > > > > utilization to the current capacity, we get better estimation in > > > > general. At least we can say with absolute precision: > > > > > > > > "the task needs _at least_ that amount of capacity". > > > > > > > > Potentially we can also flag the task as being under-provisioned, in > > > > case there was not idle time, and _let a policy_ decide what to do > > > > with it and the granted information we have. > > > > > > > > While, with your new signal, once we are over the current capacity, > > > > the "utilization" is just a sort of "random" number at best useful to > > > > drive some conclusions about how long the task has been delayed. > > see my comment above > > > > > > > > > IOW, I fear that we are embedding a policy within a signal which is > > > > currently representing something very well defined: how much cpu > > > > bandwidth a task used. While, latency/under-provisioning policies > > > > perhaps should be better placed somewhere else. > > > > > > > > Perhaps I've missed it in some of the previous discussions: > > > > have we have considered/discussed this signal-vs-policy aspect ? > > > > What's your opinion on the above instead ? > > It's not a policy but it gives better knowledge about the amount a work done > I have put below discussion on the subject on previous version Thanks, I think I've skimmed through it, but it's sill useful... > > > With contribution scaling the PELT utilization of a task is a _minimum_ > > > utilization. Regardless of where the task is currently/was running (and > > > provided that it doesn't change behaviour) its PELT utilization will > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU. > > > > The main drawback is that the _minimum_ utilization depends on the CPU > > capacity on which the task runs. The two 25% tasks on a 256 capacity > > CPU will have an utilization of 128 as an example > > > > > > > > With time scaling the PELT utilization doesn't really have a meaning on > > > its own. It has to be compared to the capacity of the CPU where it > > > is/was running to know what the its current PELT utilization means. When > > > > I would have said the opposite. The utilization of the task will > > always reflect the same amount of work that has been already done > > whatever the CPU capacity. > > In fact, the new scaling mechanism uses the real amount of work that > > has been already done to compute the utilization signal which is not > > the case currently. This gives more information about the real amount > > of worked that has been computed in the over utilization case. > > > > > the utilization over-shoots the capacity its value is no longer > > > represents utilization, it just means that it has a higher compute > > > demand than is offered on its current CPU and a high value means that it > > > has been suffering longer. It can't be used to predict the actual > > > utilization on an idle 1024 capacity any better than contribution scaled > > > PELT utilization. > > > > I think that it provides earlier detection of over utilization and > > more accurate signal for a longer time duration which can help the > > load balance > > Coming back to 50% task example . I will use a 50ms running time > > during a 100ms period for the example below to make it easier > > > > Starting from 0, the evolution of the utilization is: > > > > With contribution scaling: > > time 0ms 50ms 100ms 150ms 200ms > > capacity > > 1024 0 666 > > 512 0 333 453 > > When the CPU start to be over utilized (@100ms), the utilization is > > already too low (453 instead of 666) and scheduler doesn't detect yet > > that we are over utilized > > 256 0 169 226 246 252 > > That's even worse with this lower capacity > > > > With time scaling, > > time 0ms 50ms 100ms 150ms 200ms > > capacity > > 1024 0 666 > > 512 0 428 677 > > We know that the current capacity is not enough and the utilization > > reflect the correct utilization level compare to 1024 capacity (the > > 666 vs 677 difference comes from the 1024us window so the last window > > is not full in the case of max capacity) > > 256 0 234 468 564 677 > > At 100ms, we know that there is not enough capacity. (In fact we know > > that at 56ms). And even at time 200ms, the amount of work is exactly > > what would have been executed on a CPU 4x faster > > > > > > > > This change might not be a showstopper, but it is something to be aware > > > off and take into account wherever PELT utilization is used. > > > > The point above is clearly a big difference between the 2 approaches > > of the no spare cycle case but I think it will help by giving more > > information in the over utilization case I like the idea that we ramp up faster and always get to the same value. I like also the idea that we always reach the same value on both LITTLE and big. As long as there is idle time this is working fine, in these cases we should probably also collect util_est samples. But what happens when we don't have idle time ? Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity. Are not these two tasks being reported as 50% tasks (after a while) ? If that's the case, these are samples we should not store... -- #include <best/regards.h> Patrick Bellasi
On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 28-Nov 16:42, Vincent Guittot wrote: > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > On 28-Nov 15:55, Vincent Guittot wrote: > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > > > > On 28-Nov 14:33, Vincent Guittot wrote: > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > > > > > > > > > > > Is there anything else that I should do for these patches ? > > > > > > > > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > > > > > > > > > > > I guess the problem is that, once we cross the current capacity, > > > > > > > strictly speaking util_avg does not represent anymore a utilization. > > > > > > > > > > > > > > With the new signal this could happen and we end up storing estimated > > > > > > > utilization samples which will overestimate the task requirements. > > > > > > > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > > > > > > case we collect multiple samples above the current capacity. > > > > > > > > > > > > TBH I don't see how it's different from current implementation with a > > > > > > task that was scheduled on big core and now wakes up on little core. > > > > > > The util_est is overestimated as well. > > > > > > > > > > While running below the capacity of a CPU, either big or LITTLE, we > > > > > can still measure the actual used bandwidth as long as we have idle > > > > > time. If the task is then moved into a lower capacity core, I think > > > > > it's still safe to assume that, likely, it would need more capacity. > > > > > > > > > > Why do you say it's the same ? > > > > > > > > In the example of a task that runs 39ms in period of 80ms that we used > > > > during previous version, > > > > the utilization on the big core will reach 709 so will util_est too > > > > When the task migrates on little core (512), util_est is higher than > > > > current cpu capacity > > > > > > Right, and what's the problem ? > > > > you worry about an util_est being higher than capacity which is the case there > > I worry about util_est being higher then the capacity the task WAS > running... not the capacity the task IS running... if that value does > not correspond to what the task really need... (more on that at the > end). > > > > 1) We know that PELT is calibrated to 32ms period task and in your > > > example, since the runtime is higher then the half-life, it's > > > correct to estimate a utilization higher then 50%. > > > > > > PELT utilization is defined _based on the half-life_: thus > > > your task having a 50% duty cycle does not mean we are not correct > > > if report a utilization != 50%. > > > It would be as broken as reporting 10% utilization for a task > > > running 100ms every 1s. > > > > > > 2) If it was a 70% task on a previous activation, once it's moved into > > > a lower capacity CPU it's still correct to assume that it's likely > > > going to require the same bandwidth and thus will be > > > under-provisioned. > > > > > > I still don't see where we are wrong in this case :/ > > > > > > To me it looks different then the problem I described. > > > > > > > > With your new signal instead, once we cross the current capacity, > > > > > utilization is just not anymore utilization. Thus, IMHO it make sense > > > > > avoid to accumulate a sample for what we call "estimated utilization". > > > > This is not true. With the example above, the util_est will be exactly the same > > on big and little cores with the new signal > > ... AFAIU only if we have idle time... > > > > > > I would also say that, with the current implementation which caps > > > > > utilization to the current capacity, we get better estimation in > > > > > general. At least we can say with absolute precision: > > > > > > > > > > "the task needs _at least_ that amount of capacity". > > > > > > > > > > Potentially we can also flag the task as being under-provisioned, in > > > > > case there was not idle time, and _let a policy_ decide what to do > > > > > with it and the granted information we have. > > > > > > > > > > While, with your new signal, once we are over the current capacity, > > > > > the "utilization" is just a sort of "random" number at best useful to > > > > > drive some conclusions about how long the task has been delayed. > > > > see my comment above > > > > > > > > > > > > IOW, I fear that we are embedding a policy within a signal which is > > > > > currently representing something very well defined: how much cpu > > > > > bandwidth a task used. While, latency/under-provisioning policies > > > > > perhaps should be better placed somewhere else. > > > > > > > > > > Perhaps I've missed it in some of the previous discussions: > > > > > have we have considered/discussed this signal-vs-policy aspect ? > > > > > > What's your opinion on the above instead ? > > > > It's not a policy but it gives better knowledge about the amount a work done > > I have put below discussion on the subject on previous version > > Thanks, I think I've skimmed through it, but it's sill useful... > > > > > With contribution scaling the PELT utilization of a task is a _minimum_ > > > > utilization. Regardless of where the task is currently/was running (and > > > > provided that it doesn't change behaviour) its PELT utilization will > > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU. > > > > > > The main drawback is that the _minimum_ utilization depends on the CPU > > > capacity on which the task runs. The two 25% tasks on a 256 capacity > > > CPU will have an utilization of 128 as an example > > > > > > > > > > > With time scaling the PELT utilization doesn't really have a meaning on > > > > its own. It has to be compared to the capacity of the CPU where it > > > > is/was running to know what the its current PELT utilization means. When > > > > > > I would have said the opposite. The utilization of the task will > > > always reflect the same amount of work that has been already done > > > whatever the CPU capacity. > > > In fact, the new scaling mechanism uses the real amount of work that > > > has been already done to compute the utilization signal which is not > > > the case currently. This gives more information about the real amount > > > of worked that has been computed in the over utilization case. > > > > > > > the utilization over-shoots the capacity its value is no longer > > > > represents utilization, it just means that it has a higher compute > > > > demand than is offered on its current CPU and a high value means that it > > > > has been suffering longer. It can't be used to predict the actual > > > > utilization on an idle 1024 capacity any better than contribution scaled > > > > PELT utilization. > > > > > > I think that it provides earlier detection of over utilization and > > > more accurate signal for a longer time duration which can help the > > > load balance > > > Coming back to 50% task example . I will use a 50ms running time > > > during a 100ms period for the example below to make it easier > > > > > > Starting from 0, the evolution of the utilization is: > > > > > > With contribution scaling: > > > time 0ms 50ms 100ms 150ms 200ms > > > capacity > > > 1024 0 666 > > > 512 0 333 453 > > > When the CPU start to be over utilized (@100ms), the utilization is > > > already too low (453 instead of 666) and scheduler doesn't detect yet > > > that we are over utilized > > > 256 0 169 226 246 252 > > > That's even worse with this lower capacity > > > > > > With time scaling, > > > time 0ms 50ms 100ms 150ms 200ms > > > capacity > > > 1024 0 666 > > > 512 0 428 677 > > > We know that the current capacity is not enough and the utilization > > > reflect the correct utilization level compare to 1024 capacity (the > > > 666 vs 677 difference comes from the 1024us window so the last window > > > is not full in the case of max capacity) > > > 256 0 234 468 564 677 > > > At 100ms, we know that there is not enough capacity. (In fact we know > > > that at 56ms). And even at time 200ms, the amount of work is exactly > > > what would have been executed on a CPU 4x faster > > > > > > > > > > > This change might not be a showstopper, but it is something to be aware > > > > off and take into account wherever PELT utilization is used. > > > > > > The point above is clearly a big difference between the 2 approaches > > > of the no spare cycle case but I think it will help by giving more > > > information in the over utilization case > > I like the idea that we ramp up faster and always get to the same > value. I like also the idea that we always reach the same value on > both LITTLE and big. > > As long as there is idle time this is working fine, in these cases we > should probably also collect util_est samples. > > But what happens when we don't have idle time ? As shown above, the utilization stays correct for a longer time frame even after the over utilization point and provides better over utilization detection > > Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity. > > Are not these two tasks being reported as 50% tasks (after a while) ? Yes they will but similarly to above they will stay correct for longer time even when they become higher than current cpu capacity > > If that's the case, these are samples we should not store... > > -- > #include <best/regards.h> > > Patrick Bellasi
On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ac855b2f4774..93e0cf5d8a76 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > if (!task_sleep) > return; > > + /* Skip samples which do not represent an actual utilization */ > + if (unlikely(task_util(p) > capacity_of(task_cpu(p)))) > + return; > + > /* > * If the PELT values haven't changed since enqueue time, > * skip the util_est update. Would you not want something like: min(task_util(p), capacity_of(task_cpu(p))) And is this the only place where we need this? OTOH, if the task is always running, it will be always running irrespective of where it runs. Not storing these samples seems weird though; this is the exact condition you want to record -- the task is very active, if we skip these, we'll come back at a low frequency on the next wakeup.
On 29-Nov 11:43, Vincent Guittot wrote: > On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 28-Nov 16:42, Vincent Guittot wrote: > > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > On 28-Nov 15:55, Vincent Guittot wrote: > > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > On 28-Nov 14:33, Vincent Guittot wrote: > > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > > > > > > > > > > > > > Is there anything else that I should do for these patches ? > > > > > > > > > > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > > > > > > > > > > > > > I guess the problem is that, once we cross the current capacity, > > > > > > > > strictly speaking util_avg does not represent anymore a utilization. > > > > > > > > > > > > > > > > With the new signal this could happen and we end up storing estimated > > > > > > > > utilization samples which will overestimate the task requirements. > > > > > > > > > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we > > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > > > > > > > case we collect multiple samples above the current capacity. > > > > > > > > > > > > > > TBH I don't see how it's different from current implementation with a > > > > > > > task that was scheduled on big core and now wakes up on little core. > > > > > > > The util_est is overestimated as well. > > > > > > > > > > > > While running below the capacity of a CPU, either big or LITTLE, we > > > > > > can still measure the actual used bandwidth as long as we have idle > > > > > > time. If the task is then moved into a lower capacity core, I think > > > > > > it's still safe to assume that, likely, it would need more capacity. > > > > > > > > > > > > Why do you say it's the same ? > > > > > > > > > > In the example of a task that runs 39ms in period of 80ms that we used > > > > > during previous version, > > > > > the utilization on the big core will reach 709 so will util_est too > > > > > When the task migrates on little core (512), util_est is higher than > > > > > current cpu capacity > > > > > > > > Right, and what's the problem ? > > > > > > you worry about an util_est being higher than capacity which is the case there > > > > I worry about util_est being higher then the capacity the task WAS > > running... not the capacity the task IS running... if that value does > > not correspond to what the task really need... (more on that at the > > end). > > > > > > 1) We know that PELT is calibrated to 32ms period task and in your > > > > example, since the runtime is higher then the half-life, it's > > > > correct to estimate a utilization higher then 50%. > > > > > > > > PELT utilization is defined _based on the half-life_: thus > > > > your task having a 50% duty cycle does not mean we are not correct > > > > if report a utilization != 50%. > > > > It would be as broken as reporting 10% utilization for a task > > > > running 100ms every 1s. > > > > > > > > 2) If it was a 70% task on a previous activation, once it's moved into > > > > a lower capacity CPU it's still correct to assume that it's likely > > > > going to require the same bandwidth and thus will be > > > > under-provisioned. > > > > > > > > I still don't see where we are wrong in this case :/ > > > > > > > > To me it looks different then the problem I described. > > > > > > > > > > With your new signal instead, once we cross the current capacity, > > > > > > utilization is just not anymore utilization. Thus, IMHO it make sense > > > > > > avoid to accumulate a sample for what we call "estimated utilization". > > > > > > This is not true. With the example above, the util_est will be exactly the same > > > on big and little cores with the new signal > > > > ... AFAIU only if we have idle time... > > > > > > > > I would also say that, with the current implementation which caps > > > > > > utilization to the current capacity, we get better estimation in > > > > > > general. At least we can say with absolute precision: > > > > > > > > > > > > "the task needs _at least_ that amount of capacity". > > > > > > > > > > > > Potentially we can also flag the task as being under-provisioned, in > > > > > > case there was not idle time, and _let a policy_ decide what to do > > > > > > with it and the granted information we have. > > > > > > > > > > > > While, with your new signal, once we are over the current capacity, > > > > > > the "utilization" is just a sort of "random" number at best useful to > > > > > > drive some conclusions about how long the task has been delayed. > > > > > > see my comment above > > > > > > > > > > > > > > > IOW, I fear that we are embedding a policy within a signal which is > > > > > > currently representing something very well defined: how much cpu > > > > > > bandwidth a task used. While, latency/under-provisioning policies > > > > > > perhaps should be better placed somewhere else. > > > > > > > > > > > > Perhaps I've missed it in some of the previous discussions: > > > > > > have we have considered/discussed this signal-vs-policy aspect ? > > > > > > > > What's your opinion on the above instead ? > > > > > > It's not a policy but it gives better knowledge about the amount a work done > > > I have put below discussion on the subject on previous version > > > > Thanks, I think I've skimmed through it, but it's sill useful... > > > > > > > With contribution scaling the PELT utilization of a task is a _minimum_ > > > > > utilization. Regardless of where the task is currently/was running (and > > > > > provided that it doesn't change behaviour) its PELT utilization will > > > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU. > > > > > > > > The main drawback is that the _minimum_ utilization depends on the CPU > > > > capacity on which the task runs. The two 25% tasks on a 256 capacity > > > > CPU will have an utilization of 128 as an example > > > > > > > > > > > > > > With time scaling the PELT utilization doesn't really have a meaning on > > > > > its own. It has to be compared to the capacity of the CPU where it > > > > > is/was running to know what the its current PELT utilization means. When > > > > > > > > I would have said the opposite. The utilization of the task will > > > > always reflect the same amount of work that has been already done > > > > whatever the CPU capacity. > > > > In fact, the new scaling mechanism uses the real amount of work that > > > > has been already done to compute the utilization signal which is not > > > > the case currently. This gives more information about the real amount > > > > of worked that has been computed in the over utilization case. > > > > > > > > > the utilization over-shoots the capacity its value is no longer > > > > > represents utilization, it just means that it has a higher compute > > > > > demand than is offered on its current CPU and a high value means that it > > > > > has been suffering longer. It can't be used to predict the actual > > > > > utilization on an idle 1024 capacity any better than contribution scaled > > > > > PELT utilization. > > > > > > > > I think that it provides earlier detection of over utilization and > > > > more accurate signal for a longer time duration which can help the > > > > load balance > > > > Coming back to 50% task example . I will use a 50ms running time > > > > during a 100ms period for the example below to make it easier > > > > > > > > Starting from 0, the evolution of the utilization is: > > > > > > > > With contribution scaling: > > > > time 0ms 50ms 100ms 150ms 200ms > > > > capacity > > > > 1024 0 666 > > > > 512 0 333 453 > > > > When the CPU start to be over utilized (@100ms), the utilization is > > > > already too low (453 instead of 666) and scheduler doesn't detect yet > > > > that we are over utilized > > > > 256 0 169 226 246 252 > > > > That's even worse with this lower capacity > > > > > > > > With time scaling, > > > > time 0ms 50ms 100ms 150ms 200ms > > > > capacity > > > > 1024 0 666 > > > > 512 0 428 677 > > > > 256 0 234 468 564 677 [...] > > I like the idea that we ramp up faster and always get to the same > > value. I like also the idea that we always reach the same value on > > both LITTLE and big. > > > > As long as there is idle time this is working fine, in these cases we > > should probably also collect util_est samples. > > > > But what happens when we don't have idle time ? > > As shown above, the utilization stays correct for a longer time frame > even after the over utilization point and provides better over > utilization detection > > > > > Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity. > > > > Are not these two tasks being reported as 50% tasks (after a while) ? > > Yes they will but similarly to above they will stay correct for longer > time even when they become higher than current cpu capacity Seems we agree that, when there is no idle time: - the two 15% tasks will be overestimated - their utilization will reach 50% after a while If I'm not wrong, we will have: - 30% CPU util in ~16ms @1024 capacity ~64ms @256 capacity Thus, the tasks will be certainly over-estimated after ~64ms. Is that correct ? > > If that's the case, these are samples we should not store... Now, we can argue that 64ms is a pretty long time and thus it's quite unlucky we will have no idle for such a long time. Still, I'm wondering if we should keep collecting those samples or better find a way to detect that and skip the sampling. -- #include <best/regards.h> Patrick Bellasi
On 29-Nov 13:53, Peter Zijlstra wrote: > On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote: > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index ac855b2f4774..93e0cf5d8a76 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > > if (!task_sleep) > > return; > > > > + /* Skip samples which do not represent an actual utilization */ > > + if (unlikely(task_util(p) > capacity_of(task_cpu(p)))) > > + return; > > + > > /* > > * If the PELT values haven't changed since enqueue time, > > * skip the util_est update. > > Would you not want something like: > > min(task_util(p), capacity_of(task_cpu(p))) > > And is this the only place where we need this? Mmm... even this could be an over-estimation: I've just posted an example in my last reply to Vincent, end of: Message-ID: <20181129150020.GF23094@e110439-lin> https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/ > OTOH, if the task is always running, it will be always running > irrespective of where it runs. That's not what I'm concerned about. I'm concerned about small tasks which are running on limited capacity (e.g. due to thermal capping) without idle time. In this case, the new "utilization" signal could overestimate the real task needs. > Not storing these samples seems weird though; this is the exact > condition you want to record -- the task is very active, if we skip > these, we'll come back at a low frequency on the next wakeup. When there is not idle time, we don't know if the reported utilization, above the cpu capacity, is due to the task being bigger... or just the new utilization signal converging towards: 100% / RUNNABLE_TASKS_COUNT -- #include <best/regards.h> Patrick Bellasi
On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 29-Nov 11:43, Vincent Guittot wrote: > > On Wed, 28 Nov 2018 at 17:35, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > On 28-Nov 16:42, Vincent Guittot wrote: > > > > On Wed, 28 Nov 2018 at 16:21, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > On 28-Nov 15:55, Vincent Guittot wrote: > > > > > > On Wed, 28 Nov 2018 at 15:40, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > On 28-Nov 14:33, Vincent Guittot wrote: > > > > > > > > On Wed, 28 Nov 2018 at 12:53, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > > > > > > On 28-Nov 11:02, Peter Zijlstra wrote: > > > > > > > > > > On Wed, Nov 28, 2018 at 10:54:13AM +0100, Vincent Guittot wrote: > > > > > > > > > > > > > > > > > > > > > Is there anything else that I should do for these patches ? > > > > > > > > > > > > > > > > > > > > IIRC, Morten mention they break util_est; Patrick was going to explain. > > > > > > > > > > > > > > > > > > I guess the problem is that, once we cross the current capacity, > > > > > > > > > strictly speaking util_avg does not represent anymore a utilization. > > > > > > > > > > > > > > > > > > With the new signal this could happen and we end up storing estimated > > > > > > > > > utilization samples which will overestimate the task requirements. > > > > > > > > > > > > > > > > > > We will have a spike in estimated utilization at next wakeup, since we > > > > > > > > > use MAX(util_avg@dequeue_time, ewma). Potentially we also inflate the EWMA in > > > > > > > > > case we collect multiple samples above the current capacity. > > > > > > > > > > > > > > > > TBH I don't see how it's different from current implementation with a > > > > > > > > task that was scheduled on big core and now wakes up on little core. > > > > > > > > The util_est is overestimated as well. > > > > > > > > > > > > > > While running below the capacity of a CPU, either big or LITTLE, we > > > > > > > can still measure the actual used bandwidth as long as we have idle > > > > > > > time. If the task is then moved into a lower capacity core, I think > > > > > > > it's still safe to assume that, likely, it would need more capacity. > > > > > > > > > > > > > > Why do you say it's the same ? > > > > > > > > > > > > In the example of a task that runs 39ms in period of 80ms that we used > > > > > > during previous version, > > > > > > the utilization on the big core will reach 709 so will util_est too > > > > > > When the task migrates on little core (512), util_est is higher than > > > > > > current cpu capacity > > > > > > > > > > Right, and what's the problem ? > > > > > > > > you worry about an util_est being higher than capacity which is the case there > > > > > > I worry about util_est being higher then the capacity the task WAS > > > running... not the capacity the task IS running... if that value does > > > not correspond to what the task really need... (more on that at the > > > end). > > > > > > > > 1) We know that PELT is calibrated to 32ms period task and in your > > > > > example, since the runtime is higher then the half-life, it's > > > > > correct to estimate a utilization higher then 50%. > > > > > > > > > > PELT utilization is defined _based on the half-life_: thus > > > > > your task having a 50% duty cycle does not mean we are not correct > > > > > if report a utilization != 50%. > > > > > It would be as broken as reporting 10% utilization for a task > > > > > running 100ms every 1s. > > > > > > > > > > 2) If it was a 70% task on a previous activation, once it's moved into > > > > > a lower capacity CPU it's still correct to assume that it's likely > > > > > going to require the same bandwidth and thus will be > > > > > under-provisioned. > > > > > > > > > > I still don't see where we are wrong in this case :/ > > > > > > > > > > To me it looks different then the problem I described. > > > > > > > > > > > > With your new signal instead, once we cross the current capacity, > > > > > > > utilization is just not anymore utilization. Thus, IMHO it make sense > > > > > > > avoid to accumulate a sample for what we call "estimated utilization". > > > > > > > > This is not true. With the example above, the util_est will be exactly the same > > > > on big and little cores with the new signal > > > > > > ... AFAIU only if we have idle time... > > > > > > > > > > I would also say that, with the current implementation which caps > > > > > > > utilization to the current capacity, we get better estimation in > > > > > > > general. At least we can say with absolute precision: > > > > > > > > > > > > > > "the task needs _at least_ that amount of capacity". > > > > > > > > > > > > > > Potentially we can also flag the task as being under-provisioned, in > > > > > > > case there was not idle time, and _let a policy_ decide what to do > > > > > > > with it and the granted information we have. > > > > > > > > > > > > > > While, with your new signal, once we are over the current capacity, > > > > > > > the "utilization" is just a sort of "random" number at best useful to > > > > > > > drive some conclusions about how long the task has been delayed. > > > > > > > > see my comment above > > > > > > > > > > > > > > > > > > IOW, I fear that we are embedding a policy within a signal which is > > > > > > > currently representing something very well defined: how much cpu > > > > > > > bandwidth a task used. While, latency/under-provisioning policies > > > > > > > perhaps should be better placed somewhere else. > > > > > > > > > > > > > > Perhaps I've missed it in some of the previous discussions: > > > > > > > have we have considered/discussed this signal-vs-policy aspect ? > > > > > > > > > > What's your opinion on the above instead ? > > > > > > > > It's not a policy but it gives better knowledge about the amount a work done > > > > I have put below discussion on the subject on previous version > > > > > > Thanks, I think I've skimmed through it, but it's sill useful... > > > > > > > > > With contribution scaling the PELT utilization of a task is a _minimum_ > > > > > > utilization. Regardless of where the task is currently/was running (and > > > > > > provided that it doesn't change behaviour) its PELT utilization will > > > > > > approximate its _minimum_ utilization on an idle 1024 capacity CPU. > > > > > > > > > > The main drawback is that the _minimum_ utilization depends on the CPU > > > > > capacity on which the task runs. The two 25% tasks on a 256 capacity > > > > > CPU will have an utilization of 128 as an example > > > > > > > > > > > > > > > > > With time scaling the PELT utilization doesn't really have a meaning on > > > > > > its own. It has to be compared to the capacity of the CPU where it > > > > > > is/was running to know what the its current PELT utilization means. When > > > > > > > > > > I would have said the opposite. The utilization of the task will > > > > > always reflect the same amount of work that has been already done > > > > > whatever the CPU capacity. > > > > > In fact, the new scaling mechanism uses the real amount of work that > > > > > has been already done to compute the utilization signal which is not > > > > > the case currently. This gives more information about the real amount > > > > > of worked that has been computed in the over utilization case. > > > > > > > > > > > the utilization over-shoots the capacity its value is no longer > > > > > > represents utilization, it just means that it has a higher compute > > > > > > demand than is offered on its current CPU and a high value means that it > > > > > > has been suffering longer. It can't be used to predict the actual > > > > > > utilization on an idle 1024 capacity any better than contribution scaled > > > > > > PELT utilization. > > > > > > > > > > I think that it provides earlier detection of over utilization and > > > > > more accurate signal for a longer time duration which can help the > > > > > load balance > > > > > Coming back to 50% task example . I will use a 50ms running time > > > > > during a 100ms period for the example below to make it easier > > > > > > > > > > Starting from 0, the evolution of the utilization is: > > > > > > > > > > With contribution scaling: > > > > > time 0ms 50ms 100ms 150ms 200ms > > > > > capacity > > > > > 1024 0 666 > > > > > 512 0 333 453 > > > > > When the CPU start to be over utilized (@100ms), the utilization is > > > > > already too low (453 instead of 666) and scheduler doesn't detect yet > > > > > that we are over utilized > > > > > 256 0 169 226 246 252 > > > > > That's even worse with this lower capacity > > > > > > > > > > > > > > With time scaling, > > > > > time 0ms 50ms 100ms 150ms 200ms > > > > > capacity > > > > > 1024 0 666 > > > > > 512 0 428 677 > > > > > 256 0 234 468 564 677 > > [...] > > > > I like the idea that we ramp up faster and always get to the same > > > value. I like also the idea that we always reach the same value on > > > both LITTLE and big. > > > > > > As long as there is idle time this is working fine, in these cases we > > > should probably also collect util_est samples. > > > > > > But what happens when we don't have idle time ? > > > > As shown above, the utilization stays correct for a longer time frame > > even after the over utilization point and provides better over > > utilization detection > > > > > > > > Let say we have 2 15% tasks, co-scheduled on a cpu with <300 capacity. > > > > > > Are not these two tasks being reported as 50% tasks (after a while) ? > > > > Yes they will but similarly to above they will stay correct for longer > > time even when they become higher than current cpu capacity > > Seems we agree that, when there is no idle time: > - the two 15% tasks will be overestimated > - their utilization will reach 50% after a while > > If I'm not wrong, we will have: > - 30% CPU util in ~16ms @1024 capacity > ~64ms @256 capacity > > Thus, the tasks will be certainly over-estimated after ~64ms. > Is that correct ? From a pure util_avg pov it's correct But i'd like to weight that a bit with the example below > > > > If that's the case, these are samples we should not store... > > Now, we can argue that 64ms is a pretty long time and thus it's quite > unlucky we will have no idle for such a long time. > > Still, I'm wondering if we should keep collecting those samples or > better find a way to detect that and skip the sampling. The problem is that you can have util_avg above capacity even with idle time In the 1st example of this thread, the 39ms/80ms task will reach 709 which is the value saved by util_est on a big core But on core with half capacity, there is still idle time so 709 is a correct value although above 512 In fact, max will be always above the linear ratio because it's based on geometric series And this is true even with 15.6ms/32ms (same ratio as above) task although the impact is smaller (max value, which should be saved by util est, becomes 587 in this case). > > -- > #include <best/regards.h> > > Patrick Bellasi
On 29-Nov 17:19, Vincent Guittot wrote: > On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 29-Nov 11:43, Vincent Guittot wrote: [...] > > Seems we agree that, when there is no idle time: > > - the two 15% tasks will be overestimated > > - their utilization will reach 50% after a while > > > > If I'm not wrong, we will have: > > - 30% CPU util in ~16ms @1024 capacity > > ~64ms @256 capacity > > > > Thus, the tasks will be certainly over-estimated after ~64ms. > > Is that correct ? > > From a pure util_avg pov it's correct > But i'd like to weight that a bit with the example below > > > Now, we can argue that 64ms is a pretty long time and thus it's quite > > unlucky we will have no idle for such a long time. > > > > Still, I'm wondering if we should keep collecting those samples or > > better find a way to detect that and skip the sampling. > > The problem is that you can have util_avg above capacity even with idle time > In the 1st example of this thread, the 39ms/80ms task will reach 709 > which is the value saved by util_est on a big core > But on core with half capacity, there is still idle time so 709 is a > correct value although above 512 Right, I see your point and (in principle) I like the idea of collecting samples for tasks which happen to run at a lower capacity then required and the utilization value makes sense... > In fact, max will be always above the linear ratio because it's based > on geometric series > > And this is true even with 15.6ms/32ms (same ratio as above) task > although the impact is smaller (max value, which should be saved by > util est, becomes 587 in this case). However that's not always the case... as per my example above. Moreover, we should also consider that util_est is mainly meant to be a lower-bound for tasks utilization. That's why task_util_est() already returns the actual util_avg when it's higher than the estimated utilization. With your new signal and without any special check on samples collection, if a task is limited because of thermal capping for example, we could end up overestimating its utilization and thus perhaps generating an unwanted frequency spike when the capping is relaxed... and (even worst) it will take some more activations for the estimated utilization to converge back to the actual utilization. Since we cannot easily know if there is idle time in a CPU when a task completes an activation with a utilization higher then the CPU capacity, I would better prefer to just skip the sampling with something like: ---8<--- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9332863d122a..485053026533 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) { long last_ewma_diff; struct util_est ue; + int cpu; if (!sched_feat(UTIL_EST)) return; @@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100))) return; + /* + * To avoid overestimation of actual task utilization, skip updates if + * we cannot grant there is idle time in this CPU. + */ + cpu = cpu_of(rq_of(cfs_rq)); + if (task_util(p) > cpu_capacity(cpu)) + return; + /* * Update Task's estimated utilization * ---8<--- At least this will ensure that util_est always provides an actual measured lower bound for a task utilization. If you think this makes sense, feel free to add such a patch on top of your series. Cheers Patrick -- #include <best/regards.h> Patrick Bellasi
On Thu, 10 Jan 2019 at 16:30, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > On 29-Nov 17:19, Vincent Guittot wrote: > > On Thu, 29 Nov 2018 at 16:00, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > On 29-Nov 11:43, Vincent Guittot wrote: > > [...] > > > > Seems we agree that, when there is no idle time: > > > - the two 15% tasks will be overestimated > > > - their utilization will reach 50% after a while > > > > > > If I'm not wrong, we will have: > > > - 30% CPU util in ~16ms @1024 capacity > > > ~64ms @256 capacity > > > > > > Thus, the tasks will be certainly over-estimated after ~64ms. > > > Is that correct ? > > > > From a pure util_avg pov it's correct > > But i'd like to weight that a bit with the example below > > > > > Now, we can argue that 64ms is a pretty long time and thus it's quite > > > unlucky we will have no idle for such a long time. > > > > > > Still, I'm wondering if we should keep collecting those samples or > > > better find a way to detect that and skip the sampling. > > > > The problem is that you can have util_avg above capacity even with idle time > > In the 1st example of this thread, the 39ms/80ms task will reach 709 > > which is the value saved by util_est on a big core > > But on core with half capacity, there is still idle time so 709 is a > > correct value although above 512 > > Right, I see your point and (in principle) I like the idea of > collecting samples for tasks which happen to run at a lower capacity > then required and the utilization value makes sense... > > > In fact, max will be always above the linear ratio because it's based > > on geometric series > > > > And this is true even with 15.6ms/32ms (same ratio as above) task > > although the impact is smaller (max value, which should be saved by > > util est, becomes 587 in this case). > > However that's not always the case... as per my example above. > > Moreover, we should also consider that util_est is mainly meant to be > a lower-bound for tasks utilization. > That's why task_util_est() already returns the actual util_avg when > it's higher than the estimated utilization. I can imagine that the fact that we use max(util_avg, util_est) helps to keep using correct utilization in the scheduler when util_avg goes above cpu capacity whereas there is still idle time > > With your new signal and without any special check on samples > collection, if a task is limited because of thermal capping for > example, we could end up overestimating its utilization and thus > perhaps generating an unwanted frequency spike when the capping is > relaxed... and (even worst) it will take some more activations for the > estimated utilization to converge back to the actual utilization. > > Since we cannot easily know if there is idle time in a CPU when a task > completes an activation with a utilization higher then the CPU > capacity, I would better prefer to just skip the sampling with > something like: > > ---8<--- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 9332863d122a..485053026533 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3639,6 +3639,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > { > long last_ewma_diff; > struct util_est ue; > + int cpu; > > if (!sched_feat(UTIL_EST)) > return; > @@ -3672,6 +3673,14 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100))) > return; > > + /* > + * To avoid overestimation of actual task utilization, skip updates if > + * we cannot grant there is idle time in this CPU. > + */ > + cpu = cpu_of(rq_of(cfs_rq)); > + if (task_util(p) > cpu_capacity(cpu)) > + return; > + > /* > * Update Task's estimated utilization > * > ---8<--- > > At least this will ensure that util_est always provides an actual > measured lower bound for a task utilization. > > If you think this makes sense, feel free to add such a patch on > top of your series. ok. I'm going to add it when rebasing the series Thanks Vincent > > Cheers Patrick > > -- > #include <best/regards.h> > > Patrick Bellasi
Sorry; trying to get back to this and re-reading the old conversations. On Thu, Nov 29, 2018 at 03:13:16PM +0000, Patrick Bellasi wrote: > On 29-Nov 13:53, Peter Zijlstra wrote: > > On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote: > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index ac855b2f4774..93e0cf5d8a76 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > > > if (!task_sleep) > > > return; > > > > > > + /* Skip samples which do not represent an actual utilization */ > > > + if (unlikely(task_util(p) > capacity_of(task_cpu(p)))) > > > + return; > > > + > > > /* > > > * If the PELT values haven't changed since enqueue time, > > > * skip the util_est update. > > > > Would you not want something like: > > > > min(task_util(p), capacity_of(task_cpu(p))) > > > > And is this the only place where we need this? > > Mmm... even this could be an over-estimation: > > I've just posted an example in my last reply to Vincent, end of: > > Message-ID: <20181129150020.GF23094@e110439-lin> > https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/ In particular this bit: | Seems we agree that, when there is no idle time: | - the two 15% tasks will be overestimated | - their utilization will reach 50% after a while Right? > > OTOH, if the task is always running, it will be always running > > irrespective of where it runs. > > That's not what I'm concerned about. I'm concerned about small tasks > which are running on limited capacity (e.g. due to thermal capping) > without idle time. In this case, the new "utilization" signal could > overestimate the real task needs. > > > Not storing these samples seems weird though; this is the exact > > condition you want to record -- the task is very active, if we skip > > these, we'll come back at a low frequency on the next wakeup. > > When there is not idle time, we don't know if the reported > utilization, above the cpu capacity, is due to the task being bigger... > or just the new utilization signal converging towards: > > 100% / RUNNABLE_TASKS_COUNT So if I'm not mistaken we then have 3 cases: 1) runnable == util <= capacity no contention, idle 2) runnable == util > capacity no contention, no idle 3) runnable > util contention, no idle For 1) we can use: 'util' For 2) we can use: 'capacity' For 3) we can use: 'util * capacity >> 10' (note that 2 is a special case of 3 when u=1) This should work right? Now, instead of doing complicated things like that, you instead figure that when there's no idle there's also no dequeue happening and we can simply short-cut by skipping the entire thing, forgetting everything about 2,3. Did I get that right?
On 24-Jan 10:07, Peter Zijlstra wrote: > > Sorry; trying to get back to this and re-reading the old conversations. > > On Thu, Nov 29, 2018 at 03:13:16PM +0000, Patrick Bellasi wrote: > > On 29-Nov 13:53, Peter Zijlstra wrote: > > > On Wed, Nov 28, 2018 at 11:53:36AM +0000, Patrick Bellasi wrote: > > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > > index ac855b2f4774..93e0cf5d8a76 100644 > > > > --- a/kernel/sched/fair.c > > > > +++ b/kernel/sched/fair.c > > > > @@ -3661,6 +3661,10 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) > > > > if (!task_sleep) > > > > return; > > > > > > > > + /* Skip samples which do not represent an actual utilization */ > > > > + if (unlikely(task_util(p) > capacity_of(task_cpu(p)))) > > > > + return; > > > > + > > > > /* > > > > * If the PELT values haven't changed since enqueue time, > > > > * skip the util_est update. > > > > > > Would you not want something like: > > > > > > min(task_util(p), capacity_of(task_cpu(p))) > > > > > > And is this the only place where we need this? > > > > Mmm... even this could be an over-estimation: > > > > I've just posted an example in my last reply to Vincent, end of: > > > > Message-ID: <20181129150020.GF23094@e110439-lin> > > https://lore.kernel.org/lkml/20181129150020.GF23094@e110439-lin/ > > In particular this bit: > > | Seems we agree that, when there is no idle time: > | - the two 15% tasks will be overestimated > | - their utilization will reach 50% after a while > > Right? > > > > OTOH, if the task is always running, it will be always running > > > irrespective of where it runs. > > > > That's not what I'm concerned about. I'm concerned about small tasks > > which are running on limited capacity (e.g. due to thermal capping) > > without idle time. In this case, the new "utilization" signal could > > overestimate the real task needs. > > > > > Not storing these samples seems weird though; this is the exact > > > condition you want to record -- the task is very active, if we skip > > > these, we'll come back at a low frequency on the next wakeup. > > > > When there is not idle time, we don't know if the reported > > utilization, above the cpu capacity, is due to the task being bigger... > > or just the new utilization signal converging towards: > > > > 100% / RUNNABLE_TASKS_COUNT > > So if I'm not mistaken we then have 3 cases: > > 1) runnable == util <= capacity > > no contention, idle > > 2) runnable == util > capacity > > no contention, no idle > > 3) runnable > util > > contention, no idle > > For 1) we can use: 'util' > For 2) we can use: 'capacity' > For 3) we can use: 'util * capacity >> 10' > > (note that 2 is a special case of 3 when u=1) > > This should work right? I think there is a case, similar to 2, in which the new 'util' could potentially be used. That's the case for example of a 20% (estimated) utilization task running alone on a 15% capacity CPU, for a single activation. In that case such a task will complete and be dequeued with: runnable == util > capacity The problem is that we need to be sure there was not contention... and that seems to be difficult to detect. > Now, instead of doing complicated things like that, you instead figure > that when there's no idle there's also no dequeue happening and we can > simply short-cut by skipping the entire thing, forgetting everything > about 2,3. > > Did I get that right? More or less... just saying that 1 is the only easy to detect scenario in which we are granted the utilization represents an actual bandwidth request and thus the only safe values to sample for estimated utilization. For the other cases, since anyway: util_est := max(max(ewma, last_util), util_avg) util_est will just keep representing a safe and actually measured lower-bound for the expected utilization of a task, without side-affecting the EWMA which has a "slow" update dynamic. -- #include <best/regards.h> Patrick Bellasi
On Thu, Jan 24, 2019 at 02:04:32PM +0000, Patrick Bellasi wrote: > > So if I'm not mistaken we then have 3 cases: > > > > 1) runnable == util <= capacity > > > > no contention, idle > > > > 2) runnable == util > capacity > > > > no contention, no idle > > > > 3) runnable > util > > > > contention, no idle > > > > For 1) we can use: 'util' > > For 2) we can use: 'capacity' > > For 3) we can use: 'util * capacity >> 10' > > > > (note that 2 is a special case of 3 when u=1) > > > > This should work right? > > I think there is a case, similar to 2, in which the new 'util' could > potentially be used. That's the case for example of a 20% (estimated) > utilization task running alone on a 15% capacity CPU, for a single > activation. In that case such a task will complete and be dequeued > with: > > runnable == util > capacity > > The problem is that we need to be sure there was not contention... and > that seems to be difficult to detect. When there is contention runnable and util should diverge. Given this is all discrete stuff, there's a few funnies, but who cares about those :-) > > Now, instead of doing complicated things like that, you instead figure > > that when there's no idle there's also no dequeue happening and we can > > simply short-cut by skipping the entire thing, forgetting everything > > about 2,3. > > > > Did I get that right? > > More or less... just saying that 1 is the only easy to detect scenario > in which we are granted the utilization represents an actual bandwidth > request and thus the only safe values to sample for estimated > utilization. For the other cases, since anyway: > > util_est := max(max(ewma, last_util), util_avg) > > util_est will just keep representing a safe and actually measured > lower-bound for the expected utilization of a task, without > side-affecting the EWMA which has a "slow" update dynamic. Right, so maybe we should expound the comment here a bit; but otherwise I'm inclined to merge that v9 Vincent posted.
diff --git a/include/linux/sched.h b/include/linux/sched.h index 8f8a541..d61a523 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -356,12 +356,6 @@ struct util_est { * For cfs_rq, it is the aggregated load_avg of all runnable and * blocked sched_entities. * - * load_avg may also take frequency scaling into account: - * - * load_avg = runnable% * scale_load_down(load) * freq% - * - * where freq% is the CPU frequency normalized to the highest frequency. - * * [util_avg definition] * * util_avg = running% * SCHED_CAPACITY_SCALE @@ -370,17 +364,14 @@ struct util_est { * a CPU. For cfs_rq, it is the aggregated util_avg of all runnable * and blocked sched_entities. * - * util_avg may also factor frequency scaling and CPU capacity scaling: - * - * util_avg = running% * SCHED_CAPACITY_SCALE * freq% * capacity% - * - * where freq% is the same as above, and capacity% is the CPU capacity - * normalized to the greatest capacity (due to uarch differences, etc). + * load_avg and util_avg don't direcly factor frequency scaling and CPU + * capacity scaling. The scaling is done through the rq_clock_pelt that + * is used for computing those signals (see update_rq_clock_pelt()) * - * N.B., the above ratios (runnable%, running%, freq%, and capacity%) - * themselves are in the range of [0, 1]. To do fixed point arithmetics, - * we therefore scale them to as large a range as necessary. This is for - * example reflected by util_avg's SCHED_CAPACITY_SCALE. + * N.B., the above ratios (runnable% and running%) themselves are in the + * range of [0, 1]. To do fixed point arithmetics, we therefore scale them + * to as large a range as necessary. This is for example reflected by + * util_avg's SCHED_CAPACITY_SCALE. * * [Overflow issue] * diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5afb868..79d57da 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -180,6 +180,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY)) update_irq_load_avg(rq, irq_delta + steal); #endif + update_rq_clock_pelt(rq, delta); } void update_rq_clock(struct rq *rq) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 470ba6b..a9f1d62 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1767,7 +1767,7 @@ pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) deadline_queue_push_tasks(rq); if (rq->curr->sched_class != &dl_sched_class) - update_dl_rq_load_avg(rq_clock_task(rq), rq, 0); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0); return p; } @@ -1776,7 +1776,7 @@ static void put_prev_task_dl(struct rq *rq, struct task_struct *p) { update_curr_dl(rq); - update_dl_rq_load_avg(rq_clock_task(rq), rq, 1); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1); if (on_dl_rq(&p->dl) && p->nr_cpus_allowed > 1) enqueue_pushable_dl_task(rq, p); } @@ -1793,7 +1793,7 @@ static void task_tick_dl(struct rq *rq, struct task_struct *p, int queued) { update_curr_dl(rq); - update_dl_rq_load_avg(rq_clock_task(rq), rq, 1); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 1); /* * Even when we have runtime, update_curr_dl() might have resulted in us * not being the leftmost task anymore. In that case NEED_RESCHED will diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9cd1f32..4debba7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -674,9 +674,8 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se) return calc_delta_fair(sched_slice(cfs_rq, se), se); } -#ifdef CONFIG_SMP #include "pelt.h" -#include "sched-pelt.h" +#ifdef CONFIG_SMP static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu); static unsigned long task_h_load(struct task_struct *p); @@ -764,7 +763,7 @@ void post_init_entity_util_avg(struct sched_entity *se) * such that the next switched_to_fair() has the * expected state. */ - se->avg.last_update_time = cfs_rq_clock_task(cfs_rq); + se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq); return; } } @@ -3110,7 +3109,7 @@ void set_task_rq_fair(struct sched_entity *se, p_last_update_time = prev->avg.last_update_time; n_last_update_time = next->avg.last_update_time; #endif - __update_load_avg_blocked_se(p_last_update_time, cpu_of(rq_of(prev)), se); + __update_load_avg_blocked_se(p_last_update_time, se); se->avg.last_update_time = n_last_update_time; } @@ -3245,11 +3244,11 @@ update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cf /* * runnable_sum can't be lower than running_sum - * As running sum is scale with CPU capacity wehreas the runnable sum - * is not we rescale running_sum 1st + * Rescale running sum to be in the same range as runnable sum + * running_sum is in [0 : LOAD_AVG_MAX << SCHED_CAPACITY_SHIFT] + * runnable_sum is in [0 : LOAD_AVG_MAX] */ - running_sum = se->avg.util_sum / - arch_scale_cpu_capacity(NULL, cpu_of(rq_of(cfs_rq))); + running_sum = se->avg.util_sum >> SCHED_CAPACITY_SHIFT; runnable_sum = max(runnable_sum, running_sum); load_sum = (s64)se_weight(se) * runnable_sum; @@ -3352,7 +3351,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum /** * update_cfs_rq_load_avg - update the cfs_rq's load/util averages - * @now: current time, as per cfs_rq_clock_task() + * @now: current time, as per cfs_rq_clock_pelt() * @cfs_rq: cfs_rq to update * * The cfs_rq avg is the direct sum of all its entities (blocked and runnable) @@ -3397,7 +3396,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq) decayed = 1; } - decayed |= __update_load_avg_cfs_rq(now, cpu_of(rq_of(cfs_rq)), cfs_rq); + decayed |= __update_load_avg_cfs_rq(now, cfs_rq); #ifndef CONFIG_64BIT smp_wmb(); @@ -3487,9 +3486,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s /* Update task and its cfs_rq load average */ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { - u64 now = cfs_rq_clock_task(cfs_rq); - struct rq *rq = rq_of(cfs_rq); - int cpu = cpu_of(rq); + u64 now = cfs_rq_clock_pelt(cfs_rq); int decayed; /* @@ -3497,7 +3494,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s * track group sched_entity load average for task_h_load calc in migration */ if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) - __update_load_avg_se(now, cpu, cfs_rq, se); + __update_load_avg_se(now, cfs_rq, se); decayed = update_cfs_rq_load_avg(now, cfs_rq); decayed |= propagate_entity_load_avg(se); @@ -3549,7 +3546,7 @@ void sync_entity_load_avg(struct sched_entity *se) u64 last_update_time; last_update_time = cfs_rq_last_update_time(cfs_rq); - __update_load_avg_blocked_se(last_update_time, cpu_of(rq_of(cfs_rq)), se); + __update_load_avg_blocked_se(last_update_time, se); } /* @@ -6763,6 +6760,12 @@ done: __maybe_unused; if (new_tasks > 0) goto again; + /* + * rq is about to be idle, check if we need to update the + * lost_idle_time of clock_pelt + */ + update_idle_rq_clock_pelt(rq); + return NULL; } @@ -7422,7 +7425,7 @@ static void update_blocked_averages(int cpu) if (throttled_hierarchy(cfs_rq)) continue; - if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq)) + if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) update_tg_load_avg(cfs_rq, 0); /* Propagate pending load changes to the parent, if any: */ @@ -7443,8 +7446,8 @@ static void update_blocked_averages(int cpu) } curr_class = rq->curr->sched_class; - update_rt_rq_load_avg(rq_clock_task(rq), rq, curr_class == &rt_sched_class); - update_dl_rq_load_avg(rq_clock_task(rq), rq, curr_class == &dl_sched_class); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class); update_irq_load_avg(rq, 0); /* Don't need periodic decay once load/util_avg are null */ if (others_have_blocked(rq)) @@ -7514,11 +7517,11 @@ static inline void update_blocked_averages(int cpu) rq_lock_irqsave(rq, &rf); update_rq_clock(rq); - update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); + update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq); curr_class = rq->curr->sched_class; - update_rt_rq_load_avg(rq_clock_task(rq), rq, curr_class == &rt_sched_class); - update_dl_rq_load_avg(rq_clock_task(rq), rq, curr_class == &dl_sched_class); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class); + update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class); update_irq_load_avg(rq, 0); #ifdef CONFIG_NO_HZ_COMMON rq->last_blocked_load_update_tick = jiffies; diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index 90fb5bc..befce29 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -26,7 +26,6 @@ #include <linux/sched.h> #include "sched.h" -#include "sched-pelt.h" #include "pelt.h" /* @@ -106,16 +105,12 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3) * n=1 */ static __always_inline u32 -accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, +accumulate_sum(u64 delta, struct sched_avg *sa, unsigned long load, unsigned long runnable, int running) { - unsigned long scale_freq, scale_cpu; u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */ u64 periods; - scale_freq = arch_scale_freq_capacity(cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); - delta += sa->period_contrib; periods = delta / 1024; /* A period is 1024us (~1ms) */ @@ -137,13 +132,12 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, } sa->period_contrib = delta; - contrib = cap_scale(contrib, scale_freq); if (load) sa->load_sum += load * contrib; if (runnable) sa->runnable_load_sum += runnable * contrib; if (running) - sa->util_sum += contrib * scale_cpu; + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; return periods; } @@ -177,7 +171,7 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa, * = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}] */ static __always_inline int -___update_load_sum(u64 now, int cpu, struct sched_avg *sa, +___update_load_sum(u64 now, struct sched_avg *sa, unsigned long load, unsigned long runnable, int running) { u64 delta; @@ -221,7 +215,7 @@ ___update_load_sum(u64 now, int cpu, struct sched_avg *sa, * Step 1: accumulate *_sum since last_update_time. If we haven't * crossed period boundaries, finish. */ - if (!accumulate_sum(delta, cpu, sa, load, runnable, running)) + if (!accumulate_sum(delta, sa, load, runnable, running)) return 0; return 1; @@ -267,9 +261,9 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna * runnable_load_avg = \Sum se->avg.runable_load_avg */ -int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se) +int __update_load_avg_blocked_se(u64 now, struct sched_entity *se) { - if (___update_load_sum(now, cpu, &se->avg, 0, 0, 0)) { + if (___update_load_sum(now, &se->avg, 0, 0, 0)) { ___update_load_avg(&se->avg, se_weight(se), se_runnable(se)); return 1; } @@ -277,9 +271,9 @@ int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se) return 0; } -int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se) +int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se) { - if (___update_load_sum(now, cpu, &se->avg, !!se->on_rq, !!se->on_rq, + if (___update_load_sum(now, &se->avg, !!se->on_rq, !!se->on_rq, cfs_rq->curr == se)) { ___update_load_avg(&se->avg, se_weight(se), se_runnable(se)); @@ -290,9 +284,9 @@ int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_e return 0; } -int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) +int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq) { - if (___update_load_sum(now, cpu, &cfs_rq->avg, + if (___update_load_sum(now, &cfs_rq->avg, scale_load_down(cfs_rq->load.weight), scale_load_down(cfs_rq->runnable_weight), cfs_rq->curr != NULL)) { @@ -317,7 +311,7 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) { - if (___update_load_sum(now, rq->cpu, &rq->avg_rt, + if (___update_load_sum(now, &rq->avg_rt, running, running, running)) { @@ -340,7 +334,7 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) { - if (___update_load_sum(now, rq->cpu, &rq->avg_dl, + if (___update_load_sum(now, &rq->avg_dl, running, running, running)) { @@ -365,22 +359,31 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) int update_irq_load_avg(struct rq *rq, u64 running) { int ret = 0; + + /* + * We can't use clock_pelt because irq time is not accounted in + * clock_task. Instead we directly scale the running time to + * reflect the real amount of computation + */ + running = cap_scale(running, arch_scale_freq_capacity(cpu_of(rq))); + running = cap_scale(running, arch_scale_cpu_capacity(NULL, cpu_of(rq))); + /* * We know the time that has been used by interrupt since last update * but we don't when. Let be pessimistic and assume that interrupt has * happened just before the update. This is not so far from reality * because interrupt will most probably wake up task and trig an update - * of rq clock during which the metric si updated. + * of rq clock during which the metric is updated. * We start to decay with normal context time and then we add the * interrupt context time. * We can safely remove running from rq->clock because * rq->clock += delta with delta >= running */ - ret = ___update_load_sum(rq->clock - running, rq->cpu, &rq->avg_irq, + ret = ___update_load_sum(rq->clock - running, &rq->avg_irq, 0, 0, 0); - ret += ___update_load_sum(rq->clock, rq->cpu, &rq->avg_irq, + ret += ___update_load_sum(rq->clock, &rq->avg_irq, 1, 1, 1); diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h index 7e56b48..ae269ba 100644 --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -1,8 +1,9 @@ #ifdef CONFIG_SMP +#include "sched-pelt.h" -int __update_load_avg_blocked_se(u64 now, int cpu, struct sched_entity *se); -int __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entity *se); -int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq); +int __update_load_avg_blocked_se(u64 now, struct sched_entity *se); +int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se); +int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq); int update_rt_rq_load_avg(u64 now, struct rq *rq, int running); int update_dl_rq_load_avg(u64 now, struct rq *rq, int running); @@ -42,6 +43,98 @@ static inline void cfs_se_util_change(struct sched_avg *avg) WRITE_ONCE(avg->util_est.enqueued, enqueued); } +/* + * The clock_pelt scales the time to reflect the effective amount of + * computation done during the running delta time but then sync back to + * clock_task when rq is idle. + * + * + * absolute time | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16 + * @ max capacity ------******---------------******--------------- + * @ half capacity ------************---------************--------- + * clock pelt | 1| 2| 3| 4| 7| 8| 9| 10| 11|14|15|16 + * + */ +static inline void update_rq_clock_pelt(struct rq *rq, s64 delta) +{ + if (unlikely(is_idle_task(rq->curr))) { + /* The rq is idle, we can sync to clock_task */ + rq->clock_pelt = rq_clock_task(rq); + return; + } + + /* + * When a rq runs at a lower compute capacity, it will need + * more time to do the same amount of work than at max + * capacity. In order to be invariant, we scale the delta to + * reflect how much work has been really done. + * Running longer results in stealing idle time that will + * disturb the load signal compared to max capacity. This + * stolen idle time will be automatically reflected when the + * rq will be idle and the clock will be synced with + * rq_clock_task. + */ + + /* + * Scale the elapsed time to reflect the real amount of + * computation + */ + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu_of(rq))); + delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq))); + + rq->clock_pelt += delta; +} + +/* + * When rq becomes idle, we have to check if it has lost idle time + * because it was fully busy. A rq is fully used when the /Sum util_sum + * is greater or equal to: + * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT; + * For optimization and computing rounding purpose, we don't take into account + * the position in the current window (period_contrib) and we use the higher + * bound of util_sum to decide. + */ +static inline void update_idle_rq_clock_pelt(struct rq *rq) +{ + u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX; + u32 util_sum = rq->cfs.avg.util_sum; + util_sum += rq->avg_rt.util_sum; + util_sum += rq->avg_dl.util_sum; + + /* + * Reflecting stolen time makes sense only if the idle + * phase would be present at max capacity. As soon as the + * utilization of a rq has reached the maximum value, it is + * considered as an always runnig rq without idle time to + * steal. This potential idle time is considered as lost in + * this case. We keep track of this lost idle time compare to + * rq's clock_task. + */ + if (util_sum >= divider) + rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt; +} + +static inline u64 rq_clock_pelt(struct rq *rq) +{ + return rq->clock_pelt - rq->lost_idle_time; +} + +#ifdef CONFIG_CFS_BANDWIDTH +/* rq->task_clock normalized against any time this cfs_rq has spent throttled */ +static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) +{ + if (unlikely(cfs_rq->throttle_count)) + return cfs_rq->throttled_clock_task - cfs_rq->throttled_clock_task_time; + + return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_task_time; +} +#else +static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) +{ + return rq_clock_pelt(rq_of(cfs_rq)); +} +#endif + #else static inline int @@ -67,6 +160,18 @@ update_irq_load_avg(struct rq *rq, u64 running) { return 0; } + +static inline u64 rq_clock_pelt(struct rq *rq) +{ + return rq_clock_task(rq); +} + +static inline void +update_rq_clock_pelt(struct rq *rq, s64 delta) { } + +static inline void +update_idle_rq_clock_pelt(struct rq *rq) { } + #endif diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 9aa3287..579c60e 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1587,7 +1587,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) * rt task */ if (rq->curr->sched_class != &rt_sched_class) - update_rt_rq_load_avg(rq_clock_task(rq), rq, 0); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0); return p; } @@ -1596,7 +1596,7 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p) { update_curr_rt(rq); - update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1); /* * The previous task needs to be made eligible for pushing @@ -2327,7 +2327,7 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued) struct sched_rt_entity *rt_se = &p->rt; update_curr_rt(rq); - update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); + update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 1); watchdog(rq, p); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 67e2d56..9a8ae27 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -840,7 +840,10 @@ struct rq { unsigned int clock_update_flags; u64 clock; - u64 clock_task; + /* Ensure that all clocks are in the same cache line */ + u64 clock_task ____cacheline_aligned; + u64 clock_pelt; + unsigned long lost_idle_time; atomic_t nr_iowait;
The current implementation of load tracking invariance scales the contribution with current frequency and uarch performance (only for utilization) of the CPU. One main result of this formula is that the figures are capped by current capacity of CPU. Another one is that the load_avg is not invariant because not scaled with uarch. The util_avg of a periodic task that runs r time slots every p time slots varies in the range : U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p) with U is the max util_avg value = SCHED_CAPACITY_SCALE At a lower capacity, the range becomes: U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p) with C reflecting the compute capacity ratio between current capacity and max capacity. so C tries to compensate changes in (1-y^r') but it can't be accurate. Instead of scaling the contribution value of PELT algo, we should scale the running time. The PELT signal aims to track the amount of computation of tasks and/or rq so it seems more correct to scale the running time to reflect the effective amount of computation done since the last update. In order to be fully invariant, we need to apply the same amount of running time and idle time whatever the current capacity. Because running at lower capacity implies that the task will run longer, we have to ensure that the same amount of idle time will be applied when system becomes idle and no idle time has been "stolen". But reaching the maximum utilization value (SCHED_CAPACITY_SCALE) means that the task is seen as an always-running task whatever the capacity of the CPU (even at max compute capacity). In this case, we can discard this "stolen" idle times which becomes meaningless. In order to achieve this time scaling, a new clock_pelt is created per rq. The increase of this clock scales with current capacity when something is running on rq and synchronizes with clock_task when rq is idle. With this mechanism, we ensure the same running and idle time whatever the current capacity. This also enables to simplify the pelt algorithm by removing all references of uarch and frequency and applying the same contribution to utilization and loads. Furthermore, the scaling is done only once per update of clock (update_rq_clock_task()) instead of during each update of sched_entities and cfs/rt/dl_rq of the rq like the current implementation. This is interesting when cgroup are involved as shown in the results below: On a hikey (octo Arm64 platform). Performance cpufreq governor and only shallowest c-state to remove variance generated by those power features so we only track the impact of pelt algo. each test runs 16 times ./perf bench sched pipe (higher is better) kernel tip/sched/core + patch ops/seconds ops/seconds diff cgroup root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38% level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57% level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86% hackbench -l 1000 (lower is better) kernel tip/sched/core + patch duration(sec) duration(sec) diff cgroup root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57% level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60% level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66% Then, the responsiveness of PELT is improved when CPU is not running at max capacity with this new algorithm. I have put below some examples of duration to reach some typical load values according to the capacity of the CPU with current implementation and with this patch. These values has been computed based on the geometric series and the half period value: Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms On my hikey (octo Arm64 platform) with schedutil governor, the time to reach max OPP when starting from a null utilization, decreases from 223ms with current scale invariance down to 121ms with the new algorithm. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- include/linux/sched.h | 23 +++------- kernel/sched/core.c | 1 + kernel/sched/deadline.c | 6 +-- kernel/sched/fair.c | 45 +++++++++++--------- kernel/sched/pelt.c | 45 +++++++++++--------- kernel/sched/pelt.h | 111 ++++++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/rt.c | 6 +-- kernel/sched/sched.h | 5 ++- 8 files changed, 174 insertions(+), 68 deletions(-) -- 2.7.4