Message ID | 1527253951-22709-1-git-send-email-vincent.guittot@linaro.org |
---|---|
Headers | show |
Series | track CPU utilization | expand |
On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote: > When both cfs and rt tasks compete to run on a CPU, we can see some frequency > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't > reflect anymore the utilization of cfs tasks but only the remaining part that > is not used by rt tasks. We should monitor the stolen utilization and take > it into account when selecting OPP. This patchset doesn't change the OPP > selection policy for RT tasks but only for CFS tasks So the problem is that when RT/DL/stop/IRQ happens and preempts CFS tasks, time continues and the CFS load tracking will see !running and decay things. Then, when we get back to CFS, we'll have lower load/util than we expected. In particular, your focus is on OPP selection, and where we would have say: u=1 (always running task), after being preempted by our RT task for a while, it will now have u=.5. With the effect that when the RT task goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right? Your solution is to track RT/DL/stop/IRQ with the identical PELT average as we track cfs util. Such that we can then add the various averages to reconstruct the actual utilisation signal. This should work for the case of the utilization signal on UP. When we consider that PELT migrates the signal around on SMP, but we don't do that to the per-rq signals we have for RT/DL/stop/IRQ. There is also the 'complaint' that this ends up with 2 util signals for DL, complicating things. So this patch-set tracks the !cfs occupation using the same function, which is all good. But what, if instead of using that to compensate the OPP selection, we employ that to renormalize the util signal? If we normalize util against the dynamic (rt_avg affected) cpu_capacity, then I think your initial problem goes away. Because while the RT task will push the util to .5, it will at the same time push the CPU capacity to .5, and renormalized that gives 1. NOTE: the renorm would then become something like: scale_cpu = arch_scale_cpu_capacity() / rt_frac(); On IRC I mentioned stopping the CFS clock when preempted, and while that would result in fixed numbers, Vincent was right in pointing out the numbers will be difficult to interpret, since the meaning will be purely CPU local and I'm not sure you can actually fix it again with normalization. Imagine, running a .3 RT task, that would push the (always running) CFS down to .7, but because we discard all !cfs time, it actually has 1. If we try and normalize that we'll end up with ~1.43, which is of course completely broken. _However_, all that happens for util, also happens for load. So the above scenario will also make the CPU appear less loaded than it actually is. Now, we actually try and compensate for that by decreasing the capacity of the CPU. But because the existing rt_avg and PELT signals are so out-of-tune, this is likely to be less than ideal. With that fixed however, the best this appears to do is, as per the above, preserve the actual load. But what we really wanted is to actually inflate the load, such that someone will take load from us -- we're doing less actual work after all. Possibly, we can do something like: scale_cpu_capacity / (rt_frac^2) for load, then we inflate the load and could maybe get rid of all this capacity_of() sprinkling, but that needs more thinking. But I really feel we need to consider both util and load, as this issue affects both.
On Monday 04 Jun 2018 at 18:50:47 (+0200), Peter Zijlstra wrote: > On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote: > > When both cfs and rt tasks compete to run on a CPU, we can see some frequency > > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't > > reflect anymore the utilization of cfs tasks but only the remaining part that > > is not used by rt tasks. We should monitor the stolen utilization and take > > it into account when selecting OPP. This patchset doesn't change the OPP > > selection policy for RT tasks but only for CFS tasks > > So the problem is that when RT/DL/stop/IRQ happens and preempts CFS > tasks, time continues and the CFS load tracking will see !running and > decay things. > > Then, when we get back to CFS, we'll have lower load/util than we > expected. > > In particular, your focus is on OPP selection, and where we would have > say: u=1 (always running task), after being preempted by our RT task for > a while, it will now have u=.5. With the effect that when the RT task > goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right? > > Your solution is to track RT/DL/stop/IRQ with the identical PELT average > as we track cfs util. Such that we can then add the various averages to > reconstruct the actual utilisation signal. > > This should work for the case of the utilization signal on UP. When we > consider that PELT migrates the signal around on SMP, but we don't do > that to the per-rq signals we have for RT/DL/stop/IRQ. > > There is also the 'complaint' that this ends up with 2 util signals for > DL, complicating things. > > > So this patch-set tracks the !cfs occupation using the same function, > which is all good. But what, if instead of using that to compensate the > OPP selection, we employ that to renormalize the util signal? > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity, > then I think your initial problem goes away. Because while the RT task > will push the util to .5, it will at the same time push the CPU capacity > to .5, and renormalized that gives 1. > > NOTE: the renorm would then become something like: > scale_cpu = arch_scale_cpu_capacity() / rt_frac(); Isn't it equivalent ? I mean, you can remove RT/DL/stop/IRQ from the CPU capacity and compare the CFS util_avg against that, or you can add RT/DL/stop/IRQ to the CFS util_avg and compare it to arch_scale_cpu_capacity(). Both should be interchangeable no ? By adding RT/DL/IRQ PELT signals to the CFS util_avg, Vincent is proposing to go with the latter I think. But aren't the signals we currently use to account for RT/DL/stop/IRQ in cpu_capacity good enough for that ? Can't we just add the diff between capacity_orig_of and capacity_of to the CFS util and do OPP selection with that (for !nr_rt_running) ? Maybe add a min with dl running_bw to be on the safe side ... ? > > > On IRC I mentioned stopping the CFS clock when preempted, and while that > would result in fixed numbers, Vincent was right in pointing out the > numbers will be difficult to interpret, since the meaning will be purely > CPU local and I'm not sure you can actually fix it again with > normalization. > > Imagine, running a .3 RT task, that would push the (always running) CFS > down to .7, but because we discard all !cfs time, it actually has 1. If > we try and normalize that we'll end up with ~1.43, which is of course > completely broken. > > > _However_, all that happens for util, also happens for load. So the above > scenario will also make the CPU appear less loaded than it actually is. > > Now, we actually try and compensate for that by decreasing the capacity > of the CPU. But because the existing rt_avg and PELT signals are so > out-of-tune, this is likely to be less than ideal. With that fixed > however, the best this appears to do is, as per the above, preserve the > actual load. But what we really wanted is to actually inflate the load, > such that someone will take load from us -- we're doing less actual work > after all. > > Possibly, we can do something like: > > scale_cpu_capacity / (rt_frac^2) > > for load, then we inflate the load and could maybe get rid of all this > capacity_of() sprinkling, but that needs more thinking. > > > But I really feel we need to consider both util and load, as this issue > affects both.
On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 25, 2018 at 03:12:21PM +0200, Vincent Guittot wrote: >> When both cfs and rt tasks compete to run on a CPU, we can see some frequency >> drops with schedutil governor. In such case, the cfs_rq's utilization doesn't >> reflect anymore the utilization of cfs tasks but only the remaining part that >> is not used by rt tasks. We should monitor the stolen utilization and take >> it into account when selecting OPP. This patchset doesn't change the OPP >> selection policy for RT tasks but only for CFS tasks > > So the problem is that when RT/DL/stop/IRQ happens and preempts CFS > tasks, time continues and the CFS load tracking will see !running and > decay things. > > Then, when we get back to CFS, we'll have lower load/util than we > expected. > > In particular, your focus is on OPP selection, and where we would have > say: u=1 (always running task), after being preempted by our RT task for > a while, it will now have u=.5. With the effect that when the RT task > goes sleep we'll drop our OPP to .5 max -- which is 'wrong', right? yes that's the typical example > > Your solution is to track RT/DL/stop/IRQ with the identical PELT average > as we track cfs util. Such that we can then add the various averages to > reconstruct the actual utilisation signal. yes and get the whole cpu utilization > > This should work for the case of the utilization signal on UP. When we > consider that PELT migrates the signal around on SMP, but we don't do > that to the per-rq signals we have for RT/DL/stop/IRQ. > > There is also the 'complaint' that this ends up with 2 util signals for > DL, complicating things. yes. that's the main point of discussion how to balance dl bandwidth and dl utilization > > > So this patch-set tracks the !cfs occupation using the same function, > which is all good. But what, if instead of using that to compensate the > OPP selection, we employ that to renormalize the util signal? > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity, > then I think your initial problem goes away. Because while the RT task > will push the util to .5, it will at the same time push the CPU capacity > to .5, and renormalized that gives 1. > > NOTE: the renorm would then become something like: > scale_cpu = arch_scale_cpu_capacity() / rt_frac(); > > > On IRC I mentioned stopping the CFS clock when preempted, and while that > would result in fixed numbers, Vincent was right in pointing out the > numbers will be difficult to interpret, since the meaning will be purely > CPU local and I'm not sure you can actually fix it again with > normalization. > > Imagine, running a .3 RT task, that would push the (always running) CFS > down to .7, but because we discard all !cfs time, it actually has 1. If > we try and normalize that we'll end up with ~1.43, which is of course > completely broken. > > > _However_, all that happens for util, also happens for load. So the above > scenario will also make the CPU appear less loaded than it actually is. The load will continue to increase because we track runnable state and not running for the load > > Now, we actually try and compensate for that by decreasing the capacity > of the CPU. But because the existing rt_avg and PELT signals are so > out-of-tune, this is likely to be less than ideal. With that fixed > however, the best this appears to do is, as per the above, preserve the > actual load. But what we really wanted is to actually inflate the load, > such that someone will take load from us -- we're doing less actual work > after all. > > Possibly, we can do something like: > > scale_cpu_capacity / (rt_frac^2) > > for load, then we inflate the load and could maybe get rid of all this > capacity_of() sprinkling, but that needs more thinking. > > > But I really feel we need to consider both util and load, as this issue > affects both. my initial idea was to get max between dl bandwidth and dl util_avg but util_avg can be higher than bandwidth and using it will make sched_util selecting higher OPP for now good reason if nothing is running around and need compute capacity As you mentioned, scale_rt_capacity give the remaining capacity for cfs and it will behave like cfs util_avg now that it uses PELT. So as long as cfs util_avg < scale_rt_capacity(we probably need a margin) we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting OPP because we have remaining spare capacity but if cfs util_avg == scale_rt_capacity, we make sure to use max OPP. I will run some test to make sure that all my test are running correctly which such policy
Hi Quentin, On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote: > This patchset initially tracked only the utilization of RT rq. During > OSPM summit, it has been discussed the opportunity to extend it in order > to get an estimate of the utilization of the CPU. > > - Patches 1-3 correspond to the content of patchset v4 and add utilization > tracking for rt_rq. > > When both cfs and rt tasks compete to run on a CPU, we can see some frequency > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't > reflect anymore the utilization of cfs tasks but only the remaining part that > is not used by rt tasks. We should monitor the stolen utilization and take > it into account when selecting OPP. This patchset doesn't change the OPP > selection policy for RT tasks but only for CFS tasks > > A rt-app use case which creates an always running cfs thread and a rt threads > that wakes up periodically with both threads pinned on same CPU, show lot of > frequency switches of the CPU whereas the CPU never goes idles during the > test. I can share the json file that I used for the test if someone is > interested in. > > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom), > the cpufreq statistics outputs (stats are reset just before the test) : > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > without patchset : 1230 > with patchset : 14 I have attached the rt-app json file that I use for this test > > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see > performance improvements: > > - Without patchset : > Test execution summary: > total time: 15.0009s > total number of events: 4903 > total time taken by event execution: 14.9972 > per-request statistics: > min: 1.23ms > avg: 3.06ms > max: 13.16ms > approx. 95 percentile: 12.73ms > > Threads fairness: > events (avg/stddev): 4903.0000/0.00 > execution time (avg/stddev): 14.9972/0.00 > > - With patchset: > Test execution summary: > total time: 15.0014s > total number of events: 7694 > total time taken by event execution: 14.9979 > per-request statistics: > min: 1.23ms > avg: 1.95ms > max: 10.49ms > approx. 95 percentile: 10.39ms > > Threads fairness: > events (avg/stddev): 7694.0000/0.00 > execution time (avg/stddev): 14.9979/0.00 > > The performance improvement is 56% for this use case. > > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar > problem as with rt_rq > > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove > dl and rt from sched_rt_avg_update > > - Patches 7-8 add utilization tracking for interrupt and use it select OPP > A test with iperf on hikey 6220 gives: > w/o patchset w/ patchset > Tx 276 Mbits/sec 304 Mbits/sec +10% > Rx 299 Mbits/sec 328 Mbits/sec +09% > > 8 iterations of iperf -c server_address -r -t 5 > stdev is lower than 1% > Only WFI idle state is enable (shallowest arm idle state) > > - Patches 9 removes the unused sched_avg_update code > > - Patch 10 removes the unused sched_time_avg_ms > > Change since v3: > - add support of periodic update of blocked utilization > - rebase on lastest tip/sched/core > > Change since v2: > - move pelt code into a dedicated pelt.c file > - rebase on load tracking changes > > Change since v1: > - Only a rebase. I have addressed the comments on previous version in > patch 1/2 > > Vincent Guittot (10): > sched/pelt: Move pelt related code in a dedicated file > sched/rt: add rt_rq utilization tracking > cpufreq/schedutil: add rt utilization tracking > sched/dl: add dl_rq utilization tracking > cpufreq/schedutil: get max utilization > sched: remove rt and dl from sched_avg > sched/irq: add irq utilization tracking > cpufreq/schedutil: take into account interrupt > sched: remove rt_avg code > proc/sched: remove unused sched_time_avg_ms > > include/linux/sched/sysctl.h | 1 - > kernel/sched/Makefile | 2 +- > kernel/sched/core.c | 38 +--- > kernel/sched/cpufreq_schedutil.c | 24 ++- > kernel/sched/deadline.c | 7 +- > kernel/sched/fair.c | 381 +++---------------------------------- > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++ > kernel/sched/pelt.h | 63 +++++++ > kernel/sched/rt.c | 10 +- > kernel/sched/sched.h | 57 ++++-- > kernel/sysctl.c | 8 - > 11 files changed, 563 insertions(+), 423 deletions(-) > create mode 100644 kernel/sched/pelt.c > create mode 100644 kernel/sched/pelt.h > > -- > 2.7.4 >
Hi Vincent, On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote: > Hi Quentin, > > On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > This patchset initially tracked only the utilization of RT rq. During > > OSPM summit, it has been discussed the opportunity to extend it in order > > to get an estimate of the utilization of the CPU. > > > > - Patches 1-3 correspond to the content of patchset v4 and add utilization > > tracking for rt_rq. > > > > When both cfs and rt tasks compete to run on a CPU, we can see some frequency > > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't > > reflect anymore the utilization of cfs tasks but only the remaining part that > > is not used by rt tasks. We should monitor the stolen utilization and take > > it into account when selecting OPP. This patchset doesn't change the OPP > > selection policy for RT tasks but only for CFS tasks > > > > A rt-app use case which creates an always running cfs thread and a rt threads > > that wakes up periodically with both threads pinned on same CPU, show lot of > > frequency switches of the CPU whereas the CPU never goes idles during the > > test. I can share the json file that I used for the test if someone is > > interested in. > > > > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom), > > the cpufreq statistics outputs (stats are reset just before the test) : > > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > > without patchset : 1230 > > with patchset : 14 > > I have attached the rt-app json file that I use for this test Thank you very much ! I did a quick test with a much simpler fix to this RT-steals-time-from-CFS issue using just the existing scale_rt_capacity(). I get the following results on Hikey960: Without patch: cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans 12 cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans 640 With patch cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans 8 cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans 12 Yes the rt_avg stuff is out of sync with the PELT signal, but do you think this is an actual issue for realistic use-cases ? What about the diff below (just a quick hack to show the idea) applied on tip/sched/core ? ---8<--- diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index a8ba6d1f262a..23a4fb1c2c25 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) sg_cpu->util_dl = cpu_util_dl(rq); } +unsigned long scale_rt_capacity(int cpu); static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) { struct rq *rq = cpu_rq(sg_cpu->cpu); + int cpu = sg_cpu->cpu; + unsigned long util, dl_bw; if (rq->rt.rt_nr_running) return sg_cpu->max; @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) * util_cfs + util_dl as requested freq. However, cpufreq is not yet * ready for such an interface. So, we only do the latter for now. */ - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); + util >>= SCHED_CAPACITY_SHIFT; + util = arch_scale_cpu_capacity(NULL, cpu) - util; + util += sg_cpu->util_cfs; + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; + + /* Make sure to always provide the reserved freq to DL. */ + return max(util, dl_bw); } static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f01f0f395f9a..0e87cbe47c8b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd, return load_idx; } -static unsigned long scale_rt_capacity(int cpu) +unsigned long scale_rt_capacity(int cpu) { struct rq *rq = cpu_rq(cpu); u64 total, used, age_stamp, avg; --->8--- > > > > > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see > > performance improvements: > > > > - Without patchset : > > Test execution summary: > > total time: 15.0009s > > total number of events: 4903 > > total time taken by event execution: 14.9972 > > per-request statistics: > > min: 1.23ms > > avg: 3.06ms > > max: 13.16ms > > approx. 95 percentile: 12.73ms > > > > Threads fairness: > > events (avg/stddev): 4903.0000/0.00 > > execution time (avg/stddev): 14.9972/0.00 > > > > - With patchset: > > Test execution summary: > > total time: 15.0014s > > total number of events: 7694 > > total time taken by event execution: 14.9979 > > per-request statistics: > > min: 1.23ms > > avg: 1.95ms > > max: 10.49ms > > approx. 95 percentile: 10.39ms > > > > Threads fairness: > > events (avg/stddev): 7694.0000/0.00 > > execution time (avg/stddev): 14.9979/0.00 > > > > The performance improvement is 56% for this use case. > > > > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar > > problem as with rt_rq > > > > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove > > dl and rt from sched_rt_avg_update > > > > - Patches 7-8 add utilization tracking for interrupt and use it select OPP > > A test with iperf on hikey 6220 gives: > > w/o patchset w/ patchset > > Tx 276 Mbits/sec 304 Mbits/sec +10% > > Rx 299 Mbits/sec 328 Mbits/sec +09% > > > > 8 iterations of iperf -c server_address -r -t 5 > > stdev is lower than 1% > > Only WFI idle state is enable (shallowest arm idle state) > > > > - Patches 9 removes the unused sched_avg_update code > > > > - Patch 10 removes the unused sched_time_avg_ms > > > > Change since v3: > > - add support of periodic update of blocked utilization > > - rebase on lastest tip/sched/core > > > > Change since v2: > > - move pelt code into a dedicated pelt.c file > > - rebase on load tracking changes > > > > Change since v1: > > - Only a rebase. I have addressed the comments on previous version in > > patch 1/2 > > > > Vincent Guittot (10): > > sched/pelt: Move pelt related code in a dedicated file > > sched/rt: add rt_rq utilization tracking > > cpufreq/schedutil: add rt utilization tracking > > sched/dl: add dl_rq utilization tracking > > cpufreq/schedutil: get max utilization > > sched: remove rt and dl from sched_avg > > sched/irq: add irq utilization tracking > > cpufreq/schedutil: take into account interrupt > > sched: remove rt_avg code > > proc/sched: remove unused sched_time_avg_ms > > > > include/linux/sched/sysctl.h | 1 - > > kernel/sched/Makefile | 2 +- > > kernel/sched/core.c | 38 +--- > > kernel/sched/cpufreq_schedutil.c | 24 ++- > > kernel/sched/deadline.c | 7 +- > > kernel/sched/fair.c | 381 +++---------------------------------- > > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++ > > kernel/sched/pelt.h | 63 +++++++ > > kernel/sched/rt.c | 10 +- > > kernel/sched/sched.h | 57 ++++-- > > kernel/sysctl.c | 8 - > > 11 files changed, 563 insertions(+), 423 deletions(-) > > create mode 100644 kernel/sched/pelt.c > > create mode 100644 kernel/sched/pelt.h > > > > -- > > 2.7.4 > >
On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote: > Hi Vincent, > > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote: >> Hi Quentin, >> >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote: >> > This patchset initially tracked only the utilization of RT rq. During >> > OSPM summit, it has been discussed the opportunity to extend it in order >> > to get an estimate of the utilization of the CPU. >> > >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization >> > tracking for rt_rq. >> > >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't >> > reflect anymore the utilization of cfs tasks but only the remaining part that >> > is not used by rt tasks. We should monitor the stolen utilization and take >> > it into account when selecting OPP. This patchset doesn't change the OPP >> > selection policy for RT tasks but only for CFS tasks >> > >> > A rt-app use case which creates an always running cfs thread and a rt threads >> > that wakes up periodically with both threads pinned on same CPU, show lot of >> > frequency switches of the CPU whereas the CPU never goes idles during the >> > test. I can share the json file that I used for the test if someone is >> > interested in. >> > >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom), >> > the cpufreq statistics outputs (stats are reset just before the test) : >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> > without patchset : 1230 >> > with patchset : 14 >> >> I have attached the rt-app json file that I use for this test > > Thank you very much ! I did a quick test with a much simpler fix to this > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity(). > I get the following results on Hikey960: > > Without patch: > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > 12 > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans > 640 > With patch > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > 8 > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans > 12 > > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think > this is an actual issue for realistic use-cases ? yes I think that it's worth syncing and consolidating things on the same metric. The result will be saner and more robust as we will have the same behavior > > What about the diff below (just a quick hack to show the idea) applied > on tip/sched/core ? > > ---8<--- > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index a8ba6d1f262a..23a4fb1c2c25 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > sg_cpu->util_dl = cpu_util_dl(rq); > } > > +unsigned long scale_rt_capacity(int cpu); > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > { > struct rq *rq = cpu_rq(sg_cpu->cpu); > + int cpu = sg_cpu->cpu; > + unsigned long util, dl_bw; > > if (rq->rt.rt_nr_running) > return sg_cpu->max; > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > * ready for such an interface. So, we only do the latter for now. > */ > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > + util >>= SCHED_CAPACITY_SHIFT; > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > + util += sg_cpu->util_cfs; > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > + > + /* Make sure to always provide the reserved freq to DL. */ > + return max(util, dl_bw); > } > > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f01f0f395f9a..0e87cbe47c8b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd, > return load_idx; > } > > -static unsigned long scale_rt_capacity(int cpu) > +unsigned long scale_rt_capacity(int cpu) > { > struct rq *rq = cpu_rq(cpu); > u64 total, used, age_stamp, avg; > --->8--- > > > >> >> > >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see >> > performance improvements: >> > >> > - Without patchset : >> > Test execution summary: >> > total time: 15.0009s >> > total number of events: 4903 >> > total time taken by event execution: 14.9972 >> > per-request statistics: >> > min: 1.23ms >> > avg: 3.06ms >> > max: 13.16ms >> > approx. 95 percentile: 12.73ms >> > >> > Threads fairness: >> > events (avg/stddev): 4903.0000/0.00 >> > execution time (avg/stddev): 14.9972/0.00 >> > >> > - With patchset: >> > Test execution summary: >> > total time: 15.0014s >> > total number of events: 7694 >> > total time taken by event execution: 14.9979 >> > per-request statistics: >> > min: 1.23ms >> > avg: 1.95ms >> > max: 10.49ms >> > approx. 95 percentile: 10.39ms >> > >> > Threads fairness: >> > events (avg/stddev): 7694.0000/0.00 >> > execution time (avg/stddev): 14.9979/0.00 >> > >> > The performance improvement is 56% for this use case. >> > >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar >> > problem as with rt_rq >> > >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove >> > dl and rt from sched_rt_avg_update >> > >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP >> > A test with iperf on hikey 6220 gives: >> > w/o patchset w/ patchset >> > Tx 276 Mbits/sec 304 Mbits/sec +10% >> > Rx 299 Mbits/sec 328 Mbits/sec +09% >> > >> > 8 iterations of iperf -c server_address -r -t 5 >> > stdev is lower than 1% >> > Only WFI idle state is enable (shallowest arm idle state) >> > >> > - Patches 9 removes the unused sched_avg_update code >> > >> > - Patch 10 removes the unused sched_time_avg_ms >> > >> > Change since v3: >> > - add support of periodic update of blocked utilization >> > - rebase on lastest tip/sched/core >> > >> > Change since v2: >> > - move pelt code into a dedicated pelt.c file >> > - rebase on load tracking changes >> > >> > Change since v1: >> > - Only a rebase. I have addressed the comments on previous version in >> > patch 1/2 >> > >> > Vincent Guittot (10): >> > sched/pelt: Move pelt related code in a dedicated file >> > sched/rt: add rt_rq utilization tracking >> > cpufreq/schedutil: add rt utilization tracking >> > sched/dl: add dl_rq utilization tracking >> > cpufreq/schedutil: get max utilization >> > sched: remove rt and dl from sched_avg >> > sched/irq: add irq utilization tracking >> > cpufreq/schedutil: take into account interrupt >> > sched: remove rt_avg code >> > proc/sched: remove unused sched_time_avg_ms >> > >> > include/linux/sched/sysctl.h | 1 - >> > kernel/sched/Makefile | 2 +- >> > kernel/sched/core.c | 38 +--- >> > kernel/sched/cpufreq_schedutil.c | 24 ++- >> > kernel/sched/deadline.c | 7 +- >> > kernel/sched/fair.c | 381 +++---------------------------------- >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++ >> > kernel/sched/pelt.h | 63 +++++++ >> > kernel/sched/rt.c | 10 +- >> > kernel/sched/sched.h | 57 ++++-- >> > kernel/sysctl.c | 8 - >> > 11 files changed, 563 insertions(+), 423 deletions(-) >> > create mode 100644 kernel/sched/pelt.c >> > create mode 100644 kernel/sched/pelt.h >> > >> > -- >> > 2.7.4 >> > > >
Hi Quentin, On 05/06/18 11:57, Quentin Perret wrote: [...] > What about the diff below (just a quick hack to show the idea) applied > on tip/sched/core ? > > ---8<--- > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index a8ba6d1f262a..23a4fb1c2c25 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > sg_cpu->util_dl = cpu_util_dl(rq); > } > > +unsigned long scale_rt_capacity(int cpu); > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > { > struct rq *rq = cpu_rq(sg_cpu->cpu); > + int cpu = sg_cpu->cpu; > + unsigned long util, dl_bw; > > if (rq->rt.rt_nr_running) > return sg_cpu->max; > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > * ready for such an interface. So, we only do the latter for now. > */ > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, since we use max below, we will probably have the same problem that we discussed on Vincent's approach (overestimation of DL contribution while we could use running_bw). > + util >>= SCHED_CAPACITY_SHIFT; > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > + util += sg_cpu->util_cfs; > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; Why this_bw instead of running_bw? Thanks, - Juri
On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: > Hi Quentin, > > On 05/06/18 11:57, Quentin Perret wrote: > > [...] > > > What about the diff below (just a quick hack to show the idea) applied > > on tip/sched/core ? > > > > ---8<--- > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > --- a/kernel/sched/cpufreq_schedutil.c > > +++ b/kernel/sched/cpufreq_schedutil.c > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > sg_cpu->util_dl = cpu_util_dl(rq); > > } > > > > +unsigned long scale_rt_capacity(int cpu); > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > { > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > + int cpu = sg_cpu->cpu; > > + unsigned long util, dl_bw; > > > > if (rq->rt.rt_nr_running) > > return sg_cpu->max; > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > * ready for such an interface. So, we only do the latter for now. > > */ > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, > since we use max below, we will probably have the same problem that we > discussed on Vincent's approach (overestimation of DL contribution while > we could use running_bw). Ah no, you're right, this isn't great for long running deadline tasks. We should definitely account for the running_bw here, not the dl avg... I was trying to address the issue of RT stealing time from CFS here, but the DL integration isn't quite right which this patch as-is, I agree ... > > > + util >>= SCHED_CAPACITY_SHIFT; > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > + util += sg_cpu->util_cfs; > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > Why this_bw instead of running_bw? So IIUC, this_bw should basically give you the absolute reservation (== the sum of runtime/deadline ratios of all DL tasks on that rq). The reason I added this max is because I'm still not sure to understand how we can safely drop the freq below that point ? If we don't guarantee to always stay at least at the freq required by DL, aren't we risking to start a deadline tasks stuck at a low freq because of rate limiting ? In this case, if that tasks uses all of its runtime then you might start missing deadlines ... My feeling is that the only safe thing to do is to guarantee to never go below the freq required by DL, and to optimistically add CFS tasks without raising the OPP if we have good reasons to think that DL is using less than it required (which is what we should get by using running_bw above I suppose). Does that make any sense ? Thanks ! Quentin
On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote: > On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote: > > Hi Vincent, > > > > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote: > >> Hi Quentin, > >> > >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote: > >> > This patchset initially tracked only the utilization of RT rq. During > >> > OSPM summit, it has been discussed the opportunity to extend it in order > >> > to get an estimate of the utilization of the CPU. > >> > > >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization > >> > tracking for rt_rq. > >> > > >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency > >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't > >> > reflect anymore the utilization of cfs tasks but only the remaining part that > >> > is not used by rt tasks. We should monitor the stolen utilization and take > >> > it into account when selecting OPP. This patchset doesn't change the OPP > >> > selection policy for RT tasks but only for CFS tasks > >> > > >> > A rt-app use case which creates an always running cfs thread and a rt threads > >> > that wakes up periodically with both threads pinned on same CPU, show lot of > >> > frequency switches of the CPU whereas the CPU never goes idles during the > >> > test. I can share the json file that I used for the test if someone is > >> > interested in. > >> > > >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom), > >> > the cpufreq statistics outputs (stats are reset just before the test) : > >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > >> > without patchset : 1230 > >> > with patchset : 14 > >> > >> I have attached the rt-app json file that I use for this test > > > > Thank you very much ! I did a quick test with a much simpler fix to this > > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity(). > > I get the following results on Hikey960: > > > > Without patch: > > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > > 12 > > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans > > 640 > > With patch > > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > > 8 > > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans > > 12 > > > > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think > > this is an actual issue for realistic use-cases ? > > yes I think that it's worth syncing and consolidating things on the > same metric. The result will be saner and more robust as we will have > the same behavior TBH I'm not disagreeing with that, the PELT-everywhere approach feels cleaner in a way, but do you have a use-case in mind where this will definitely help ? I mean, yes the rt_avg is a slow response to the RT pressure, but is this always a problem ? Ramping down slower might actually help in some cases no ? > > > > > What about the diff below (just a quick hack to show the idea) applied > > on tip/sched/core ? > > > > ---8<--- > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > --- a/kernel/sched/cpufreq_schedutil.c > > +++ b/kernel/sched/cpufreq_schedutil.c > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > sg_cpu->util_dl = cpu_util_dl(rq); > > } > > > > +unsigned long scale_rt_capacity(int cpu); > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > { > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > + int cpu = sg_cpu->cpu; > > + unsigned long util, dl_bw; > > > > if (rq->rt.rt_nr_running) > > return sg_cpu->max; > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > * ready for such an interface. So, we only do the latter for now. > > */ > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > + util >>= SCHED_CAPACITY_SHIFT; > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > + util += sg_cpu->util_cfs; > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > + > > + /* Make sure to always provide the reserved freq to DL. */ > > + return max(util, dl_bw); > > } > > > > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index f01f0f395f9a..0e87cbe47c8b 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd, > > return load_idx; > > } > > > > -static unsigned long scale_rt_capacity(int cpu) > > +unsigned long scale_rt_capacity(int cpu) > > { > > struct rq *rq = cpu_rq(cpu); > > u64 total, used, age_stamp, avg; > > --->8--- > > > > > > > >> > >> > > >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see > >> > performance improvements: > >> > > >> > - Without patchset : > >> > Test execution summary: > >> > total time: 15.0009s > >> > total number of events: 4903 > >> > total time taken by event execution: 14.9972 > >> > per-request statistics: > >> > min: 1.23ms > >> > avg: 3.06ms > >> > max: 13.16ms > >> > approx. 95 percentile: 12.73ms > >> > > >> > Threads fairness: > >> > events (avg/stddev): 4903.0000/0.00 > >> > execution time (avg/stddev): 14.9972/0.00 > >> > > >> > - With patchset: > >> > Test execution summary: > >> > total time: 15.0014s > >> > total number of events: 7694 > >> > total time taken by event execution: 14.9979 > >> > per-request statistics: > >> > min: 1.23ms > >> > avg: 1.95ms > >> > max: 10.49ms > >> > approx. 95 percentile: 10.39ms > >> > > >> > Threads fairness: > >> > events (avg/stddev): 7694.0000/0.00 > >> > execution time (avg/stddev): 14.9979/0.00 > >> > > >> > The performance improvement is 56% for this use case. > >> > > >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar > >> > problem as with rt_rq > >> > > >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove > >> > dl and rt from sched_rt_avg_update > >> > > >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP > >> > A test with iperf on hikey 6220 gives: > >> > w/o patchset w/ patchset > >> > Tx 276 Mbits/sec 304 Mbits/sec +10% > >> > Rx 299 Mbits/sec 328 Mbits/sec +09% > >> > > >> > 8 iterations of iperf -c server_address -r -t 5 > >> > stdev is lower than 1% > >> > Only WFI idle state is enable (shallowest arm idle state) > >> > > >> > - Patches 9 removes the unused sched_avg_update code > >> > > >> > - Patch 10 removes the unused sched_time_avg_ms > >> > > >> > Change since v3: > >> > - add support of periodic update of blocked utilization > >> > - rebase on lastest tip/sched/core > >> > > >> > Change since v2: > >> > - move pelt code into a dedicated pelt.c file > >> > - rebase on load tracking changes > >> > > >> > Change since v1: > >> > - Only a rebase. I have addressed the comments on previous version in > >> > patch 1/2 > >> > > >> > Vincent Guittot (10): > >> > sched/pelt: Move pelt related code in a dedicated file > >> > sched/rt: add rt_rq utilization tracking > >> > cpufreq/schedutil: add rt utilization tracking > >> > sched/dl: add dl_rq utilization tracking > >> > cpufreq/schedutil: get max utilization > >> > sched: remove rt and dl from sched_avg > >> > sched/irq: add irq utilization tracking > >> > cpufreq/schedutil: take into account interrupt > >> > sched: remove rt_avg code > >> > proc/sched: remove unused sched_time_avg_ms > >> > > >> > include/linux/sched/sysctl.h | 1 - > >> > kernel/sched/Makefile | 2 +- > >> > kernel/sched/core.c | 38 +--- > >> > kernel/sched/cpufreq_schedutil.c | 24 ++- > >> > kernel/sched/deadline.c | 7 +- > >> > kernel/sched/fair.c | 381 +++---------------------------------- > >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++ > >> > kernel/sched/pelt.h | 63 +++++++ > >> > kernel/sched/rt.c | 10 +- > >> > kernel/sched/sched.h | 57 ++++-- > >> > kernel/sysctl.c | 8 - > >> > 11 files changed, 563 insertions(+), 423 deletions(-) > >> > create mode 100644 kernel/sched/pelt.c > >> > create mode 100644 kernel/sched/pelt.h > >> > > >> > -- > >> > 2.7.4 > >> > > > > >
On 05/06/18 14:05, Quentin Perret wrote: > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: > > Hi Quentin, > > > > On 05/06/18 11:57, Quentin Perret wrote: > > > > [...] > > > > > What about the diff below (just a quick hack to show the idea) applied > > > on tip/sched/core ? > > > > > > ---8<--- > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > > --- a/kernel/sched/cpufreq_schedutil.c > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > > sg_cpu->util_dl = cpu_util_dl(rq); > > > } > > > > > > +unsigned long scale_rt_capacity(int cpu); > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > { > > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > > + int cpu = sg_cpu->cpu; > > > + unsigned long util, dl_bw; > > > > > > if (rq->rt.rt_nr_running) > > > return sg_cpu->max; > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > > * ready for such an interface. So, we only do the latter for now. > > > */ > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, > > since we use max below, we will probably have the same problem that we > > discussed on Vincent's approach (overestimation of DL contribution while > > we could use running_bw). > > Ah no, you're right, this isn't great for long running deadline tasks. > We should definitely account for the running_bw here, not the dl avg... > > I was trying to address the issue of RT stealing time from CFS here, but > the DL integration isn't quite right which this patch as-is, I agree ... > > > > > > + util >>= SCHED_CAPACITY_SHIFT; > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > > + util += sg_cpu->util_cfs; > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > Why this_bw instead of running_bw? > > So IIUC, this_bw should basically give you the absolute reservation (== the > sum of runtime/deadline ratios of all DL tasks on that rq). Yep. > The reason I added this max is because I'm still not sure to understand > how we can safely drop the freq below that point ? If we don't guarantee > to always stay at least at the freq required by DL, aren't we risking to > start a deadline tasks stuck at a low freq because of rate limiting ? In > this case, if that tasks uses all of its runtime then you might start > missing deadlines ... We decided to avoid (software) rate limiting for DL with e97a90f7069b ("sched/cpufreq: Rate limits for SCHED_DEADLINE"). > My feeling is that the only safe thing to do is to guarantee to never go > below the freq required by DL, and to optimistically add CFS tasks > without raising the OPP if we have good reasons to think that DL is > using less than it required (which is what we should get by using > running_bw above I suppose). Does that make any sense ? Then we can't still avoid the hardware limits, so using running_bw is a trade off between safety (especially considering soft real-time scenarios) and energy consumption (which seems to be working in practice). Thanks, - Juri
On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote: > On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote: >> On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote: >> > Hi Vincent, >> > >> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote: >> >> Hi Quentin, >> >> >> >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote: >> >> > This patchset initially tracked only the utilization of RT rq. During >> >> > OSPM summit, it has been discussed the opportunity to extend it in order >> >> > to get an estimate of the utilization of the CPU. >> >> > >> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization >> >> > tracking for rt_rq. >> >> > >> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency >> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't >> >> > reflect anymore the utilization of cfs tasks but only the remaining part that >> >> > is not used by rt tasks. We should monitor the stolen utilization and take >> >> > it into account when selecting OPP. This patchset doesn't change the OPP >> >> > selection policy for RT tasks but only for CFS tasks >> >> > >> >> > A rt-app use case which creates an always running cfs thread and a rt threads >> >> > that wakes up periodically with both threads pinned on same CPU, show lot of >> >> > frequency switches of the CPU whereas the CPU never goes idles during the >> >> > test. I can share the json file that I used for the test if someone is >> >> > interested in. >> >> > >> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom), >> >> > the cpufreq statistics outputs (stats are reset just before the test) : >> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> >> > without patchset : 1230 >> >> > with patchset : 14 >> >> >> >> I have attached the rt-app json file that I use for this test >> > >> > Thank you very much ! I did a quick test with a much simpler fix to this >> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity(). >> > I get the following results on Hikey960: >> > >> > Without patch: >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> > 12 >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans >> > 640 >> > With patch >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> > 8 >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans >> > 12 >> > >> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think >> > this is an actual issue for realistic use-cases ? >> >> yes I think that it's worth syncing and consolidating things on the >> same metric. The result will be saner and more robust as we will have >> the same behavior > > TBH I'm not disagreeing with that, the PELT-everywhere approach feels > cleaner in a way, but do you have a use-case in mind where this will > definitely help ? > > I mean, yes the rt_avg is a slow response to the RT pressure, but is > this always a problem ? Ramping down slower might actually help in some > cases no ? I would say no because when one will decrease the other one will not increase at the same pace and we will have some wrong behavior or decision > >> >> > >> > What about the diff below (just a quick hack to show the idea) applied >> > on tip/sched/core ? >> > >> > ---8<--- >> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c >> > index a8ba6d1f262a..23a4fb1c2c25 100644 >> > --- a/kernel/sched/cpufreq_schedutil.c >> > +++ b/kernel/sched/cpufreq_schedutil.c >> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) >> > sg_cpu->util_dl = cpu_util_dl(rq); >> > } >> > >> > +unsigned long scale_rt_capacity(int cpu); >> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) >> > { >> > struct rq *rq = cpu_rq(sg_cpu->cpu); >> > + int cpu = sg_cpu->cpu; >> > + unsigned long util, dl_bw; >> > >> > if (rq->rt.rt_nr_running) >> > return sg_cpu->max; >> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) >> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet >> > * ready for such an interface. So, we only do the latter for now. >> > */ >> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); >> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); >> > + util >>= SCHED_CAPACITY_SHIFT; >> > + util = arch_scale_cpu_capacity(NULL, cpu) - util; >> > + util += sg_cpu->util_cfs; >> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >> > + >> > + /* Make sure to always provide the reserved freq to DL. */ >> > + return max(util, dl_bw); >> > } >> > >> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> > index f01f0f395f9a..0e87cbe47c8b 100644 >> > --- a/kernel/sched/fair.c >> > +++ b/kernel/sched/fair.c >> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd, >> > return load_idx; >> > } >> > >> > -static unsigned long scale_rt_capacity(int cpu) >> > +unsigned long scale_rt_capacity(int cpu) >> > { >> > struct rq *rq = cpu_rq(cpu); >> > u64 total, used, age_stamp, avg; >> > --->8--- >> > >> > >> > >> >> >> >> > >> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see >> >> > performance improvements: >> >> > >> >> > - Without patchset : >> >> > Test execution summary: >> >> > total time: 15.0009s >> >> > total number of events: 4903 >> >> > total time taken by event execution: 14.9972 >> >> > per-request statistics: >> >> > min: 1.23ms >> >> > avg: 3.06ms >> >> > max: 13.16ms >> >> > approx. 95 percentile: 12.73ms >> >> > >> >> > Threads fairness: >> >> > events (avg/stddev): 4903.0000/0.00 >> >> > execution time (avg/stddev): 14.9972/0.00 >> >> > >> >> > - With patchset: >> >> > Test execution summary: >> >> > total time: 15.0014s >> >> > total number of events: 7694 >> >> > total time taken by event execution: 14.9979 >> >> > per-request statistics: >> >> > min: 1.23ms >> >> > avg: 1.95ms >> >> > max: 10.49ms >> >> > approx. 95 percentile: 10.39ms >> >> > >> >> > Threads fairness: >> >> > events (avg/stddev): 7694.0000/0.00 >> >> > execution time (avg/stddev): 14.9979/0.00 >> >> > >> >> > The performance improvement is 56% for this use case. >> >> > >> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar >> >> > problem as with rt_rq >> >> > >> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove >> >> > dl and rt from sched_rt_avg_update >> >> > >> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP >> >> > A test with iperf on hikey 6220 gives: >> >> > w/o patchset w/ patchset >> >> > Tx 276 Mbits/sec 304 Mbits/sec +10% >> >> > Rx 299 Mbits/sec 328 Mbits/sec +09% >> >> > >> >> > 8 iterations of iperf -c server_address -r -t 5 >> >> > stdev is lower than 1% >> >> > Only WFI idle state is enable (shallowest arm idle state) >> >> > >> >> > - Patches 9 removes the unused sched_avg_update code >> >> > >> >> > - Patch 10 removes the unused sched_time_avg_ms >> >> > >> >> > Change since v3: >> >> > - add support of periodic update of blocked utilization >> >> > - rebase on lastest tip/sched/core >> >> > >> >> > Change since v2: >> >> > - move pelt code into a dedicated pelt.c file >> >> > - rebase on load tracking changes >> >> > >> >> > Change since v1: >> >> > - Only a rebase. I have addressed the comments on previous version in >> >> > patch 1/2 >> >> > >> >> > Vincent Guittot (10): >> >> > sched/pelt: Move pelt related code in a dedicated file >> >> > sched/rt: add rt_rq utilization tracking >> >> > cpufreq/schedutil: add rt utilization tracking >> >> > sched/dl: add dl_rq utilization tracking >> >> > cpufreq/schedutil: get max utilization >> >> > sched: remove rt and dl from sched_avg >> >> > sched/irq: add irq utilization tracking >> >> > cpufreq/schedutil: take into account interrupt >> >> > sched: remove rt_avg code >> >> > proc/sched: remove unused sched_time_avg_ms >> >> > >> >> > include/linux/sched/sysctl.h | 1 - >> >> > kernel/sched/Makefile | 2 +- >> >> > kernel/sched/core.c | 38 +--- >> >> > kernel/sched/cpufreq_schedutil.c | 24 ++- >> >> > kernel/sched/deadline.c | 7 +- >> >> > kernel/sched/fair.c | 381 +++---------------------------------- >> >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++ >> >> > kernel/sched/pelt.h | 63 +++++++ >> >> > kernel/sched/rt.c | 10 +- >> >> > kernel/sched/sched.h | 57 ++++-- >> >> > kernel/sysctl.c | 8 - >> >> > 11 files changed, 563 insertions(+), 423 deletions(-) >> >> > create mode 100644 kernel/sched/pelt.c >> >> > create mode 100644 kernel/sched/pelt.h >> >> > >> >> > -- >> >> > 2.7.4 >> >> > >> > >> >
On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote: > On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote: > > On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote: > >> On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote: > >> > Hi Vincent, > >> > > >> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote: > >> >> Hi Quentin, > >> >> > >> >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote: > >> >> > This patchset initially tracked only the utilization of RT rq. During > >> >> > OSPM summit, it has been discussed the opportunity to extend it in order > >> >> > to get an estimate of the utilization of the CPU. > >> >> > > >> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization > >> >> > tracking for rt_rq. > >> >> > > >> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency > >> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't > >> >> > reflect anymore the utilization of cfs tasks but only the remaining part that > >> >> > is not used by rt tasks. We should monitor the stolen utilization and take > >> >> > it into account when selecting OPP. This patchset doesn't change the OPP > >> >> > selection policy for RT tasks but only for CFS tasks > >> >> > > >> >> > A rt-app use case which creates an always running cfs thread and a rt threads > >> >> > that wakes up periodically with both threads pinned on same CPU, show lot of > >> >> > frequency switches of the CPU whereas the CPU never goes idles during the > >> >> > test. I can share the json file that I used for the test if someone is > >> >> > interested in. > >> >> > > >> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom), > >> >> > the cpufreq statistics outputs (stats are reset just before the test) : > >> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > >> >> > without patchset : 1230 > >> >> > with patchset : 14 > >> >> > >> >> I have attached the rt-app json file that I use for this test > >> > > >> > Thank you very much ! I did a quick test with a much simpler fix to this > >> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity(). > >> > I get the following results on Hikey960: > >> > > >> > Without patch: > >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > >> > 12 > >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans > >> > 640 > >> > With patch > >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans > >> > 8 > >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans > >> > 12 > >> > > >> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think > >> > this is an actual issue for realistic use-cases ? > >> > >> yes I think that it's worth syncing and consolidating things on the > >> same metric. The result will be saner and more robust as we will have > >> the same behavior > > > > TBH I'm not disagreeing with that, the PELT-everywhere approach feels > > cleaner in a way, but do you have a use-case in mind where this will > > definitely help ? > > > > I mean, yes the rt_avg is a slow response to the RT pressure, but is > > this always a problem ? Ramping down slower might actually help in some > > cases no ? > > I would say no because when one will decrease the other one will not > increase at the same pace and we will have some wrong behavior or > decision I think I get your point. Yes, sometimes, the slow-moving rt_avg can be off a little bit (which can be good or bad, depending in the case) if your RT task runs a lot with very changing behaviour. And again, I'm not fundamentally against the idea of having extra complexity for RT/IRQ PELT signals _if_ we have a use-case. But is there a real use-case where we really need all of that ? That's a true question, I honestly don't have the answer :-) > > > > >> > >> > > >> > What about the diff below (just a quick hack to show the idea) applied > >> > on tip/sched/core ? > >> > > >> > ---8<--- > >> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > >> > index a8ba6d1f262a..23a4fb1c2c25 100644 > >> > --- a/kernel/sched/cpufreq_schedutil.c > >> > +++ b/kernel/sched/cpufreq_schedutil.c > >> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > >> > sg_cpu->util_dl = cpu_util_dl(rq); > >> > } > >> > > >> > +unsigned long scale_rt_capacity(int cpu); > >> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > >> > { > >> > struct rq *rq = cpu_rq(sg_cpu->cpu); > >> > + int cpu = sg_cpu->cpu; > >> > + unsigned long util, dl_bw; > >> > > >> > if (rq->rt.rt_nr_running) > >> > return sg_cpu->max; > >> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > >> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > >> > * ready for such an interface. So, we only do the latter for now. > >> > */ > >> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > >> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > >> > + util >>= SCHED_CAPACITY_SHIFT; > >> > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > >> > + util += sg_cpu->util_cfs; > >> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > >> > + > >> > + /* Make sure to always provide the reserved freq to DL. */ > >> > + return max(util, dl_bw); > >> > } > >> > > >> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) > >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> > index f01f0f395f9a..0e87cbe47c8b 100644 > >> > --- a/kernel/sched/fair.c > >> > +++ b/kernel/sched/fair.c > >> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd, > >> > return load_idx; > >> > } > >> > > >> > -static unsigned long scale_rt_capacity(int cpu) > >> > +unsigned long scale_rt_capacity(int cpu) > >> > { > >> > struct rq *rq = cpu_rq(cpu); > >> > u64 total, used, age_stamp, avg; > >> > --->8--- > >> > > >> > > >> > > >> >> > >> >> > > >> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see > >> >> > performance improvements: > >> >> > > >> >> > - Without patchset : > >> >> > Test execution summary: > >> >> > total time: 15.0009s > >> >> > total number of events: 4903 > >> >> > total time taken by event execution: 14.9972 > >> >> > per-request statistics: > >> >> > min: 1.23ms > >> >> > avg: 3.06ms > >> >> > max: 13.16ms > >> >> > approx. 95 percentile: 12.73ms > >> >> > > >> >> > Threads fairness: > >> >> > events (avg/stddev): 4903.0000/0.00 > >> >> > execution time (avg/stddev): 14.9972/0.00 > >> >> > > >> >> > - With patchset: > >> >> > Test execution summary: > >> >> > total time: 15.0014s > >> >> > total number of events: 7694 > >> >> > total time taken by event execution: 14.9979 > >> >> > per-request statistics: > >> >> > min: 1.23ms > >> >> > avg: 1.95ms > >> >> > max: 10.49ms > >> >> > approx. 95 percentile: 10.39ms > >> >> > > >> >> > Threads fairness: > >> >> > events (avg/stddev): 7694.0000/0.00 > >> >> > execution time (avg/stddev): 14.9979/0.00 > >> >> > > >> >> > The performance improvement is 56% for this use case. > >> >> > > >> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar > >> >> > problem as with rt_rq > >> >> > > >> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove > >> >> > dl and rt from sched_rt_avg_update > >> >> > > >> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP > >> >> > A test with iperf on hikey 6220 gives: > >> >> > w/o patchset w/ patchset > >> >> > Tx 276 Mbits/sec 304 Mbits/sec +10% > >> >> > Rx 299 Mbits/sec 328 Mbits/sec +09% > >> >> > > >> >> > 8 iterations of iperf -c server_address -r -t 5 > >> >> > stdev is lower than 1% > >> >> > Only WFI idle state is enable (shallowest arm idle state) > >> >> > > >> >> > - Patches 9 removes the unused sched_avg_update code > >> >> > > >> >> > - Patch 10 removes the unused sched_time_avg_ms > >> >> > > >> >> > Change since v3: > >> >> > - add support of periodic update of blocked utilization > >> >> > - rebase on lastest tip/sched/core > >> >> > > >> >> > Change since v2: > >> >> > - move pelt code into a dedicated pelt.c file > >> >> > - rebase on load tracking changes > >> >> > > >> >> > Change since v1: > >> >> > - Only a rebase. I have addressed the comments on previous version in > >> >> > patch 1/2 > >> >> > > >> >> > Vincent Guittot (10): > >> >> > sched/pelt: Move pelt related code in a dedicated file > >> >> > sched/rt: add rt_rq utilization tracking > >> >> > cpufreq/schedutil: add rt utilization tracking > >> >> > sched/dl: add dl_rq utilization tracking > >> >> > cpufreq/schedutil: get max utilization > >> >> > sched: remove rt and dl from sched_avg > >> >> > sched/irq: add irq utilization tracking > >> >> > cpufreq/schedutil: take into account interrupt > >> >> > sched: remove rt_avg code > >> >> > proc/sched: remove unused sched_time_avg_ms > >> >> > > >> >> > include/linux/sched/sysctl.h | 1 - > >> >> > kernel/sched/Makefile | 2 +- > >> >> > kernel/sched/core.c | 38 +--- > >> >> > kernel/sched/cpufreq_schedutil.c | 24 ++- > >> >> > kernel/sched/deadline.c | 7 +- > >> >> > kernel/sched/fair.c | 381 +++---------------------------------- > >> >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++ > >> >> > kernel/sched/pelt.h | 63 +++++++ > >> >> > kernel/sched/rt.c | 10 +- > >> >> > kernel/sched/sched.h | 57 ++++-- > >> >> > kernel/sysctl.c | 8 - > >> >> > 11 files changed, 563 insertions(+), 423 deletions(-) > >> >> > create mode 100644 kernel/sched/pelt.c > >> >> > create mode 100644 kernel/sched/pelt.h > >> >> > > >> >> > -- > >> >> > 2.7.4 > >> >> > > >> > > >> >
On 5 June 2018 at 15:52, Quentin Perret <quentin.perret@arm.com> wrote: > On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote: >> On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote: >> > On Tuesday 05 Jun 2018 at 13:59:56 (+0200), Vincent Guittot wrote: >> >> On 5 June 2018 at 12:57, Quentin Perret <quentin.perret@arm.com> wrote: >> >> > Hi Vincent, >> >> > >> >> > On Tuesday 05 Jun 2018 at 10:36:26 (+0200), Vincent Guittot wrote: >> >> >> Hi Quentin, >> >> >> >> >> >> On 25 May 2018 at 15:12, Vincent Guittot <vincent.guittot@linaro.org> wrote: >> >> >> > This patchset initially tracked only the utilization of RT rq. During >> >> >> > OSPM summit, it has been discussed the opportunity to extend it in order >> >> >> > to get an estimate of the utilization of the CPU. >> >> >> > >> >> >> > - Patches 1-3 correspond to the content of patchset v4 and add utilization >> >> >> > tracking for rt_rq. >> >> >> > >> >> >> > When both cfs and rt tasks compete to run on a CPU, we can see some frequency >> >> >> > drops with schedutil governor. In such case, the cfs_rq's utilization doesn't >> >> >> > reflect anymore the utilization of cfs tasks but only the remaining part that >> >> >> > is not used by rt tasks. We should monitor the stolen utilization and take >> >> >> > it into account when selecting OPP. This patchset doesn't change the OPP >> >> >> > selection policy for RT tasks but only for CFS tasks >> >> >> > >> >> >> > A rt-app use case which creates an always running cfs thread and a rt threads >> >> >> > that wakes up periodically with both threads pinned on same CPU, show lot of >> >> >> > frequency switches of the CPU whereas the CPU never goes idles during the >> >> >> > test. I can share the json file that I used for the test if someone is >> >> >> > interested in. >> >> >> > >> >> >> > For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom), >> >> >> > the cpufreq statistics outputs (stats are reset just before the test) : >> >> >> > $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> >> >> > without patchset : 1230 >> >> >> > with patchset : 14 >> >> >> >> >> >> I have attached the rt-app json file that I use for this test >> >> > >> >> > Thank you very much ! I did a quick test with a much simpler fix to this >> >> > RT-steals-time-from-CFS issue using just the existing scale_rt_capacity(). >> >> > I get the following results on Hikey960: >> >> > >> >> > Without patch: >> >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> >> > 12 >> >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans >> >> > 640 >> >> > With patch >> >> > cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans >> >> > 8 >> >> > cat /sys/devices/system/cpu/cpufreq/policy4/stats/total_trans >> >> > 12 >> >> > >> >> > Yes the rt_avg stuff is out of sync with the PELT signal, but do you think >> >> > this is an actual issue for realistic use-cases ? >> >> >> >> yes I think that it's worth syncing and consolidating things on the >> >> same metric. The result will be saner and more robust as we will have >> >> the same behavior >> > >> > TBH I'm not disagreeing with that, the PELT-everywhere approach feels >> > cleaner in a way, but do you have a use-case in mind where this will >> > definitely help ? >> > >> > I mean, yes the rt_avg is a slow response to the RT pressure, but is >> > this always a problem ? Ramping down slower might actually help in some >> > cases no ? >> >> I would say no because when one will decrease the other one will not >> increase at the same pace and we will have some wrong behavior or >> decision > > I think I get your point. Yes, sometimes, the slow-moving rt_avg can be > off a little bit (which can be good or bad, depending in the case) if your > RT task runs a lot with very changing behaviour. And again, I'm not > fundamentally against the idea of having extra complexity for RT/IRQ PELT > signals _if_ we have a use-case. But is there a real use-case where we > really need all of that ? That's a true question, I honestly don't have > the answer :-) The iperf test result is another example of the benefit > >> >> > >> >> >> >> > >> >> > What about the diff below (just a quick hack to show the idea) applied >> >> > on tip/sched/core ? >> >> > >> >> > ---8<--- >> >> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c >> >> > index a8ba6d1f262a..23a4fb1c2c25 100644 >> >> > --- a/kernel/sched/cpufreq_schedutil.c >> >> > +++ b/kernel/sched/cpufreq_schedutil.c >> >> > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) >> >> > sg_cpu->util_dl = cpu_util_dl(rq); >> >> > } >> >> > >> >> > +unsigned long scale_rt_capacity(int cpu); >> >> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) >> >> > { >> >> > struct rq *rq = cpu_rq(sg_cpu->cpu); >> >> > + int cpu = sg_cpu->cpu; >> >> > + unsigned long util, dl_bw; >> >> > >> >> > if (rq->rt.rt_nr_running) >> >> > return sg_cpu->max; >> >> > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) >> >> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet >> >> > * ready for such an interface. So, we only do the latter for now. >> >> > */ >> >> > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); >> >> > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); >> >> > + util >>= SCHED_CAPACITY_SHIFT; >> >> > + util = arch_scale_cpu_capacity(NULL, cpu) - util; >> >> > + util += sg_cpu->util_cfs; >> >> > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >> >> > + >> >> > + /* Make sure to always provide the reserved freq to DL. */ >> >> > + return max(util, dl_bw); >> >> > } >> >> > >> >> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags) >> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> >> > index f01f0f395f9a..0e87cbe47c8b 100644 >> >> > --- a/kernel/sched/fair.c >> >> > +++ b/kernel/sched/fair.c >> >> > @@ -7868,7 +7868,7 @@ static inline int get_sd_load_idx(struct sched_domain *sd, >> >> > return load_idx; >> >> > } >> >> > >> >> > -static unsigned long scale_rt_capacity(int cpu) >> >> > +unsigned long scale_rt_capacity(int cpu) >> >> > { >> >> > struct rq *rq = cpu_rq(cpu); >> >> > u64 total, used, age_stamp, avg; >> >> > --->8--- >> >> > >> >> > >> >> > >> >> >> >> >> >> > >> >> >> > If we replace the cfs thread of rt-app by a sysbench cpu test, we can see >> >> >> > performance improvements: >> >> >> > >> >> >> > - Without patchset : >> >> >> > Test execution summary: >> >> >> > total time: 15.0009s >> >> >> > total number of events: 4903 >> >> >> > total time taken by event execution: 14.9972 >> >> >> > per-request statistics: >> >> >> > min: 1.23ms >> >> >> > avg: 3.06ms >> >> >> > max: 13.16ms >> >> >> > approx. 95 percentile: 12.73ms >> >> >> > >> >> >> > Threads fairness: >> >> >> > events (avg/stddev): 4903.0000/0.00 >> >> >> > execution time (avg/stddev): 14.9972/0.00 >> >> >> > >> >> >> > - With patchset: >> >> >> > Test execution summary: >> >> >> > total time: 15.0014s >> >> >> > total number of events: 7694 >> >> >> > total time taken by event execution: 14.9979 >> >> >> > per-request statistics: >> >> >> > min: 1.23ms >> >> >> > avg: 1.95ms >> >> >> > max: 10.49ms >> >> >> > approx. 95 percentile: 10.39ms >> >> >> > >> >> >> > Threads fairness: >> >> >> > events (avg/stddev): 7694.0000/0.00 >> >> >> > execution time (avg/stddev): 14.9979/0.00 >> >> >> > >> >> >> > The performance improvement is 56% for this use case. >> >> >> > >> >> >> > - Patches 4-5 add utilization tracking for dl_rq in order to solve similar >> >> >> > problem as with rt_rq >> >> >> > >> >> >> > - Patches 6 uses dl and rt utilization in the scale_rt_capacity() and remove >> >> >> > dl and rt from sched_rt_avg_update >> >> >> > >> >> >> > - Patches 7-8 add utilization tracking for interrupt and use it select OPP >> >> >> > A test with iperf on hikey 6220 gives: >> >> >> > w/o patchset w/ patchset >> >> >> > Tx 276 Mbits/sec 304 Mbits/sec +10% >> >> >> > Rx 299 Mbits/sec 328 Mbits/sec +09% >> >> >> > >> >> >> > 8 iterations of iperf -c server_address -r -t 5 >> >> >> > stdev is lower than 1% >> >> >> > Only WFI idle state is enable (shallowest arm idle state) >> >> >> > >> >> >> > - Patches 9 removes the unused sched_avg_update code >> >> >> > >> >> >> > - Patch 10 removes the unused sched_time_avg_ms >> >> >> > >> >> >> > Change since v3: >> >> >> > - add support of periodic update of blocked utilization >> >> >> > - rebase on lastest tip/sched/core >> >> >> > >> >> >> > Change since v2: >> >> >> > - move pelt code into a dedicated pelt.c file >> >> >> > - rebase on load tracking changes >> >> >> > >> >> >> > Change since v1: >> >> >> > - Only a rebase. I have addressed the comments on previous version in >> >> >> > patch 1/2 >> >> >> > >> >> >> > Vincent Guittot (10): >> >> >> > sched/pelt: Move pelt related code in a dedicated file >> >> >> > sched/rt: add rt_rq utilization tracking >> >> >> > cpufreq/schedutil: add rt utilization tracking >> >> >> > sched/dl: add dl_rq utilization tracking >> >> >> > cpufreq/schedutil: get max utilization >> >> >> > sched: remove rt and dl from sched_avg >> >> >> > sched/irq: add irq utilization tracking >> >> >> > cpufreq/schedutil: take into account interrupt >> >> >> > sched: remove rt_avg code >> >> >> > proc/sched: remove unused sched_time_avg_ms >> >> >> > >> >> >> > include/linux/sched/sysctl.h | 1 - >> >> >> > kernel/sched/Makefile | 2 +- >> >> >> > kernel/sched/core.c | 38 +--- >> >> >> > kernel/sched/cpufreq_schedutil.c | 24 ++- >> >> >> > kernel/sched/deadline.c | 7 +- >> >> >> > kernel/sched/fair.c | 381 +++---------------------------------- >> >> >> > kernel/sched/pelt.c | 395 +++++++++++++++++++++++++++++++++++++++ >> >> >> > kernel/sched/pelt.h | 63 +++++++ >> >> >> > kernel/sched/rt.c | 10 +- >> >> >> > kernel/sched/sched.h | 57 ++++-- >> >> >> > kernel/sysctl.c | 8 - >> >> >> > 11 files changed, 563 insertions(+), 423 deletions(-) >> >> >> > create mode 100644 kernel/sched/pelt.c >> >> >> > create mode 100644 kernel/sched/pelt.h >> >> >> > >> >> >> > -- >> >> >> > 2.7.4 >> >> >> > >> >> > >> >> >
On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote: > On 05/06/18 14:05, Quentin Perret wrote: > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: > > > Hi Quentin, > > > > > > On 05/06/18 11:57, Quentin Perret wrote: > > > > > > [...] > > > > > > > What about the diff below (just a quick hack to show the idea) applied > > > > on tip/sched/core ? > > > > > > > > ---8<--- > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > > > --- a/kernel/sched/cpufreq_schedutil.c > > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > > > sg_cpu->util_dl = cpu_util_dl(rq); > > > > } > > > > > > > > +unsigned long scale_rt_capacity(int cpu); > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > { > > > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > > > + int cpu = sg_cpu->cpu; > > > > + unsigned long util, dl_bw; > > > > > > > > if (rq->rt.rt_nr_running) > > > > return sg_cpu->max; > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > > > * ready for such an interface. So, we only do the latter for now. > > > > */ > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, > > > since we use max below, we will probably have the same problem that we > > > discussed on Vincent's approach (overestimation of DL contribution while > > > we could use running_bw). > > > > Ah no, you're right, this isn't great for long running deadline tasks. > > We should definitely account for the running_bw here, not the dl avg... > > > > I was trying to address the issue of RT stealing time from CFS here, but > > the DL integration isn't quite right which this patch as-is, I agree ... > > > > > > > > > + util >>= SCHED_CAPACITY_SHIFT; > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > > > + util += sg_cpu->util_cfs; > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > > > Why this_bw instead of running_bw? > > > > So IIUC, this_bw should basically give you the absolute reservation (== the > > sum of runtime/deadline ratios of all DL tasks on that rq). > > Yep. > > > The reason I added this max is because I'm still not sure to understand > > how we can safely drop the freq below that point ? If we don't guarantee > > to always stay at least at the freq required by DL, aren't we risking to > > start a deadline tasks stuck at a low freq because of rate limiting ? In > > this case, if that tasks uses all of its runtime then you might start > > missing deadlines ... > > We decided to avoid (software) rate limiting for DL with e97a90f7069b > ("sched/cpufreq: Rate limits for SCHED_DEADLINE"). Right, I spotted that one, but yeah you could also be limited by HW ... > > > My feeling is that the only safe thing to do is to guarantee to never go > > below the freq required by DL, and to optimistically add CFS tasks > > without raising the OPP if we have good reasons to think that DL is > > using less than it required (which is what we should get by using > > running_bw above I suppose). Does that make any sense ? > > Then we can't still avoid the hardware limits, so using running_bw is a > trade off between safety (especially considering soft real-time > scenarios) and energy consumption (which seems to be working in > practice). Ok, I see ... Have you guys already tried something like my patch above (keeping the freq >= this_bw) in real world use cases ? Is this costing that much energy in practice ? If we fill the gaps left by DL (when it doesn't use all the runtime) with CFS tasks that might no be so bad ... Thank you very much for taking the time to explain all this, I really appreciate :-) Quentin
On Tuesday 05 Jun 2018 at 15:55:43 (+0200), Vincent Guittot wrote: > On 5 June 2018 at 15:52, Quentin Perret <quentin.perret@arm.com> wrote: > > On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote: > >> On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote: > >> I would say no because when one will decrease the other one will not > >> increase at the same pace and we will have some wrong behavior or > >> decision > > > > I think I get your point. Yes, sometimes, the slow-moving rt_avg can be > > off a little bit (which can be good or bad, depending in the case) if your > > RT task runs a lot with very changing behaviour. And again, I'm not > > fundamentally against the idea of having extra complexity for RT/IRQ PELT > > signals _if_ we have a use-case. But is there a real use-case where we > > really need all of that ? That's a true question, I honestly don't have > > the answer :-) > > The iperf test result is another example of the benefit The iperf test result ? The sysbench test you mean ?
On 05/06/18 15:01, Quentin Perret wrote: > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote: > > On 05/06/18 14:05, Quentin Perret wrote: > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: > > > > Hi Quentin, > > > > > > > > On 05/06/18 11:57, Quentin Perret wrote: > > > > > > > > [...] > > > > > > > > > What about the diff below (just a quick hack to show the idea) applied > > > > > on tip/sched/core ? > > > > > > > > > > ---8<--- > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > > > > --- a/kernel/sched/cpufreq_schedutil.c > > > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > > > > sg_cpu->util_dl = cpu_util_dl(rq); > > > > > } > > > > > > > > > > +unsigned long scale_rt_capacity(int cpu); > > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > { > > > > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > > > > + int cpu = sg_cpu->cpu; > > > > > + unsigned long util, dl_bw; > > > > > > > > > > if (rq->rt.rt_nr_running) > > > > > return sg_cpu->max; > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > > > > * ready for such an interface. So, we only do the latter for now. > > > > > */ > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, > > > > since we use max below, we will probably have the same problem that we > > > > discussed on Vincent's approach (overestimation of DL contribution while > > > > we could use running_bw). > > > > > > Ah no, you're right, this isn't great for long running deadline tasks. > > > We should definitely account for the running_bw here, not the dl avg... > > > > > > I was trying to address the issue of RT stealing time from CFS here, but > > > the DL integration isn't quite right which this patch as-is, I agree ... > > > > > > > > > > > > + util >>= SCHED_CAPACITY_SHIFT; > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > > > > + util += sg_cpu->util_cfs; > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > > > > > Why this_bw instead of running_bw? > > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the > > > sum of runtime/deadline ratios of all DL tasks on that rq). > > > > Yep. > > > > > The reason I added this max is because I'm still not sure to understand > > > how we can safely drop the freq below that point ? If we don't guarantee > > > to always stay at least at the freq required by DL, aren't we risking to > > > start a deadline tasks stuck at a low freq because of rate limiting ? In > > > this case, if that tasks uses all of its runtime then you might start > > > missing deadlines ... > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE"). > > Right, I spotted that one, but yeah you could also be limited by HW ... > > > > > > My feeling is that the only safe thing to do is to guarantee to never go > > > below the freq required by DL, and to optimistically add CFS tasks > > > without raising the OPP if we have good reasons to think that DL is > > > using less than it required (which is what we should get by using > > > running_bw above I suppose). Does that make any sense ? > > > > Then we can't still avoid the hardware limits, so using running_bw is a > > trade off between safety (especially considering soft real-time > > scenarios) and energy consumption (which seems to be working in > > practice). > > Ok, I see ... Have you guys already tried something like my patch above > (keeping the freq >= this_bw) in real world use cases ? Is this costing > that much energy in practice ? If we fill the gaps left by DL (when it IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so he might add some numbers to my words above. I didn't (yet). But, please consider that I might be reserving (for example) 50% of bandwidth for my heavy and time sensitive task and then have that task wake up only once in a while (but I'll be keeping clock speed up for the whole time). :/ > doesn't use all the runtime) with CFS tasks that might no be so bad ... > > Thank you very much for taking the time to explain all this, I really > appreciate :-) Sure. Thanks for participating to the discussion! Best, - Juri
On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote: > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote: > > So this patch-set tracks the !cfs occupation using the same function, > > which is all good. But what, if instead of using that to compensate the > > OPP selection, we employ that to renormalize the util signal? > > > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity, > > then I think your initial problem goes away. Because while the RT task > > will push the util to .5, it will at the same time push the CPU capacity > > to .5, and renormalized that gives 1. > > > > NOTE: the renorm would then become something like: > > scale_cpu = arch_scale_cpu_capacity() / rt_frac(); Should probably be: scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac()) > > > > > > On IRC I mentioned stopping the CFS clock when preempted, and while that > > would result in fixed numbers, Vincent was right in pointing out the > > numbers will be difficult to interpret, since the meaning will be purely > > CPU local and I'm not sure you can actually fix it again with > > normalization. > > > > Imagine, running a .3 RT task, that would push the (always running) CFS > > down to .7, but because we discard all !cfs time, it actually has 1. If > > we try and normalize that we'll end up with ~1.43, which is of course > > completely broken. > > > > > > _However_, all that happens for util, also happens for load. So the above > > scenario will also make the CPU appear less loaded than it actually is. > > The load will continue to increase because we track runnable state and > not running for the load Duh yes. So renormalizing it once, like proposed for util would actually do the right thing there too. Would not that allow us to get rid of much of the capacity magic in the load balance code? /me thinks more.. Bah, no.. because you don't want this dynamic renormalization part of the sums. So you want to keep it after the fact. :/ > As you mentioned, scale_rt_capacity give the remaining capacity for > cfs and it will behave like cfs util_avg now that it uses PELT. So as > long as cfs util_avg < scale_rt_capacity(we probably need a margin) > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting > OPP because we have remaining spare capacity but if cfs util_avg == > scale_rt_capacity, we make sure to use max OPP. Good point, when cfs-util < cfs-cap then there is idle time and the util number is 'right', when cfs-util == cfs-cap we're overcommitted and should go max. Since the util and cap values are aligned that should track nicely.
On Tuesday 05 Jun 2018 at 15:09:54 (+0100), Quentin Perret wrote: > On Tuesday 05 Jun 2018 at 15:55:43 (+0200), Vincent Guittot wrote: > > On 5 June 2018 at 15:52, Quentin Perret <quentin.perret@arm.com> wrote: > > > On Tuesday 05 Jun 2018 at 15:18:38 (+0200), Vincent Guittot wrote: > > >> On 5 June 2018 at 15:12, Quentin Perret <quentin.perret@arm.com> wrote: > > >> I would say no because when one will decrease the other one will not > > >> increase at the same pace and we will have some wrong behavior or > > >> decision > > > > > > I think I get your point. Yes, sometimes, the slow-moving rt_avg can be > > > off a little bit (which can be good or bad, depending in the case) if your > > > RT task runs a lot with very changing behaviour. And again, I'm not > > > fundamentally against the idea of having extra complexity for RT/IRQ PELT > > > signals _if_ we have a use-case. But is there a real use-case where we > > > really need all of that ? That's a true question, I honestly don't have > > > the answer :-) > > > > The iperf test result is another example of the benefit > > The iperf test result ? The sysbench test you mean ? Ah sorry I missed that one form the cover letter ... I'll look into that then :-) Thanks, Quentin
On 05/06/18 16:18, Peter Zijlstra wrote: > On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote: [...] > > As you mentioned, scale_rt_capacity give the remaining capacity for > > cfs and it will behave like cfs util_avg now that it uses PELT. So as > > long as cfs util_avg < scale_rt_capacity(we probably need a margin) > > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting > > OPP because we have remaining spare capacity but if cfs util_avg == > > scale_rt_capacity, we make sure to use max OPP. > > Good point, when cfs-util < cfs-cap then there is idle time and the util > number is 'right', when cfs-util == cfs-cap we're overcommitted and > should go max. > > Since the util and cap values are aligned that should track nicely. Yeah. Makes sense to me as well. :)
On 05-Jun 16:18, Peter Zijlstra wrote: > On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote: > > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote: > > > > So this patch-set tracks the !cfs occupation using the same function, > > > which is all good. But what, if instead of using that to compensate the > > > OPP selection, we employ that to renormalize the util signal? > > > > > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity, > > > then I think your initial problem goes away. Because while the RT task > > > will push the util to .5, it will at the same time push the CPU capacity > > > to .5, and renormalized that gives 1. And would not that mean also that a 50% task co-scheduled with the same 50% RT task, will be reported as a 100% util_avg task? > > > > > > NOTE: the renorm would then become something like: > > > scale_cpu = arch_scale_cpu_capacity() / rt_frac(); > > Should probably be: > > scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac()) > > > > > > > > > > On IRC I mentioned stopping the CFS clock when preempted, and while that > > > would result in fixed numbers, Vincent was right in pointing out the > > > numbers will be difficult to interpret, since the meaning will be purely > > > CPU local and I'm not sure you can actually fix it again with > > > normalization. > > > > > > Imagine, running a .3 RT task, that would push the (always running) CFS > > > down to .7, but because we discard all !cfs time, it actually has 1. If > > > we try and normalize that we'll end up with ~1.43, which is of course > > > completely broken. > > > > > > > > > _However_, all that happens for util, also happens for load. So the above > > > scenario will also make the CPU appear less loaded than it actually is. > > > > The load will continue to increase because we track runnable state and > > not running for the load > > Duh yes. So renormalizing it once, like proposed for util would actually > do the right thing there too. Would not that allow us to get rid of > much of the capacity magic in the load balance code? > > /me thinks more.. > > Bah, no.. because you don't want this dynamic renormalization part of > the sums. So you want to keep it after the fact. :/ > > > As you mentioned, scale_rt_capacity give the remaining capacity for > > cfs and it will behave like cfs util_avg now that it uses PELT. So as > > long as cfs util_avg < scale_rt_capacity(we probably need a margin) > > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting > > OPP because we have remaining spare capacity but if cfs util_avg == > > scale_rt_capacity, we make sure to use max OPP. What will happen for the 50% task of the example above? > Good point, when cfs-util < cfs-cap then there is idle time and the util > number is 'right', when cfs-util == cfs-cap we're overcommitted and > should go max. Again I cannot easily read the example above... Would that mean that a 50% CFS task, preempted by a 50% RT task (which already set OPP to max while RUNNABLE) will end up running at the max OPP too? > Since the util and cap values are aligned that should track nicely. True... the only potential issue I see is that we are steering PELT behaviors towards better driving schedutil to run high-demand workloads while _maybe_ affecting quite sensibly the capacity of PELT to describe how much CPU a task uses. Ultimately, utilization has always been a metric on "how much you use"... while here it seems to me we are bending it to be something to define "how fast you have to run". -- #include <best/regards.h> Patrick Bellasi
On Tue, Jun 05, 2018 at 04:38:26PM +0100, Patrick Bellasi wrote: > On 05-Jun 16:18, Peter Zijlstra wrote: > > On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote: > > > As you mentioned, scale_rt_capacity give the remaining capacity for > > > cfs and it will behave like cfs util_avg now that it uses PELT. So as > > > long as cfs util_avg < scale_rt_capacity(we probably need a margin) > > > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting > > > OPP because we have remaining spare capacity but if cfs util_avg == > > > scale_rt_capacity, we make sure to use max OPP. > > What will happen for the 50% task of the example above? When the cfs-cap reaches 50% (where cfs_cap := 1 - rt_avg - dl_avg - stop_avg - irq_avg) a cfs-util of 50% means that there is no idle time. So util will still be 50%, nothing funny. But frequency selection will see util==cap and select max (it might not have because reduction could be due to IRQ pressure for example). At the moment cfs-cap rises (>50%), and the cfs-util stays at 50%, we'll have 50% utilization. We know there is idle time, the task could use more if it wanted to. > > Good point, when cfs-util < cfs-cap then there is idle time and the util > > number is 'right', when cfs-util == cfs-cap we're overcommitted and > > should go max. > > Again I cannot easily read the example above... > > Would that mean that a 50% CFS task, preempted by a 50% RT task (which > already set OPP to max while RUNNABLE) will end up running at the max > OPP too? Yes, because there is no idle time. When there is no idle time, max freq is the right frequency. The moment cfs-util drops below cfs-cap, we'll stop running at max, because at that point we've found idle time to reduce frequency with. > > Since the util and cap values are aligned that should track nicely. > > True... the only potential issue I see is that we are steering PELT > behaviors towards better driving schedutil to run high-demand > workloads while _maybe_ affecting quite sensibly the capacity of PELT > to describe how much CPU a task uses. > > Ultimately, utilization has always been a metric on "how much you > use"... while here it seems to me we are bending it to be something to > define "how fast you have to run". This latest proposal does not in fact change the util tracking. But in general, 'how much do you use' can be a very difficult question, see the whole turbo / hardware managed dvfs discussion a week or so ago.
On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote: > On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote: > > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote: > > > > So this patch-set tracks the !cfs occupation using the same function, > > > which is all good. But what, if instead of using that to compensate the > > > OPP selection, we employ that to renormalize the util signal? > > > > > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity, > > > then I think your initial problem goes away. Because while the RT task > > > will push the util to .5, it will at the same time push the CPU capacity > > > to .5, and renormalized that gives 1. > > > > > > NOTE: the renorm would then become something like: > > > scale_cpu = arch_scale_cpu_capacity() / rt_frac(); > > Should probably be: > > scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac()) > > > > > > > > > > On IRC I mentioned stopping the CFS clock when preempted, and while that > > > would result in fixed numbers, Vincent was right in pointing out the > > > numbers will be difficult to interpret, since the meaning will be purely > > > CPU local and I'm not sure you can actually fix it again with > > > normalization. > > > > > > Imagine, running a .3 RT task, that would push the (always running) CFS > > > down to .7, but because we discard all !cfs time, it actually has 1. If > > > we try and normalize that we'll end up with ~1.43, which is of course > > > completely broken. > > > > > > > > > _However_, all that happens for util, also happens for load. So the above > > > scenario will also make the CPU appear less loaded than it actually is. > > > > The load will continue to increase because we track runnable state and > > not running for the load > > Duh yes. So renormalizing it once, like proposed for util would actually > do the right thing there too. Would not that allow us to get rid of > much of the capacity magic in the load balance code? > > /me thinks more.. > > Bah, no.. because you don't want this dynamic renormalization part of > the sums. So you want to keep it after the fact. :/ > > > As you mentioned, scale_rt_capacity give the remaining capacity for > > cfs and it will behave like cfs util_avg now that it uses PELT. So as > > long as cfs util_avg < scale_rt_capacity(we probably need a margin) > > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting > > OPP because we have remaining spare capacity but if cfs util_avg == > > scale_rt_capacity, we make sure to use max OPP. > > Good point, when cfs-util < cfs-cap then there is idle time and the util > number is 'right', when cfs-util == cfs-cap we're overcommitted and > should go max. > > Since the util and cap values are aligned that should track nicely. So Vincent proposed to have a margin between cfs util and cfs cap to be sure there is a little bit of idle time. This is _exactly_ what the overutilized flag in EAS does. That would actually make a lot of sense to use that flag in schedutil. The idea is basically to say, if there isn't enough idle time on all CPUs, the util signal are kinda wrong, so let's not make any decisions (task placement or OPP selection) based on that. If overutilized, go to max freq. Does that make sense ? Thanks, Quentin
On 6 June 2018 at 11:44, Quentin Perret <quentin.perret@arm.com> wrote: > On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote: >> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote: >> > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote: >> >> > > So this patch-set tracks the !cfs occupation using the same function, >> > > which is all good. But what, if instead of using that to compensate the >> > > OPP selection, we employ that to renormalize the util signal? >> > > >> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity, >> > > then I think your initial problem goes away. Because while the RT task >> > > will push the util to .5, it will at the same time push the CPU capacity >> > > to .5, and renormalized that gives 1. >> > > >> > > NOTE: the renorm would then become something like: >> > > scale_cpu = arch_scale_cpu_capacity() / rt_frac(); >> >> Should probably be: >> >> scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac()) >> >> > > >> > > >> > > On IRC I mentioned stopping the CFS clock when preempted, and while that >> > > would result in fixed numbers, Vincent was right in pointing out the >> > > numbers will be difficult to interpret, since the meaning will be purely >> > > CPU local and I'm not sure you can actually fix it again with >> > > normalization. >> > > >> > > Imagine, running a .3 RT task, that would push the (always running) CFS >> > > down to .7, but because we discard all !cfs time, it actually has 1. If >> > > we try and normalize that we'll end up with ~1.43, which is of course >> > > completely broken. >> > > >> > > >> > > _However_, all that happens for util, also happens for load. So the above >> > > scenario will also make the CPU appear less loaded than it actually is. >> > >> > The load will continue to increase because we track runnable state and >> > not running for the load >> >> Duh yes. So renormalizing it once, like proposed for util would actually >> do the right thing there too. Would not that allow us to get rid of >> much of the capacity magic in the load balance code? >> >> /me thinks more.. >> >> Bah, no.. because you don't want this dynamic renormalization part of >> the sums. So you want to keep it after the fact. :/ >> >> > As you mentioned, scale_rt_capacity give the remaining capacity for >> > cfs and it will behave like cfs util_avg now that it uses PELT. So as >> > long as cfs util_avg < scale_rt_capacity(we probably need a margin) >> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting >> > OPP because we have remaining spare capacity but if cfs util_avg == >> > scale_rt_capacity, we make sure to use max OPP. >> >> Good point, when cfs-util < cfs-cap then there is idle time and the util >> number is 'right', when cfs-util == cfs-cap we're overcommitted and >> should go max. >> >> Since the util and cap values are aligned that should track nicely. > > So Vincent proposed to have a margin between cfs util and cfs cap to be > sure there is a little bit of idle time. This is _exactly_ what the > overutilized flag in EAS does. That would actually make a lot of sense > to use that flag in schedutil. The idea is basically to say, if there > isn't enough idle time on all CPUs, the util signal are kinda wrong, so > let's not make any decisions (task placement or OPP selection) based on > that. If overutilized, go to max freq. Does that make sense ? Yes it's similar to the overutilized except that - this is done per cpu and whereas overutilization is for the whole system - the test is done at every freq update and not only during some cfs event and it uses the last up to date value and not a periodically updated snapshot of the value - this is done also without EAS Then for the margin, it has to be discussed if it is really needed or not > > Thanks, > Quentin
On 6 June 2018 at 11:59, Vincent Guittot <vincent.guittot@linaro.org> wrote: > On 6 June 2018 at 11:44, Quentin Perret <quentin.perret@arm.com> wrote: >> On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote: [snip] >>> >>> > As you mentioned, scale_rt_capacity give the remaining capacity for >>> > cfs and it will behave like cfs util_avg now that it uses PELT. So as >>> > long as cfs util_avg < scale_rt_capacity(we probably need a margin) >>> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting >>> > OPP because we have remaining spare capacity but if cfs util_avg == >>> > scale_rt_capacity, we make sure to use max OPP. >>> >>> Good point, when cfs-util < cfs-cap then there is idle time and the util >>> number is 'right', when cfs-util == cfs-cap we're overcommitted and >>> should go max. >>> >>> Since the util and cap values are aligned that should track nicely. I have re run my tests and and the results seem to be ok so far. I'm going to clean up a bit the code used for the test and sent a new version of the proposal >> >> So Vincent proposed to have a margin between cfs util and cfs cap to be >> sure there is a little bit of idle time. This is _exactly_ what the >> overutilized flag in EAS does. That would actually make a lot of sense >> to use that flag in schedutil. The idea is basically to say, if there >> isn't enough idle time on all CPUs, the util signal are kinda wrong, so >> let's not make any decisions (task placement or OPP selection) based on >> that. If overutilized, go to max freq. Does that make sense ? > > Yes it's similar to the overutilized except that > - this is done per cpu and whereas overutilization is for the whole system > - the test is done at every freq update and not only during some cfs > event and it uses the last up to date value and not a periodically > updated snapshot of the value > - this is done also without EAS > > Then for the margin, it has to be discussed if it is really needed or not > >> >> Thanks, >> Quentin
On Wednesday 06 Jun 2018 at 11:59:04 (+0200), Vincent Guittot wrote: > On 6 June 2018 at 11:44, Quentin Perret <quentin.perret@arm.com> wrote: > > On Tuesday 05 Jun 2018 at 16:18:09 (+0200), Peter Zijlstra wrote: > >> On Mon, Jun 04, 2018 at 08:08:58PM +0200, Vincent Guittot wrote: > >> > On 4 June 2018 at 18:50, Peter Zijlstra <peterz@infradead.org> wrote: > >> > >> > > So this patch-set tracks the !cfs occupation using the same function, > >> > > which is all good. But what, if instead of using that to compensate the > >> > > OPP selection, we employ that to renormalize the util signal? > >> > > > >> > > If we normalize util against the dynamic (rt_avg affected) cpu_capacity, > >> > > then I think your initial problem goes away. Because while the RT task > >> > > will push the util to .5, it will at the same time push the CPU capacity > >> > > to .5, and renormalized that gives 1. > >> > > > >> > > NOTE: the renorm would then become something like: > >> > > scale_cpu = arch_scale_cpu_capacity() / rt_frac(); > >> > >> Should probably be: > >> > >> scale_cpu = atch_scale_cpu_capacity() / (1 - rt_frac()) > >> > >> > > > >> > > > >> > > On IRC I mentioned stopping the CFS clock when preempted, and while that > >> > > would result in fixed numbers, Vincent was right in pointing out the > >> > > numbers will be difficult to interpret, since the meaning will be purely > >> > > CPU local and I'm not sure you can actually fix it again with > >> > > normalization. > >> > > > >> > > Imagine, running a .3 RT task, that would push the (always running) CFS > >> > > down to .7, but because we discard all !cfs time, it actually has 1. If > >> > > we try and normalize that we'll end up with ~1.43, which is of course > >> > > completely broken. > >> > > > >> > > > >> > > _However_, all that happens for util, also happens for load. So the above > >> > > scenario will also make the CPU appear less loaded than it actually is. > >> > > >> > The load will continue to increase because we track runnable state and > >> > not running for the load > >> > >> Duh yes. So renormalizing it once, like proposed for util would actually > >> do the right thing there too. Would not that allow us to get rid of > >> much of the capacity magic in the load balance code? > >> > >> /me thinks more.. > >> > >> Bah, no.. because you don't want this dynamic renormalization part of > >> the sums. So you want to keep it after the fact. :/ > >> > >> > As you mentioned, scale_rt_capacity give the remaining capacity for > >> > cfs and it will behave like cfs util_avg now that it uses PELT. So as > >> > long as cfs util_avg < scale_rt_capacity(we probably need a margin) > >> > we keep using dl bandwidth + cfs util_avg + rt util_avg for selecting > >> > OPP because we have remaining spare capacity but if cfs util_avg == > >> > scale_rt_capacity, we make sure to use max OPP. > >> > >> Good point, when cfs-util < cfs-cap then there is idle time and the util > >> number is 'right', when cfs-util == cfs-cap we're overcommitted and > >> should go max. > >> > >> Since the util and cap values are aligned that should track nicely. > > > > So Vincent proposed to have a margin between cfs util and cfs cap to be > > sure there is a little bit of idle time. This is _exactly_ what the > > overutilized flag in EAS does. That would actually make a lot of sense > > to use that flag in schedutil. The idea is basically to say, if there > > isn't enough idle time on all CPUs, the util signal are kinda wrong, so > > let's not make any decisions (task placement or OPP selection) based on > > that. If overutilized, go to max freq. Does that make sense ? > > Yes it's similar to the overutilized except that > - this is done per cpu and whereas overutilization is for the whole system Is this a good thing ? It has to be discussed. Anyways, the patch from Morten which is part of the latest EAS posting (v3) introduces a cpu_overutilized() function which does what you want I think. > - the test is done at every freq update and not only during some cfs > event and it uses the last up to date value and not a periodically > updated snapshot of the value Yeah good point. Now, the overutilized flag is attached to the root domain so you should be able to set/clear it from RT/DL whenever that makes sense I suppose. That's just a flag about the current state of the system so I don't see why it should be touched only by CFS. > - this is done also without EAS The overutilized flag doesn't have to come with EAS if it is useful for something else (OPP selection). > > Then for the margin, it has to be discussed if it is really needed or not +1 Thanks, Quentin
Hi Quentin, Il 05/06/2018 16:13, Juri Lelli ha scritto: > On 05/06/18 15:01, Quentin Perret wrote: >> On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote: >>> On 05/06/18 14:05, Quentin Perret wrote: >>>> On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: >>>>> Hi Quentin, >>>>> >>>>> On 05/06/18 11:57, Quentin Perret wrote: >>>>> >>>>> [...] >>>>> >>>>>> What about the diff below (just a quick hack to show the idea) applied >>>>>> on tip/sched/core ? >>>>>> >>>>>> ---8<--- >>>>>> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c >>>>>> index a8ba6d1f262a..23a4fb1c2c25 100644 >>>>>> --- a/kernel/sched/cpufreq_schedutil.c >>>>>> +++ b/kernel/sched/cpufreq_schedutil.c >>>>>> @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) >>>>>> sg_cpu->util_dl = cpu_util_dl(rq); >>>>>> } >>>>>> >>>>>> +unsigned long scale_rt_capacity(int cpu); >>>>>> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) >>>>>> { >>>>>> struct rq *rq = cpu_rq(sg_cpu->cpu); >>>>>> + int cpu = sg_cpu->cpu; >>>>>> + unsigned long util, dl_bw; >>>>>> >>>>>> if (rq->rt.rt_nr_running) >>>>>> return sg_cpu->max; >>>>>> @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) >>>>>> * util_cfs + util_dl as requested freq. However, cpufreq is not yet >>>>>> * ready for such an interface. So, we only do the latter for now. >>>>>> */ >>>>>> - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); >>>>>> + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); >>>>> >>>>> Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, >>>>> since we use max below, we will probably have the same problem that we >>>>> discussed on Vincent's approach (overestimation of DL contribution while >>>>> we could use running_bw). >>>> >>>> Ah no, you're right, this isn't great for long running deadline tasks. >>>> We should definitely account for the running_bw here, not the dl avg... >>>> >>>> I was trying to address the issue of RT stealing time from CFS here, but >>>> the DL integration isn't quite right which this patch as-is, I agree ... >>>> >>>>> >>>>>> + util >>= SCHED_CAPACITY_SHIFT; >>>>>> + util = arch_scale_cpu_capacity(NULL, cpu) - util; >>>>>> + util += sg_cpu->util_cfs; >>>>>> + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >>>>> >>>>> Why this_bw instead of running_bw? >>>> >>>> So IIUC, this_bw should basically give you the absolute reservation (== the >>>> sum of runtime/deadline ratios of all DL tasks on that rq). >>> >>> Yep. >>> >>>> The reason I added this max is because I'm still not sure to understand >>>> how we can safely drop the freq below that point ? If we don't guarantee >>>> to always stay at least at the freq required by DL, aren't we risking to >>>> start a deadline tasks stuck at a low freq because of rate limiting ? In >>>> this case, if that tasks uses all of its runtime then you might start >>>> missing deadlines ... >>> >>> We decided to avoid (software) rate limiting for DL with e97a90f7069b >>> ("sched/cpufreq: Rate limits for SCHED_DEADLINE"). >> >> Right, I spotted that one, but yeah you could also be limited by HW ... >> >>> >>>> My feeling is that the only safe thing to do is to guarantee to never go >>>> below the freq required by DL, and to optimistically add CFS tasks >>>> without raising the OPP if we have good reasons to think that DL is >>>> using less than it required (which is what we should get by using >>>> running_bw above I suppose). Does that make any sense ? >>> >>> Then we can't still avoid the hardware limits, so using running_bw is a >>> trade off between safety (especially considering soft real-time >>> scenarios) and energy consumption (which seems to be working in >>> practice). >> >> Ok, I see ... Have you guys already tried something like my patch above >> (keeping the freq >= this_bw) in real world use cases ? Is this costing >> that much energy in practice ? If we fill the gaps left by DL (when it > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so > he might add some numbers to my words above. I didn't (yet). But, please > consider that I might be reserving (for example) 50% of bandwidth for my > heavy and time sensitive task and then have that task wake up only once > in a while (but I'll be keeping clock speed up for the whole time). :/ As far as I can remember, we never tested energy consumption of running_bw vs this_bw, as at OSPM'17 we had already decided to use running_bw implementing GRUB-PA. The rationale is that, as Juri pointed out, the amount of spare (i.e. reclaimable) bandwidth in this_bw is very user-dependent. For example, the user can let this_bw be much higher than the measured bandwidth, just to be sure that the deadlines are met even in corner cases. In practice, this means that the task executes for quite a short time and then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced, at the 0lag time). Using this_bw rather than running_bw, the CPU frequency would remain at the same fixed value even when the task is blocked. I understand that on some cases it could even be better (i.e. no waste of energy in frequency switch). However, IMHO, these are corner cases and in the average case it is better to rely on running_bw and reduce the CPU frequency accordingly. Best regards, Claudio
Hi Claudio, On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote: > Hi Quentin, > > Il 05/06/2018 16:13, Juri Lelli ha scritto: > > On 05/06/18 15:01, Quentin Perret wrote: > > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote: > > > > On 05/06/18 14:05, Quentin Perret wrote: > > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: > > > > > > Hi Quentin, > > > > > > > > > > > > On 05/06/18 11:57, Quentin Perret wrote: > > > > > > > > > > > > [...] > > > > > > > > > > > > > What about the diff below (just a quick hack to show the idea) applied > > > > > > > on tip/sched/core ? > > > > > > > > > > > > > > ---8<--- > > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > > > > > > --- a/kernel/sched/cpufreq_schedutil.c > > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > > > > > > sg_cpu->util_dl = cpu_util_dl(rq); > > > > > > > } > > > > > > > +unsigned long scale_rt_capacity(int cpu); > > > > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > > > { > > > > > > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > > > > > > + int cpu = sg_cpu->cpu; > > > > > > > + unsigned long util, dl_bw; > > > > > > > if (rq->rt.rt_nr_running) > > > > > > > return sg_cpu->max; > > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > > > > > > * ready for such an interface. So, we only do the latter for now. > > > > > > > */ > > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > > > > > > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, > > > > > > since we use max below, we will probably have the same problem that we > > > > > > discussed on Vincent's approach (overestimation of DL contribution while > > > > > > we could use running_bw). > > > > > > > > > > Ah no, you're right, this isn't great for long running deadline tasks. > > > > > We should definitely account for the running_bw here, not the dl avg... > > > > > > > > > > I was trying to address the issue of RT stealing time from CFS here, but > > > > > the DL integration isn't quite right which this patch as-is, I agree ... > > > > > > > > > > > > > > > > > > + util >>= SCHED_CAPACITY_SHIFT; > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > > > > > > + util += sg_cpu->util_cfs; > > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > > > > > > > > > Why this_bw instead of running_bw? > > > > > > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the > > > > > sum of runtime/deadline ratios of all DL tasks on that rq). > > > > > > > > Yep. > > > > > > > > > The reason I added this max is because I'm still not sure to understand > > > > > how we can safely drop the freq below that point ? If we don't guarantee > > > > > to always stay at least at the freq required by DL, aren't we risking to > > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In > > > > > this case, if that tasks uses all of its runtime then you might start > > > > > missing deadlines ... > > > > > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b > > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE"). > > > > > > Right, I spotted that one, but yeah you could also be limited by HW ... > > > > > > > > > > > > My feeling is that the only safe thing to do is to guarantee to never go > > > > > below the freq required by DL, and to optimistically add CFS tasks > > > > > without raising the OPP if we have good reasons to think that DL is > > > > > using less than it required (which is what we should get by using > > > > > running_bw above I suppose). Does that make any sense ? > > > > > > > > Then we can't still avoid the hardware limits, so using running_bw is a > > > > trade off between safety (especially considering soft real-time > > > > scenarios) and energy consumption (which seems to be working in > > > > practice). > > > > > > Ok, I see ... Have you guys already tried something like my patch above > > > (keeping the freq >= this_bw) in real world use cases ? Is this costing > > > that much energy in practice ? If we fill the gaps left by DL (when it > > > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so > > he might add some numbers to my words above. I didn't (yet). But, please > > consider that I might be reserving (for example) 50% of bandwidth for my > > heavy and time sensitive task and then have that task wake up only once > > in a while (but I'll be keeping clock speed up for the whole time). :/ > > As far as I can remember, we never tested energy consumption of running_bw > vs this_bw, as at OSPM'17 we had already decided to use running_bw > implementing GRUB-PA. > The rationale is that, as Juri pointed out, the amount of spare (i.e. > reclaimable) bandwidth in this_bw is very user-dependent. For example, > the user can let this_bw be much higher than the measured bandwidth, just > to be sure that the deadlines are met even in corner cases. Ok I see the issue. Trusting userspace isn't necessarily the right thing to do, I totally agree with that. > In practice, this means that the task executes for quite a short time and > then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced, > at the 0lag time). > Using this_bw rather than running_bw, the CPU frequency would remain at > the same fixed value even when the task is blocked. > I understand that on some cases it could even be better (i.e. no waste > of energy in frequency switch). +1, I'm pretty sure using this_bw is pretty much always worst than using running_bw from an energy standpoint,. The waste of energy in frequency changes should be less than the energy wasted by staying at a too high frequency for a long time, otherwise DVFS isn't a good idea to begin with :-) > However, IMHO, these are corner cases and in the average case it is better > to rely on running_bw and reduce the CPU frequency accordingly. My point was that accepting to go at a lower frequency than required by this_bw is fundamentally unsafe. If you're at a low frequency when a DL task starts, there are real situations where you won't be able to increase the frequency immediately, which can eventually lead to missing deadlines. Now, if this risk is known, has been discussed, and is accepted, that's fair enough. I'm just too late for the discussion :-) Thank you for your time ! Quentin
Hi Quentin, 2018-06-06 15:20 GMT+02:00 Quentin Perret <quentin.perret@arm.com>: > > Hi Claudio, > > On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote: > > Hi Quentin, > > > > Il 05/06/2018 16:13, Juri Lelli ha scritto: > > > On 05/06/18 15:01, Quentin Perret wrote: > > > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote: > > > > > On 05/06/18 14:05, Quentin Perret wrote: > > > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: > > > > > > > Hi Quentin, > > > > > > > > > > > > > > On 05/06/18 11:57, Quentin Perret wrote: > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > What about the diff below (just a quick hack to show the idea) applied > > > > > > > > on tip/sched/core ? > > > > > > > > > > > > > > > > ---8<--- > > > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > > > > > > > --- a/kernel/sched/cpufreq_schedutil.c > > > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > > > > > > > sg_cpu->util_dl = cpu_util_dl(rq); > > > > > > > > } > > > > > > > > +unsigned long scale_rt_capacity(int cpu); > > > > > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > > > > { > > > > > > > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > > > > > > > + int cpu = sg_cpu->cpu; > > > > > > > > + unsigned long util, dl_bw; > > > > > > > > if (rq->rt.rt_nr_running) > > > > > > > > return sg_cpu->max; > > > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > > > > > > > * ready for such an interface. So, we only do the latter for now. > > > > > > > > */ > > > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > > > > > > > > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, > > > > > > > since we use max below, we will probably have the same problem that we > > > > > > > discussed on Vincent's approach (overestimation of DL contribution while > > > > > > > we could use running_bw). > > > > > > > > > > > > Ah no, you're right, this isn't great for long running deadline tasks. > > > > > > We should definitely account for the running_bw here, not the dl avg... > > > > > > > > > > > > I was trying to address the issue of RT stealing time from CFS here, but > > > > > > the DL integration isn't quite right which this patch as-is, I agree ... > > > > > > > > > > > > > > > > > > > > > + util >>= SCHED_CAPACITY_SHIFT; > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > > > > > > > + util += sg_cpu->util_cfs; > > > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > > > > > > > > > > > Why this_bw instead of running_bw? > > > > > > > > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the > > > > > > sum of runtime/deadline ratios of all DL tasks on that rq). > > > > > > > > > > Yep. > > > > > > > > > > > The reason I added this max is because I'm still not sure to understand > > > > > > how we can safely drop the freq below that point ? If we don't guarantee > > > > > > to always stay at least at the freq required by DL, aren't we risking to > > > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In > > > > > > this case, if that tasks uses all of its runtime then you might start > > > > > > missing deadlines ... > > > > > > > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b > > > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE"). > > > > > > > > Right, I spotted that one, but yeah you could also be limited by HW ... > > > > > > > > > > > > > > > My feeling is that the only safe thing to do is to guarantee to never go > > > > > > below the freq required by DL, and to optimistically add CFS tasks > > > > > > without raising the OPP if we have good reasons to think that DL is > > > > > > using less than it required (which is what we should get by using > > > > > > running_bw above I suppose). Does that make any sense ? > > > > > > > > > > Then we can't still avoid the hardware limits, so using running_bw is a > > > > > trade off between safety (especially considering soft real-time > > > > > scenarios) and energy consumption (which seems to be working in > > > > > practice). > > > > > > > > Ok, I see ... Have you guys already tried something like my patch above > > > > (keeping the freq >= this_bw) in real world use cases ? Is this costing > > > > that much energy in practice ? If we fill the gaps left by DL (when it > > > > > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so > > > he might add some numbers to my words above. I didn't (yet). But, please > > > consider that I might be reserving (for example) 50% of bandwidth for my > > > heavy and time sensitive task and then have that task wake up only once > > > in a while (but I'll be keeping clock speed up for the whole time). :/ > > > > As far as I can remember, we never tested energy consumption of running_bw > > vs this_bw, as at OSPM'17 we had already decided to use running_bw > > implementing GRUB-PA. > > The rationale is that, as Juri pointed out, the amount of spare (i.e. > > reclaimable) bandwidth in this_bw is very user-dependent. For example, > > the user can let this_bw be much higher than the measured bandwidth, just > > to be sure that the deadlines are met even in corner cases. > > Ok I see the issue. Trusting userspace isn't necessarily the right thing > to do, I totally agree with that. > > > In practice, this means that the task executes for quite a short time and > > then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced, > > at the 0lag time). > > Using this_bw rather than running_bw, the CPU frequency would remain at > > the same fixed value even when the task is blocked. > > I understand that on some cases it could even be better (i.e. no waste > > of energy in frequency switch). > > +1, I'm pretty sure using this_bw is pretty much always worst than > using running_bw from an energy standpoint,. The waste of energy in > frequency changes should be less than the energy wasted by staying at a > too high frequency for a long time, otherwise DVFS isn't a good idea to > begin with :-) > > > However, IMHO, these are corner cases and in the average case it is better > > to rely on running_bw and reduce the CPU frequency accordingly. > > My point was that accepting to go at a lower frequency than required by > this_bw is fundamentally unsafe. If you're at a low frequency when a DL > task starts, there are real situations where you won't be able to > increase the frequency immediately, which can eventually lead to missing > deadlines. I see. Unfortunately, I'm having quite crazy days so I couldn't follow the original discussion on LKML properly. My fault. Anyway, to answer your question (if this time I have understood it correctly). You're right: the tests have shown that whenever the DL task period gets comparable with the time for switching frequency, the amount of missed deadlines becomes not negligible. To give you a rough idea, this already happens with periods of 10msec on a Odroid XU4. The reason is that the task instance starts at a too low frequency, and the system can't switch frequency in time for meeting the deadline. This is a known issue, partially discussed during the RT Summit'17. However, the community has been more in favour of reducing the energy consumption than meeting firm deadlines. If you need a safe system, in fact, you'd better thinking about disabling DVFS completely and relying on a fixed CPU frequency. A possible trade-off could be a further entry in sys to let system designers switching from (default) running_bw to (more pessimistic) this_bw. However, I'm not sure the community wants a further knob on sysfs just to make RT people happier :) Best, Claudio
On Wednesday 06 Jun 2018 at 15:53:27 (+0200), Claudio Scordino wrote: > Hi Quentin, > > 2018-06-06 15:20 GMT+02:00 Quentin Perret <quentin.perret@arm.com>: > > > > Hi Claudio, > > > > On Wednesday 06 Jun 2018 at 15:05:58 (+0200), Claudio Scordino wrote: > > > Hi Quentin, > > > > > > Il 05/06/2018 16:13, Juri Lelli ha scritto: > > > > On 05/06/18 15:01, Quentin Perret wrote: > > > > > On Tuesday 05 Jun 2018 at 15:15:18 (+0200), Juri Lelli wrote: > > > > > > On 05/06/18 14:05, Quentin Perret wrote: > > > > > > > On Tuesday 05 Jun 2018 at 14:11:53 (+0200), Juri Lelli wrote: > > > > > > > > Hi Quentin, > > > > > > > > > > > > > > > > On 05/06/18 11:57, Quentin Perret wrote: > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > What about the diff below (just a quick hack to show the idea) applied > > > > > > > > > on tip/sched/core ? > > > > > > > > > > > > > > > > > > ---8<--- > > > > > > > > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > > > > > > > > > index a8ba6d1f262a..23a4fb1c2c25 100644 > > > > > > > > > --- a/kernel/sched/cpufreq_schedutil.c > > > > > > > > > +++ b/kernel/sched/cpufreq_schedutil.c > > > > > > > > > @@ -180,9 +180,12 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu) > > > > > > > > > sg_cpu->util_dl = cpu_util_dl(rq); > > > > > > > > > } > > > > > > > > > +unsigned long scale_rt_capacity(int cpu); > > > > > > > > > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > > > > > { > > > > > > > > > struct rq *rq = cpu_rq(sg_cpu->cpu); > > > > > > > > > + int cpu = sg_cpu->cpu; > > > > > > > > > + unsigned long util, dl_bw; > > > > > > > > > if (rq->rt.rt_nr_running) > > > > > > > > > return sg_cpu->max; > > > > > > > > > @@ -197,7 +200,14 @@ static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu) > > > > > > > > > * util_cfs + util_dl as requested freq. However, cpufreq is not yet > > > > > > > > > * ready for such an interface. So, we only do the latter for now. > > > > > > > > > */ > > > > > > > > > - return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs)); > > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) * scale_rt_capacity(cpu); > > > > > > > > > > > > > > > > Sorry to be pedantinc, but this (ATM) includes DL avg contribution, so, > > > > > > > > since we use max below, we will probably have the same problem that we > > > > > > > > discussed on Vincent's approach (overestimation of DL contribution while > > > > > > > > we could use running_bw). > > > > > > > > > > > > > > Ah no, you're right, this isn't great for long running deadline tasks. > > > > > > > We should definitely account for the running_bw here, not the dl avg... > > > > > > > > > > > > > > I was trying to address the issue of RT stealing time from CFS here, but > > > > > > > the DL integration isn't quite right which this patch as-is, I agree ... > > > > > > > > > > > > > > > > > > > > > > > > + util >>= SCHED_CAPACITY_SHIFT; > > > > > > > > > + util = arch_scale_cpu_capacity(NULL, cpu) - util; > > > > > > > > > + util += sg_cpu->util_cfs; > > > > > > > > > + dl_bw = (rq->dl.this_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > > > > > > > > > > > > > Why this_bw instead of running_bw? > > > > > > > > > > > > > > So IIUC, this_bw should basically give you the absolute reservation (== the > > > > > > > sum of runtime/deadline ratios of all DL tasks on that rq). > > > > > > > > > > > > Yep. > > > > > > > > > > > > > The reason I added this max is because I'm still not sure to understand > > > > > > > how we can safely drop the freq below that point ? If we don't guarantee > > > > > > > to always stay at least at the freq required by DL, aren't we risking to > > > > > > > start a deadline tasks stuck at a low freq because of rate limiting ? In > > > > > > > this case, if that tasks uses all of its runtime then you might start > > > > > > > missing deadlines ... > > > > > > > > > > > > We decided to avoid (software) rate limiting for DL with e97a90f7069b > > > > > > ("sched/cpufreq: Rate limits for SCHED_DEADLINE"). > > > > > > > > > > Right, I spotted that one, but yeah you could also be limited by HW ... > > > > > > > > > > > > > > > > > > My feeling is that the only safe thing to do is to guarantee to never go > > > > > > > below the freq required by DL, and to optimistically add CFS tasks > > > > > > > without raising the OPP if we have good reasons to think that DL is > > > > > > > using less than it required (which is what we should get by using > > > > > > > running_bw above I suppose). Does that make any sense ? > > > > > > > > > > > > Then we can't still avoid the hardware limits, so using running_bw is a > > > > > > trade off between safety (especially considering soft real-time > > > > > > scenarios) and energy consumption (which seems to be working in > > > > > > practice). > > > > > > > > > > Ok, I see ... Have you guys already tried something like my patch above > > > > > (keeping the freq >= this_bw) in real world use cases ? Is this costing > > > > > that much energy in practice ? If we fill the gaps left by DL (when it > > > > > > > > IIRC, Claudio (now Cc-ed) did experiment a bit with both approaches, so > > > > he might add some numbers to my words above. I didn't (yet). But, please > > > > consider that I might be reserving (for example) 50% of bandwidth for my > > > > heavy and time sensitive task and then have that task wake up only once > > > > in a while (but I'll be keeping clock speed up for the whole time). :/ > > > > > > As far as I can remember, we never tested energy consumption of running_bw > > > vs this_bw, as at OSPM'17 we had already decided to use running_bw > > > implementing GRUB-PA. > > > The rationale is that, as Juri pointed out, the amount of spare (i.e. > > > reclaimable) bandwidth in this_bw is very user-dependent. For example, > > > the user can let this_bw be much higher than the measured bandwidth, just > > > to be sure that the deadlines are met even in corner cases. > > > > Ok I see the issue. Trusting userspace isn't necessarily the right thing > > to do, I totally agree with that. > > > > > In practice, this means that the task executes for quite a short time and > > > then blocks (with its bandwidth reclaimed, hence the CPU frequency reduced, > > > at the 0lag time). > > > Using this_bw rather than running_bw, the CPU frequency would remain at > > > the same fixed value even when the task is blocked. > > > I understand that on some cases it could even be better (i.e. no waste > > > of energy in frequency switch). > > > > +1, I'm pretty sure using this_bw is pretty much always worst than > > using running_bw from an energy standpoint,. The waste of energy in > > frequency changes should be less than the energy wasted by staying at a > > too high frequency for a long time, otherwise DVFS isn't a good idea to > > begin with :-) > > > > > However, IMHO, these are corner cases and in the average case it is better > > > to rely on running_bw and reduce the CPU frequency accordingly. > > > > My point was that accepting to go at a lower frequency than required by > > this_bw is fundamentally unsafe. If you're at a low frequency when a DL > > task starts, there are real situations where you won't be able to > > increase the frequency immediately, which can eventually lead to missing > > deadlines. > > > I see. Unfortunately, I'm having quite crazy days so I couldn't follow > the original discussion on LKML properly. My fault. No problem ! > Anyway, to answer your question (if this time I have understood it correctly). > > You're right: the tests have shown that whenever the DL task period > gets comparable with the time for switching frequency, the amount of > missed deadlines becomes not negligible. > To give you a rough idea, this already happens with periods of 10msec > on a Odroid XU4. > The reason is that the task instance starts at a too low frequency, > and the system can't switch frequency in time for meeting the > deadline. Ok that's very interesting ... > > This is a known issue, partially discussed during the RT Summit'17. > However, the community has been more in favour of reducing the energy > consumption than meeting firm deadlines. > If you need a safe system, in fact, you'd better thinking about > disabling DVFS completely and relying on a fixed CPU frequency. Yeah, understood. My proposition was sort of a middle-ground solution. I was basically suggesting to select OPPs with something like: max(this_bw, running_bw + cfs_util) The idea is that you're always guaranteed to always request a high enough frequency for this_bw, and you can opportunistically try to run CFS tasks without raising the OPP if running_bw is low. In the worst case, the CFS tasks will be preempted and delayed a bit. But DL should always be safe. And if the CFS activity is sufficient to fill the gap between running_bw and this_bw, then you should be pretty energy efficient as well. Now, that doesn't always solve the issue you mentioned earlier. If there isn't much CFS traffic going on, and if the delta between this_bw and running_bw is high, they you'll be stuck at a high freq although the system utilization is low ... As usual, it's just a trade off :-) > > A possible trade-off could be a further entry in sys to let system > designers switching from (default) running_bw to (more pessimistic) > this_bw. > However, I'm not sure the community wants a further knob on sysfs just > to make RT people happier :) > > Best, > > Claudio
Hi all, sorry; I missed the beginning of this thread... Anyway, below I add some comments: On Wed, 6 Jun 2018 15:05:58 +0200 Claudio Scordino <claudio@evidence.eu.com> wrote: [...] > >> Ok, I see ... Have you guys already tried something like my patch > >> above (keeping the freq >= this_bw) in real world use cases ? Is > >> this costing that much energy in practice ? If we fill the gaps > >> left by DL (when it > > > > IIRC, Claudio (now Cc-ed) did experiment a bit with both > > approaches, so he might add some numbers to my words above. I > > didn't (yet). But, please consider that I might be reserving (for > > example) 50% of bandwidth for my heavy and time sensitive task and > > then have that task wake up only once in a while (but I'll be > > keeping clock speed up for the whole time). :/ > > As far as I can remember, we never tested energy consumption of > running_bw vs this_bw, as at OSPM'17 we had already decided to use > running_bw implementing GRUB-PA. The rationale is that, as Juri > pointed out, the amount of spare (i.e. reclaimable) bandwidth in > this_bw is very user-dependent. Yes, I agree with this. The appropriateness of using this_bw or running_bw is highly workload-dependent... If a periodic task consumes much less than its runtime (or if a sporadic task has inter-activation times much larger than the SCHED_DEADLINE period), then running_bw has to be preferred. But if a periodic task consumes almost all of its runtime before blocking, then this_bw has to be preferred... But this also depends on the hardware: if the frequency switch time is small, then running_bw is more appropriate... On the other hand, this_bw works much better if the frequency switch time is high. (Talking about this, I remember Claudio measured frequency switch times large almost 3ms... Is this really due to hardware issues? Or maybe there is some software issue invoved? 3ms look like a lot of time...) Anyway, this is why I proposed to use some kind of /sys/... file to control the kind of deadline utilization used for frequency scaling: in this way, the system designer / administrator, who hopefully has the needed information about workload and hardware, can optimize the frequency scaling behaviour by deciding if running_bw or this_bw will be used. Luca > For example, the user can let this_bw > be much higher than the measured bandwidth, just to be sure that the > deadlines are met even in corner cases. In practice, this means that > the task executes for quite a short time and then blocks (with its > bandwidth reclaimed, hence the CPU frequency reduced, at the 0lag > time). Using this_bw rather than running_bw, the CPU frequency would > remain at the same fixed value even when the task is blocked. I > understand that on some cases it could even be better (i.e. no waste > of energy in frequency switch). However, IMHO, these are corner cases > and in the average case it is better to rely on running_bw and reduce > the CPU frequency accordingly. > > Best regards, > > Claudio
Hi, On Wed, 6 Jun 2018 14:20:46 +0100 Quentin Perret <quentin.perret@arm.com> wrote: [...] > > However, IMHO, these are corner cases and in the average case it is > > better to rely on running_bw and reduce the CPU frequency > > accordingly. > > My point was that accepting to go at a lower frequency than required > by this_bw is fundamentally unsafe. If you're at a low frequency when > a DL task starts, there are real situations where you won't be able > to increase the frequency immediately, which can eventually lead to > missing deadlines. Now, if this risk is known, has been discussed, > and is accepted, that's fair enough. I'm just too late for the > discussion :-) Well, our conclusion was that this issue can be addressed when designing the scheduling parameters: - If we do not consider frequency scaling, a task can respect its deadlines if the SCHED_DEADLINE runtime is larger than the task's execution time and the SCHED_DEADLINE period is smaller than the task's period (and if some kind of "global" admission test is respected) - Considering frequency scaling (and 0-time frequency switches), the SCHED_DEADLINE runtime must be larger than the task execution time at the highest frequency - If the frequency switch time is larger than 0, then the SCHED_DEADLINE runtime must be larger than the task execution time (at the highest frequency) plus the frequency switch time If this third condition is respected, I think that deadline misses can be avoided even if running_bw is used (and the CPU takes a considerable time to switch frequency). Of course, this requires an over-allocation of runtime (and the global admission test has more probabilities to fail)... Luca
Hi Luca, On Wednesday 06 Jun 2018 at 23:05:36 (+0200), luca abeni wrote: > Hi, > > On Wed, 6 Jun 2018 14:20:46 +0100 > Quentin Perret <quentin.perret@arm.com> wrote: > [...] > > > However, IMHO, these are corner cases and in the average case it is > > > better to rely on running_bw and reduce the CPU frequency > > > accordingly. > > > > My point was that accepting to go at a lower frequency than required > > by this_bw is fundamentally unsafe. If you're at a low frequency when > > a DL task starts, there are real situations where you won't be able > > to increase the frequency immediately, which can eventually lead to > > missing deadlines. Now, if this risk is known, has been discussed, > > and is accepted, that's fair enough. I'm just too late for the > > discussion :-) > > Well, our conclusion was that this issue can be addressed when > designing the scheduling parameters: > - If we do not consider frequency scaling, a task can respect its > deadlines if the SCHED_DEADLINE runtime is larger than the task's > execution time and the SCHED_DEADLINE period is smaller than the > task's period (and if some kind of "global" admission test is > respected) > - Considering frequency scaling (and 0-time frequency switches), the > SCHED_DEADLINE runtime must be larger than the task execution time at > the highest frequency > - If the frequency switch time is larger than 0, then the > SCHED_DEADLINE runtime must be larger than the task execution time > (at the highest frequency) plus the frequency switch time > > If this third condition is respected, I think that deadline misses can > be avoided even if running_bw is used (and the CPU takes a considerable > time to switch frequency). Of course, this requires an over-allocation > of runtime (and the global admission test has more probabilities to > fail)... Ah, right, this third condition should definitely be a valid workaround to the issue I mentioned earlier. And the runtime parameter is already very much target-dependent I guess, so it should be fine to add another target-specific component (the frequency-switching time) to the runtime estimation. And, if you really need to have tight runtimes to fit all of your tasks, then you should just use a fixed frequency I guess ... At least the current implementation gives a choice to the user between being energy-efficient using sugov with over-allocated runtimes and having tighter runtimes with the performance/powersave/userspace governor, so that's all good :-) Thank you very much, Quentin