Message ID | 1527253951-22709-6-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
Series | track CPU utilization | expand |
Hi Vincent, On 25/05/18 15:12, Vincent Guittot wrote: > Now that we have both the dl class bandwidth requirement and the dl class > utilization, we can use the max of the 2 values when agregating the > utilization of the CPU. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/sched.h | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 4526ba6..0eb07a8 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > static inline unsigned long cpu_util_dl(struct rq *rq) > { > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; I'd be tempted to say the we actually want to cap to this one above instead of using the max (as you are proposing below) or the (theoretical) power reduction benefits of using DEADLINE for certain tasks might vanish. > + > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); > + > + return util; Anyway, just a quick thought. I guess we should experiment with this a bit. Now, I don't unfortunately have a Arm platform at hand for testing. Claudio, Luca (now Cc-ed), would you be able to fire some tests with this change? Oh, adding Joel and Alessio as well that experimented with DEADLINE lately. Thanks, - Juri
Hi Juri, On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote: > Hi Vincent, > > On 25/05/18 15:12, Vincent Guittot wrote: >> Now that we have both the dl class bandwidth requirement and the dl class >> utilization, we can use the max of the 2 values when agregating the >> utilization of the CPU. >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> --- >> kernel/sched/sched.h | 6 +++++- >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 4526ba6..0eb07a8 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL >> static inline unsigned long cpu_util_dl(struct rq *rq) >> { >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > I'd be tempted to say the we actually want to cap to this one above > instead of using the max (as you are proposing below) or the > (theoretical) power reduction benefits of using DEADLINE for certain > tasks might vanish. The problem that I'm facing is that the sched_entity bandwidth is removed after the 0-lag time and the rq->dl.running_bw goes back to zero but if the DL task has preempted a CFS task, the utilization of the CFS task will be lower than reality and schedutil will set a lower OPP whereas the CPU is always running. The example with a RT task described in the cover letter can be run with a DL task and will give similar results. avg_dl.util_avg tracks the utilization of the rq seen by the scheduler whereas rq->dl.running_bw gives the minimum to match DL requirement. > >> + >> + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); >> + >> + return util; > > Anyway, just a quick thought. I guess we should experiment with this a > bit. Now, I don't unfortunately have a Arm platform at hand for testing. > Claudio, Luca (now Cc-ed), would you be able to fire some tests with > this change? > > Oh, adding Joel and Alessio as well that experimented with DEADLINE > lately. > > Thanks, > > - Juri
On 28/05/18 16:57, Vincent Guittot wrote: > Hi Juri, > > On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote: > > Hi Vincent, > > > > On 25/05/18 15:12, Vincent Guittot wrote: > >> Now that we have both the dl class bandwidth requirement and the dl class > >> utilization, we can use the max of the 2 values when agregating the > >> utilization of the CPU. > >> > >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > >> --- > >> kernel/sched/sched.h | 6 +++++- > >> 1 file changed, 5 insertions(+), 1 deletion(-) > >> > >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > >> index 4526ba6..0eb07a8 100644 > >> --- a/kernel/sched/sched.h > >> +++ b/kernel/sched/sched.h > >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > >> static inline unsigned long cpu_util_dl(struct rq *rq) > >> { > >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > I'd be tempted to say the we actually want to cap to this one above > > instead of using the max (as you are proposing below) or the > > (theoretical) power reduction benefits of using DEADLINE for certain > > tasks might vanish. > > The problem that I'm facing is that the sched_entity bandwidth is > removed after the 0-lag time and the rq->dl.running_bw goes back to > zero but if the DL task has preempted a CFS task, the utilization of > the CFS task will be lower than reality and schedutil will set a lower > OPP whereas the CPU is always running. The example with a RT task > described in the cover letter can be run with a DL task and will give > similar results. > avg_dl.util_avg tracks the utilization of the rq seen by the scheduler > whereas rq->dl.running_bw gives the minimum to match DL requirement. Mmm, I see. Note that I'm only being cautious, what you propose might work OK, but it seems to me that we might lose some of the benefits of running tasks with DEADLINE if we start selecting frequency as you propose even when such tasks are running. An idea might be to copy running_bw util into dl.util_avg when a DL task goes to sleep, and then decay the latter as for RT contribution. What you think?
On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@redhat.com> wrote: > On 28/05/18 16:57, Vincent Guittot wrote: >> Hi Juri, >> >> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote: >> > Hi Vincent, >> > >> > On 25/05/18 15:12, Vincent Guittot wrote: >> >> Now that we have both the dl class bandwidth requirement and the dl class >> >> utilization, we can use the max of the 2 values when agregating the >> >> utilization of the CPU. >> >> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> >> --- >> >> kernel/sched/sched.h | 6 +++++- >> >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> >> index 4526ba6..0eb07a8 100644 >> >> --- a/kernel/sched/sched.h >> >> +++ b/kernel/sched/sched.h >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} >> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL >> >> static inline unsigned long cpu_util_dl(struct rq *rq) >> >> { >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >> > >> > I'd be tempted to say the we actually want to cap to this one above >> > instead of using the max (as you are proposing below) or the >> > (theoretical) power reduction benefits of using DEADLINE for certain >> > tasks might vanish. >> >> The problem that I'm facing is that the sched_entity bandwidth is >> removed after the 0-lag time and the rq->dl.running_bw goes back to >> zero but if the DL task has preempted a CFS task, the utilization of >> the CFS task will be lower than reality and schedutil will set a lower >> OPP whereas the CPU is always running. The example with a RT task >> described in the cover letter can be run with a DL task and will give >> similar results. >> avg_dl.util_avg tracks the utilization of the rq seen by the scheduler >> whereas rq->dl.running_bw gives the minimum to match DL requirement. > > Mmm, I see. Note that I'm only being cautious, what you propose might > work OK, but it seems to me that we might lose some of the benefits of > running tasks with DEADLINE if we start selecting frequency as you > propose even when such tasks are running. I see your point. Taking into account the number cfs running task to choose between rq->dl.running_bw and avg_dl.util_avg could help > > An idea might be to copy running_bw util into dl.util_avg when a DL task > goes to sleep, and then decay the latter as for RT contribution. What > you think? Not sure that this will work because you will overwrite the value each time a DL task goes to sleep and the decay will mainly happen on the update when last DL task goes to sleep which might not reflect what has been used by DL tasks but only the requirement of the last running DL task. This other interest of the PELT is to have an utilization tracking which uses the same curve as CFS so the increase of cfs_rq->avg.util_avg and the decrease of avg_dl.util_avg will compensate themselves (or the opposite)
On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote: [..] > > + > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); > > + > > + return util; > > Anyway, just a quick thought. I guess we should experiment with this a > bit. Now, I don't unfortunately have a Arm platform at hand for testing. > Claudio, Luca (now Cc-ed), would you be able to fire some tests with > this change? > > Oh, adding Joel and Alessio as well that experimented with DEADLINE > lately. I also feel that for power reasons, dl.util_avg shouldn't drive the OPP beyond what the running bandwidth is, or atleast do that only if CFS tasks are running and being preempted as you/Vincent mentioned in one of the threads. With our DL experiments, I didn't measure power but got it to a point where the OPP is scaling correctly based on DL parameters. I think Alessio did measure power at his setup but I can't recall now. Alessio? thanks, - Joel
On 28/05/18 22:08, Joel Fernandes wrote: > On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote: > [..] > > > + > > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); > > > + > > > + return util; > > > > Anyway, just a quick thought. I guess we should experiment with this a > > bit. Now, I don't unfortunately have a Arm platform at hand for testing. > > Claudio, Luca (now Cc-ed), would you be able to fire some tests with > > this change? > > > > Oh, adding Joel and Alessio as well that experimented with DEADLINE > > lately. > > I also feel that for power reasons, dl.util_avg shouldn't drive the OPP > beyond what the running bandwidth is, or atleast do that only if CFS tasks > are running and being preempted as you/Vincent mentioned in one of the > threads. It's however a bit awkward that we might be running at a higher OPP when CFS tasks are running (even though they are of less priority). :/ > With our DL experiments, I didn't measure power but got it to a point where > the OPP is scaling correctly based on DL parameters. I think Alessio did > measure power at his setup but I can't recall now. Alessio? I see. Thanks.
On 29 May 2018 at 08:31, Juri Lelli <juri.lelli@redhat.com> wrote: > On 28/05/18 22:08, Joel Fernandes wrote: >> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote: >> [..] >> > > + >> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); >> > > + >> > > + return util; >> > >> > Anyway, just a quick thought. I guess we should experiment with this a >> > bit. Now, I don't unfortunately have a Arm platform at hand for testing. >> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with >> > this change? >> > >> > Oh, adding Joel and Alessio as well that experimented with DEADLINE >> > lately. >> >> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP >> beyond what the running bandwidth is, or atleast do that only if CFS tasks >> are running and being preempted as you/Vincent mentioned in one of the >> threads. > > It's however a bit awkward that we might be running at a higher OPP when > CFS tasks are running (even though they are of less priority). :/ Even if cfs task has lower priority that doesn't mean that we should not take their needs into account. In the same way, we run at max OPP as soon as a RT task is runnable > >> With our DL experiments, I didn't measure power but got it to a point where >> the OPP is scaling correctly based on DL parameters. I think Alessio did >> measure power at his setup but I can't recall now. Alessio? > > I see. Thanks.
Hi Vincent, On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote: > Now that we have both the dl class bandwidth requirement and the dl class > utilization, we can use the max of the 2 values when agregating the > utilization of the CPU. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/sched.h | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 4526ba6..0eb07a8 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > static inline unsigned long cpu_util_dl(struct rq *rq) > { > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > + > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); Would it make sense to use a UTIL_EST version of that signal here ? I don't think that would make sense for the RT class with your patch-set since you only really use the blocked part of the signal for RT IIUC, but would that work for DL ? > + > + return util; > } > > static inline unsigned long cpu_util_cfs(struct rq *rq) > -- > 2.7.4 >
On 29/05/18 08:48, Vincent Guittot wrote: > On 29 May 2018 at 08:31, Juri Lelli <juri.lelli@redhat.com> wrote: > > On 28/05/18 22:08, Joel Fernandes wrote: > >> On Mon, May 28, 2018 at 12:12:34PM +0200, Juri Lelli wrote: > >> [..] > >> > > + > >> > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); > >> > > + > >> > > + return util; > >> > > >> > Anyway, just a quick thought. I guess we should experiment with this a > >> > bit. Now, I don't unfortunately have a Arm platform at hand for testing. > >> > Claudio, Luca (now Cc-ed), would you be able to fire some tests with > >> > this change? > >> > > >> > Oh, adding Joel and Alessio as well that experimented with DEADLINE > >> > lately. > >> > >> I also feel that for power reasons, dl.util_avg shouldn't drive the OPP > >> beyond what the running bandwidth is, or atleast do that only if CFS tasks > >> are running and being preempted as you/Vincent mentioned in one of the > >> threads. > > > > It's however a bit awkward that we might be running at a higher OPP when > > CFS tasks are running (even though they are of less priority). :/ > > Even if cfs task has lower priority that doesn't mean that we should > not take their needs into account. > In the same way, we run at max OPP as soon as a RT task is runnable Sure. What I fear is a little CFS utilization generating spikes because dl.util_avg became big when running DL tasks. Not sure that can happen though because such DL tasks should be throttled anyway.
On 29/05/18 09:40, Quentin Perret wrote: > Hi Vincent, > > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote: > > Now that we have both the dl class bandwidth requirement and the dl class > > utilization, we can use the max of the 2 values when agregating the > > utilization of the CPU. > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > kernel/sched/sched.h | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 4526ba6..0eb07a8 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > > static inline unsigned long cpu_util_dl(struct rq *rq) > > { > > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > + > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); > > Would it make sense to use a UTIL_EST version of that signal here ? I > don't think that would make sense for the RT class with your patch-set > since you only really use the blocked part of the signal for RT IIUC, > but would that work for DL ? Well, UTIL_EST for DL looks pretty much what we already do by computing utilization based on dl.running_bw. That's why I was thinking of using that as a starting point for dl.util_avg decay phase.
On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote: > On 29/05/18 09:40, Quentin Perret wrote: > > Hi Vincent, > > > > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote: > > > Now that we have both the dl class bandwidth requirement and the dl class > > > utilization, we can use the max of the 2 values when agregating the > > > utilization of the CPU. > > > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > > --- > > > kernel/sched/sched.h | 6 +++++- > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > > index 4526ba6..0eb07a8 100644 > > > --- a/kernel/sched/sched.h > > > +++ b/kernel/sched/sched.h > > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > > > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > > > static inline unsigned long cpu_util_dl(struct rq *rq) > > > { > > > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > + > > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); > > > > Would it make sense to use a UTIL_EST version of that signal here ? I > > don't think that would make sense for the RT class with your patch-set > > since you only really use the blocked part of the signal for RT IIUC, > > but would that work for DL ? > > Well, UTIL_EST for DL looks pretty much what we already do by computing > utilization based on dl.running_bw. That's why I was thinking of using > that as a starting point for dl.util_avg decay phase. Hmmm I see your point, but running_bw and the util_avg are fundamentally different ... I mean, the util_avg doesn't know about the period, which is an issue in itself I guess ... If you have a long running DL task (say 100ms runtime) with a long period (say 1s), the running_bw should represent ~1/10 of the CPU capacity, but the util_avg can go quite high, which means that you might end up executing this task at max OPP. So if we really want to drive OPPs like that for deadline, a util_est-like version of this util_avg signal should help. Now, you can also argue that going to max OPP for a task that _we know_ uses 1/10 of the CPU capacity isn't right ...
On 30/05/18 09:37, Quentin Perret wrote: > On Tuesday 29 May 2018 at 11:52:03 (+0200), Juri Lelli wrote: > > On 29/05/18 09:40, Quentin Perret wrote: > > > Hi Vincent, > > > > > > On Friday 25 May 2018 at 15:12:26 (+0200), Vincent Guittot wrote: > > > > Now that we have both the dl class bandwidth requirement and the dl class > > > > utilization, we can use the max of the 2 values when agregating the > > > > utilization of the CPU. > > > > > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > > > --- > > > > kernel/sched/sched.h | 6 +++++- > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > > > index 4526ba6..0eb07a8 100644 > > > > --- a/kernel/sched/sched.h > > > > +++ b/kernel/sched/sched.h > > > > @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > > > > #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > > > > static inline unsigned long cpu_util_dl(struct rq *rq) > > > > { > > > > - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > > > > + > > > > + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); > > > > > > Would it make sense to use a UTIL_EST version of that signal here ? I > > > don't think that would make sense for the RT class with your patch-set > > > since you only really use the blocked part of the signal for RT IIUC, > > > but would that work for DL ? > > > > Well, UTIL_EST for DL looks pretty much what we already do by computing > > utilization based on dl.running_bw. That's why I was thinking of using > > that as a starting point for dl.util_avg decay phase. > > Hmmm I see your point, but running_bw and the util_avg are fundamentally > different ... I mean, the util_avg doesn't know about the period, which is > an issue in itself I guess ... > > If you have a long running DL task (say 100ms runtime) with a long period > (say 1s), the running_bw should represent ~1/10 of the CPU capacity, but > the util_avg can go quite high, which means that you might end up > executing this task at max OPP. So if we really want to drive OPPs like > that for deadline, a util_est-like version of this util_avg signal > should help. Now, you can also argue that going to max OPP for a task > that _we know_ uses 1/10 of the CPU capacity isn't right ... Yep, that's my point. :)
Hi Vincent, Juri, On 28-May 18:34, Vincent Guittot wrote: > On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@redhat.com> wrote: > > On 28/05/18 16:57, Vincent Guittot wrote: > >> Hi Juri, > >> > >> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote: > >> > Hi Vincent, > >> > > >> > On 25/05/18 15:12, Vincent Guittot wrote: > >> >> Now that we have both the dl class bandwidth requirement and the dl class > >> >> utilization, we can use the max of the 2 values when agregating the > >> >> utilization of the CPU. > >> >> > >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > >> >> --- > >> >> kernel/sched/sched.h | 6 +++++- > >> >> 1 file changed, 5 insertions(+), 1 deletion(-) > >> >> > >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > >> >> index 4526ba6..0eb07a8 100644 > >> >> --- a/kernel/sched/sched.h > >> >> +++ b/kernel/sched/sched.h > >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > >> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > >> >> static inline unsigned long cpu_util_dl(struct rq *rq) > >> >> { > >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > >> > > >> > I'd be tempted to say the we actually want to cap to this one above > >> > instead of using the max (as you are proposing below) or the > >> > (theoretical) power reduction benefits of using DEADLINE for certain > >> > tasks might vanish. > >> > >> The problem that I'm facing is that the sched_entity bandwidth is > >> removed after the 0-lag time and the rq->dl.running_bw goes back to > >> zero but if the DL task has preempted a CFS task, the utilization of > >> the CFS task will be lower than reality and schedutil will set a lower > >> OPP whereas the CPU is always running. With UTIL_EST enabled I don't expect an OPP reduction below the expected utilization of a CFS task. IOW, when a periodic CFS task is preempted by a DL one, what we use for OPP selection once the DL task is over is still the estimated utilization for the CFS task itself. Thus, schedutil will eventually (since we have quite conservative down scaling thresholds) go down to the right OPP to serve that task. > >> The example with a RT task described in the cover letter can be > >> run with a DL task and will give similar results. In the cover letter you says: A rt-app use case which creates an always running cfs thread and a rt threads that wakes up periodically with both threads pinned on same CPU, show lot of frequency switches of the CPU whereas the CPU never goes idles during the test. I would say that's a quite specific corner case where your always running CFS task has never accumulated a util_est sample. Do we really have these cases in real systems? Otherwise, it seems to me that we are trying to solve quite specific corner cases by adding a not negligible level of "complexity". Moreover, I also have the impression that we can fix these use-cases by: - improving the way we accumulate samples in util_est i.e. by discarding preemption time - maybe by improving the utilization aggregation in schedutil to better understand DL requirements i.e. a 10% utilization with a 100ms running time is way different then the same utilization with a 1ms running time -- #include <best/regards.h> Patrick Bellasi
On 31 May 2018 at 12:27, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > Hi Vincent, Juri, > > On 28-May 18:34, Vincent Guittot wrote: >> On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@redhat.com> wrote: >> > On 28/05/18 16:57, Vincent Guittot wrote: >> >> Hi Juri, >> >> >> >> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote: >> >> > Hi Vincent, >> >> > >> >> > On 25/05/18 15:12, Vincent Guittot wrote: >> >> >> Now that we have both the dl class bandwidth requirement and the dl class >> >> >> utilization, we can use the max of the 2 values when agregating the >> >> >> utilization of the CPU. >> >> >> >> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> >> >> --- >> >> >> kernel/sched/sched.h | 6 +++++- >> >> >> 1 file changed, 5 insertions(+), 1 deletion(-) >> >> >> >> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> >> >> index 4526ba6..0eb07a8 100644 >> >> >> --- a/kernel/sched/sched.h >> >> >> +++ b/kernel/sched/sched.h >> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} >> >> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL >> >> >> static inline unsigned long cpu_util_dl(struct rq *rq) >> >> >> { >> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; >> >> > >> >> > I'd be tempted to say the we actually want to cap to this one above >> >> > instead of using the max (as you are proposing below) or the >> >> > (theoretical) power reduction benefits of using DEADLINE for certain >> >> > tasks might vanish. >> >> >> >> The problem that I'm facing is that the sched_entity bandwidth is >> >> removed after the 0-lag time and the rq->dl.running_bw goes back to >> >> zero but if the DL task has preempted a CFS task, the utilization of >> >> the CFS task will be lower than reality and schedutil will set a lower >> >> OPP whereas the CPU is always running. > > With UTIL_EST enabled I don't expect an OPP reduction below the > expected utilization of a CFS task. I'm not sure to fully catch what you mean but all tests that I ran, are using util_est (which is enable by default if i'm not wrong). So all OPP drops that have been observed, were with util_est > > IOW, when a periodic CFS task is preempted by a DL one, what we use > for OPP selection once the DL task is over is still the estimated > utilization for the CFS task itself. Thus, schedutil will eventually > (since we have quite conservative down scaling thresholds) go down to > the right OPP to serve that task. > >> >> The example with a RT task described in the cover letter can be >> >> run with a DL task and will give similar results. > > In the cover letter you says: > > A rt-app use case which creates an always running cfs thread and a > rt threads that wakes up periodically with both threads pinned on > same CPU, show lot of frequency switches of the CPU whereas the CPU > never goes idles during the test. > > I would say that's a quite specific corner case where your always > running CFS task has never accumulated a util_est sample. > > Do we really have these cases in real systems? My example is voluntary an extreme one because it's easier to highlight the problem > > Otherwise, it seems to me that we are trying to solve quite specific > corner cases by adding a not negligible level of "complexity". By complexity, do you mean : Taking into account the number cfs running task to choose between rq->dl.running_bw and avg_dl.util_avg I'm preparing a patchset that will provide the cfs waiting time in addition to dl/rt util_avg for almost no additional cost. I will try to sent the proposal later today > > Moreover, I also have the impression that we can fix these > use-cases by: > > - improving the way we accumulate samples in util_est > i.e. by discarding preemption time > > - maybe by improving the utilization aggregation in schedutil to > better understand DL requirements > i.e. a 10% utilization with a 100ms running time is way different > then the same utilization with a 1ms running time > > > -- > #include <best/regards.h> > > Patrick Bellasi
Le Thursday 31 May 2018 à 15:02:04 (+0200), Vincent Guittot a écrit : > On 31 May 2018 at 12:27, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > > > > Hi Vincent, Juri, > > > > On 28-May 18:34, Vincent Guittot wrote: > >> On 28 May 2018 at 17:22, Juri Lelli <juri.lelli@redhat.com> wrote: > >> > On 28/05/18 16:57, Vincent Guittot wrote: > >> >> Hi Juri, > >> >> > >> >> On 28 May 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote: > >> >> > Hi Vincent, > >> >> > > >> >> > On 25/05/18 15:12, Vincent Guittot wrote: > >> >> >> Now that we have both the dl class bandwidth requirement and the dl class > >> >> >> utilization, we can use the max of the 2 values when agregating the > >> >> >> utilization of the CPU. > >> >> >> > >> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > >> >> >> --- > >> >> >> kernel/sched/sched.h | 6 +++++- > >> >> >> 1 file changed, 5 insertions(+), 1 deletion(-) > >> >> >> > >> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > >> >> >> index 4526ba6..0eb07a8 100644 > >> >> >> --- a/kernel/sched/sched.h > >> >> >> +++ b/kernel/sched/sched.h > >> >> >> @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} > >> >> >> #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL > >> >> >> static inline unsigned long cpu_util_dl(struct rq *rq) > >> >> >> { > >> >> >> - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > >> >> >> + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; > >> >> > > >> >> > I'd be tempted to say the we actually want to cap to this one above > >> >> > instead of using the max (as you are proposing below) or the > >> >> > (theoretical) power reduction benefits of using DEADLINE for certain > >> >> > tasks might vanish. > >> >> > >> >> The problem that I'm facing is that the sched_entity bandwidth is > >> >> removed after the 0-lag time and the rq->dl.running_bw goes back to > >> >> zero but if the DL task has preempted a CFS task, the utilization of > >> >> the CFS task will be lower than reality and schedutil will set a lower > >> >> OPP whereas the CPU is always running. > > > > With UTIL_EST enabled I don't expect an OPP reduction below the > > expected utilization of a CFS task. > > I'm not sure to fully catch what you mean but all tests that I ran, > are using util_est (which is enable by default if i'm not wrong). So > all OPP drops that have been observed, were with util_est > > > > > IOW, when a periodic CFS task is preempted by a DL one, what we use > > for OPP selection once the DL task is over is still the estimated > > utilization for the CFS task itself. Thus, schedutil will eventually > > (since we have quite conservative down scaling thresholds) go down to > > the right OPP to serve that task. > > > >> >> The example with a RT task described in the cover letter can be > >> >> run with a DL task and will give similar results. > > > > In the cover letter you says: > > > > A rt-app use case which creates an always running cfs thread and a > > rt threads that wakes up periodically with both threads pinned on > > same CPU, show lot of frequency switches of the CPU whereas the CPU > > never goes idles during the test. > > > > I would say that's a quite specific corner case where your always > > running CFS task has never accumulated a util_est sample. > > > > Do we really have these cases in real systems? > > My example is voluntary an extreme one because it's easier to > highlight the problem > > > > > Otherwise, it seems to me that we are trying to solve quite specific > > corner cases by adding a not negligible level of "complexity". > > By complexity, do you mean : > > Taking into account the number cfs running task to choose between > rq->dl.running_bw and avg_dl.util_avg > > I'm preparing a patchset that will provide the cfs waiting time in > addition to dl/rt util_avg for almost no additional cost. I will try > to sent the proposal later today The code below adds the tracking of the waiting level of cfs tasks because of rt/dl preemption. This waiting time can then be used when selecting an OPP instead of the dl util_avg which could become higher than dl bandwidth with "long" runtime We need only one new call for the 1st cfs task that is enqueued to get these additional metrics the call to arch_scale_cpu_capacity() can be removed once the later will be taken into account when computing the load (which scales only with freq currently) For rt task, we must keep to take into account util_avg to have an idea of the rt level on the cpu which is given by the badnwodth for dl --- kernel/sched/fair.c | 27 +++++++++++++++++++++++++++ kernel/sched/pelt.c | 8 ++++++-- kernel/sched/sched.h | 4 +++- 3 files changed, 36 insertions(+), 3 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eac1f9a..1682ea7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq) } #endif +static inline void update_cfs_wait_util_avg(struct rq *rq) +{ + /* + * If cfs is already enqueued, we don't have anything to do because + * we already updated the non waiting time + */ + if (rq->cfs.h_nr_running) + return; + + /* + * If rt is running, we update the non wait time before increasing + * cfs.h_nr_running) + */ + if (rq->curr->sched_class == &rt_sched_class) + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); + + /* + * If dl is running, we update the non time before increasing + * cfs.h_nr_running) + */ + if (rq->curr->sched_class == &dl_sched_class) + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1); +} + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@ -5159,6 +5183,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) struct cfs_rq *cfs_rq; struct sched_entity *se = &p->se; + /* Update the tracking of waiting time */ + update_cfs_wait_util_avg(rq); + /* * The code below (indirectly) updates schedutil which looks at * the cfs_rq utilization to select a frequency. diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c index a559a53..ef8905e 100644 --- a/kernel/sched/pelt.c +++ b/kernel/sched/pelt.c @@ -322,9 +322,11 @@ int __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) { + unsigned long waiting = rq->cfs.h_nr_running ? arch_scale_cpu_capacity(NULL, rq->cpu) : 0; + if (___update_load_sum(now, rq->cpu, &rq->avg_rt, running, - running, + waiting, running)) { ___update_load_avg(&rq->avg_rt, 1, 1); @@ -345,9 +347,11 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running) int update_dl_rq_load_avg(u64 now, struct rq *rq, int running) { + unsigned long waiting = rq->cfs.h_nr_running ? arch_scale_cpu_capacity(NULL, rq->cpu) : 0; + if (___update_load_sum(now, rq->cpu, &rq->avg_dl, running, - running, + waiting, running)) { ___update_load_avg(&rq->avg_dl, 1, 1); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0ea94de..9f72a05 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2184,7 +2184,9 @@ static inline unsigned long cpu_util_dl(struct rq *rq) { unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; - util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); + util = max_t(unsigned long, util, + READ_ONCE(rq->avg_dl.runnable_load_avg)); + trace_printk("cpu_util_dl cpu%d %u %lu %llu", rq->cpu, rq->cfs.h_nr_running, READ_ONCE(rq->avg_dl.util_avg), -- 2.7.4 > > > > > Moreover, I also have the impression that we can fix these > > use-cases by: > > > > - improving the way we accumulate samples in util_est > > i.e. by discarding preemption time > > > > - maybe by improving the utilization aggregation in schedutil to > > better understand DL requirements > > i.e. a 10% utilization with a 100ms running time is way different > > then the same utilization with a 1ms running time > > > > > > -- > > #include <best/regards.h> > > > > Patrick Bellasi
On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote: > > >> >> The example with a RT task described in the cover letter can be > > >> >> run with a DL task and will give similar results. > > > > > > In the cover letter you says: > > > > > > A rt-app use case which creates an always running cfs thread and a > > > rt threads that wakes up periodically with both threads pinned on > > > same CPU, show lot of frequency switches of the CPU whereas the CPU > > > never goes idles during the test. > > > > > > I would say that's a quite specific corner case where your always > > > running CFS task has never accumulated a util_est sample. > > > > > > Do we really have these cases in real systems? > > > > My example is voluntary an extreme one because it's easier to > > highlight the problem > > > > > > > > Otherwise, it seems to me that we are trying to solve quite specific > > > corner cases by adding a not negligible level of "complexity". > > > > By complexity, do you mean : > > > > Taking into account the number cfs running task to choose between > > rq->dl.running_bw and avg_dl.util_avg > > > > I'm preparing a patchset that will provide the cfs waiting time in > > addition to dl/rt util_avg for almost no additional cost. I will try > > to sent the proposal later today > > > The code below adds the tracking of the waiting level of cfs tasks because of > rt/dl preemption. This waiting time can then be used when selecting an OPP > instead of the dl util_avg which could become higher than dl bandwidth with > "long" runtime > > We need only one new call for the 1st cfs task that is enqueued to get these additional metrics > the call to arch_scale_cpu_capacity() can be removed once the later will be > taken into account when computing the load (which scales only with freq > currently) > > For rt task, we must keep to take into account util_avg to have an idea of the > rt level on the cpu which is given by the badnwodth for dl > > --- > kernel/sched/fair.c | 27 +++++++++++++++++++++++++++ > kernel/sched/pelt.c | 8 ++++++-- > kernel/sched/sched.h | 4 +++- > 3 files changed, 36 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index eac1f9a..1682ea7 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq) > } > #endif > > +static inline void update_cfs_wait_util_avg(struct rq *rq) > +{ > + /* > + * If cfs is already enqueued, we don't have anything to do because > + * we already updated the non waiting time > + */ > + if (rq->cfs.h_nr_running) > + return; > + > + /* > + * If rt is running, we update the non wait time before increasing > + * cfs.h_nr_running) > + */ > + if (rq->curr->sched_class == &rt_sched_class) > + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); > + > + /* > + * If dl is running, we update the non time before increasing > + * cfs.h_nr_running) > + */ > + if (rq->curr->sched_class == &dl_sched_class) > + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1); > +} > + Please correct me if I'm wrong but the CFS preemption-decay happens in set_next_entity -> update_load_avg when the CFS task is scheduled again after the preemption. Then can we not fix this issue by doing our UTIL_EST magic from set_next_entity? But yeah probably we need to be careful with overhead.. IMO I feel its overkill to account dl_avg when we already have DL's running bandwidth we can use. I understand it may be too instanenous, but perhaps we can fix CFS's problems within CFS itself and not have to do this kind of extra external accounting ? I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg from within CFS class as being done above. thanks, - Joel
On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote: > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote: >> > >> >> The example with a RT task described in the cover letter can be >> > >> >> run with a DL task and will give similar results. >> > > >> > > In the cover letter you says: >> > > >> > > A rt-app use case which creates an always running cfs thread and a >> > > rt threads that wakes up periodically with both threads pinned on >> > > same CPU, show lot of frequency switches of the CPU whereas the CPU >> > > never goes idles during the test. >> > > >> > > I would say that's a quite specific corner case where your always >> > > running CFS task has never accumulated a util_est sample. >> > > >> > > Do we really have these cases in real systems? >> > >> > My example is voluntary an extreme one because it's easier to >> > highlight the problem >> > >> > > >> > > Otherwise, it seems to me that we are trying to solve quite specific >> > > corner cases by adding a not negligible level of "complexity". >> > >> > By complexity, do you mean : >> > >> > Taking into account the number cfs running task to choose between >> > rq->dl.running_bw and avg_dl.util_avg >> > >> > I'm preparing a patchset that will provide the cfs waiting time in >> > addition to dl/rt util_avg for almost no additional cost. I will try >> > to sent the proposal later today >> >> >> The code below adds the tracking of the waiting level of cfs tasks because of >> rt/dl preemption. This waiting time can then be used when selecting an OPP >> instead of the dl util_avg which could become higher than dl bandwidth with >> "long" runtime >> >> We need only one new call for the 1st cfs task that is enqueued to get these additional metrics >> the call to arch_scale_cpu_capacity() can be removed once the later will be >> taken into account when computing the load (which scales only with freq >> currently) >> >> For rt task, we must keep to take into account util_avg to have an idea of the >> rt level on the cpu which is given by the badnwodth for dl >> >> --- >> kernel/sched/fair.c | 27 +++++++++++++++++++++++++++ >> kernel/sched/pelt.c | 8 ++++++-- >> kernel/sched/sched.h | 4 +++- >> 3 files changed, 36 insertions(+), 3 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index eac1f9a..1682ea7 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -5148,6 +5148,30 @@ static inline void hrtick_update(struct rq *rq) >> } >> #endif >> >> +static inline void update_cfs_wait_util_avg(struct rq *rq) >> +{ >> + /* >> + * If cfs is already enqueued, we don't have anything to do because >> + * we already updated the non waiting time >> + */ >> + if (rq->cfs.h_nr_running) >> + return; >> + >> + /* >> + * If rt is running, we update the non wait time before increasing >> + * cfs.h_nr_running) >> + */ >> + if (rq->curr->sched_class == &rt_sched_class) >> + update_rt_rq_load_avg(rq_clock_task(rq), rq, 1); >> + >> + /* >> + * If dl is running, we update the non time before increasing >> + * cfs.h_nr_running) >> + */ >> + if (rq->curr->sched_class == &dl_sched_class) >> + update_dl_rq_load_avg(rq_clock_task(rq), rq, 1); >> +} >> + > > Please correct me if I'm wrong but the CFS preemption-decay happens in > set_next_entity -> update_load_avg when the CFS task is scheduled again after > the preemption. Then can we not fix this issue by doing our UTIL_EST magic > from set_next_entity? But yeah probably we need to be careful with overhead.. util_est is there to keep track of the last max. I'm not sure that trying to add some magics to take into account preemption is the right way to do. Mixing several information in the same metric just add more fuzzy in the meaning of the metric. > > IMO I feel its overkill to account dl_avg when we already have DL's running > bandwidth we can use. I understand it may be too instanenous, but perhaps we We keep using dl bandwidth which is quite correct for dl needs but doesn't reflect how it has disturbed other classes > can fix CFS's problems within CFS itself and not have to do this kind of > extra external accounting ? > > I also feel its better if we don't have to call update_{rt,dl}_rq_load_avg > from within CFS class as being done above. > > thanks, > > - Joel >
Hi Vincent, On 04/06/18 08:41, Vincent Guittot wrote: > On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote: > > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote: [...] > > IMO I feel its overkill to account dl_avg when we already have DL's running > > bandwidth we can use. I understand it may be too instanenous, but perhaps we > > We keep using dl bandwidth which is quite correct for dl needs but > doesn't reflect how it has disturbed other classes > > > can fix CFS's problems within CFS itself and not have to do this kind of > > extra external accounting ? I would also keep accounting for waiting time due to higher prio classes all inside CFS. My impression, when discussing it with you on IRC, was that we should be able to do that by not decaying cfs.util_avg when CFS is preempted (creating a new signal for it). Is not this enough? I feel we should try to keep cross-class accounting/interaction at a minimum. Thanks, - Juri
On 4 June 2018 at 09:04, Juri Lelli <juri.lelli@redhat.com> wrote: > Hi Vincent, > > On 04/06/18 08:41, Vincent Guittot wrote: >> On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote: >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote: > > [...] > >> > IMO I feel its overkill to account dl_avg when we already have DL's running >> > bandwidth we can use. I understand it may be too instanenous, but perhaps we >> >> We keep using dl bandwidth which is quite correct for dl needs but >> doesn't reflect how it has disturbed other classes >> >> > can fix CFS's problems within CFS itself and not have to do this kind of >> > extra external accounting ? > > I would also keep accounting for waiting time due to higher prio classes > all inside CFS. My impression, when discussing it with you on IRC, was > that we should be able to do that by not decaying cfs.util_avg when CFS > is preempted (creating a new signal for it). Is not this enough? We don't just want to not decay a signal but increase the signal to reflect the amount of preemption Then, we can't do that in a current signal. So you would like to add another metrics in cfs_rq ? The place doesn't really matter to be honest in cfs_rq or in dl_rq but you will not prevent to add call in dl class to start/stop the accounting of the preemption > > I feel we should try to keep cross-class accounting/interaction at a > minimum. accounting for cross class preemption can't be done without cross-class accounting Regards, Vincent > > Thanks, > > - Juri
On 04/06/18 09:14, Vincent Guittot wrote: > On 4 June 2018 at 09:04, Juri Lelli <juri.lelli@redhat.com> wrote: > > Hi Vincent, > > > > On 04/06/18 08:41, Vincent Guittot wrote: > >> On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote: > >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote: > > > > [...] > > > >> > IMO I feel its overkill to account dl_avg when we already have DL's running > >> > bandwidth we can use. I understand it may be too instanenous, but perhaps we > >> > >> We keep using dl bandwidth which is quite correct for dl needs but > >> doesn't reflect how it has disturbed other classes > >> > >> > can fix CFS's problems within CFS itself and not have to do this kind of > >> > extra external accounting ? > > > > I would also keep accounting for waiting time due to higher prio classes > > all inside CFS. My impression, when discussing it with you on IRC, was > > that we should be able to do that by not decaying cfs.util_avg when CFS > > is preempted (creating a new signal for it). Is not this enough? > > We don't just want to not decay a signal but increase the signal to > reflect the amount of preemption OK. > Then, we can't do that in a current signal. So you would like to add > another metrics in cfs_rq ? Since it's CFS related, I'd say it should fit in CFS. > The place doesn't really matter to be honest in cfs_rq or in dl_rq but > you will not prevent to add call in dl class to start/stop the > accounting of the preemption > > > > > I feel we should try to keep cross-class accounting/interaction at a > > minimum. > > accounting for cross class preemption can't be done without > cross-class accounting Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of higher prio class and act accordingly? Thanks, - Juri
On 4 June 2018 at 12:12, Juri Lelli <juri.lelli@redhat.com> wrote: > On 04/06/18 09:14, Vincent Guittot wrote: >> On 4 June 2018 at 09:04, Juri Lelli <juri.lelli@redhat.com> wrote: >> > Hi Vincent, >> > >> > On 04/06/18 08:41, Vincent Guittot wrote: >> >> On 1 June 2018 at 19:45, Joel Fernandes <joelaf@google.com> wrote: >> >> > On Fri, Jun 01, 2018 at 03:53:07PM +0200, Vincent Guittot wrote: >> > >> > [...] >> > >> >> > IMO I feel its overkill to account dl_avg when we already have DL's running >> >> > bandwidth we can use. I understand it may be too instanenous, but perhaps we >> >> >> >> We keep using dl bandwidth which is quite correct for dl needs but >> >> doesn't reflect how it has disturbed other classes >> >> >> >> > can fix CFS's problems within CFS itself and not have to do this kind of >> >> > extra external accounting ? >> > >> > I would also keep accounting for waiting time due to higher prio classes >> > all inside CFS. My impression, when discussing it with you on IRC, was >> > that we should be able to do that by not decaying cfs.util_avg when CFS >> > is preempted (creating a new signal for it). Is not this enough? >> >> We don't just want to not decay a signal but increase the signal to >> reflect the amount of preemption > > OK. > >> Then, we can't do that in a current signal. So you would like to add >> another metrics in cfs_rq ? > > Since it's CFS related, I'd say it should fit in CFS. It's dl and cfs as the goal is to track cfs preempted by dl This means creating a new struct whereas some fields are unused in avg_dl struct And duplicate some call to ___update_load_sum as we track avg_dl for removing sched_rt_avg_update and update_dl/rt_rq_load_avg are already call in fair.c for updating blocked load > >> The place doesn't really matter to be honest in cfs_rq or in dl_rq but >> you will not prevent to add call in dl class to start/stop the >> accounting of the preemption >> >> > >> > I feel we should try to keep cross-class accounting/interaction at a >> > minimum. >> >> accounting for cross class preemption can't be done without >> cross-class accounting > > Mmm, can't we distinguish in, say, pick_next_task_fair() if prev was of > higher prio class and act accordingly? we will not be able to make the difference between rt/dl/stop preemption by using only pick_next_task_fair Thanks > > Thanks, > > - Juri
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4526ba6..0eb07a8 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2194,7 +2194,11 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL static inline unsigned long cpu_util_dl(struct rq *rq) { - return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; + unsigned long util = (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT; + + util = max_t(unsigned long, util, READ_ONCE(rq->avg_dl.util_avg)); + + return util; } static inline unsigned long cpu_util_cfs(struct rq *rq)
Now that we have both the dl class bandwidth requirement and the dl class utilization, we can use the max of the 2 values when agregating the utilization of the CPU. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/sched.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) -- 2.7.4