Message ID | 20230820210640.585311-3-qyousef@layalina.io |
---|---|
State | New |
Headers | show |
Series | Fix dvfs_headroom escaping uclamp constraints | expand |
On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote: > > DVFS headroom is applied after we calculate the effective_cpu_util() > which is where we honour uclamp constraints. It makes more sense to > apply the headroom there once and let all users naturally get the right > thing without having to sprinkle the call around in various places. You have to take care of not mixing scheduler and cpufreq constraint and policy. uclamp is a scheduler feature to highlight that the utilization requirement of a task can't go above a level. dvfs head room is a cpufreq decision to anticipate either hw limitation and responsiveness problem or performance hints. they come from different sources and rational and this patch mixed them which i'm not sure is a correct thing to do > > Before this fix running > > uclampset -M 800 cat /dev/zero > /dev/null > > Will cause the test system to run at max freq of 2.8GHz. After the fix > it runs at 2.2GHz instead which is the correct value that matches the > capacity of 800. So a task with an utilization of 800 will run at higher freq than a task clamped to 800 by uclamp ? Why should they run at different freq for the same utilization ? > > Note that similar problem exist for uclamp_min. If util was 50, and > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp > constraints, we'll end up with util of 125 instead of 100. IOW, we get > boosted twice, first time by uclamp_min, and second time by dvfs > headroom. > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > --- > include/linux/energy_model.h | 1 - > kernel/sched/core.c | 11 ++++++++--- > kernel/sched/cpufreq_schedutil.c | 5 ++--- > 3 files changed, 10 insertions(+), 7 deletions(-) > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > index 6ebde4e69e98..adec808b371a 100644 > --- a/include/linux/energy_model.h > +++ b/include/linux/energy_model.h > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > scale_cpu = arch_scale_cpu_capacity(cpu); > ps = &pd->table[pd->nr_perf_states - 1]; > > - max_util = apply_dvfs_headroom(max_util); > max_util = min(max_util, allowed_cpu_cap); > freq = map_util_freq(max_util, ps->frequency, scale_cpu); > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index efe3848978a0..441d433c83cd 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > * frequency will be gracefully reduced with the utilization decay. > */ > util = util_cfs + cpu_util_rt(rq); > - if (type == FREQUENCY_UTIL) > + if (type == FREQUENCY_UTIL) { > + util = apply_dvfs_headroom(util); This is not the same as before because utilization has not being scaled by irq steal time yet > util = uclamp_rq_util_with(rq, util, p); > + } > > dl_util = cpu_util_dl(rq); > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > * max - irq > * U' = irq + --------- * U > * max > + * > + * We only need to apply dvfs headroom to irq part since the util part > + * already had it applied. > */ > util = scale_irq_capacity(util, irq, max); > - util += irq; > + util += type == FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq; > > /* > * Bandwidth required by DEADLINE must always be granted while, for > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > * an interface. So, we only do the latter for now. > */ > if (type == FREQUENCY_UTIL) > - util += cpu_bw_dl(rq); > + util += apply_dvfs_headroom(cpu_bw_dl(rq)); If we follow your reasoning with uclamp on the dl bandwidth, should we not skip this as well ? > > return min(max, util); > } > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index 916c4d3d6192..0c7565ac31fb 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -143,7 +143,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, > unsigned int freq = arch_scale_freq_invariant() ? > policy->cpuinfo.max_freq : policy->cur; > > - util = apply_dvfs_headroom(util); > freq = map_util_freq(util, freq, max); > > if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update) > @@ -406,8 +405,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, > sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util) > sg_cpu->util = prev_util; > > - cpufreq_driver_adjust_perf(sg_cpu->cpu, apply_dvfs_headroom(sg_cpu->bw_dl), > - apply_dvfs_headroom(sg_cpu->util), max_cap); > + cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_dl, > + sg_cpu->util, max_cap); > > sg_cpu->sg_policy->last_freq_update_time = time; > } > -- > 2.34.1 >
On 08/29/23 16:35, Vincent Guittot wrote: > On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote: > > > > DVFS headroom is applied after we calculate the effective_cpu_util() > > which is where we honour uclamp constraints. It makes more sense to > > apply the headroom there once and let all users naturally get the right > > thing without having to sprinkle the call around in various places. > > You have to take care of not mixing scheduler and cpufreq constraint and policy. > > uclamp is a scheduler feature to highlight that the utilization > requirement of a task can't go above a level. uclamp is a performance hint, which utilization is how we represent it internally. A task with uclamp of 800 is not the same as util being actually 800. In our previous discussions around util_fits_cpu() we had similar discussion on how the two can't be treated the same. Same with uclamp_min; if it is set to 1024 but there is a task with util say 100, this task shouldn't cause overutilized as its actual utilization actually fits, but it just asked to run at a higher performance point. The actual utilization has to be in the over utilized range. Unless uclamp_max forces it to fit of course. So uclamp_max sets a cap to the performance that the task is allowed to reach. And this translates into frequency selection and cpu selection (in case of HMP) by design. I don't think we're mixing matters here. But the headroom should be applied to actual utilization, not uclamp. If the rq is capped to a specific performance level, we should honour this. We do the same with iowait boost where it is capped by uclamp_max. A task capped by uclamp_max shouldn't be the trigger of running at a frequency higher than this point. Otherwise we'd need to expose all these internal details to the user, which I think we all agree isn't something to consider at all. > > dvfs head room is a cpufreq decision to anticipate either hw > limitation and responsiveness problem or performance hints. > > they come from different sources and rational and this patch mixed > them which i'm not sure is a correct thing to do I don't think I'm mixing them up to be honest. The governor is driven by effective_cpu_util() to tell it what is the effective_cpu_util() when making frequency selection. This function takes into account all the factors that could impact frequency selection including all type of rq pressures (except thermal). I think it is appropriate to take headroom into account there and make sure we honour uclamp_max hint to cap the performance appropriately based on the effective uclamp_max value of the rq. For example if actually util was 640, then after the headroom it'd be 800. And if uclamp_max is 800, then this task will still get the 1.25 headroom. We are not changing this behavior. But if the task goes beyond that, then it'll stop at 800 because this what the request is all about. A headroom beyond this point is not required because the task (or rq to be more precise) is capped to this performance point and regardless how busy the cpu gets in terms of real utilization or headroom, it shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is 800 then it'll only get a performance equivalent to OPP@800 would give. If we don't do that uclamp_max range effectively is from 0-819 (based on current headroom setting of 1.25). Which is not what we want or what we're telling userspace. Userspace sees the whole system performance levels abstracted from 0 - 1024. As it should. The only effect they would observe and there's no way around it is that OPPs are discrete points. So in reality our 0-1024 is a staircase where a range of util values will map to the same OPP and then we'll get a jump. So the user can end up requesting for example 700 and 720 and not notice any difference because they both map to the same OPP. I don't think we can fix that - but maybe I should add it to the uclamp doc as a caveat when setting uclamp. > > > > > Before this fix running > > > > uclampset -M 800 cat /dev/zero > /dev/null > > > > Will cause the test system to run at max freq of 2.8GHz. After the fix > > it runs at 2.2GHz instead which is the correct value that matches the > > capacity of 800. > > So a task with an utilization of 800 will run at higher freq than a > task clamped to 800 by uclamp ? Why should they run at different freq > for the same utilization ? Because uclamp sets an upper limit on the performance level the task should be able to achieve. Imagine a task is 1024 and capped to 800, it should not run at max frequency, right? What's the point of the uclamp_max hint if the headroom will cause it to run at max anyway? We lost the meaning of the hint. And if this headroom changes in the future, people will start observing different behavior for existing uclamp_max settings on the same system because of this this rightfully hidden and unaccounted for factor. > > > > > Note that similar problem exist for uclamp_min. If util was 50, and > > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp > > constraints, we'll end up with util of 125 instead of 100. IOW, we get > > boosted twice, first time by uclamp_min, and second time by dvfs > > headroom. > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > > --- > > include/linux/energy_model.h | 1 - > > kernel/sched/core.c | 11 ++++++++--- > > kernel/sched/cpufreq_schedutil.c | 5 ++--- > > 3 files changed, 10 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > > index 6ebde4e69e98..adec808b371a 100644 > > --- a/include/linux/energy_model.h > > +++ b/include/linux/energy_model.h > > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > > scale_cpu = arch_scale_cpu_capacity(cpu); > > ps = &pd->table[pd->nr_perf_states - 1]; > > > > - max_util = apply_dvfs_headroom(max_util); > > max_util = min(max_util, allowed_cpu_cap); > > freq = map_util_freq(max_util, ps->frequency, scale_cpu); > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index efe3848978a0..441d433c83cd 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > * frequency will be gracefully reduced with the utilization decay. > > */ > > util = util_cfs + cpu_util_rt(rq); > > - if (type == FREQUENCY_UTIL) > > + if (type == FREQUENCY_UTIL) { > > + util = apply_dvfs_headroom(util); > > This is not the same as before because utilization has not being > scaled by irq steal time yet We do the scaling below, no? AFAICS, we had: (util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale Using U = (util_cfs + util_rt) * scale we can write this after the multiplication U + irq * scale + ((max-irq)*U/max) + dl_bw * scale > > > util = uclamp_rq_util_with(rq, util, p); > > + } > > > > dl_util = cpu_util_dl(rq); > > > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > * max - irq > > * U' = irq + --------- * U > > * max > > + * > > + * We only need to apply dvfs headroom to irq part since the util part > > + * already had it applied. > > */ > > util = scale_irq_capacity(util, irq, max); > > - util += irq; > > + util += type == FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq; > > > > /* > > * Bandwidth required by DEADLINE must always be granted while, for > > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > * an interface. So, we only do the latter for now. > > */ > > if (type == FREQUENCY_UTIL) > > - util += cpu_bw_dl(rq); > > + util += apply_dvfs_headroom(cpu_bw_dl(rq)); > > If we follow your reasoning with uclamp on the dl bandwidth, should we > not skip this as well ? I do remove this in patch 4. Can fold that one into this one if you like. I opted to keep the current behavior in this patch and remove these later in patch 4. I do think that both RT and DL shouldn't need DVFS headroom in general as they both 'disable' it by default. Thanks! -- Qais Yousef
On Tue, 29 Aug 2023 at 18:37, Qais Yousef <qyousef@layalina.io> wrote: > > On 08/29/23 16:35, Vincent Guittot wrote: > > On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > DVFS headroom is applied after we calculate the effective_cpu_util() > > > which is where we honour uclamp constraints. It makes more sense to > > > apply the headroom there once and let all users naturally get the right > > > thing without having to sprinkle the call around in various places. > > > > You have to take care of not mixing scheduler and cpufreq constraint and policy. > > > > uclamp is a scheduler feature to highlight that the utilization > > requirement of a task can't go above a level. > > uclamp is a performance hint, which utilization is how we represent it > internally. A task with uclamp of 800 is not the same as util being actually > 800. In our previous discussions around util_fits_cpu() we had similar > discussion on how the two can't be treated the same. > > Same with uclamp_min; if it is set to 1024 but there is a task with util say > 100, this task shouldn't cause overutilized as its actual utilization actually > fits, but it just asked to run at a higher performance point. The actual > utilization has to be in the over utilized range. Unless uclamp_max forces it > to fit of course. > > So uclamp_max sets a cap to the performance that the task is allowed to reach. > And this translates into frequency selection and cpu selection (in case of HMP) > by design. The dvfs head room is a sole property of cpufreq and the scheduler doesn't care about it. Scheduler should provide some cpu utilization hints. Then what cpufreq will do based on the utilization, the HW constraint and the policy is out of scheduler scope. This headroom is there only because energy model wants to forecast what will be the frequency that cpufreq will select later on but scheduler doesn't care > > I don't think we're mixing matters here. But the headroom should be applied to > actual utilization, not uclamp. If the rq is capped to a specific performance > level, we should honour this. > > We do the same with iowait boost where it is capped by uclamp_max. A task > capped by uclamp_max shouldn't be the trigger of running at a frequency higher > than this point. Otherwise we'd need to expose all these internal details to > the user, which I think we all agree isn't something to consider at all. > > > > > dvfs head room is a cpufreq decision to anticipate either hw > > limitation and responsiveness problem or performance hints. > > > > they come from different sources and rational and this patch mixed > > them which i'm not sure is a correct thing to do > > I don't think I'm mixing them up to be honest. > > The governor is driven by effective_cpu_util() to tell it what is the > effective_cpu_util() when making frequency selection. This function takes into > account all the factors that could impact frequency selection including all type > of rq pressures (except thermal). I think it is appropriate to take headroom > into account there and make sure we honour uclamp_max hint to cap the > performance appropriately based on the effective uclamp_max value of the rq. > > For example if actually util was 640, then after the headroom it'd be 800. And > if uclamp_max is 800, then this task will still get the 1.25 headroom. We are > not changing this behavior. > > But if the task goes beyond that, then it'll stop at 800 because this what the > request is all about. A headroom beyond this point is not required because the > task (or rq to be more precise) is capped to this performance point and > regardless how busy the cpu gets in terms of real utilization or headroom, it > shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is > 800 then it'll only get a performance equivalent to OPP@800 would give. > > If we don't do that uclamp_max range effectively is from 0-819 (based on > current headroom setting of 1.25). Which is not what we want or what we're > telling userspace. Userspace sees the whole system performance levels > abstracted from 0 - 1024. As it should. The only effect they would observe and > there's no way around it is that OPPs are discrete points. So in reality our > 0-1024 is a staircase where a range of util values will map to the same OPP and > then we'll get a jump. So the user can end up requesting for example 700 and > 720 and not notice any difference because they both map to the same OPP. > I don't think we can fix that - but maybe I should add it to the uclamp doc as > a caveat when setting uclamp. > > > > > > > > > Before this fix running > > > > > > uclampset -M 800 cat /dev/zero > /dev/null > > > > > > Will cause the test system to run at max freq of 2.8GHz. After the fix > > > it runs at 2.2GHz instead which is the correct value that matches the > > > capacity of 800. > > > > So a task with an utilization of 800 will run at higher freq than a > > task clamped to 800 by uclamp ? Why should they run at different freq > > for the same utilization ? > > Because uclamp sets an upper limit on the performance level the task should be > able to achieve. Imagine a task is 1024 and capped to 800, it should not run at > max frequency, right? What's the point of the uclamp_max hint if the headroom Who knows ? Here we try to predict the coming utilization of the cpu and this prediction is only partial so cpufreq might want to keep some headroom even if uclamp max is applied to a cfs rq. Anticipate unpredictable irq stolen time IMO, dvfs_headroom should never be visible in scheduler code. This is the case now; it's only visible by cpufreq which is the "owner" and energy model which tries to predict cpufreq behavior > will cause it to run at max anyway? We lost the meaning of the hint. And if > this headroom changes in the future, people will start observing different > behavior for existing uclamp_max settings on the same system because of this > this rightfully hidden and unaccounted for factor. > > > > > > > > > Note that similar problem exist for uclamp_min. If util was 50, and > > > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp > > > constraints, we'll end up with util of 125 instead of 100. IOW, we get > > > boosted twice, first time by uclamp_min, and second time by dvfs > > > headroom. > > > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > > > --- > > > include/linux/energy_model.h | 1 - > > > kernel/sched/core.c | 11 ++++++++--- > > > kernel/sched/cpufreq_schedutil.c | 5 ++--- > > > 3 files changed, 10 insertions(+), 7 deletions(-) > > > > > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > > > index 6ebde4e69e98..adec808b371a 100644 > > > --- a/include/linux/energy_model.h > > > +++ b/include/linux/energy_model.h > > > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > > > scale_cpu = arch_scale_cpu_capacity(cpu); > > > ps = &pd->table[pd->nr_perf_states - 1]; > > > > > > - max_util = apply_dvfs_headroom(max_util); > > > max_util = min(max_util, allowed_cpu_cap); > > > freq = map_util_freq(max_util, ps->frequency, scale_cpu); > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index efe3848978a0..441d433c83cd 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > * frequency will be gracefully reduced with the utilization decay. > > > */ > > > util = util_cfs + cpu_util_rt(rq); > > > - if (type == FREQUENCY_UTIL) > > > + if (type == FREQUENCY_UTIL) { > > > + util = apply_dvfs_headroom(util); > > > > This is not the same as before because utilization has not being > > scaled by irq steal time yet > > We do the scaling below, no? > > AFAICS, we had: > > (util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale I think it's only : ((util_cfs + util_rt) * (max -irq) / max + irq + dl_bw) * scale and it' s effectively similar > > Using U = (util_cfs + util_rt) * scale > > we can write this after the multiplication > > U + irq * scale + ((max-irq)*U/max) + dl_bw * scale > > > > > > util = uclamp_rq_util_with(rq, util, p); > > > + } > > > > > > dl_util = cpu_util_dl(rq); > > > > > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > * max - irq > > > * U' = irq + --------- * U > > > * max > > > + * > > > + * We only need to apply dvfs headroom to irq part since the util part > > > + * already had it applied. > > > */ > > > util = scale_irq_capacity(util, irq, max); > > > - util += irq; > > > + util += type == FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq; > > > > > > /* > > > * Bandwidth required by DEADLINE must always be granted while, for > > > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > * an interface. So, we only do the latter for now. > > > */ > > > if (type == FREQUENCY_UTIL) > > > - util += cpu_bw_dl(rq); > > > + util += apply_dvfs_headroom(cpu_bw_dl(rq)); > > > > If we follow your reasoning with uclamp on the dl bandwidth, should we > > not skip this as well ? > > I do remove this in patch 4. Can fold that one into this one if you like. > I opted to keep the current behavior in this patch and remove these later in > patch 4. > > I do think that both RT and DL shouldn't need DVFS headroom in general as they > both 'disable' it by default. and the scheduler in general doesn't care about DVFS headroom. I don't think it's the right way to fix your problem of selecting the right frequency on your platform. It would be better to get a proper interface between EM and cpufreq like cpufreq_give_me_freq_for_this_utilization_ctx. And cpufreq needs 2 information: - the utilization of the cpu - the uclamp/performance/bandwidth hints on this cpu because the uclamp_max hints are screwed up by irq scaling anyway. For example: irq = 256 util = 800 but uclamp=512 If I follow your reasoning about the uclamp being a performance hint and uclamp=512 means that we should not go above 512 just for cfs+rt, we should stay at 512 in this case because irq only needs 256 which is below the 512 bandwidth. The utilization reported to cpufreq will be above 512 whatever the current (720) formula or your proposal (608). In the case of uclamp, it should be applied after having been scaled by irq time. So we should have reported utilization of 720 with a bandwidth requirement of 512 and then cpufreq can applies its headroom if needed > > > Thanks! > > -- > Qais Yousef
On 09/07/23 15:47, Vincent Guittot wrote: > On Tue, 29 Aug 2023 at 18:37, Qais Yousef <qyousef@layalina.io> wrote: > > > > On 08/29/23 16:35, Vincent Guittot wrote: > > > On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > DVFS headroom is applied after we calculate the effective_cpu_util() > > > > which is where we honour uclamp constraints. It makes more sense to > > > > apply the headroom there once and let all users naturally get the right > > > > thing without having to sprinkle the call around in various places. > > > > > > You have to take care of not mixing scheduler and cpufreq constraint and policy. > > > > > > uclamp is a scheduler feature to highlight that the utilization > > > requirement of a task can't go above a level. > > > > uclamp is a performance hint, which utilization is how we represent it > > internally. A task with uclamp of 800 is not the same as util being actually > > 800. In our previous discussions around util_fits_cpu() we had similar > > discussion on how the two can't be treated the same. > > > > Same with uclamp_min; if it is set to 1024 but there is a task with util say > > 100, this task shouldn't cause overutilized as its actual utilization actually > > fits, but it just asked to run at a higher performance point. The actual > > utilization has to be in the over utilized range. Unless uclamp_max forces it > > to fit of course. > > > > So uclamp_max sets a cap to the performance that the task is allowed to reach. > > And this translates into frequency selection and cpu selection (in case of HMP) > > by design. > > The dvfs head room is a sole property of cpufreq and the scheduler > doesn't care about it. Scheduler should provide some cpu utilization > hints. Then what cpufreq will do based on the utilization, the HW > constraint and the policy is out of scheduler scope. > > This headroom is there only because energy model wants to forecast > what will be the frequency that cpufreq will select later on but > scheduler doesn't care > > > > > I don't think we're mixing matters here. But the headroom should be applied to > > actual utilization, not uclamp. If the rq is capped to a specific performance > > level, we should honour this. > > > > We do the same with iowait boost where it is capped by uclamp_max. A task > > capped by uclamp_max shouldn't be the trigger of running at a frequency higher > > than this point. Otherwise we'd need to expose all these internal details to > > the user, which I think we all agree isn't something to consider at all. > > > > > > > > dvfs head room is a cpufreq decision to anticipate either hw > > > limitation and responsiveness problem or performance hints. > > > > > > they come from different sources and rational and this patch mixed > > > them which i'm not sure is a correct thing to do > > > > I don't think I'm mixing them up to be honest. > > > > The governor is driven by effective_cpu_util() to tell it what is the > > effective_cpu_util() when making frequency selection. This function takes into > > account all the factors that could impact frequency selection including all type > > of rq pressures (except thermal). I think it is appropriate to take headroom > > into account there and make sure we honour uclamp_max hint to cap the > > performance appropriately based on the effective uclamp_max value of the rq. > > > > For example if actually util was 640, then after the headroom it'd be 800. And > > if uclamp_max is 800, then this task will still get the 1.25 headroom. We are > > not changing this behavior. > > > > But if the task goes beyond that, then it'll stop at 800 because this what the > > request is all about. A headroom beyond this point is not required because the > > task (or rq to be more precise) is capped to this performance point and > > regardless how busy the cpu gets in terms of real utilization or headroom, it > > shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is > > 800 then it'll only get a performance equivalent to OPP@800 would give. > > > > If we don't do that uclamp_max range effectively is from 0-819 (based on > > current headroom setting of 1.25). Which is not what we want or what we're > > telling userspace. Userspace sees the whole system performance levels > > abstracted from 0 - 1024. As it should. The only effect they would observe and > > there's no way around it is that OPPs are discrete points. So in reality our > > 0-1024 is a staircase where a range of util values will map to the same OPP and > > then we'll get a jump. So the user can end up requesting for example 700 and > > 720 and not notice any difference because they both map to the same OPP. > > I don't think we can fix that - but maybe I should add it to the uclamp doc as > > a caveat when setting uclamp. > > > > > > > > > > > > > Before this fix running > > > > > > > > uclampset -M 800 cat /dev/zero > /dev/null > > > > > > > > Will cause the test system to run at max freq of 2.8GHz. After the fix > > > > it runs at 2.2GHz instead which is the correct value that matches the > > > > capacity of 800. > > > > > > So a task with an utilization of 800 will run at higher freq than a > > > task clamped to 800 by uclamp ? Why should they run at different freq > > > for the same utilization ? > > > > Because uclamp sets an upper limit on the performance level the task should be > > able to achieve. Imagine a task is 1024 and capped to 800, it should not run at > > max frequency, right? What's the point of the uclamp_max hint if the headroom > > Who knows ? Here we try to predict the coming utilization of the cpu > and this prediction is only partial so cpufreq might want to keep some > headroom even if uclamp max is applied to a cfs rq. Anticipate > unpredictable irq stolen time I struggle to see why we need headroom after uclamp_max. But I think I'm starting to see what you're getting at. I think you'd like not to make assumption in the scheduler and hide this logic in the governor itself? > > IMO, dvfs_headroom should never be visible in scheduler code. This is > the case now; it's only visible by cpufreq which is the "owner" and > energy model which tries to predict cpufreq behavior Okay. > > > will cause it to run at max anyway? We lost the meaning of the hint. And if > > this headroom changes in the future, people will start observing different > > behavior for existing uclamp_max settings on the same system because of this > > this rightfully hidden and unaccounted for factor. > > > > > > > > > > > > > Note that similar problem exist for uclamp_min. If util was 50, and > > > > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp > > > > constraints, we'll end up with util of 125 instead of 100. IOW, we get > > > > boosted twice, first time by uclamp_min, and second time by dvfs > > > > headroom. > > > > > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > > > > --- > > > > include/linux/energy_model.h | 1 - > > > > kernel/sched/core.c | 11 ++++++++--- > > > > kernel/sched/cpufreq_schedutil.c | 5 ++--- > > > > 3 files changed, 10 insertions(+), 7 deletions(-) > > > > > > > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > > > > index 6ebde4e69e98..adec808b371a 100644 > > > > --- a/include/linux/energy_model.h > > > > +++ b/include/linux/energy_model.h > > > > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > > > > scale_cpu = arch_scale_cpu_capacity(cpu); > > > > ps = &pd->table[pd->nr_perf_states - 1]; > > > > > > > > - max_util = apply_dvfs_headroom(max_util); > > > > max_util = min(max_util, allowed_cpu_cap); > > > > freq = map_util_freq(max_util, ps->frequency, scale_cpu); > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > index efe3848978a0..441d433c83cd 100644 > > > > --- a/kernel/sched/core.c > > > > +++ b/kernel/sched/core.c > > > > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > > * frequency will be gracefully reduced with the utilization decay. > > > > */ > > > > util = util_cfs + cpu_util_rt(rq); > > > > - if (type == FREQUENCY_UTIL) > > > > + if (type == FREQUENCY_UTIL) { > > > > + util = apply_dvfs_headroom(util); > > > > > > This is not the same as before because utilization has not being > > > scaled by irq steal time yet > > > > We do the scaling below, no? > > > > AFAICS, we had: > > > > (util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale > > I think it's only : > ((util_cfs + util_rt) * (max -irq) / max + irq + dl_bw) * scale > > and it' s effectively similar +1 > > > > > Using U = (util_cfs + util_rt) * scale > > > > we can write this after the multiplication > > > > U + irq * scale + ((max-irq)*U/max) + dl_bw * scale > > > > > > > > > util = uclamp_rq_util_with(rq, util, p); > > > > + } > > > > > > > > dl_util = cpu_util_dl(rq); > > > > > > > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > > * max - irq > > > > * U' = irq + --------- * U > > > > * max > > > > + * > > > > + * We only need to apply dvfs headroom to irq part since the util part > > > > + * already had it applied. > > > > */ > > > > util = scale_irq_capacity(util, irq, max); > > > > - util += irq; > > > > + util += type == FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq; > > > > > > > > /* > > > > * Bandwidth required by DEADLINE must always be granted while, for > > > > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > > * an interface. So, we only do the latter for now. > > > > */ > > > > if (type == FREQUENCY_UTIL) > > > > - util += cpu_bw_dl(rq); > > > > + util += apply_dvfs_headroom(cpu_bw_dl(rq)); > > > > > > If we follow your reasoning with uclamp on the dl bandwidth, should we > > > not skip this as well ? > > > > I do remove this in patch 4. Can fold that one into this one if you like. > > I opted to keep the current behavior in this patch and remove these later in > > patch 4. > > > > I do think that both RT and DL shouldn't need DVFS headroom in general as they > > both 'disable' it by default. > > and the scheduler in general doesn't care about DVFS headroom. > > I don't think it's the right way to fix your problem of selecting the > right frequency on your platform. It would be better to get a proper > interface between EM and cpufreq like > cpufreq_give_me_freq_for_this_utilization_ctx. Hmm is the problem then that effective_cpu_util() was moved to scheduler and it should move back to schedutil and provide a proper interface to connect the two in abstracted way? > > And cpufreq needs 2 information: > - the utilization of the cpu > - the uclamp/performance/bandwidth hints on this cpu because the > uclamp_max hints are screwed up by irq scaling anyway. > > For example: > > irq = 256 > util = 800 but uclamp=512 > > If I follow your reasoning about the uclamp being a performance hint > and uclamp=512 means that we should not go above 512 just for cfs+rt, Yes. We already limit it to 512 now. With the change we'll limit it after the headroom is applied to them. > we should stay at 512 in this case because irq only needs 256 which is > below the 512 bandwidth. The utilization reported to cpufreq will be Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt will be restricted by their uclamp settings _after_ the headroom is applied. > above 512 whatever the current (720) formula or your proposal (608). > In the case of uclamp, it should be applied after having been scaled > by irq time. I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers. So the way I'm proposing it here util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 util = uclamp_rq_util_with(rq, util, NULL) = 512 util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384 util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704 util += dvfs_headroom(dl_bw) = 704 > > So we should have reported utilization of 720 with a bandwidth > requirement of 512 and then cpufreq can applies its headroom if needed The key part I'm changing is this util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 util = uclamp_rq_util_with(rq, util, NULL) = 512 Before we had (assume irq, rt and dl are 0 for simplicity and a single task is running) util = cfs = 800 util = uclamp_rq_util_with(rq, util, NULL) = 512 util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640 So we are running higher than we are allowed to. So applying the headroom after taking uclamp constraints into account is the problem. IIUC and your concerns are about how scheduler and schedutil are interacting loses some abstraction, then yeah this makes sense and can refactor code to better abstract the two and keep them decoupled. But if you think the order the headroom is applied should be the way it is, then this doesn't fix the problem. A good use case to consider is a single task running at 1024. If I want to limit it to 80% using uclamp_max, how can this be enforced? With current code, the task will continue to run at 1024; which is the problem. Running at 640 instead of 512 is still a problem too as this could be 1 or 2 OPP higher than what we want to allow; and there could be significant power difference between those OPPs. Thanks! -- Qais Yousef
On Thu, 7 Sept 2023 at 23:55, Qais Yousef <qyousef@layalina.io> wrote: > > On 09/07/23 15:47, Vincent Guittot wrote: > > On Tue, 29 Aug 2023 at 18:37, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > On 08/29/23 16:35, Vincent Guittot wrote: > > > > On Sun, 20 Aug 2023 at 23:08, Qais Yousef <qyousef@layalina.io> wrote: > > > > > > > > > > DVFS headroom is applied after we calculate the effective_cpu_util() > > > > > which is where we honour uclamp constraints. It makes more sense to > > > > > apply the headroom there once and let all users naturally get the right > > > > > thing without having to sprinkle the call around in various places. > > > > > > > > You have to take care of not mixing scheduler and cpufreq constraint and policy. > > > > > > > > uclamp is a scheduler feature to highlight that the utilization > > > > requirement of a task can't go above a level. > > > > > > uclamp is a performance hint, which utilization is how we represent it > > > internally. A task with uclamp of 800 is not the same as util being actually > > > 800. In our previous discussions around util_fits_cpu() we had similar > > > discussion on how the two can't be treated the same. > > > > > > Same with uclamp_min; if it is set to 1024 but there is a task with util say > > > 100, this task shouldn't cause overutilized as its actual utilization actually > > > fits, but it just asked to run at a higher performance point. The actual > > > utilization has to be in the over utilized range. Unless uclamp_max forces it > > > to fit of course. > > > > > > So uclamp_max sets a cap to the performance that the task is allowed to reach. > > > And this translates into frequency selection and cpu selection (in case of HMP) > > > by design. > > > > The dvfs head room is a sole property of cpufreq and the scheduler > > doesn't care about it. Scheduler should provide some cpu utilization > > hints. Then what cpufreq will do based on the utilization, the HW > > constraint and the policy is out of scheduler scope. > > > > This headroom is there only because energy model wants to forecast > > what will be the frequency that cpufreq will select later on but > > scheduler doesn't care > > > > > > > > I don't think we're mixing matters here. But the headroom should be applied to > > > actual utilization, not uclamp. If the rq is capped to a specific performance > > > level, we should honour this. > > > > > > We do the same with iowait boost where it is capped by uclamp_max. A task > > > capped by uclamp_max shouldn't be the trigger of running at a frequency higher > > > than this point. Otherwise we'd need to expose all these internal details to > > > the user, which I think we all agree isn't something to consider at all. > > > > > > > > > > > dvfs head room is a cpufreq decision to anticipate either hw > > > > limitation and responsiveness problem or performance hints. > > > > > > > > they come from different sources and rational and this patch mixed > > > > them which i'm not sure is a correct thing to do > > > > > > I don't think I'm mixing them up to be honest. > > > > > > The governor is driven by effective_cpu_util() to tell it what is the > > > effective_cpu_util() when making frequency selection. This function takes into > > > account all the factors that could impact frequency selection including all type > > > of rq pressures (except thermal). I think it is appropriate to take headroom > > > into account there and make sure we honour uclamp_max hint to cap the > > > performance appropriately based on the effective uclamp_max value of the rq. > > > > > > For example if actually util was 640, then after the headroom it'd be 800. And > > > if uclamp_max is 800, then this task will still get the 1.25 headroom. We are > > > not changing this behavior. > > > > > > But if the task goes beyond that, then it'll stop at 800 because this what the > > > request is all about. A headroom beyond this point is not required because the > > > task (or rq to be more precise) is capped to this performance point and > > > regardless how busy the cpu gets in terms of real utilization or headroom, it > > > shouldn't go beyond this point. ie: if a task is a 1024 but uclamp_max of is > > > 800 then it'll only get a performance equivalent to OPP@800 would give. > > > > > > If we don't do that uclamp_max range effectively is from 0-819 (based on > > > current headroom setting of 1.25). Which is not what we want or what we're > > > telling userspace. Userspace sees the whole system performance levels > > > abstracted from 0 - 1024. As it should. The only effect they would observe and > > > there's no way around it is that OPPs are discrete points. So in reality our > > > 0-1024 is a staircase where a range of util values will map to the same OPP and > > > then we'll get a jump. So the user can end up requesting for example 700 and > > > 720 and not notice any difference because they both map to the same OPP. > > > I don't think we can fix that - but maybe I should add it to the uclamp doc as > > > a caveat when setting uclamp. > > > > > > > > > > > > > > > > > Before this fix running > > > > > > > > > > uclampset -M 800 cat /dev/zero > /dev/null > > > > > > > > > > Will cause the test system to run at max freq of 2.8GHz. After the fix > > > > > it runs at 2.2GHz instead which is the correct value that matches the > > > > > capacity of 800. > > > > > > > > So a task with an utilization of 800 will run at higher freq than a > > > > task clamped to 800 by uclamp ? Why should they run at different freq > > > > for the same utilization ? > > > > > > Because uclamp sets an upper limit on the performance level the task should be > > > able to achieve. Imagine a task is 1024 and capped to 800, it should not run at > > > max frequency, right? What's the point of the uclamp_max hint if the headroom > > > > Who knows ? Here we try to predict the coming utilization of the cpu > > and this prediction is only partial so cpufreq might want to keep some > > headroom even if uclamp max is applied to a cfs rq. Anticipate > > unpredictable irq stolen time > > I struggle to see why we need headroom after uclamp_max. But I think I'm > starting to see what you're getting at. I think you'd like not to make > assumption in the scheduler and hide this logic in the governor itself? yes > > > > > IMO, dvfs_headroom should never be visible in scheduler code. This is > > the case now; it's only visible by cpufreq which is the "owner" and > > energy model which tries to predict cpufreq behavior > > Okay. > > > > > > will cause it to run at max anyway? We lost the meaning of the hint. And if > > > this headroom changes in the future, people will start observing different > > > behavior for existing uclamp_max settings on the same system because of this > > > this rightfully hidden and unaccounted for factor. > > > > > > > > > > > > > > > > > Note that similar problem exist for uclamp_min. If util was 50, and > > > > > uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp > > > > > constraints, we'll end up with util of 125 instead of 100. IOW, we get > > > > > boosted twice, first time by uclamp_min, and second time by dvfs > > > > > headroom. > > > > > > > > > > Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> > > > > > --- > > > > > include/linux/energy_model.h | 1 - > > > > > kernel/sched/core.c | 11 ++++++++--- > > > > > kernel/sched/cpufreq_schedutil.c | 5 ++--- > > > > > 3 files changed, 10 insertions(+), 7 deletions(-) > > > > > > > > > > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h > > > > > index 6ebde4e69e98..adec808b371a 100644 > > > > > --- a/include/linux/energy_model.h > > > > > +++ b/include/linux/energy_model.h > > > > > @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, > > > > > scale_cpu = arch_scale_cpu_capacity(cpu); > > > > > ps = &pd->table[pd->nr_perf_states - 1]; > > > > > > > > > > - max_util = apply_dvfs_headroom(max_util); > > > > > max_util = min(max_util, allowed_cpu_cap); > > > > > freq = map_util_freq(max_util, ps->frequency, scale_cpu); > > > > > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > > > index efe3848978a0..441d433c83cd 100644 > > > > > --- a/kernel/sched/core.c > > > > > +++ b/kernel/sched/core.c > > > > > @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > > > * frequency will be gracefully reduced with the utilization decay. > > > > > */ > > > > > util = util_cfs + cpu_util_rt(rq); > > > > > - if (type == FREQUENCY_UTIL) > > > > > + if (type == FREQUENCY_UTIL) { > > > > > + util = apply_dvfs_headroom(util); > > > > > > > > This is not the same as before because utilization has not being > > > > scaled by irq steal time yet > > > > > > We do the scaling below, no? > > > > > > AFAICS, we had: > > > > > > (util_cfs + util_rt + irq + ((max-irq)*(util_cfs + util_rt)/max)+ dl_bw) * scale > > > > I think it's only : > > ((util_cfs + util_rt) * (max -irq) / max + irq + dl_bw) * scale > > > > and it' s effectively similar > > +1 > > > > > > > > > Using U = (util_cfs + util_rt) * scale > > > > > > we can write this after the multiplication > > > > > > U + irq * scale + ((max-irq)*U/max) + dl_bw * scale > > > > > > > > > > > > util = uclamp_rq_util_with(rq, util, p); > > > > > + } > > > > > > > > > > dl_util = cpu_util_dl(rq); > > > > > > > > > > @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > > > * max - irq > > > > > * U' = irq + --------- * U > > > > > * max > > > > > + * > > > > > + * We only need to apply dvfs headroom to irq part since the util part > > > > > + * already had it applied. > > > > > */ > > > > > util = scale_irq_capacity(util, irq, max); > > > > > - util += irq; > > > > > + util += type == FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq; > > > > > > > > > > /* > > > > > * Bandwidth required by DEADLINE must always be granted while, for > > > > > @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, > > > > > * an interface. So, we only do the latter for now. > > > > > */ > > > > > if (type == FREQUENCY_UTIL) > > > > > - util += cpu_bw_dl(rq); > > > > > + util += apply_dvfs_headroom(cpu_bw_dl(rq)); > > > > > > > > If we follow your reasoning with uclamp on the dl bandwidth, should we > > > > not skip this as well ? > > > > > > I do remove this in patch 4. Can fold that one into this one if you like. > > > I opted to keep the current behavior in this patch and remove these later in > > > patch 4. > > > > > > I do think that both RT and DL shouldn't need DVFS headroom in general as they > > > both 'disable' it by default. > > > > and the scheduler in general doesn't care about DVFS headroom. > > > > I don't think it's the right way to fix your problem of selecting the > > right frequency on your platform. It would be better to get a proper > > interface between EM and cpufreq like > > cpufreq_give_me_freq_for_this_utilization_ctx. > > Hmm is the problem then that effective_cpu_util() was moved to scheduler and it > should move back to schedutil and provide a proper interface to connect the two > in abstracted way? probably yes > > > > > And cpufreq needs 2 information: > > - the utilization of the cpu > > - the uclamp/performance/bandwidth hints on this cpu because the > > uclamp_max hints are screwed up by irq scaling anyway. > > > > For example: > > > > irq = 256 > > util = 800 but uclamp=512 > > > > If I follow your reasoning about the uclamp being a performance hint > > and uclamp=512 means that we should not go above 512 just for cfs+rt, > > Yes. We already limit it to 512 now. With the change we'll limit it after the > headroom is applied to them. > > > we should stay at 512 in this case because irq only needs 256 which is > > below the 512 bandwidth. The utilization reported to cpufreq will be > > Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt > will be restricted by their uclamp settings _after_ the headroom is applied. But then you mixed some utilization level (irq pressure) which is time related with some performance/bandwidth limitation which is freq related. And I think that part of the reason why we can't agree on where to put headroom and how to take into account uclamp > > > above 512 whatever the current (720) formula or your proposal (608). > > In the case of uclamp, it should be applied after having been scaled > > by irq time. > > I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers. My bad, I finally decided to use an irq pressure of 128 in my calculation but forgot to change the value in my email > > So the way I'm proposing it here > > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 > util = uclamp_rq_util_with(rq, util, NULL) = 512 > util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384 > util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704 > util += dvfs_headroom(dl_bw) = 704 > > > > > So we should have reported utilization of 720 with a bandwidth > > requirement of 512 and then cpufreq can applies its headroom if needed > > The key part I'm changing is this > > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 > util = uclamp_rq_util_with(rq, util, NULL) = 512 > > Before we had (assume irq, rt and dl are 0 for simplicity and a single task is > running) > > util = cfs = 800 > util = uclamp_rq_util_with(rq, util, NULL) = 512 > util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640 > > So we are running higher than we are allowed to. So applying the headroom > after taking uclamp constraints into account is the problem. But I think it's not correct because you mixed some utilization level with some performance requirement. IIRC, you said that the uclamp/min/max must not be seen as an utilization level but a performance hint. In such case, you can't add it to some other utilization because it defeats its original purpose > > IIUC and your concerns are about how scheduler and schedutil are interacting > loses some abstraction, then yeah this makes sense and can refactor code to > better abstract the two and keep them decoupled. > > But if you think the order the headroom is applied should be the way it is, > then this doesn't fix the problem. > > A good use case to consider is a single task running at 1024. If I want to > limit it to 80% using uclamp_max, how can this be enforced? With current code, > the task will continue to run at 1024; which is the problem. Single case is the easy part > > Running at 640 instead of 512 is still a problem too as this could be 1 or > 2 OPP higher than what we want to allow; and there could be significant power > difference between those OPPs. That's my point. Your proposal tries to fix the simple case of 1 task with a uclamp_max but If we want to move in this direction, we should fix all cases. I probably need to put my idea in real code to see what it means but we should provide 2 value to cpufreq: an utilization level and a performance hint which would max aggregate the various performance hints that we have > > > Thanks! > > -- > Qais Yousef
On 09/08/23 16:30, Vincent Guittot wrote: > > Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt > > will be restricted by their uclamp settings _after_ the headroom is applied. > > But then you mixed some utilization level (irq pressure) which is > time related with some performance/bandwidth limitation which is freq > related. And I think that part of the reason why we can't agree on > where to put headroom and how to take into account uclamp I did not change how this part works. We can drop patch 4 completely if this is what is causing the confusion. What I changed is the order of applying uclamp_rq_util_with() and apply_dvfs_headroom(). The rest is still the same as-is in the code today. > > > > > > above 512 whatever the current (720) formula or your proposal (608). > > > In the case of uclamp, it should be applied after having been scaled > > > by irq time. > > > > I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers. > > My bad, I finally decided to use an irq pressure of 128 in my > calculation but forgot to change the value in my email > > > > > So the way I'm proposing it here > > > > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 > > util = uclamp_rq_util_with(rq, util, NULL) = 512 > > util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384 > > util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704 > > util += dvfs_headroom(dl_bw) = 704 > > > > > > > > So we should have reported utilization of 720 with a bandwidth > > > requirement of 512 and then cpufreq can applies its headroom if needed > > > > The key part I'm changing is this > > > > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 > > util = uclamp_rq_util_with(rq, util, NULL) = 512 > > > > Before we had (assume irq, rt and dl are 0 for simplicity and a single task is > > running) > > > > util = cfs = 800 > > util = uclamp_rq_util_with(rq, util, NULL) = 512 > > util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640 > > > > So we are running higher than we are allowed to. So applying the headroom > > after taking uclamp constraints into account is the problem. > > But I think it's not correct because you mixed some utilization level > with some performance requirement. IIRC, you said that the > uclamp/min/max must not be seen as an utilization level but a > performance hint. In such case, you can't add it to some other > utilization because it defeats its original purpose But this is what we do today already. I didn't change this part. See below. > > > > > IIUC and your concerns are about how scheduler and schedutil are interacting > > loses some abstraction, then yeah this makes sense and can refactor code to > > better abstract the two and keep them decoupled. > > > > But if you think the order the headroom is applied should be the way it is, > > then this doesn't fix the problem. > > > > A good use case to consider is a single task running at 1024. If I want to > > limit it to 80% using uclamp_max, how can this be enforced? With current code, > > the task will continue to run at 1024; which is the problem. > > Single case is the easy part > > > > > Running at 640 instead of 512 is still a problem too as this could be 1 or > > 2 OPP higher than what we want to allow; and there could be significant power > > difference between those OPPs. > > That's my point. > > Your proposal tries to fix the simple case of 1 task with a uclamp_max > but If we want to move in this direction, we should fix all cases. I think I am addressing all cases. I don't think I understand the problem you're trying to highlight. Is the RFC patch 4 is what is causing the confusion? That one is not related to fixing uclamp_max; it's just me questioning the status quo of which pressure signal requires the headroom. The main change being done here actually is to apply_dvfs_headroom() *before* doing uclamp_rq_util_with(). I am not sure how you see this mixing. Current code performs apply_dvfs_headroom() *after*; which what causes the CPU to run at a performance level higher than rq->uclamp[UCLAMP_MAX]. It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to 800, then the CPU should not vote to max (assuminig all other pressures are 0). Handling of irq pressure etc is not changed. > I probably need to put my idea in real code to see what it means but > we should provide 2 value to cpufreq: an utilization level and a > performance hint which would max aggregate the various performance > hints that we have Hmm. This reads to me code structure change. From my perspective; the correct behavior is to apply the headroom before restricting the performance level. It seems you're seeing a better way to aggregate all the pressures when taking uclamp into account. Which I am starting to see your point if this is what you're saying. Though I haven't changed the existing relationship. I did try to question some things in patch 4, my thoughts were specific to headroom, but it seems you have more generalized form. I do agree in general it seems scheduler and schedutil are getting tightly coupled code wise and it would be good to provide generic cpufreq interfaces to potentially allow other governors to hook in and benefit. Whether this is something desired or not, I don't know as I remember Peter and Rafael wanted schedutil to grow to become the de-facto governor. It seems the right thing anyway. Thanks! -- Qais Yousef
On 10/09/2023 19:46, Qais Yousef wrote: > On 09/08/23 16:30, Vincent Guittot wrote: > [...] >>>> above 512 whatever the current (720) formula or your proposal (608). >>>> In the case of uclamp, it should be applied after having been scaled >>>> by irq time. >>> >>> I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers. >> >> My bad, I finally decided to use an irq pressure of 128 in my >> calculation but forgot to change the value in my email >> >>> >>> So the way I'm proposing it here >>> >>> util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 >>> util = uclamp_rq_util_with(rq, util, NULL) = 512 >>> util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384 >>> util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704 >>> util += dvfs_headroom(dl_bw) = 704 >>> >>>> >>>> So we should have reported utilization of 720 with a bandwidth >>>> requirement of 512 and then cpufreq can applies its headroom if needed >>> >>> The key part I'm changing is this >>> >>> util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 >>> util = uclamp_rq_util_with(rq, util, NULL) = 512 >>> >>> Before we had (assume irq, rt and dl are 0 for simplicity and a single task is >>> running) >>> >>> util = cfs = 800 >>> util = uclamp_rq_util_with(rq, util, NULL) = 512 >>> util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640 >>> >>> So we are running higher than we are allowed to. So applying the headroom >>> after taking uclamp constraints into account is the problem. I'm not sure I understood all the example math in this thread correctly: Examples: irq = 128 or 256 util = 800 uclamp = 512 --- current code: ((util_cfs + util_rt) * ((max - irq) / max) + irq + dl_bw) * scale <- uclamped(cfs+rt) -> <-- scale_irq_capacity() -->|<-- map_util_perf() / (headroom()) irq = 128: (512 * (1024 - 128) / 1024 + 128 + 0) * 1.25 = 576 * 1.25 = 720 irq = 256: (512 * (1024 - 256) / 1024 + 256 + 0) * 1.25 = 640 * 1.25 = 800 --- new approach: irq = 128: (512 * (1024 - 128) / 1024 + 128 + 0.25 * 128) = 608 irq = 256: (512 * (1024 - 256) / 1024 + 256 + 0.25 * 256) = 704 <-> uclamped(cfs+rt+headroom(cfs+rt)) <- scale_irq_capacity() -> <-- headroom(irq) ? --> Is the correct? [...]
On Sun, 10 Sept 2023 at 19:46, Qais Yousef <qyousef@layalina.io> wrote: > > On 09/08/23 16:30, Vincent Guittot wrote: > > > > Hmm IIUC, no. irq pressure will be added on top as normal. It's just cfs+rt > > > will be restricted by their uclamp settings _after_ the headroom is applied. > > > > But then you mixed some utilization level (irq pressure) which is > > time related with some performance/bandwidth limitation which is freq > > related. And I think that part of the reason why we can't agree on > > where to put headroom and how to take into account uclamp > > I did not change how this part works. We can drop patch 4 completely if this is > what is causing the confusion. > > What I changed is the order of applying uclamp_rq_util_with() and > apply_dvfs_headroom(). The rest is still the same as-is in the code today. > > > > > > > > > > above 512 whatever the current (720) formula or your proposal (608). > > > > In the case of uclamp, it should be applied after having been scaled > > > > by irq time. > > > > > > I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers. > > > > My bad, I finally decided to use an irq pressure of 128 in my > > calculation but forgot to change the value in my email > > > > > > > > So the way I'm proposing it here > > > > > > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 > > > util = uclamp_rq_util_with(rq, util, NULL) = 512 > > > util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384 > > > util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704 > > > util += dvfs_headroom(dl_bw) = 704 > > > > > > > > > > > So we should have reported utilization of 720 with a bandwidth > > > > requirement of 512 and then cpufreq can applies its headroom if needed > > > > > > The key part I'm changing is this > > > > > > util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 > > > util = uclamp_rq_util_with(rq, util, NULL) = 512 > > > > > > Before we had (assume irq, rt and dl are 0 for simplicity and a single task is > > > running) > > > > > > util = cfs = 800 > > > util = uclamp_rq_util_with(rq, util, NULL) = 512 > > > util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640 > > > > > > So we are running higher than we are allowed to. So applying the headroom > > > after taking uclamp constraints into account is the problem. > > > > But I think it's not correct because you mixed some utilization level > > with some performance requirement. IIRC, you said that the > > uclamp/min/max must not be seen as an utilization level but a > > performance hint. In such case, you can't add it to some other > > utilization because it defeats its original purpose > > But this is what we do today already. I didn't change this part. See below. And it seems that what is done today doesn't work correctly for you. Your proposal to include cpufreq headroom into the scheduler is not correct IMO and it only applies for some cases. Then, the cpufreq driver can have some really good reason to apply some headroom even with an uclamp value but it can't make any decision. I think that we should use another way to fix your problem with how uclamp than reordering how headroom is applied by cpufreq. Mixing utilization and performance in one signal hide some differences that cpufreq can make a good use of. As an example: cfs util = 650 cfs uclamp = 800 irq = 128 cfs with headroom 650*1.25=812 is clamped to 800 Final utilization will be : 800(1024-128)/1024+128*1.25=860 which is above the target of 800. When we look at the detail, we have: cfs util once scaled to the full range is only 650(1024-128)/1024= 568 After applying irq (even with some headroom) 568+128*1.25 = 728 which is below the uclamp of 800 so shouldn't we stay at 800 in this case ? > > > > > > > > > IIUC and your concerns are about how scheduler and schedutil are interacting > > > loses some abstraction, then yeah this makes sense and can refactor code to > > > better abstract the two and keep them decoupled. > > > > > > But if you think the order the headroom is applied should be the way it is, > > > then this doesn't fix the problem. > > > > > > A good use case to consider is a single task running at 1024. If I want to > > > limit it to 80% using uclamp_max, how can this be enforced? With current code, > > > the task will continue to run at 1024; which is the problem. > > > > Single case is the easy part > > > > > > > > Running at 640 instead of 512 is still a problem too as this could be 1 or > > > 2 OPP higher than what we want to allow; and there could be significant power > > > difference between those OPPs. > > > > That's my point. > > > > Your proposal tries to fix the simple case of 1 task with a uclamp_max > > but If we want to move in this direction, we should fix all cases. > > I think I am addressing all cases. I don't think I understand the problem > you're trying to highlight. Is the RFC patch 4 is what is causing the > confusion? That one is not related to fixing uclamp_max; it's just me > questioning the status quo of which pressure signal requires the headroom. No patch 4 doesn't cause me confusion. > > The main change being done here actually is to apply_dvfs_headroom() *before* > doing uclamp_rq_util_with(). I am not sure how you see this mixing. Because dvfs_headroom is a cpufreq hints and you want to apply it somewhere else. > > Current code performs apply_dvfs_headroom() *after*; which what causes the CPU > to run at a performance level higher than rq->uclamp[UCLAMP_MAX]. > > It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to > 800, then the CPU should not vote to max (assuminig all other pressures are 0). You can't remove the irq pressure from the picture. If rq->uclamp[UCLAMP_MAX] is set to 800 means that cpu must not go above 800, it should apply also after taking into account other inputs. At least up to some level as described in my example above > > Handling of irq pressure etc is not changed. > > > I probably need to put my idea in real code to see what it means but > > we should provide 2 value to cpufreq: an utilization level and a > > performance hint which would max aggregate the various performance > > hints that we have > > Hmm. This reads to me code structure change. From my perspective; the correct > behavior is to apply the headroom before restricting the performance level. > > It seems you're seeing a better way to aggregate all the pressures when taking > uclamp into account. Which I am starting to see your point if this is what > you're saying. Though I haven't changed the existing relationship. I did try > to question some things in patch 4, my thoughts were specific to headroom, but > it seems you have more generalized form. > > I do agree in general it seems scheduler and schedutil are getting tightly > coupled code wise and it would be good to provide generic cpufreq interfaces to > potentially allow other governors to hook in and benefit. > > Whether this is something desired or not, I don't know as I remember Peter and > Rafael wanted schedutil to grow to become the de-facto governor. It seems the > right thing anyway. > > > Thanks! > > -- > Qais Yousef
On 09/12/23 18:03, Vincent Guittot wrote: > And it seems that what is done today doesn't work correctly for you. > Your proposal to include cpufreq headroom into the scheduler is not > correct IMO and it only applies for some cases. Then, the cpufreq > driver can have some really good reason to apply some headroom even > with an uclamp value but it can't make any decision. > > I think that we should use another way to fix your problem with how > uclamp than reordering how headroom is applied by cpufreq. Mixing > utilization and performance in one signal hide some differences that > cpufreq can make a good use of. > > As an example: > > cfs util = 650 > cfs uclamp = 800 > irq = 128 > > cfs with headroom 650*1.25=812 is clamped to 800 > > Final utilization will be : 800(1024-128)/1024+128*1.25=860 which is > above the target of 800. > > When we look at the detail, we have: > > cfs util once scaled to the full range is only 650(1024-128)/1024= 568 > > After applying irq (even with some headroom) 568+128*1.25 = 728 which > is below the uclamp of 800 so shouldn't we stay at 800 in this case ? Shouldn't it be (568+128)*1.25 = 870? Which is almost the 860 above. We calmped the 812 to 800, with rounding errors that almost accounts for the 10 points difference between 870 and 860.. I might have gotten the math wrong. But what I saw is that we have util = (X + Y + Z) * A and what I did util = AX + AY + AZ so maybe I missed something up, but I just did the multiplication with the headroom to each element individually rather than after the sum. So yeah, if I messed that part up, then that wasn't intentional and should be done differently. But I still can't see it. > > > > The main change being done here actually is to apply_dvfs_headroom() *before* > > doing uclamp_rq_util_with(). I am not sure how you see this mixing. > > Because dvfs_headroom is a cpufreq hints and you want to apply it > somewhere else. I am still not sure if you mean we are mixing up the code and we need better abstraction or something else. Beside the abstraction problem, which I agree with, I can't see what I am mixing up yet :( Sorry I think I need more helping hand to see it. > > Current code performs apply_dvfs_headroom() *after*; which what causes the CPU > > to run at a performance level higher than rq->uclamp[UCLAMP_MAX]. > > > > It doesn't matter how many tasks on the rq, if rq->uclamp[UCLAMP_MAX] is set to > > 800, then the CPU should not vote to max (assuminig all other pressures are 0). > > You can't remove the irq pressure from the picture. If > rq->uclamp[UCLAMP_MAX] is set to 800 means that cpu must not go above > 800, it should apply also after taking into account other inputs. At > least up to some level as described in my example above I was trying to simplify to understand what you mean as I don't think I see the problem you're highlighting still. Thanks! -- Qais Yousef
On 09/12/23 15:40, Dietmar Eggemann wrote: > On 10/09/2023 19:46, Qais Yousef wrote: > > On 09/08/23 16:30, Vincent Guittot wrote: > > > > [...] > > >>>> above 512 whatever the current (720) formula or your proposal (608). > >>>> In the case of uclamp, it should be applied after having been scaled > >>>> by irq time. > >>> > >>> I lost you a bit here. I'm not sure how you reached the 720 and 608 numbers. > >> > >> My bad, I finally decided to use an irq pressure of 128 in my > >> calculation but forgot to change the value in my email > >> > >>> > >>> So the way I'm proposing it here > >>> > >>> util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 > >>> util = uclamp_rq_util_with(rq, util, NULL) = 512 > >>> util = scale_rq_capacity(512, 256, 1024) = 0.75 * 512 = 384 > >>> util += dvfs_headroom(irq) = 384 + 256 + 0.25 * 256 = 704 > >>> util += dvfs_headroom(dl_bw) = 704 > >>> > >>>> > >>>> So we should have reported utilization of 720 with a bandwidth > >>>> requirement of 512 and then cpufreq can applies its headroom if needed > >>> > >>> The key part I'm changing is this > >>> > >>> util = cfs + rt + dvfs_headroom(cfs+rt) = 800 + 0.25 * 800 = 1000 > >>> util = uclamp_rq_util_with(rq, util, NULL) = 512 > >>> > >>> Before we had (assume irq, rt and dl are 0 for simplicity and a single task is > >>> running) > >>> > >>> util = cfs = 800 > >>> util = uclamp_rq_util_with(rq, util, NULL) = 512 > >>> util = dvfs_headroom(util) = 512 * 0.25 * 512 = 640 > >>> > >>> So we are running higher than we are allowed to. So applying the headroom > >>> after taking uclamp constraints into account is the problem. > > I'm not sure I understood all the example math in this thread correctly: > > Examples: > > irq = 128 or 256 > > util = 800 uclamp = 512 > > > --- current code: > > ((util_cfs + util_rt) * ((max - irq) / max) + irq + dl_bw) * scale > > <- uclamped(cfs+rt) -> > > <-- scale_irq_capacity() -->|<-- map_util_perf() > / (headroom()) > > irq = 128: (512 * (1024 - 128) / 1024 + 128 + 0) * 1.25 = 576 * 1.25 = 720 > > irq = 256: (512 * (1024 - 256) / 1024 + 256 + 0) * 1.25 = 640 * 1.25 = 800 > > > --- new approach: > > irq = 128: (512 * (1024 - 128) / 1024 + 128 + 0.25 * 128) = 608 > > irq = 256: (512 * (1024 - 256) / 1024 + 256 + 0.25 * 256) = 704 > > <-> > uclamped(cfs+rt+headroom(cfs+rt)) > > <- scale_irq_capacity() -> > > <-- headroom(irq) ? --> > > > Is the correct? Yes, this is my understanding too. But I'm not sure anymore as it seems I'm missing something from what Vincent is saying. Thanks! -- Qais Yousef
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h index 6ebde4e69e98..adec808b371a 100644 --- a/include/linux/energy_model.h +++ b/include/linux/energy_model.h @@ -243,7 +243,6 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, scale_cpu = arch_scale_cpu_capacity(cpu); ps = &pd->table[pd->nr_perf_states - 1]; - max_util = apply_dvfs_headroom(max_util); max_util = min(max_util, allowed_cpu_cap); freq = map_util_freq(max_util, ps->frequency, scale_cpu); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index efe3848978a0..441d433c83cd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7439,8 +7439,10 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, * frequency will be gracefully reduced with the utilization decay. */ util = util_cfs + cpu_util_rt(rq); - if (type == FREQUENCY_UTIL) + if (type == FREQUENCY_UTIL) { + util = apply_dvfs_headroom(util); util = uclamp_rq_util_with(rq, util, p); + } dl_util = cpu_util_dl(rq); @@ -7471,9 +7473,12 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, * max - irq * U' = irq + --------- * U * max + * + * We only need to apply dvfs headroom to irq part since the util part + * already had it applied. */ util = scale_irq_capacity(util, irq, max); - util += irq; + util += type == FREQUENCY_UTIL ? apply_dvfs_headroom(irq) : irq; /* * Bandwidth required by DEADLINE must always be granted while, for @@ -7486,7 +7491,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, * an interface. So, we only do the latter for now. */ if (type == FREQUENCY_UTIL) - util += cpu_bw_dl(rq); + util += apply_dvfs_headroom(cpu_bw_dl(rq)); return min(max, util); } diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 916c4d3d6192..0c7565ac31fb 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -143,7 +143,6 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, unsigned int freq = arch_scale_freq_invariant() ? policy->cpuinfo.max_freq : policy->cur; - util = apply_dvfs_headroom(util); freq = map_util_freq(util, freq, max); if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update) @@ -406,8 +405,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util) sg_cpu->util = prev_util; - cpufreq_driver_adjust_perf(sg_cpu->cpu, apply_dvfs_headroom(sg_cpu->bw_dl), - apply_dvfs_headroom(sg_cpu->util), max_cap); + cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_dl, + sg_cpu->util, max_cap); sg_cpu->sg_policy->last_freq_update_time = time; }
DVFS headroom is applied after we calculate the effective_cpu_util() which is where we honour uclamp constraints. It makes more sense to apply the headroom there once and let all users naturally get the right thing without having to sprinkle the call around in various places. Before this fix running uclampset -M 800 cat /dev/zero > /dev/null Will cause the test system to run at max freq of 2.8GHz. After the fix it runs at 2.2GHz instead which is the correct value that matches the capacity of 800. Note that similar problem exist for uclamp_min. If util was 50, and uclamp_min is 100. Since we apply_dvfs_headroom() after apply uclamp constraints, we'll end up with util of 125 instead of 100. IOW, we get boosted twice, first time by uclamp_min, and second time by dvfs headroom. Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> --- include/linux/energy_model.h | 1 - kernel/sched/core.c | 11 ++++++++--- kernel/sched/cpufreq_schedutil.c | 5 ++--- 3 files changed, 10 insertions(+), 7 deletions(-)