Message ID | 20230820210640.585311-2-qyousef@layalina.io |
---|---|
State | New |
Headers | show |
Series | Fix dvfs_headroom escaping uclamp constraints | expand |
On 08/21/23 18:38, Dietmar Eggemann wrote: > On 20/08/2023 23:06, Qais Yousef wrote: > > We are providing headroom for the utilization to grow until the next > > decision point to pick the next frequency. Give the function a better > > name and give it some documentation. It is not really mapping anything. > > Wasn't the original aim to have a counterpart to task scheduler's > fits_capacity(), i.e. implement a frequency tipping point at 80%? > > #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024) > > > (util / max) = 0.8, hence 1.25 for the frequency-invariant case? > > https://lkml.kernel.org/r/11678919.CQLTrQTYxG@vostro.rjw.lan > > next_freq = 1.25 * max_freq * util / max > > 1,25 * util <-- map_util_perf() > > [...] > > Difference is that EAS deals with `util_cfs` and `capacity` whereas > power (CPUfreq and EM) deals with `util` and `capacity_orig`. And this > is where `capacity pressure` comes in for EAS (or fair.c). > > In this regard, I'm not sure why we should rename the function? I don't see any relation between the two to be honest. Both are magical numbers actually that no longer mean any sense in real world and people modify them in practice, as you know. As we brought up the topic in pelt half life and other avenues, this 25% is a headroom for the util to grow. And this headroom is required for DVFS reasons as evident by the current definition being in cpufreq.h. fits_capacity() is about misfit detection - I don't see any relation to selecting frequency. But voodoo relationship that is based on past circumstances I don't see holding true as both code and systems have evolved significantly since then. I think you're trying to make sure we've hit the top frequency before we force an upmigration early. But this is artificial and not a real necessity. Consider for example a single task running on a medium core with a capacity of 750. It has a steady state utilization of 300. Why should it run at 25% higher OPP which can be one or 2 freq jumps difference? And since the power cost is not linear, the power difference could be big. Not a cost I'd pay to artificially coordinate with misfit detection. And how about SMP systems that don't care about misfit detection? Why should they be tied to this magical 25% headroom? Note that DVFS hardware is a lot better nowadays that it is common to see a reaction time of 500us. So if this 300 task goes through a transition and its util starts growing again, we'll react to this within 500us; that's barely a few percents jump compared to the 25% one. I think this is what the headroom should be about, hence the new name. I will send the patches to address this issue separately soon enough; they're almost ready actually. I didn't expect the name of the function to be an issue here to be honest and thought it is clear a headroom for dvfs reaction time. Thanks! -- Qais Yousef
diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h index b9caa01dfac4..6ebde4e69e98 100644 --- a/include/linux/energy_model.h +++ b/include/linux/energy_model.h @@ -243,7 +243,7 @@ static inline unsigned long em_cpu_energy(struct em_perf_domain *pd, scale_cpu = arch_scale_cpu_capacity(cpu); ps = &pd->table[pd->nr_perf_states - 1]; - max_util = map_util_perf(max_util); + max_util = apply_dvfs_headroom(max_util); max_util = min(max_util, allowed_cpu_cap); freq = map_util_freq(max_util, ps->frequency, scale_cpu); diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h index bdd31ab93bc5..f0069b354ac8 100644 --- a/include/linux/sched/cpufreq.h +++ b/include/linux/sched/cpufreq.h @@ -29,10 +29,35 @@ static inline unsigned long map_util_freq(unsigned long util, return freq * util / cap; } -static inline unsigned long map_util_perf(unsigned long util) +/* + * DVFS decision are made at discrete points. If CPU stays busy, the util will + * continue to grow, which means it could need to run at a higher frequency + * before the next decision point was reached. IOW, we can't follow the util as + * it grows immediately, but there's a delay before we issue a request to go to + * higher frequency. The headroom caters for this delay so the system continues + * to run at adequate performance point. + * + * This function provides enough headroom to provide adequate performance + * assuming the CPU continues to be busy. + * + * At the moment it is a constant multiplication with 1.25. + * + * TODO: The headroom should be a function of the delay. 25% is too high + * especially on powerful systems. For example, if the delay is 500us, it makes + * more sense to give a small headroom as the next decision point is not far + * away and will follow the util if it continues to rise. On the other hand if + * the delay is 10ms, then we need a bigger headroom so the CPU won't struggle + * at a lower frequency if it never goes to idle until then. + */ +static inline unsigned long apply_dvfs_headroom(unsigned long util) { return util + (util >> 2); } +#else +static inline unsigned long apply_dvfs_headroom(unsigned long util) +{ + return util; +} #endif /* CONFIG_CPU_FREQ */ #endif /* _LINUX_SCHED_CPUFREQ_H */ diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 4492608b7d7f..916c4d3d6192 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -143,7 +143,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy, unsigned int freq = arch_scale_freq_invariant() ? policy->cpuinfo.max_freq : policy->cur; - util = map_util_perf(util); + util = apply_dvfs_headroom(util); freq = map_util_freq(util, freq, max); if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update) @@ -406,8 +406,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time, sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util) sg_cpu->util = prev_util; - cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl), - map_util_perf(sg_cpu->util), max_cap); + cpufreq_driver_adjust_perf(sg_cpu->cpu, apply_dvfs_headroom(sg_cpu->bw_dl), + apply_dvfs_headroom(sg_cpu->util), max_cap); sg_cpu->sg_policy->last_freq_update_time = time; }
We are providing headroom for the utilization to grow until the next decision point to pick the next frequency. Give the function a better name and give it some documentation. It is not really mapping anything. Provide a dummy definition for !CONFIG_CPUFREQ which will be required for later patches. Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> --- include/linux/energy_model.h | 2 +- include/linux/sched/cpufreq.h | 27 ++++++++++++++++++++++++++- kernel/sched/cpufreq_schedutil.c | 6 +++--- 3 files changed, 30 insertions(+), 5 deletions(-)