diff mbox series

[1/4] sched: cpufreq: Rename map_util_perf to apply_dvfs_headroom

Message ID 20230820210640.585311-2-qyousef@layalina.io
State New
Headers show
Series Fix dvfs_headroom escaping uclamp constraints | expand

Commit Message

Qais Yousef Aug. 20, 2023, 9:06 p.m. UTC
We are providing headroom for the utilization to grow until the next
decision point to pick the next frequency. Give the function a better
name and give it some documentation. It is not really mapping anything.

Provide a dummy definition for !CONFIG_CPUFREQ which will be required
for later patches.

Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io>
---
 include/linux/energy_model.h     |  2 +-
 include/linux/sched/cpufreq.h    | 27 ++++++++++++++++++++++++++-
 kernel/sched/cpufreq_schedutil.c |  6 +++---
 3 files changed, 30 insertions(+), 5 deletions(-)

Comments

Qais Yousef Aug. 26, 2023, 8:03 p.m. UTC | #1
On 08/21/23 18:38, Dietmar Eggemann wrote:
> On 20/08/2023 23:06, Qais Yousef wrote:
> > We are providing headroom for the utilization to grow until the next
> > decision point to pick the next frequency. Give the function a better
> > name and give it some documentation. It is not really mapping anything.
> 
> Wasn't the original aim to have a counterpart to task scheduler's
> fits_capacity(), i.e. implement a frequency tipping point at 80%?
> 
>   #define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
> 
> 
> (util / max) = 0.8, hence 1.25 for the frequency-invariant case?
> 
> https://lkml.kernel.org/r/11678919.CQLTrQTYxG@vostro.rjw.lan
> 
>   next_freq = 1.25 * max_freq * util / max
> 
>               1,25 *            util  <-- map_util_perf()
> 
> [...]
> 
> Difference is that EAS deals with `util_cfs` and `capacity` whereas
> power (CPUfreq and EM) deals with `util` and `capacity_orig`. And this
> is where `capacity pressure` comes in for EAS (or fair.c).
> 
> In this regard, I'm not sure why we should rename the function?

I don't see any relation between the two to be honest. Both are magical numbers
actually that no longer mean any sense in real world and people modify them in
practice, as you know.

As we brought up the topic in pelt half life and other avenues, this 25% is
a headroom for the util to grow. And this headroom is required for DVFS reasons
as evident by the current definition being in cpufreq.h.

fits_capacity() is about misfit detection - I don't see any relation to
selecting frequency. But voodoo relationship that is based on past
circumstances I don't see holding true as both code and systems have evolved
significantly since then. I think you're trying to make sure we've hit the top
frequency before we force an upmigration early. But this is artificial and not
a real necessity.

Consider for example a single task running on a medium core with a capacity of
750. It has a steady state utilization of 300. Why should it run at 25% higher
OPP which can be one or 2 freq jumps difference? And since the power cost is
not linear, the power difference could be big. Not a cost I'd pay to
artificially coordinate with misfit detection. And how about SMP systems that
don't care about misfit detection? Why should they be tied to this magical 25%
headroom?

Note that DVFS hardware is a lot better nowadays that it is common to see
a reaction time of 500us. So if this 300 task goes through a transition and its
util starts growing again, we'll react to this within 500us; that's barely
a few percents jump compared to the 25% one. I think this is what the headroom
should be about, hence the new name.

I will send the patches to address this issue separately soon enough; they're
almost ready actually. I didn't expect the name of the function to be an issue
here to be honest and thought it is clear a headroom for dvfs reaction time.


Thanks!

--
Qais Yousef
diff mbox series

Patch

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index b9caa01dfac4..6ebde4e69e98 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -243,7 +243,7 @@  static inline unsigned long em_cpu_energy(struct em_perf_domain *pd,
 	scale_cpu = arch_scale_cpu_capacity(cpu);
 	ps = &pd->table[pd->nr_perf_states - 1];
 
-	max_util = map_util_perf(max_util);
+	max_util = apply_dvfs_headroom(max_util);
 	max_util = min(max_util, allowed_cpu_cap);
 	freq = map_util_freq(max_util, ps->frequency, scale_cpu);
 
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index bdd31ab93bc5..f0069b354ac8 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -29,10 +29,35 @@  static inline unsigned long map_util_freq(unsigned long util,
 	return freq * util / cap;
 }
 
-static inline unsigned long map_util_perf(unsigned long util)
+/*
+ * DVFS decision are made at discrete points. If CPU stays busy, the util will
+ * continue to grow, which means it could need to run at a higher frequency
+ * before the next decision point was reached. IOW, we can't follow the util as
+ * it grows immediately, but there's a delay before we issue a request to go to
+ * higher frequency. The headroom caters for this delay so the system continues
+ * to run at adequate performance point.
+ *
+ * This function provides enough headroom to provide adequate performance
+ * assuming the CPU continues to be busy.
+ *
+ * At the moment it is a constant multiplication with 1.25.
+ *
+ * TODO: The headroom should be a function of the delay. 25% is too high
+ * especially on powerful systems. For example, if the delay is 500us, it makes
+ * more sense to give a small headroom as the next decision point is not far
+ * away and will follow the util if it continues to rise. On the other hand if
+ * the delay is 10ms, then we need a bigger headroom so the CPU won't struggle
+ * at a lower frequency if it never goes to idle until then.
+ */
+static inline unsigned long apply_dvfs_headroom(unsigned long util)
 {
 	return util + (util >> 2);
 }
+#else
+static inline unsigned long apply_dvfs_headroom(unsigned long util)
+{
+	return util;
+}
 #endif /* CONFIG_CPU_FREQ */
 
 #endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 4492608b7d7f..916c4d3d6192 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -143,7 +143,7 @@  static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
-	util = map_util_perf(util);
+	util = apply_dvfs_headroom(util);
 	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
@@ -406,8 +406,8 @@  static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;
 
-	cpufreq_driver_adjust_perf(sg_cpu->cpu, map_util_perf(sg_cpu->bw_dl),
-				   map_util_perf(sg_cpu->util), max_cap);
+	cpufreq_driver_adjust_perf(sg_cpu->cpu, apply_dvfs_headroom(sg_cpu->bw_dl),
+				   apply_dvfs_headroom(sg_cpu->util), max_cap);
 
 	sg_cpu->sg_policy->last_freq_update_time = time;
 }