Message ID | 2989520.e9J7NaK4W3@rjwysocki.net |
---|---|
State | New |
Headers | show |
Series | cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT | expand |
On 11/29/24 16:00, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > Make it possible to use EAS with cpufreq drivers that implement the > :setpolicy() callback instead of using generic cpufreq governors. > > This is going to be necessary for using EAS with intel_pstate in its > default configuration. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > This is the minimum of what's needed, but I'd really prefer to move > the cpufreq vs EAS checks into cpufreq because messing around cpufreq > internals in topology.c feels like a butcher shop kind of exercise. Makes sense, something like cpufreq_eas_capable(). > > Besides, as I said before, I remain unconvinced about the usefulness > of these checks at all. Yes, one is supposed to get the best results > from EAS when running schedutil, but what if they just want to try > something else with EAS? What if they can get better results with > that other thing, surprisingly enough? How do you imagine this to work then? I assume we don't make any 'resulting-OPP-guesses' like sugov_effective_cpu_perf() for any of the setpolicy governors. Neither for dbs and I guess userspace. What about standard powersave and performance? Do we just have a cpufreq callback to ask which OPP to use for the energy calculation? Assume lowest/highest? (I don't think there is hardware where lowest/highest makes a difference, so maybe not bothering with the complexity could be an option, too.) > > --- > kernel/sched/topology.c | 10 +++++++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > Index: linux-pm/kernel/sched/topology.c > =================================================================== > --- linux-pm.orig/kernel/sched/topology.c > +++ linux-pm/kernel/sched/topology.c > @@ -217,6 +217,7 @@ static bool sched_is_eas_possible(const > bool any_asym_capacity = false; > struct cpufreq_policy *policy; > struct cpufreq_governor *gov; > + bool cpufreq_ok; > int i; > > /* EAS is enabled for asymmetric CPU capacity topologies. */ > @@ -251,7 +252,7 @@ static bool sched_is_eas_possible(const > return false; > } > > - /* Do not attempt EAS if schedutil is not being used. */ > + /* Do not attempt EAS if cpufreq is not configured adequately */ > for_each_cpu(i, cpu_mask) { > policy = cpufreq_cpu_get(i); > if (!policy) { > @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const > } > return false; > } > + /* Require schedutil or a "setpolicy" driver */ > gov = policy->governor; > + cpufreq_ok = gov == &schedutil_gov || > + (!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN); > cpufreq_cpu_put(policy); > - if (gov != &schedutil_gov) { > + if (!cpufreq_ok) { > if (sched_debug()) { > - pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n", > + pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n", > cpumask_pr_args(cpu_mask)); > } > return false; The logic here looks fine to me FWIW.
On Wed, Dec 11, 2024 at 12:44 PM Christian Loehle <christian.loehle@arm.com> wrote: > > On 12/11/24 11:29, Rafael J. Wysocki wrote: > > On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle > > <christian.loehle@arm.com> wrote: > >> > >> On 11/29/24 16:00, Rafael J. Wysocki wrote: > >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>> > >>> Make it possible to use EAS with cpufreq drivers that implement the > >>> :setpolicy() callback instead of using generic cpufreq governors. > >>> > >>> This is going to be necessary for using EAS with intel_pstate in its > >>> default configuration. > >>> > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>> --- > >>> > >>> This is the minimum of what's needed, but I'd really prefer to move > >>> the cpufreq vs EAS checks into cpufreq because messing around cpufreq > >>> internals in topology.c feels like a butcher shop kind of exercise. > >> > >> Makes sense, something like cpufreq_eas_capable(). > >> > >>> > >>> Besides, as I said before, I remain unconvinced about the usefulness > >>> of these checks at all. Yes, one is supposed to get the best results > >>> from EAS when running schedutil, but what if they just want to try > >>> something else with EAS? What if they can get better results with > >>> that other thing, surprisingly enough? > >> > >> How do you imagine this to work then? > >> I assume we don't make any 'resulting-OPP-guesses' like > >> sugov_effective_cpu_perf() for any of the setpolicy governors. > >> Neither for dbs and I guess userspace. > >> What about standard powersave and performance? > >> Do we just have a cpufreq callback to ask which OPP to use for > >> the energy calculation? Assume lowest/highest? > >> (I don't think there is hardware where lowest/highest makes a > >> difference, so maybe not bothering with the complexity could > >> be an option, too.) > > > > In the "setpolicy" case there is no way to reliably predict the OPP > > that is going to be used, so why bother? > > > > In the other cases, and if the OPPs are actually known, EAS may still > > make assumptions regarding which of them will be used that will match > > the schedutil selection rules, but if the cpufreq governor happens to > > choose a different OPP, this is not the end of the world. > > "Not the end of the world" as in the model making incorrect assumptions. > With the significant power-performance overlaps we see in mobile systems > taking sugov's guess while using powersave/performance (the !setpolicy > case) at least will make worse decisions. > See here for reference, first slide. > https://lpc.events/event/16/contributions/1194/attachments/1114/2139/LPC2022_Energy_model_accuracy.pdf I've never said it won't make worse decisions, but whoever decides which governor to use should be able to check which one is better. > What about the config space, are you fine with everything relying on > CONFIG_CPU_FREQ_GOV_SCHEDUTIL? Yes, that's fine. I think that schedultil should be the default governor for EAS, but I don't see why it should be regarded as the only one possible and so enforced.
On Wed, 11 Dec 2024 at 12:29, Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle > <christian.loehle@arm.com> wrote: > > > > On 11/29/24 16:00, Rafael J. Wysocki wrote: > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > Make it possible to use EAS with cpufreq drivers that implement the > > > :setpolicy() callback instead of using generic cpufreq governors. > > > > > > This is going to be necessary for using EAS with intel_pstate in its > > > default configuration. > > > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > --- > > > > > > This is the minimum of what's needed, but I'd really prefer to move > > > the cpufreq vs EAS checks into cpufreq because messing around cpufreq > > > internals in topology.c feels like a butcher shop kind of exercise. > > > > Makes sense, something like cpufreq_eas_capable(). > > > > > > > > Besides, as I said before, I remain unconvinced about the usefulness > > > of these checks at all. Yes, one is supposed to get the best results > > > from EAS when running schedutil, but what if they just want to try > > > something else with EAS? What if they can get better results with > > > that other thing, surprisingly enough? > > > > How do you imagine this to work then? > > I assume we don't make any 'resulting-OPP-guesses' like > > sugov_effective_cpu_perf() for any of the setpolicy governors. > > Neither for dbs and I guess userspace. > > What about standard powersave and performance? > > Do we just have a cpufreq callback to ask which OPP to use for > > the energy calculation? Assume lowest/highest? > > (I don't think there is hardware where lowest/highest makes a > > difference, so maybe not bothering with the complexity could > > be an option, too.) > > In the "setpolicy" case there is no way to reliably predict the OPP > that is going to be used, so why bother? > > In the other cases, and if the OPPs are actually known, EAS may still > make assumptions regarding which of them will be used that will match > the schedutil selection rules, but if the cpufreq governor happens to > choose a different OPP, this is not the end of the world. Should we add a new cpufreq governor fops to return the guest estimate of the compute capacity selection ? something like cpufreq_effective_cpu_perf(cpu, actual, min, max) EAS needs to estimate what would be the next OPP; schedutil uses sugov_effective_cpu_perf() and other governor could provide their own > > Yes, you could have been more energy-efficient had you chosen to use > schedutil, but you chose otherwise and that's what you get. Calling sugov_effective_cpu_perf() for another governor than schedutil doesn't make sense. and do we handle the case when CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not selected > > > > > > > --- > > > kernel/sched/topology.c | 10 +++++++--- > > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > > > Index: linux-pm/kernel/sched/topology.c > > > =================================================================== > > > --- linux-pm.orig/kernel/sched/topology.c > > > +++ linux-pm/kernel/sched/topology.c > > > @@ -217,6 +217,7 @@ static bool sched_is_eas_possible(const > > > bool any_asym_capacity = false; > > > struct cpufreq_policy *policy; > > > struct cpufreq_governor *gov; > > > + bool cpufreq_ok; > > > int i; > > > > > > /* EAS is enabled for asymmetric CPU capacity topologies. */ > > > @@ -251,7 +252,7 @@ static bool sched_is_eas_possible(const > > > return false; > > > } > > > > > > - /* Do not attempt EAS if schedutil is not being used. */ > > > + /* Do not attempt EAS if cpufreq is not configured adequately */ > > > for_each_cpu(i, cpu_mask) { > > > policy = cpufreq_cpu_get(i); > > > if (!policy) { > > > @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const > > > } > > > return false; > > > } > > > + /* Require schedutil or a "setpolicy" driver */ > > > gov = policy->governor; > > > + cpufreq_ok = gov == &schedutil_gov || > > > + (!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN); > > > cpufreq_cpu_put(policy); > > > - if (gov != &schedutil_gov) { > > > + if (!cpufreq_ok) { > > > if (sched_debug()) { > > > - pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n", > > > + pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n", > > > cpumask_pr_args(cpu_mask)); > > > } > > > return false; > > > > The logic here looks fine to me FWIW. > > > >
On Wed, Dec 11, 2024 at 2:25 PM Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Wed, 11 Dec 2024 at 12:29, Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Wed, Dec 11, 2024 at 11:33 AM Christian Loehle > > <christian.loehle@arm.com> wrote: > > > > > > On 11/29/24 16:00, Rafael J. Wysocki wrote: > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > > > > > Make it possible to use EAS with cpufreq drivers that implement the > > > > :setpolicy() callback instead of using generic cpufreq governors. > > > > > > > > This is going to be necessary for using EAS with intel_pstate in its > > > > default configuration. > > > > > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > --- > > > > > > > > This is the minimum of what's needed, but I'd really prefer to move > > > > the cpufreq vs EAS checks into cpufreq because messing around cpufreq > > > > internals in topology.c feels like a butcher shop kind of exercise. > > > > > > Makes sense, something like cpufreq_eas_capable(). > > > > > > > > > > > Besides, as I said before, I remain unconvinced about the usefulness > > > > of these checks at all. Yes, one is supposed to get the best results > > > > from EAS when running schedutil, but what if they just want to try > > > > something else with EAS? What if they can get better results with > > > > that other thing, surprisingly enough? > > > > > > How do you imagine this to work then? > > > I assume we don't make any 'resulting-OPP-guesses' like > > > sugov_effective_cpu_perf() for any of the setpolicy governors. > > > Neither for dbs and I guess userspace. > > > What about standard powersave and performance? > > > Do we just have a cpufreq callback to ask which OPP to use for > > > the energy calculation? Assume lowest/highest? > > > (I don't think there is hardware where lowest/highest makes a > > > difference, so maybe not bothering with the complexity could > > > be an option, too.) > > > > In the "setpolicy" case there is no way to reliably predict the OPP > > that is going to be used, so why bother? > > > > In the other cases, and if the OPPs are actually known, EAS may still > > make assumptions regarding which of them will be used that will match > > the schedutil selection rules, but if the cpufreq governor happens to > > choose a different OPP, this is not the end of the world. > > Should we add a new cpufreq governor fops to return the guest estimate > of the compute capacity selection ? something like > cpufreq_effective_cpu_perf(cpu, actual, min, max) > EAS needs to estimate what would be the next OPP; schedutil uses > sugov_effective_cpu_perf() and other governor could provide their own Generally, yes. And documented for that matter. But it doesn't really tell you the OPP, but the performance level that is going to be set for the given list of arguments IIUC. An energy model is needed to find an OPP for the given perf level. Or generally the cost of it for that matter. > > Yes, you could have been more energy-efficient had you chosen to use > > schedutil, but you chose otherwise and that's what you get. > > Calling sugov_effective_cpu_perf() for another governor than schedutil > doesn't make sense. It will work for intel_pstate in the "setpolicy" mode to a reasonable approximation AFAICS. > and do we handle the case when > CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not selected I don't think it's necessary to handle it.
On 29/11/2024 17:00, Rafael J. Wysocki wrote: [...] > @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const > } > return false; > } > + /* Require schedutil or a "setpolicy" driver */ > gov = policy->governor; > + cpufreq_ok = gov == &schedutil_gov || > + (!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN); > cpufreq_cpu_put(policy); > - if (gov != &schedutil_gov) { > + if (!cpufreq_ok) { > if (sched_debug()) { > - pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n", > + pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n", > cpumask_pr_args(cpu_mask)); > } > return false; build_perf_domains() which calls sched_is_eas_possible() has schedutil (4) mentioned in the function header as well: /* * EAS can be used on a root domain if it meets all the following conditions: * 1. an Energy Model (EM) is available; * 2. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy. * 3. no SMT is detected. * 4. schedutil is driving the frequency of all CPUs of the rd; <-- ! * 5. frequency invariance support is present; */ IMHO, his patch should remove the function header since the conditions in sched_is_eas_possible() have comments already or are self-explanatory.
Index: linux-pm/kernel/sched/topology.c =================================================================== --- linux-pm.orig/kernel/sched/topology.c +++ linux-pm/kernel/sched/topology.c @@ -217,6 +217,7 @@ static bool sched_is_eas_possible(const bool any_asym_capacity = false; struct cpufreq_policy *policy; struct cpufreq_governor *gov; + bool cpufreq_ok; int i; /* EAS is enabled for asymmetric CPU capacity topologies. */ @@ -251,7 +252,7 @@ static bool sched_is_eas_possible(const return false; } - /* Do not attempt EAS if schedutil is not being used. */ + /* Do not attempt EAS if cpufreq is not configured adequately */ for_each_cpu(i, cpu_mask) { policy = cpufreq_cpu_get(i); if (!policy) { @@ -261,11 +262,14 @@ static bool sched_is_eas_possible(const } return false; } + /* Require schedutil or a "setpolicy" driver */ gov = policy->governor; + cpufreq_ok = gov == &schedutil_gov || + (!gov && policy->policy != CPUFREQ_POLICY_UNKNOWN); cpufreq_cpu_put(policy); - if (gov != &schedutil_gov) { + if (!cpufreq_ok) { if (sched_debug()) { - pr_info("rd %*pbl: Checking EAS, schedutil is mandatory\n", + pr_info("rd %*pbl: Checking EAS, unsuitable cpufreq governor\n", cpumask_pr_args(cpu_mask)); } return false;