mbox series

[RFC,v021,0/9] cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT

Message ID 5861970.DvuYhMxLoT@rjwysocki.net
Headers show
Series cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT | expand

Message

Rafael J. Wysocki Nov. 29, 2024, 3:55 p.m. UTC
Hi Everyone,

This is a new iteration of the "EAS for intel_pstate" work:

https://lore.kernel.org/linux-pm/3607404.iIbC2pHGDl@rjwysocki.net/

It contains a few new patches and almost all of the patches sent previously
have been updated.

The following paragraph from the original cover letter still applies:

"The underlying observation is that on the platforms targeted by these changes,
Lunar Lake at the time of this writing, the "small" CPUs (E-cores), when run at
the same performance level, are always more energy-efficient than the "big" or
"performance" CPUs (P-cores).  This means that, regardless of the scale-
invariant utilization of a task, as long as there is enough spare capacity on
E-cores, the relative cost of running it there is always lower."

Thus the idea is still to register a perf domain per CPU type, but this time
there may be more than just two of them because of the first patch.

The states table in each of these perf domains is still one-element and that
element only contains the cost value, but this time the costs are computed
and not prescribed (see the last patch).  Nevertheless, the expected effect
is still that the perf domains (or CPU types) with lower cost values will
be preferred so long as there is enough spare capacity in them.

The first two patches are not really RFC, but they are included here because
patches [8-9/9] depend on patch [1/9].  They will be resent next week as
non-RFC 6.14-candidate material.

The difference made by them is significant because it is now not known in
advance how many CPU types will be there and the cost values for each of
them cannot be prescribed.

Patch [3/9] is also a change that I'd like to make regardless of what
happens to the rest of the series because it effectively moves EM code
from the schedutil governor to EM where it belongs.  Of course, it is also
depended on by patch [9/9].

Patch [4/9] differs from its previous version,

https://lore.kernel.org/linux-pm/1889415.atdPhlSkOF@rjwysocki.net/

because gov is NULL not only when it is not used at all, but also during the
cpufreq policy init and exit, so the check in the patch had to be adjusted
to match the former case only.  [As a side note, I don't think that the code
modified by patch [4/9] belongs to sched/topology as it messes around the
cpufreq internals.  At least, it should be moved to cpufreq and called by
sched_is_eas_possible(), but I'm also not convinced that it is necessary
at all.  This is not directly related to the $subject series, though.]

Patch [5/9] adds a new function needed by patch [9/9] and it is the same as
its previous version:

https://lore.kernel.org/linux-pm/2223963.Mh6RI2rZIc@rjwysocki.net/

Patch [6/9] is almost the same as its previous version:

https://lore.kernel.org/linux-pm/1821040.VLH7GnMWUR@rjwysocki.net/

but its changelog has been expanded a bit as suggested by Dietmar.  It
simply rearranges the EM code without changing its functionality, so the
next patch looks more straightforward.

Patch [7/9] is a somewhat updated counterpart of

https://lore.kernel.org/linux-pm/2017201.usQuhbGJ8B@rjwysocki.net/

It still changes the EM code to allow a perf domains with one-element states
table to be registered without providing the :active_power() callback (which
is then done in the last patch), but it is somewhat simpler.  It also
contains some discussion regarding the requirement that the capacity of
all CPUs in a perf domain must be the same.  In a short summary, I'm not
convinced that it is actually valid.

Patches [8-9/9] modify intel_pstate.  The first one is preparatory, but it
is useful for explaining the basic concept, which is "hybrid domains" that
each contain CPUs of the same type.

The last patch is just the registration of EM perf domains (one for each hybrid
domain), expanding them when needed and rebuilding sched domains in some corner
cases.  It also contains some discussion that doesn't technically belong to the
changelog, but is useful for explaining the background for some decisions.

Please refer to the individual patch changelogs for details.

For easier access, the series is available on the experimental/intel_ostate
branch in linux-pm.git:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate

Thanks!

or

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=experimental/intel_pstate

Thanks!

Comments

Christian Loehle Dec. 12, 2024, 5:04 p.m. UTC | #1
On 11/29/24 16:21, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Hybrid platforms contain different types of CPUs.  They may differ
> by micro-architecture, by cache topology, by manufacturing process, by
> the interconnect access design etc.  Of course, this means that power-
> performance curves for CPUs of different types are generally different.
> 
> Because of these differences, CPUs of different types need to be handled
> differently in certain situations and so it is convenient to operate
> groups of CPUs that each contain CPUs of the same type.  In intel_pstate,
> each of them will be represented by a struct hybrid_domain object and
> referred to as a hybrid domain.
> 
> A key problem is how to identify the type of a CPUs so as to know which
> hybrid domain it belongs to.  In principle, there are a few ways to do
> it, but none of them is perfectly reliable.
> 
> From the computational perspective, an important factor is how many
> instructions (on average) can be executed by the given CPU when it is
> running at a specific frequency, often referred to as the IPC
> (instructions per cycle) ratio of the given CPU to the least-capable
> CPU in the system.  In intel_pstate this ratio is represented by the
> performance-to-frequency scaling factor which needs to be used to get
> a frequency in kHz for a given HWP performance level of the given CPU.
> Since HWP performance levels are in the same units for all CPUs in a
> hybrid system, the smaller the scaling factor, the larger the IPC ratio
> for the given CPU.
> 
> Of course, the performance-to-frequency scaling factor must be the
> same for all CPUs of the same type.  While it may be the same for CPUs
> of different types, there is only one case in which that actually
> happens (Meteor Lake platforms with two types of E-cores) and it is not
> expected to happen again in the future.  Moreover, when it happens,
> there is no straightforward way to distinguish CPUs of different types
> with the same scaling factor in general.
> 
> For this reason, the scaling factor is as good as it gets for CPU
> type identification and so it is used for building hybrid domains in
> intel_pstate.
> 
> On hybrid systems, every CPU is added to a hybrid domain at the
> initialization time.  If a hybrid domain with a matching scaling
> factor is already present at that point, the CPU will be added to it.
> Otherwise, a new hybrid domain will be created and the CPU will be
> put into it.  The domain's scaling factor will then be set to the
> one of the new CPU.

Just two irrelevant typos below, although for the unfamiliar maybe an
example debug message output from any Arrow Lake would make this more
concrete?

> 
> So far, the new code doesn't do much beyond printing debud messages,

s/debud/debug

> but subsequently the EAS support for intel_pstate will be based on it.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpufreq/intel_pstate.c |   57 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -943,6 +943,62 @@ static struct cpudata *hybrid_max_perf_c
>   */
>  static DEFINE_MUTEX(hybrid_capacity_lock);
>  
> +#ifdef CONFIG_ENERGY_MODEL
> +/*
> + * A hybrid domain is a collection of CPUs with the same perf-to-frequency
> + * scaling factor.
> + */
> +struct hybrid_domain {
> +	struct hybrid_domain *next;
> +	cpumask_t cpumask;
> +	int scaling;
> +};
> +
> +static struct hybrid_domain *hybrid_domains;
> +
> +static void hybrid_add_to_domain(struct cpudata *cpudata)
> +{
> +	int scaling = cpudata->pstate.scaling;
> +	int cpu = cpudata->cpu;
> +	struct hybrid_domain *hd;
> +
> +	/* Do this only on hubrid platforms. */

s/hubrid/hybrid