diff mbox series

[RFC,v021,5/9] PM: EM: Introduce em_dev_expand_perf_domain()

Message ID 3353401.44csPzL39Z@rjwysocki.net
State New
Headers show
Series cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT | expand

Commit Message

Rafael J. Wysocki Nov. 29, 2024, 4:02 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Introduce a helper function for adding a CPU to an existing EM perf
domain.

Subsequently, this will be used by the intel_pstate driver to add new
CPUs to existing perf domains when those CPUs go online for the first
time after the initialization of the driver.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

v0.1 -> v0.2: No changes

---
 include/linux/energy_model.h |    5 +++++
 kernel/power/energy_model.c  |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

Comments

Rafael J. Wysocki Dec. 17, 2024, 8:40 p.m. UTC | #1
On Tue, Dec 17, 2024 at 10:38 AM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 29/11/2024 17:02, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Introduce a helper function for adding a CPU to an existing EM perf
> > domain.
> >
> > Subsequently, this will be used by the intel_pstate driver to add new
> > CPUs to existing perf domains when those CPUs go online for the first
> > time after the initialization of the driver.
> >
> > No intentional functional impact.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >
> > v0.1 -> v0.2: No changes
>
> Could you add information why this new EM interface is needed?

There is some of it in the changelog already.

In fact, it is only needed in a corner case when the system starts
with some CPUs offline and they only go online later (as a result of
an explicit request from user space).  That is the only case when a
new CPU may need to be added to an existing perf domain.

> IIRC, you can't use the existing way (cpufreq_driver::register_em) since
> it gets called to early (3) for the PD cpumasks to be ready. This issue
> will be there for any system in which uarch domains are not congruent
> with clock domains which we hadn't have to deal with Arm's heterogeneous
> CPUs so far.

Basically, yes.

> __init intel_pstate_init()
>
>   intel_pstate_register_driver()
>
>     cpufreq_register_driver()
>
>       subsys_interface_register()
>
>         sif->add_dev() -> cpufreq_add_dev()
>
>           cpufreq_online()
>
>             if (!new_policy && cpufreq_driver->online)
>
>             else
>
>               cpufreq_driver->init() -> intel_pstate_cpu_init()
>
>                 __intel_pstate_cpu_init()
>
>                   intel_pstate_init_cpu()
>
>                     intel_pstate_get_cpu_pstates()
>
>                       hybrid_add_to_domain()
>
>                         em_dev_expand_perf_domain()              <-- (1)
>
>                   intel_pstate_init_acpi_perf_limits()
>
>                     intel_pstate_set_itmt_prio()                 <-- (2)
>
>             if (new_policy)
>
>               cpufreq_driver->register_em()                      <-- (3)
>
>     hybrid_init_cpu_capacity_scaling()
>
>       hybrid_refresh_cpu_capacity_scaling()
>
>         __hybrid_refresh_cpu_capacity_scaling()                  <-- (4)
>
>         hybrid_register_all_perf_domains()
>
>           hybrid_register_perf_domain()
>
>             em_dev_register_perf_domain()                        <-- (5)
>
>       /* Enable EAS */
>       sched_clear_itmt_support()                                 <-- (6)
>
> Debugging this on a 'nosmt' i7-13700K (online mask =
> [0,2,4,6,8,10,12,14,16-23]
>
> (1) Add CPU to existing hybrid PD or create new hybrid PD.

Not exactly.

(1) is just to expand an existing perf domain if the CPU is new (was
offline all the time previously).

Likewise, the direct hybrid_register_perf_domain() call in
hybrid_add_to_domain() is to add a new perf domain if the given CPU is
new (was offline all the time previously) and is the first one of the
given type (say, the system is starting with all E-cores offline).
It won't succeed before (4).

For CPUs that are online to start with, hybrid_add_to_domain() assigns
them to hybrid domains and PDs are created for them in
hybrid_register_all_perf_domains().

> (2) Triggers sched domain rebuild (+ enabling EAS) already here during
>     startup ?

This is just to enable ITMT which is the default mechanism for Intel
hybrid platforms.  It also requires a rebuild of sched domains to be
enabled.

>     IMHO, reason is that max_highest_perf > min_highest_perf because of
>     different itmt prio

Yes (which means that the platform is at least not homogenous).

This really has been introduced for the handling of favored cores on
otherwise non-hybrid platforms (say Tiger Lake).

>     Happens for CPU8 on my machine (after CPU8 is added to hybrid PD
>     0,2,4,6,8) (itmt prio for CPU8=69 (1024) instead of 68 (1009)).
>     So it looks like EAS is enabled before (6) ?

No, it is ITMT because CPU8 is a favored core.

> (3) ARM's way to do (5)
> (4) Setting hybrid_max_perf_cpu
> (5) Register EM here
> (6) Actual call to initially triggers sched domain rebuild (+ enable
>     EAS) (done already in (2) on my machine)

This is the second rebuild of sched domains to turn off ITMT and
enable EAS.  The previous one is to enable ITMT.

The earlier enabling of ITMT could be avoided I think, but that would
be a complication on platforms that contain favored cores but
otherwise are not hybrid.

> So (3) is not possible for Intel hybrid since the policy's cpumask(s)

It is possible in principle, but not particularly convenient because
at that point it is not clear whether or not the platform is really
hybrid and SMT is off and so whether or not EAS is to be used.

> contain only one CPUs, i.e. CPUs are not sharing clock.
> And those cpumasks have to be build under (1) to be used in (5)?

They are built by the caller of (1) to be precise, but yes.
diff mbox series

Patch

Index: linux-pm/kernel/power/energy_model.c
===================================================================
--- linux-pm.orig/kernel/power/energy_model.c
+++ linux-pm/kernel/power/energy_model.c
@@ -676,6 +676,38 @@  void em_dev_unregister_perf_domain(struc
 }
 EXPORT_SYMBOL_GPL(em_dev_unregister_perf_domain);
 
+/**
+ * em_dev_expand_perf_domain() - Expand CPU perf domain
+ * @dev: CPU device of a CPU in the perf domain.
+ * @new_cpu: CPU to add to the perf domain.
+ */
+int em_dev_expand_perf_domain(struct device *dev, int new_cpu)
+{
+	struct device *new_cpu_dev;
+	struct em_perf_domain *pd;
+
+	if (IS_ERR_OR_NULL(dev) || !_is_cpu_device(dev))
+		return -EINVAL;
+
+	new_cpu_dev = get_cpu_device(new_cpu);
+	if (!new_cpu_dev)
+		return -EINVAL;
+
+	guard(mutex)(&em_pd_mutex);
+
+	if (em_pd_get(new_cpu_dev))
+		return -EEXIST;
+
+	pd = em_pd_get(dev);
+	if (!pd)
+		return -EINVAL;
+
+	cpumask_set_cpu(new_cpu, em_span_cpus(pd));
+	new_cpu_dev->em_pd = pd;
+
+	return 0;
+}
+
 static struct em_perf_table __rcu *em_table_dup(struct em_perf_domain *pd)
 {
 	struct em_perf_table __rcu *em_table;
Index: linux-pm/include/linux/energy_model.h
===================================================================
--- linux-pm.orig/include/linux/energy_model.h
+++ linux-pm/include/linux/energy_model.h
@@ -172,6 +172,7 @@  int em_dev_register_perf_domain(struct d
 				struct em_data_callback *cb, cpumask_t *span,
 				bool microwatts);
 void em_dev_unregister_perf_domain(struct device *dev);
+int em_dev_expand_perf_domain(struct device *dev, int new_cpu);
 struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd);
 void em_table_free(struct em_perf_table __rcu *table);
 int em_dev_compute_costs(struct device *dev, struct em_perf_state *table,
@@ -354,6 +355,10 @@  int em_dev_register_perf_domain(struct d
 static inline void em_dev_unregister_perf_domain(struct device *dev)
 {
 }
+static inline int em_dev_expand_perf_domain(struct device *dev, int new_cpu)
+{
+	return -EINVAL;
+}
 static inline struct em_perf_domain *em_cpu_get(int cpu)
 {
 	return NULL;