Message ID | 3353401.44csPzL39Z@rjwysocki.net |
---|---|
State | New |
Headers | show |
Series | cpufreq: intel_pstate: Enable EAS on hybrid platforms without SMT | expand |
On Tue, Dec 17, 2024 at 10:38 AM Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > > On 29/11/2024 17:02, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > Introduce a helper function for adding a CPU to an existing EM perf > > domain. > > > > Subsequently, this will be used by the intel_pstate driver to add new > > CPUs to existing perf domains when those CPUs go online for the first > > time after the initialization of the driver. > > > > No intentional functional impact. > > > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > > > v0.1 -> v0.2: No changes > > Could you add information why this new EM interface is needed? There is some of it in the changelog already. In fact, it is only needed in a corner case when the system starts with some CPUs offline and they only go online later (as a result of an explicit request from user space). That is the only case when a new CPU may need to be added to an existing perf domain. > IIRC, you can't use the existing way (cpufreq_driver::register_em) since > it gets called to early (3) for the PD cpumasks to be ready. This issue > will be there for any system in which uarch domains are not congruent > with clock domains which we hadn't have to deal with Arm's heterogeneous > CPUs so far. Basically, yes. > __init intel_pstate_init() > > intel_pstate_register_driver() > > cpufreq_register_driver() > > subsys_interface_register() > > sif->add_dev() -> cpufreq_add_dev() > > cpufreq_online() > > if (!new_policy && cpufreq_driver->online) > > else > > cpufreq_driver->init() -> intel_pstate_cpu_init() > > __intel_pstate_cpu_init() > > intel_pstate_init_cpu() > > intel_pstate_get_cpu_pstates() > > hybrid_add_to_domain() > > em_dev_expand_perf_domain() <-- (1) > > intel_pstate_init_acpi_perf_limits() > > intel_pstate_set_itmt_prio() <-- (2) > > if (new_policy) > > cpufreq_driver->register_em() <-- (3) > > hybrid_init_cpu_capacity_scaling() > > hybrid_refresh_cpu_capacity_scaling() > > __hybrid_refresh_cpu_capacity_scaling() <-- (4) > > hybrid_register_all_perf_domains() > > hybrid_register_perf_domain() > > em_dev_register_perf_domain() <-- (5) > > /* Enable EAS */ > sched_clear_itmt_support() <-- (6) > > Debugging this on a 'nosmt' i7-13700K (online mask = > [0,2,4,6,8,10,12,14,16-23] > > (1) Add CPU to existing hybrid PD or create new hybrid PD. Not exactly. (1) is just to expand an existing perf domain if the CPU is new (was offline all the time previously). Likewise, the direct hybrid_register_perf_domain() call in hybrid_add_to_domain() is to add a new perf domain if the given CPU is new (was offline all the time previously) and is the first one of the given type (say, the system is starting with all E-cores offline). It won't succeed before (4). For CPUs that are online to start with, hybrid_add_to_domain() assigns them to hybrid domains and PDs are created for them in hybrid_register_all_perf_domains(). > (2) Triggers sched domain rebuild (+ enabling EAS) already here during > startup ? This is just to enable ITMT which is the default mechanism for Intel hybrid platforms. It also requires a rebuild of sched domains to be enabled. > IMHO, reason is that max_highest_perf > min_highest_perf because of > different itmt prio Yes (which means that the platform is at least not homogenous). This really has been introduced for the handling of favored cores on otherwise non-hybrid platforms (say Tiger Lake). > Happens for CPU8 on my machine (after CPU8 is added to hybrid PD > 0,2,4,6,8) (itmt prio for CPU8=69 (1024) instead of 68 (1009)). > So it looks like EAS is enabled before (6) ? No, it is ITMT because CPU8 is a favored core. > (3) ARM's way to do (5) > (4) Setting hybrid_max_perf_cpu > (5) Register EM here > (6) Actual call to initially triggers sched domain rebuild (+ enable > EAS) (done already in (2) on my machine) This is the second rebuild of sched domains to turn off ITMT and enable EAS. The previous one is to enable ITMT. The earlier enabling of ITMT could be avoided I think, but that would be a complication on platforms that contain favored cores but otherwise are not hybrid. > So (3) is not possible for Intel hybrid since the policy's cpumask(s) It is possible in principle, but not particularly convenient because at that point it is not clear whether or not the platform is really hybrid and SMT is off and so whether or not EAS is to be used. > contain only one CPUs, i.e. CPUs are not sharing clock. > And those cpumasks have to be build under (1) to be used in (5)? They are built by the caller of (1) to be precise, but yes.
On 17/12/2024 21:40, Rafael J. Wysocki wrote: > On Tue, Dec 17, 2024 at 10:38 AM Dietmar Eggemann > <dietmar.eggemann@arm.com> wrote: >> >> On 29/11/2024 17:02, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> Introduce a helper function for adding a CPU to an existing EM perf >>> domain. >>> >>> Subsequently, this will be used by the intel_pstate driver to add new >>> CPUs to existing perf domains when those CPUs go online for the first >>> time after the initialization of the driver. >>> >>> No intentional functional impact. >>> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> --- >>> >>> v0.1 -> v0.2: No changes >> >> Could you add information why this new EM interface is needed? > > There is some of it in the changelog already. > > In fact, it is only needed in a corner case when the system starts > with some CPUs offline and they only go online later (as a result of > an explicit request from user space). That is the only case when a > new CPU may need to be added to an existing perf domain. OK, I see. Arm doesn't need this since we derive the masks from the CPUfreq policies so far. I just verified, we both keep hotplugged-out CPU within the PD. That's why we mask the PD cpus with cpu_online_mask in: find_energy_efficient_cpu() ... for (; pd; pd = pd->next) cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask); >> IIRC, you can't use the existing way (cpufreq_driver::register_em) since >> it gets called to early (3) for the PD cpumasks to be ready. This issue >> will be there for any system in which uarch domains are not congruent >> with clock domains which we hadn't have to deal with Arm's heterogeneous >> CPUs so far. > > Basically, yes. > >> __init intel_pstate_init() >> >> intel_pstate_register_driver() >> >> cpufreq_register_driver() >> >> subsys_interface_register() >> >> sif->add_dev() -> cpufreq_add_dev() >> >> cpufreq_online() >> >> if (!new_policy && cpufreq_driver->online) >> >> else >> >> cpufreq_driver->init() -> intel_pstate_cpu_init() >> >> __intel_pstate_cpu_init() >> >> intel_pstate_init_cpu() >> >> intel_pstate_get_cpu_pstates() >> >> hybrid_add_to_domain() >> >> em_dev_expand_perf_domain() <-- (1) >> >> intel_pstate_init_acpi_perf_limits() >> >> intel_pstate_set_itmt_prio() <-- (2) >> >> if (new_policy) >> >> cpufreq_driver->register_em() <-- (3) >> >> hybrid_init_cpu_capacity_scaling() >> >> hybrid_refresh_cpu_capacity_scaling() >> >> __hybrid_refresh_cpu_capacity_scaling() <-- (4) >> >> hybrid_register_all_perf_domains() >> >> hybrid_register_perf_domain() >> >> em_dev_register_perf_domain() <-- (5) >> >> /* Enable EAS */ >> sched_clear_itmt_support() <-- (6) >> >> Debugging this on a 'nosmt' i7-13700K (online mask = >> [0,2,4,6,8,10,12,14,16-23] >> >> (1) Add CPU to existing hybrid PD or create new hybrid PD. > > Not exactly. > > (1) is just to expand an existing perf domain if the CPU is new (was > offline all the time previously). OK. > > Likewise, the direct hybrid_register_perf_domain() call in > hybrid_add_to_domain() is to add a new perf domain if the given CPU is > new (was offline all the time previously) and is the first one of the > given type (say, the system is starting with all E-cores offline). > It won't succeed before (4). > > For CPUs that are online to start with, hybrid_add_to_domain() assigns > them to hybrid domains and PDs are created for them in > hybrid_register_all_perf_domains(). Understood. > >> (2) Triggers sched domain rebuild (+ enabling EAS) already here during >> startup ? > > This is just to enable ITMT which is the default mechanism for Intel > hybrid platforms. It also requires a rebuild of sched domains to be > enabled. > >> IMHO, reason is that max_highest_perf > min_highest_perf because of >> different itmt prio > > Yes (which means that the platform is at least not homogenous). > > This really has been introduced for the handling of favored cores on > otherwise non-hybrid platforms (say Tiger Lake). > >> Happens for CPU8 on my machine (after CPU8 is added to hybrid PD >> 0,2,4,6,8) (itmt prio for CPU8=69 (1024) instead of 68 (1009)). >> So it looks like EAS is enabled before (6) ? > > No, it is ITMT because CPU8 is a favored core. > >> (3) ARM's way to do (5) >> (4) Setting hybrid_max_perf_cpu >> (5) Register EM here >> (6) Actual call to initially triggers sched domain rebuild (+ enable >> EAS) (done already in (2) on my machine) > > This is the second rebuild of sched domains to turn off ITMT and > enable EAS. The previous one is to enable ITMT. > > The earlier enabling of ITMT could be avoided I think, but that would > be a complication on platforms that contain favored cores but > otherwise are not hybrid. OK, the fact that EAS will already be enabled in (2) is not an issue IMHO. > >> So (3) is not possible for Intel hybrid since the policy's cpumask(s) > > It is possible in principle, but not particularly convenient because > at that point it is not clear whether or not the platform is really > hybrid and SMT is off and so whether or not EAS is to be used. > >> contain only one CPUs, i.e. CPUs are not sharing clock. >> And those cpumasks have to be build under (1) to be used in (5)? > > They are built by the caller of (1) to be precise, but yes. OK. I still see an issue in putting all performance CPUs in one PD. find_energy_efficient_cpu() assumes that all PD CPUs has the same CPU capacity value. find_energy_efficient_cpu() ... for (; pd; pd = pd->next) { ... /* Account external pressure for the energy estimation */ cpu = cpumask_first(cpus); cpu_actual_cap = get_actual_cpu_capacity(cpu); --> (1) Even though X86 does not implement hw_load_avg() or cpufreq_get_pressure() we would still assume that all PD CPUs have the same cpu_capacity: get_actual_cpu_capacity() <-- (1) capacity = arch_scale_cpu_capacity(cpu) Now the error introduced is small (1024 versus 1009) on my i7-13700K but it's there. How big can those diffs based om itmt prio be?
Index: linux-pm/kernel/power/energy_model.c =================================================================== --- linux-pm.orig/kernel/power/energy_model.c +++ linux-pm/kernel/power/energy_model.c @@ -676,6 +676,38 @@ void em_dev_unregister_perf_domain(struc } EXPORT_SYMBOL_GPL(em_dev_unregister_perf_domain); +/** + * em_dev_expand_perf_domain() - Expand CPU perf domain + * @dev: CPU device of a CPU in the perf domain. + * @new_cpu: CPU to add to the perf domain. + */ +int em_dev_expand_perf_domain(struct device *dev, int new_cpu) +{ + struct device *new_cpu_dev; + struct em_perf_domain *pd; + + if (IS_ERR_OR_NULL(dev) || !_is_cpu_device(dev)) + return -EINVAL; + + new_cpu_dev = get_cpu_device(new_cpu); + if (!new_cpu_dev) + return -EINVAL; + + guard(mutex)(&em_pd_mutex); + + if (em_pd_get(new_cpu_dev)) + return -EEXIST; + + pd = em_pd_get(dev); + if (!pd) + return -EINVAL; + + cpumask_set_cpu(new_cpu, em_span_cpus(pd)); + new_cpu_dev->em_pd = pd; + + return 0; +} + static struct em_perf_table __rcu *em_table_dup(struct em_perf_domain *pd) { struct em_perf_table __rcu *em_table; Index: linux-pm/include/linux/energy_model.h =================================================================== --- linux-pm.orig/include/linux/energy_model.h +++ linux-pm/include/linux/energy_model.h @@ -172,6 +172,7 @@ int em_dev_register_perf_domain(struct d struct em_data_callback *cb, cpumask_t *span, bool microwatts); void em_dev_unregister_perf_domain(struct device *dev); +int em_dev_expand_perf_domain(struct device *dev, int new_cpu); struct em_perf_table __rcu *em_table_alloc(struct em_perf_domain *pd); void em_table_free(struct em_perf_table __rcu *table); int em_dev_compute_costs(struct device *dev, struct em_perf_state *table, @@ -354,6 +355,10 @@ int em_dev_register_perf_domain(struct d static inline void em_dev_unregister_perf_domain(struct device *dev) { } +static inline int em_dev_expand_perf_domain(struct device *dev, int new_cpu) +{ + return -EINVAL; +} static inline struct em_perf_domain *em_cpu_get(int cpu) { return NULL;