Message ID | 20231212142730.998913-1-vincent.guittot@linaro.org |
---|---|
Headers | show |
Series | Rework system pressure interface to the scheduler | expand |
On 12-12-23, 15:27, Vincent Guittot wrote: > Provide to the scheduler a feedback about the temporary max available > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be > filtered as the pressure will happen for dozens ms or more. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++++++++++++++++++++++++ > include/linux/cpufreq.h | 10 ++++++++ > 2 files changed, 58 insertions(+) > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > index 44db4f59c4cc..7d5f71be8d29 100644 > --- a/drivers/cpufreq/cpufreq.c > +++ b/drivers/cpufreq/cpufreq.c > @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu) > } > EXPORT_SYMBOL(cpufreq_get_policy); > > +DEFINE_PER_CPU(unsigned long, cpufreq_pressure); > +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure); > + > +/** > + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs > + * @cpus : The related CPUs for which max capacity has been reduced > + * @capped_freq : The maximum allowed frequency that CPUs can run at > + * > + * Update the value of cpufreq pressure for all @cpus in the mask. The > + * cpumask should include all (online+offline) affected CPUs, to avoid > + * operating on stale data when hot-plug is used for some CPUs. The > + * @capped_freq reflects the currently allowed max CPUs frequency due to > + * freq_qos capping. It might be also a boost frequency value, which is bigger > + * than the internal 'capacity_freq_ref' max frequency. In such case the > + * pressure value should simply be removed, since this is an indication that > + * there is no capping. The @capped_freq must be provided in kHz. > + */ > +static void cpufreq_update_pressure(const struct cpumask *cpus, Since this is defined as 'static', why not just pass policy here ? > + unsigned long capped_freq) > +{ > + unsigned long max_capacity, capacity, pressure; > + u32 max_freq; > + int cpu; > + > + cpu = cpumask_first(cpus); > + max_capacity = arch_scale_cpu_capacity(cpu); This anyway expects all of them to be from the same policy .. > + max_freq = arch_scale_freq_ref(cpu); > + > + /* > + * Handle properly the boost frequencies, which should simply clean > + * the thermal pressure value. > + */ > + if (max_freq <= capped_freq) > + capacity = max_capacity; > + else > + capacity = mult_frac(max_capacity, capped_freq, max_freq); > + > + pressure = max_capacity - capacity; > + Extra blank line here. > + > + for_each_cpu(cpu, cpus) > + WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure); > +} > + > /** > * cpufreq_set_policy - Modify cpufreq policy parameters. > * @policy: Policy object to modify. > @@ -2584,6 +2628,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, > { > struct cpufreq_policy_data new_data; > struct cpufreq_governor *old_gov; > + struct cpumask *cpus; > int ret; > > memcpy(&new_data.cpuinfo, &policy->cpuinfo, sizeof(policy->cpuinfo)); > @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, > policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H); > trace_cpu_frequency_limits(policy); > > + cpus = policy->related_cpus; You don't need the extra variable anyway, but lets just pass policy instead to the routine. > + cpufreq_update_pressure(cpus, policy->max); > + > policy->cached_target_freq = UINT_MAX; > > pr_debug("new min and max freqs are %u - %u kHz\n", > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h > index afda5f24d3dd..b1d97edd3253 100644 > --- a/include/linux/cpufreq.h > +++ b/include/linux/cpufreq.h > @@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); > void cpufreq_enable_fast_switch(struct cpufreq_policy *policy); > void cpufreq_disable_fast_switch(struct cpufreq_policy *policy); > bool has_target_index(void); > + > +DECLARE_PER_CPU(unsigned long, cpufreq_pressure); > +static inline unsigned long cpufreq_get_pressure(int cpu) > +{ > + return per_cpu(cpufreq_pressure, cpu); > +} > #else > static inline unsigned int cpufreq_get(unsigned int cpu) > { > @@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void) > return false; > } > static inline void disable_cpufreq(void) { } > +static inline unsigned long cpufreq_get_pressure(int cpu) > +{ > + return 0; > +} > #endif > > #ifdef CONFIG_CPU_FREQ_STAT > -- > 2.34.1
On Wed, 13 Dec 2023 at 08:17, Viresh Kumar <viresh.kumar@linaro.org> wrote: > > On 12-12-23, 15:27, Vincent Guittot wrote: > > Provide to the scheduler a feedback about the temporary max available > > capacity. Unlike arch_update_thermal_pressure, this doesn't need to be > > filtered as the pressure will happen for dozens ms or more. > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++++++++++++++++++++++++ > > include/linux/cpufreq.h | 10 ++++++++ > > 2 files changed, 58 insertions(+) > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > > index 44db4f59c4cc..7d5f71be8d29 100644 > > --- a/drivers/cpufreq/cpufreq.c > > +++ b/drivers/cpufreq/cpufreq.c > > @@ -2563,6 +2563,50 @@ int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu) > > } > > EXPORT_SYMBOL(cpufreq_get_policy); > > > > +DEFINE_PER_CPU(unsigned long, cpufreq_pressure); > > +EXPORT_PER_CPU_SYMBOL_GPL(cpufreq_pressure); > > + > > +/** > > + * cpufreq_update_pressure() - Update cpufreq pressure for CPUs > > + * @cpus : The related CPUs for which max capacity has been reduced > > + * @capped_freq : The maximum allowed frequency that CPUs can run at > > + * > > + * Update the value of cpufreq pressure for all @cpus in the mask. The > > + * cpumask should include all (online+offline) affected CPUs, to avoid > > + * operating on stale data when hot-plug is used for some CPUs. The > > + * @capped_freq reflects the currently allowed max CPUs frequency due to > > + * freq_qos capping. It might be also a boost frequency value, which is bigger > > + * than the internal 'capacity_freq_ref' max frequency. In such case the > > + * pressure value should simply be removed, since this is an indication that > > + * there is no capping. The @capped_freq must be provided in kHz. > > + */ > > +static void cpufreq_update_pressure(const struct cpumask *cpus, > > Since this is defined as 'static', why not just pass policy here ? Mainly because we only need the cpumask and also because this follows the same pattern as other place like arch_topology.c > > > + unsigned long capped_freq) > > +{ > > + unsigned long max_capacity, capacity, pressure; > > + u32 max_freq; > > + int cpu; > > + > > + cpu = cpumask_first(cpus); > > + max_capacity = arch_scale_cpu_capacity(cpu); > > This anyway expects all of them to be from the same policy .. > > > + max_freq = arch_scale_freq_ref(cpu); > > + > > + /* > > + * Handle properly the boost frequencies, which should simply clean > > + * the thermal pressure value. > > + */ > > + if (max_freq <= capped_freq) > > + capacity = max_capacity; > > + else > > + capacity = mult_frac(max_capacity, capped_freq, max_freq); > > + > > + pressure = max_capacity - capacity; > > + > > Extra blank line here. > > > + > > + for_each_cpu(cpu, cpus) > > + WRITE_ONCE(per_cpu(cpufreq_pressure, cpu), pressure); > > +} > > + > > /** > > * cpufreq_set_policy - Modify cpufreq policy parameters. > > * @policy: Policy object to modify. > > @@ -2584,6 +2628,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, > > { > > struct cpufreq_policy_data new_data; > > struct cpufreq_governor *old_gov; > > + struct cpumask *cpus; > > int ret; > > > > memcpy(&new_data.cpuinfo, &policy->cpuinfo, sizeof(policy->cpuinfo)); > > @@ -2618,6 +2663,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, > > policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H); > > trace_cpu_frequency_limits(policy); > > > > + cpus = policy->related_cpus; > > You don't need the extra variable anyway, but lets just pass policy > instead to the routine. In fact I have followed what was done in cpufreq_cooling.c with arch_update_thermal_pressure(). Will remove it > > > + cpufreq_update_pressure(cpus, policy->max); > > + > > policy->cached_target_freq = UINT_MAX; > > > > pr_debug("new min and max freqs are %u - %u kHz\n", > > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h > > index afda5f24d3dd..b1d97edd3253 100644 > > --- a/include/linux/cpufreq.h > > +++ b/include/linux/cpufreq.h > > @@ -241,6 +241,12 @@ struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy); > > void cpufreq_enable_fast_switch(struct cpufreq_policy *policy); > > void cpufreq_disable_fast_switch(struct cpufreq_policy *policy); > > bool has_target_index(void); > > + > > +DECLARE_PER_CPU(unsigned long, cpufreq_pressure); > > +static inline unsigned long cpufreq_get_pressure(int cpu) > > +{ > > + return per_cpu(cpufreq_pressure, cpu); > > +} > > #else > > static inline unsigned int cpufreq_get(unsigned int cpu) > > { > > @@ -263,6 +269,10 @@ static inline bool cpufreq_supports_freq_invariance(void) > > return false; > > } > > static inline void disable_cpufreq(void) { } > > +static inline unsigned long cpufreq_get_pressure(int cpu) > > +{ > > + return 0; > > +} > > #endif > > > > #ifdef CONFIG_CPU_FREQ_STAT > > -- > > 2.34.1 > > -- > viresh
Hi Vincent, I've been waiting for this feature, thanks! On 12/12/23 14:27, Vincent Guittot wrote: > Following the consolidation and cleanup of CPU capacity in [1], this serie > reworks how the scheduler gets the pressures on CPUs. We need to take into > account all pressures applied by cpufreq on the compute capacity of a CPU > for dozens of ms or more and not only cpufreq cooling device or HW > mitigiations. we split the pressure applied on CPU's capacity in 2 parts: > - one from cpufreq and freq_qos > - one from HW high freq mitigiation. > > The next step will be to add a dedicated interface for long standing > capping of the CPU capacity (i.e. for seconds or more) like the > scaling_max_freq of cpufreq sysfs. The latter is already taken into > account by this serie but as a temporary pressure which is not always the > best choice when we know that it will happen for seconds or more. > > [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/ > > Vincent Guittot (4): > cpufreq: Add a cpufreq pressure feedback for the scheduler > sched: Take cpufreq feedback into account > thermal/cpufreq: Remove arch_update_thermal_pressure() > sched: Rename arch_update_thermal_pressure into > arch_update_hw_pressure > > arch/arm/include/asm/topology.h | 6 +-- > arch/arm64/include/asm/topology.h | 6 +-- > drivers/base/arch_topology.c | 26 ++++----- > drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++ > drivers/cpufreq/qcom-cpufreq-hw.c | 4 +- > drivers/thermal/cpufreq_cooling.c | 3 -- > include/linux/arch_topology.h | 8 +-- > include/linux/cpufreq.h | 10 ++++ > include/linux/sched/topology.h | 8 +-- > .../{thermal_pressure.h => hw_pressure.h} | 14 ++--- > include/trace/events/sched.h | 2 +- > init/Kconfig | 12 ++--- > kernel/sched/core.c | 8 +-- > kernel/sched/fair.c | 53 ++++++++++--------- > kernel/sched/pelt.c | 18 +++---- > kernel/sched/pelt.h | 16 +++--- > kernel/sched/sched.h | 4 +- > 17 files changed, 152 insertions(+), 94 deletions(-) > rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%) > I would like to test it, but something worries me. Why there is 0/5 in this subject and only 4 patches? Could you tell me your base branch that I can apply this, please? Regards, Lukasz
On Thu, 14 Dec 2023 at 09:21, Lukasz Luba <lukasz.luba@arm.com> wrote: > > Hi Vincent, > > I've been waiting for this feature, thanks! > > > On 12/12/23 14:27, Vincent Guittot wrote: > > Following the consolidation and cleanup of CPU capacity in [1], this serie > > reworks how the scheduler gets the pressures on CPUs. We need to take into > > account all pressures applied by cpufreq on the compute capacity of a CPU > > for dozens of ms or more and not only cpufreq cooling device or HW > > mitigiations. we split the pressure applied on CPU's capacity in 2 parts: > > - one from cpufreq and freq_qos > > - one from HW high freq mitigiation. > > > > The next step will be to add a dedicated interface for long standing > > capping of the CPU capacity (i.e. for seconds or more) like the > > scaling_max_freq of cpufreq sysfs. The latter is already taken into > > account by this serie but as a temporary pressure which is not always the > > best choice when we know that it will happen for seconds or more. > > > > [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/ > > > > Vincent Guittot (4): > > cpufreq: Add a cpufreq pressure feedback for the scheduler > > sched: Take cpufreq feedback into account > > thermal/cpufreq: Remove arch_update_thermal_pressure() > > sched: Rename arch_update_thermal_pressure into > > arch_update_hw_pressure > > > > arch/arm/include/asm/topology.h | 6 +-- > > arch/arm64/include/asm/topology.h | 6 +-- > > drivers/base/arch_topology.c | 26 ++++----- > > drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++ > > drivers/cpufreq/qcom-cpufreq-hw.c | 4 +- > > drivers/thermal/cpufreq_cooling.c | 3 -- > > include/linux/arch_topology.h | 8 +-- > > include/linux/cpufreq.h | 10 ++++ > > include/linux/sched/topology.h | 8 +-- > > .../{thermal_pressure.h => hw_pressure.h} | 14 ++--- > > include/trace/events/sched.h | 2 +- > > init/Kconfig | 12 ++--- > > kernel/sched/core.c | 8 +-- > > kernel/sched/fair.c | 53 ++++++++++--------- > > kernel/sched/pelt.c | 18 +++---- > > kernel/sched/pelt.h | 16 +++--- > > kernel/sched/sched.h | 4 +- > > 17 files changed, 152 insertions(+), 94 deletions(-) > > rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%) > > > > I would like to test it, but something worries me. Why there is 0/5 in > this subject and only 4 patches? I removed a patch from the series but copied/pasted the cover letter subject without noticing the /5 instead of /4 > > Could you tell me your base branch that I can apply this, please? It applies on top of tip/sched/core + [1] and you can find it here: https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/system-pressure > > Regards, > Lukasz
On 12/14/23 08:29, Vincent Guittot wrote: > On Thu, 14 Dec 2023 at 09:21, Lukasz Luba <lukasz.luba@arm.com> wrote: >> >> Hi Vincent, >> >> I've been waiting for this feature, thanks! >> >> >> On 12/12/23 14:27, Vincent Guittot wrote: >>> Following the consolidation and cleanup of CPU capacity in [1], this serie >>> reworks how the scheduler gets the pressures on CPUs. We need to take into >>> account all pressures applied by cpufreq on the compute capacity of a CPU >>> for dozens of ms or more and not only cpufreq cooling device or HW >>> mitigiations. we split the pressure applied on CPU's capacity in 2 parts: >>> - one from cpufreq and freq_qos >>> - one from HW high freq mitigiation. >>> >>> The next step will be to add a dedicated interface for long standing >>> capping of the CPU capacity (i.e. for seconds or more) like the >>> scaling_max_freq of cpufreq sysfs. The latter is already taken into >>> account by this serie but as a temporary pressure which is not always the >>> best choice when we know that it will happen for seconds or more. >>> >>> [1] https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/ >>> >>> Vincent Guittot (4): >>> cpufreq: Add a cpufreq pressure feedback for the scheduler >>> sched: Take cpufreq feedback into account >>> thermal/cpufreq: Remove arch_update_thermal_pressure() >>> sched: Rename arch_update_thermal_pressure into >>> arch_update_hw_pressure >>> >>> arch/arm/include/asm/topology.h | 6 +-- >>> arch/arm64/include/asm/topology.h | 6 +-- >>> drivers/base/arch_topology.c | 26 ++++----- >>> drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++ >>> drivers/cpufreq/qcom-cpufreq-hw.c | 4 +- >>> drivers/thermal/cpufreq_cooling.c | 3 -- >>> include/linux/arch_topology.h | 8 +-- >>> include/linux/cpufreq.h | 10 ++++ >>> include/linux/sched/topology.h | 8 +-- >>> .../{thermal_pressure.h => hw_pressure.h} | 14 ++--- >>> include/trace/events/sched.h | 2 +- >>> init/Kconfig | 12 ++--- >>> kernel/sched/core.c | 8 +-- >>> kernel/sched/fair.c | 53 ++++++++++--------- >>> kernel/sched/pelt.c | 18 +++---- >>> kernel/sched/pelt.h | 16 +++--- >>> kernel/sched/sched.h | 4 +- >>> 17 files changed, 152 insertions(+), 94 deletions(-) >>> rename include/trace/events/{thermal_pressure.h => hw_pressure.h} (55%) >>> >> >> I would like to test it, but something worries me. Why there is 0/5 in >> this subject and only 4 patches? > > I removed a patch from the series but copied/pasted the cover letter > subject without noticing the /5 instead of /4 OK > >> >> Could you tell me your base branch that I can apply this, please? > > It applies on top of tip/sched/core + [1] > and you can find it here: > https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/system-pressure Thanks for the info and handy link.
On 12/14/23 08:32, Lukasz Luba wrote: > > > On 12/14/23 08:29, Vincent Guittot wrote: >> On Thu, 14 Dec 2023 at 09:21, Lukasz Luba <lukasz.luba@arm.com> wrote: >>> >>> Hi Vincent, >>> >>> I've been waiting for this feature, thanks! >>> >>> >>> On 12/12/23 14:27, Vincent Guittot wrote: >>>> Following the consolidation and cleanup of CPU capacity in [1], this >>>> serie >>>> reworks how the scheduler gets the pressures on CPUs. We need to >>>> take into >>>> account all pressures applied by cpufreq on the compute capacity of >>>> a CPU >>>> for dozens of ms or more and not only cpufreq cooling device or HW >>>> mitigiations. we split the pressure applied on CPU's capacity in 2 >>>> parts: >>>> - one from cpufreq and freq_qos >>>> - one from HW high freq mitigiation. >>>> >>>> The next step will be to add a dedicated interface for long standing >>>> capping of the CPU capacity (i.e. for seconds or more) like the >>>> scaling_max_freq of cpufreq sysfs. The latter is already taken into >>>> account by this serie but as a temporary pressure which is not >>>> always the >>>> best choice when we know that it will happen for seconds or more. >>>> >>>> [1] >>>> https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guittot@linaro.org/ >>>> >>>> Vincent Guittot (4): >>>> cpufreq: Add a cpufreq pressure feedback for the scheduler >>>> sched: Take cpufreq feedback into account >>>> thermal/cpufreq: Remove arch_update_thermal_pressure() >>>> sched: Rename arch_update_thermal_pressure into >>>> arch_update_hw_pressure >>>> >>>> arch/arm/include/asm/topology.h | 6 +-- >>>> arch/arm64/include/asm/topology.h | 6 +-- >>>> drivers/base/arch_topology.c | 26 ++++----- >>>> drivers/cpufreq/cpufreq.c | 48 +++++++++++++++++ >>>> drivers/cpufreq/qcom-cpufreq-hw.c | 4 +- >>>> drivers/thermal/cpufreq_cooling.c | 3 -- >>>> include/linux/arch_topology.h | 8 +-- >>>> include/linux/cpufreq.h | 10 ++++ >>>> include/linux/sched/topology.h | 8 +-- >>>> .../{thermal_pressure.h => hw_pressure.h} | 14 ++--- >>>> include/trace/events/sched.h | 2 +- >>>> init/Kconfig | 12 ++--- >>>> kernel/sched/core.c | 8 +-- >>>> kernel/sched/fair.c | 53 >>>> ++++++++++--------- >>>> kernel/sched/pelt.c | 18 +++---- >>>> kernel/sched/pelt.h | 16 +++--- >>>> kernel/sched/sched.h | 4 +- >>>> 17 files changed, 152 insertions(+), 94 deletions(-) >>>> rename include/trace/events/{thermal_pressure.h => hw_pressure.h} >>>> (55%) >>>> >>> >>> I would like to test it, but something worries me. Why there is 0/5 in >>> this subject and only 4 patches? >> >> I removed a patch from the series but copied/pasted the cover letter >> subject without noticing the /5 instead of /4 > > OK > >> >>> >>> Could you tell me your base branch that I can apply this, please? >> >> It applies on top of tip/sched/core + [1] >> and you can find it here: >> https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/system-pressure > > Thanks for the info and handy link. > I've tested your patches with: DTPM/PowerCap + thermal gov + cpufreq sysfs scaling_max_freq. It works fine all my cases (couldn't cause any issues). If you like to test DTPM you will need 2 fixed pending in Rafael's tree. So, I'm looking for a your v2 to continue reviewing it.