mbox series

[v2,0/3] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems

Message ID 4941491.31r3eYUQgx@rjwysocki.net
Headers show
Series x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems | expand

Message

Rafael J. Wysocki Aug. 12, 2024, 12:35 p.m. UTC
Hi Everyone,

This is an update of

https://lore.kernel.org/linux-pm/4908113.GXAFRqVoOG@rjwysocki.net/

which addresses a few issues that have transpired since it was posted.

The most visible difference is that hybrid CPU capacity scaling is now
allowed to be used when frequency-invariance is not functional which
simplifies things and gets rid of a couple of corner cases.

The changes in this series are available from this git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=intel_pstate-testing

(with an extra debug commit on top).  Ricardo has tested them and reports
that they cause task utilization to be mostly scale-invariant, as expected.

The original cover letter quoted below still applies:

The purpose of this series is to provide the scheduler with asymmetric CPU
capacity information on x86 hybrid systems based on Intel hardware.

The asymmetric CPU capacity information is important on hybrid systems as it
allows utilization to be computed for tasks in a consistent way across all
CPUs in the system, regardless of their capacity.  This, in turn, allows
the schedutil cpufreq governor to set CPU performance levels consistently
in the cases when tasks migrate between CPUs of different capacities.  It
should also help to improve task placement and load balancing decisions on
hybrid systems and it is key for anything along the lines of EAS.

The information in question comes from the MSR_HWP_CAPABILITIES register and
is provided to the scheduler by the intel_pstate driver, as per the changelog
of patch [3/3].  Patch [2/3] introduces the arch infrastructure needed for
that (in the form of a per-CPU capacity variable) and patch [1/3] is a
preliminary code adjustment.

This is based on an RFC posted previously

https://lore.kernel.org/linux-pm/7663799.EvYhyI6sBW@kreacher/

but differs from it quite a bit (except for the first patch).  The most
significant difference is based on the observation that frequency-
invariance needs to adjusted to the capacity scaling on hybrid systems
for the complete scale-invariance to work as expected.

Thank you!

Comments

Chen Yu Aug. 13, 2024, 1:27 a.m. UTC | #1
Hi Rafael,

On 2024-08-12 at 14:42:26 +0200, Rafael J. Wysocki wrote:
> +void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long base_cap,
> +			   unsigned long max_freq, unsigned long base_freq)
> +{
> +	if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
> +		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
> +			   div_u64(cap << SCHED_CAPACITY_SHIFT, base_cap));
> +		WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
> +			   div_u64(max_freq << SCHED_CAPACITY_SHIFT, base_freq));
>

Would the capacity update be frequently invoked? Just wonder if we could
first READ_ONCE() to check if the value is already the value we want to
change to, to avoid one write and less cache snoop overhead (in case other
CPU reads this CPU's capacity)

thanks,
Chenyu
Rafael J. Wysocki Aug. 13, 2024, 11:07 a.m. UTC | #2
Hi,

On Tue, Aug 13, 2024 at 3:27 AM Chen Yu <yu.c.chen@intel.com> wrote:
>
> Hi Rafael,
>
> On 2024-08-12 at 14:42:26 +0200, Rafael J. Wysocki wrote:
> > +void arch_set_cpu_capacity(int cpu, unsigned long cap, unsigned long base_cap,
> > +                        unsigned long max_freq, unsigned long base_freq)
> > +{
> > +     if (static_branch_likely(&arch_hybrid_cap_scale_key)) {
> > +             WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->capacity,
> > +                        div_u64(cap << SCHED_CAPACITY_SHIFT, base_cap));
> > +             WRITE_ONCE(per_cpu_ptr(arch_cpu_scale, cpu)->freq_ratio,
> > +                        div_u64(max_freq << SCHED_CAPACITY_SHIFT, base_freq));
> >
>
> Would the capacity update be frequently invoked?

I hope not.

> Just wonder if we could
> first READ_ONCE() to check if the value is already the value we want to
> change to, to avoid one write and less cache snoop overhead (in case other
> CPU reads this CPU's capacity)

Well, I'd rather not complicate the code beyond what is necessary
unless this is demonstrated to make a measurable difference.

Besides, AFAICS the only case in the caller in which the same values
can be passed to arch_set_cpu_capacity() is the CPU offline one and
that should not happen too often.