mbox series

[v3,0/2] x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems

Message ID 3310447.aeNJFYEL58@rjwysocki.net
Headers show
Series x86 / intel_pstate: Set asymmetric CPU capacity on hybrid systems | expand

Message

Rafael J. Wysocki Aug. 28, 2024, 11:45 a.m. UTC
Hi Everyone,

This is an update of

https://lore.kernel.org/linux-pm/4941491.31r3eYUQgx@rjwysocki.net/

which was an update of

https://lore.kernel.org/linux-pm/4908113.GXAFRqVoOG@rjwysocki.net/

It addresses Ricardo's review comments and fixes an issue with intel_pstate
operation mode changes that would cause it to attempt to enable hybrid CPU
capacity scaling after it has been already enabled during initialization.

The most visible difference with respect to the previous version is that
patch [1/3] has been dropped because it is not needed any more after using
the observation that sched_clear_itmt_support() would cause sched domains
to be rebuilt.

Other than this, there are cosmetic differences in patch [1/2] (previously [2/3])
and the new code in intel_pstate_register_driver() in patch [2/2] (previously [3/3])
has been squashed into hybrid_init_cpu_scaling() which now checks whether or
not to enable hybrid CPU capacity scaling (as it may have been enabled already).

This series is available from the following git branch:

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=intel_pstate-testing

(with an extra debug commit on top).

The original cover letter quoted below still applies:

The purpose of this series is to provide the scheduler with asymmetric CPU
capacity information on x86 hybrid systems based on Intel hardware.

The asymmetric CPU capacity information is important on hybrid systems as it
allows utilization to be computed for tasks in a consistent way across all
CPUs in the system, regardless of their capacity.  This, in turn, allows
the schedutil cpufreq governor to set CPU performance levels consistently
in the cases when tasks migrate between CPUs of different capacities.  It
should also help to improve task placement and load balancing decisions on
hybrid systems and it is key for anything along the lines of EAS.

The information in question comes from the MSR_HWP_CAPABILITIES register and
is provided to the scheduler by the intel_pstate driver, as per the changelog
of patch [3/3].  Patch [2/3] introduces the arch infrastructure needed for
that (in the form of a per-CPU capacity variable) and patch [1/3] is a
preliminary code adjustment.

This is based on an RFC posted previously

https://lore.kernel.org/linux-pm/7663799.EvYhyI6sBW@kreacher/

but differs from it quite a bit (except for the first patch).  The most
significant difference is based on the observation that frequency-
invariance needs to adjusted to the capacity scaling on hybrid systems
for the complete scale-invariance to work as expected.

Thank you!

Comments

Ricardo Neri Sept. 4, 2024, 7:25 a.m. UTC | #1
On Wed, Aug 28, 2024 at 01:45:00PM +0200, Rafael J. Wysocki wrote:
> Hi Everyone,
> 
> This is an update of
> 
> https://lore.kernel.org/linux-pm/4941491.31r3eYUQgx@rjwysocki.net/
> 
> which was an update of
> 
> https://lore.kernel.org/linux-pm/4908113.GXAFRqVoOG@rjwysocki.net/
> 
> It addresses Ricardo's review comments and fixes an issue with intel_pstate
> operation mode changes that would cause it to attempt to enable hybrid CPU
> capacity scaling after it has been already enabled during initialization.
> 
> The most visible difference with respect to the previous version is that
> patch [1/3] has been dropped because it is not needed any more after using
> the observation that sched_clear_itmt_support() would cause sched domains
> to be rebuilt.
> 
> Other than this, there are cosmetic differences in patch [1/2] (previously [2/3])
> and the new code in intel_pstate_register_driver() in patch [2/2] (previously [3/3])
> has been squashed into hybrid_init_cpu_scaling() which now checks whether or
> not to enable hybrid CPU capacity scaling (as it may have been enabled already).
> 
> This series is available from the following git branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=intel_pstate-testing
> 
> (with an extra debug commit on top).
> 
> The original cover letter quoted below still applies:
> 
> The purpose of this series is to provide the scheduler with asymmetric CPU
> capacity information on x86 hybrid systems based on Intel hardware.
> 
> The asymmetric CPU capacity information is important on hybrid systems as it
> allows utilization to be computed for tasks in a consistent way across all
> CPUs in the system, regardless of their capacity.  This, in turn, allows
> the schedutil cpufreq governor to set CPU performance levels consistently
> in the cases when tasks migrate between CPUs of different capacities.  It
> should also help to improve task placement and load balancing decisions on
> hybrid systems and it is key for anything along the lines of EAS.
> 
> The information in question comes from the MSR_HWP_CAPABILITIES register and
> is provided to the scheduler by the intel_pstate driver, as per the changelog
> of patch [3/3].  Patch [2/3] introduces the arch infrastructure needed for
> that (in the form of a per-CPU capacity variable) and patch [1/3] is a
> preliminary code adjustment.
> 
> This is based on an RFC posted previously
> 
> https://lore.kernel.org/linux-pm/7663799.EvYhyI6sBW@kreacher/
> 
> but differs from it quite a bit (except for the first patch).  The most
> significant difference is based on the observation that frequency-
> invariance needs to adjusted to the capacity scaling on hybrid systems
> for the complete scale-invariance to work as expected.
> 
> Thank you!

Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance

You can look at the scaling invariance these patches achieve here

https://pasteboard.co/dhBAUjfr36Tx.png

I tested these patches on an Meteor Lake system. It has CPUs with three
levels of capacity (Pcore, Ecore, and Lcore)

The "Requested work" plot shows a sawtooth pattern of the amount of work
requested as a percentage of the maximum amount of work that can be
obtained from the biggest CPU running at its maximum frequency. The work
is continuously calling getcpu() in a time window of constant duration
with varying percentages of work.

The "Achieved work" plot shows that the Ecore and Lcore cannot complete
as much work as the PCore even when fully busy (see the "Busy %" plot).
Also, bigger CPUs have more idle time.

The "Scale freq capacity" plot shows the current frequency of each CPU
is now scaled to 1024 by their respective max frequencies. It no longer
uses the single arch_max_freq_ratio value. Capacity now scales correctly:
when running at its maximum frequency, the current capacity (see
"Current capacity" plot and refer to cap_scale()) now matches the value
from arch_scale_cpu_capacity() (see "CPU capacity" plot).

The "Task utilization" plot shows that task->util_avg is now invariant
across CPUs.
> 
> 
>
Rafael J. Wysocki Sept. 4, 2024, 11:30 a.m. UTC | #2
On Wed, Sep 4, 2024 at 9:19 AM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> On Wed, Aug 28, 2024 at 01:45:00PM +0200, Rafael J. Wysocki wrote:
> > Hi Everyone,
> >
> > This is an update of
> >
> > https://lore.kernel.org/linux-pm/4941491.31r3eYUQgx@rjwysocki.net/
> >
> > which was an update of
> >
> > https://lore.kernel.org/linux-pm/4908113.GXAFRqVoOG@rjwysocki.net/
> >
> > It addresses Ricardo's review comments and fixes an issue with intel_pstate
> > operation mode changes that would cause it to attempt to enable hybrid CPU
> > capacity scaling after it has been already enabled during initialization.
> >
> > The most visible difference with respect to the previous version is that
> > patch [1/3] has been dropped because it is not needed any more after using
> > the observation that sched_clear_itmt_support() would cause sched domains
> > to be rebuilt.
> >
> > Other than this, there are cosmetic differences in patch [1/2] (previously [2/3])
> > and the new code in intel_pstate_register_driver() in patch [2/2] (previously [3/3])
> > has been squashed into hybrid_init_cpu_scaling() which now checks whether or
> > not to enable hybrid CPU capacity scaling (as it may have been enabled already).
> >
> > This series is available from the following git branch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/log/?h=intel_pstate-testing
> >
> > (with an extra debug commit on top).
> >
> > The original cover letter quoted below still applies:
> >
> > The purpose of this series is to provide the scheduler with asymmetric CPU
> > capacity information on x86 hybrid systems based on Intel hardware.
> >
> > The asymmetric CPU capacity information is important on hybrid systems as it
> > allows utilization to be computed for tasks in a consistent way across all
> > CPUs in the system, regardless of their capacity.  This, in turn, allows
> > the schedutil cpufreq governor to set CPU performance levels consistently
> > in the cases when tasks migrate between CPUs of different capacities.  It
> > should also help to improve task placement and load balancing decisions on
> > hybrid systems and it is key for anything along the lines of EAS.
> >
> > The information in question comes from the MSR_HWP_CAPABILITIES register and
> > is provided to the scheduler by the intel_pstate driver, as per the changelog
> > of patch [3/3].  Patch [2/3] introduces the arch infrastructure needed for
> > that (in the form of a per-CPU capacity variable) and patch [1/3] is a
> > preliminary code adjustment.
> >
> > This is based on an RFC posted previously
> >
> > https://lore.kernel.org/linux-pm/7663799.EvYhyI6sBW@kreacher/
> >
> > but differs from it quite a bit (except for the first patch).  The most
> > significant difference is based on the observation that frequency-
> > invariance needs to adjusted to the capacity scaling on hybrid systems
> > for the complete scale-invariance to work as expected.
> >
> > Thank you!
>
> Tested-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> # scale invariance
>
> You can look at the scaling invariance these patches achieve here
>
> https://pasteboard.co/dhBAUjfr36Tx.png
>
> I tested these patches on an Meteor Lake system. It has CPUs with three
> levels of capacity (Pcore, Ecore, and Lcore)
>
> The "Requested work" plot shows a sawtooth pattern of the amount of work
> requested as a percentage of the maximum amount of work that can be
> obtained from the biggest CPU running at its maximum frequency. The work
> is continuously calling getcpu() in a time window of constant duration
> with varying percentages of work.
>
> The "Achieved work" plot shows that the Ecore and Lcore cannot complete
> as much work as the PCore even when fully busy (see the "Busy %" plot).
> Also, bigger CPUs have more idle time.
>
> The "Scale freq capacity" plot shows the current frequency of each CPU
> is now scaled to 1024 by their respective max frequencies. It no longer
> uses the single arch_max_freq_ratio value. Capacity now scales correctly:
> when running at its maximum frequency, the current capacity (see
> "Current capacity" plot and refer to cap_scale()) now matches the value
> from arch_scale_cpu_capacity() (see "CPU capacity" plot).
>
> The "Task utilization" plot shows that task->util_avg is now invariant
> across CPUs.

Thank you!