mbox series

[RESEND,V2,0/9] Implement AMD Pstate EPP Driver

Message ID 20221010162248.348141-1-Perry.Yuan@amd.com
Headers show
Series Implement AMD Pstate EPP Driver | expand

Message

Yuan, Perry Oct. 10, 2022, 4:22 p.m. UTC
Hi all,

This patchset implements one new AMD CPU frequency driver
"amd-pstate-epp” instance for better performance and power control.
CPPC has a parameter called energy preference performance (EPP).
The EPP is used in the CCLK DPM controller to drive the frequency that a core
is going to operate during short periods of activity.
EPP values will be utilized for different OS profiles (balanced, performance, power savings).

AMD Energy Performance Preference (EPP) provides a hint to the hardware
if software wants to bias toward performance (0x0) or energy efficiency (0xff)
The lowlevel power firmware will calculate the runtime frequency according to the EPP preference 
value. So the EPP hint will impact the CPU cores frequency responsiveness.

We use the RAPL interface with "perf" tool to get the energy data of the package power.
Performance Per Watt (PPW) Calculation:

The PPW calculation is referred by below paper:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fdam%2Fdevelop%2Fexternal%2Fus%2Fen%2Fdocuments%2Fperformance-per-what-paper.pdf&data=04%7C01%7CPerry.Yuan%40amd.com%7Cac66e8ce98044e9b062708d9ab47c8d8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637729147708574423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TPOvCE%2Frbb0ptBreWNxHqOi9YnVhcHGKG88vviDLb00%3D&reserved=0

Below formula is referred from below spec to measure the PPW:

(F / t) / P = F * t / (t * E) = F / E,

"F" is the number of frames per second.
"P" is power measured in watts.
"E" is energy measured in joules.

Gitsouce Benchmark Data on ROME Server CPU
+------------------------------+------------------------------+------------+------------------+
| Kernel Module                | PPW (1 / s * J)              |Energy(J) | PPW Improvement (%)|
+==============================+==============================+============+==================+
| acpi-cpufreq:schedutil       | 5.85658E-05                  | 17074.8    | base             |
+------------------------------+------------------------------+------------+------------------+
| acpi-cpufreq:ondemand        | 5.03079E-05                  | 19877.6    | -14.10%          |
+------------------------------+------------------------------+------------+------------------+
| acpi-cpufreq:performance     | 5.88132E-05                  | 17003      | 0.42%            |
+------------------------------+------------------------------+------------+------------------+
| amd-pstate:ondemand          | 4.60295E-05                  | 21725.2    | -21.41%          |
+------------------------------+------------------------------+------------+------------------+
| amd-pstate:schedutil         | 4.70026E-05                  | 21275.4    | -19.7%           |
+------------------------------+------------------------------+------------+------------------+
| amd-pstate:performance       | 5.80094E-05                  | 17238.6    | -0.95%           |
+------------------------------+------------------------------+------------+------------------+
| EPP:performance              | 5.8292E-05                   | 17155      | -0.47%           |
+------------------------------+------------------------------+------------+------------------+
| EPP: balance performance:    | 6.71709E-05                  | 14887.4    | 14.69%           |
+------------------------------+------------------------------+------------+------------------+
| EPP:power                    | 6.66951E-05                  | 4993.6     | 13.88%           |
+------------------------------+------------------------------+------------+------------------+

Tbench Benchmark Data on ROME Server CPU
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| Kernel Module                               | PPW MB / (s * J)  |Throughput(MB/s)| Energy (J)|PPW Improvement(%)|
+=============================================+===================+==============+=============+==================+
| acpi_cpufreq: schedutil                     | 46.39             | 17191        | 37057.3     | base             |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| acpi_cpufreq: ondemand                      | 51.51             | 19269.5      | 37406.5     | 11.04 %          |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| acpi_cpufreq: performance                   | 45.96             | 17063.7      | 37123.7     | -0.74 %          |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| EPP:powersave: performance(0)               | 54.46             | 20263.1      | 37205       | 17.87 %          |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| EPP:powersave: balance performance          | 55.03             | 20481.9      | 37221.5     | 19.14 %          |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| EPP:powersave: balance_power                | 54.43             | 20245.9      | 37194.2     | 17.77 %          |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| EPP:powersave: power(255)                   | 54.26             | 20181.7      | 37197.4     | 17.40 %          |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| amd-pstate: schedutil                       | 48.22             | 17844.9      | 37006.6     | 3.80 %           |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| amd-pstate: ondemand                        | 61.30             | 22988        | 37503.4     | 33.72 %          |
+---------------------------------------------+-------------------+--------------+-------------+------------------+
| amd-pstate: performance                     | 54.52             | 20252.6      | 37147.8     | 17.81 %          |
+---------------------------------------------+-------------------+--------------+-------------+------------------+

changes from v1:
 * rebased to v6.0
 * drive feedbacks from Mario for the suspend/resume patch
 * drive feedbacks from Nathan for the EPP support on msr type
 * fix some typos and code style indent problems
 * update commit comments for patch 4/7
 * change the `epp_enabled` module param name to `epp`
 * set the default epp mode to be false
 * add testing for the x86_energy_perf_policy utility patchset(will
   send that utility patchset with another thread)

Perry Yuan (9):
  ACPI: CPPC: Add AMD pstate energy performance preference cppc control
  cpufreq: amd_pstate: add module parameter to load amd pstate EPP
    driver
  cpufreq: cpufreq: export cpufreq cpu release and acquire
  x86/msr: Add the MSR definition for AMD CPPC boost state
  Documentation: amd-pstate: add EPP profiles introduction
  cpufreq: amd_pstate: add AMD pstate EPP support for shared memory type
    processor
  cpufreq: amd_pstate: add AMD Pstate EPP support for the MSR based
    processors
  cpufreq: amd_pstate: implement amd pstate cpu online and offline
    callback
  cpufreq: amd-pstate: implement suspend and resume callbacks

 Documentation/admin-guide/pm/amd-pstate.rst |  19 +
 arch/x86/include/asm/msr-index.h            |   7 +
 drivers/acpi/cppc_acpi.c                    | 128 ++-
 drivers/cpufreq/amd-pstate.c                | 949 +++++++++++++++++++-
 drivers/cpufreq/cpufreq.c                   |   2 +
 include/acpi/cppc_acpi.h                    |  17 +
 6 files changed, 1115 insertions(+), 7 deletions(-)

Comments

Russell Haley Oct. 12, 2022, 12:06 p.m. UTC | #1
Although I am very much in favor of having some kernel interface to the
EPP MSR for AMD CPUs just as for Intel, I have some reservations about
the units in the tables, and whether performance per watt, measured in
this way by these benchmarks, is an appropriate figure of merit for
cpufreq governors.

On 10/10/22 11:22, Perry Yuan wrote:

> The PPW calculation is referred by below paper:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fdam%2Fdevelop%2Fexternal%2Fus%2Fen%2Fdocuments%2Fperformance-per-what-paper.pdf&data=04%7C01%7CPerry.Yuan%40amd.com%7Cac66e8ce98044e9b062708d9ab47c8d8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637729147708574423%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TPOvCE%2Frbb0ptBreWNxHqOi9YnVhcHGKG88vviDLb00%3D&reserved=0
> 
> Below formula is referred from below spec to measure the PPW:
> 
> (F / t) / P = F * t / (t * E) = F / E,
> 
> "F" is the number of frames per second.
> "P" is power measured in watts.
> "E" is energy measured in joules.

In the whitepaper, "F" is not the number of frames per second.  It is
the number of frames.  The number of frames per second is "F/t", where
"t" is the number of seconds. Following the dimensional analysis:

    Frames
   --------- / Watts
    seconds

    Frames      Joules
 = --------- / ---------
    seconds     seconds

    Frames      seconds
 = --------- * ---------
    seconds     Joules

    Frames
 = ---------
    Joules

All the seconds cancel, and performance per watt reduces to completed
work divided by energy, as you would expect. However, in the benchmark
tables, seconds always appear in the PPW unit.

Furthermore...

> Gitsouce Benchmark Data on ROME Server CPU
> +------------------------------+------------------------------+------------+------------------+
> | Kernel Module                | PPW (1 / s * J)              |Energy(J) | PPW Improvement (%)|
> +==============================+==============================+============+==================+
> | acpi-cpufreq:schedutil       | 5.85658E-05                  | 17074.8    | base             |
> +------------------------------+------------------------------+------------+------------------+
> | acpi-cpufreq:ondemand        | 5.03079E-05                  | 19877.6    | -14.10%          |
> +------------------------------+------------------------------+------------+------------------+
> | acpi-cpufreq:performance     | 5.88132E-05                  | 17003      | 0.42%            |
> +------------------------------+------------------------------+------------+------------------+
> | amd-pstate:ondemand          | 4.60295E-05                  | 21725.2    | -21.41%          |
> +------------------------------+------------------------------+------------+------------------+
> | amd-pstate:schedutil         | 4.70026E-05                  | 21275.4    | -19.7%           |
> +------------------------------+------------------------------+------------+------------------+
> | amd-pstate:performance       | 5.80094E-05                  | 17238.6    | -0.95%           |
> +------------------------------+------------------------------+------------+------------------+
> | EPP:performance              | 5.8292E-05                   | 17155      | -0.47%           |
> +------------------------------+------------------------------+------------+------------------+
> | EPP: balance performance:    | 6.71709E-05                  | 14887.4    | 14.69%           |
> +------------------------------+------------------------------+------------+------------------+
> | EPP:power                    | 6.66951E-05                  | 4993.6     | 13.88%           |
> +------------------------------+------------------------------+------------+------------------+

The numbers in the PPW column are equal to 1/Energy, so the math works
out even if the units are mislabeled. But neither the actual performance
nor anything that can be used to derive it appear in the table.

As far as I can tell, this benchmark, which compiles git from source,
should be entirely CPU bound.  That is, it is occupying at least one CPU
core for the entire runtime. [1] For such tasks, to a first order
approximation you can run the CPU at 1/2 frequency and finish the task
with 1/4 the energy in 2x the time. Since the time units vanish,
"performance per watt" can look good when performance and watts are both
low. So you very much need to have performance in the table.

I can think of a couple ways to handle this problem. The empirical
approach would be to use the userspace governor and scaling_setspeed to
iteratively find a fixed frequency with similar benchmark performance to
each driver/governor, and then report the energy usage. The "benchmark"
should probably be a sum of multiple runtime benchmarks, or a harmonic
mean of multiple rate benchmarks, because the advantage a governor is
supposed to have is the ability to adapt to different workloads and/or
different phases of computation.

Alternately, one might use or perf^3/watt as the figure of merit. That's
an ED2P metric [2], and you'd be comparing governors on their ability to
make the CPU look like a "better" CPU by identifying tasks that waste a
lot of available cycles stalled on things outside the CPU core clock
domain (DRAM, I/O) and running them at lower frequency and higher
instructions per available cycle.

I've heard about perf^2/watt being used, but I don't know what, if any,
theoretical basis it has.

On another note, If PPW of CPU-bound tasks is maximized based on energy
counted with the CPU package energy MSR only (assuming it's even
calibrated), without including DRAM and baseline consumers like fans,
HDDs, southbridge, displays, NICs, radios, ect., then the PPW of the
system as a whole is certain to be worse. This is the idea behind
race-to-idle. On the other hand, CPU package power can be the correct
measure for deadline-type workloads where finishing the task sooner
doesn't allow powering down the machine. That's stuff like
line-speed-limited network servers and scrolling in web browsers. In
that case, the only thing that goes to sleep when the task is done is
the CPU, so the only energy that counts is the energy burnt in the CPU.

> Tbench Benchmark Data on ROME Server CPU
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | Kernel Module                               | PPW MB / (s * J)  |Throughput(MB/s)| Energy (J)|PPW Improvement(%)|
> +=============================================+===================+==============+=============+==================+
> | acpi_cpufreq: schedutil                     | 46.39             | 17191        | 37057.3     | base             |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | acpi_cpufreq: ondemand                      | 51.51             | 19269.5      | 37406.5     | 11.04 %          |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | acpi_cpufreq: performance                   | 45.96             | 17063.7      | 37123.7     | -0.74 %          |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | EPP:powersave: performance(0)               | 54.46             | 20263.1      | 37205       | 17.87 %          |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | EPP:powersave: balance performance          | 55.03             | 20481.9      | 37221.5     | 19.14 %          |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | EPP:powersave: balance_power                | 54.43             | 20245.9      | 37194.2     | 17.77 %          |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | EPP:powersave: power(255)                   | 54.26             | 20181.7      | 37197.4     | 17.40 %          |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | amd-pstate: schedutil                       | 48.22             | 17844.9      | 37006.6     | 3.80 %           |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | amd-pstate: ondemand                        | 61.30             | 22988        | 37503.4     | 33.72 %          |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
> | amd-pstate: performance                     | 54.52             | 20252.6      | 37147.8     | 17.81 %          |
> +---------------------------------------------+-------------------+--------------+-------------+------------------+
For this one it seems like PPW is calculated as Throughput/Energy * 100?
The benchmark looks a lot like the result of running the script at [3].
It looks like the script would multiply by 99 though?  And also the
bogus time units do not appear in the script, so if that's a newer
version I'm glad it's fixed.

But I ran tbench on my own machine, single-thread to reduce the impact
of background activity, and got this:

+---------------+---------------------+--------+
| CPU Frequency | Throughput ( MB/s ) | Perf % |
+===============+=====================+========+
| 1 GHz         |  85.78              | Base   |
| 2 GHz         | 174.35              | 203 %  |
| 3 GHz         | 264.04              | 308 %  |
| 4 GHz         | 352.86              | 411 %  |
+---------------+---------------------+--------+

Which implies tbench is 100% clock-frequency-bound [1, 4], and so this
benchmark is equivalent to measuring the average clock frequency over
the runtime. I think that means the most interesting number in your
table is the throughput.

Somehow, amd-pstate:ondemand is running the CPU faster on average than
even amd-pstate:performance and EPP:powersave:performance, which
*should* be choosing the highest possible frequency at all times.

1. As I understand it, the intent in the schedutil governor is to run
CPU-bound tasks at maximum performance, and if you want to trade energy
for time userspace should set cpu.uclamp.max in the cgroup.  Any
CPU-bound benchmark that runs slower under the schedutil governor than
under the performance governor can then be considered a bug. There are
many such bugs, and tbench is one of them.  But I agree with the
philosphy: 1:1 scaling with CPU frequency is the best possible, and no
governor should be running such a workload below scaling_max_frequency.

2. http://www.eecs.umich.edu/courses/eecs470/OLD/w14/lectures/470L14W14.pdf

3.
https://patchwork.kernel.org/project/linux-pm/patch/20220914061105.1982477-3-li.meng@amd.com/

4. I suspect the >100% scaling is due to the relative overhead of
background tasks and scheduling being less at higher clock frequency.