mbox series

[0/1] cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state

Message ID 20240809073120.250974-1-aboorvad@linux.ibm.com
Headers show
Series cpuidle/menu: Address performance drop from favoring physical over polling cpuidle state | expand

Message

Aboorva Devarajan Aug. 9, 2024, 7:31 a.m. UTC
This patch aims to discuss a potential performance degradation that can occur
in certain workloads when the menu governor prioritizes selecting a physical
idle state over a polling state for short idle durations. 

Note: This patch is intended to showcase a performance degradation, applying
this patch could lead to increased power consumption due to the trade-off between
performance and power efficiency, potentially causing a higher preference for
performance at the expense of power usage.

==================================================
System details in which the degradation is observed:

$ uname -r
6.10.0+

$ lscpu
Architecture:             ppc64le
  Byte Order:             Little Endian
CPU(s):                   160
  On-line CPU(s) list:    0-159
Model name:               POWER10 (architected), altivec supported
  Model:                  2.0 (pvr 0080 0200)
  Thread(s) per core:     8
  Core(s) per socket:     3
  Socket(s):              6
  Physical sockets:       4
  Physical chips:         2
  Physical cores/chip:    6
Virtualization features:
  Hypervisor vendor:      pHyp
  Virtualization type:    para
Caches (sum of all):
  L1d:                    1.3 MiB (40 instances)
  L1i:                    1.9 MiB (40 instances)
  L2:                     40 MiB (40 instances)
  L3:                     160 MiB (40 instances)
NUMA:
  NUMA node(s):           6
  NUMA node0 CPU(s):      0-31
  NUMA node1 CPU(s):      32-71
  NUMA node2 CPU(s):      72-79
  NUMA node3 CPU(s):      80-87
  NUMA node4 CPU(s):      88-119
  NUMA node5 CPU(s):      120-159


$ cpupower idle-info
CPUidle driver: pseries_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 2
Available idle states: snooze CEDE
snooze:
Flags/Description: snooze
Latency: 0
Residency: 0
Usage: 6229
Duration: 402142
CEDE:
Flags/Description: CEDE
Latency: 12
Residency: 120
Usage: 191411
Duration: 36329999037

==================================================

The menu governor contains a condition that selects physical idle states over,
such as the CEDE state over polling state, by checking if their exit latency meets
the latency requirements. This can lead to performance drops in workloads with
frequent short idle periods.

The specific condition which causes degradation is as below (menu governor):

```
if (s->target_residency_ns > predicted_ns) {
    ...
    if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
        s->exit_latency_ns <= latency_req &&
        s->target_residency_ns <= data->next_timer_ns) {
        predicted_ns = s->target_residency_ns;
        idx = i;
        break;
    }
    ...
}
```

This condition can cause the menu governor to choose the CEDE state on Power
Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
when the predicted idle duration is much shorter than the target residency
of the physical state. This misprediction leads to performance degradation
in certain workloads.

==================================================
Test Results
==================================================

This issue can be clearly observed with the below test.

A test with multiple wakee threads and a single waker thread was run to
demonstrate this issue. The waker thread periodically wakes up the wakee
threads after a specific sleep duration, creating a repeating of sleep -> wake
pattern. The test was run for a stipulated period, and cpuidle statistics are
collected.

./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60

==================================================
Results (Baseline Kernel):
==================================================
Wakee 0[PID 8295] affined to CPUs: 10,
Wakee 2[PID 8297] affined to CPUs: 30,
Wakee 3[PID 8298] affined to CPUs: 40,
Wakee 1[PID 8296] affined to CPUs: 20,
Wakee 4[PID 8299] affined to CPUs: 50,
Wakee 5[PID 8300] affined to CPUs: 60,
Wakee 6[PID 8301] affined to CPUs: 70,
Waker[PID 8302] affined to CPUs: 0,

|-----------------------------------|-------------------------|-----------------------------|
| Metric                            | snooze                  | CEDE                        |
|-----------------------------------|-------------------------|-----------------------------|
| Usage                             | 47815                   | 2030160                     |
| Above                             | 0                       | 2030043                     |
| Below                             | 0                       | 0                           |
| Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
| Overall average sleep duration    | 28.721 us               |                             |
| Overall average wakeup latency    | 6.858 us                |                             |
|-----------------------------------|-------------------------|-----------------------------|

In this test, without the patch, the CPU often enters the CEDE state for
sleep durations of around 20-30 microseconds, even though the CEDE state's
residency time is 120 microseconds. This happens because the menu governor
prioritizes the physical idle state (CEDE) if its exit latency is within
the latency limits. It also uses next_timer_ns for comparison, which can
be farther off than the actual idle duration as it is more predictable,
instead of using predicted idle duration as a comparision point with the
target residency.

Inference: The "Above" metric in the table counts instances where the
state was entered but the observed idle duration was too short. The data
shows that the CEDE state's "Above" metric is high, with around 99.9% of
the CEDE state usage exceeding the expected cpuidle state.

==================================================
Results (With Patch):
==================================================
Wakee 0[PID 8295] affined to CPUs: 10,
Wakee 2[PID 8297] affined to CPUs: 30,
Wakee 3[PID 8298] affined to CPUs: 40,
Wakee 1[PID 8296] affined to CPUs: 20,
Wakee 4[PID 8299] affined to CPUs: 50,
Wakee 5[PID 8300] affined to CPUs: 60,
Wakee 6[PID 8301] affined to CPUs: 70,
Waker[PID 8302] affined to CPUs: 0,

|-----------------------------------|-------------------------|-----------------------------|
| Metric                            | snooze                  | CEDE                        |
|-----------------------------------|-------------------------|-----------------------------|
| Usage                             | 1853343                 | 6                           |
| Above                             | 0                       | 6                           |
| Below                             | 2                       | 0                           |
| Time Spent (us)                   | 51055087 (85.09%)       | 356 (0.00%)                 |
| Overall average sleep duration    | 32.268 us               |                             |
| Overall average wakeup latency    | 4.382 us                |                             |
|-----------------------------------|-------------------------|-----------------------------|

When the condition that prioritizes the selection of physical idle states
based on exit latency is removed, these metrics improve significantly.
The reduction in "above" metric for the CEDE state shows that the governor’s
selection of cpuidle state are now more accurate that before, leading to
better overall performance.

==================================================
Performance Improvements:
==================================================
The patch improved the performance of OLTP benchmarks by around 5-10%.

The following table summarizes the performance improvements observed in
multiple OLTP query tests:

|----|----------|------------------|---------------------------|
| #  | Base (s) | Base + Patch (s) | % Performance Improvement |
|    | Kernel   | Kernel           |    (lower the better)     |
|----|----------|------------------|---------------------------|
| 1  | 15.77    | 14.14            | -10.34%                   |
| 2  | 0.87     | 0.86             | -0.90%                    |
| 3  | 4.05     | 3.78             | -6.70%                    |
| 4  | 0.61     | 0.59             | -3.10%                    |
| 5  | 1.18     | 1.14             | -3.47%                    |
| 6  | 4.11     | 3.84             | -6.59%                    |
| 7  | 0.75     | 0.75             | -0.24%                    |
| 8  | 14.91    | 13.56            | -9.03%                    |
| 9  | 4.02     | 3.86             | -3.90%                    |
| 10 | 14.93    | 13.65            | -8.57%                    |
| 11 | 15.62    | 14.03            | -10.20%                   |
| 12 | 2.15     | 2.05             | -4.52%                    |
| 13 | 7.92     | 7.35             | -7.29%                    |
| 14 | 4.08     | 4.07             | -0.08%                    |
| 15 | 2.28     | 2.10             | -8.12%                    |
| 16 | 1.14     | 1.11             | -3.09%                    |
| 17 | 2.14     | 2.06             | -3.64%                    |
| 18 | 2.54     | 2.36             | -7.16%                    |
| 19 | 3.72     | 3.68             | -1.05%                    |
| 20 | 2.12     | 2.05             | -3.28%                    |
| 21 | 8.17     | 7.46             | -8.69%                    |
| 22 | 3.98     | 3.83             | -3.58%                    |
| 23 | 2.45     | 2.30             | -6.12%                    |
| 24 | 2.13     | 2.06             | -3.61%                    |
| 25 | 15.05    | 13.74            | -8.70%                    |
| 26 | 4.49     | 4.16             | -7.43%                    |
| 27 | 3.75     | 3.66             | -2.37%                    |
| 28 | 2.90     | 2.87             | -0.96%                    |
| 29 | 4.37     | 4.03             | -7.83%                    |
| 30 | 2.57     | 2.43             | -5.19%                    |
| 31 | 2.78     | 2.71             | -2.30%                    |
| 32 | 3.94     | 3.78             | -4.27%                    |
| 33 | 0.70     | 0.69             | -1.48%                    |
| 34 | 2.15     | 2.08             | -3.62%                    |
| 35 | 8.30     | 7.52             | -9.38%                    |
| 36 | 2.15     | 2.08             | -3.25%                    |
| 37 | 4.37     | 4.03             | -7.88%                    |
| 38 | 4.16     | 3.88             | -6.78%                    |
| 39 | 1.12     | 1.12             | -0.02%                    |
| 40 | 1.22     | 1.17             | -4.19%                    |
| 41 | 8.45     | 7.67             | -9.27%                    |
| 42 | 7.91     | 7.31             | -7.51%                    |
| 43 | 1.12     | 1.12             | -0.08%                    |
| 44 | 0.68     | 0.65             | -3.86%                    |
| 45 | 2.12     | 2.00             | -5.47%                    |
| 46 | 3.11     | 3.04             | -2.36%                    |
| 47 | 2.29     | 2.13             | -6.88%                    |
| 48 | 8.37     | 7.61             | -9.02%                    |
| 49 | 2.61     | 2.46             | -5.75%                    |
| 50 | 7.92     | 7.30             | -7.79%                    |
| 51 | 2.85     | 2.77             | -2.58%                    |
| 52 | 0.79     | 0.78             | -0.87%                    |
| 53 | 2.90     | 2.79             | -3.63%                    |
| 54 | 8.03     | 7.40             | -7.87%                    |
| 55 | 2.14     | 2.00             | -6.53%                    |
| 56 | 15.54    | 13.93            | -10.38%                   |
| 57 | 8.07     | 7.45             | -7.62%                    |
| 58 | 8.48     | 7.70             | -9.20%                    |
| 59 | 1.08     | 1.04             | -3.80%                    |
| 60 | 1.14     | 1.13             | -0.91%                    |
| 61 | 4.28     | 3.95             | -7.84%                    |
| 62 | 8.40     | 7.62             | -9.36%                    |
| 63 | 7.91     | 7.38             | -6.78%                    |
| 64 | 2.16     | 2.10             | -2.92%                    |
| 65 | 7.78     | 7.18             | -7.60%                    |
| 66 | 4.22     | 4.19             | -0.84%                    |
| 67 | 2.99     | 2.88             | -3.61%                    |
| 68 | 8.44     | 7.56             | -10.33%                   |
| 69 | 4.10     | 3.74             | -8.89%                    |
| 70 | 4.25     | 3.98             | -6.37%                    |
| 71 | 0.61     | 0.61             | -0.39%                    |
| 72 | 3.89     | 3.56             | -8.39%                    |
|----|----------|------------------|---------------------------|


==================================================
Observation:
==================================================

The condition under discussion is intended to prioritize physical idle state
when their exit latency is within the acceptable limits thereby improving
the power efficiency and overall SMT folding. However, in systems where the
target residency value is significantly higher than the exit latency, this
leads to frequent selection of states like CEDE in Power Systems, which is
less efficient for short sleep durations.

Whereas, in systems where the residency value and exit latency are almost
equal, this condition might not adversely affect the performance.

In general scenarios, there seems to be a preference for prioritizing
energy efficiency, as shown by the commit message 0c313cb (“cpuidle: menu:
Fallback to polling if the next timer event is near”). This change involved
reverting to next_timer_ns as a comparison point instead of using the predicted
idle duration. While next_timer_ns is more precise, it might not account for
very short idle durations, which could affect performance in certain
situations, even though it helps reduce energy consumption.

But, any feedback and suggestions on better improving the power-performance
tradeoff in menu governor to address the performance issue in specific cases
as mentioned above will be very helpful.

Thanks
Aboorva

Aboorva Devarajan (1):
  cpuidle/menu: avoid prioritizing physical state over polling state

 drivers/cpuidle/governors/menu.c | 11 -----------
 1 file changed, 11 deletions(-)

Comments

Christian Loehle Aug. 13, 2024, 12:56 p.m. UTC | #1
On 8/9/24 08:31, Aboorva Devarajan wrote:
> This patch aims to discuss a potential performance degradation that can occur
> in certain workloads when the menu governor prioritizes selecting a physical
> idle state over a polling state for short idle durations. 
> 
> Note: This patch is intended to showcase a performance degradation, applying
> this patch could lead to increased power consumption due to the trade-off between
> performance and power efficiency, potentially causing a higher preference for
> performance at the expense of power usage.
> 

Not really a menu expert, but at this point I don't know who dares call
themselves one.
The elephant in the room would be: Does teo work better for you?

> ==================================================
> System details in which the degradation is observed:
> 
> $ uname -r
> 6.10.0+
> 
> $ lscpu
> Architecture:             ppc64le
>   Byte Order:             Little Endian
> CPU(s):                   160
>   On-line CPU(s) list:    0-159
> Model name:               POWER10 (architected), altivec supported
>   Model:                  2.0 (pvr 0080 0200)
>   Thread(s) per core:     8
>   Core(s) per socket:     3
>   Socket(s):              6
>   Physical sockets:       4
>   Physical chips:         2
>   Physical cores/chip:    6
> Virtualization features:
>   Hypervisor vendor:      pHyp
>   Virtualization type:    para
> Caches (sum of all):
>   L1d:                    1.3 MiB (40 instances)
>   L1i:                    1.9 MiB (40 instances)
>   L2:                     40 MiB (40 instances)
>   L3:                     160 MiB (40 instances)
> NUMA:
>   NUMA node(s):           6
>   NUMA node0 CPU(s):      0-31
>   NUMA node1 CPU(s):      32-71
>   NUMA node2 CPU(s):      72-79
>   NUMA node3 CPU(s):      80-87
>   NUMA node4 CPU(s):      88-119
>   NUMA node5 CPU(s):      120-159
> 
> 
> $ cpupower idle-info
> CPUidle driver: pseries_idle
> CPUidle governor: menu
> analyzing CPU 0:
> 
> Number of idle states: 2
> Available idle states: snooze CEDE
> snooze:
> Flags/Description: snooze
> Latency: 0
> Residency: 0
> Usage: 6229
> Duration: 402142
> CEDE:
> Flags/Description: CEDE
> Latency: 12
> Residency: 120
> Usage: 191411
> Duration: 36329999037
> 
> ==================================================
> 
> The menu governor contains a condition that selects physical idle states over,
> such as the CEDE state over polling state, by checking if their exit latency meets
> the latency requirements. This can lead to performance drops in workloads with
> frequent short idle periods.
> 
> The specific condition which causes degradation is as below (menu governor):
> 
> ```
> if (s->target_residency_ns > predicted_ns) {
>     ...
>     if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
>         s->exit_latency_ns <= latency_req &&
>         s->target_residency_ns <= data->next_timer_ns) {
>         predicted_ns = s->target_residency_ns;
>         idx = i;
>         break;
>     }
>     ...
> }
> ```
> 
> This condition can cause the menu governor to choose the CEDE state on Power
> Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
> when the predicted idle duration is much shorter than the target residency
> of the physical state. This misprediction leads to performance degradation
> in certain workloads.
> 

So clearly the condition
s->target_residency_ns <= data->next_timer_ns)
is supposed to prevent this, but data->next_timer_ns isn't accurate,
have you got any idea what it's set to in your workload usually?
Seems like your workload is timer-based, so the idle duration should be
predicted accurately.


> ==================================================
> Test Results
> ==================================================
> 
> This issue can be clearly observed with the below test.
> 
> A test with multiple wakee threads and a single waker thread was run to
> demonstrate this issue. The waker thread periodically wakes up the wakee
> threads after a specific sleep duration, creating a repeating of sleep -> wake
> pattern. The test was run for a stipulated period, and cpuidle statistics are
> collected.
> 
> ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60
> 
> ==================================================
> Results (Baseline Kernel):
> ==================================================
> Wakee 0[PID 8295] affined to CPUs: 10,
> Wakee 2[PID 8297] affined to CPUs: 30,
> Wakee 3[PID 8298] affined to CPUs: 40,
> Wakee 1[PID 8296] affined to CPUs: 20,
> Wakee 4[PID 8299] affined to CPUs: 50,
> Wakee 5[PID 8300] affined to CPUs: 60,
> Wakee 6[PID 8301] affined to CPUs: 70,
> Waker[PID 8302] affined to CPUs: 0,
> 
> |-----------------------------------|-------------------------|-----------------------------|
> | Metric                            | snooze                  | CEDE                        |
> |-----------------------------------|-------------------------|-----------------------------|
> | Usage                             | 47815                   | 2030160                     |
> | Above                             | 0                       | 2030043                     |
> | Below                             | 0                       | 0                           |
> | Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
> | Overall average sleep duration    | 28.721 us               |                             |
> | Overall average wakeup latency    | 6.858 us                |                             |
> |-----------------------------------|-------------------------|-----------------------------|
> 
> In this test, without the patch, the CPU often enters the CEDE state for
> sleep durations of around 20-30 microseconds, even though the CEDE state's
> residency time is 120 microseconds. This happens because the menu governor
> prioritizes the physical idle state (CEDE) if its exit latency is within
> the latency limits. It also uses next_timer_ns for comparison, which can
> be farther off than the actual idle duration as it is more predictable,
> instead of using predicted idle duration as a comparision point with the
> target residency.

Ideally that shouldn't be the case though (next_timer_ns be farther off the
actual idle duration).

> [snip]
Aboorva Devarajan Aug. 20, 2024, 8:51 a.m. UTC | #2
On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:

Hi Christian,

Thanks a lot for your comments.
...
> On 8/9/24 08:31, Aboorva Devarajan wrote:
> > This patch aims to discuss a potential performance degradation that can occur
> > in certain workloads when the menu governor prioritizes selecting a physical
> > idle state over a polling state for short idle durations. 
> > 
> > Note: This patch is intended to showcase a performance degradation, applying
> > this patch could lead to increased power consumption due to the trade-off between
> > performance and power efficiency, potentially causing a higher preference for
> > performance at the expense of power usage.
> > 
> 
> Not really a menu expert, but at this point I don't know who dares call
> themselves one.
> The elephant in the room would be: Does teo work better for you?
> 

I ran some tests with the teo governor enabled, but it didn’t make a
lot of difference. The results are presented below.

> > ==================================================
> > System details in which the degradation is observed:
> > 
> > $ uname -r
> > 6.10.0+
> > 
> > $ lscpu
> > Architecture:             ppc64le
> >   Byte Order:             Little Endian
> > CPU(s):                   160
> >   On-line CPU(s) list:    0-159
> > Model name:               POWER10 (architected), altivec supported
> >   Model:                  2.0 (pvr 0080 0200)
> >   Thread(s) per core:     8
> >   Core(s) per socket:     3
> >   Socket(s):              6
> >   Physical sockets:       4
> >   Physical chips:         2
> >   Physical cores/chip:    6
> > Virtualization features:
> >   Hypervisor vendor:      pHyp
> >   Virtualization type:    para
> > Caches (sum of all):
> >   L1d:                    1.3 MiB (40 instances)
> >   L1i:                    1.9 MiB (40 instances)
> >   L2:                     40 MiB (40 instances)
> >   L3:                     160 MiB (40 instances)
> > NUMA:
> >   NUMA node(s):           6
> >   NUMA node0 CPU(s):      0-31
> >   NUMA node1 CPU(s):      32-71
> >   NUMA node2 CPU(s):      72-79
> >   NUMA node3 CPU(s):      80-87
> >   NUMA node4 CPU(s):      88-119
> >   NUMA node5 CPU(s):      120-159
> > 
> > 
> > $ cpupower idle-info
> > CPUidle driver: pseries_idle
> > CPUidle governor: menu
> > analyzing CPU 0:
> > 
> > Number of idle states: 2
> > Available idle states: snooze CEDE
> > snooze:
> > Flags/Description: snooze
> > Latency: 0
> > Residency: 0
> > Usage: 6229
> > Duration: 402142
> > CEDE:
> > Flags/Description: CEDE
> > Latency: 12
> > Residency: 120
> > Usage: 191411
> > Duration: 36329999037
> > 
> > ==================================================
> > 
> > The menu governor contains a condition that selects physical idle states over,
> > such as the CEDE state over polling state, by checking if their exit latency meets
> > the latency requirements. This can lead to performance drops in workloads with
> > frequent short idle periods.
> > 
> > The specific condition which causes degradation is as below (menu governor):
> > 
> > ```
> > if (s->target_residency_ns > predicted_ns) {
> >     ...
> >     if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
> >         s->exit_latency_ns <= latency_req &&
> >         s->target_residency_ns <= data->next_timer_ns) {
> >         predicted_ns = s->target_residency_ns;
> >         idx = i;
> >         break;
> >     }
> >     ...
> > }
> > ```
> > 
> > This condition can cause the menu governor to choose the CEDE state on Power
> > Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
> > when the predicted idle duration is much shorter than the target residency
> > of the physical state. This misprediction leads to performance degradation
> > in certain workloads.
> > 
> 
> So clearly the condition
> s->target_residency_ns <= data->next_timer_ns)
> is supposed to prevent this, but data->next_timer_ns isn't accurate,
> have you got any idea what it's set to in your workload usually?
> Seems like your workload is timer-based, so the idle duration should be
> predicted accurately.
> 

Yes, that's right ideally this condition should have prevented this,
but `data->next_timer_ns` is almost always greater than the actual
idle duration which seems inaccurate.

> 
> > ==================================================
> > Test Results
> > ==================================================
> > 
> > This issue can be clearly observed with the below test.
> > 
> > A test with multiple wakee threads and a single waker thread was run to
> > demonstrate this issue. The waker thread periodically wakes up the wakee
> > threads after a specific sleep duration, creating a repeating of sleep -> wake
> > pattern. The test was run for a stipulated period, and cpuidle statistics are
> > collected.
> > 
> > ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60
> > 
> > ==================================================
> > Results (Baseline Kernel):
> > ==================================================
> > Wakee 0[PID 8295] affined to CPUs: 10,
> > Wakee 2[PID 8297] affined to CPUs: 30,
> > Wakee 3[PID 8298] affined to CPUs: 40,
> > Wakee 1[PID 8296] affined to CPUs: 20,
> > Wakee 4[PID 8299] affined to CPUs: 50,
> > Wakee 5[PID 8300] affined to CPUs: 60,
> > Wakee 6[PID 8301] affined to CPUs: 70,
> > Waker[PID 8302] affined to CPUs: 0,
> > 
> > > -----------------------------------|-------------------------|-----------------------------|
> > > Metric                            | snooze                  | CEDE                        |
> > > -----------------------------------|-------------------------|-----------------------------|
> > > Usage                             | 47815                   | 2030160                     |
> > > Above                             | 0                       | 2030043                     |
> > > Below                             | 0                       | 0                           |
> > > Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
> > > Overall average sleep duration    | 28.721 us               |                             |
> > > Overall average wakeup latency    | 6.858 us                |                             |
> > > -----------------------------------|-------------------------|-----------------------------|
> > 
> > In this test, without the patch, the CPU often enters the CEDE state for
> > sleep durations of around 20-30 microseconds, even though the CEDE state's
> > residency time is 120 microseconds. This happens because the menu governor
> > prioritizes the physical idle state (CEDE) if its exit latency is within
> > the latency limits. It also uses next_timer_ns for comparison, which can
> > be farther off than the actual idle duration as it is more predictable,
> > instead of using predicted idle duration as a comparision point with the
> > target residency.
> 
> Ideally that shouldn't be the case though (next_timer_ns be farther off thactual idle duration)

I ran some experiments based on your suggestions. Below are the
relative average runtimes and percentage differences compared to
the base version:

Picked a single representative workload for simplicity:

Note: Lower (% Difference) the better.
|---------------------------|-----------------|--------------|
| Configuration             | Average Runtime | % Difference |
|---------------------------|-----------------|--------------|
| Base (menu)               | 1.00            | 0.00%        |
| Base + Patch 1 (menu)     | 0.92            | -8.00%       |
| Base + Patch 2 (menu)     | 0.98            | -2.00%       |
| Base (teo)                | 1.01            | +1.00%       |
|---------------------------|-----------------|--------------|
Patch 1: https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/  
Patch 2: https://lore.kernel.org/all/c20a07e4-b9e6-4a66-80f5-63d679b17c3b@arm.com/

It seems that Patch 2 does provide a slight improvement in runtime, but
not significantly like Patch 1. Additionally, teo does not seem
to help in this case.

Regarding the condition `s->target_residency_ns <= data->next_timer_ns`,
it appears that `data->next_timer_ns` is consistently greater than
both actual idle duration and `s->target_residency_ns` so this condition
nearly always holds true, indicating some inaccuracy. I'll investigate
this further and follow up with more details.

Regards,
Aboorva
Christian Loehle Aug. 21, 2024, 10:55 a.m. UTC | #3
On 8/20/24 09:51, Aboorva Devarajan wrote:
> On Tue, 2024-08-13 at 13:56 +0100, Christian Loehle wrote:
> 
> Hi Christian,
> 
> Thanks a lot for your comments.
> ...
>> On 8/9/24 08:31, Aboorva Devarajan wrote:
>>> This patch aims to discuss a potential performance degradation that can occur
>>> in certain workloads when the menu governor prioritizes selecting a physical
>>> idle state over a polling state for short idle durations. 
>>>
>>> Note: This patch is intended to showcase a performance degradation, applying
>>> this patch could lead to increased power consumption due to the trade-off between
>>> performance and power efficiency, potentially causing a higher preference for
>>> performance at the expense of power usage.
>>>
>>
>> Not really a menu expert, but at this point I don't know who dares call
>> themselves one.
>> The elephant in the room would be: Does teo work better for you?
>>
> 
> I ran some tests with the teo governor enabled, but it didn’t make a
> lot of difference. The results are presented below.
> 
>>> ==================================================
>>> System details in which the degradation is observed:
>>>
>>> $ uname -r
>>> 6.10.0+
>>>
>>> $ lscpu
>>> Architecture:             ppc64le
>>>   Byte Order:             Little Endian
>>> CPU(s):                   160
>>>   On-line CPU(s) list:    0-159
>>> Model name:               POWER10 (architected), altivec supported
>>>   Model:                  2.0 (pvr 0080 0200)
>>>   Thread(s) per core:     8
>>>   Core(s) per socket:     3
>>>   Socket(s):              6
>>>   Physical sockets:       4
>>>   Physical chips:         2
>>>   Physical cores/chip:    6
>>> Virtualization features:
>>>   Hypervisor vendor:      pHyp
>>>   Virtualization type:    para
>>> Caches (sum of all):
>>>   L1d:                    1.3 MiB (40 instances)
>>>   L1i:                    1.9 MiB (40 instances)
>>>   L2:                     40 MiB (40 instances)
>>>   L3:                     160 MiB (40 instances)
>>> NUMA:
>>>   NUMA node(s):           6
>>>   NUMA node0 CPU(s):      0-31
>>>   NUMA node1 CPU(s):      32-71
>>>   NUMA node2 CPU(s):      72-79
>>>   NUMA node3 CPU(s):      80-87
>>>   NUMA node4 CPU(s):      88-119
>>>   NUMA node5 CPU(s):      120-159
>>>
>>>
>>> $ cpupower idle-info
>>> CPUidle driver: pseries_idle
>>> CPUidle governor: menu
>>> analyzing CPU 0:
>>>
>>> Number of idle states: 2
>>> Available idle states: snooze CEDE
>>> snooze:
>>> Flags/Description: snooze
>>> Latency: 0
>>> Residency: 0
>>> Usage: 6229
>>> Duration: 402142
>>> CEDE:
>>> Flags/Description: CEDE
>>> Latency: 12
>>> Residency: 120
>>> Usage: 191411
>>> Duration: 36329999037
>>>
>>> ==================================================
>>>
>>> The menu governor contains a condition that selects physical idle states over,
>>> such as the CEDE state over polling state, by checking if their exit latency meets
>>> the latency requirements. This can lead to performance drops in workloads with
>>> frequent short idle periods.
>>>
>>> The specific condition which causes degradation is as below (menu governor):
>>>
>>> ```
>>> if (s->target_residency_ns > predicted_ns) {
>>>     ...
>>>     if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) &&
>>>         s->exit_latency_ns <= latency_req &&
>>>         s->target_residency_ns <= data->next_timer_ns) {
>>>         predicted_ns = s->target_residency_ns;
>>>         idx = i;
>>>         break;
>>>     }
>>>     ...
>>> }
>>> ```
>>>
>>> This condition can cause the menu governor to choose the CEDE state on Power
>>> Systems (residency: 120 us, exit latency: 12 us) over a polling state, even
>>> when the predicted idle duration is much shorter than the target residency
>>> of the physical state. This misprediction leads to performance degradation
>>> in certain workloads.
>>>
>>
>> So clearly the condition
>> s->target_residency_ns <= data->next_timer_ns)
>> is supposed to prevent this, but data->next_timer_ns isn't accurate,
>> have you got any idea what it's set to in your workload usually?
>> Seems like your workload is timer-based, so the idle duration should be
>> predicted accurately.
>>
> 
> Yes, that's right ideally this condition should have prevented this,
> but `data->next_timer_ns` is almost always greater than the actual
> idle duration which seems inaccurate.
> 
>>
>>> ==================================================
>>> Test Results
>>> ==================================================
>>>
>>> This issue can be clearly observed with the below test.
>>>
>>> A test with multiple wakee threads and a single waker thread was run to
>>> demonstrate this issue. The waker thread periodically wakes up the wakee
>>> threads after a specific sleep duration, creating a repeating of sleep -> wake
>>> pattern. The test was run for a stipulated period, and cpuidle statistics are
>>> collected.
>>>
>>> ./cpuidle-test -a 0 -b 10 -b 20 -b 30 -b 40 -b 50 -b 60 -b 70 -r 20 -t 60
>>>
>>> ==================================================
>>> Results (Baseline Kernel):
>>> ==================================================
>>> Wakee 0[PID 8295] affined to CPUs: 10,
>>> Wakee 2[PID 8297] affined to CPUs: 30,
>>> Wakee 3[PID 8298] affined to CPUs: 40,
>>> Wakee 1[PID 8296] affined to CPUs: 20,
>>> Wakee 4[PID 8299] affined to CPUs: 50,
>>> Wakee 5[PID 8300] affined to CPUs: 60,
>>> Wakee 6[PID 8301] affined to CPUs: 70,
>>> Waker[PID 8302] affined to CPUs: 0,
>>>
>>>> -----------------------------------|-------------------------|-----------------------------|
>>>> Metric                            | snooze                  | CEDE                        |
>>>> -----------------------------------|-------------------------|-----------------------------|
>>>> Usage                             | 47815                   | 2030160                     |
>>>> Above                             | 0                       | 2030043                     |
>>>> Below                             | 0                       | 0                           |
>>>> Time Spent (us)                   | 976317 (1.63%)          | 51046474 (85.08%)           |
>>>> Overall average sleep duration    | 28.721 us               |                             |
>>>> Overall average wakeup latency    | 6.858 us                |                             |
>>>> -----------------------------------|-------------------------|-----------------------------|
>>>
>>> In this test, without the patch, the CPU often enters the CEDE state for
>>> sleep durations of around 20-30 microseconds, even though the CEDE state's
>>> residency time is 120 microseconds. This happens because the menu governor
>>> prioritizes the physical idle state (CEDE) if its exit latency is within
>>> the latency limits. It also uses next_timer_ns for comparison, which can
>>> be farther off than the actual idle duration as it is more predictable,
>>> instead of using predicted idle duration as a comparision point with the
>>> target residency.
>>
>> Ideally that shouldn't be the case though (next_timer_ns be farther off thactual idle duration)
> 
> I ran some experiments based on your suggestions. Below are the
> relative average runtimes and percentage differences compared to
> the base version:
> 
> Picked a single representative workload for simplicity:
> 
> Note: Lower (% Difference) the better.
> |---------------------------|-----------------|--------------|
> | Configuration             | Average Runtime | % Difference |
> |---------------------------|-----------------|--------------|
> | Base (menu)               | 1.00            | 0.00%        |
> | Base + Patch 1 (menu)     | 0.92            | -8.00%       |
> | Base + Patch 2 (menu)     | 0.98            | -2.00%       |
> | Base (teo)                | 1.01            | +1.00%       |
> |---------------------------|-----------------|--------------|
> Patch 1: https://lore.kernel.org/all/20240809073120.250974-2-aboorvad@linux.ibm.com/  
> Patch 2: https://lore.kernel.org/all/c20a07e4-b9e6-4a66-80f5-63d679b17c3b@arm.com/
> 
> It seems that Patch 2 does provide a slight improvement in runtime, but
> not significantly like Patch 1. Additionally, teo does not seem
> to help in this case.
> 
> Regarding the condition `s->target_residency_ns <= data->next_timer_ns`,
> it appears that `data->next_timer_ns` is consistently greater than
> both actual idle duration and `s->target_residency_ns` so this condition
> nearly always holds true, indicating some inaccuracy. I'll investigate
> this further and follow up with more details.

The wakeup source(s), since they don't seem to be timer events would be
interesting, although a bit of a hassle to get right. 
What's the workload anyway?

> 
> Regards,
> Aboorva
>
Christian Loehle Sept. 19, 2024, 9:43 a.m. UTC | #4
On 9/19/24 09:49, Christian Loehle wrote:
> On 9/19/24 06:02, Aboorva Devarajan wrote:
>> On Wed, 2024-08-21 at 11:55 +0100, Christian Loehle wrote:
> 
>>
>> Timer based wakeup:
>>
>> +------------------------+---------------------+---------------------+
>> | Metric                 | Without Patch       | With Patch          |
>> +------------------------+---------------------+---------------------+
>> | Wakee thread - CPU     | 110                 | 110                 |
>> | Waker thread - CPU     | 20                  | 20                  |
>> | Sleep Interval         | 50 us               | 50 us               |
>> | Total Wakeups          | -                   | -                   |
>> | Avg Wakeup Latency     | -                   | -                   |
>> | Actual Sleep time      | 52.639 us           | 52.627 us           |
>> +------------------------+---------------------+---------------------+
>> | Idle State 0 Usage Diff| 94,879              | 94,879              |
>> | Idle State 0 Time Diff | 4,700,323 ns        | 4,697,576 ns        |
>> | Idle State 1 Usage Diff| 0                   | 0                   |
>> | Idle State 1 Time Diff | 0 ns                | 0 ns                |
>> +------------------------+---------------------+---------------------+
>> | Total Above Usage      | 0 (0.00%)           | (0.00%)             |
>> +------------------------+---------------------+---------------------+
>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>> +------------------------+---------------------+---------------------+
>>
>> In timer-based wakeups, the menu governor effectively predicts the idle
>> duration both with and without the patch. This ensures that there are
>> few or no instances of "Above" usage, allowing the CPU to remain in the
>> correct idle state.
>>
>> The condition (s->target_residency_ns <= data->next_timer_ns) in the menu
>> governor ensures that a physical idle state is not prioritized when a
>> timer event is expected before the target residency of the first physical
>> idle state.
>>
>> As a result, the patch has no impact in this case, and performance
>> remains stable with timer based wakeups.
>>
>> Pipe based wakeup (non-timer wakeup):
>>
>> +------------------------+---------------------+---------------------+
>> | Metric                 | Without Patch       | With Patch          |
>> +------------------------+---------------------+---------------------+
>> | Wakee thread - CPU     | 110                 | 110                 |
>> | Waker thread - CPU     | 20                  | 20                  |
>> | Sleep Interval         | 50 us               | 50 us               |
>> | Total Wakeups          | 97031               | 96583               |
>> | Avg Wakeup Latency     | 7.070 us            | 4.879 us            |
>> | Actual Sleep time      | 51.366 us           | 51.605 us           |
>> +------------------------+---------------------+---------------------+
>> | Idle State 0 Usage Diff| 1209                | 96,586              |
>> | Idle State 0 Time Diff | 55,579 ns           | 4,510,003 ns        |
>> | Idle State 1 Usage Diff| 95,826              | 5                   |
>> | Idle State 1 Time Diff | 4,522,639 ns        | 198 ns              |
>> +------------------------+---------------------+---------------------+
>> +------------------------+---------------------+---------------------+
>> | **Total Above Usage**  | 95,824 (98.75%)     | 5 (0.01%)           |
>> +------------------------+---------------------+---------------------+     
>> +------------------------+---------------------+---------------------+
>> | Total Below Usage      | 0 (0.00%)           | 0 (0.00%)           |
>> +------------------------+---------------------+---------------------+
>>
>> In the pipe-based wakeup scenario, before the patch was applied, the 
>> "Above" metric was notably high around 98.75%. This suggests that the
>> menu governor frequently selected a deeper idle state like CEDE, even
>> when the idle period was relatively short.
>>
>> This happened because the menu governor is inclined to prioritize the
>> physical idle state (CEDE) even when the target residency time of the
>> physical idle state (s->target_residency_ns) was longer than the
>> predicted idle time (predicted_ns), so data->next_timer_ns won't be
>> relevant here in non-timer wakeups.
>>
>> In this test, despite the actual idle period being around 50 microseconds,
>> the menu governor still chose CEDE state, which has a target residency of
>> 120 microseconds.
> 
> And the idle_miss statistics confirms that this was mostly wrong decisions
> in hindsight.
> I'll go through the menu code again, this indeed shouldn't be happening.
> I'd be very surprised if upstream teo performed as badly (or actually badly
> at all) on this, although it doesn't seem to help your actual workload,
> would you mind giving that a try too?
> 
> Regards,
> Christian

So with a workload as static as this, the recent interval detection of menu
should take affect. It records the last 8 idle interval lengths and does some
statistics do see if they are somewhat stable and uses those instead of the
next timer event.
While I wouldn't vouch for the statistics just yet, the results don't seem
to be completely unreasonable.
Do you mind trying a) printing some snapshots of these intervals and the
resulting return value of get_typical_interval()?
b) Trying this out? Setting these values to some around 50us, that returns
50us for me as expected and therefore should not select an idle state above
that.

-->8--

--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -112,6 +112,8 @@ struct menu_device {
        int             interval_ptr;
 };
 
+static int intervals_dummy[] = {50, 52, 48, 50, 51, 49, 51, 51};
+
 static inline int which_bucket(u64 duration_ns, unsigned int nr_iowaiters)
 {
        int bucket = 0;
@@ -177,7 +179,7 @@ static unsigned int get_typical_interval(struct menu_device *data)
        sum = 0;
        divisor = 0;
        for (i = 0; i < INTERVALS; i++) {
-               unsigned int value = data->intervals[i];
+               unsigned int value = intervals_dummy[i];
                if (value <= thresh) {
                        sum += value;
                        divisor++;
@@ -200,7 +202,7 @@ static unsigned int get_typical_interval(struct menu_device *data)
        /* Then try to determine variance */
        variance = 0;
        for (i = 0; i < INTERVALS; i++) {
-               unsigned int value = data->intervals[i];
+               unsigned int value = intervals_dummy[i];
                if (value <= thresh) {
                        int64_t diff = (int64_t)value - avg;
                        variance += diff * diff;