mbox series

[RFC,0/3] sched: delayed thread migration

Message ID 20201020154445.119701-1-redha.gouicem@gmail.com
Headers show
Series sched: delayed thread migration | expand

Message

Redha Oct. 20, 2020, 3:44 p.m. UTC
This patch series proposes a tweak in the thread placement strategy of CFS
to make a better use of cores running at a high frequency. It is the result
of a published work at ATC'20, available here:
https://www.usenix.org/conference/atc20/presentation/gouicern

We address the frequency inversion problem that stems from new DVFS
capabilities of CPUs where frequency can be different among cores of the
same chip. On applications with a large number of fork/wait patterns, we
see that CFS tends to place new threads on idle cores running at a low
frequency, while cores where the parent is running are at a high frequency
and become idle immediately after. This means that a low frequency core is
used while a high frequency core becomes idle. A more detailed analysis of
this behavior is available in the paper but here is small example with 2
cores c0 and c1 and 2 threads A and B.

Legend: - means high frequency
        . means low frequency
	  nothing means nothing is running.

c0:   A------fork()-wait()
                  \
c1:                B................--------

c0 becomes idle while at a high frequency and c1 becomes busy and executes
B at a low frequency for a long time.

The fix we propose is to delay the migration of threads in order to
see if the parent's core can be used quickly before using another core.
When choosing a core to place a new or waking up thread (in
select_task_rq_fair()), we check if the destination core is idle or busy
frequency-wise. If it is busy, we keep the current behavior and move the
thread to this new core. If it is idle, we place the thread locally
(parent's core on fork, waker's core on wakeup). We also arm a
high-resolution timer that will trigger a migration of this thread to the
new core chosen by the original algorithm of CFS. If the thread is able to
run quickly on the local core, we cancel the timer (e.g. the parent thread
calls the wait() syscall). Else, the thread will be moved to the core
decided by CFS and run quickly. When the parent thread waits, the previous
example becomes:

c0:   A------fork()-wait()-B---------------
                  \/
c1:

If the parent thread keeps running, the example becomes:

c0:   A------fork()---------timer------------
                  \/           \
c1:                             B.............--------

In the first case, we use a high frequency core (c0) and let c1 be idle. In
the second case, we loose a few us of execution time for B.

There are two configurable parameters in this patch:
     - the frequency threshold above which we consider a core busy
       frequency-wise. By default, the threshold is set to 0, which
       disables the delayed migrations. On our test server (an Intel Xeon
       E7-8870 v4), an efficient value was the minimal frequency of the
       CPU. On another test machine (AMD Ryzen 5 3400G), it was a frequency
       a little bit higher than the minimal one.
     - the delay of the high-resolution timer. We choose a default value of
       50 us since it is the mean delay we observed between a fork and a
       wait during a kernel build.
Both parameters can be modified through files in /proc/sys/kernel/. Since
this is kind of work in progress, we are still looking to find better ways
to choose the best threshold and delay, as both values probably depend on
the machine.

In terms of performance, we benchmarked 60 applications on a 5.4 kernel at
the time of the paper's publication. The detailed results are available in
the paper but overall, we improve the performance of 23 applications, with
a best improvement of 56%. Only 3 applications suffer large performance
degradations (more than 5%), but the degradation is "small" (at most 8%).
We also think there might be energy savings since we try to avoid waking
idle cores. We did see small improvements on our machines, but nothing as
good as in terms of performance (at most 5% improvement).

On the current development kernel (on the sched/core branch, commit
f166b111e049), we run a few of these benchmarks to confirm our results on
an up-to-date kernel. We run the build of the kernel (complete and
scheduler only), hackbench, 2 applications from the NAS benchmark suite and
openssl. We select these applications because they had either very good or
very bad results on the 5.4 kernel. We present the mean of 10 runs, with
the powersave scaling governor and the intel_pstate driver in active mode.

 Benchmark    | unit | 5.9-f166b111e049 |          patched
 -------------+------+------------------+------------------
			LOWER IS BETTER
 -------------+------+------------------+------------------
 hackbench    |    s |            5.661 |    5.612 (- 0.9%)
 kbuild-sched |    s |            7.370 |    6.309 (-14.4%)
 kbuild       |    s |           34.798 |   34.788 (- 0.0%)
 nas_ft.C     |    s |           10.080 |   10.012 (- 0.7%)
 nas_sp.B     |    s |            6.653 |    7.033 (+ 5.7%)
 -------------+------+------------------+------------------
 			HIGHER IS BETTER
 -------------+------+------------------+------------------
 openssl      |sign/s|         15086.89 | 15071.08 (- 0.1%)

On this small subset, the only application negatively impacted application
is an HPC workload, so it is less of a problem since HPC people usually
bypass the scheduler altogether. We still have a large beneficial impact on
the scheduler build that proeminently has the fork/wait pattern and
frequency inversions.



The first patch of the series is not specific to scheduling. It allows us
(or anyone else) to use the cpufreq infrastructure at a different sampling
rate without compromising the cpufreq subsystem and applications that
depend on it.

The main idea behind this patch series is to bring to light the frequency
inversion problem that will become more and more prominent with new CPUs
that feature per-core DVFS. The solution proposed is a first idea for
solving this problem that still needs to be tested across more CPUs and
with more applications.


Redha Gouicem (3):
  cpufreq: x86: allow external frequency measures
  sched: core: x86: query frequency at each tick
  sched/fair: delay thread migration on fork/wakeup/exec

 arch/x86/kernel/cpu/aperfmperf.c | 31 ++++++++++++++++++++---
 drivers/cpufreq/cpufreq.c        |  5 ++++
 include/linux/cpufreq.h          |  1 +
 include/linux/sched.h            |  4 +++
 include/linux/sched/sysctl.h     |  3 +++
 kernel/sched/core.c              | 43 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c              | 42 ++++++++++++++++++++++++++++++-
 kernel/sched/sched.h             |  7 ++++++
 kernel/sysctl.c                  | 14 +++++++++++
 9 files changed, 145 insertions(+), 5 deletions(-)

Comments

Peter Zijlstra Oct. 21, 2020, 7:26 a.m. UTC | #1
On Tue, Oct 20, 2020 at 05:44:38PM +0200, Redha Gouicem wrote:
> 
> The first patch of the series is not specific to scheduling. It allows us
> (or anyone else) to use the cpufreq infrastructure at a different sampling
> rate without compromising the cpufreq subsystem and applications that
> depend on it.

It's also completely redudant as the scheduler already reads aperf/mperf
on every tick. Clearly you didn't do your homework ;-)

> The main idea behind this patch series is to bring to light the frequency
> inversion problem that will become more and more prominent with new CPUs
> that feature per-core DVFS. The solution proposed is a first idea for
> solving this problem that still needs to be tested across more CPUs and
> with more applications.

Which is why schedutil (the only cpufreq gov anybody should be using) is
integrated with the scheduler and closes the loop and tells the CPU
about the expected load.
Redha Oct. 21, 2020, 10:40 a.m. UTC | #2
On 21/10/2020 09:26, Peter Zijlstra wrote:
> On Tue, Oct 20, 2020 at 05:44:38PM +0200, Redha Gouicem wrote:

>> The first patch of the series is not specific to scheduling. It allows us

>> (or anyone else) to use the cpufreq infrastructure at a different sampling

>> rate without compromising the cpufreq subsystem and applications that

>> depend on it.

> It's also completely redudant as the scheduler already reads aperf/mperf

> on every tick. Clearly you didn't do your homework ;-)

My bad. I worked on this a year ago, just never got time to submit to the
lkml and I should have re-done my homework more thoroughly before
submitting. The paper was submitted approximately at the same time as the
patch introducing support frequency invariance and frequency reading at
every tick (1 week apart!)
Again, my bad.

>

>> The main idea behind this patch series is to bring to light the frequency

>> inversion problem that will become more and more prominent with new CPUs

>> that feature per-core DVFS. The solution proposed is a first idea for

>> solving this problem that still needs to be tested across more CPUs and

>> with more applications.

> Which is why schedutil (the only cpufreq gov anybody should be using) is

> integrated with the scheduler and closes the loop and tells the CPU

> about the expected load.

>

While I agree that schedutil is probably a good option, I'm not sure we
treat exactly the same problem. schedutil aims at mapping the frequency of
the CPU to the actual load. What I'm saying is that since it takes some
time for the frequency to match the load, why not account for the frequency
when making placement/migration decisions. I know that with the frequency
invariance code, capacity accounts for frequency, which means that thread
placement decisions do account for frequency indirectly. However, we still
have performance improvements with our patch for the workloads with
fork/wait patterns. I really believe that we can still gain performance if
we make decisions while accounting for the frequency more directly.
Peter Zijlstra Oct. 21, 2020, 11:48 a.m. UTC | #3
On Wed, Oct 21, 2020 at 12:40:44PM +0200, Redha wrote:

> >> The main idea behind this patch series is to bring to light the frequency
> >> inversion problem that will become more and more prominent with new CPUs
> >> that feature per-core DVFS. The solution proposed is a first idea for
> >> solving this problem that still needs to be tested across more CPUs and
> >> with more applications.
> > Which is why schedutil (the only cpufreq gov anybody should be using) is
> > integrated with the scheduler and closes the loop and tells the CPU
> > about the expected load.
> >
> While I agree that schedutil is probably a good option, I'm not sure we
> treat exactly the same problem. schedutil aims at mapping the frequency of
> the CPU to the actual load. What I'm saying is that since it takes some
> time for the frequency to match the load, why not account for the frequency
> when making placement/migration decisions.

Because overhead, mostly :/ EAS does some of that. Mostly wakeup CPU
selection is already a bottle-neck for some applications (see the fight
over select_idle_sibling()).

Programming a timer is out of budget for high rate wakeup workloads.
Worse, you also don't prime the CPU to ramp up during the enforced
delay.

Also, two new config knobs :-(

> I know that with the frequency invariance code, capacity accounts for
> frequency, which means that thread placement decisions do account for
> frequency indirectly. However, we still have performance improvements
> with our patch for the workloads with fork/wait patterns. I really
> believe that we can still gain performance if we make decisions while
> accounting for the frequency more directly.

So I don't think that's fundamentally a DVFS problem though, just
something that's exacerbated by it. There's a long history with trying
to detect this pattern, see for example WF_SYNC and wake_affine().

(we even had some code in the early CFS days that measured overlap
between tasks, to account for the period between waking up the recipient
and blocking on the answer, but that never worked reliably either, so
that went the way of the dodo)

The classical micro-benchmark is pipe-bench, which ping-pongs a single
byte between two tasks over a pipe. If you run that on a single CPU it
is _much_ faster then when the tasks get split up. DVFS is just like
caches here, yet another reason.

schedutil does solve the problem where, when we migrate a task, the CPU
would have to individually re-learn the DVFS state. By using the
scheduler statistics we can program the DVFS state up-front, on
migration. Instead of waiting for it.

So from that PoV, schedutil fundamentally solves the individual DVFS
problem as best is possible. It closes the control loop; we no longer
have individually operating control loops that are unaware of one
another.