mbox series

[0/4,v4] sched/rt: track rt rq utilization

Message ID 1521199541-15308-1-git-send-email-vincent.guittot@linaro.org
Headers show
Series sched/rt: track rt rq utilization | expand

Message

Vincent Guittot March 16, 2018, 11:25 a.m. UTC
When both cfs and rt tasks compete to run on a CPU, we can see some frequency
drops with schedutil governor. In such case, the cfs_rq's utilization doesn't
reflect anymore the utilization of cfs tasks but only the remaining part that
is not used by rt tasks. We should monitor the stolen utilization and take
it into account when selecting OPP. This patchset doesn't change the OPP
selection policy for RT tasks but only for CFS tasks

A rt-app use case which creates an always running cfs thread and a rt threads
that wakes up periodically with both threads pinned on same CPU, show lot of 
frequency switches of the CPU whereas the CPU never goes idles during the 
test. I can share the json file that I used for the test if someone is
interested in.

For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),
the cpufreq statistics outputs (stats are reset just before the test) : 
$ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans
without patchset : 1230
with patchset : 14

If we replace the cfs thread of rt-app by a sysbench cpu test, we can see
performance improvements:

- Without patchset :
Test execution summary:
    total time:                          15.0009s
    total number of events:              4903
    total time taken by event execution: 14.9972
    per-request statistics:
         min:                                  1.23ms
         avg:                                  3.06ms
         max:                                 13.16ms
         approx.  95 percentile:              12.73ms

Threads fairness:
    events (avg/stddev):           4903.0000/0.00
    execution time (avg/stddev):   14.9972/0.00

- With patchset:
Test execution summary:
    total time:                          15.0014s
    total number of events:              7694
    total time taken by event execution: 14.9979
    per-request statistics:
         min:                                  1.23ms
         avg:                                  1.95ms
         max:                                 10.49ms
         approx.  95 percentile:              10.39ms

Threads fairness:
    events (avg/stddev):           7694.0000/0.00
    execution time (avg/stddev):   14.9979/0.00

The performance improvement is 56% for this use case.

Patch 1 move pelt code in pelt.c file
Patch 2 tracks utilization of rt_rq.
Patch 3 adds the rt_rq's utilization when selection OPP for cfs tasks
Patch 4 support periodic update of blocked rt utilization

Change since v3:
- add support of periodic update of blocked utilization
- rebase on lastest tip/sched/core

Change since v2:
- move pelt code into a dedicated pelt.c file
- rebase on load tracking changes

Change since v1:
- Only a rebase. I have addressed the comments on previous version in
  patch 1/2

Vincent Guittot (4):
  sched/pelt: Move pelt related code in a dedicated file
  sched/rt: add rt_rq utilization tracking
  cpufreq/schedutil: add rt utilization tracking
  sched/nohz: monitor rt utilization

 kernel/sched/Makefile            |   2 +-
 kernel/sched/cpufreq_schedutil.c |   4 +-
 kernel/sched/fair.c              | 321 ++-----------------------------------
 kernel/sched/pelt.c              | 331 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/pelt.h              |  24 +++
 kernel/sched/rt.c                |   8 +
 kernel/sched/sched.h             |  28 ++++
 7 files changed, 410 insertions(+), 308 deletions(-)
 create mode 100644 kernel/sched/pelt.c
 create mode 100644 kernel/sched/pelt.h

-- 
2.7.4

Comments

Peter Zijlstra April 14, 2018, 10:07 a.m. UTC | #1
What I don't see in this patch-set is removal of the current rt_avg
stuff.

And I didn't look closely enough; but are the root cfs and rt pelt
windows aligned? They really should be; otherwise you can't combine them
sanely.
Vincent Guittot April 14, 2018, 11:42 a.m. UTC | #2
On 14 April 2018 at 12:07, Peter Zijlstra <peterz@infradead.org> wrote:
>

>

> What I don't see in this patch-set is removal of the current rt_avg

> stuff.


This RT load tracking doesn't replace current rt_avg because they are
not using same period and providing same function
current rt_avg uses sysctl_sched_time_avg to define the averaging
period and it's default period is 1 second. But PELT uses a fixed
period
current rt_avg is tracking irq accounting which this patch doesn't do.
This is probably doable but will need more complex changes

Replacing current rt_avg by this new RT utilization tracking would
require more complex changes so I didn't want to add them this 1st
step.


>

> And I didn't look closely enough; but are the root cfs and rt pelt

> windows aligned? They really should be; otherwise you can't combine them

> sanely.


No They are not aligned.
I agree that this could generate some variation on the sum. I'm going
to fix this point
Dietmar Eggemann April 15, 2018, 11:56 a.m. UTC | #3
On 03/16/2018 12:25 PM, Vincent Guittot wrote:

[...]

> For a 15 seconds long test on a hikey 6220 (octo core cortex A53 platfrom),

> the cpufreq statistics outputs (stats are reset just before the test) :

> $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

> without patchset : 1230

> with patchset : 14

> 

> If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

> performance improvements:

> 

> - Without patchset :

> Test execution summary:

>      total time:                          15.0009s

>      total number of events:              4903

>      total time taken by event execution: 14.9972

>      per-request statistics:

>           min:                                  1.23ms

>           avg:                                  3.06ms

>           max:                                 13.16ms

>           approx.  95 percentile:              12.73ms

> 

> Threads fairness:

>      events (avg/stddev):           4903.0000/0.00

>      execution time (avg/stddev):   14.9972/0.00

> 

> - With patchset:

> Test execution summary:

>      total time:                          15.0014s

>      total number of events:              7694

>      total time taken by event execution: 14.9979

>      per-request statistics:

>           min:                                  1.23ms

>           avg:                                  1.95ms

>           max:                                 10.49ms

>           approx.  95 percentile:              10.39ms

> 

> Threads fairness:

>      events (avg/stddev):           7694.0000/0.00

>      execution time (avg/stddev):   14.9979/0.00

> 

> The performance improvement is 56% for this use case.


How do you get this number? Normally we use the 'total time' value.

[...]
Vincent Guittot April 15, 2018, noon UTC | #4
Hi Dietmar,

On 15 April 2018 at 13:56, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 03/16/2018 12:25 PM, Vincent Guittot wrote:

>

> [...]

>

>

>> For a 15 seconds long test on a hikey 6220 (octo core cortex A53

>> platfrom),

>> the cpufreq statistics outputs (stats are reset just before the test) :

>> $ cat /sys/devices/system/cpu/cpufreq/policy0/stats/total_trans

>> without patchset : 1230

>> with patchset : 14

>>

>> If we replace the cfs thread of rt-app by a sysbench cpu test, we can see

>> performance improvements:

>>

>> - Without patchset :

>> Test execution summary:

>>      total time:                          15.0009s

>>      total number of events:              4903

>>      total time taken by event execution: 14.9972

>>      per-request statistics:

>>           min:                                  1.23ms

>>           avg:                                  3.06ms

>>           max:                                 13.16ms

>>           approx.  95 percentile:              12.73ms

>>

>> Threads fairness:

>>      events (avg/stddev):           4903.0000/0.00

>>      execution time (avg/stddev):   14.9972/0.00

>>

>> - With patchset:

>> Test execution summary:

>>      total time:                          15.0014s

>>      total number of events:              7694

>>      total time taken by event execution: 14.9979

>>      per-request statistics:

>>           min:                                  1.23ms

>>           avg:                                  1.95ms

>>           max:                                 10.49ms

>>           approx.  95 percentile:              10.39ms

>>

>> Threads fairness:

>>      events (avg/stddev):           7694.0000/0.00

>>      execution time (avg/stddev):   14.9979/0.00

>>

>> The performance improvement is 56% for this use case.

>

>

> How do you get this number? Normally we use the 'total time' value.


The test stop after an defined amount of time with --max-time=15 in my case

>

> [...]