mbox series

[RFC,0/5] CFS load tracking trace events

Message ID 20170328063541.12912-1-dietmar.eggemann@arm.com
Headers show
Series CFS load tracking trace events | expand

Message

Dietmar Eggemann March 28, 2017, 6:35 a.m. UTC
This patch-set introduces trace events for load (and utilization)
tracking for the following three cfs scheduler bricks: cfs_rq,
sched_entity and task_group.

I've decided to sent it out because people are discussing problems with
PELT related functionality and as a base for the related discussion next
week at the OSPM-summit in Pisa.

The requirements for the design are:

 (1) Support for all combinations related to the following kernel
     configuration options: CONFIG_SMP, CONFIG_FAIR_GROUP_SCHED,
     CONFIG_SCHED_AUTOGROUP, CONFIG_DEBUG_KERNEL.

 (2) All key=value pairs have to appear in all combinations of the
     kernel configuration options mentioned under (1).

 (3) Stable interface, i.e. only use keys for a key=value pairs which
     are independent from the underlying (load tracking) implementation.

 (4) Minimal invasive to the actual cfs scheduler code, i.e. the trace
     event should not have to be guarded by an if condition (e.g.
     entity_is_task(se) to distinguish between task and task_group)
     or kernel configuration switch.

The trace events only expose one load (and one utilization) key=value
pair besides those used to identify the cfs scheduler brick. They do
not provide any internal implementation details of the actual (PELT)
load-tracking mechanism.
This limitation might be too much since for a cfs_rq we can only trace
runnable load (cfs_rq->runnable_load_avg) or runnable/blocked load
(cfs_rq->avg.load_avg).
Other load metrics like instantaneous load ({cfs_rq|se}->load.weight)
are not traced but could be easily added.

The following keys are used to identify the cfs scheduler brick:

 (1) Cpu number the cfs scheduler brick is attached to.

 (2) Task_group path and (css) id.

 (3) Task name and pid.

In case a key does not apply due to an unset Kernel configuration option
or the fact that a sched_entity can represent either a task or a
task_group its value is set to an invalid default:

 (1) Task_group path: "(null)"

 (2) Task group id:   -1

 (3) Task name:       "(null)"

 (4) Task pid:        -1

Load tracking trace events are placed into kernel/sched/fair.c:

 (1) for cfs_rq:

     - In PELT core function __update_load_avg().
    
     - During sched_entity attach/detach.

     - During sched_entity load/util propagation down the task_group
       hierarchy.

 (2) for sched_entity:

     - In PELT core function __update_load_avg().
    
     - During sched_entity load/util propagation down the task_group
       hierarchy.

 (3) for task_group:

     - In its PELT update function.

An alternative for using __update_load_avg() would be to put trace
events into update_cfs_rq_load_avg() for cfs_rq's and into
set_task_rq_fair(), update_load_avg(), sync_entity_load_avg() for
sched_entities. This would also get rid of the if(cfs_rq)/else condition
in __update_load_avg().

These trace events still look a bit fragile.
First of all, this patch-set has to use cfs specific data structures
in the global task scheduler trace file include/trace/events/sched.h.
And second, the accessor-functions (rq_of(), task_of(), etc.) are
private to the cfs scheduler class. In case they would be public these
trace events would be easier to code. That said, group_cfs_rq() is
already made public by this patch-stack.

This version bases on tip/sched/core as of yesterday (bc4278987e38). It
has been compile tested on ~160 configurations via 0day's kbuild test
robot.

Dietmar Eggemann (5):
  sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG
  sched/events: Introduce cfs_rq load tracking trace event
  sched/fair: Export group_cfs_rq()
  sched/events: Introduce sched_entity load tracking trace event
  sched/events: Introduce task_group load tracking trace event

 include/linux/sched.h        |  10 +++
 include/trace/events/sched.h | 143 +++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/autogroup.c     |   2 -
 kernel/sched/autogroup.h     |   2 -
 kernel/sched/fair.c          |  26 ++++----
 5 files changed, 167 insertions(+), 16 deletions(-)

-- 
2.11.0

Comments

Vincent Guittot March 28, 2017, 10:05 a.m. UTC | #1
On 28 March 2017 at 08:35, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> This patch-set introduces trace events for load (and utilization)

> tracking for the following three cfs scheduler bricks: cfs_rq,

> sched_entity and task_group.

>

> I've decided to sent it out because people are discussing problems with

> PELT related functionality and as a base for the related discussion next

> week at the OSPM-summit in Pisa.

>

> The requirements for the design are:

>

>  (1) Support for all combinations related to the following kernel

>      configuration options: CONFIG_SMP, CONFIG_FAIR_GROUP_SCHED,

>      CONFIG_SCHED_AUTOGROUP, CONFIG_DEBUG_KERNEL.

>

>  (2) All key=value pairs have to appear in all combinations of the

>      kernel configuration options mentioned under (1).

>

>  (3) Stable interface, i.e. only use keys for a key=value pairs which

>      are independent from the underlying (load tracking) implementation.

>

>  (4) Minimal invasive to the actual cfs scheduler code, i.e. the trace

>      event should not have to be guarded by an if condition (e.g.

>      entity_is_task(se) to distinguish between task and task_group)

>      or kernel configuration switch.

>

> The trace events only expose one load (and one utilization) key=value

> pair besides those used to identify the cfs scheduler brick. They do

> not provide any internal implementation details of the actual (PELT)

> load-tracking mechanism.

> This limitation might be too much since for a cfs_rq we can only trace

> runnable load (cfs_rq->runnable_load_avg) or runnable/blocked load

> (cfs_rq->avg.load_avg).

> Other load metrics like instantaneous load ({cfs_rq|se}->load.weight)

> are not traced but could be easily added.

>

> The following keys are used to identify the cfs scheduler brick:

>

>  (1) Cpu number the cfs scheduler brick is attached to.

>

>  (2) Task_group path and (css) id.

>

>  (3) Task name and pid.


Do you really need both path/name and id/pid ?

The path/name looks quite intrusive so can't we just use id/pid ?

>

> In case a key does not apply due to an unset Kernel configuration option

> or the fact that a sched_entity can represent either a task or a

> task_group its value is set to an invalid default:

>

>  (1) Task_group path: "(null)"

>

>  (2) Task group id:   -1

>

>  (3) Task name:       "(null)"

>

>  (4) Task pid:        -1

>

> Load tracking trace events are placed into kernel/sched/fair.c:

>

>  (1) for cfs_rq:

>

>      - In PELT core function __update_load_avg().

>

>      - During sched_entity attach/detach.

>

>      - During sched_entity load/util propagation down the task_group

>        hierarchy.

>

>  (2) for sched_entity:

>

>      - In PELT core function __update_load_avg().

>

>      - During sched_entity load/util propagation down the task_group

>        hierarchy.

>

>  (3) for task_group:

>

>      - In its PELT update function.

>

> An alternative for using __update_load_avg() would be to put trace

> events into update_cfs_rq_load_avg() for cfs_rq's and into

> set_task_rq_fair(), update_load_avg(), sync_entity_load_avg() for

> sched_entities. This would also get rid of the if(cfs_rq)/else condition

> in __update_load_avg().

>

> These trace events still look a bit fragile.

> First of all, this patch-set has to use cfs specific data structures

> in the global task scheduler trace file include/trace/events/sched.h.

> And second, the accessor-functions (rq_of(), task_of(), etc.) are

> private to the cfs scheduler class. In case they would be public these

> trace events would be easier to code. That said, group_cfs_rq() is

> already made public by this patch-stack.

>

> This version bases on tip/sched/core as of yesterday (bc4278987e38). It

> has been compile tested on ~160 configurations via 0day's kbuild test

> robot.

>

> Dietmar Eggemann (5):

>   sched/autogroup: Define autogroup_path() for !CONFIG_SCHED_DEBUG

>   sched/events: Introduce cfs_rq load tracking trace event

>   sched/fair: Export group_cfs_rq()

>   sched/events: Introduce sched_entity load tracking trace event

>   sched/events: Introduce task_group load tracking trace event

>

>  include/linux/sched.h        |  10 +++

>  include/trace/events/sched.h | 143 +++++++++++++++++++++++++++++++++++++++++++

>  kernel/sched/autogroup.c     |   2 -

>  kernel/sched/autogroup.h     |   2 -

>  kernel/sched/fair.c          |  26 ++++----

>  5 files changed, 167 insertions(+), 16 deletions(-)

>

> --

> 2.11.0

>
Dietmar Eggemann March 28, 2017, 1:45 p.m. UTC | #2
On 03/28/2017 12:05 PM, Vincent Guittot wrote:
> On 28 March 2017 at 08:35, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:


[...]

>> The following keys are used to identify the cfs scheduler brick:

>>

>>  (1) Cpu number the cfs scheduler brick is attached to.

>>

>>  (2) Task_group path and (css) id.

>>

>>  (3) Task name and pid.

>

> Do you really need both path/name and id/pid ?

>

> The path/name looks quite intrusive so can't we just use id/pid ?


One problem is that all autogroups use id=0.

Another thing with task_groups is that dealing with path="/tg1/tg11" is 
so much more intuitive than id="7".

IMHO, we do need task name and pid to be able to clearly identify a task 
(same name/different pid or fork phase (forkee still has name of forker)).

You're right, the implementation with path is more complicated but I 
guess that's worth it. We could get rid of 'id' though.

[...]