diff mbox

[7/7,v3] sched: fix wrong utilization accounting when switching to fair class

Message ID 1473666472-13749-8-git-send-email-vincent.guittot@linaro.org
State Superseded
Headers show

Commit Message

Vincent Guittot Sept. 12, 2016, 7:47 a.m. UTC
When a task switches to fair scheduling class, the period between now and
the last update of its utilization is accounted as running time whatever
happened during this period. This wrong accounting applies to the task
and also to the task group branch.

When changing the property of a running task like its list of allowed CPUs
or its scheduling class, we follow the sequence:
-dequeue task
-put task
-change the property
-set task as current task
-enqueue task

The end of the sequence doesn't follow the normal sequence which is :
-enqueue a task
-then set the task as current task.

This wrong ordering is the root cause of wrong utilization accounting.
Update the sequence to follow the right one:
-dequeue task
-put task
-change the property
-enqueue task
-set task as current task

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/core.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

-- 
1.9.1

Comments

Vincent Guittot Sept. 15, 2016, 3:36 p.m. UTC | #1
On 15 September 2016 at 15:18, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:

>> When a task switches to fair scheduling class, the period between now and

>> the last update of its utilization is accounted as running time whatever

>> happened during this period. This wrong accounting applies to the task

>> and also to the task group branch.

>>

>> When changing the property of a running task like its list of allowed CPUs

>> or its scheduling class, we follow the sequence:

>> -dequeue task

>> -put task

>> -change the property

>> -set task as current task

>> -enqueue task

>>

>> The end of the sequence doesn't follow the normal sequence which is :

>> -enqueue a task

>> -then set the task as current task.

>>

>> This wrong ordering is the root cause of wrong utilization accounting.

>> Update the sequence to follow the right one:

>> -dequeue task

>> -put task

>> -change the property

>> -enqueue task

>> -set task as current task

>

> But enqueue_entity depends on cfs_rq->curr, which is set by

> set_curr_task_fair().


With this sequence, cfs_rq->curr is null and the cfs_rq is "idle" as
the entity has been dequeued and put back in the rb tree the time to
change the properties.

enqueue_entity use cfs_rq->cur == se for:
- updating current. With this sequence, current is now null so nothing to do
- to skip the enqueue of the se in rb tree. With this sequence, se is
put in the rb tree during the enqueue and take back during the set
task as current task

I don't see any functional issue but we are not doing the same step
with the new sequence

>

> Also, the normalize comment in dequeue_entity() worries me, 'someone'

> didn't update that when he moved update_min_vruntime() around.
Vincent Guittot Sept. 16, 2016, 12:45 p.m. UTC | #2
On 16 September 2016 at 12:51, Peter Zijlstra <peterz@infradead.org> wrote:
>

> On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:

>

> > -dequeue task

> > -put task

> > -change the property

> > -enqueue task

> > -set task as current task

>

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c

> > index 3e52d08..7a9c9b9 100644

> > --- a/kernel/sched/core.c

> > +++ b/kernel/sched/core.c

> > @@ -1105,10 +1105,10 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)

> >

> >       p->sched_class->set_cpus_allowed(p, new_mask);

> >

> > -     if (running)

> > -             p->sched_class->set_curr_task(rq);

> >       if (queued)

> >               enqueue_task(rq, p, ENQUEUE_RESTORE);

> > +     if (running)

> > +             p->sched_class->set_curr_task(rq);

> >  }

> >

> >  /*

>

> So one thing that I've wanted to do for a while, but never managed to

> come up with a sensible way to do is encapsulate this pattern.

>

> The two options I came up with are:

>

> #define FOO(p, stmt)

> ({

>         struct rq *rq = task_rq(p);

>         bool queued = task_on_rq_queued(p);

>         bool running = task_current(rq);

>         int queue_flags = DEQUEUE_SAVE; /* also ENQUEUE_RESTORE */

>

>         if (queued)

>                 dequeue_task(rq, p, queue_flags);

>         if (running)

>                 put_prev_task(rq, p);

>

>         stmt;

>

>         if (queued)

>                 enqueue_task(rq, p, queue_flags);

>         if (running)

>                 set_curr_task(rq, p);

> })

>

> and

>

> void foo(struct task_struct *p, void (*func)(struct task_struct *, int *))

> {

>         struct rq *rq = task_rq(p);

>         bool queued = task_on_rq_queued(p);

>         bool running = task_current(rq);

>         int queue_flags = DEQUEUE_SAVE; /* also ENQUEUE_RESTORE */

>

>         if (queued)

>                 dequeue_task(rq, p, queue_flags);

>         if (running)

>                 put_prev_task(rq, p);

>

>         func(p, &queue_flags);

>

>         if (queued)

>                 enqueue_task(rq, p, queue_flags);

>         if (running)

>                 set_curr_task(rq, p);

> }

>

> Neither results in particularly pretty code. Although I suppose if I'd

> have to pick one I'd go for the macro variant.

>

> Opinions? I'm fine with leaving the code as is, just wanted to throw

> this out there.


I'm not convinced by using such encapsulation as it adds the
constraint of having a function to pass which is not always the case
and it hides a bit whats happen to this function
What about creating a task_FOO_save and a task_FOO_save macro  ? something like

#define task_FOO_save(p, rq, flags)
({
        bool queued = task_on_rq_queued(p);
        bool running = task_current(rq);
        int queue_flags = DEQUEUE_SAVE; /* also ENQUEUE_RESTORE */

        if (queued)
                dequeue_task(rq, p, queue_flags);
        if (running)
                put_prev_task(rq, p);
        flags = queued | running << 1;
})

#define task_FOO_restore(p, rq, flags)
({
        bool queued = flags & 0x1;
        bool running = flags & 0x2;
        int queue_flags = ENQUEUE_RESTORE;

        if (queued)
                enqueue_task(rq, p, queue_flags);
        if (running)
                set_curr_task(rq, p);
})
Vincent Guittot Sept. 16, 2016, 2:23 p.m. UTC | #3
On 16 September 2016 at 14:16, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 15, 2016 at 05:36:58PM +0200, Vincent Guittot wrote:

>> On 15 September 2016 at 15:18, Peter Zijlstra <peterz@infradead.org> wrote:

>> > On Mon, Sep 12, 2016 at 09:47:52AM +0200, Vincent Guittot wrote:

>

>> >> Update the sequence to follow the right one:

>> >> -dequeue task

>> >> -put task

>> >> -change the property

>> >> -enqueue task

>> >> -set task as current task

>> >

>> > But enqueue_entity depends on cfs_rq->curr, which is set by

>> > set_curr_task_fair().

>>

>> With this sequence, cfs_rq->curr is null and the cfs_rq is "idle" as

>> the entity has been dequeued and put back in the rb tree the time to

>> change the properties.

>>

>> enqueue_entity use cfs_rq->cur == se for:

>> - updating current. With this sequence, current is now null so nothing to do

>> - to skip the enqueue of the se in rb tree. With this sequence, se is

>> put in the rb tree during the enqueue and take back during the set

>> task as current task

>>

>> I don't see any functional issue but we are not doing the same step

>> with the new sequence

>

> So I think you're right in that it should work.

>

> I also think we can then simplify enqueue_entity() in that it will never

> be possible to enqueue current with your change.

>

> But my brain just isn't working today, so who knows.

>

>> > Also, the normalize comment in dequeue_entity() worries me, 'someone'

>> > didn't update that when he moved update_min_vruntime() around.

>

> I now worry more, so we do:

>

>         dequeue_task := dequeue_task_fair (p == current)

>           dequeue_entity

>             update_curr()

>               update_min_vruntime()

>             vruntime -= min_vruntime

>             update_min_vruntime()

>               // use cfs_rq->curr, which we just normalized !


yes but does it really change the cfs_rq->min_vruntime in this case ?

If curr is the task with the smallest vruntime of the cfs_rq,
cfs_rq->min_vruntime has been aligned with curr->vruntime during
update_curr(). So vruntime -= min_vruntime will be for sure less than
cfs_rq->min_vruntime and cfs_rq->min_vruntime stays unchanged

If curr is not the task with the smallest vruntime of the cfs_rq,
cfs_rq->min_vruntime has been aligned with the left most entity. And
vruntime -= min_vruntime will not change anything during the 2nd
update_min_vruntime as it will be either greater than
leftmost->vruntime or less than cfs_rq->min_vruntime.

>

>         put_prev_task := put_prev_task_fair

>           put_prev_entity

>             cfs_rq->curr = NULL;

>

>

> Now the point of the latter update_min_vruntime() is to advance

> min_vruntime when the task we removed was the one holding it back.

>

> However, it means that if we do dequeue+enqueue, we're further in the

> future (ie. we get penalized).

>

> So I'm inclined to simply remove the (2nd) update_min_vruntime() call.

> But as said above, my brain isn't co-operating much today.
Vincent Guittot Sept. 20, 2016, 1:06 p.m. UTC | #4
On 20 September 2016 at 13:54, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Sep 16, 2016 at 04:23:16PM +0200, Vincent Guittot wrote:

>> On 16 September 2016 at 14:16, Peter Zijlstra <peterz@infradead.org> wrote:

>

>> >> > Also, the normalize comment in dequeue_entity() worries me, 'someone'

>> >> > didn't update that when he moved update_min_vruntime() around.

>> >

>> > I now worry more, so we do:

>> >

>> >         dequeue_task := dequeue_task_fair (p == current)

>> >           dequeue_entity

>> >             update_curr()

>> >               update_min_vruntime()

>> >             vruntime -= min_vruntime

>> >             update_min_vruntime()

>> >               // use cfs_rq->curr, which we just normalized !

>>

>> yes but does it really change the cfs_rq->min_vruntime in this case ?

>

> So let me see; it does:

>

>         vruntime = cfs_rq->min_vruntime;

>

>         if (curr) // true

>           vruntime = curr->vruntime; // == vruntime - min_vruntime

>

>         if (leftmost) // possible

>           if (curr) // true

>             vruntime = min_vruntime(vruntime, se->vruntime);

>               if (se->vruntime - (curr->vruntime - min_vruntime)) < 0 // false

>

>         min_vruntime = max_vruntime(min_vruntime, vruntime);

>           if ((curr->vruntime - min_vruntime) - min_vruntime) > 0)

>

>

> The problem is that double subtraction of min_vruntime can wrap.

> The thing is, min_vruntime is the 0-point in our modular space, it

> normalizes vruntime (ideally min_vruntime would be our 0-lag point,

> resulting in vruntime - min_vruntime being the lag).

>

> The moment min_vruntime grows past S64_MAX/2 -2*min_vruntime wraps into


fair enough

> positive space again and the test above becomes true and we'll select

> the normalized @curr vruntime as new min_vruntime and weird stuff will

> happen.

>

>

> Also, even it things magically worked out, its still very icky to mix

> the normalized vruntime into things.


I agree

>

>> >         put_prev_task := put_prev_task_fair

>> >           put_prev_entity

>> >             cfs_rq->curr = NULL;

>> >

>> >

>> > Now the point of the latter update_min_vruntime() is to advance

>> > min_vruntime when the task we removed was the one holding it back.

>> >

>> > However, it means that if we do dequeue+enqueue, we're further in the

>> > future (ie. we get penalized).

>> >

>> > So I'm inclined to simply remove the (2nd) update_min_vruntime() call.

>> > But as said above, my brain isn't co-operating much today.

>

> OK, so not sure we can actually remove it, we do want it to move

> min_vruntime forward (sometimes). We just don't want it to do so when

> DEQUEUE_SAVE -- we want to get back where we left off, nor do we want to

> muck about with touching normalized values.

>

> Another fun corner case is DEQUEUE_SLEEP; in that case we do not

> normalize, but we still want advance min_vruntime if this was the one

> holding it back.

>

> I ended up with the below, but I'm not sure I like it much. Let me prod

> a wee bit more to see if there's not something else we can do.

>

> Google has this patch-set replacing min_vruntime with an actual global

> 0-lag, which greatly simplifies things. If only they'd post it sometime

> :/ /me prods pjt and ben with a sharp stick :-)

>

> ---

>  kernel/sched/fair.c | 22 ++++++++++++++++++----

>  1 file changed, 18 insertions(+), 4 deletions(-)

>

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 986c10c25176..77566a340cbf 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -462,17 +462,23 @@ static inline int entity_before(struct sched_entity *a,

>

>  static void update_min_vruntime(struct cfs_rq *cfs_rq)

>  {

> +       struct sched_entity *curr = cfs_rq->curr;

> +

>         u64 vruntime = cfs_rq->min_vruntime;

>

> -       if (cfs_rq->curr)

> -               vruntime = cfs_rq->curr->vruntime;

> +       if (curr) {

> +               if (curr->on_rq)

> +                       vruntime = curr->vruntime;

> +               else

> +                       curr = NULL;

> +       }

>

>         if (cfs_rq->rb_leftmost) {

>                 struct sched_entity *se = rb_entry(cfs_rq->rb_leftmost,

>                                                    struct sched_entity,

>                                                    run_node);

>

> -               if (!cfs_rq->curr)

> +               if (!curr)

>                         vruntime = se->vruntime;

>                 else

>                         vruntime = min_vruntime(vruntime, se->vruntime);

> @@ -3483,8 +3489,16 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)

>         /* return excess runtime on last dequeue */

>         return_cfs_rq_runtime(cfs_rq);

>

> -       update_min_vruntime(cfs_rq);

>         update_cfs_shares(cfs_rq);

> +

> +       /*

> +        * Now advance min_vruntime if @se was the entity holding it back,

> +        * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be

> +        * put back on, and if we advance min_vruntime, we'll be placed back

> +        * further than we started -- ie. we'll be penalized.

> +        */

> +       if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)

> +               update_min_vruntime(cfs_rq);

>  }

>

>  /*
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3e52d08..7a9c9b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1105,10 +1105,10 @@  void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
 
 	p->sched_class->set_cpus_allowed(p, new_mask);
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 }
 
 /*
@@ -3687,10 +3687,10 @@  void rt_mutex_setprio(struct task_struct *p, int prio)
 
 	p->prio = prio;
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, queue_flag);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 out_unlock:
@@ -4243,8 +4243,6 @@  static int __sched_setscheduler(struct task_struct *p,
 	prev_class = p->sched_class;
 	__setscheduler(rq, p, attr, pi);
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued) {
 		/*
 		 * We enqueue to tail when the priority of a task is
@@ -4255,6 +4253,8 @@  static int __sched_setscheduler(struct task_struct *p,
 
 		enqueue_task(rq, p, queue_flags);
 	}
+	if (running)
+		p->sched_class->set_curr_task(rq);
 
 	check_class_changed(rq, p, prev_class, oldprio);
 	preempt_disable(); /* avoid rq from going away on us */
@@ -5417,10 +5417,10 @@  void sched_setnuma(struct task_struct *p, int nid)
 
 	p->numa_preferred_nid = nid;
 
-	if (running)
-		p->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, p, ENQUEUE_RESTORE);
+	if (running)
+		p->sched_class->set_curr_task(rq);
 	task_rq_unlock(rq, p, &rf);
 }
 #endif /* CONFIG_NUMA_BALANCING */
@@ -7868,10 +7868,10 @@  void sched_move_task(struct task_struct *tsk)
 
 	sched_change_group(tsk, TASK_MOVE_GROUP);
 
-	if (unlikely(running))
-		tsk->sched_class->set_curr_task(rq);
 	if (queued)
 		enqueue_task(rq, tsk, ENQUEUE_RESTORE | ENQUEUE_MOVE);
+	if (unlikely(running))
+		tsk->sched_class->set_curr_task(rq);
 
 	task_rq_unlock(rq, tsk, &rf);
 }