Message ID | 1498885573-18984-1-git-send-email-vincent.guittot@linaro.org |
---|---|
State | Accepted |
Commit | f235a54f00449c611f85173fe8a66c4d189c5ce1 |
Headers | show |
On Sat, Jul 01, 2017 at 07:06:13AM +0200, Vincent Guittot wrote: > The running state is a subset of runnable state which means that running > can't be set if runnable (weight) is cleared. There are corner cases > where the current sched_entity has been already dequeued but cfs_rq->curr > has not been updated yet and still points to the dequeued sched_entity. > If ___update_load_avg is called at that time, weight will be 0 and running > will be set which is not possible. > > This case happens during pick_next_task_fair() when a cfs_rq becomes idles. > The current sched_entity has been dequeued so se->on_rq is cleared and > cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be > cleared when picking the idle thread). Because the cfs_rq becomes idle, > idle_balance() is called and ends up to call update_blocked_averages() > with these wrong running and runnable states. > > Add a test in ___update_load_avg to correct the running state in this case. Cute, however did you find that ?
On 4 July 2017 at 09:27, Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, Jul 01, 2017 at 07:06:13AM +0200, Vincent Guittot wrote: >> The running state is a subset of runnable state which means that running >> can't be set if runnable (weight) is cleared. There are corner cases >> where the current sched_entity has been already dequeued but cfs_rq->curr >> has not been updated yet and still points to the dequeued sched_entity. >> If ___update_load_avg is called at that time, weight will be 0 and running >> will be set which is not possible. >> >> This case happens during pick_next_task_fair() when a cfs_rq becomes idles. >> The current sched_entity has been dequeued so se->on_rq is cleared and >> cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be >> cleared when picking the idle thread). Because the cfs_rq becomes idle, >> idle_balance() is called and ends up to call update_blocked_averages() >> with these wrong running and runnable states. >> >> Add a test in ___update_load_avg to correct the running state in this case. > > Cute, however did you find that ? In fact, while rebasing and running more tests on my patch "update scale invariance of PELT" that changes how to scale the load and utilization, I have seen that sometimes the utilization was increasing but not the load when CPU was going into idle state because the stolen_idle time was applied as idle time for load but running time for utilization. This patch has highlighted the problem.
On Tue, Jul 04, 2017 at 09:27:07AM +0200, Peter Zijlstra wrote: > On Sat, Jul 01, 2017 at 07:06:13AM +0200, Vincent Guittot wrote: > > The running state is a subset of runnable state which means that running > > can't be set if runnable (weight) is cleared. There are corner cases > > where the current sched_entity has been already dequeued but cfs_rq->curr > > has not been updated yet and still points to the dequeued sched_entity. > > If ___update_load_avg is called at that time, weight will be 0 and running > > will be set which is not possible. > > > > This case happens during pick_next_task_fair() when a cfs_rq becomes idles. > > The current sched_entity has been dequeued so se->on_rq is cleared and > > cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be > > cleared when picking the idle thread). Because the cfs_rq becomes idle, > > idle_balance() is called and ends up to call update_blocked_averages() > > with these wrong running and runnable states. > > > > Add a test in ___update_load_avg to correct the running state in this case. > > Cute, however did you find that ? Hmm,.. could you give a little more detail? Because if ->on_rq=0, we'll have done dequeue_task() which will have done update_curr() with ->on_rq, weight and ->running consistently. Then the above, inconsistent update should not happen, because delta=0.
On 4 July 2017 at 10:34, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Jul 04, 2017 at 09:27:07AM +0200, Peter Zijlstra wrote: >> On Sat, Jul 01, 2017 at 07:06:13AM +0200, Vincent Guittot wrote: >> > The running state is a subset of runnable state which means that running >> > can't be set if runnable (weight) is cleared. There are corner cases >> > where the current sched_entity has been already dequeued but cfs_rq->curr >> > has not been updated yet and still points to the dequeued sched_entity. >> > If ___update_load_avg is called at that time, weight will be 0 and running >> > will be set which is not possible. >> > >> > This case happens during pick_next_task_fair() when a cfs_rq becomes idles. >> > The current sched_entity has been dequeued so se->on_rq is cleared and >> > cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be >> > cleared when picking the idle thread). Because the cfs_rq becomes idle, >> > idle_balance() is called and ends up to call update_blocked_averages() >> > with these wrong running and runnable states. >> > >> > Add a test in ___update_load_avg to correct the running state in this case. >> >> Cute, however did you find that ? > > Hmm,.. could you give a little more detail? > > Because if ->on_rq=0, we'll have done dequeue_task() which will have > done update_curr() with ->on_rq, weight and ->running consistently. > > Then the above, inconsistent update should not happen, because delta=0. In fact, the delta between dequeue_entity_load_avg() and update_blocked_averages() is not 0 on my platform (hikey) but can be longer than 60us (at lowest frequency with only 1 task group level)
On Tue, Jul 04, 2017 at 11:12:34AM +0200, Vincent Guittot wrote: > On 4 July 2017 at 10:34, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jul 04, 2017 at 09:27:07AM +0200, Peter Zijlstra wrote: > >> On Sat, Jul 01, 2017 at 07:06:13AM +0200, Vincent Guittot wrote: > >> > The running state is a subset of runnable state which means that running > >> > can't be set if runnable (weight) is cleared. There are corner cases > >> > where the current sched_entity has been already dequeued but cfs_rq->curr > >> > has not been updated yet and still points to the dequeued sched_entity. > >> > If ___update_load_avg is called at that time, weight will be 0 and running > >> > will be set which is not possible. > >> > > >> > This case happens during pick_next_task_fair() when a cfs_rq becomes idles. > >> > The current sched_entity has been dequeued so se->on_rq is cleared and > >> > cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be > >> > cleared when picking the idle thread). Because the cfs_rq becomes idle, > >> > idle_balance() is called and ends up to call update_blocked_averages() > >> > with these wrong running and runnable states. > >> > > >> > Add a test in ___update_load_avg to correct the running state in this case. > >> > >> Cute, however did you find that ? > > > > Hmm,.. could you give a little more detail? > > > > Because if ->on_rq=0, we'll have done dequeue_task() which will have > > done update_curr() with ->on_rq, weight and ->running consistently. > > > > Then the above, inconsistent update should not happen, because delta=0. > > In fact, the delta between dequeue_entity_load_avg() and > update_blocked_averages() is not 0 on my platform (hikey) but can be > longer than 60us (at lowest frequency with only 1 task group level) But but but, how can that happen? Should it not all be under the same rq->lock and thus have only a single update_rq_clock() and thus be at the same 'instant' ?
On 4 July 2017 at 11:44, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Jul 04, 2017 at 11:12:34AM +0200, Vincent Guittot wrote: >> On 4 July 2017 at 10:34, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Tue, Jul 04, 2017 at 09:27:07AM +0200, Peter Zijlstra wrote: >> >> On Sat, Jul 01, 2017 at 07:06:13AM +0200, Vincent Guittot wrote: >> >> > The running state is a subset of runnable state which means that running >> >> > can't be set if runnable (weight) is cleared. There are corner cases >> >> > where the current sched_entity has been already dequeued but cfs_rq->curr >> >> > has not been updated yet and still points to the dequeued sched_entity. >> >> > If ___update_load_avg is called at that time, weight will be 0 and running >> >> > will be set which is not possible. >> >> > >> >> > This case happens during pick_next_task_fair() when a cfs_rq becomes idles. >> >> > The current sched_entity has been dequeued so se->on_rq is cleared and >> >> > cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be >> >> > cleared when picking the idle thread). Because the cfs_rq becomes idle, >> >> > idle_balance() is called and ends up to call update_blocked_averages() >> >> > with these wrong running and runnable states. >> >> > >> >> > Add a test in ___update_load_avg to correct the running state in this case. >> >> >> >> Cute, however did you find that ? >> > >> > Hmm,.. could you give a little more detail? >> > >> > Because if ->on_rq=0, we'll have done dequeue_task() which will have >> > done update_curr() with ->on_rq, weight and ->running consistently. >> > >> > Then the above, inconsistent update should not happen, because delta=0. >> >> In fact, the delta between dequeue_entity_load_avg() and >> update_blocked_averages() is not 0 on my platform (hikey) but can be >> longer than 60us (at lowest frequency with only 1 task group level) > > But but but, how can that happen? Should it not all be under the same > rq->lock and thus have only a single update_rq_clock() and thus be at > the same 'instant' ? idle_balance() unlock rq->lock before calling update_blocked_averages And update_blocked_averages() starts by calling update_rq_clock()
On Tue, Jul 04, 2017 at 11:57:12AM +0200, Vincent Guittot wrote: > On 4 July 2017 at 11:44, Peter Zijlstra <peterz@infradead.org> wrote: > > But but but, how can that happen? Should it not all be under the same > > rq->lock and thus have only a single update_rq_clock() and thus be at > > the same 'instant' ? > > idle_balance() unlock rq->lock before calling update_blocked_averages > And update_blocked_averages() starts by calling update_rq_clock() Ah indeed. Might want to clarify that point.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 008c514..bc36a75 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2968,6 +2968,18 @@ ___update_load_avg(u64 now, int cpu, struct sched_avg *sa, sa->last_update_time += delta << 10; /* + * running is a subset of runnable (weight) so running can't be set if + * runnable is clear. But there are some corner cases where the current + * se has been already dequeued but cfs_rq->curr still points to it. + * This means that weight will be 0 but not running for a sched_entity + * but also for a cfs_rq if the latter becomes idle. As an example, + * this happens during idle_balance() which calls + * update_blocked_averages() + */ + if (!weight) + running = 0; + + /* * Now we know we crossed measurement unit boundaries. The *_avg * accrues by two steps: *
The running state is a subset of runnable state which means that running can't be set if runnable (weight) is cleared. There are corner cases where the current sched_entity has been already dequeued but cfs_rq->curr has not been updated yet and still points to the dequeued sched_entity. If ___update_load_avg is called at that time, weight will be 0 and running will be set which is not possible. This case happens during pick_next_task_fair() when a cfs_rq becomes idles. The current sched_entity has been dequeued so se->on_rq is cleared and cfs_rq->weight is null. But cfs_rq->curr still points to se (it will be cleared when picking the idle thread). Because the cfs_rq becomes idle, idle_balance() is called and ends up to call update_blocked_averages() with these wrong running and runnable states. Add a test in ___update_load_avg to correct the running state in this case. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) -- 2.7.4