Message ID | 20161019132957.GA7509@e105550-lin.cambridge.arm.com |
---|---|
State | New |
Headers | show |
On 19 October 2016 at 15:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote: >> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit : >> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote: >> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote: >> > > > So aside from funny BIOSes, this should also show up when creating >> > > > cgroups when you have offlined a few CPUs, which is far more common I'd >> > > > think. >> > > >> > > The problem is also that the load of the tg->se[cpu] that represents >> > > the tg->cfs_rq[cpu] is initialized to 1024 in: >> > > alloc_fair_sched_group >> > > for_each_possible_cpu(i) { >> > > init_entity_runnable_average(se); >> > > sa->load_avg = scale_load_down(se->load.weight); >> > > >> > > Initializing sa->load_avg to 1024 for a newly created task makes >> > > sense as we don't know yet what will be its real load but i'm not sure >> > > that we have to do the same for se that represents a task group. This >> > > load should be initialized to 0 and it will increase when task will be >> > > moved/attached into task group >> > >> > Yes, I think that makes sense, not sure how horrible that is with the >> >> That should not be that bad because this initial value is only useful for >> the few dozens of ms that follow the creation of the task group > > IMHO, it doesn't make much sense to initialize empty containers, which > group sched_entities really are, to 1024. It is meant to represent what > is in it, and a creation it is empty, so in my opinion initializing it > to zero make sense. > >> > current state of things, but after your propagate patch, that >> > reinstates the interactivity hack that should work for sure. > > It actually works on mainline/tip as well. > > As I see it, the fundamental problem is keeping group entities up to > date. Because the load_weight and hence se->avg.load_avg each per-cpu > group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for > all cpus (tg->load_avg), including those that might be empty and > therefore not enqueued, we must ensure that they are updated some other > way. Most naturally as part of update_blocked_averages(). > > To guarantee that, it basically boils down to making sure: > Any cfs_rq with a non-zero tg_load_avg_contrib must be on the > leaf_cfs_rq_list. > > We can do that in different ways: 1) Add all cfs_rqs to the > leaf_cfs_rq_list at task group creation, or 2) initialize group > sched_entity contributions to zero and make sure that they are added to > leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued > on it. > > Vincent patch below gives us the second option. > >> kernel/sched/fair.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 8b03fb5..89776ac 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se) >> * will definitely be update (after enqueue). >> */ >> sa->period_contrib = 1023; >> - sa->load_avg = scale_load_down(se->load.weight); >> + /* >> + * Tasks are intialized with full load to be seen as heavy task until >> + * they get a chance to stabilize to their real load level. >> + * group entity are intialized with null load to reflect the fact that >> + * nothing has been attached yet to the task group. >> + */ >> + if (entity_is_task(se)) >> + sa->load_avg = scale_load_down(se->load.weight); >> sa->load_sum = sa->load_avg * LOAD_AVG_MAX; >> /* >> * At this point, util_avg won't be used in select_task_rq_fair anyway > > I would suggest adding a comment somewhere stating that we need to keep > group cfs_rqs up to date: > > ----- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index abb3763dff69..2b820d489be0 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu) > if (throttled_hierarchy(cfs_rq)) > continue; > > + /* > + * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib > + * _must_ be on the leaf_cfs_rq_list to ensure that group shares > + * are updated correctly. > + */ As discussed on IRC, the point is that even if the leaf cfs_rq is added to the leaf_cfs_rq_list, it doesn't ensure that it will be updated correctly for unplugged CPUs > if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true)) > update_tg_load_avg(cfs_rq, 0); > } > ----- > > I did a couple of simple tests on tip/sched/core to test whether > Vincent's fix works even without reflecting group load/util in the group > hierarchy: > > Juno (2xA57+4xA53) > > tip: > grouped hog(1) alone: 2841 > non-grouped hogs(6) alone: 40830 > grouped hog(1): 218 > non-grouped hogs(6): 40580 > > tip+vg: > grouped hog alone: 2849 > non-grouped hogs(6) alone: 40831 > grouped hog: 2363 > non-grouped hogs: 38418 > > See script below for details, but we basically see that the grouped task > is not getting its 'fair' share on tip, while it does with Vincent's > patch. > > To summarize, I think Vincent's patch makes sense and works :-) More > testing is needed of cause to see if there are other problems. > > ----- > > # Create 100 task groups: > for i in `seq 1 100`; > do > cgcreate -g cpu:/root/test$i > done > > NCPUS=$(grep -c ^processor /proc/cpuinfo) > > # Run single cpu hog inside task group on first cpu _alone_: > cgexec -g cpu:/root/test100 taskset 0x01 sysbench --test=cpu \ > --num-threads=1 --max-time=5 --max-requests=1000000 run | \ > awk '{if ($4=="events:") {print "grouped hog(1) alone: " $5}}' > > # Run cpu hogs outside task group _alone_: > sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \ > --max-requests=1000000 run | awk '{if ($4=="events:") \ > {print "non-grouped hogs('$NCPUS') alone: " $5}}' > > # Run cpu hogs outside task group: > sysbench --test=cpu --num-threads=$NCPUS --max-time=10 \ > --max-requests=1000000 run | awk '{if ($4=="events:") \ > {print "non-grouped hogs('$NCPUS'): " $5}}' & > > # Run single cpu hog inside task group on first cpu: > cgexec -g cpu:/root/test100 taskset 0x01 sysbench \ > --test=cpu --num-threads=1 --max-time=5 \ > --max-requests=1000000 run | awk '{if ($4=="events:") \ > {print "grouped hog(1): " $5}}' > > wait > > # Delete task groups: > for i in `seq 1 100`; > do > cgdelete -g cpu:/root/test$i > done
On Wed, Oct 19, 2016 at 07:41:36PM +0200, Vincent Guittot wrote: > On 19 October 2016 at 15:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > On Tue, Oct 18, 2016 at 01:56:51PM +0200, Vincent Guittot wrote: > >> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit : > >> > On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote: > >> > > On 18 October 2016 at 11:07, Peter Zijlstra <peterz@infradead.org> wrote: > >> > > > So aside from funny BIOSes, this should also show up when creating > >> > > > cgroups when you have offlined a few CPUs, which is far more common I'd > >> > > > think. > >> > > > >> > > The problem is also that the load of the tg->se[cpu] that represents > >> > > the tg->cfs_rq[cpu] is initialized to 1024 in: > >> > > alloc_fair_sched_group > >> > > for_each_possible_cpu(i) { > >> > > init_entity_runnable_average(se); > >> > > sa->load_avg = scale_load_down(se->load.weight); > >> > > > >> > > Initializing sa->load_avg to 1024 for a newly created task makes > >> > > sense as we don't know yet what will be its real load but i'm not sure > >> > > that we have to do the same for se that represents a task group. This > >> > > load should be initialized to 0 and it will increase when task will be > >> > > moved/attached into task group > >> > > >> > Yes, I think that makes sense, not sure how horrible that is with the > >> > >> That should not be that bad because this initial value is only useful for > >> the few dozens of ms that follow the creation of the task group > > > > IMHO, it doesn't make much sense to initialize empty containers, which > > group sched_entities really are, to 1024. It is meant to represent what > > is in it, and a creation it is empty, so in my opinion initializing it > > to zero make sense. > > > >> > current state of things, but after your propagate patch, that > >> > reinstates the interactivity hack that should work for sure. > > > > It actually works on mainline/tip as well. > > > > As I see it, the fundamental problem is keeping group entities up to > > date. Because the load_weight and hence se->avg.load_avg each per-cpu > > group sched_entity depends on the group cfs_rq->tg_load_avg_contrib for > > all cpus (tg->load_avg), including those that might be empty and > > therefore not enqueued, we must ensure that they are updated some other > > way. Most naturally as part of update_blocked_averages(). > > > > To guarantee that, it basically boils down to making sure: > > Any cfs_rq with a non-zero tg_load_avg_contrib must be on the > > leaf_cfs_rq_list. > > > > We can do that in different ways: 1) Add all cfs_rqs to the > > leaf_cfs_rq_list at task group creation, or 2) initialize group > > sched_entity contributions to zero and make sure that they are added to > > leaf_cfs_rq_list as soon as a sched_entity (task or group) is enqueued > > on it. > > > > Vincent patch below gives us the second option. > > > >> kernel/sched/fair.c | 9 ++++++++- > >> 1 file changed, 8 insertions(+), 1 deletion(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 8b03fb5..89776ac 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se) > >> * will definitely be update (after enqueue). > >> */ > >> sa->period_contrib = 1023; > >> - sa->load_avg = scale_load_down(se->load.weight); > >> + /* > >> + * Tasks are intialized with full load to be seen as heavy task until > >> + * they get a chance to stabilize to their real load level. > >> + * group entity are intialized with null load to reflect the fact that > >> + * nothing has been attached yet to the task group. > >> + */ > >> + if (entity_is_task(se)) > >> + sa->load_avg = scale_load_down(se->load.weight); > >> sa->load_sum = sa->load_avg * LOAD_AVG_MAX; > >> /* > >> * At this point, util_avg won't be used in select_task_rq_fair anyway > > > > I would suggest adding a comment somewhere stating that we need to keep > > group cfs_rqs up to date: > > > > ----- > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index abb3763dff69..2b820d489be0 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu) > > if (throttled_hierarchy(cfs_rq)) > > continue; > > > > + /* > > + * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib > > + * _must_ be on the leaf_cfs_rq_list to ensure that group shares > > + * are updated correctly. > > + */ > > As discussed on IRC, the point is that even if the leaf cfs_rq is > added to the leaf_cfs_rq_list, it doesn't ensure that it will be > updated correctly for unplugged CPUs Agreed. We have to ensure that tg_load_avg_contrib is zeroed for leaf cfs_rqs belonging to unplugged cpus. And if modify the above to say leaf_cfs_rq_list of an online cpu, then we should be covered I think.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index abb3763dff69..2b820d489be0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6641,6 +6641,11 @@ static void update_blocked_averages(int cpu) if (throttled_hierarchy(cfs_rq)) continue; + /* + * Note that _any_ leaf cfs_rq with a non-zero tg_load_avg_contrib + * _must_ be on the leaf_cfs_rq_list to ensure that group shares + * are updated correctly. + */ if (update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq, true)) update_tg_load_avg(cfs_rq, 0); }