Message ID | 1481216215-24651-3-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
On Thu, 08 Dec, at 05:56:54PM, Vincent Guittot wrote: > find_idlest_group() only compares the runnable_load_avg when looking for > the least loaded group. But on fork intensive use case like hackbench > where tasks blocked quickly after the fork, this can lead to selecting the > same CPU instead of other CPUs, which have similar runnable load but a > lower load_avg. > > When the runnable_load_avg of 2 CPUs are close, we now take into account > the amount of blocked load as a 2nd selection factor. There is now 3 zones > for the runnable_load of the rq: > -[0 .. (runnable_load - imbalance)] : Select the new rq which has > significantly less runnable_load > -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The > runnable loads are close so we use load_avg to chose between the 2 rq > -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which > has significantly less runnable_load > > The scale factor that is currently used for comparing runnable_load, > doesn't work well with small value. As an example, the use of a scaling > factor fails as soon as this_runnable_load == 0 because we always select > local rq even if min_runnable_load is only 1, which doesn't really make > sense because they are just the same. So instead of scaling factor, we use > an absolute margin for runnable_load to detect CPUs with similar > runnable_load and we keep using scaling factor for blocked load. > > For use case like hackbench, this enable the scheduler to select different > CPUs during the fork sequence and to spread tasks across the system. > > Tests have been done on a Hikey board (ARM based octo cores) for several > kernel. The result below gives min, max, avg and stdev values of 18 runs > with each configuration. > > The v4.8+patches configuration also includes the changes below which is > part of the proposal made by Peter to ensure that the clock will be up to > date when the fork task will be attached to the rq. > > @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p) > __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); > #endif > rq = __task_rq_lock(p, &rf); > + update_rq_clock(rq); > post_init_entity_util_avg(&p->se); > > activate_task(rq, p, 0); > > hackbench -P -g 1 > > ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches > min 0.049 0.050 0.051 0,048 > avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) > max 0.066 0.068 0.070 0,063 > stdev +/-9% +/-9% +/-8% +/-9% > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++---------- > 1 file changed, 38 insertions(+), 10 deletions(-) Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Peter, Ingo, when you pick this up would you also consider adding the following tag which links to an email describing the problem this patch solves and the performance test results when it's applied? Link: https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk
On 9 December 2016 at 16:22, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Dec 08, 2016 at 05:56:54PM +0100, Vincent Guittot wrote: >> @@ -5449,14 +5456,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >> } >> >> /* Adjust by relative CPU capacity of the group */ >> - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; >> + avg_load = (avg_load * SCHED_CAPACITY_SCALE) / >> + group->sgc->capacity; >> + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / >> + group->sgc->capacity; >> >> if (local_group) { >> - this_load = avg_load; >> + this_runnable_load = runnable_load; >> + this_avg_load = avg_load; >> this_spare = max_spare_cap; >> } else { >> - if (avg_load < min_load) { >> - min_load = avg_load; >> + if (min_runnable_load > (runnable_load + imbalance)) { >> + /* >> + * The runnable load is significantly smaller >> + * so we can pick this new cpu >> + */ >> + min_runnable_load = runnable_load; >> + min_avg_load = avg_load; >> + idlest = group; >> + } else if ((runnable_load < (min_runnable_load + imbalance)) && >> + (100*min_avg_load > imbalance_scale*avg_load)) { >> + /* >> + * The runnable loads are close so we take >> + * into account blocked load through avg_load >> + * which is blocked + runnable load >> + */ >> + min_avg_load = avg_load; >> idlest = group; >> } >> >> @@ -5480,13 +5505,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, >> goto skip_spare; >> >> if (this_spare > task_util(p) / 2 && >> - imbalance*this_spare > 100*most_spare) >> + imbalance_scale*this_spare > 100*most_spare) >> return NULL; >> else if (most_spare > task_util(p) / 2) >> return most_spare_sg; >> >> skip_spare: >> - if (!idlest || 100*this_load < imbalance*min_load) >> + if (!idlest || >> + (min_runnable_load > (this_runnable_load + imbalance)) || >> + ((this_runnable_load < (min_runnable_load + imbalance)) && >> + (100*this_avg_load < imbalance_scale*min_avg_load))) >> return NULL; >> return idlest; >> } > > I did the below on top for readability. Changes looks good to me > > --- > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5469,17 +5469,16 @@ find_idlest_group(struct sched_domain *s > if (min_runnable_load > (runnable_load + imbalance)) { > /* > * The runnable load is significantly smaller > - * so we can pick this new cpu > + * so we can pick this new cpu > */ > min_runnable_load = runnable_load; > min_avg_load = avg_load; > idlest = group; > } else if ((runnable_load < (min_runnable_load + imbalance)) && > - (100*min_avg_load > imbalance_scale*avg_load)) { > + (100*min_avg_load > imbalance_scale*avg_load)) { > /* > - * The runnable loads are close so we take > - * into account blocked load through avg_load > - * which is blocked + runnable load > + * The runnable loads are close so take the > + * blocked load into account through avg_load. > */ > min_avg_load = avg_load; > idlest = group; > @@ -5509,15 +5508,21 @@ find_idlest_group(struct sched_domain *s > if (this_spare > task_util(p) / 2 && > imbalance_scale*this_spare > 100*most_spare) > return NULL; > - else if (most_spare > task_util(p) / 2) > + > + if (most_spare > task_util(p) / 2) > return most_spare_sg; > > skip_spare: > - if (!idlest || > - (min_runnable_load > (this_runnable_load + imbalance)) || > - ((this_runnable_load < (min_runnable_load + imbalance)) && > - (100*this_avg_load < imbalance_scale*min_avg_load))) > + if (!idlest) > + return NULL; > + > + if (min_runnable_load > (this_runnable_load + imbalance)) > return NULL; > + > + if ((this_runnable_load < (min_runnable_load + imbalance)) && > + (100*this_avg_load < imbalance_scale*min_avg_load)) > + return NULL; > + > return idlest; > } >
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1da846b..0129fbb 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5405,16 +5405,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, { struct sched_group *idlest = NULL, *group = sd->groups; struct sched_group *most_spare_sg = NULL; - unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0; + unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0; unsigned long most_spare = 0, this_spare = 0; int load_idx = sd->forkexec_idx; - int imbalance = 100 + (sd->imbalance_pct-100)/2; + int imbalance_scale = 100 + (sd->imbalance_pct-100)/2; + unsigned long imbalance = scale_load_down(NICE_0_LOAD) * + (sd->imbalance_pct-100) / 100; if (sd_flag & SD_BALANCE_WAKE) load_idx = sd->wake_idx; do { - unsigned long load, avg_load, spare_cap, max_spare_cap; + unsigned long load, avg_load, runnable_load; + unsigned long spare_cap, max_spare_cap; int local_group; int i; @@ -5431,6 +5435,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, * the group containing the CPU with most spare capacity. */ avg_load = 0; + runnable_load = 0; max_spare_cap = 0; for_each_cpu(i, sched_group_cpus(group)) { @@ -5440,7 +5445,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, else load = target_load(i, load_idx); - avg_load += load; + runnable_load += load; + + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); spare_cap = capacity_spare_wake(i, p); @@ -5449,14 +5456,32 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, } /* Adjust by relative CPU capacity of the group */ - avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; + avg_load = (avg_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / + group->sgc->capacity; if (local_group) { - this_load = avg_load; + this_runnable_load = runnable_load; + this_avg_load = avg_load; this_spare = max_spare_cap; } else { - if (avg_load < min_load) { - min_load = avg_load; + if (min_runnable_load > (runnable_load + imbalance)) { + /* + * The runnable load is significantly smaller + * so we can pick this new cpu + */ + min_runnable_load = runnable_load; + min_avg_load = avg_load; + idlest = group; + } else if ((runnable_load < (min_runnable_load + imbalance)) && + (100*min_avg_load > imbalance_scale*avg_load)) { + /* + * The runnable loads are close so we take + * into account blocked load through avg_load + * which is blocked + runnable load + */ + min_avg_load = avg_load; idlest = group; } @@ -5480,13 +5505,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, goto skip_spare; if (this_spare > task_util(p) / 2 && - imbalance*this_spare > 100*most_spare) + imbalance_scale*this_spare > 100*most_spare) return NULL; else if (most_spare > task_util(p) / 2) return most_spare_sg; skip_spare: - if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || + (min_runnable_load > (this_runnable_load + imbalance)) || + ((this_runnable_load < (min_runnable_load + imbalance)) && + (100*this_avg_load < imbalance_scale*min_avg_load))) return NULL; return idlest; }
find_idlest_group() only compares the runnable_load_avg when looking for the least loaded group. But on fork intensive use case like hackbench where tasks blocked quickly after the fork, this can lead to selecting the same CPU instead of other CPUs, which have similar runnable load but a lower load_avg. When the runnable_load_avg of 2 CPUs are close, we now take into account the amount of blocked load as a 2nd selection factor. There is now 3 zones for the runnable_load of the rq: -[0 .. (runnable_load - imbalance)] : Select the new rq which has significantly less runnable_load -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The runnable loads are close so we use load_avg to chose between the 2 rq -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which has significantly less runnable_load The scale factor that is currently used for comparing runnable_load, doesn't work well with small value. As an example, the use of a scaling factor fails as soon as this_runnable_load == 0 because we always select local rq even if min_runnable_load is only 1, which doesn't really make sense because they are just the same. So instead of scaling factor, we use an absolute margin for runnable_load to detect CPUs with similar runnable_load and we keep using scaling factor for blocked load. For use case like hackbench, this enable the scheduler to select different CPUs during the fork sequence and to spread tasks across the system. Tests have been done on a Hikey board (ARM based octo cores) for several kernel. The result below gives min, max, avg and stdev values of 18 runs with each configuration. The v4.8+patches configuration also includes the changes below which is part of the proposal made by Peter to ensure that the clock will be up to date when the fork task will be attached to the rq. @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p) __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); #endif rq = __task_rq_lock(p, &rf); + update_rq_clock(rq); post_init_entity_util_avg(&p->se); activate_task(rq, p, 0); hackbench -P -g 1 ea86cb4b7621 7dc603c9028e v4.8 v4.8+patches min 0.049 0.050 0.051 0,048 avg 0.057 0.057(0%) 0.057(0%) 0,055(+5%) max 0.066 0.068 0.070 0,063 stdev +/-9% +/-9% +/-8% +/-9% Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 38 insertions(+), 10 deletions(-) -- 2.7.4