Message ID | 20160928131309.GA5483@vingu-laptop |
---|---|
State | New |
Headers | show |
On 28/09/16 14:13, Vincent Guittot wrote: > Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit : >> On 28 September 2016 at 04:31, Dietmar Eggemann >> <dietmar.eggemann@arm.com> wrote: >>> On 28/09/16 12:19, Peter Zijlstra wrote: >>>> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote: >>>>> On 28/09/16 11:14, Peter Zijlstra wrote: >>>>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote: [...] > IIUC the problem raised by Matt, he see a regression because we now remove > during the dequeue the exact same load as during the enqueue so > cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have > a lot of hackbench blocked thread. This is my understanding as well. > The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable > task, is quite correct and we should keep it. But when we look for the idlest > group, we have to take into account the blocked thread. > > That's what i have tried to do below [...] > + /* > + * In case that we have same runnable load (especially null > + * runnable load), we select the group with smallest blocked > + * load > + */ > + min_avg_load = avg_load; > + min_runnable_load = runnable_load; Setting 'min_runnable_load' wouldn't be necessary here. > idlest = group; > } > + > } while (group = group->next, group != sd->groups); > > - if (!idlest || 100*this_load < imbalance*min_load) > + if (!idlest || 100*this_load < imbalance*min_runnable_load) > return NULL; > return idlest; On the Hikey board (ARM64) (2 cluster, each 4 cpu's, so MC and DIE), the first f_i_g (on DIE) is still based on rbl_load. So if the first hackbench task (spawning all the worker task) runs on cluster1, and the former worker p_X already blocks f_i_g returns cluster2, if p_X still runs, it returns idlest=NULL and we continue with cluster1 for second f_i_g on MC. The additional 'else if' condition doesn't seem to help much because of occurrences where an idle cpu (which never took a worker) still has a small value of rbl_load (shouldn't actually happen, weighted_cpuload() should be 0) so it is never chosen or it has even a negative impact in the case where an idle cpu (which never took a worker) is not chosen because its load (cfs->avg.load_avg) hasn't been updated for a long time so another cpu with rbl_load = 0 and a smaller load is used (even though a lot of worker where already placed on it). There are also episodes where we 'pack' workers onto the cpu which is initially picked in f_i_c (on DIE) because (100*this_load < imbalance*min_load) is true in f_i_g on MC. Maybe we can get rid of this for !sd->child ? [...]
On 29 September 2016 at 18:15, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > On 28/09/16 14:13, Vincent Guittot wrote: >> Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit : >>> On 28 September 2016 at 04:31, Dietmar Eggemann >>> <dietmar.eggemann@arm.com> wrote: >>>> On 28/09/16 12:19, Peter Zijlstra wrote: >>>>> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote: >>>>>> On 28/09/16 11:14, Peter Zijlstra wrote: >>>>>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote: > > [...] > >> IIUC the problem raised by Matt, he see a regression because we now remove >> during the dequeue the exact same load as during the enqueue so >> cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have >> a lot of hackbench blocked thread. > > This is my understanding as well. > >> The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable >> task, is quite correct and we should keep it. But when we look for the idlest >> group, we have to take into account the blocked thread. >> >> That's what i have tried to do below > > [...] > >> + /* >> + * In case that we have same runnable load (especially null >> + * runnable load), we select the group with smallest blocked >> + * load >> + */ >> + min_avg_load = avg_load; >> + min_runnable_load = runnable_load; > > Setting 'min_runnable_load' wouldn't be necessary here. fair enough > >> idlest = group; >> } >> + >> } while (group = group->next, group != sd->groups); >> >> - if (!idlest || 100*this_load < imbalance*min_load) >> + if (!idlest || 100*this_load < imbalance*min_runnable_load) >> return NULL; >> return idlest; > > On the Hikey board (ARM64) (2 cluster, each 4 cpu's, so MC and DIE), the > first f_i_g (on DIE) is still based on rbl_load. So if the first > hackbench task (spawning all the worker task) runs on cluster1, and the > former worker p_X already blocks f_i_g returns cluster2, if p_X still > runs, it returns idlest=NULL and we continue with cluster1 for second > f_i_g on MC. > > The additional 'else if' condition doesn't seem to help much because of > occurrences where an idle cpu (which never took a worker) still has a > small value of rbl_load (shouldn't actually happen, weighted_cpuload() > should be 0) so it is never chosen or it has even a negative impact in > the case where an idle cpu (which never took a worker) is not chosen > because its load (cfs->avg.load_avg) hasn't been updated for a long time > so another cpu with rbl_load = 0 and a smaller load is used (even though > a lot of worker where already placed on it). So the elseif part is there to take care of the regression raised by Matt where the runnable_load_avg is null because worker are blocked and the same cpu is selected This can be extended with a threshold in order to include small differences that came from computation rounding > > There are also episodes where we 'pack' workers onto the cpu which is > initially picked in f_i_c (on DIE) because (100*this_load < > imbalance*min_load) is true in f_i_g on MC. Maybe we can get rid of this > for !sd->child ? This threshold is there to filter any small variations that are not relevant. I'm going to extend the use of cfs_rq_load_avg() in all conditions so we take into account blocked load everywhere instead of only when runnable_load_avg is null > > [...]
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 06b3c47..702915e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5353,7 +5353,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int sd_flag) { struct sched_group *idlest = NULL, *group = sd->groups; - unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long min_runnable_load = ULONG_MAX, this_load = 0; + unsigned long min_avg_load = ULONG_MAX; int load_idx = sd->forkexec_idx; int imbalance = 100 + (sd->imbalance_pct-100)/2; @@ -5361,7 +5362,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, load_idx = sd->wake_idx; do { - unsigned long load, avg_load; + unsigned long load, avg_load, runnable_load; int local_group; int i; @@ -5375,6 +5376,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, /* Tally up the load of all CPUs in the group */ avg_load = 0; + runnable_load = 0; for_each_cpu(i, sched_group_cpus(group)) { /* Bias balancing toward cpus of our domain */ @@ -5383,21 +5385,35 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, else load = target_load(i, load_idx); - avg_load += load; + runnable_load += load; + + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); } /* Adjust by relative CPU capacity of the group */ avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; if (local_group) { - this_load = avg_load; - } else if (avg_load < min_load) { - min_load = avg_load; + this_load = runnable_load; + } else if (runnable_load < min_runnable_load) { + min_runnable_load = runnable_load; + min_avg_load = avg_load; + idlest = group; + } else if ((runnable_load == min_runnable_load) && (avg_load < min_avg_load)) { + /* + * In case that we have same runnable load (especially null + * runnable load), we select the group with smallest blocked + * load + */ + min_avg_load = avg_load; + min_runnable_load = runnable_load; idlest = group; } + } while (group = group->next, group != sd->groups); - if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || 100*this_load < imbalance*min_runnable_load) return NULL; return idlest; }