Message ID | 1544803317-7610-2-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
Series | sched/fair: some fixes for asym_packing | expand |
On 14/12/2018 16:01, Vincent Guittot wrote: > When check_asym_packing() is triggered, the imbalance is set to : > busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE > busiest_stat.avg_load also comes from a division and the final rounding > can make imbalance slightly lower than the weighted load of the cfs_rq. > But this is enough to skip the rq in find_busiest_queue and prevents asym > migration to happen. > > Add 1 to the avg_load to make sure that the targeted cpu will not be > skipped unexpectidly. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index ca46964..c215f7a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -8215,6 +8215,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, > /* Adjust by relative CPU capacity of the group */ > sgs->group_capacity = group->sgc->capacity; > sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; > + /* > + * Prevent division rounding to make the computation of imbalance > + * slightly less than original value and to prevent the rq to be then > + * selected as busiest queue > + */ > + sgs->avg_load += 1; I've tried investigating this off-by-one issue by running lmbench, but it seems to be a gnarly one. The adventure starts in update_sg_lb_stats() where we compute sgs->avg_load: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity Then we drop by find_busiest_group() and call check_asym_packing() which, if we do need to pack, does: env->imbalance = DIV_ROUND_CLOSEST( sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, SCHED_CAPACITY_SCALE); And finally the one check that seems to be failing, and that you're addressing with a +1 is in find_busiest_queue(): if (rq->nr_running == 1 && wl > env->imbalance && !check_cpu_capacity(rq, env->sd)) continue; Now, running lmbench on my HiKey960 hacked up to use asym packing, I get an example where there's a task that should be migrated via asym packing but isn't: # This a DIE level balance update_sg_lb_stats(): cpu=0 load=1 cpu=1 load=1023 cpu=2 load=0 cpu=3 load=0 avg_load=568 # Busiest group is [0-3] find_busiest_group(): check_asym_packing(): env->imbalance = 1022 find_busiest_queue(): cpu=0 wl=1 cpu=1 wl=1023 cpu=2 wl=0 cpu=3 wl=0 Only CPU1 has rq->nr_running > 0 (==1 actually), but wl > env->imbalance so we skip it in find_busiest_queue(). Now, I thought it was due to IRQ/rt pressure - 1840 isn't the maximum group capacity for the LITTLE cluster, it should be (463 * 4) == 1852. I got curious and threw random group capacity values in that equation while keeping the same avg_load [1], and it gets wild: you have a signal that oscillates between 1024 and 1014! [1]: https://gist.github.com/valschneider/f42252e83be943d276b901f57734b8ba Adding +1 to avg_load doesn't solve those pesky rounding errors that get worse as group_capacity grows, but it reverses the trend: the signal oscillates above 1024. So in the end a +1 *could* "fix" this, but a) It'd make more sense to have it in the check_asym_packing() code b) It's ugly and it makes me feel like this piece of code is flimsy AF. > > if (sgs->sum_nr_running) > sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; >
On Tue, 18 Dec 2018 at 18:32, Valentin Schneider <valentin.schneider@arm.com> wrote: > > On 14/12/2018 16:01, Vincent Guittot wrote: > > When check_asym_packing() is triggered, the imbalance is set to : > > busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE > > busiest_stat.avg_load also comes from a division and the final rounding > > can make imbalance slightly lower than the weighted load of the cfs_rq. > > But this is enough to skip the rq in find_busiest_queue and prevents asym > > migration to happen. > > > > Add 1 to the avg_load to make sure that the targeted cpu will not be > > skipped unexpectidly. > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > kernel/sched/fair.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index ca46964..c215f7a 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -8215,6 +8215,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, > > /* Adjust by relative CPU capacity of the group */ > > sgs->group_capacity = group->sgc->capacity; > > sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; > > + /* > > + * Prevent division rounding to make the computation of imbalance > > + * slightly less than original value and to prevent the rq to be then > > + * selected as busiest queue > > + */ > > + sgs->avg_load += 1; > > I've tried investigating this off-by-one issue by running lmbench, but it > seems to be a gnarly one. > > > > The adventure starts in update_sg_lb_stats() where we compute > sgs->avg_load: > > sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity > > Then we drop by find_busiest_group() and call check_asym_packing() which, > if we do need to pack, does: > > env->imbalance = DIV_ROUND_CLOSEST( > sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, > SCHED_CAPACITY_SCALE); > > And finally the one check that seems to be failing, and that you're > addressing with a +1 is in find_busiest_queue(): > > if (rq->nr_running == 1 && wl > env->imbalance && > !check_cpu_capacity(rq, env->sd)) > continue; > > > > Now, running lmbench on my HiKey960 hacked up to use asym packing, I > get an example where there's a task that should be migrated via asym > packing but isn't: > > # This a DIE level balance > update_sg_lb_stats(): > cpu=0 load=1 > cpu=1 load=1023 > cpu=2 load=0 > cpu=3 load=0 > > avg_load=568 > > # Busiest group is [0-3] > find_busiest_group(): check_asym_packing(): > env->imbalance = 1022 > > find_busiest_queue(): > cpu=0 wl=1 > cpu=1 wl=1023 > cpu=2 wl=0 > cpu=3 wl=0 > > Only CPU1 has rq->nr_running > 0 (==1 actually), but wl > env->imbalance > so we skip it in find_busiest_queue(). > > Now, I thought it was due to IRQ/rt pressure - 1840 isn't the maximum > group capacity for the LITTLE cluster, it should be (463 * 4) == 1852. > I got curious and threw random group capacity values in that equation while > keeping the same avg_load [1], and it gets wild: you have a signal that > oscillates between 1024 and 1014! hmm ... this seems to not be related with $subject > > [1]: https://gist.github.com/valschneider/f42252e83be943d276b901f57734b8ba > > Adding +1 to avg_load doesn't solve those pesky rounding errors that get > worse as group_capacity grows, but it reverses the trend: the signal > oscillates above 1024. > > > > So in the end a +1 *could* "fix" this, but > a) It'd make more sense to have it in the check_asym_packing() code > b) It's ugly and it makes me feel like this piece of code is flimsy AF. This is another UC, asym packing is used at SMT level for now and we don't face this kind of problem, it has been also tested and DynamiQ configuration which is similar to SMT : 1 CPU per sched_group The legacy b.L one was not the main target although it can be The rounding error is there and should be fixed and your UC is another case for legacy b.L that can be treated in another patch > > > > > if (sgs->sum_nr_running) > > sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; > >
On 19/12/2018 08:32, Vincent Guittot wrote: [...] > This is another UC, asym packing is used at SMT level for now and we > don't face this kind of problem, it has been also tested and DynamiQ > configuration which is similar to SMT : 1 CPU per sched_group > The legacy b.L one was not the main target although it can be > > The rounding error is there and should be fixed and your UC is another > case for legacy b.L that can be treated in another patch > I used that setup out of convenience for myself, but AFAICT that use-case just stresses that issue. In the end we still have an imbalance value that can vary with only slight group_capacity changes. And that's true even if that group contains a single CPU, as the imbalance value computed by check_asym_packing() can vary by +/-1 with very tiny capacity changes (rt/IRQ pressure). That means that sometimes you're off by one and don't do the packing because some CPU was pressured, but some other time you might do it because that CPU was *more* pressured. It's doesn't make sense. In the end problem is that in update_sg_lb_stats() we do: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; and in check_asym_packing() we do: env->imbalance = DIV_ROUND_CLOSEST( sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, SCHED_CAPACITY_SCALE) So we end up with something like: group_load * SCHED_CAPACITY_SCALE * group_capacity imbalance = -------------------------------------------------- group_capacity * SCHED_CAPACITY_SCALE Which you'd hope to reduce down to: imbalance = group_load but you get division errors all over the place. So actually, the fix we'd want would be: -----8<----- @@ -8463,9 +8463,7 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds) if (sched_asym_prefer(busiest_cpu, env->dst_cpu)) return 0; - env->imbalance = DIV_ROUND_CLOSEST( - sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, - SCHED_CAPACITY_SCALE); + env->imbalance = sds->busiest_stat.group_load; return 1; } ----->8-----
On Wed, 19 Dec 2018 at 12:58, Valentin Schneider <valentin.schneider@arm.com> wrote: > > On 19/12/2018 08:32, Vincent Guittot wrote: > [...] > > This is another UC, asym packing is used at SMT level for now and we > > don't face this kind of problem, it has been also tested and DynamiQ > > configuration which is similar to SMT : 1 CPU per sched_group > > The legacy b.L one was not the main target although it can be > > > > The rounding error is there and should be fixed and your UC is another > > case for legacy b.L that can be treated in another patch > > > > I used that setup out of convenience for myself, but AFAICT that use-case > just stresses that issue. After looking at you UC in details, your problem comes from the wl=1 for cpu0 whereas there is no running task. But wl!=0 without running task should not be possible since fdf5f3 ("wsched/fair: Disable LB_BIAS by default") This explain the 1023 instead of 1022 > > In the end we still have an imbalance value that can vary with only > slight group_capacity changes. And that's true even if that group > contains a single CPU, as the imbalance value computed by > check_asym_packing() can vary by +/-1 with very tiny capacity changes > (rt/IRQ pressure). That means that sometimes you're off by one and don't > do the packing because some CPU was pressured, but some other time you > might do it because that CPU was *more* pressured. It's doesn't make sense. > > > In the end problem is that in update_sg_lb_stats() we do: > > sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; > > and in check_asym_packing() we do: > > env->imbalance = DIV_ROUND_CLOSEST( > sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, > SCHED_CAPACITY_SCALE) > > So we end up with something like: > > group_load * SCHED_CAPACITY_SCALE * group_capacity > imbalance = -------------------------------------------------- > group_capacity * SCHED_CAPACITY_SCALE > > Which you'd hope to reduce down to: > > imbalance = group_load > > but you get division errors all over the place. So actually, the fix we'd > want would be: > > -----8<----- > @@ -8463,9 +8463,7 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds) > if (sched_asym_prefer(busiest_cpu, env->dst_cpu)) > return 0; > > - env->imbalance = DIV_ROUND_CLOSEST( > - sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, > - SCHED_CAPACITY_SCALE); > + env->imbalance = sds->busiest_stat.group_load; > > return 1; > } > ----->8-----
On 19/12/2018 13:39, Vincent Guittot wrote: [...] >> I used that setup out of convenience for myself, but AFAICT that use-case >> just stresses that issue. > > After looking at you UC in details, your problem comes from the wl=1 > for cpu0 whereas there is no running task. > But wl!=0 without running task should not be possible since fdf5f3 > ("wsched/fair: Disable LB_BIAS by default") > > This explain the 1023 instead of 1022 > > True, I had a look at the trace and there doesn't seem to be any running task on that CPU. That's a separate matter however - the rounding issues can happen regardless of the wl values.
On Wed, 19 Dec 2018 at 15:59, Valentin Schneider <valentin.schneider@arm.com> wrote: > > > > On 19/12/2018 13:39, Vincent Guittot wrote: > [...] > >> I used that setup out of convenience for myself, but AFAICT that use-case > >> just stresses that issue. > > > > After looking at you UC in details, your problem comes from the wl=1 > > for cpu0 whereas there is no running task. > > But wl!=0 without running task should not be possible since fdf5f3 > > ("wsched/fair: Disable LB_BIAS by default") > > > > This explain the 1023 instead of 1022 > > > > > > True, I had a look at the trace and there doesn't seem to be any running > task on that CPU. That's a separate matter however - the rounding issues > can happen regardless of the wl values. But it means that the rounding fix +1 works and your problems comes from something else
On 19/12/2018 15:05, Vincent Guittot wrote: [...] >> True, I had a look at the trace and there doesn't seem to be any running >> task on that CPU. That's a separate matter however - the rounding issues >> can happen regardless of the wl values. > > But it means that the rounding fix +1 works and your problems comes > from something else Oh yes, I never said it didn't work - I was doing some investigation on the reason as to why we'd need this fix, because it's wasn't explicit from the commit message. The rounding errors are countered by the +1, yes, but I'd rather remove the errors altogether and go for the snippet I suggested in my previous reply.
On Wed, 19 Dec 2018 at 16:11, Valentin Schneider <valentin.schneider@arm.com> wrote: > > On 19/12/2018 15:05, Vincent Guittot wrote: > [...] > >> True, I had a look at the trace and there doesn't seem to be any running > >> task on that CPU. That's a separate matter however - the rounding issues > >> can happen regardless of the wl values. > > > > But it means that the rounding fix +1 works and your problems comes > > from something else > > Oh yes, I never said it didn't work - I was doing some investigation on > the reason as to why we'd need this fix, because it's wasn't explicit from > the commit message. > > The rounding errors are countered by the +1, yes, but I'd rather remove > the errors altogether and go for the snippet I suggested in my previous > reply. except that you don't always want to migrate all group load. I prefer keeping current algorithm and fix it for now. Trying "new" thing can come in a 2nd step
On 19/12/2018 15:20, Vincent Guittot wrote: [...] >> Oh yes, I never said it didn't work - I was doing some investigation on >> the reason as to why we'd need this fix, because it's wasn't explicit from >> the commit message. >> >> The rounding errors are countered by the +1, yes, but I'd rather remove >> the errors altogether and go for the snippet I suggested in my previous >> reply. > > except that you don't always want to migrate all group load. > I prefer keeping current algorithm and fix it for now. Trying "new" > thing can come in a 2nd step We already set the imbalance as the whole group load, we just introduce rounding errors inbetween. As I've already said: in update_sg_lb_stats() we do: sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; and in check_asym_packing() we do: env->imbalance = DIV_ROUND_CLOSEST( sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, SCHED_CAPACITY_SCALE) So we end up with something like: group_load * SCHED_CAPACITY_SCALE * group_capacity imbalance = -------------------------------------------------- group_capacity * SCHED_CAPACITY_SCALE Which we could reduce down to: imbalance = group_load and not get any rounding errors.
On Wed, 19 Dec 2018 at 16:30, Valentin Schneider <valentin.schneider@arm.com> wrote: > > On 19/12/2018 15:20, Vincent Guittot wrote: > [...] > >> Oh yes, I never said it didn't work - I was doing some investigation on > >> the reason as to why we'd need this fix, because it's wasn't explicit from > >> the commit message. > >> > >> The rounding errors are countered by the +1, yes, but I'd rather remove > >> the errors altogether and go for the snippet I suggested in my previous > >> reply. > > > > except that you don't always want to migrate all group load. > > I prefer keeping current algorithm and fix it for now. Trying "new" > > thing can come in a 2nd step > > We already set the imbalance as the whole group load, we just introduce > rounding errors inbetween. As I've already said: > > in update_sg_lb_stats() we do: > > sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; > > and in check_asym_packing() we do: > > env->imbalance = DIV_ROUND_CLOSEST( > sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity, > SCHED_CAPACITY_SCALE) > > So we end up with something like: > > group_load * SCHED_CAPACITY_SCALE * group_capacity > imbalance = -------------------------------------------------- > group_capacity * SCHED_CAPACITY_SCALE > > Which we could reduce down to: > > imbalance = group_load ah yes, i thought the nr_running was involved but it's not the case. This looks better indeed > > and not get any rounding errors.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ca46964..c215f7a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8215,6 +8215,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, /* Adjust by relative CPU capacity of the group */ sgs->group_capacity = group->sgc->capacity; sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; + /* + * Prevent division rounding to make the computation of imbalance + * slightly less than original value and to prevent the rq to be then + * selected as busiest queue + */ + sgs->avg_load += 1; if (sgs->sum_nr_running) sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
When check_asym_packing() is triggered, the imbalance is set to : busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE busiest_stat.avg_load also comes from a division and the final rounding can make imbalance slightly lower than the weighted load of the cfs_rq. But this is enough to skip the rq in find_busiest_queue and prevents asym migration to happen. Add 1 to the avg_load to make sure that the targeted cpu will not be skipped unexpectidly. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+) -- 2.7.4