Message ID | 1571405198-27570-5-git-send-email-vincent.guittot@linaro.org |
---|---|
State | Accepted |
Commit | 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912 |
Headers | show |
Series | sched/fair: rework the CFS load balance | expand |
On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote: > The load_balance algorithm contains some heuristics which have become > meaningless since the rework of the scheduler's metrics like the > introduction of PELT. > > Furthermore, load is an ill-suited metric for solving certain task > placement imbalance scenarios. For instance, in the presence of idle CPUs, > we should simply try to get at least one task per CPU, whereas the current > load-based algorithm can actually leave idle CPUs alone simply because the > load is somewhat balanced. The current algorithm ends up creating virtual > and meaningless value like the avg_load_per_task or tweaks the state of a > group to make it overloaded whereas it's not, in order to try to migrate > tasks. > I do not think this is necessarily 100% true. With both the previous load-balancing behaviour and the apparent behaviour of this patch, it's still possible to pull two communicating tasks apart and across NUMA domains when utilisation is low. Specifically, a load difference of less than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate task to level out load. So, load might be ill-suited for some cases but that does not make it completely useless either. The type of behaviour can be seen by running netperf via mmtests (configuration file configs/config-network-netperf-unbound) on a NUMA machine and noting that the local vs remote NUMA hinting faults are roughly 50%. I had prototyped some fixes around this that took imbalance_pct into account but it was too special-cased and was not a universal win. If I was reviewing my own patch I would have naked it on the "you added a special-case hack into the load balancer for one load". I didn't get back to it before getting cc'd on this series. > load_balance should better qualify the imbalance of the group and clearly > define what has to be moved to fix this imbalance. > > The type of sched_group has been extended to better reflect the type of > imbalance. We now have : > group_has_spare > group_fully_busy > group_misfit_task > group_asym_packing > group_imbalanced > group_overloaded > > Based on the type of sched_group, load_balance now sets what it wants to > move in order to fix the imbalance. It can be some load as before but also > some utilization, a number of task or a type of task: > migrate_task > migrate_util > migrate_load > migrate_misfit > > This new load_balance algorithm fixes several pending wrong tasks > placement: > - the 1 task per CPU case with asymmetric system > - the case of cfs task preempted by other class > - the case of tasks not evenly spread on groups with spare capacity > On the last one, spreading tasks evenly across NUMA domains is not necessarily a good idea. If I have 2 tasks running on a 2-socket machine with 24 logical CPUs per socket, it should not automatically mean that one task should move cross-node and I have definitely observed this happening. It's probably bad in terms of locality no matter what but it's especially bad if the 2 tasks happened to be communicating because then load balancing will pull apart the tasks while wake_affine will push them together (and potentially NUMA balancing as well). Note that this also applies for some IO workloads because, depending on the filesystem, the task may be communicating with workqueues (XFS) or a kernel thread (ext4 with jbd2). > Also the load balance decisions have been consolidated in the 3 functions > below after removing the few bypasses and hacks of the current code: > - update_sd_pick_busiest() select the busiest sched_group. > - find_busiest_group() checks if there is an imbalance between local and > busiest group. > - calculate_imbalance() decides what have to be moved. > > Finally, the now unused field total_running of struct sd_lb_stats has been > removed. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 611 ++++++++++++++++++++++++++++++++++------------------ > 1 file changed, 402 insertions(+), 209 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e004841..5ae5281 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7068,11 +7068,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; > > enum fbq_type { regular, remote, all }; > > +/* > + * group_type describes the group of CPUs at the moment of the load balance. > + * The enum is ordered by pulling priority, with the group with lowest priority > + * first so the groupe_type can be simply compared when selecting the busiest > + * group. see update_sd_pick_busiest(). > + */ s/groupe_type/group_type/ > enum group_type { > - group_other = 0, > + group_has_spare = 0, > + group_fully_busy, > group_misfit_task, > + group_asym_packing, > group_imbalanced, > - group_overloaded, > + group_overloaded > +}; > + While not your fault, it would be nice to comment on the meaning of each group type. From a glance, it's not obvious to me why a misfit task should be a high priority to move a task than a fully_busy (but not overloaded) group given that moving the misfit task might make a group overloaded. > +enum migration_type { > + migrate_load = 0, > + migrate_util, > + migrate_task, > + migrate_misfit > }; > Could do with a comment explaining what migration_type is for because the name is unhelpful. I *think* at a glance it's related to what sort of imbalance is being addressed which is partially addressed by the group_type. That understanding may change as I continue reading the series but now I have to figure it out which means it'll be forgotten again in 6 months. > #define LBF_ALL_PINNED 0x01 > @@ -7105,7 +7120,7 @@ struct lb_env { > unsigned int loop_max; > > enum fbq_type fbq_type; > - enum group_type src_grp_type; > + enum migration_type migration_type; > struct list_head tasks; > }; > > @@ -7328,7 +7343,7 @@ static struct task_struct *detach_one_task(struct lb_env *env) > static const unsigned int sched_nr_migrate_break = 32; > > /* > - * detach_tasks() -- tries to detach up to imbalance runnable load from > + * detach_tasks() -- tries to detach up to imbalance load/util/tasks from > * busiest_rq, as part of a balancing operation within domain "sd". > * > * Returns number of detached tasks if successful and 0 otherwise. > @@ -7336,8 +7351,8 @@ static const unsigned int sched_nr_migrate_break = 32; > static int detach_tasks(struct lb_env *env) > { > struct list_head *tasks = &env->src_rq->cfs_tasks; > + unsigned long util, load; > struct task_struct *p; > - unsigned long load; > int detached = 0; > > lockdep_assert_held(&env->src_rq->lock); > @@ -7370,19 +7385,51 @@ static int detach_tasks(struct lb_env *env) > if (!can_migrate_task(p, env)) > goto next; > > - load = task_h_load(p); > + switch (env->migration_type) { > + case migrate_load: > + load = task_h_load(p); > > - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed) > - goto next; > + if (sched_feat(LB_MIN) && > + load < 16 && !env->sd->nr_balance_failed) > + goto next; > > - if ((load / 2) > env->imbalance) > - goto next; > + if ((load / 2) > env->imbalance) > + goto next; > + > + env->imbalance -= load; > + break; > + > + case migrate_util: > + util = task_util_est(p); > + > + if (util > env->imbalance) > + goto next; > + > + env->imbalance -= util; > + break; > + > + case migrate_task: > + env->imbalance--; > + break; > + > + case migrate_misfit: > + load = task_h_load(p); > + > + /* > + * load of misfit task might decrease a bit since it has > + * been recorded. Be conservative in the condition. > + */ > + if (load / 2 < env->imbalance) > + goto next; > + > + env->imbalance = 0; > + break; > + } > So, no problem with this but it brings up another point. migration_type also determines what env->imbalance means (e.g. load, utilisation, nr_running etc). That was true before your patch too but now it's much more explicit, which is nice, but could do with a comment. > detach_task(p, env); > list_add(&p->se.group_node, &env->tasks); > > detached++; > - env->imbalance -= load; > > #ifdef CONFIG_PREEMPTION > /* > @@ -7396,7 +7443,7 @@ static int detach_tasks(struct lb_env *env) > > /* > * We only want to steal up to the prescribed amount of > - * runnable load. > + * load/util/tasks. > */ > if (env->imbalance <= 0) > break; > @@ -7661,7 +7708,6 @@ struct sg_lb_stats { > unsigned int idle_cpus; > unsigned int group_weight; > enum group_type group_type; > - int group_no_capacity; > unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */ > unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ > #ifdef CONFIG_NUMA_BALANCING Glad to see group_no_capacity go away, that had some "interesting" treatment in update_sd_lb_stats. > @@ -7677,10 +7723,10 @@ struct sg_lb_stats { > struct sd_lb_stats { > struct sched_group *busiest; /* Busiest group in this sd */ > struct sched_group *local; /* Local group in this sd */ > - unsigned long total_running; > unsigned long total_load; /* Total load of all groups in sd */ > unsigned long total_capacity; /* Total capacity of all groups in sd */ > unsigned long avg_load; /* Average load across all groups in sd */ > + unsigned int prefer_sibling; /* tasks should go to sibling first */ > > struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ > struct sg_lb_stats local_stat; /* Statistics of the local group */ > @@ -7691,19 +7737,18 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) > /* > * Skimp on the clearing to avoid duplicate work. We can avoid clearing > * local_stat because update_sg_lb_stats() does a full clear/assignment. > - * We must however clear busiest_stat::avg_load because > - * update_sd_pick_busiest() reads this before assignment. > + * We must however set busiest_stat::group_type and > + * busiest_stat::idle_cpus to the worst busiest group because > + * update_sd_pick_busiest() reads these before assignment. > */ > *sds = (struct sd_lb_stats){ > .busiest = NULL, > .local = NULL, > - .total_running = 0UL, > .total_load = 0UL, > .total_capacity = 0UL, > .busiest_stat = { > - .avg_load = 0UL, > - .sum_h_nr_running = 0, > - .group_type = group_other, > + .idle_cpus = UINT_MAX, > + .group_type = group_has_spare, > }, > }; > } > @@ -7945,19 +7990,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) > } > > static inline enum > -group_type group_classify(struct sched_group *group, > +group_type group_classify(struct lb_env *env, > + struct sched_group *group, > struct sg_lb_stats *sgs) > { > - if (sgs->group_no_capacity) > + if (group_is_overloaded(env, sgs)) > return group_overloaded; > > if (sg_imbalanced(group)) > return group_imbalanced; > > + if (sgs->group_asym_packing) > + return group_asym_packing; > + > if (sgs->group_misfit_task_load) > return group_misfit_task; > > - return group_other; > + if (!group_has_capacity(env, sgs)) > + return group_fully_busy; > + > + return group_has_spare; > } > > static bool update_nohz_stats(struct rq *rq, bool force) > @@ -7994,10 +8046,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, > struct sg_lb_stats *sgs, > int *sg_status) > { > - int i, nr_running; > + int i, nr_running, local_group; > > memset(sgs, 0, sizeof(*sgs)); > > + local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group)); > + > for_each_cpu_and(i, sched_group_span(group), env->cpus) { > struct rq *rq = cpu_rq(i); > > @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, > /* > * No need to call idle_cpu() if nr_running is not 0 > */ > - if (!nr_running && idle_cpu(i)) > + if (!nr_running && idle_cpu(i)) { > sgs->idle_cpus++; > + /* Idle cpu can't have misfit task */ > + continue; > + } > + > + if (local_group) > + continue; > > + /* Check for a misfit task on the cpu */ > if (env->sd->flags & SD_ASYM_CPUCAPACITY && > sgs->group_misfit_task_load < rq->misfit_task_load) { > sgs->group_misfit_task_load = rq->misfit_task_load; So.... why exactly do we not care about misfit tasks on CPUs in the local group? I'm not saying you're wrong because you have a clear idea on how misfit tasks should be treated but it's very non-obvious just from the code. > <SNIP> > > @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env, > if (sgs->group_type < busiest->group_type) > return false; > > - if (sgs->avg_load <= busiest->avg_load) > - return false; > - > - if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) > - goto asym_packing; > - > /* > - * Candidate sg has no more than one task per CPU and > - * has higher per-CPU capacity. Migrating tasks to less > - * capable CPUs may harm throughput. Maximize throughput, > - * power/energy consequences are not considered. > + * The candidate and the current busiest group are the same type of > + * group. Let check which one is the busiest according to the type. > */ > - if (sgs->sum_h_nr_running <= sgs->group_weight && > - group_smaller_min_cpu_capacity(sds->local, sg)) > - return false; > > - /* > - * If we have more than one misfit sg go with the biggest misfit. > - */ > - if (sgs->group_type == group_misfit_task && > - sgs->group_misfit_task_load < busiest->group_misfit_task_load) > + switch (sgs->group_type) { > + case group_overloaded: > + /* Select the overloaded group with highest avg_load. */ > + if (sgs->avg_load <= busiest->avg_load) > + return false; > + break; > + > + case group_imbalanced: > + /* > + * Select the 1st imbalanced group as we don't have any way to > + * choose one more than another. > + */ > return false; > > -asym_packing: > - /* This is the busiest node in its class. */ > - if (!(env->sd->flags & SD_ASYM_PACKING)) > - return true; > + case group_asym_packing: > + /* Prefer to move from lowest priority CPU's work */ > + if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu)) > + return false; > + break; > Again, I'm not seeing what prevents a !SD_ASYM_PACKING domain checking sched_asym_prefer. > <SNIP> > + case group_fully_busy: > + /* > + * Select the fully busy group with highest avg_load. In > + * theory, there is no need to pull task from such kind of > + * group because tasks have all compute capacity that they need > + * but we can still improve the overall throughput by reducing > + * contention when accessing shared HW resources. > + * > + * XXX for now avg_load is not computed and always 0 so we > + * select the 1st one. > + */ > + if (sgs->avg_load <= busiest->avg_load) > + return false; > + break; > + With the exception that if we are balancing between NUMA domains and they were communicating tasks that we've now pulled them apart. That might increase the CPU resources available at the cost of increased remote memory access cost. > + case group_has_spare: > + /* > + * Select not overloaded group with lowest number of > + * idle cpus. We could also compare the spare capacity > + * which is more stable but it can end up that the > + * group has less spare capacity but finally more idle > + * cpus which means less opportunity to pull tasks. > + */ > + if (sgs->idle_cpus >= busiest->idle_cpus) > + return false; > + break; > } > > - return false; > + /* > + * Candidate sg has no more than one task per CPU and has higher > + * per-CPU capacity. Migrating tasks to less capable CPUs may harm > + * throughput. Maximize throughput, power/energy consequences are not > + * considered. > + */ > + if ((env->sd->flags & SD_ASYM_CPUCAPACITY) && > + (sgs->group_type <= group_fully_busy) && > + (group_smaller_min_cpu_capacity(sds->local, sg))) > + return false; > + > + return true; > } > > #ifdef CONFIG_NUMA_BALANCING > @@ -8172,13 +8265,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) > * @env: The load balancing environment. > * @sds: variable to hold the statistics for this sched_domain. > */ > + Spurious whitespace change. > static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) > { > struct sched_domain *child = env->sd->child; > struct sched_group *sg = env->sd->groups; > struct sg_lb_stats *local = &sds->local_stat; > struct sg_lb_stats tmp_sgs; > - bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > int sg_status = 0; > > #ifdef CONFIG_NO_HZ_COMMON > @@ -8205,22 +8298,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > if (local_group) > goto next_group; > > - /* > - * In case the child domain prefers tasks go to siblings > - * first, lower the sg capacity so that we'll try > - * and move all the excess tasks away. We lower the capacity > - * of a group only if the local group has the capacity to fit > - * these excess tasks. The extra check prevents the case where > - * you always pull from the heaviest group when it is already > - * under-utilized (possible with a large weight task outweighs > - * the tasks on the system). > - */ > - if (prefer_sibling && sds->local && > - group_has_capacity(env, local) && > - (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) { > - sgs->group_no_capacity = 1; > - sgs->group_type = group_classify(sg, sgs); > - } > > if (update_sd_pick_busiest(env, sds, sg, sgs)) { > sds->busiest = sg; > @@ -8229,13 +8306,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > next_group: > /* Now, start updating sd_lb_stats */ > - sds->total_running += sgs->sum_h_nr_running; > sds->total_load += sgs->group_load; > sds->total_capacity += sgs->group_capacity; > > sg = sg->next; > } while (sg != env->sd->groups); > > + /* Tag domain that child domain prefers tasks go to siblings first */ > + sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > + > #ifdef CONFIG_NO_HZ_COMMON > if ((env->flags & LBF_NOHZ_AGAIN) && > cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) { > @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > */ > static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) > { > - unsigned long max_pull, load_above_capacity = ~0UL; > struct sg_lb_stats *local, *busiest; > > local = &sds->local_stat; > busiest = &sds->busiest_stat; > > - if (busiest->group_asym_packing) { > - env->imbalance = busiest->group_load; > + if (busiest->group_type == group_misfit_task) { > + /* Set imbalance to allow misfit task to be balanced. */ > + env->migration_type = migrate_misfit; > + env->imbalance = busiest->group_misfit_task_load; > + return; > + } > + > + if (busiest->group_type == group_asym_packing) { > + /* > + * In case of asym capacity, we will try to migrate all load to > + * the preferred CPU. > + */ > + env->migration_type = migrate_task; > + env->imbalance = busiest->sum_h_nr_running; > + return; > + } > + > + if (busiest->group_type == group_imbalanced) { > + /* > + * In the group_imb case we cannot rely on group-wide averages > + * to ensure CPU-load equilibrium, try to move any task to fix > + * the imbalance. The next load balance will take care of > + * balancing back the system. > + */ > + env->migration_type = migrate_task; > + env->imbalance = 1; > return; > } > > /* > - * Avg load of busiest sg can be less and avg load of local sg can > - * be greater than avg load across all sgs of sd because avg load > - * factors in sg capacity and sgs with smaller group_type are > - * skipped when updating the busiest sg: > + * Try to use spare capacity of local group without overloading it or > + * emptying busiest > */ > - if (busiest->group_type != group_misfit_task && > - (busiest->avg_load <= sds->avg_load || > - local->avg_load >= sds->avg_load)) { > - env->imbalance = 0; > + if (local->group_type == group_has_spare) { > + if (busiest->group_type > group_fully_busy) { > + /* > + * If busiest is overloaded, try to fill spare > + * capacity. This might end up creating spare capacity > + * in busiest or busiest still being overloaded but > + * there is no simple way to directly compute the > + * amount of load to migrate in order to balance the > + * system. > + */ busiest may not be overloaded, it may be imbalanced. Maybe the distinction is irrelevant though. > + env->migration_type = migrate_util; > + env->imbalance = max(local->group_capacity, local->group_util) - > + local->group_util; > + > + /* > + * In some case, the group's utilization is max or even > + * higher than capacity because of migrations but the > + * local CPU is (newly) idle. There is at least one > + * waiting task in this overloaded busiest group. Let > + * try to pull it. > + */ > + if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) { > + env->migration_type = migrate_task; > + env->imbalance = 1; > + } > + Not convinced this is a good thing to do across NUMA domains. If it was tied to the group being definitely overloaded then I could see the logic. > + return; > + } > + > + if (busiest->group_weight == 1 || sds->prefer_sibling) { > + unsigned int nr_diff = busiest->sum_h_nr_running; > + /* > + * When prefer sibling, evenly spread running tasks on > + * groups. > + */ > + env->migration_type = migrate_task; > + lsub_positive(&nr_diff, local->sum_h_nr_running); > + env->imbalance = nr_diff >> 1; > + return; > + } Comment is slightly misleading given that it's not just preferring siblings but for when balancing across single CPU domains. > + > + /* > + * If there is no overload, we just want to even the number of > + * idle cpus. > + */ > + env->migration_type = migrate_task; > + env->imbalance = max_t(long, 0, (local->idle_cpus - > + busiest->idle_cpus) >> 1); > return; > } Why do we want an even number of idle CPUs unconditionally? This goes back to the NUMA domain case. 2 communicating tasks running on a 2-socket system should not be pulled apart just to have 1 task running on each socket. I didn't see anything obvious after this point but I also am getting a bit on the fried side trying to hold this entire patch in my head and got hung up on the NUMA domain balancing in particular. -- Mel Gorman SUSE Labs
On 30/10/2019 16:45, Mel Gorman wrote: > On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote: >> The load_balance algorithm contains some heuristics which have become >> meaningless since the rework of the scheduler's metrics like the >> introduction of PELT. >> >> Furthermore, load is an ill-suited metric for solving certain task >> placement imbalance scenarios. For instance, in the presence of idle CPUs, >> we should simply try to get at least one task per CPU, whereas the current >> load-based algorithm can actually leave idle CPUs alone simply because the >> load is somewhat balanced. The current algorithm ends up creating virtual >> and meaningless value like the avg_load_per_task or tweaks the state of a >> group to make it overloaded whereas it's not, in order to try to migrate >> tasks. >> > > I do not think this is necessarily 100% true. With both the previous > load-balancing behaviour and the apparent behaviour of this patch, it's > still possible to pull two communicating tasks apart and across NUMA > domains when utilisation is low. Specifically, a load difference of less > than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate > task to level out load. > > So, load might be ill-suited for some cases but that does not make it > completely useless either. > For sure! What we're trying to get to is that load is much less relevant at low task count than at high task count. A pathological case for asymmetric systems (big.LITTLE et al) is if you spawn as many tasks as you have CPUs and they're not all spread out straight from wakeup (so you have at least one rq coscheduling 2 tasks), the load balancer might never solve that imbalance and leave a CPU idle for the duration of the workload. The problem there is that the current LB tries to balance avg_load, which is screwy when asymmetric capacities are involved. Take the h960 for instance, which is 4+4 big.LITTLE (LITTLEs are 462 capacity, bigs 1024). Say you end up with 3 tasks on the LITTLEs (one is idle), and 5 tasks on the bigs (none idle). The tasks are CPU hogs, so they won't go through new wakeups. You have something like 3072 load on the LITTLEs vs 5120 on the bigs. The thing is, a group's avg_load scales inversely with capacity: sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) / sgs->group_capacity; That means the LITTLEs group ends up with 1702 avg_load, whereas the bigs group gets 1280 - the LITTLEs group is seen as more "busy" despite having an idle CPU. This kind of imbalance usually ends up in fix_small_imbalance() territory, which is random at best. >> @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, >> /* >> * No need to call idle_cpu() if nr_running is not 0 >> */ >> - if (!nr_running && idle_cpu(i)) >> + if (!nr_running && idle_cpu(i)) { >> sgs->idle_cpus++; >> + /* Idle cpu can't have misfit task */ >> + continue; >> + } >> + >> + if (local_group) >> + continue; >> >> + /* Check for a misfit task on the cpu */ >> if (env->sd->flags & SD_ASYM_CPUCAPACITY && >> sgs->group_misfit_task_load < rq->misfit_task_load) { >> sgs->group_misfit_task_load = rq->misfit_task_load; > > So.... why exactly do we not care about misfit tasks on CPUs in the > local group? I'm not saying you're wrong because you have a clear idea > on how misfit tasks should be treated but it's very non-obvious just > from the code. > Misfit tasks need to be migrated away to CPUs of higher capacity. We let that happen through the usual load-balance pull - there's little point in pulling from ourselves. > I didn't see anything obvious after this point but I also am getting a > bit on the fried side trying to hold this entire patch in my head and > got hung up on the NUMA domain balancing in particular. > I share the feeling, I've done it by chunks but need to have another go at it.
On Wed, 30 Oct 2019 at 16:45, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote: > > The load_balance algorithm contains some heuristics which have become > > meaningless since the rework of the scheduler's metrics like the > > introduction of PELT. > > > > Furthermore, load is an ill-suited metric for solving certain task > > placement imbalance scenarios. For instance, in the presence of idle CPUs, > > we should simply try to get at least one task per CPU, whereas the current > > load-based algorithm can actually leave idle CPUs alone simply because the > > load is somewhat balanced. The current algorithm ends up creating virtual > > and meaningless value like the avg_load_per_task or tweaks the state of a > > group to make it overloaded whereas it's not, in order to try to migrate > > tasks. > > > > I do not think this is necessarily 100% true. With both the previous > load-balancing behaviour and the apparent behaviour of this patch, it's > still possible to pull two communicating tasks apart and across NUMA > domains when utilisation is low. Specifically, a load difference of less > than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate > task to level out load. > > So, load might be ill-suited for some cases but that does not make it > completely useless either. I fully agree and we keep using it in some cases. The goal is only to not use it when it is obviously the wrong metric to be used > > The type of behaviour can be seen by running netperf via mmtests > (configuration file configs/config-network-netperf-unbound) on a NUMA > machine and noting that the local vs remote NUMA hinting faults are roughly > 50%. I had prototyped some fixes around this that took imbalance_pct into > account but it was too special-cased and was not a universal win. If > I was reviewing my own patch I would have naked it on the "you added a > special-case hack into the load balancer for one load". I didn't get back > to it before getting cc'd on this series. > > > load_balance should better qualify the imbalance of the group and clearly > > define what has to be moved to fix this imbalance. > > > > The type of sched_group has been extended to better reflect the type of > > imbalance. We now have : > > group_has_spare > > group_fully_busy > > group_misfit_task > > group_asym_packing > > group_imbalanced > > group_overloaded > > > > Based on the type of sched_group, load_balance now sets what it wants to > > move in order to fix the imbalance. It can be some load as before but also > > some utilization, a number of task or a type of task: > > migrate_task > > migrate_util > > migrate_load > > migrate_misfit > > > > This new load_balance algorithm fixes several pending wrong tasks > > placement: > > - the 1 task per CPU case with asymmetric system > > - the case of cfs task preempted by other class > > - the case of tasks not evenly spread on groups with spare capacity > > > > On the last one, spreading tasks evenly across NUMA domains is not > necessarily a good idea. If I have 2 tasks running on a 2-socket machine > with 24 logical CPUs per socket, it should not automatically mean that > one task should move cross-node and I have definitely observed this > happening. It's probably bad in terms of locality no matter what but it's > especially bad if the 2 tasks happened to be communicating because then > load balancing will pull apart the tasks while wake_affine will push > them together (and potentially NUMA balancing as well). Note that this > also applies for some IO workloads because, depending on the filesystem, > the task may be communicating with workqueues (XFS) or a kernel thread > (ext4 with jbd2). This rework doesn't touch the NUMA_BALANCING part and NUMA balancing still gives guidances with fbq_classify_group/queue. But the latter could also take advantage of the new type of group. For example, what I did in the fix for find_idlest_group : checking numa_preferred_nid when the group has capacity and keep the task on preferred node if possible. Similar behavior could also be beneficial in periodic load_balance case. > > > Also the load balance decisions have been consolidated in the 3 functions > > below after removing the few bypasses and hacks of the current code: > > - update_sd_pick_busiest() select the busiest sched_group. > > - find_busiest_group() checks if there is an imbalance between local and > > busiest group. > > - calculate_imbalance() decides what have to be moved. > > > > Finally, the now unused field total_running of struct sd_lb_stats has been > > removed. > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > kernel/sched/fair.c | 611 ++++++++++++++++++++++++++++++++++------------------ > > 1 file changed, 402 insertions(+), 209 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e004841..5ae5281 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7068,11 +7068,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; > > > > enum fbq_type { regular, remote, all }; > > > > +/* > > + * group_type describes the group of CPUs at the moment of the load balance. > > + * The enum is ordered by pulling priority, with the group with lowest priority > > + * first so the groupe_type can be simply compared when selecting the busiest > > + * group. see update_sd_pick_busiest(). > > + */ > > s/groupe_type/group_type/ > > > enum group_type { > > - group_other = 0, > > + group_has_spare = 0, > > + group_fully_busy, > > group_misfit_task, > > + group_asym_packing, > > group_imbalanced, > > - group_overloaded, > > + group_overloaded > > +}; > > + > > While not your fault, it would be nice to comment on the meaning of each > group type. From a glance, it's not obvious to me why a misfit task should > be a high priority to move a task than a fully_busy (but not overloaded) > group given that moving the misfit task might make a group overloaded. > > > +enum migration_type { > > + migrate_load = 0, > > + migrate_util, > > + migrate_task, > > + migrate_misfit > > }; > > > > Could do with a comment explaining what migration_type is for because > the name is unhelpful. I *think* at a glance it's related to what sort > of imbalance is being addressed which is partially addressed by the > group_type. That understanding may change as I continue reading the series > but now I have to figure it out which means it'll be forgotten again in > 6 months. > > > #define LBF_ALL_PINNED 0x01 > > @@ -7105,7 +7120,7 @@ struct lb_env { > > unsigned int loop_max; > > > > enum fbq_type fbq_type; > > - enum group_type src_grp_type; > > + enum migration_type migration_type; > > struct list_head tasks; > > }; > > > > @@ -7328,7 +7343,7 @@ static struct task_struct *detach_one_task(struct lb_env *env) > > static const unsigned int sched_nr_migrate_break = 32; > > > > /* > > - * detach_tasks() -- tries to detach up to imbalance runnable load from > > + * detach_tasks() -- tries to detach up to imbalance load/util/tasks from > > * busiest_rq, as part of a balancing operation within domain "sd". > > * > > * Returns number of detached tasks if successful and 0 otherwise. > > @@ -7336,8 +7351,8 @@ static const unsigned int sched_nr_migrate_break = 32; > > static int detach_tasks(struct lb_env *env) > > { > > struct list_head *tasks = &env->src_rq->cfs_tasks; > > + unsigned long util, load; > > struct task_struct *p; > > - unsigned long load; > > int detached = 0; > > > > lockdep_assert_held(&env->src_rq->lock); > > @@ -7370,19 +7385,51 @@ static int detach_tasks(struct lb_env *env) > > if (!can_migrate_task(p, env)) > > goto next; > > > > - load = task_h_load(p); > > + switch (env->migration_type) { > > + case migrate_load: > > + load = task_h_load(p); > > > > - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed) > > - goto next; > > + if (sched_feat(LB_MIN) && > > + load < 16 && !env->sd->nr_balance_failed) > > + goto next; > > > > - if ((load / 2) > env->imbalance) > > - goto next; > > + if ((load / 2) > env->imbalance) > > + goto next; > > + > > + env->imbalance -= load; > > + break; > > + > > + case migrate_util: > > + util = task_util_est(p); > > + > > + if (util > env->imbalance) > > + goto next; > > + > > + env->imbalance -= util; > > + break; > > + > > + case migrate_task: > > + env->imbalance--; > > + break; > > + > > + case migrate_misfit: > > + load = task_h_load(p); > > + > > + /* > > + * load of misfit task might decrease a bit since it has > > + * been recorded. Be conservative in the condition. > > + */ > > + if (load / 2 < env->imbalance) > > + goto next; > > + > > + env->imbalance = 0; > > + break; > > + } > > > > So, no problem with this but it brings up another point. migration_type > also determines what env->imbalance means (e.g. load, utilisation, > nr_running etc). That was true before your patch too but now it's much > more explicit, which is nice, but could do with a comment. > > > detach_task(p, env); > > list_add(&p->se.group_node, &env->tasks); > > > > detached++; > > - env->imbalance -= load; > > > > #ifdef CONFIG_PREEMPTION > > /* > > @@ -7396,7 +7443,7 @@ static int detach_tasks(struct lb_env *env) > > > > /* > > * We only want to steal up to the prescribed amount of > > - * runnable load. > > + * load/util/tasks. > > */ > > if (env->imbalance <= 0) > > break; > > @@ -7661,7 +7708,6 @@ struct sg_lb_stats { > > unsigned int idle_cpus; > > unsigned int group_weight; > > enum group_type group_type; > > - int group_no_capacity; > > unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */ > > unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ > > #ifdef CONFIG_NUMA_BALANCING > > Glad to see group_no_capacity go away, that had some "interesting" > treatment in update_sd_lb_stats. > > > @@ -7677,10 +7723,10 @@ struct sg_lb_stats { > > struct sd_lb_stats { > > struct sched_group *busiest; /* Busiest group in this sd */ > > struct sched_group *local; /* Local group in this sd */ > > - unsigned long total_running; > > unsigned long total_load; /* Total load of all groups in sd */ > > unsigned long total_capacity; /* Total capacity of all groups in sd */ > > unsigned long avg_load; /* Average load across all groups in sd */ > > + unsigned int prefer_sibling; /* tasks should go to sibling first */ > > > > struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ > > struct sg_lb_stats local_stat; /* Statistics of the local group */ > > @@ -7691,19 +7737,18 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) > > /* > > * Skimp on the clearing to avoid duplicate work. We can avoid clearing > > * local_stat because update_sg_lb_stats() does a full clear/assignment. > > - * We must however clear busiest_stat::avg_load because > > - * update_sd_pick_busiest() reads this before assignment. > > + * We must however set busiest_stat::group_type and > > + * busiest_stat::idle_cpus to the worst busiest group because > > + * update_sd_pick_busiest() reads these before assignment. > > */ > > *sds = (struct sd_lb_stats){ > > .busiest = NULL, > > .local = NULL, > > - .total_running = 0UL, > > .total_load = 0UL, > > .total_capacity = 0UL, > > .busiest_stat = { > > - .avg_load = 0UL, > > - .sum_h_nr_running = 0, > > - .group_type = group_other, > > + .idle_cpus = UINT_MAX, > > + .group_type = group_has_spare, > > }, > > }; > > } > > @@ -7945,19 +7990,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) > > } > > > > static inline enum > > -group_type group_classify(struct sched_group *group, > > +group_type group_classify(struct lb_env *env, > > + struct sched_group *group, > > struct sg_lb_stats *sgs) > > { > > - if (sgs->group_no_capacity) > > + if (group_is_overloaded(env, sgs)) > > return group_overloaded; > > > > if (sg_imbalanced(group)) > > return group_imbalanced; > > > > + if (sgs->group_asym_packing) > > + return group_asym_packing; > > + > > if (sgs->group_misfit_task_load) > > return group_misfit_task; > > > > - return group_other; > > + if (!group_has_capacity(env, sgs)) > > + return group_fully_busy; > > + > > + return group_has_spare; > > } > > > > static bool update_nohz_stats(struct rq *rq, bool force) > > @@ -7994,10 +8046,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, > > struct sg_lb_stats *sgs, > > int *sg_status) > > { > > - int i, nr_running; > > + int i, nr_running, local_group; > > > > memset(sgs, 0, sizeof(*sgs)); > > > > + local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group)); > > + > > for_each_cpu_and(i, sched_group_span(group), env->cpus) { > > struct rq *rq = cpu_rq(i); > > > > @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, > > /* > > * No need to call idle_cpu() if nr_running is not 0 > > */ > > - if (!nr_running && idle_cpu(i)) > > + if (!nr_running && idle_cpu(i)) { > > sgs->idle_cpus++; > > + /* Idle cpu can't have misfit task */ > > + continue; > > + } > > + > > + if (local_group) > > + continue; > > > > + /* Check for a misfit task on the cpu */ > > if (env->sd->flags & SD_ASYM_CPUCAPACITY && > > sgs->group_misfit_task_load < rq->misfit_task_load) { > > sgs->group_misfit_task_load = rq->misfit_task_load; > > So.... why exactly do we not care about misfit tasks on CPUs in the > local group? I'm not saying you're wrong because you have a clear idea > on how misfit tasks should be treated but it's very non-obvious just > from the code. local_group can't do anything with local misfit tasks so it doesn't give any additional information compared to overloaded, fully_busy or has_spare > > > <SNIP> > > > > @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env, > > if (sgs->group_type < busiest->group_type) > > return false; > > > > - if (sgs->avg_load <= busiest->avg_load) > > - return false; > > - > > - if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) > > - goto asym_packing; > > - > > /* > > - * Candidate sg has no more than one task per CPU and > > - * has higher per-CPU capacity. Migrating tasks to less > > - * capable CPUs may harm throughput. Maximize throughput, > > - * power/energy consequences are not considered. > > + * The candidate and the current busiest group are the same type of > > + * group. Let check which one is the busiest according to the type. > > */ > > - if (sgs->sum_h_nr_running <= sgs->group_weight && > > - group_smaller_min_cpu_capacity(sds->local, sg)) > > - return false; > > > > - /* > > - * If we have more than one misfit sg go with the biggest misfit. > > - */ > > - if (sgs->group_type == group_misfit_task && > > - sgs->group_misfit_task_load < busiest->group_misfit_task_load) > > + switch (sgs->group_type) { > > + case group_overloaded: > > + /* Select the overloaded group with highest avg_load. */ > > + if (sgs->avg_load <= busiest->avg_load) > > + return false; > > + break; > > + > > + case group_imbalanced: > > + /* > > + * Select the 1st imbalanced group as we don't have any way to > > + * choose one more than another. > > + */ > > return false; > > > > -asym_packing: > > - /* This is the busiest node in its class. */ > > - if (!(env->sd->flags & SD_ASYM_PACKING)) > > - return true; > > + case group_asym_packing: > > + /* Prefer to move from lowest priority CPU's work */ > > + if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu)) > > + return false; > > + break; > > > > Again, I'm not seeing what prevents a !SD_ASYM_PACKING domain checking > sched_asym_prefer. the test is done when collecting group's statistic in update_sg_lb_stats() /* Check if dst cpu is idle and preferred to this group */ if (env->sd->flags & SD_ASYM_PACKING && env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running && sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { sgs->group_asym_packing = 1; } Then the group type group_asym_packing is only set if sgs->group_asym_packing has been set > > > <SNIP> > > + case group_fully_busy: > > + /* > > + * Select the fully busy group with highest avg_load. In > > + * theory, there is no need to pull task from such kind of > > + * group because tasks have all compute capacity that they need > > + * but we can still improve the overall throughput by reducing > > + * contention when accessing shared HW resources. > > + * > > + * XXX for now avg_load is not computed and always 0 so we > > + * select the 1st one. > > + */ > > + if (sgs->avg_load <= busiest->avg_load) > > + return false; > > + break; > > + > > With the exception that if we are balancing between NUMA domains and they > were communicating tasks that we've now pulled them apart. That might > increase the CPU resources available at the cost of increased remote > memory access cost. I expect the numa classification to help and skip those runqueue > > > + case group_has_spare: > > + /* > > + * Select not overloaded group with lowest number of > > + * idle cpus. We could also compare the spare capacity > > + * which is more stable but it can end up that the > > + * group has less spare capacity but finally more idle > > + * cpus which means less opportunity to pull tasks. > > + */ > > + if (sgs->idle_cpus >= busiest->idle_cpus) > > + return false; > > + break; > > } > > > > - return false; > > + /* > > + * Candidate sg has no more than one task per CPU and has higher > > + * per-CPU capacity. Migrating tasks to less capable CPUs may harm > > + * throughput. Maximize throughput, power/energy consequences are not > > + * considered. > > + */ > > + if ((env->sd->flags & SD_ASYM_CPUCAPACITY) && > > + (sgs->group_type <= group_fully_busy) && > > + (group_smaller_min_cpu_capacity(sds->local, sg))) > > + return false; > > + > > + return true; > > } > > > > #ifdef CONFIG_NUMA_BALANCING > > @@ -8172,13 +8265,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) > > * @env: The load balancing environment. > > * @sds: variable to hold the statistics for this sched_domain. > > */ > > + > > Spurious whitespace change. > > > static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) > > { > > struct sched_domain *child = env->sd->child; > > struct sched_group *sg = env->sd->groups; > > struct sg_lb_stats *local = &sds->local_stat; > > struct sg_lb_stats tmp_sgs; > > - bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > > int sg_status = 0; > > > > #ifdef CONFIG_NO_HZ_COMMON > > @@ -8205,22 +8298,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > if (local_group) > > goto next_group; > > > > - /* > > - * In case the child domain prefers tasks go to siblings > > - * first, lower the sg capacity so that we'll try > > - * and move all the excess tasks away. We lower the capacity > > - * of a group only if the local group has the capacity to fit > > - * these excess tasks. The extra check prevents the case where > > - * you always pull from the heaviest group when it is already > > - * under-utilized (possible with a large weight task outweighs > > - * the tasks on the system). > > - */ > > - if (prefer_sibling && sds->local && > > - group_has_capacity(env, local) && > > - (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) { > > - sgs->group_no_capacity = 1; > > - sgs->group_type = group_classify(sg, sgs); > > - } > > > > if (update_sd_pick_busiest(env, sds, sg, sgs)) { > > sds->busiest = sg; > > @@ -8229,13 +8306,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > > next_group: > > /* Now, start updating sd_lb_stats */ > > - sds->total_running += sgs->sum_h_nr_running; > > sds->total_load += sgs->group_load; > > sds->total_capacity += sgs->group_capacity; > > > > sg = sg->next; > > } while (sg != env->sd->groups); > > > > + /* Tag domain that child domain prefers tasks go to siblings first */ > > + sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; > > + > > #ifdef CONFIG_NO_HZ_COMMON > > if ((env->flags & LBF_NOHZ_AGAIN) && > > cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) { > > @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > */ > > static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) > > { > > - unsigned long max_pull, load_above_capacity = ~0UL; > > struct sg_lb_stats *local, *busiest; > > > > local = &sds->local_stat; > > busiest = &sds->busiest_stat; > > > > - if (busiest->group_asym_packing) { > > - env->imbalance = busiest->group_load; > > + if (busiest->group_type == group_misfit_task) { > > + /* Set imbalance to allow misfit task to be balanced. */ > > + env->migration_type = migrate_misfit; > > + env->imbalance = busiest->group_misfit_task_load; > > + return; > > + } > > + > > + if (busiest->group_type == group_asym_packing) { > > + /* > > + * In case of asym capacity, we will try to migrate all load to > > + * the preferred CPU. > > + */ > > + env->migration_type = migrate_task; > > + env->imbalance = busiest->sum_h_nr_running; > > + return; > > + } > > + > > + if (busiest->group_type == group_imbalanced) { > > + /* > > + * In the group_imb case we cannot rely on group-wide averages > > + * to ensure CPU-load equilibrium, try to move any task to fix > > + * the imbalance. The next load balance will take care of > > + * balancing back the system. > > + */ > > + env->migration_type = migrate_task; > > + env->imbalance = 1; > > return; > > } > > > > /* > > - * Avg load of busiest sg can be less and avg load of local sg can > > - * be greater than avg load across all sgs of sd because avg load > > - * factors in sg capacity and sgs with smaller group_type are > > - * skipped when updating the busiest sg: > > + * Try to use spare capacity of local group without overloading it or > > + * emptying busiest > > */ > > - if (busiest->group_type != group_misfit_task && > > - (busiest->avg_load <= sds->avg_load || > > - local->avg_load >= sds->avg_load)) { > > - env->imbalance = 0; > > + if (local->group_type == group_has_spare) { > > + if (busiest->group_type > group_fully_busy) { > > + /* > > + * If busiest is overloaded, try to fill spare > > + * capacity. This might end up creating spare capacity > > + * in busiest or busiest still being overloaded but > > + * there is no simple way to directly compute the > > + * amount of load to migrate in order to balance the > > + * system. > > + */ > > busiest may not be overloaded, it may be imbalanced. Maybe the > distinction is irrelevant though. the case busiest->group_type == group_imbalanced has already been handled earlier int he function > > > + env->migration_type = migrate_util; > > + env->imbalance = max(local->group_capacity, local->group_util) - > > + local->group_util; > > + > > + /* > > + * In some case, the group's utilization is max or even > > + * higher than capacity because of migrations but the > > + * local CPU is (newly) idle. There is at least one > > + * waiting task in this overloaded busiest group. Let > > + * try to pull it. > > + */ > > + if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) { > > + env->migration_type = migrate_task; > > + env->imbalance = 1; > > + } > > + > > Not convinced this is a good thing to do across NUMA domains. If it was > tied to the group being definitely overloaded then I could see the logic. > > > + return; > > + } > > + > > + if (busiest->group_weight == 1 || sds->prefer_sibling) { > > + unsigned int nr_diff = busiest->sum_h_nr_running; > > + /* > > + * When prefer sibling, evenly spread running tasks on > > + * groups. > > + */ > > + env->migration_type = migrate_task; > > + lsub_positive(&nr_diff, local->sum_h_nr_running); > > + env->imbalance = nr_diff >> 1; > > + return; > > + } > > Comment is slightly misleading given that it's not just preferring > siblings but for when balancing across single CPU domains. > > > + > > + /* > > + * If there is no overload, we just want to even the number of > > + * idle cpus. > > + */ > > + env->migration_type = migrate_task; > > + env->imbalance = max_t(long, 0, (local->idle_cpus - > > + busiest->idle_cpus) >> 1); > > return; > > } > > Why do we want an even number of idle CPUs unconditionally? This goes back > to the NUMA domain case. 2 communicating tasks running on a 2-socket system > should not be pulled apart just to have 1 task running on each socket. > > I didn't see anything obvious after this point but I also am getting a > bit on the fried side trying to hold this entire patch in my head and > got hung up on the NUMA domain balancing in particular. > > -- > Mel Gorman > SUSE Labs
On Thu, Oct 31, 2019 at 10:09:17AM +0100, Vincent Guittot wrote: > On Wed, 30 Oct 2019 at 16:45, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote: > > > The load_balance algorithm contains some heuristics which have become > > > meaningless since the rework of the scheduler's metrics like the > > > introduction of PELT. > > > > > > Furthermore, load is an ill-suited metric for solving certain task > > > placement imbalance scenarios. For instance, in the presence of idle CPUs, > > > we should simply try to get at least one task per CPU, whereas the current > > > load-based algorithm can actually leave idle CPUs alone simply because the > > > load is somewhat balanced. The current algorithm ends up creating virtual > > > and meaningless value like the avg_load_per_task or tweaks the state of a > > > group to make it overloaded whereas it's not, in order to try to migrate > > > tasks. > > > > > > > I do not think this is necessarily 100% true. With both the previous > > load-balancing behaviour and the apparent behaviour of this patch, it's > > still possible to pull two communicating tasks apart and across NUMA > > domains when utilisation is low. Specifically, a load difference of less > > than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate > > task to level out load. > > > > So, load might be ill-suited for some cases but that does not make it > > completely useless either. > > I fully agree and we keep using it in some cases. > The goal is only to not use it when it is obviously the wrong metric to be used > Understood, ideally it's be explicit why each metric (task/util/load) is used each time for future reference and why it's the best for a given situation. It's not a requirement for the series as the scheduler does not have a perfect history of explaining itself. > > > > The type of behaviour can be seen by running netperf via mmtests > > (configuration file configs/config-network-netperf-unbound) on a NUMA > > machine and noting that the local vs remote NUMA hinting faults are roughly > > 50%. I had prototyped some fixes around this that took imbalance_pct into > > account but it was too special-cased and was not a universal win. If > > I was reviewing my own patch I would have naked it on the "you added a > > special-case hack into the load balancer for one load". I didn't get back > > to it before getting cc'd on this series. > > > > > load_balance should better qualify the imbalance of the group and clearly > > > define what has to be moved to fix this imbalance. > > > > > > The type of sched_group has been extended to better reflect the type of > > > imbalance. We now have : > > > group_has_spare > > > group_fully_busy > > > group_misfit_task > > > group_asym_packing > > > group_imbalanced > > > group_overloaded > > > > > > Based on the type of sched_group, load_balance now sets what it wants to > > > move in order to fix the imbalance. It can be some load as before but also > > > some utilization, a number of task or a type of task: > > > migrate_task > > > migrate_util > > > migrate_load > > > migrate_misfit > > > > > > This new load_balance algorithm fixes several pending wrong tasks > > > placement: > > > - the 1 task per CPU case with asymmetric system > > > - the case of cfs task preempted by other class > > > - the case of tasks not evenly spread on groups with spare capacity > > > > > > > On the last one, spreading tasks evenly across NUMA domains is not > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine > > with 24 logical CPUs per socket, it should not automatically mean that > > one task should move cross-node and I have definitely observed this > > happening. It's probably bad in terms of locality no matter what but it's > > especially bad if the 2 tasks happened to be communicating because then > > load balancing will pull apart the tasks while wake_affine will push > > them together (and potentially NUMA balancing as well). Note that this > > also applies for some IO workloads because, depending on the filesystem, > > the task may be communicating with workqueues (XFS) or a kernel thread > > (ext4 with jbd2). > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing > still gives guidances with fbq_classify_group/queue. I know the NUMA_BALANCING part is not touched, I'm talking about load balancing across SD_NUMA domains which happens independently of NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to override the load balancer when it moves tasks away from the preferred node. > But the latter could also take advantage of the new type of group. For > example, what I did in the fix for find_idlest_group : checking > numa_preferred_nid when the group has capacity and keep the task on > preferred node if possible. Similar behavior could also be beneficial > in periodic load_balance case. > And this is the catch -- numa_preferred_nid is not guaranteed to be set at all. NUMA balancing might be disabled, the task may not have been running long enough to pick a preferred NID or NUMA balancing might be unable to pick a preferred NID. The decision to avoid unnecessary migrations across NUMA domains should be made independently of NUMA balancing. The netperf configuration from mmtests is great at illustrating the point because it'll also say what the average local/remote access ratio is. 2 communicating tasks running on an otherwise idle NUMA machine should not have the load balancer move the server to one node and the client to another. Historically, we might have accounted for this with imbalance_pct which makes sense for load and was special cased in some places, but it does not make sense to use imbalance_pct for nr_running. Either way, I think balancing across SD_NUMA should have explicit logic to take into account that there is an additional cost outside of the scheduler when a task moves cross-node. Even if it's a case that this series does not tackle the problem now, it'd be good to leave a TODO comment behind noting that SD_NUMA may need to be special cased. > > > @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, > > > /* > > > * No need to call idle_cpu() if nr_running is not 0 > > > */ > > > - if (!nr_running && idle_cpu(i)) > > > + if (!nr_running && idle_cpu(i)) { > > > sgs->idle_cpus++; > > > + /* Idle cpu can't have misfit task */ > > > + continue; > > > + } > > > + > > > + if (local_group) > > > + continue; > > > > > > + /* Check for a misfit task on the cpu */ > > > if (env->sd->flags & SD_ASYM_CPUCAPACITY && > > > sgs->group_misfit_task_load < rq->misfit_task_load) { > > > sgs->group_misfit_task_load = rq->misfit_task_load; > > > > So.... why exactly do we not care about misfit tasks on CPUs in the > > local group? I'm not saying you're wrong because you have a clear idea > > on how misfit tasks should be treated but it's very non-obvious just > > from the code. > > local_group can't do anything with local misfit tasks so it doesn't > give any additional information compared to overloaded, fully_busy or > has_spare > Ok, that's very clear and now I'm feeling a bit stupid because I should have spotted that. It really could do with a comment so somebody else does not try "fixing" it :( > > > <SNIP> > > > > > > @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env, > > > if (sgs->group_type < busiest->group_type) > > > return false; > > > > > > - if (sgs->avg_load <= busiest->avg_load) > > > - return false; > > > - > > > - if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) > > > - goto asym_packing; > > > - > > > /* > > > - * Candidate sg has no more than one task per CPU and > > > - * has higher per-CPU capacity. Migrating tasks to less > > > - * capable CPUs may harm throughput. Maximize throughput, > > > - * power/energy consequences are not considered. > > > + * The candidate and the current busiest group are the same type of > > > + * group. Let check which one is the busiest according to the type. > > > */ > > > - if (sgs->sum_h_nr_running <= sgs->group_weight && > > > - group_smaller_min_cpu_capacity(sds->local, sg)) > > > - return false; > > > > > > - /* > > > - * If we have more than one misfit sg go with the biggest misfit. > > > - */ > > > - if (sgs->group_type == group_misfit_task && > > > - sgs->group_misfit_task_load < busiest->group_misfit_task_load) > > > + switch (sgs->group_type) { > > > + case group_overloaded: > > > + /* Select the overloaded group with highest avg_load. */ > > > + if (sgs->avg_load <= busiest->avg_load) > > > + return false; > > > + break; > > > + > > > + case group_imbalanced: > > > + /* > > > + * Select the 1st imbalanced group as we don't have any way to > > > + * choose one more than another. > > > + */ > > > return false; > > > > > > -asym_packing: > > > - /* This is the busiest node in its class. */ > > > - if (!(env->sd->flags & SD_ASYM_PACKING)) > > > - return true; > > > + case group_asym_packing: > > > + /* Prefer to move from lowest priority CPU's work */ > > > + if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu)) > > > + return false; > > > + break; > > > > > > > Again, I'm not seeing what prevents a !SD_ASYM_PACKING domain checking > > sched_asym_prefer. > > the test is done when collecting group's statistic in update_sg_lb_stats() > /* Check if dst cpu is idle and preferred to this group */ > if (env->sd->flags & SD_ASYM_PACKING && > env->idle != CPU_NOT_IDLE && > sgs->sum_h_nr_running && > sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { > sgs->group_asym_packing = 1; > } > > Then the group type group_asym_packing is only set if > sgs->group_asym_packing has been set > Yeah, sorry. I need to get ASYM_PACKING clearer in my head. > > > > > <SNIP> > > > + case group_fully_busy: > > > + /* > > > + * Select the fully busy group with highest avg_load. In > > > + * theory, there is no need to pull task from such kind of > > > + * group because tasks have all compute capacity that they need > > > + * but we can still improve the overall throughput by reducing > > > + * contention when accessing shared HW resources. > > > + * > > > + * XXX for now avg_load is not computed and always 0 so we > > > + * select the 1st one. > > > + */ > > > + if (sgs->avg_load <= busiest->avg_load) > > > + return false; > > > + break; > > > + > > > > With the exception that if we are balancing between NUMA domains and they > > were communicating tasks that we've now pulled them apart. That might > > increase the CPU resources available at the cost of increased remote > > memory access cost. > > I expect the numa classification to help and skip those runqueue > It might but the "canary in the mine" is netperf. A basic pair should not be pulled apart. > > > @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > */ > > > static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) > > > { > > > - unsigned long max_pull, load_above_capacity = ~0UL; > > > struct sg_lb_stats *local, *busiest; > > > > > > local = &sds->local_stat; > > > busiest = &sds->busiest_stat; > > > > > > - if (busiest->group_asym_packing) { > > > - env->imbalance = busiest->group_load; > > > + if (busiest->group_type == group_misfit_task) { > > > + /* Set imbalance to allow misfit task to be balanced. */ > > > + env->migration_type = migrate_misfit; > > > + env->imbalance = busiest->group_misfit_task_load; > > > + return; > > > + } > > > + > > > + if (busiest->group_type == group_asym_packing) { > > > + /* > > > + * In case of asym capacity, we will try to migrate all load to > > > + * the preferred CPU. > > > + */ > > > + env->migration_type = migrate_task; > > > + env->imbalance = busiest->sum_h_nr_running; > > > + return; > > > + } > > > + > > > + if (busiest->group_type == group_imbalanced) { > > > + /* > > > + * In the group_imb case we cannot rely on group-wide averages > > > + * to ensure CPU-load equilibrium, try to move any task to fix > > > + * the imbalance. The next load balance will take care of > > > + * balancing back the system. > > > + */ > > > + env->migration_type = migrate_task; > > > + env->imbalance = 1; > > > return; > > > } > > > > > > /* > > > - * Avg load of busiest sg can be less and avg load of local sg can > > > - * be greater than avg load across all sgs of sd because avg load > > > - * factors in sg capacity and sgs with smaller group_type are > > > - * skipped when updating the busiest sg: > > > + * Try to use spare capacity of local group without overloading it or > > > + * emptying busiest > > > */ > > > - if (busiest->group_type != group_misfit_task && > > > - (busiest->avg_load <= sds->avg_load || > > > - local->avg_load >= sds->avg_load)) { > > > - env->imbalance = 0; > > > + if (local->group_type == group_has_spare) { > > > + if (busiest->group_type > group_fully_busy) { > > > + /* > > > + * If busiest is overloaded, try to fill spare > > > + * capacity. This might end up creating spare capacity > > > + * in busiest or busiest still being overloaded but > > > + * there is no simple way to directly compute the > > > + * amount of load to migrate in order to balance the > > > + * system. > > > + */ > > > > busiest may not be overloaded, it may be imbalanced. Maybe the > > distinction is irrelevant though. > > the case busiest->group_type == group_imbalanced has already been > handled earlier int he function > Bah, of course. -- Mel Gorman SUSE Labs
On Thu, 31 Oct 2019 at 11:15, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Thu, Oct 31, 2019 at 10:09:17AM +0100, Vincent Guittot wrote: > > On Wed, 30 Oct 2019 at 16:45, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > On Fri, Oct 18, 2019 at 03:26:31PM +0200, Vincent Guittot wrote: > > > > The load_balance algorithm contains some heuristics which have become > > > > meaningless since the rework of the scheduler's metrics like the > > > > introduction of PELT. > > > > > > > > Furthermore, load is an ill-suited metric for solving certain task > > > > placement imbalance scenarios. For instance, in the presence of idle CPUs, > > > > we should simply try to get at least one task per CPU, whereas the current > > > > load-based algorithm can actually leave idle CPUs alone simply because the > > > > load is somewhat balanced. The current algorithm ends up creating virtual > > > > and meaningless value like the avg_load_per_task or tweaks the state of a > > > > group to make it overloaded whereas it's not, in order to try to migrate > > > > tasks. > > > > > > > > > > I do not think this is necessarily 100% true. With both the previous > > > load-balancing behaviour and the apparent behaviour of this patch, it's > > > still possible to pull two communicating tasks apart and across NUMA > > > domains when utilisation is low. Specifically, a load difference of less > > > than SCHED_CAPACITY_SCALE between NUMA codes can be enough too migrate > > > task to level out load. > > > > > > So, load might be ill-suited for some cases but that does not make it > > > completely useless either. > > > > I fully agree and we keep using it in some cases. > > The goal is only to not use it when it is obviously the wrong metric to be used > > > > Understood, ideally it's be explicit why each metric (task/util/load) > is used each time for future reference and why it's the best for a given > situation. It's not a requirement for the series as the scheduler does > not have a perfect history of explaining itself. > > > > > > > The type of behaviour can be seen by running netperf via mmtests > > > (configuration file configs/config-network-netperf-unbound) on a NUMA > > > machine and noting that the local vs remote NUMA hinting faults are roughly > > > 50%. I had prototyped some fixes around this that took imbalance_pct into > > > account but it was too special-cased and was not a universal win. If > > > I was reviewing my own patch I would have naked it on the "you added a > > > special-case hack into the load balancer for one load". I didn't get back > > > to it before getting cc'd on this series. > > > > > > > load_balance should better qualify the imbalance of the group and clearly > > > > define what has to be moved to fix this imbalance. > > > > > > > > The type of sched_group has been extended to better reflect the type of > > > > imbalance. We now have : > > > > group_has_spare > > > > group_fully_busy > > > > group_misfit_task > > > > group_asym_packing > > > > group_imbalanced > > > > group_overloaded > > > > > > > > Based on the type of sched_group, load_balance now sets what it wants to > > > > move in order to fix the imbalance. It can be some load as before but also > > > > some utilization, a number of task or a type of task: > > > > migrate_task > > > > migrate_util > > > > migrate_load > > > > migrate_misfit > > > > > > > > This new load_balance algorithm fixes several pending wrong tasks > > > > placement: > > > > - the 1 task per CPU case with asymmetric system > > > > - the case of cfs task preempted by other class > > > > - the case of tasks not evenly spread on groups with spare capacity > > > > > > > > > > On the last one, spreading tasks evenly across NUMA domains is not > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine > > > with 24 logical CPUs per socket, it should not automatically mean that > > > one task should move cross-node and I have definitely observed this > > > happening. It's probably bad in terms of locality no matter what but it's > > > especially bad if the 2 tasks happened to be communicating because then > > > load balancing will pull apart the tasks while wake_affine will push > > > them together (and potentially NUMA balancing as well). Note that this > > > also applies for some IO workloads because, depending on the filesystem, > > > the task may be communicating with workqueues (XFS) or a kernel thread > > > (ext4 with jbd2). > > > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing > > still gives guidances with fbq_classify_group/queue. > > I know the NUMA_BALANCING part is not touched, I'm talking about load > balancing across SD_NUMA domains which happens independently of > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to > override the load balancer when it moves tasks away from the preferred > node. Yes. this patchset relies on this override for now to prevent moving task away. I agree that additional patches are probably needed to improve load balance at NUMA level and I expect that this rework will make it simpler to add. I just wanted to get the output of some real use cases before defining more numa level specific conditions. Some want to spread on there numa nodes but other want to keep everything together. The preferred node and fbq_classify_group was the only sensible metrics to me when he wrote this patchset but changes can be added if they make sense. > > > But the latter could also take advantage of the new type of group. For > > example, what I did in the fix for find_idlest_group : checking > > numa_preferred_nid when the group has capacity and keep the task on > > preferred node if possible. Similar behavior could also be beneficial > > in periodic load_balance case. > > > > And this is the catch -- numa_preferred_nid is not guaranteed to be set at > all. NUMA balancing might be disabled, the task may not have been running > long enough to pick a preferred NID or NUMA balancing might be unable to > pick a preferred NID. The decision to avoid unnecessary migrations across > NUMA domains should be made independently of NUMA balancing. The netperf > configuration from mmtests is great at illustrating the point because it'll > also say what the average local/remote access ratio is. 2 communicating > tasks running on an otherwise idle NUMA machine should not have the load > balancer move the server to one node and the client to another. I'm going to make it a try on my setup to see the results > > Historically, we might have accounted for this with imbalance_pct which > makes sense for load and was special cased in some places, but it does > not make sense to use imbalance_pct for nr_running. Either way, I think > balancing across SD_NUMA should have explicit logic to take into account > that there is an additional cost outside of the scheduler when a task > moves cross-node. > > Even if it's a case that this series does not tackle the problem now, > it'd be good to leave a TODO comment behind noting that SD_NUMA may need > to be special cased. > > > > > @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, > > > > /* > > > > * No need to call idle_cpu() if nr_running is not 0 > > > > */ > > > > - if (!nr_running && idle_cpu(i)) > > > > + if (!nr_running && idle_cpu(i)) { > > > > sgs->idle_cpus++; > > > > + /* Idle cpu can't have misfit task */ > > > > + continue; > > > > + } > > > > + > > > > + if (local_group) > > > > + continue; > > > > > > > > + /* Check for a misfit task on the cpu */ > > > > if (env->sd->flags & SD_ASYM_CPUCAPACITY && > > > > sgs->group_misfit_task_load < rq->misfit_task_load) { > > > > sgs->group_misfit_task_load = rq->misfit_task_load; > > > > > > So.... why exactly do we not care about misfit tasks on CPUs in the > > > local group? I'm not saying you're wrong because you have a clear idea > > > on how misfit tasks should be treated but it's very non-obvious just > > > from the code. > > > > local_group can't do anything with local misfit tasks so it doesn't > > give any additional information compared to overloaded, fully_busy or > > has_spare > > > > Ok, that's very clear and now I'm feeling a bit stupid because I should > have spotted that. It really could do with a comment so somebody else > does not try "fixing" it :( > > > > > <SNIP> > > > > > > > > @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env, > > > > if (sgs->group_type < busiest->group_type) > > > > return false; > > > > > > > > - if (sgs->avg_load <= busiest->avg_load) > > > > - return false; > > > > - > > > > - if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) > > > > - goto asym_packing; > > > > - > > > > /* > > > > - * Candidate sg has no more than one task per CPU and > > > > - * has higher per-CPU capacity. Migrating tasks to less > > > > - * capable CPUs may harm throughput. Maximize throughput, > > > > - * power/energy consequences are not considered. > > > > + * The candidate and the current busiest group are the same type of > > > > + * group. Let check which one is the busiest according to the type. > > > > */ > > > > - if (sgs->sum_h_nr_running <= sgs->group_weight && > > > > - group_smaller_min_cpu_capacity(sds->local, sg)) > > > > - return false; > > > > > > > > - /* > > > > - * If we have more than one misfit sg go with the biggest misfit. > > > > - */ > > > > - if (sgs->group_type == group_misfit_task && > > > > - sgs->group_misfit_task_load < busiest->group_misfit_task_load) > > > > + switch (sgs->group_type) { > > > > + case group_overloaded: > > > > + /* Select the overloaded group with highest avg_load. */ > > > > + if (sgs->avg_load <= busiest->avg_load) > > > > + return false; > > > > + break; > > > > + > > > > + case group_imbalanced: > > > > + /* > > > > + * Select the 1st imbalanced group as we don't have any way to > > > > + * choose one more than another. > > > > + */ > > > > return false; > > > > > > > > -asym_packing: > > > > - /* This is the busiest node in its class. */ > > > > - if (!(env->sd->flags & SD_ASYM_PACKING)) > > > > - return true; > > > > + case group_asym_packing: > > > > + /* Prefer to move from lowest priority CPU's work */ > > > > + if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu)) > > > > + return false; > > > > + break; > > > > > > > > > > Again, I'm not seeing what prevents a !SD_ASYM_PACKING domain checking > > > sched_asym_prefer. > > > > the test is done when collecting group's statistic in update_sg_lb_stats() > > /* Check if dst cpu is idle and preferred to this group */ > > if (env->sd->flags & SD_ASYM_PACKING && > > env->idle != CPU_NOT_IDLE && > > sgs->sum_h_nr_running && > > sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { > > sgs->group_asym_packing = 1; > > } > > > > Then the group type group_asym_packing is only set if > > sgs->group_asym_packing has been set > > > > Yeah, sorry. I need to get ASYM_PACKING clearer in my head. > > > > > > > > <SNIP> > > > > + case group_fully_busy: > > > > + /* > > > > + * Select the fully busy group with highest avg_load. In > > > > + * theory, there is no need to pull task from such kind of > > > > + * group because tasks have all compute capacity that they need > > > > + * but we can still improve the overall throughput by reducing > > > > + * contention when accessing shared HW resources. > > > > + * > > > > + * XXX for now avg_load is not computed and always 0 so we > > > > + * select the 1st one. > > > > + */ > > > > + if (sgs->avg_load <= busiest->avg_load) > > > > + return false; > > > > + break; > > > > + > > > > > > With the exception that if we are balancing between NUMA domains and they > > > were communicating tasks that we've now pulled them apart. That might > > > increase the CPU resources available at the cost of increased remote > > > memory access cost. > > > > I expect the numa classification to help and skip those runqueue > > > > It might but the "canary in the mine" is netperf. A basic pair should > not be pulled apart. > > > > > @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > > > > */ > > > > static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) > > > > { > > > > - unsigned long max_pull, load_above_capacity = ~0UL; > > > > struct sg_lb_stats *local, *busiest; > > > > > > > > local = &sds->local_stat; > > > > busiest = &sds->busiest_stat; > > > > > > > > - if (busiest->group_asym_packing) { > > > > - env->imbalance = busiest->group_load; > > > > + if (busiest->group_type == group_misfit_task) { > > > > + /* Set imbalance to allow misfit task to be balanced. */ > > > > + env->migration_type = migrate_misfit; > > > > + env->imbalance = busiest->group_misfit_task_load; > > > > + return; > > > > + } > > > > + > > > > + if (busiest->group_type == group_asym_packing) { > > > > + /* > > > > + * In case of asym capacity, we will try to migrate all load to > > > > + * the preferred CPU. > > > > + */ > > > > + env->migration_type = migrate_task; > > > > + env->imbalance = busiest->sum_h_nr_running; > > > > + return; > > > > + } > > > > + > > > > + if (busiest->group_type == group_imbalanced) { > > > > + /* > > > > + * In the group_imb case we cannot rely on group-wide averages > > > > + * to ensure CPU-load equilibrium, try to move any task to fix > > > > + * the imbalance. The next load balance will take care of > > > > + * balancing back the system. > > > > + */ > > > > + env->migration_type = migrate_task; > > > > + env->imbalance = 1; > > > > return; > > > > } > > > > > > > > /* > > > > - * Avg load of busiest sg can be less and avg load of local sg can > > > > - * be greater than avg load across all sgs of sd because avg load > > > > - * factors in sg capacity and sgs with smaller group_type are > > > > - * skipped when updating the busiest sg: > > > > + * Try to use spare capacity of local group without overloading it or > > > > + * emptying busiest > > > > */ > > > > - if (busiest->group_type != group_misfit_task && > > > > - (busiest->avg_load <= sds->avg_load || > > > > - local->avg_load >= sds->avg_load)) { > > > > - env->imbalance = 0; > > > > + if (local->group_type == group_has_spare) { > > > > + if (busiest->group_type > group_fully_busy) { > > > > + /* > > > > + * If busiest is overloaded, try to fill spare > > > > + * capacity. This might end up creating spare capacity > > > > + * in busiest or busiest still being overloaded but > > > > + * there is no simple way to directly compute the > > > > + * amount of load to migrate in order to balance the > > > > + * system. > > > > + */ > > > > > > busiest may not be overloaded, it may be imbalanced. Maybe the > > > distinction is irrelevant though. > > > > the case busiest->group_type == group_imbalanced has already been > > handled earlier int he function > > > > Bah, of course. > > -- > Mel Gorman > SUSE Labs
On Thu, Oct 31, 2019 at 12:13:09PM +0100, Vincent Guittot wrote: > > > > On the last one, spreading tasks evenly across NUMA domains is not > > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine > > > > with 24 logical CPUs per socket, it should not automatically mean that > > > > one task should move cross-node and I have definitely observed this > > > > happening. It's probably bad in terms of locality no matter what but it's > > > > especially bad if the 2 tasks happened to be communicating because then > > > > load balancing will pull apart the tasks while wake_affine will push > > > > them together (and potentially NUMA balancing as well). Note that this > > > > also applies for some IO workloads because, depending on the filesystem, > > > > the task may be communicating with workqueues (XFS) or a kernel thread > > > > (ext4 with jbd2). > > > > > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing > > > still gives guidances with fbq_classify_group/queue. > > > > I know the NUMA_BALANCING part is not touched, I'm talking about load > > balancing across SD_NUMA domains which happens independently of > > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to > > override the load balancer when it moves tasks away from the preferred > > node. > > Yes. this patchset relies on this override for now to prevent moving task away. Fair enough, netperf hits the corner case where it does not work but that is also true without your series. > I agree that additional patches are probably needed to improve load > balance at NUMA level and I expect that this rework will make it > simpler to add. > I just wanted to get the output of some real use cases before defining > more numa level specific conditions. Some want to spread on there numa > nodes but other want to keep everything together. The preferred node > and fbq_classify_group was the only sensible metrics to me when he > wrote this patchset but changes can be added if they make sense. > That's fair. While it was possible to address the case before your series, it was a hatchet job. If the changelog simply notes that some special casing may still be required for SD_NUMA but it's outside the scope of the series, then I'd be happy. At least there is a good chance then if there is follow-up work that it won't be interpreted as an attempt to reintroduce hacky heuristics. > > > > > But the latter could also take advantage of the new type of group. For > > > example, what I did in the fix for find_idlest_group : checking > > > numa_preferred_nid when the group has capacity and keep the task on > > > preferred node if possible. Similar behavior could also be beneficial > > > in periodic load_balance case. > > > > > > > And this is the catch -- numa_preferred_nid is not guaranteed to be set at > > all. NUMA balancing might be disabled, the task may not have been running > > long enough to pick a preferred NID or NUMA balancing might be unable to > > pick a preferred NID. The decision to avoid unnecessary migrations across > > NUMA domains should be made independently of NUMA balancing. The netperf > > configuration from mmtests is great at illustrating the point because it'll > > also say what the average local/remote access ratio is. 2 communicating > > tasks running on an otherwise idle NUMA machine should not have the load > > balancer move the server to one node and the client to another. > > I'm going to make it a try on my setup to see the results > Thanks. -- Mel Gorman SUSE Labs
Le Thursday 31 Oct 2019 à 11:40:20 (+0000), Mel Gorman a écrit : > On Thu, Oct 31, 2019 at 12:13:09PM +0100, Vincent Guittot wrote: > > > > > On the last one, spreading tasks evenly across NUMA domains is not > > > > > necessarily a good idea. If I have 2 tasks running on a 2-socket machine > > > > > with 24 logical CPUs per socket, it should not automatically mean that > > > > > one task should move cross-node and I have definitely observed this > > > > > happening. It's probably bad in terms of locality no matter what but it's > > > > > especially bad if the 2 tasks happened to be communicating because then > > > > > load balancing will pull apart the tasks while wake_affine will push > > > > > them together (and potentially NUMA balancing as well). Note that this > > > > > also applies for some IO workloads because, depending on the filesystem, > > > > > the task may be communicating with workqueues (XFS) or a kernel thread > > > > > (ext4 with jbd2). > > > > > > > > This rework doesn't touch the NUMA_BALANCING part and NUMA balancing > > > > still gives guidances with fbq_classify_group/queue. > > > > > > I know the NUMA_BALANCING part is not touched, I'm talking about load > > > balancing across SD_NUMA domains which happens independently of > > > NUMA_BALANCING. In fact, there is logic in NUMA_BALANCING that tries to > > > override the load balancer when it moves tasks away from the preferred > > > node. > > > > Yes. this patchset relies on this override for now to prevent moving task away. > > Fair enough, netperf hits the corner case where it does not work but > that is also true without your series. I run mmtest/netperf test on my setup. It's a mix of small positive or negative differences (see below) netperf-udp 5.3-rc2 5.3-rc2 tip +rwk+fix Hmean send-64 95.06 ( 0.00%) 94.12 * -0.99%* Hmean send-128 191.71 ( 0.00%) 189.94 * -0.93%* Hmean send-256 379.05 ( 0.00%) 370.96 * -2.14%* Hmean send-1024 1485.24 ( 0.00%) 1476.64 * -0.58%* Hmean send-2048 2894.80 ( 0.00%) 2887.00 * -0.27%* Hmean send-3312 4580.27 ( 0.00%) 4555.91 * -0.53%* Hmean send-4096 5592.99 ( 0.00%) 5517.31 * -1.35%* Hmean send-8192 9117.00 ( 0.00%) 9497.06 * 4.17%* Hmean send-16384 15824.59 ( 0.00%) 15824.30 * -0.00%* Hmean recv-64 95.06 ( 0.00%) 94.08 * -1.04%* Hmean recv-128 191.68 ( 0.00%) 189.89 * -0.93%* Hmean recv-256 378.94 ( 0.00%) 370.87 * -2.13%* Hmean recv-1024 1485.24 ( 0.00%) 1476.20 * -0.61%* Hmean recv-2048 2893.52 ( 0.00%) 2885.25 * -0.29%* Hmean recv-3312 4580.27 ( 0.00%) 4553.48 * -0.58%* Hmean recv-4096 5592.99 ( 0.00%) 5517.27 * -1.35%* Hmean recv-8192 9115.69 ( 0.00%) 9495.69 * 4.17%* Hmean recv-16384 15824.36 ( 0.00%) 15818.36 * -0.04%* Stddev send-64 0.15 ( 0.00%) 1.17 (-688.29%) Stddev send-128 1.56 ( 0.00%) 1.15 ( 25.96%) Stddev send-256 4.20 ( 0.00%) 5.27 ( -25.63%) Stddev send-1024 20.11 ( 0.00%) 5.68 ( 71.74%) Stddev send-2048 11.06 ( 0.00%) 21.74 ( -96.50%) Stddev send-3312 61.10 ( 0.00%) 48.03 ( 21.39%) Stddev send-4096 71.84 ( 0.00%) 31.99 ( 55.46%) Stddev send-8192 165.14 ( 0.00%) 159.99 ( 3.12%) Stddev send-16384 81.30 ( 0.00%) 188.65 (-132.05%) Stddev recv-64 0.15 ( 0.00%) 1.15 (-673.42%) Stddev recv-128 1.58 ( 0.00%) 1.14 ( 28.27%) Stddev recv-256 4.29 ( 0.00%) 5.19 ( -21.05%) Stddev recv-1024 20.11 ( 0.00%) 5.70 ( 71.67%) Stddev recv-2048 10.43 ( 0.00%) 21.41 (-105.22%) Stddev recv-3312 61.10 ( 0.00%) 46.92 ( 23.20%) Stddev recv-4096 71.84 ( 0.00%) 31.97 ( 55.50%) Stddev recv-8192 163.90 ( 0.00%) 160.88 ( 1.84%) Stddev recv-16384 81.41 ( 0.00%) 187.01 (-129.71%) 5.3-rc2 5.3-rc2 tip +rwk+fix Duration User 38.90 39.13 Duration System 1311.29 1311.10 Duration Elapsed 1892.82 1892.86 netperf-tcp 5.3-rc2 5.3-rc2 tip +rwk+fix Hmean 64 871.30 ( 0.00%) 860.90 * -1.19%* Hmean 128 1689.39 ( 0.00%) 1679.31 * -0.60%* Hmean 256 3199.59 ( 0.00%) 3241.98 * 1.32%* Hmean 1024 9390.47 ( 0.00%) 9268.47 * -1.30%* Hmean 2048 13373.95 ( 0.00%) 13395.61 * 0.16%* Hmean 3312 16701.30 ( 0.00%) 17165.96 * 2.78%* Hmean 4096 15831.03 ( 0.00%) 15544.66 * -1.81%* Hmean 8192 19720.01 ( 0.00%) 20188.60 * 2.38%* Hmean 16384 23925.90 ( 0.00%) 23914.50 * -0.05%* Stddev 64 7.38 ( 0.00%) 4.23 ( 42.67%) Stddev 128 11.62 ( 0.00%) 10.13 ( 12.85%) Stddev 256 34.33 ( 0.00%) 7.94 ( 76.88%) Stddev 1024 35.61 ( 0.00%) 116.34 (-226.66%) Stddev 2048 285.30 ( 0.00%) 80.50 ( 71.78%) Stddev 3312 304.74 ( 0.00%) 449.08 ( -47.36%) Stddev 4096 668.11 ( 0.00%) 569.30 ( 14.79%) Stddev 8192 733.23 ( 0.00%) 944.38 ( -28.80%) Stddev 16384 553.03 ( 0.00%) 299.44 ( 45.86%) 5.3-rc2 5.3-rc2 tip +rwk+fix Duration User 138.05 140.95 Duration System 1210.60 1208.45 Duration Elapsed 1352.86 1352.90 > > > I agree that additional patches are probably needed to improve load > > balance at NUMA level and I expect that this rework will make it > > simpler to add. > > I just wanted to get the output of some real use cases before defining > > more numa level specific conditions. Some want to spread on there numa > > nodes but other want to keep everything together. The preferred node > > and fbq_classify_group was the only sensible metrics to me when he > > wrote this patchset but changes can be added if they make sense. > > > > That's fair. While it was possible to address the case before your > series, it was a hatchet job. If the changelog simply notes that some > special casing may still be required for SD_NUMA but it's outside the > scope of the series, then I'd be happy. At least there is a good chance > then if there is follow-up work that it won't be interpreted as an > attempt to reintroduce hacky heuristics. > Would the additional comment make sense for you about work to be done for SD_NUMA ? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0ad4b21..7e4cb65 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all }; * group. see update_sd_pick_busiest(). */ enum group_type { + /* + * The group has spare capacity that can be used to process more work. + */ group_has_spare = 0, + /* + * The group is fully used and the tasks don't compete for more CPU + * cycles. Nevetheless, some tasks might wait before running. + */ group_fully_busy, + /* + * One task doesn't fit with CPU's capacity and must be migrated on a + * more powerful CPU. + */ group_misfit_task, + /* + * One local CPU with higher capacity is available and task should be + * migrated on it instead on current CPU. + */ group_asym_packing, + /* + * The tasks affinity prevents the scheduler to balance the load across + * the system. + */ group_imbalanced, + /* + * The CPU is overloaded and can't provide expected CPU cycles to all + * tasks. + */ group_overloaded }; @@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s /* * Try to use spare capacity of local group without overloading it or - * emptying busiest + * emptying busiest. + * XXX Spreading tasks across numa nodes is not always the best policy + * and special cares should be taken for SD_NUMA domain level before + * spreading the tasks. For now, load_balance() fully relies on + * NUMA_BALANCING and fbq_classify_group/rq to overide the decision. */ if (local->group_type == group_has_spare) { if (busiest->group_type > group_fully_busy) { -- 2.7.4 > > > > > > > > But the latter could also take advantage of the new type of group. For > > > > example, what I did in the fix for find_idlest_group : checking > > > > numa_preferred_nid when the group has capacity and keep the task on > > > > preferred node if possible. Similar behavior could also be beneficial > > > > in periodic load_balance case. > > > > > > > > > > And this is the catch -- numa_preferred_nid is not guaranteed to be set at > > > all. NUMA balancing might be disabled, the task may not have been running > > > long enough to pick a preferred NID or NUMA balancing might be unable to > > > pick a preferred NID. The decision to avoid unnecessary migrations across > > > NUMA domains should be made independently of NUMA balancing. The netperf > > > configuration from mmtests is great at illustrating the point because it'll > > > also say what the average local/remote access ratio is. 2 communicating > > > tasks running on an otherwise idle NUMA machine should not have the load > > > balancer move the server to one node and the client to another. > > > > I'm going to make it a try on my setup to see the results > > > > Thanks. > > -- > Mel Gorman > SUSE Labs
On Fri, Nov 08, 2019 at 05:35:01PM +0100, Vincent Guittot wrote: > > Fair enough, netperf hits the corner case where it does not work but > > that is also true without your series. > > I run mmtest/netperf test on my setup. It's a mix of small positive or > negative differences (see below) > > <SNIP> > > netperf-tcp > 5.3-rc2 5.3-rc2 > tip +rwk+fix > Hmean 64 871.30 ( 0.00%) 860.90 * -1.19%* > Hmean 128 1689.39 ( 0.00%) 1679.31 * -0.60%* > Hmean 256 3199.59 ( 0.00%) 3241.98 * 1.32%* > Hmean 1024 9390.47 ( 0.00%) 9268.47 * -1.30%* > Hmean 2048 13373.95 ( 0.00%) 13395.61 * 0.16%* > Hmean 3312 16701.30 ( 0.00%) 17165.96 * 2.78%* > Hmean 4096 15831.03 ( 0.00%) 15544.66 * -1.81%* > Hmean 8192 19720.01 ( 0.00%) 20188.60 * 2.38%* > Hmean 16384 23925.90 ( 0.00%) 23914.50 * -0.05%* > Stddev 64 7.38 ( 0.00%) 4.23 ( 42.67%) > Stddev 128 11.62 ( 0.00%) 10.13 ( 12.85%) > Stddev 256 34.33 ( 0.00%) 7.94 ( 76.88%) > Stddev 1024 35.61 ( 0.00%) 116.34 (-226.66%) > Stddev 2048 285.30 ( 0.00%) 80.50 ( 71.78%) > Stddev 3312 304.74 ( 0.00%) 449.08 ( -47.36%) > Stddev 4096 668.11 ( 0.00%) 569.30 ( 14.79%) > Stddev 8192 733.23 ( 0.00%) 944.38 ( -28.80%) > Stddev 16384 553.03 ( 0.00%) 299.44 ( 45.86%) > > 5.3-rc2 5.3-rc2 > tip +rwk+fix > Duration User 138.05 140.95 > Duration System 1210.60 1208.45 > Duration Elapsed 1352.86 1352.90 > This roughly matches what I've seen. The interesting part to me for netperf is the next section of the report that reports the locality of numa hints. With netperf on a 2-socket machine, it's generally around 50% as the client/server are pulled apart. Because netperf is not heavily memory bound, it doesn't have much impact on the overall performance but it's good at catching the cross-node migrations. > > > > > I agree that additional patches are probably needed to improve load > > > balance at NUMA level and I expect that this rework will make it > > > simpler to add. > > > I just wanted to get the output of some real use cases before defining > > > more numa level specific conditions. Some want to spread on there numa > > > nodes but other want to keep everything together. The preferred node > > > and fbq_classify_group was the only sensible metrics to me when he > > > wrote this patchset but changes can be added if they make sense. > > > > > > > That's fair. While it was possible to address the case before your > > series, it was a hatchet job. If the changelog simply notes that some > > special casing may still be required for SD_NUMA but it's outside the > > scope of the series, then I'd be happy. At least there is a good chance > > then if there is follow-up work that it won't be interpreted as an > > attempt to reintroduce hacky heuristics. > > > > Would the additional comment make sense for you about work to be done > for SD_NUMA ? > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 0ad4b21..7e4cb65 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all }; > * group. see update_sd_pick_busiest(). > */ > enum group_type { > + /* > + * The group has spare capacity that can be used to process more work. > + */ > group_has_spare = 0, > + /* > + * The group is fully used and the tasks don't compete for more CPU > + * cycles. Nevetheless, some tasks might wait before running. > + */ > group_fully_busy, > + /* > + * One task doesn't fit with CPU's capacity and must be migrated on a > + * more powerful CPU. > + */ > group_misfit_task, > + /* > + * One local CPU with higher capacity is available and task should be > + * migrated on it instead on current CPU. > + */ > group_asym_packing, > + /* > + * The tasks affinity prevents the scheduler to balance the load across > + * the system. > + */ > group_imbalanced, > + /* > + * The CPU is overloaded and can't provide expected CPU cycles to all > + * tasks. > + */ > group_overloaded > }; Looks good. > > @@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > /* > * Try to use spare capacity of local group without overloading it or > - * emptying busiest > + * emptying busiest. > + * XXX Spreading tasks across numa nodes is not always the best policy > + * and special cares should be taken for SD_NUMA domain level before > + * spreading the tasks. For now, load_balance() fully relies on > + * NUMA_BALANCING and fbq_classify_group/rq to overide the decision. > */ > if (local->group_type == group_has_spare) { > if (busiest->group_type > group_fully_busy) { Perfect. Any patch in that are can then update the comment and it should be semi-obvious to the next reviewer that it's expected. Thanks Vincent. -- Mel Gorman SUSE Labs
Le Friday 08 Nov 2019 à 18:37:30 (+0000), Mel Gorman a écrit : > On Fri, Nov 08, 2019 at 05:35:01PM +0100, Vincent Guittot wrote: > > > Fair enough, netperf hits the corner case where it does not work but > > > that is also true without your series. > > > > I run mmtest/netperf test on my setup. It's a mix of small positive or > > negative differences (see below) > > > > <SNIP> > > > > netperf-tcp > > 5.3-rc2 5.3-rc2 > > tip +rwk+fix > > Hmean 64 871.30 ( 0.00%) 860.90 * -1.19%* > > Hmean 128 1689.39 ( 0.00%) 1679.31 * -0.60%* > > Hmean 256 3199.59 ( 0.00%) 3241.98 * 1.32%* > > Hmean 1024 9390.47 ( 0.00%) 9268.47 * -1.30%* > > Hmean 2048 13373.95 ( 0.00%) 13395.61 * 0.16%* > > Hmean 3312 16701.30 ( 0.00%) 17165.96 * 2.78%* > > Hmean 4096 15831.03 ( 0.00%) 15544.66 * -1.81%* > > Hmean 8192 19720.01 ( 0.00%) 20188.60 * 2.38%* > > Hmean 16384 23925.90 ( 0.00%) 23914.50 * -0.05%* > > Stddev 64 7.38 ( 0.00%) 4.23 ( 42.67%) > > Stddev 128 11.62 ( 0.00%) 10.13 ( 12.85%) > > Stddev 256 34.33 ( 0.00%) 7.94 ( 76.88%) > > Stddev 1024 35.61 ( 0.00%) 116.34 (-226.66%) > > Stddev 2048 285.30 ( 0.00%) 80.50 ( 71.78%) > > Stddev 3312 304.74 ( 0.00%) 449.08 ( -47.36%) > > Stddev 4096 668.11 ( 0.00%) 569.30 ( 14.79%) > > Stddev 8192 733.23 ( 0.00%) 944.38 ( -28.80%) > > Stddev 16384 553.03 ( 0.00%) 299.44 ( 45.86%) > > > > 5.3-rc2 5.3-rc2 > > tip +rwk+fix > > Duration User 138.05 140.95 > > Duration System 1210.60 1208.45 > > Duration Elapsed 1352.86 1352.90 > > > > This roughly matches what I've seen. The interesting part to me for > netperf is the next section of the report that reports the locality of > numa hints. With netperf on a 2-socket machine, it's generally around > 50% as the client/server are pulled apart. Because netperf is not > heavily memory bound, it doesn't have much impact on the overall > performance but it's good at catching the cross-node migrations. Ok. I didn't want to make my reply too long. I have put them below for the netperf-tcp results: 5.3-rc2 5.3-rc2 tip +rwk+fix Ops NUMA alloc hit 60077762.00 60387907.00 Ops NUMA alloc miss 0.00 0.00 Ops NUMA interleave hit 0.00 0.00 Ops NUMA alloc local 60077571.00 60387798.00 Ops NUMA base-page range updates 5948.00 17223.00 Ops NUMA PTE updates 5948.00 17223.00 Ops NUMA PMD updates 0.00 0.00 Ops NUMA hint faults 4639.00 14050.00 Ops NUMA hint local faults % 2073.00 6515.00 Ops NUMA hint local percent 44.69 46.37 Ops NUMA pages migrated 1528.00 4306.00 Ops AutoNUMA cost 23.27 70.45 > > > > > > > > I agree that additional patches are probably needed to improve load > > > > balance at NUMA level and I expect that this rework will make it > > > > simpler to add. > > > > I just wanted to get the output of some real use cases before defining > > > > more numa level specific conditions. Some want to spread on there numa > > > > nodes but other want to keep everything together. The preferred node > > > > and fbq_classify_group was the only sensible metrics to me when he > > > > wrote this patchset but changes can be added if they make sense. > > > > > > > > > > That's fair. While it was possible to address the case before your > > > series, it was a hatchet job. If the changelog simply notes that some > > > special casing may still be required for SD_NUMA but it's outside the > > > scope of the series, then I'd be happy. At least there is a good chance > > > then if there is follow-up work that it won't be interpreted as an > > > attempt to reintroduce hacky heuristics. > > > > > > > Would the additional comment make sense for you about work to be done > > for SD_NUMA ? > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 0ad4b21..7e4cb65 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6960,11 +6960,34 @@ enum fbq_type { regular, remote, all }; > > * group. see update_sd_pick_busiest(). > > */ > > enum group_type { > > + /* > > + * The group has spare capacity that can be used to process more work. > > + */ > > group_has_spare = 0, > > + /* > > + * The group is fully used and the tasks don't compete for more CPU > > + * cycles. Nevetheless, some tasks might wait before running. > > + */ > > group_fully_busy, > > + /* > > + * One task doesn't fit with CPU's capacity and must be migrated on a > > + * more powerful CPU. > > + */ > > group_misfit_task, > > + /* > > + * One local CPU with higher capacity is available and task should be > > + * migrated on it instead on current CPU. > > + */ > > group_asym_packing, > > + /* > > + * The tasks affinity prevents the scheduler to balance the load across > > + * the system. > > + */ > > group_imbalanced, > > + /* > > + * The CPU is overloaded and can't provide expected CPU cycles to all > > + * tasks. > > + */ > > group_overloaded > > }; > > Looks good. > > > > > @@ -8563,7 +8586,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > > > > /* > > * Try to use spare capacity of local group without overloading it or > > - * emptying busiest > > + * emptying busiest. > > + * XXX Spreading tasks across numa nodes is not always the best policy > > + * and special cares should be taken for SD_NUMA domain level before > > + * spreading the tasks. For now, load_balance() fully relies on > > + * NUMA_BALANCING and fbq_classify_group/rq to overide the decision. > > */ > > if (local->group_type == group_has_spare) { > > if (busiest->group_type > group_fully_busy) { > > Perfect. Any patch in that are can then update the comment and it > should be semi-obvious to the next reviewer that it's expected. > > Thanks Vincent. > > -- > Mel Gorman > SUSE Labs
On Tue, Nov 12, 2019 at 11:58:30AM +0100, Vincent Guittot wrote: > > This roughly matches what I've seen. The interesting part to me for > > netperf is the next section of the report that reports the locality of > > numa hints. With netperf on a 2-socket machine, it's generally around > > 50% as the client/server are pulled apart. Because netperf is not > > heavily memory bound, it doesn't have much impact on the overall > > performance but it's good at catching the cross-node migrations. > > Ok. I didn't want to make my reply too long. I have put them below for > the netperf-tcp results: > 5.3-rc2 5.3-rc2 > tip +rwk+fix > Ops NUMA alloc hit 60077762.00 60387907.00 > Ops NUMA alloc miss 0.00 0.00 > Ops NUMA interleave hit 0.00 0.00 > Ops NUMA alloc local 60077571.00 60387798.00 > Ops NUMA base-page range updates 5948.00 17223.00 > Ops NUMA PTE updates 5948.00 17223.00 > Ops NUMA PMD updates 0.00 0.00 > Ops NUMA hint faults 4639.00 14050.00 > Ops NUMA hint local faults % 2073.00 6515.00 > Ops NUMA hint local percent 44.69 46.37 > Ops NUMA pages migrated 1528.00 4306.00 > Ops AutoNUMA cost 23.27 70.45 > Thanks -- it was "NUMA hint local percent" I was interested in and the 46.37% local hinting faults is likely indicative of the client/server being load balanced across SD_NUMA domains without NUMA Balancing being aggressive enough to fix it. At least I know I am not just seriously unlucky or testing magical machines! -- Mel Gorman SUSE Labs
On Tue, 12 Nov 2019 at 16:06, Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Nov 12, 2019 at 11:58:30AM +0100, Vincent Guittot wrote: > > > This roughly matches what I've seen. The interesting part to me for > > > netperf is the next section of the report that reports the locality of > > > numa hints. With netperf on a 2-socket machine, it's generally around > > > 50% as the client/server are pulled apart. Because netperf is not > > > heavily memory bound, it doesn't have much impact on the overall > > > performance but it's good at catching the cross-node migrations. > > > > Ok. I didn't want to make my reply too long. I have put them below for > > the netperf-tcp results: > > 5.3-rc2 5.3-rc2 > > tip +rwk+fix > > Ops NUMA alloc hit 60077762.00 60387907.00 > > Ops NUMA alloc miss 0.00 0.00 > > Ops NUMA interleave hit 0.00 0.00 > > Ops NUMA alloc local 60077571.00 60387798.00 > > Ops NUMA base-page range updates 5948.00 17223.00 > > Ops NUMA PTE updates 5948.00 17223.00 > > Ops NUMA PMD updates 0.00 0.00 > > Ops NUMA hint faults 4639.00 14050.00 > > Ops NUMA hint local faults % 2073.00 6515.00 > > Ops NUMA hint local percent 44.69 46.37 > > Ops NUMA pages migrated 1528.00 4306.00 > > Ops AutoNUMA cost 23.27 70.45 > > > > Thanks -- it was "NUMA hint local percent" I was interested in and the > 46.37% local hinting faults is likely indicative of the client/server > being load balanced across SD_NUMA domains without NUMA Balancing being > aggressive enough to fix it. At least I know I am not just seriously > unlucky or testing magical machines! I agree that the collaboration between load balanced across SD_NUMA level and NUMA balancing should be improved It's also interesting to notice that the patchset doesn't seem to do worse than the baseline: 46.37% vs 44.69% Vincent > > -- > Mel Gorman > SUSE Labs
On Tue, Nov 12, 2019 at 04:40:20PM +0100, Vincent Guittot wrote: > On Tue, 12 Nov 2019 at 16:06, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Tue, Nov 12, 2019 at 11:58:30AM +0100, Vincent Guittot wrote: > > > > This roughly matches what I've seen. The interesting part to me for > > > > netperf is the next section of the report that reports the locality of > > > > numa hints. With netperf on a 2-socket machine, it's generally around > > > > 50% as the client/server are pulled apart. Because netperf is not > > > > heavily memory bound, it doesn't have much impact on the overall > > > > performance but it's good at catching the cross-node migrations. > > > > > > Ok. I didn't want to make my reply too long. I have put them below for > > > the netperf-tcp results: > > > 5.3-rc2 5.3-rc2 > > > tip +rwk+fix > > > Ops NUMA alloc hit 60077762.00 60387907.00 > > > Ops NUMA alloc miss 0.00 0.00 > > > Ops NUMA interleave hit 0.00 0.00 > > > Ops NUMA alloc local 60077571.00 60387798.00 > > > Ops NUMA base-page range updates 5948.00 17223.00 > > > Ops NUMA PTE updates 5948.00 17223.00 > > > Ops NUMA PMD updates 0.00 0.00 > > > Ops NUMA hint faults 4639.00 14050.00 > > > Ops NUMA hint local faults % 2073.00 6515.00 > > > Ops NUMA hint local percent 44.69 46.37 > > > Ops NUMA pages migrated 1528.00 4306.00 > > > Ops AutoNUMA cost 23.27 70.45 > > > > > > > Thanks -- it was "NUMA hint local percent" I was interested in and the > > 46.37% local hinting faults is likely indicative of the client/server > > being load balanced across SD_NUMA domains without NUMA Balancing being > > aggressive enough to fix it. At least I know I am not just seriously > > unlucky or testing magical machines! > > I agree that the collaboration between load balanced across SD_NUMA > level and NUMA balancing should be improved > > It's also interesting to notice that the patchset doesn't seem to do > worse than the baseline: 46.37% vs 44.69% > Yes, I should have highlighted that. The series appears to improve a number of areas while being performance neutral with respect to SD_NUMA. If this turns out to be wrong in some case, it should be semi-obvious even if the locality looks ok. It'll be a headline regression with increased NUMA pte scanning and increased frequency of migrations indicating that NUMA balancing is taken excessive corrective action. I'll know it when I see it :P -- Mel Gorman SUSE Labs
* Mel Gorman <mgorman@techsingularity.net> wrote: > s/groupe_type/group_type/ > > > enum group_type { > > - group_other = 0, > > + group_has_spare = 0, > > + group_fully_busy, > > group_misfit_task, > > + group_asym_packing, > > group_imbalanced, > > - group_overloaded, > > + group_overloaded > > +}; > > + > > While not your fault, it would be nice to comment on the meaning of each > group type. From a glance, it's not obvious to me why a misfit task should > be a high priority to move a task than a fully_busy (but not overloaded) > group given that moving the misfit task might make a group overloaded. This part of your feedback should now be addressed in the scheduler tree via: a9723389cc75: sched/fair: Add comments for group_type and balancing at SD_NUMA level > > +enum migration_type { > > + migrate_load = 0, > > + migrate_util, > > + migrate_task, > > + migrate_misfit > > }; > > > > Could do with a comment explaining what migration_type is for because > the name is unhelpful. I *think* at a glance it's related to what sort > of imbalance is being addressed which is partially addressed by the > group_type. That understanding may change as I continue reading the series > but now I have to figure it out which means it'll be forgotten again in > 6 months. Agreed. Vincent, is any patch brewing here, or should I take a stab? Thanks, Ingo
On Mon, 18 Nov 2019 at 14:50, Ingo Molnar <mingo@kernel.org> wrote: > > > * Mel Gorman <mgorman@techsingularity.net> wrote: > > > s/groupe_type/group_type/ > > > > > enum group_type { > > > - group_other = 0, > > > + group_has_spare = 0, > > > + group_fully_busy, > > > group_misfit_task, > > > + group_asym_packing, > > > group_imbalanced, > > > - group_overloaded, > > > + group_overloaded > > > +}; > > > + > > > > While not your fault, it would be nice to comment on the meaning of each > > group type. From a glance, it's not obvious to me why a misfit task should > > be a high priority to move a task than a fully_busy (but not overloaded) > > group given that moving the misfit task might make a group overloaded. > > This part of your feedback should now be addressed in the scheduler tree > via: > > a9723389cc75: sched/fair: Add comments for group_type and balancing at SD_NUMA level > > > > +enum migration_type { > > > + migrate_load = 0, > > > + migrate_util, > > > + migrate_task, > > > + migrate_misfit > > > }; > > > > > > > Could do with a comment explaining what migration_type is for because > > the name is unhelpful. I *think* at a glance it's related to what sort > > of imbalance is being addressed which is partially addressed by the > > group_type. That understanding may change as I continue reading the series > > but now I have to figure it out which means it'll be forgotten again in > > 6 months. > > Agreed. Vincent, is any patch brewing here, or should I take a stab? > No I haven't patch under preparation for this So you can go ahead Thanks, Vincent > Thanks, > > Ingo
On Mon, Nov 18, 2019 at 02:50:17PM +0100, Ingo Molnar wrote: > > * Mel Gorman <mgorman@techsingularity.net> wrote: > > > s/groupe_type/group_type/ > > > > > enum group_type { > > > - group_other = 0, > > > + group_has_spare = 0, > > > + group_fully_busy, > > > group_misfit_task, > > > + group_asym_packing, > > > group_imbalanced, > > > - group_overloaded, > > > + group_overloaded > > > +}; > > > + > > > > While not your fault, it would be nice to comment on the meaning of each > > group type. From a glance, it's not obvious to me why a misfit task should > > be a high priority to move a task than a fully_busy (but not overloaded) > > group given that moving the misfit task might make a group overloaded. > > This part of your feedback should now be addressed in the scheduler tree > via: > > a9723389cc75: sched/fair: Add comments for group_type and balancing at SD_NUMA level > While I can't see that commit ID yet, the discussed version of the patch was fine by me. -- Mel Gorman SUSE Labs
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e004841..5ae5281 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7068,11 +7068,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; enum fbq_type { regular, remote, all }; +/* + * group_type describes the group of CPUs at the moment of the load balance. + * The enum is ordered by pulling priority, with the group with lowest priority + * first so the groupe_type can be simply compared when selecting the busiest + * group. see update_sd_pick_busiest(). + */ enum group_type { - group_other = 0, + group_has_spare = 0, + group_fully_busy, group_misfit_task, + group_asym_packing, group_imbalanced, - group_overloaded, + group_overloaded +}; + +enum migration_type { + migrate_load = 0, + migrate_util, + migrate_task, + migrate_misfit }; #define LBF_ALL_PINNED 0x01 @@ -7105,7 +7120,7 @@ struct lb_env { unsigned int loop_max; enum fbq_type fbq_type; - enum group_type src_grp_type; + enum migration_type migration_type; struct list_head tasks; }; @@ -7328,7 +7343,7 @@ static struct task_struct *detach_one_task(struct lb_env *env) static const unsigned int sched_nr_migrate_break = 32; /* - * detach_tasks() -- tries to detach up to imbalance runnable load from + * detach_tasks() -- tries to detach up to imbalance load/util/tasks from * busiest_rq, as part of a balancing operation within domain "sd". * * Returns number of detached tasks if successful and 0 otherwise. @@ -7336,8 +7351,8 @@ static const unsigned int sched_nr_migrate_break = 32; static int detach_tasks(struct lb_env *env) { struct list_head *tasks = &env->src_rq->cfs_tasks; + unsigned long util, load; struct task_struct *p; - unsigned long load; int detached = 0; lockdep_assert_held(&env->src_rq->lock); @@ -7370,19 +7385,51 @@ static int detach_tasks(struct lb_env *env) if (!can_migrate_task(p, env)) goto next; - load = task_h_load(p); + switch (env->migration_type) { + case migrate_load: + load = task_h_load(p); - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed) - goto next; + if (sched_feat(LB_MIN) && + load < 16 && !env->sd->nr_balance_failed) + goto next; - if ((load / 2) > env->imbalance) - goto next; + if ((load / 2) > env->imbalance) + goto next; + + env->imbalance -= load; + break; + + case migrate_util: + util = task_util_est(p); + + if (util > env->imbalance) + goto next; + + env->imbalance -= util; + break; + + case migrate_task: + env->imbalance--; + break; + + case migrate_misfit: + load = task_h_load(p); + + /* + * load of misfit task might decrease a bit since it has + * been recorded. Be conservative in the condition. + */ + if (load / 2 < env->imbalance) + goto next; + + env->imbalance = 0; + break; + } detach_task(p, env); list_add(&p->se.group_node, &env->tasks); detached++; - env->imbalance -= load; #ifdef CONFIG_PREEMPTION /* @@ -7396,7 +7443,7 @@ static int detach_tasks(struct lb_env *env) /* * We only want to steal up to the prescribed amount of - * runnable load. + * load/util/tasks. */ if (env->imbalance <= 0) break; @@ -7661,7 +7708,6 @@ struct sg_lb_stats { unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; - int group_no_capacity; unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANCING @@ -7677,10 +7723,10 @@ struct sg_lb_stats { struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ - unsigned long total_running; unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ + unsigned int prefer_sibling; /* tasks should go to sibling first */ struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ struct sg_lb_stats local_stat; /* Statistics of the local group */ @@ -7691,19 +7737,18 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) /* * Skimp on the clearing to avoid duplicate work. We can avoid clearing * local_stat because update_sg_lb_stats() does a full clear/assignment. - * We must however clear busiest_stat::avg_load because - * update_sd_pick_busiest() reads this before assignment. + * We must however set busiest_stat::group_type and + * busiest_stat::idle_cpus to the worst busiest group because + * update_sd_pick_busiest() reads these before assignment. */ *sds = (struct sd_lb_stats){ .busiest = NULL, .local = NULL, - .total_running = 0UL, .total_load = 0UL, .total_capacity = 0UL, .busiest_stat = { - .avg_load = 0UL, - .sum_h_nr_running = 0, - .group_type = group_other, + .idle_cpus = UINT_MAX, + .group_type = group_has_spare, }, }; } @@ -7945,19 +7990,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) } static inline enum -group_type group_classify(struct sched_group *group, +group_type group_classify(struct lb_env *env, + struct sched_group *group, struct sg_lb_stats *sgs) { - if (sgs->group_no_capacity) + if (group_is_overloaded(env, sgs)) return group_overloaded; if (sg_imbalanced(group)) return group_imbalanced; + if (sgs->group_asym_packing) + return group_asym_packing; + if (sgs->group_misfit_task_load) return group_misfit_task; - return group_other; + if (!group_has_capacity(env, sgs)) + return group_fully_busy; + + return group_has_spare; } static bool update_nohz_stats(struct rq *rq, bool force) @@ -7994,10 +8046,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, struct sg_lb_stats *sgs, int *sg_status) { - int i, nr_running; + int i, nr_running, local_group; memset(sgs, 0, sizeof(*sgs)); + local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group)); + for_each_cpu_and(i, sched_group_span(group), env->cpus) { struct rq *rq = cpu_rq(i); @@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env, /* * No need to call idle_cpu() if nr_running is not 0 */ - if (!nr_running && idle_cpu(i)) + if (!nr_running && idle_cpu(i)) { sgs->idle_cpus++; + /* Idle cpu can't have misfit task */ + continue; + } + + if (local_group) + continue; + /* Check for a misfit task on the cpu */ if (env->sd->flags & SD_ASYM_CPUCAPACITY && sgs->group_misfit_task_load < rq->misfit_task_load) { sgs->group_misfit_task_load = rq->misfit_task_load; @@ -8032,14 +8093,24 @@ static inline void update_sg_lb_stats(struct lb_env *env, } } - /* Adjust by relative CPU capacity of the group */ + /* Check if dst cpu is idle and preferred to this group */ + if (env->sd->flags & SD_ASYM_PACKING && + env->idle != CPU_NOT_IDLE && + sgs->sum_h_nr_running && + sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { + sgs->group_asym_packing = 1; + } + sgs->group_capacity = group->sgc->capacity; - sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; sgs->group_weight = group->group_weight; - sgs->group_no_capacity = group_is_overloaded(env, sgs); - sgs->group_type = group_classify(group, sgs); + sgs->group_type = group_classify(env, group, sgs); + + /* Computing avg_load makes sense only when group is overloaded */ + if (sgs->group_type == group_overloaded) + sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) / + sgs->group_capacity; } /** @@ -8062,6 +8133,10 @@ static bool update_sd_pick_busiest(struct lb_env *env, { struct sg_lb_stats *busiest = &sds->busiest_stat; + /* Make sure that there is at least one task to pull */ + if (!sgs->sum_h_nr_running) + return false; + /* * Don't try to pull misfit tasks we can't help. * We can use max_capacity here as reduction in capacity on some @@ -8070,7 +8145,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, */ if (sgs->group_type == group_misfit_task && (!group_smaller_max_cpu_capacity(sg, sds->local) || - !group_has_capacity(env, &sds->local_stat))) + sds->local_stat.group_type != group_has_spare)) return false; if (sgs->group_type > busiest->group_type) @@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->group_type < busiest->group_type) return false; - if (sgs->avg_load <= busiest->avg_load) - return false; - - if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) - goto asym_packing; - /* - * Candidate sg has no more than one task per CPU and - * has higher per-CPU capacity. Migrating tasks to less - * capable CPUs may harm throughput. Maximize throughput, - * power/energy consequences are not considered. + * The candidate and the current busiest group are the same type of + * group. Let check which one is the busiest according to the type. */ - if (sgs->sum_h_nr_running <= sgs->group_weight && - group_smaller_min_cpu_capacity(sds->local, sg)) - return false; - /* - * If we have more than one misfit sg go with the biggest misfit. - */ - if (sgs->group_type == group_misfit_task && - sgs->group_misfit_task_load < busiest->group_misfit_task_load) + switch (sgs->group_type) { + case group_overloaded: + /* Select the overloaded group with highest avg_load. */ + if (sgs->avg_load <= busiest->avg_load) + return false; + break; + + case group_imbalanced: + /* + * Select the 1st imbalanced group as we don't have any way to + * choose one more than another. + */ return false; -asym_packing: - /* This is the busiest node in its class. */ - if (!(env->sd->flags & SD_ASYM_PACKING)) - return true; + case group_asym_packing: + /* Prefer to move from lowest priority CPU's work */ + if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu)) + return false; + break; - /* No ASYM_PACKING if target CPU is already busy */ - if (env->idle == CPU_NOT_IDLE) - return true; - /* - * ASYM_PACKING needs to move all the work to the highest - * prority CPUs in the group, therefore mark all groups - * of lower priority than ourself as busy. - * - * This is primarily intended to used at the sibling level. Some - * cores like POWER7 prefer to use lower numbered SMT threads. In the - * case of POWER7, it can move to lower SMT modes only when higher - * threads are idle. When in lower SMT modes, the threads will - * perform better since they share less core resources. Hence when we - * have idle threads, we want them to be the higher ones. - */ - if (sgs->sum_h_nr_running && - sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) { - sgs->group_asym_packing = 1; - if (!sds->busiest) - return true; + case group_misfit_task: + /* + * If we have more than one misfit sg go with the biggest + * misfit. + */ + if (sgs->group_misfit_task_load < busiest->group_misfit_task_load) + return false; + break; - /* Prefer to move from lowest priority CPU's work */ - if (sched_asym_prefer(sds->busiest->asym_prefer_cpu, - sg->asym_prefer_cpu)) - return true; + case group_fully_busy: + /* + * Select the fully busy group with highest avg_load. In + * theory, there is no need to pull task from such kind of + * group because tasks have all compute capacity that they need + * but we can still improve the overall throughput by reducing + * contention when accessing shared HW resources. + * + * XXX for now avg_load is not computed and always 0 so we + * select the 1st one. + */ + if (sgs->avg_load <= busiest->avg_load) + return false; + break; + + case group_has_spare: + /* + * Select not overloaded group with lowest number of + * idle cpus. We could also compare the spare capacity + * which is more stable but it can end up that the + * group has less spare capacity but finally more idle + * cpus which means less opportunity to pull tasks. + */ + if (sgs->idle_cpus >= busiest->idle_cpus) + return false; + break; } - return false; + /* + * Candidate sg has no more than one task per CPU and has higher + * per-CPU capacity. Migrating tasks to less capable CPUs may harm + * throughput. Maximize throughput, power/energy consequences are not + * considered. + */ + if ((env->sd->flags & SD_ASYM_CPUCAPACITY) && + (sgs->group_type <= group_fully_busy) && + (group_smaller_min_cpu_capacity(sds->local, sg))) + return false; + + return true; } #ifdef CONFIG_NUMA_BALANCING @@ -8172,13 +8265,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) * @env: The load balancing environment. * @sds: variable to hold the statistics for this sched_domain. */ + static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; - bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING; int sg_status = 0; #ifdef CONFIG_NO_HZ_COMMON @@ -8205,22 +8298,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (local_group) goto next_group; - /* - * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity so that we'll try - * and move all the excess tasks away. We lower the capacity - * of a group only if the local group has the capacity to fit - * these excess tasks. The extra check prevents the case where - * you always pull from the heaviest group when it is already - * under-utilized (possible with a large weight task outweighs - * the tasks on the system). - */ - if (prefer_sibling && sds->local && - group_has_capacity(env, local) && - (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) { - sgs->group_no_capacity = 1; - sgs->group_type = group_classify(sg, sgs); - } if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -8229,13 +8306,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd next_group: /* Now, start updating sd_lb_stats */ - sds->total_running += sgs->sum_h_nr_running; sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; sg = sg->next; } while (sg != env->sd->groups); + /* Tag domain that child domain prefers tasks go to siblings first */ + sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; + #ifdef CONFIG_NO_HZ_COMMON if ((env->flags & LBF_NOHZ_AGAIN) && cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) { @@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd */ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { - unsigned long max_pull, load_above_capacity = ~0UL; struct sg_lb_stats *local, *busiest; local = &sds->local_stat; busiest = &sds->busiest_stat; - if (busiest->group_asym_packing) { - env->imbalance = busiest->group_load; + if (busiest->group_type == group_misfit_task) { + /* Set imbalance to allow misfit task to be balanced. */ + env->migration_type = migrate_misfit; + env->imbalance = busiest->group_misfit_task_load; + return; + } + + if (busiest->group_type == group_asym_packing) { + /* + * In case of asym capacity, we will try to migrate all load to + * the preferred CPU. + */ + env->migration_type = migrate_task; + env->imbalance = busiest->sum_h_nr_running; + return; + } + + if (busiest->group_type == group_imbalanced) { + /* + * In the group_imb case we cannot rely on group-wide averages + * to ensure CPU-load equilibrium, try to move any task to fix + * the imbalance. The next load balance will take care of + * balancing back the system. + */ + env->migration_type = migrate_task; + env->imbalance = 1; return; } /* - * Avg load of busiest sg can be less and avg load of local sg can - * be greater than avg load across all sgs of sd because avg load - * factors in sg capacity and sgs with smaller group_type are - * skipped when updating the busiest sg: + * Try to use spare capacity of local group without overloading it or + * emptying busiest */ - if (busiest->group_type != group_misfit_task && - (busiest->avg_load <= sds->avg_load || - local->avg_load >= sds->avg_load)) { - env->imbalance = 0; + if (local->group_type == group_has_spare) { + if (busiest->group_type > group_fully_busy) { + /* + * If busiest is overloaded, try to fill spare + * capacity. This might end up creating spare capacity + * in busiest or busiest still being overloaded but + * there is no simple way to directly compute the + * amount of load to migrate in order to balance the + * system. + */ + env->migration_type = migrate_util; + env->imbalance = max(local->group_capacity, local->group_util) - + local->group_util; + + /* + * In some case, the group's utilization is max or even + * higher than capacity because of migrations but the + * local CPU is (newly) idle. There is at least one + * waiting task in this overloaded busiest group. Let + * try to pull it. + */ + if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) { + env->migration_type = migrate_task; + env->imbalance = 1; + } + + return; + } + + if (busiest->group_weight == 1 || sds->prefer_sibling) { + unsigned int nr_diff = busiest->sum_h_nr_running; + /* + * When prefer sibling, evenly spread running tasks on + * groups. + */ + env->migration_type = migrate_task; + lsub_positive(&nr_diff, local->sum_h_nr_running); + env->imbalance = nr_diff >> 1; + return; + } + + /* + * If there is no overload, we just want to even the number of + * idle cpus. + */ + env->migration_type = migrate_task; + env->imbalance = max_t(long, 0, (local->idle_cpus - + busiest->idle_cpus) >> 1); return; } /* - * If there aren't any idle CPUs, avoid creating some. + * Local is fully busy but has to take more load to relieve the + * busiest group */ - if (busiest->group_type == group_overloaded && - local->group_type == group_overloaded) { - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE; - if (load_above_capacity > busiest->group_capacity) { - load_above_capacity -= busiest->group_capacity; - load_above_capacity *= scale_load_down(NICE_0_LOAD); - load_above_capacity /= busiest->group_capacity; - } else - load_above_capacity = ~0UL; + if (local->group_type < group_overloaded) { + /* + * Local will become overloaded so the avg_load metrics are + * finally needed. + */ + + local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) / + local->group_capacity; + + sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) / + sds->total_capacity; } /* - * We're trying to get all the CPUs to the average_load, so we don't - * want to push ourselves above the average load, nor do we wish to - * reduce the max loaded CPU below the average load. At the same time, - * we also don't want to reduce the group load below the group - * capacity. Thus we look for the minimum possible imbalance. + * Both group are or will become overloaded and we're trying to get all + * the CPUs to the average_load, so we don't want to push ourselves + * above the average load, nor do we wish to reduce the max loaded CPU + * below the average load. At the same time, we also don't want to + * reduce the group load below the group capacity. Thus we look for + * the minimum possible imbalance. */ - max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); - - /* How much load to actually move to equalise the imbalance */ + env->migration_type = migrate_load; env->imbalance = min( - max_pull * busiest->group_capacity, + (busiest->avg_load - sds->avg_load) * busiest->group_capacity, (sds->avg_load - local->avg_load) * local->group_capacity ) / SCHED_CAPACITY_SCALE; - - /* Boost imbalance to allow misfit task to be balanced. */ - if (busiest->group_type == group_misfit_task) { - env->imbalance = max_t(long, env->imbalance, - busiest->group_misfit_task_load); - } - } /******* find_busiest_group() helpers end here *********************/ +/* + * Decision matrix according to the local and busiest group type + * + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded + * has_spare nr_idle balanced N/A N/A balanced balanced + * fully_busy nr_idle nr_idle N/A N/A balanced balanced + * misfit_task force N/A N/A N/A force force + * asym_packing force force N/A N/A force force + * imbalanced force force N/A N/A force force + * overloaded force force N/A N/A force avg_load + * + * N/A : Not Applicable because already filtered while updating + * statistics. + * balanced : The system is balanced for these 2 groups. + * force : Calculate the imbalance as load migration is probably needed. + * avg_load : Only if imbalance is significant enough. + * nr_idle : dst_cpu is not busy and the number of idle cpus is quite + * different in groups. + */ + /** * find_busiest_group - Returns the busiest group within the sched_domain * if there is an imbalance. @@ -8370,17 +8529,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat; - /* ASYM feature bypasses nice load balance check */ - if (busiest->group_asym_packing) - goto force_balance; - /* There is no busy sibling group to pull tasks from */ - if (!sds.busiest || busiest->sum_h_nr_running == 0) + if (!sds.busiest) goto out_balanced; - /* XXX broken for overlapping NUMA groups */ - sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load) - / sds.total_capacity; + /* Misfit tasks should be dealt with regardless of the avg load */ + if (busiest->group_type == group_misfit_task) + goto force_balance; + + /* ASYM feature bypasses nice load balance check */ + if (busiest->group_type == group_asym_packing) + goto force_balance; /* * If the busiest group is imbalanced the below checks don't @@ -8391,55 +8550,64 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance; /* - * When dst_cpu is idle, prevent SMP nice and/or asymmetric group - * capacities from resulting in underutilization due to avg_load. - */ - if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && - busiest->group_no_capacity) - goto force_balance; - - /* Misfit tasks should be dealt with regardless of the avg load */ - if (busiest->group_type == group_misfit_task) - goto force_balance; - - /* * If the local group is busier than the selected busiest group * don't try and pull any tasks. */ - if (local->avg_load >= busiest->avg_load) + if (local->group_type > busiest->group_type) goto out_balanced; /* - * Don't pull any tasks if this group is already above the domain - * average load. + * When groups are overloaded, use the avg_load to ensure fairness + * between tasks. */ - if (local->avg_load >= sds.avg_load) - goto out_balanced; + if (local->group_type == group_overloaded) { + /* + * If the local group is more loaded than the selected + * busiest group don't try and pull any tasks. + */ + if (local->avg_load >= busiest->avg_load) + goto out_balanced; + + /* XXX broken for overlapping NUMA groups */ + sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) / + sds.total_capacity; - if (env->idle == CPU_IDLE) { /* - * This CPU is idle. If the busiest group is not overloaded - * and there is no imbalance between this and busiest group - * wrt idle CPUs, it is balanced. The imbalance becomes - * significant if the diff is greater than 1 otherwise we - * might end up to just move the imbalance on another group + * Don't pull any tasks if this group is already above the + * domain average load. */ - if ((busiest->group_type != group_overloaded) && - (local->idle_cpus <= (busiest->idle_cpus + 1))) + if (local->avg_load >= sds.avg_load) goto out_balanced; - } else { + /* - * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use - * imbalance_pct to be conservative. + * If the busiest group is more loaded, use imbalance_pct to be + * conservative. */ if (100 * busiest->avg_load <= env->sd->imbalance_pct * local->avg_load) goto out_balanced; } + /* Try to move all excess tasks to child's sibling domain */ + if (sds.prefer_sibling && local->group_type == group_has_spare && + busiest->sum_h_nr_running > local->sum_h_nr_running + 1) + goto force_balance; + + if (busiest->group_type != group_overloaded && + (env->idle == CPU_NOT_IDLE || + local->idle_cpus <= (busiest->idle_cpus + 1))) + /* + * If the busiest group is not overloaded + * and there is no imbalance between this and busiest group + * wrt idle CPUs, it is balanced. The imbalance + * becomes significant if the diff is greater than 1 otherwise + * we might end up to just move the imbalance on another + * group. + */ + goto out_balanced; + force_balance: /* Looks like there is an imbalance. Compute it */ - env->src_grp_type = busiest->group_type; calculate_imbalance(env, &sds); return env->imbalance ? sds.busiest : NULL; @@ -8455,11 +8623,13 @@ static struct rq *find_busiest_queue(struct lb_env *env, struct sched_group *group) { struct rq *busiest = NULL, *rq; - unsigned long busiest_load = 0, busiest_capacity = 1; + unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1; + unsigned int busiest_nr = 0; int i; for_each_cpu_and(i, sched_group_span(group), env->cpus) { - unsigned long capacity, load; + unsigned long capacity, load, util; + unsigned int nr_running; enum fbq_type rt; rq = cpu_rq(i); @@ -8487,20 +8657,8 @@ static struct rq *find_busiest_queue(struct lb_env *env, if (rt > env->fbq_type) continue; - /* - * For ASYM_CPUCAPACITY domains with misfit tasks we simply - * seek the "biggest" misfit task. - */ - if (env->src_grp_type == group_misfit_task) { - if (rq->misfit_task_load > busiest_load) { - busiest_load = rq->misfit_task_load; - busiest = rq; - } - - continue; - } - capacity = capacity_of(i); + nr_running = rq->cfs.h_nr_running; /* * For ASYM_CPUCAPACITY domains, don't pick a CPU that could @@ -8510,35 +8668,70 @@ static struct rq *find_busiest_queue(struct lb_env *env, */ if (env->sd->flags & SD_ASYM_CPUCAPACITY && capacity_of(env->dst_cpu) < capacity && - rq->nr_running == 1) + nr_running == 1) continue; - load = cpu_runnable_load(rq); + switch (env->migration_type) { + case migrate_load: + /* + * When comparing with load imbalance, use + * cpu_runnable_load() which is not scaled with the CPU + * capacity. + */ + load = cpu_runnable_load(rq); - /* - * When comparing with imbalance, use cpu_runnable_load() - * which is not scaled with the CPU capacity. - */ + if (nr_running == 1 && load > env->imbalance && + !check_cpu_capacity(rq, env->sd)) + break; - if (rq->nr_running == 1 && load > env->imbalance && - !check_cpu_capacity(rq, env->sd)) - continue; + /* + * For the load comparisons with the other CPU's, + * consider the cpu_runnable_load() scaled with the CPU + * capacity, so that the load can be moved away from + * the CPU that is potentially running at a lower + * capacity. + * + * Thus we're looking for max(load_i / capacity_i), + * crosswise multiplication to rid ourselves of the + * division works out to: + * load_i * capacity_j > load_j * capacity_i; + * where j is our previous maximum. + */ + if (load * busiest_capacity > busiest_load * capacity) { + busiest_load = load; + busiest_capacity = capacity; + busiest = rq; + } + break; + + case migrate_util: + util = cpu_util(cpu_of(rq)); + + if (busiest_util < util) { + busiest_util = util; + busiest = rq; + } + break; + + case migrate_task: + if (busiest_nr < nr_running) { + busiest_nr = nr_running; + busiest = rq; + } + break; + + case migrate_misfit: + /* + * For ASYM_CPUCAPACITY domains with misfit tasks we + * simply seek the "biggest" misfit task. + */ + if (rq->misfit_task_load > busiest_load) { + busiest_load = rq->misfit_task_load; + busiest = rq; + } + + break; - /* - * For the load comparisons with the other CPU's, consider - * the cpu_runnable_load() scaled with the CPU capacity, so - * that the load can be moved away from the CPU that is - * potentially running at a lower capacity. - * - * Thus we're looking for max(load_i / capacity_i), crosswise - * multiplication to rid ourselves of the division works out - * to: load_i * capacity_j > load_j * capacity_i; where j is - * our previous maximum. - */ - if (load * busiest_capacity > busiest_load * capacity) { - busiest_load = load; - busiest_capacity = capacity; - busiest = rq; } } @@ -8584,7 +8777,7 @@ voluntary_active_balance(struct lb_env *env) return 1; } - if (env->src_grp_type == group_misfit_task) + if (env->migration_type == migrate_misfit) return 1; return 0;
The load_balance algorithm contains some heuristics which have become meaningless since the rework of the scheduler's metrics like the introduction of PELT. Furthermore, load is an ill-suited metric for solving certain task placement imbalance scenarios. For instance, in the presence of idle CPUs, we should simply try to get at least one task per CPU, whereas the current load-based algorithm can actually leave idle CPUs alone simply because the load is somewhat balanced. The current algorithm ends up creating virtual and meaningless value like the avg_load_per_task or tweaks the state of a group to make it overloaded whereas it's not, in order to try to migrate tasks. load_balance should better qualify the imbalance of the group and clearly define what has to be moved to fix this imbalance. The type of sched_group has been extended to better reflect the type of imbalance. We now have : group_has_spare group_fully_busy group_misfit_task group_asym_packing group_imbalanced group_overloaded Based on the type of sched_group, load_balance now sets what it wants to move in order to fix the imbalance. It can be some load as before but also some utilization, a number of task or a type of task: migrate_task migrate_util migrate_load migrate_misfit This new load_balance algorithm fixes several pending wrong tasks placement: - the 1 task per CPU case with asymmetric system - the case of cfs task preempted by other class - the case of tasks not evenly spread on groups with spare capacity Also the load balance decisions have been consolidated in the 3 functions below after removing the few bypasses and hacks of the current code: - update_sd_pick_busiest() select the busiest sched_group. - find_busiest_group() checks if there is an imbalance between local and busiest group. - calculate_imbalance() decides what have to be moved. Finally, the now unused field total_running of struct sd_lb_stats has been removed. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 611 ++++++++++++++++++++++++++++++++++------------------ 1 file changed, 402 insertions(+), 209 deletions(-) -- 2.7.4