diff mbox series

[v4,01/11] sched/fair: clean up asym packing

Message ID 1571405198-27570-2-git-send-email-vincent.guittot@linaro.org
State Accepted
Commit 490ba971d8b498ba3a47999ab94c6a0d1830ad41
Headers show
Series sched/fair: rework the CFS load balance | expand

Commit Message

Vincent Guittot Oct. 18, 2019, 1:26 p.m. UTC
Clean up asym packing to follow the default load balance behavior:
- classify the group by creating a group_asym_packing field.
- calculate the imbalance in calculate_imbalance() instead of bypassing it.

We don't need to test twice same conditions anymore to detect asym packing
and we consolidate the calculation of imbalance in calculate_imbalance().

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Rik van Riel <riel@surriel.com>

---
 kernel/sched/fair.c | 63 ++++++++++++++---------------------------------------
 1 file changed, 16 insertions(+), 47 deletions(-)

-- 
2.7.4

Comments

Mel Gorman Oct. 30, 2019, 2:51 p.m. UTC | #1
On Fri, Oct 18, 2019 at 03:26:28PM +0200, Vincent Guittot wrote:
> Clean up asym packing to follow the default load balance behavior:

> - classify the group by creating a group_asym_packing field.

> - calculate the imbalance in calculate_imbalance() instead of bypassing it.

> 

> We don't need to test twice same conditions anymore to detect asym packing

> and we consolidate the calculation of imbalance in calculate_imbalance().

> 

> There is no functional changes.

> 

> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> Acked-by: Rik van Riel <riel@surriel.com>

> ---

>  kernel/sched/fair.c | 63 ++++++++++++++---------------------------------------

>  1 file changed, 16 insertions(+), 47 deletions(-)

> 

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> index 1f0a5e1..617145c 100644

> --- a/kernel/sched/fair.c

> +++ b/kernel/sched/fair.c

> @@ -7675,6 +7675,7 @@ struct sg_lb_stats {

>  	unsigned int group_weight;

>  	enum group_type group_type;

>  	int group_no_capacity;

> +	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */

>  	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */

>  #ifdef CONFIG_NUMA_BALANCING

>  	unsigned int nr_numa_running;

> @@ -8129,9 +8130,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,

>  	 * ASYM_PACKING needs to move all the work to the highest

>  	 * prority CPUs in the group, therefore mark all groups

>  	 * of lower priority than ourself as busy.

> +	 *

> +	 * This is primarily intended to used at the sibling level.  Some

> +	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the

> +	 * case of POWER7, it can move to lower SMT modes only when higher

> +	 * threads are idle.  When in lower SMT modes, the threads will

> +	 * perform better since they share less core resources.  Hence when we

> +	 * have idle threads, we want them to be the higher ones.

>  	 */

>  	if (sgs->sum_nr_running &&

>  	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {

> +		sgs->group_asym_packing = 1;

>  		if (!sds->busiest)

>  			return true;

>  


(I did not read any of the earlier implementations of this series, maybe
this was discussed already in which case, sorry for the noise)

Are you *sure* this is not a functional change?

Asym packing is a twisty maze of headaches and I'm not familiar enough
with it to be 100% certain without spending a lot of time on this. Even
spotting how Power7 ends up using asym packing with lower-numbered SMT
threads is a bit of a challenge.  Specifically, it relies on the scheduler
domain SD_ASYM_PACKING flag for SMT domains to use the weak implementation
of arch_asym_cpu_priority which by defaults favours the lower-numbered CPU.

The check_asym_packing implementation checks that flag but I can't see
the equiavlent type of check here. update_sd_pick_busiest  could be called
for domains that span NUMA or basically any domain that does not specify
SD_ASYM_PACKING and end up favouring a lower-numbered CPU (or whatever
arch_asym_cpu_priority returns in the case of x86 which has a different
idea for favoured CPUs).

sched_asym_prefer appears to be a function that is very easy to use
incorrectly. Should it take env and check the SD flags first?

-- 
Mel Gorman
SUSE Labs
Vincent Guittot Oct. 30, 2019, 4:03 p.m. UTC | #2
On Wed, 30 Oct 2019 at 15:51, Mel Gorman <mgorman@techsingularity.net> wrote:
>

> On Fri, Oct 18, 2019 at 03:26:28PM +0200, Vincent Guittot wrote:

> > Clean up asym packing to follow the default load balance behavior:

> > - classify the group by creating a group_asym_packing field.

> > - calculate the imbalance in calculate_imbalance() instead of bypassing it.

> >

> > We don't need to test twice same conditions anymore to detect asym packing

> > and we consolidate the calculation of imbalance in calculate_imbalance().

> >

> > There is no functional changes.

> >

> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

> > Acked-by: Rik van Riel <riel@surriel.com>

> > ---

> >  kernel/sched/fair.c | 63 ++++++++++++++---------------------------------------

> >  1 file changed, 16 insertions(+), 47 deletions(-)

> >

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c

> > index 1f0a5e1..617145c 100644

> > --- a/kernel/sched/fair.c

> > +++ b/kernel/sched/fair.c

> > @@ -7675,6 +7675,7 @@ struct sg_lb_stats {

> >       unsigned int group_weight;

> >       enum group_type group_type;

> >       int group_no_capacity;

> > +     unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */

> >       unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */

> >  #ifdef CONFIG_NUMA_BALANCING

> >       unsigned int nr_numa_running;

> > @@ -8129,9 +8130,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,

> >        * ASYM_PACKING needs to move all the work to the highest

> >        * prority CPUs in the group, therefore mark all groups

> >        * of lower priority than ourself as busy.

> > +      *

> > +      * This is primarily intended to used at the sibling level.  Some

> > +      * cores like POWER7 prefer to use lower numbered SMT threads.  In the

> > +      * case of POWER7, it can move to lower SMT modes only when higher

> > +      * threads are idle.  When in lower SMT modes, the threads will

> > +      * perform better since they share less core resources.  Hence when we

> > +      * have idle threads, we want them to be the higher ones.

> >        */

> >       if (sgs->sum_nr_running &&

> >           sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {

> > +             sgs->group_asym_packing = 1;

> >               if (!sds->busiest)

> >                       return true;

> >

>

> (I did not read any of the earlier implementations of this series, maybe

> this was discussed already in which case, sorry for the noise)

>

> Are you *sure* this is not a functional change?

>

> Asym packing is a twisty maze of headaches and I'm not familiar enough

> with it to be 100% certain without spending a lot of time on this. Even

> spotting how Power7 ends up using asym packing with lower-numbered SMT

> threads is a bit of a challenge.  Specifically, it relies on the scheduler

> domain SD_ASYM_PACKING flag for SMT domains to use the weak implementation

> of arch_asym_cpu_priority which by defaults favours the lower-numbered CPU.

>

> The check_asym_packing implementation checks that flag but I can't see

> the equiavlent type of check here. update_sd_pick_busiest  could be called


The checks of  SD_ASYM_PACKING and CPU_NOT_IDLE are already above in
the function but out of the patch.
In fact this part of update_sd_pick_busiest is already dedicated to
asym_packing.

What I'm doing is that instead of checking asym_packing in
update_sd_pick_busiest and then rechecking the same thing in
find_busiest_group, I save the check result and reuse it

Also patch 04 moves further this code

> for domains that span NUMA or basically any domain that does not specify

> SD_ASYM_PACKING and end up favouring a lower-numbered CPU (or whatever

> arch_asym_cpu_priority returns in the case of x86 which has a different

> idea for favoured CPUs).

>

> sched_asym_prefer appears to be a function that is very easy to use

> incorrectly. Should it take env and check the SD flags first?

>

> --

> Mel Gorman

> SUSE Labs
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f0a5e1..617145c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7675,6 +7675,7 @@  struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	int group_no_capacity;
+	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -8129,9 +8130,17 @@  static bool update_sd_pick_busiest(struct lb_env *env,
 	 * ASYM_PACKING needs to move all the work to the highest
 	 * prority CPUs in the group, therefore mark all groups
 	 * of lower priority than ourself as busy.
+	 *
+	 * This is primarily intended to used at the sibling level.  Some
+	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
+	 * case of POWER7, it can move to lower SMT modes only when higher
+	 * threads are idle.  When in lower SMT modes, the threads will
+	 * perform better since they share less core resources.  Hence when we
+	 * have idle threads, we want them to be the higher ones.
 	 */
 	if (sgs->sum_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
+		sgs->group_asym_packing = 1;
 		if (!sds->busiest)
 			return true;
 
@@ -8273,51 +8282,6 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 }
 
 /**
- * check_asym_packing - Check to see if the group is packed into the
- *			sched domain.
- *
- * This is primarily intended to used at the sibling level.  Some
- * cores like POWER7 prefer to use lower numbered SMT threads.  In the
- * case of POWER7, it can move to lower SMT modes only when higher
- * threads are idle.  When in lower SMT modes, the threads will
- * perform better since they share less core resources.  Hence when we
- * have idle threads, we want them to be the higher ones.
- *
- * This packing function is run on idle threads.  It checks to see if
- * the busiest CPU in this domain (core in the P7 case) has a higher
- * CPU number than the packing function is being run on.  Here we are
- * assuming lower CPU number will be equivalent to lower a SMT thread
- * number.
- *
- * Return: 1 when packing is required and a task should be moved to
- * this CPU.  The amount of the imbalance is returned in env->imbalance.
- *
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain which is to be packed
- */
-static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	int busiest_cpu;
-
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return 0;
-
-	if (env->idle == CPU_NOT_IDLE)
-		return 0;
-
-	if (!sds->busiest)
-		return 0;
-
-	busiest_cpu = sds->busiest->asym_prefer_cpu;
-	if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
-		return 0;
-
-	env->imbalance = sds->busiest_stat.group_load;
-
-	return 1;
-}
-
-/**
  * fix_small_imbalance - Calculate the minor imbalance that exists
  *			amongst the groups of a sched_domain, during
  *			load balancing.
@@ -8401,6 +8365,11 @@  static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	if (busiest->group_asym_packing) {
+		env->imbalance = busiest->group_load;
+		return;
+	}
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -8505,8 +8474,8 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	busiest = &sds.busiest_stat;
 
 	/* ASYM feature bypasses nice load balance check */
-	if (check_asym_packing(env, &sds))
-		return sds.busiest;
+	if (busiest->group_asym_packing)
+		goto force_balance;
 
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest || busiest->sum_nr_running == 0)