diff mbox

[04/11] sched: unify imbalance bias for target group

Message ID 1393293054-11378-5-git-send-email-alex.shi@linaro.org
State New
Headers show

Commit Message

Alex Shi Feb. 25, 2014, 1:50 a.m. UTC
Old code considers the bias in source/target_load already. but still
use imbalance_pct as last check in idlest/busiest group finding. It is
also a kind of redundant job. If we bias imbalance in source/target_load,
we'd better not use imbalance_pct again.

After cpu_load array removed, it is nice time to unify the target bias
consideration. So I remove the imbalance_pct from last check and add the
live bias using.

On wake_affine, since all archs' wake_idx is 0, current logical is just
want to prefer current cpu. so we follows this logical. Just renaming the
target_load/source_load to wegithed_cpuload for more exact meaning.
Thanks for reminding from Morten!

Signed-off-by: Alex Shi <alex.shi@linaro.org>
---
 kernel/sched/fair.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

Comments

Alex Shi Feb. 26, 2014, 3:16 p.m. UTC | #1
> So this patch is weird..
> 
> So the original bias in the source/target load is purely based on actual
> load figures. It only pulls-down/pulls-up resp. the long term avg with a
> shorter term average; iow. it allows the source to decrease faster and
> the target to increase faster, giving a natural inertia (ie. a
> resistance to movement).
> 
> Therefore this gives rise to a conservative imbalance.
> 
> Then at the end we use the imbalance_pct thing as a normal hysteresis
> control to avoid the rapid state switching associated with a single
> control point system.

Peter, Thanks response and detailed explanations! :)

Yes, fixed bias can not replace the current bias.
If we said sth inertia, we usually mean the previous value or long term
value here. but source/target_load doesn't prefer a long term or shorter
term load, Just get the min or max of them. so I can't see other meaning
except source/target bias. And the long term load is a decayed load with
history load value, not a real actual load.

And in current logical, assume the load of cpu is constant in a period,
then the source/target_load will lose its 'resistance' function for
balance. Considering the moving cost, rq locking and potential cpu cache
missing, Is some bias needed here?

Another problem is, we bias load twice for busy_idx scenario. once in
source/target_load another is imbalance_pct in find_busiest_group. I
can't figure out the reason. :(

So would rather select a random long/shorter term load, than maybe it's
better to use a fixed bias, like in current NUMA balancing, and in
newidle/wake balance.

> 
> 
> You completely wreck that, you also don't give a coherent model back.
> 
> 
> The movement of imbalance_pct into target_load() doesn't make sense to
> me either; it's an (expensive) no-op afaict. Seeing how:
> 
>   100 * source_load() < imb_pct * target_load()
> 
> is very much equal to:
> 
>   source_load() < (imb_pct * target_load()) / 100;
> 
> Except you get to do that div all over the place.

It is my fault. Will change it back.
> 
> It also completely muddles the fact that its a normal hysteresis
> control. Not a load bias. A fixed bias can never replace the inertial
> control we had; it doesn't make sense as a replacement.

I know fixed bias maybe not the best, but sorry for can not figure out
better one. Would you like to give some suggestion?
> 
> Not to mention you seem to ignore all concerns wrt the use of longer
> term averages for the bigger domains.

For bigger domain, I given up the aggression bias idea(less bias on
large group) in V2 version due to Morten's concern. And I have asked for
more comments, but no more detailed concerns I can see now. :(
> 
> Now I'm all for removing code; and so far the numbers aren't bad; but I
> don't like the complete muddle you make of things at all.
> 

Sorry, do you mean the load_idx removing is fine, if without this fixed
bias? Or other suggestion here?
Alex Shi March 2, 2014, 1:44 a.m. UTC | #2
Would you like to give more comments? :)

Thanks!

On 02/26/2014 11:16 PM, Alex Shi wrote:
>> >Now I'm all for removing code; and so far the numbers aren't bad; but I
>> >don't like the complete muddle you make of things at all.
>> >
> Sorry, do you mean the load_idx removing is fine, if without this fixed
> bias? Or other suggestion here?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Alex Shi March 12, 2014, 10:36 a.m. UTC | #3
On 02/26/2014 11:16 PM, Alex Shi wrote:
>> > So this patch is weird..
>> > 
>> > So the original bias in the source/target load is purely based on actual
>> > load figures. It only pulls-down/pulls-up resp. the long term avg with a
>> > shorter term average; iow. it allows the source to decrease faster and
>> > the target to increase faster, giving a natural inertia (ie. a
>> > resistance to movement).
>> > 
>> > Therefore this gives rise to a conservative imbalance.
>> > 
>> > Then at the end we use the imbalance_pct thing as a normal hysteresis
>> > control to avoid the rapid state switching associated with a single
>> > control point system.
> Peter, Thanks response and detailed explanations! :)
> 
> Yes, fixed bias can not replace the current bias.
> If we said sth inertia, we usually mean the previous value or long term
> value here. but source/target_load doesn't prefer a long term or shorter
> term load, Just get the min or max of them. so I can't see other meaning
> except source/target bias. And the long term load is a decayed load with
> history load value, not a real actual load.
> 
> And in current logical, assume the load of cpu is constant in a period,
> then the source/target_load will lose its 'resistance' function for
> balance. Considering the moving cost, rq locking and potential cpu cache
> missing, Is some bias needed here?
> 
> Another problem is, we bias load twice for busy_idx scenario. once in
> source/target_load another is imbalance_pct in find_busiest_group. I
> can't figure out the reason. :(
> 
> So would rather select a random long/shorter term load, than maybe it's
> better to use a fixed bias, like in current NUMA balancing, and in
> newidle/wake balance.
> 


May I didn't say clear about the issue of cpu_load. So forgive my
verbose explanation again.

In 5 cpu load idx, only busy_idx and idle_idx are not zero, only they
are using long term load value.

The other idxes, wake_idx, forexec_idx and new_idle_idx are all zero.
They are using imbalance_pct as fixed bias consideration *only*, as well
as in numa balancing.

As to busy_idx,
We considered the cpu load history and src/dst bias both. But we are
wrong to mix them together. Considering long/short term load isn't
related with bias. The long term load consideration is done in runnable
load avg. And the bias value should be isolated and based on task
migration cost between cpu/groups.
Now we mix them together, the ridiculous thing is, when all cpu load are
continuous stable, long/short term load is same. then we lose the bias
meaning, so any minimum imbalance may cause unnecessary task moving. To
prevent this funny thing happen, we have to reuse the imbalance_pct
again in find_busiest_group().  But That clearly causes over bias in
normal time. If there are some burst load in system, it is more worse.

As to idle_idx, it is not use imbalance_pct at all.
Since short term load is zero, so it looks clearly, we pretend we are
long term load when cpu in dst group. or zero load, when cpu in src
group. But from maximum performance point view. It's better to balance
task to idle cpu. So we'd better to move tasks to dst group unless the
moving cost is beyond task migration cost, that is the imbalance_pct
for. Now pretending we have some load in dst group rejects the incoming
load we pretend have. And It also prefer to move task to long time idle
cpu, that actually costs performance/latency both since low latency of
deep c-state waking.
Anyway, for idle cpu load balance, since we are working on cpu idle
migration into scheduler. The problem must be reconsidered. We don't
need to care much now.


Base on above reasons, I believe mixing long term load with task moving
bias consideration is stupid. And I admit the imbalance_pct need more
tuning or even remake, but it is not a bad start, at least it is used in
balance anywhere now.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df9c8b5..d7093ee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,7 +1016,7 @@  bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 
 static unsigned long weighted_cpuload(const int cpu);
 static unsigned long source_load(int cpu);
-static unsigned long target_load(int cpu);
+static unsigned long target_load(int cpu, int imbalance_pct);
 static unsigned long power_of(int cpu);
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
 
@@ -3977,7 +3977,7 @@  static unsigned long source_load(int cpu)
  * Return a high guess at the load of a migration-target cpu weighted
  * according to the scheduling class and "nice" value.
  */
-static unsigned long target_load(int cpu)
+static unsigned long target_load(int cpu, int imbalance_pct)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long total = weighted_cpuload(cpu);
@@ -3985,6 +3985,11 @@  static unsigned long target_load(int cpu)
 	if (!sched_feat(LB_BIAS))
 		return total;
 
+	/*
+	 * Bias target load with imbalance_pct.
+	 */
+	total = total * imbalance_pct / 100;
+
 	return max(rq->cpu_load, total);
 }
 
@@ -4200,8 +4205,8 @@  static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	this_cpu  = smp_processor_id();
 	prev_cpu  = task_cpu(p);
-	load	  = source_load(prev_cpu);
-	this_load = target_load(this_cpu);
+	load	  = weighted_cpuload(prev_cpu);
+	this_load = weighted_cpuload(this_cpu);
 
 	/*
 	 * If sync wakeup then subtract the (maximum possible)
@@ -4257,7 +4262,7 @@  static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 
 	if (balanced ||
 	    (this_load <= load &&
-	     this_load + target_load(prev_cpu) <= tl_per_task)) {
+		     this_load + weighted_cpuload(prev_cpu) <= tl_per_task)) {
 		/*
 		 * This domain has SD_WAKE_AFFINE and
 		 * p is cache cold in this domain, and
@@ -4303,7 +4308,7 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			if (local_group)
 				load = source_load(i);
 			else
-				load = target_load(i);
+				load = target_load(i, imbalance);
 
 			avg_load += load;
 		}
@@ -4319,7 +4324,7 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest || this_load < min_load)
 		return NULL;
 	return idlest;
 }
@@ -5745,6 +5750,7 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 {
 	unsigned long load;
 	int i;
+	int bias = 100 + (env->sd->imbalance_pct - 100) / 2;
 
 	memset(sgs, 0, sizeof(*sgs));
 
@@ -5752,8 +5758,8 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		struct rq *rq = cpu_rq(i);
 
 		/* Bias balancing toward cpus of our domain */
-		if (local_group)
-			load = target_load(i);
+		if (local_group && env->idle != CPU_IDLE)
+			load = target_load(i, bias);
 		else
 			load = source_load(i);
 
@@ -6193,14 +6199,6 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 		if ((local->idle_cpus < busiest->idle_cpus) &&
 		    busiest->sum_nr_running <= busiest->group_weight)
 			goto out_balanced;
-	} else {
-		/*
-		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
-		 * imbalance_pct to be conservative.
-		 */
-		if (100 * busiest->avg_load <=
-				env->sd->imbalance_pct * local->avg_load)
-			goto out_balanced;
 	}
 
 force_balance: