diff mbox

sched/fair: Do not decay new task load on first enqueue

Message ID 20161010173440.GA28945@linaro.org
State New
Headers show

Commit Message

Vincent Guittot Oct. 10, 2016, 5:34 p.m. UTC
Le Monday 10 Oct 2016 à 14:29:28 (+0200), Vincent Guittot a écrit :
> On 10 October 2016 at 12:01, Matt Fleming <matt@codeblueprint.co.uk> wrote:

> > On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:

> >>

> >> The difference between this patch and Peterz's is your patch have a

> >> delta since activate_task()->enqueue_task() does do update_rq_clock(),

> >> so why don't have the delta will cause low cpu machines (4 or 8) to

> >> regress against your another reply in this thread?

> >

> > Both my patch and Peter's patch cause issues with low cpu machines. In

> > <20161004201105.GP16071@codeblueprint.co.uk> I said,

> >

> >  "This patch causes some low cpu machines (4 or 8) to regress. It turns

> >   out they regress with my patch too."

> >

> > Have I misunderstood your question?

> >

> > I ran out of time to investigate this last week, though I did try all

> > proposed patches, including Vincent's, and none of them produced wins

> > across the board.

> 

> I have tried to reprocude your issue on my target an hikey board (ARM

> based octo cores) but i failed to see a regression with commit

> 7dc603c9028e. Neverthless, i can see tasks not been well  spread

> during fork as you mentioned. So I have studied a bit more the

> spreading issue during fork last week and i have a new version of my

> proposed patch that i'm going to send soon. With this patch, i can see

> a good spread of tasks  during the fork sequence and some kind of perf

> improvement even if it's bit difficult as the variance is quite

> important with hackbench test so it's mainly an improvement of

> repeatability of the result

>


Subject: [PATCH] sched: use load_avg for selecting idlest group

select_busiest_group only compares the runnable_load_avg when looking for
the idlest group. But on fork intensive use case like hackbenchw here task
blocked quickly after the fork, this can lead to selecting the same CPU
whereas other CPUs, which have similar runnable load but a lower load_avg,
could be chosen instead.

When the runnable_load_avg of 2 CPUs are close, we now take into account
the amount of blocked load as a 2nd selection factor.

For use case like hackbench, this enable the scheduler to select different
CPUs during the fork sequence and to spread tasks across the system.

Tests have been done on a Hikey board (ARM based octo cores) for several
kernel. The result below gives min, max, avg and stdev values of 18 runs
with each configuration.

The v4.8+patches configuration also includes the changes below which is part of the
proposal made by Peter to ensure that the clock will be up to date when the
fork task will be attached to the rq.

@@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	rq = __task_rq_lock(p, &rf);
+	update_rq_clock(rq);
 	post_init_entity_util_avg(&p->se);
 
 	activate_task(rq, p, 0);

hackbench -P -g 1 

       ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
min    0.049         0.050         0.051       0,048
avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
max    0.066         0.068         0.070       0,063
stdev  +/-9%         +/-9%         +/-8%       +/-9%

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 8 deletions(-)

-- 
2.7.4

Comments

Vincent Guittot Oct. 11, 2016, 1:14 p.m. UTC | #1
On 11 October 2016 at 12:24, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Mon, 10 Oct, at 07:34:40PM, Vincent Guittot wrote:

>>

>> Subject: [PATCH] sched: use load_avg for selecting idlest group

>>

>> select_busiest_group only compares the runnable_load_avg when looking for

>> the idlest group. But on fork intensive use case like hackbenchw here task

>> blocked quickly after the fork, this can lead to selecting the same CPU

>> whereas other CPUs, which have similar runnable load but a lower load_avg,

>> could be chosen instead.

>>

>> When the runnable_load_avg of 2 CPUs are close, we now take into account

>> the amount of blocked load as a 2nd selection factor.

>>

>> For use case like hackbench, this enable the scheduler to select different

>> CPUs during the fork sequence and to spread tasks across the system.

>>

>> Tests have been done on a Hikey board (ARM based octo cores) for several

>> kernel. The result below gives min, max, avg and stdev values of 18 runs

>> with each configuration.

>>

>> The v4.8+patches configuration also includes the changes below which is part of the

>> proposal made by Peter to ensure that the clock will be up to date when the

>> fork task will be attached to the rq.

>>

>> @@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)

>>       __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));

>>  #endif

>>       rq = __task_rq_lock(p, &rf);

>> +     update_rq_clock(rq);

>>       post_init_entity_util_avg(&p->se);

>>

>>       activate_task(rq, p, 0);

>>

>> hackbench -P -g 1

>>

>>        ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches

>> min    0.049         0.050         0.051       0,048

>> avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)

>> max    0.066         0.068         0.070       0,063

>> stdev  +/-9%         +/-9%         +/-8%       +/-9%

>>

>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

>> ---

>>  kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------

>>  1 file changed, 32 insertions(+), 8 deletions(-)

>

> This patch looks pretty good to me and this 2-socket 48-cpu Xeon

> (domain0 SMT, domain1 MC, domain2 NUMA) shows a few nice performance

> improvements, and no regressions for various combinations of hackbench

> sockets/pipes and group numbers.


Good

>

> But on a 2-socket 8-cpu Xeon (domain0 MC, domain1 DIE) running,

>

>   perf stat --null -r 25 -- hackbench -pipe 30 process 1000


I'm going to have a look at this use case on my hikey board made 8
CPUs (2 cluster of quad core) (domain0 MC, domain1 DIE) which seems to
have a similar topology

>

> I see a regression,

>

>   baseline: 2.41228

>   patched : 2.64528 (-9.7%)


Just to be sure; By baseline you mean v4.8 ?

>

> Even though the spread of tasks during fork[0] is improved,

>

>   baseline CV: 0.478%

>   patched CV : 0.042%

>

> Clearly the spread wasn't *that* bad to begin with on this machine for

> this workload. I consider the baseline spread to be pretty well

> distributed. Some other factor must be at play.

>

> Patched runqueue latencies are higher (max9* are percentiles),

>

>   baseline: mean: 615932.69 max90: 75272.00 max95: 175985.00 max99: 5884778.00 max: 1694084747.00

>   patched: mean : 882026.28 max90: 92015.00 max95: 291760.00 max99: 7590167.00 max: 1841154776.00

>

> And there are more migrations of hackbench tasks,

>

>   baseline: total: 5390 cross-MC: 3810 cross-DIE: 1580

>   patched : total: 7222 cross-MC: 4591 cross-DIE: 2631

>                  (+34.0%)       (+20.5%)        (+66.5%)

>

> That's a lot more costly cross-DIE migrations. I think this patch is

> along the right lines, but there's something fishy happening on this

> box.

>

> [0] - Fork task placement spread measurement:

>

>       cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \

>         sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'


nice command to evaluate spread
Vincent Guittot Oct. 12, 2016, 7:41 a.m. UTC | #2
On 11 October 2016 at 20:57, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Tue, 11 Oct, at 03:14:47PM, Vincent Guittot wrote:

>> >

>> > I see a regression,

>> >

>> >   baseline: 2.41228

>> >   patched : 2.64528 (-9.7%)

>>

>> Just to be sure; By baseline you mean v4.8 ?

>

> Baseline is actually tip/sched/core commit 447976ef4fd0

> ("sched/irqtime: Consolidate irqtime flushing code") but I could try

> out v4.8 instead if you'd prefer that.


ok. In fact, I have noticed another regression with tip/sched/core and
hackbench while looking at yours.
I have bisect to :
10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings")

hackbench -P -g 1

       v4.8        tip/sched/core  tip/sched/core+revert 10e2f1acd010
and 1b568f0aabf2
min 0.051       0,052               0.049
avg 0.057(0%)   0,062(-7%)   0.056(+1%)
max 0.070       0,073      0.067
stdev  +/-8%       +/-10%    +/-9%

The issue seems to be that it prevents some migration at wake up at
the end of hackbench test so we have last tasks that compete for the
same CPU whereas other CPUs are idle in the same MC domain. I haven't
to look more deeply which part of the patch do the regression yet

 >

>> >       cat /tmp/trace.$1 | grep -E "wakeup_new.*comm=hackbench" | \

>> >         sed -e 's/.*target_cpu=//' | sort | uniq -c | awk '{print $1}'

>>

>> nice command to evaluate spread

>

> Thanks!
Matt Fleming Oct. 18, 2016, 10:29 a.m. UTC | #3
On Tue, 11 Oct, at 11:24:53AM, Matt Fleming wrote:
> 

> That's a lot more costly cross-DIE migrations. I think this patch is

> along the right lines, but there's something fishy happening on this

> box.


I wasn't really able to track down why this machine regressed, and the
patch shows enough improvement for other boxes that I think it's worth
sticking in tip/sched/core.

I saw a number of double digit gains for hackbench with this patch.
Here's a 4-cpu box that shows improvements in arithmetic mean and some
nice improvements to the stddev,

hackbench-process-pipes
                            4.8.0                 4.8.0
                          vanilla  find-idle-group-v1r1
Amean    1       3.2580 (  0.00%)      3.0607 (  6.06%)
Amean    3       8.0387 (  0.00%)      8.2027 ( -2.04%)
Amean    5      12.0507 (  0.00%)     12.1703 ( -0.99%)
Amean    7      16.0027 (  0.00%)     16.4567 ( -2.84%)
Amean    12     26.9030 (  0.00%)     21.7767 ( 19.05%)
Amean    16     33.4127 (  0.00%)     28.9877 ( 13.24%)
Stddev   1       0.1383 (  0.00%)      0.1787 (-29.22%)
Stddev   3       0.1297 (  0.00%)      0.1677 (-29.29%)
Stddev   5       0.5495 (  0.00%)      0.4549 ( 17.22%)
Stddev   7       0.7401 (  0.00%)      0.6161 ( 16.75%)
Stddev   12      2.3187 (  0.00%)      0.7102 ( 69.37%)
Stddev   16      4.0734 (  0.00%)      2.9364 ( 27.91%)

And a 60-cpu box with similar improvements,

hackbench-process-sockets
                             4.8.0                 4.8.0
                           vanilla  find-idle-group-v1r1
Amean    1       32.8810 (  0.00%)      2.0590 ( 93.74%)
Amean    4       34.8663 (  0.00%)      3.3177 ( 90.48%)
Amean    7       34.1243 (  0.00%)      4.3323 ( 87.30%)
Amean    12      36.8373 (  0.00%)      5.3650 ( 85.44%)
Amean    21      29.0333 (  0.00%)      6.0280 ( 79.24%)
Amean    30      27.2807 (  0.00%)      7.2653 ( 73.37%)
Amean    48      23.2453 (  0.00%)      9.9810 ( 57.06%)
Amean    79      29.6660 (  0.00%)     16.6680 ( 43.81%)
Amean    110     41.3680 (  0.00%)     23.1787 ( 43.97%)
Amean    141     45.3720 (  0.00%)     29.4130 ( 35.17%)
Amean    172     45.6403 (  0.00%)     34.3783 ( 24.68%)
Amean    203     47.6120 (  0.00%)     40.0063 ( 15.97%)
Amean    234     50.1703 (  0.00%)     43.4503 ( 13.39%)
Amean    240     50.1110 (  0.00%)     44.4660 ( 11.26%)
Stddev   1        0.9714 (  0.00%)      0.6617 ( 31.88%)
Stddev   4        0.4459 (  0.00%)      0.2943 ( 34.01%)
Stddev   7        0.3091 (  0.00%)      0.2722 ( 11.94%)
Stddev   12       0.4603 (  0.00%)      0.1424 ( 69.05%)
Stddev   21       1.3647 (  0.00%)      0.1566 ( 88.52%)
Stddev   30       4.1277 (  0.00%)      0.0481 ( 98.84%)
Stddev   48       1.3689 (  0.00%)      0.4968 ( 63.71%)
Stddev   79       2.0772 (  0.00%)      0.5309 ( 74.44%)
Stddev   110      4.5655 (  0.00%)      0.1846 ( 95.96%)
Stddev   141      4.2637 (  0.00%)      0.6675 ( 84.34%)
Stddev   172      1.5741 (  0.00%)      1.0242 ( 34.93%)
Stddev   203      1.0973 (  0.00%)      1.4043 (-27.98%)
Stddev   234      2.2048 (  0.00%)      0.8709 ( 60.50%)
Stddev   240      1.2849 (  0.00%)      0.2688 ( 79.08%)
Peter Zijlstra Oct. 18, 2016, 11:09 a.m. UTC | #4
On Wed, Oct 12, 2016 at 09:41:36AM +0200, Vincent Guittot wrote:

> ok. In fact, I have noticed another regression with tip/sched/core and

> hackbench while looking at yours.

> I have bisect to :

> 10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings")

> 

> hackbench -P -g 1

> 

>        v4.8        tip/sched/core  tip/sched/core+revert 10e2f1acd010

> and 1b568f0aabf2

> min 0.051       0,052               0.049

> avg 0.057(0%)   0,062(-7%)   0.056(+1%)

> max 0.070       0,073      0.067

> stdev  +/-8%       +/-10%    +/-9%

> 

> The issue seems to be that it prevents some migration at wake up at

> the end of hackbench test so we have last tasks that compete for the

> same CPU whereas other CPUs are idle in the same MC domain. I haven't

> to look more deeply which part of the patch do the regression yet


So select_idle_cpu(), which does the LLC wide CPU scan is now throttled
by a comparison between avg_cost and avg_idle; where avg_cost is a
historical measure of how costly it was to scan the entire LLC domain
and avg_idle is our current idle time guestimate (also a historical
average).

The problem was that a number of workloads were spending quite a lot of
time here scanning CPUs while they could be doing useful work (esp.
since newer parts have silly amounts of CPUs per LLC).

The toggle is a heuristic with a random number in.. we could see if
there's anything better we can do. I know some people take the toggle
out entirely, but that will regress other workloads.
Peter Zijlstra Oct. 18, 2016, 11:10 a.m. UTC | #5
On Tue, Oct 18, 2016 at 11:29:37AM +0100, Matt Fleming wrote:
> On Tue, 11 Oct, at 11:24:53AM, Matt Fleming wrote:

> > 

> > That's a lot more costly cross-DIE migrations. I think this patch is

> > along the right lines, but there's something fishy happening on this

> > box.

> 

> I wasn't really able to track down why this machine regressed, and the

> patch shows enough improvement for other boxes that I think it's worth

> sticking in tip/sched/core.

> 


I'm entirely lost as to which patch we're talking about by now ;-)
Matt Fleming Oct. 18, 2016, 11:29 a.m. UTC | #6
On Tue, 18 Oct, at 01:10:17PM, Peter Zijlstra wrote:
> 

> I'm entirely lost as to which patch we're talking about by now ;-)


Heh, this one from Vincent,

  https://lkml.kernel.org/r/20161010173440.GA28945@linaro.org
Peter Zijlstra Oct. 18, 2016, 12:15 p.m. UTC | #7
On Tue, Oct 18, 2016 at 12:29:57PM +0100, Matt Fleming wrote:
> On Tue, 18 Oct, at 01:10:17PM, Peter Zijlstra wrote:

> > 

> > I'm entirely lost as to which patch we're talking about by now ;-)

> 

> Heh, this one from Vincent,

> 

>   https://lkml.kernel.org/r/20161010173440.GA28945@linaro.org


Ah, right.

Seems like a sensible thing to do, and I suppose I should go finish my
(and yours) update_rq_clock() patches that supersede the patch referred
to in that thing and is depended upon.


It might make sense to have helper functions to evaluate those
conditions, because currently there's two instances of each, once in the
branch selection and then again (but inverted, we miss the == case fwiw)
in the return NULL case.
Vincent Guittot Oct. 18, 2016, 3:19 p.m. UTC | #8
On 18 October 2016 at 13:09, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Oct 12, 2016 at 09:41:36AM +0200, Vincent Guittot wrote:

>

>> ok. In fact, I have noticed another regression with tip/sched/core and

>> hackbench while looking at yours.

>> I have bisect to :

>> 10e2f1acd0 ("sched/core: Rewrite and improve select_idle_siblings")

>>

>> hackbench -P -g 1

>>

>>        v4.8        tip/sched/core  tip/sched/core+revert 10e2f1acd010

>> and 1b568f0aabf2

>> min 0.051       0,052               0.049

>> avg 0.057(0%)   0,062(-7%)   0.056(+1%)

>> max 0.070       0,073      0.067

>> stdev  +/-8%       +/-10%    +/-9%

>>

>> The issue seems to be that it prevents some migration at wake up at

>> the end of hackbench test so we have last tasks that compete for the

>> same CPU whereas other CPUs are idle in the same MC domain. I haven't

>> to look more deeply which part of the patch do the regression yet

>

> So select_idle_cpu(), which does the LLC wide CPU scan is now throttled

> by a comparison between avg_cost and avg_idle; where avg_cost is a

> historical measure of how costly it was to scan the entire LLC domain

> and avg_idle is our current idle time guestimate (also a historical

> average).

>

> The problem was that a number of workloads were spending quite a lot of

> time here scanning CPUs while they could be doing useful work (esp.

> since newer parts have silly amounts of CPUs per LLC).


make sense

>

> The toggle is a heuristic with a random number in.. we could see if

> there's anything better we can do. I know some people take the toggle

> out entirely, but that will regress other workloads.


ok. so removing the toggle fixes the problem in this test case too.

may be we can take into account the sd_llc_size in the toggle
Vincent Guittot Oct. 19, 2016, 6:38 a.m. UTC | #9
On 18 October 2016 at 14:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Oct 18, 2016 at 12:29:57PM +0100, Matt Fleming wrote:

>> On Tue, 18 Oct, at 01:10:17PM, Peter Zijlstra wrote:

>> >

>> > I'm entirely lost as to which patch we're talking about by now ;-)

>>

>> Heh, this one from Vincent,

>>

>>   https://lkml.kernel.org/r/20161010173440.GA28945@linaro.org

>

> Ah, right.

>

> Seems like a sensible thing to do, and I suppose I should go finish my

> (and yours) update_rq_clock() patches that supersede the patch referred

> to in that thing and is depended upon.

>

>

> It might make sense to have helper functions to evaluate those


The main issue is the number of parameters used in these conditions
that makes helper function not really more readable.

> conditions, because currently there's two instances of each, once in the

> branch selection and then again (but inverted, we miss the == case fwiw)


not sure to catch the comment about inverted and miss the == case
The test splits runnable_load_avg is 3 ranges:
[0 .. (min_runnable_load - imbalance)] : use the
runnable_loab_avg/this_runnable_load which is significantly smaller
] (min_runnable_load - imbalance) .. (min_runnable_load + imbalance) [
: min_runnable_load and runnable_loab_avg/this_runnable_load are close
so we compare min_load_avg with avg_load/this_avg_load to choose
[(min_runnable_load + imbalance) .. ULONG_MAX] : use min_runnable_load

The condition is used when we look for the best other group in the
sched_domain and  to compare the local group with this best other
group

> in the return NULL case.

>

>

>
Peter Zijlstra Oct. 19, 2016, 9:53 a.m. UTC | #10
On Wed, Oct 19, 2016 at 08:38:12AM +0200, Vincent Guittot wrote:
> > It might make sense to have helper functions to evaluate those

> 

> The main issue is the number of parameters used in these conditions

> that makes helper function not really more readable.


Fair enough I suppose..

> > conditions, because currently there's two instances of each, once in the

> > branch selection and then again (but inverted, we miss the == case fwiw)

> 

> not sure to catch the comment about inverted and miss the == case


Oh, you're right. Which proves my point on this not being entirely
readable :/

Initially I thought the things were of the form:

	else if (ineq) {
	}

	...

	if (... || !ineq || ...)


Which would get you things like: a<b, !a<b := a>b, and leave a==b
undefined. But if I put them along side one another like:


+		} else if (min_runnable_load > (runnable_load + imbalance)) {

+	                  (min_runnable_load > (this_runnable_load + imbalance)) ||


+		} else if ((runnable_load < (min_runnable_load + imbalance)) &&
+				(100*min_avg_load > sd->imbalance_pct*avg_load)) {

+	             ((this_runnable_load < (min_runnable_load + imbalance)) &&
+			        (100*min_avg_load > sd->imbalance_pct*this_avg_load)))

We can see this is not in fact the case.

Blergh, I also cannot see a pretty way to increase readability here,
because while they have the same general shape, there's this small
variation with this_*.

A well..
Vincent Guittot Nov. 9, 2016, 4:53 p.m. UTC | #11
On 19 October 2016 at 11:53, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Oct 19, 2016 at 08:38:12AM +0200, Vincent Guittot wrote:

>> > It might make sense to have helper functions to evaluate those

>>

>> The main issue is the number of parameters used in these conditions

>> that makes helper function not really more readable.

>

> Fair enough I suppose..

>

>> > conditions, because currently there's two instances of each, once in the

>> > branch selection and then again (but inverted, we miss the == case fwiw)

>>

>> not sure to catch the comment about inverted and miss the == case

>

> Oh, you're right. Which proves my point on this not being entirely

> readable :/


Hi Peter,

I'm not sure of the status of this patch.
It seems difficult to get a more readable code.
Do you want me to add more comments about when we enter each state or
current version is enough ?

Vincent

>

> Initially I thought the things were of the form:

>

>         else if (ineq) {

>         }

>

>         ...

>

>         if (... || !ineq || ...)

>

>

> Which would get you things like: a<b, !a<b := a>b, and leave a==b

> undefined. But if I put them along side one another like:

>

>

> +               } else if (min_runnable_load > (runnable_load + imbalance)) {

>

> +                         (min_runnable_load > (this_runnable_load + imbalance)) ||

>

>

> +               } else if ((runnable_load < (min_runnable_load + imbalance)) &&

> +                               (100*min_avg_load > sd->imbalance_pct*avg_load)) {

>

> +                    ((this_runnable_load < (min_runnable_load + imbalance)) &&

> +                               (100*min_avg_load > sd->imbalance_pct*this_avg_load)))

>

> We can see this is not in fact the case.

>

> Blergh, I also cannot see a pretty way to increase readability here,

> because while they have the same general shape, there's this small

> variation with this_*.

>

> A well..
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 039de34..628b00b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5166,15 +5166,16 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
+	unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
 	int load_idx = sd->forkexec_idx;
-	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	unsigned long imbalance = (scale_load_down(NICE_0_LOAD)*(sd->imbalance_pct-100))/100;
 
 	if (sd_flag & SD_BALANCE_WAKE)
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, runnable_load;
 		int local_group;
 		int i;
 
@@ -5188,6 +5189,7 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
+		runnable_load = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -5196,21 +5198,43 @@  find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 			else
 				load = target_load(i, load_idx);
 
-			avg_load += load;
+			runnable_load += load;
+
+			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
 		}
 
 		/* Adjust by relative CPU capacity of the group */
 		avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
 
 		if (local_group) {
-			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
+			this_runnable_load = runnable_load;
+			this_avg_load = avg_load;
+		} else if (min_runnable_load > (runnable_load + imbalance)) {
+			/*
+			 * The runnable load is significantly smaller so we
+			 * can pick this new cpu
+			 */
+			min_runnable_load = runnable_load;
+			min_avg_load = avg_load;
+			idlest = group;
+		} else if ((runnable_load < (min_runnable_load + imbalance)) &&
+				(100*min_avg_load > sd->imbalance_pct*avg_load)) {
+			/*
+			 * The runnable loads are close so we take into account
+			 * blocked load throught avg_load which is blocked +
+			 * runnable load
+			 */
+			min_avg_load = avg_load;
 			idlest = group;
 		}
+
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest ||
+	    (min_runnable_load > (this_runnable_load + imbalance)) ||
+	    ((this_runnable_load < (min_runnable_load + imbalance)) &&
+			(100*min_avg_load > sd->imbalance_pct*this_avg_load)))
 		return NULL;
 	return idlest;
 }