From patchwork Mon Oct 10 17:34:40 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 77450
Delivered-To: patch@linaro.org
Received: by 10.140.97.247 with SMTP id m110csp23135qge;
 Mon, 10 Oct 2016 10:34:50 -0700 (PDT)
X-Received: by 10.107.35.1 with SMTP id j1mr7226062ioj.9.1476120890887;
 Mon, 10 Oct 2016 10:34:50 -0700 (PDT)
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 v30si1781003pgc.216.2016.10.10.10.34.50; 
 Mon, 10 Oct 2016 10:34:50 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1753478AbcJJRes (ORCPT <rfc822;julien.grall@linaro.org>
 + 27 others); Mon, 10 Oct 2016 13:34:48 -0400
Received: from mail-qk0-f171.google.com ([209.85.220.171]:34124 "EHLO
 mail-qk0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1753060AbcJJReq (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Mon, 10 Oct 2016 13:34:46 -0400
Received: by mail-qk0-f171.google.com with SMTP id f128so100419225qkb.1
 for <linux-kernel@vger.kernel.org>;
 Mon, 10 Oct 2016 10:34:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to
 :user-agent; bh=GTRKcmIjRq76Juu2P2Ks37X5pFr4nbJ8WCgo0pu9D9Y=;
 b=aPV+zD76pQgkvJIJmXh040cTFpyXqReCr9UbZG+/evT5PUsWqViqQlhKTwSIXF/mZE
 LewIkWo+Z+1cZSnqg1lIEG+J/r84zBC+ZnZ0rdNoF14m07SNxCES9KfLPslWsea3hSTx
 w0qWmCMeZCuH9VKO3CKcb0x6Y7IUgNl13V72o=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=GTRKcmIjRq76Juu2P2Ks37X5pFr4nbJ8WCgo0pu9D9Y=;
 b=Q7Pcv3SUBb+mEuxU9UXvN4b5cVNRzGHYGizlrqQdJdmNLgj7Hw3DTCSx/nX7p1s56m
 WsesH6xsP2iCbKV7Fttjdb5yoBfarhxTLn0umd43bSTWa9YzGh/16NwVTdieA3wqqBzt
 V0l4Siz6f+9ROEs6omZnlJXa0iIa3gk29W2uU+4N0EsKmNgFoEm9vGiDCJ4S8HK5jfy+
 iKIPhDqXuqw9cmk5hSp5i7GnGEawwSIAJWGwCV30cCSU9YjBYKx4aJ5rpHDCEnx2qKky
 zwfLIhAr6yAga1Vqc2Q7EJCDdNH5BXvZVr+fMLeOKGlCY/NzXhVXLY9hHC9qqwhZ/Qkm
 JRjg==
X-Gm-Message-State: AA6/9RlQr3G2nRqK+Kg7MyQiiewV8bKrq02XHWLAmWUf0GoDfLDuLILpjyZgpFHsyaDI2qfC
X-Received: by 10.194.119.100 with SMTP id kt4mr30980020wjb.84.1476120883813; 
 Mon, 10 Oct 2016 10:34:43 -0700 (PDT)
Received: from linaro.org ([2a01:e35:8bd4:7750:3c20:fa73:c3d3:3863])
 by smtp.gmail.com with ESMTPSA id
 uw3sm41835115wjb.21.2016.10.10.10.34.42
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 10 Oct 2016 10:34:42 -0700 (PDT)
Date: Mon, 10 Oct 2016 19:34:40 +0200
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Wanpeng Li <kernellwp@gmail.com>, Peter Zijlstra <peterz@infradead.org>,
 Ingo Molnar <mingo@kernel.org>,
 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
 Mike Galbraith <umgwanakikbuti@gmail.com>, Yuyang Du <yuyang.du@intel.com>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>
Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue
Message-ID: <20161010173440.GA28945@linaro.org>
References: <20160923115808.2330-1-matt@codeblueprint.co.uk>
 <20160928101422.GR5016@twins.programming.kicks-ass.net>
 <20160928193731.GD16071@codeblueprint.co.uk>
 <CANRm+CyVFuT3XJt7DZEBZgHb_hQPzDUfOGnkAqNexH4q2ex74Q@mail.gmail.com>
 <20161010100107.GZ16071@codeblueprint.co.uk>
 <CAKfTPtBFrahA2fBoG5S5CBiJHb8EZkUbPaOZ4jZFc1mVYH5zJQ@mail.gmail.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <CAKfTPtBFrahA2fBoG5S5CBiJHb8EZkUbPaOZ4jZFc1mVYH5zJQ@mail.gmail.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Le Monday 10 Oct 2016 à 14:29:28 (+0200), Vincent Guittot a écrit :
> On 10 October 2016 at 12:01, Matt Fleming <matt@codeblueprint.co.uk> wrote:
> > On Sun, 09 Oct, at 11:39:27AM, Wanpeng Li wrote:
> >>
> >> The difference between this patch and Peterz's is your patch have a
> >> delta since activate_task()->enqueue_task() does do update_rq_clock(),
> >> so why don't have the delta will cause low cpu machines (4 or 8) to
> >> regress against your another reply in this thread?
> >
> > Both my patch and Peter's patch cause issues with low cpu machines. In
> > <20161004201105.GP16071@codeblueprint.co.uk> I said,
> >
> >  "This patch causes some low cpu machines (4 or 8) to regress. It turns
> >   out they regress with my patch too."
> >
> > Have I misunderstood your question?
> >
> > I ran out of time to investigate this last week, though I did try all
> > proposed patches, including Vincent's, and none of them produced wins
> > across the board.
> 
> I have tried to reprocude your issue on my target an hikey board (ARM
> based octo cores) but i failed to see a regression with commit
> 7dc603c9028e. Neverthless, i can see tasks not been well  spread
> during fork as you mentioned. So I have studied a bit more the
> spreading issue during fork last week and i have a new version of my
> proposed patch that i'm going to send soon. With this patch, i can see
> a good spread of tasks  during the fork sequence and some kind of perf
> improvement even if it's bit difficult as the variance is quite
> important with hackbench test so it's mainly an improvement of
> repeatability of the result
>

Subject: [PATCH] sched: use load_avg for selecting idlest group

select_busiest_group only compares the runnable_load_avg when looking for
the idlest group. But on fork intensive use case like hackbenchw here task
blocked quickly after the fork, this can lead to selecting the same CPU
whereas other CPUs, which have similar runnable load but a lower load_avg,
could be chosen instead.

When the runnable_load_avg of 2 CPUs are close, we now take into account
the amount of blocked load as a 2nd selection factor.

For use case like hackbench, this enable the scheduler to select different
CPUs during the fork sequence and to spread tasks across the system.

Tests have been done on a Hikey board (ARM based octo cores) for several
kernel. The result below gives min, max, avg and stdev values of 18 runs
with each configuration.

The v4.8+patches configuration also includes the changes below which is part of the
proposal made by Peter to ensure that the clock will be up to date when the
fork task will be attached to the rq.

@@ -2568,6 +2568,7 @@ void wake_up_new_task(struct task_struct *p)
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 	rq = __task_rq_lock(p, &rf);
+	update_rq_clock(rq);
 	post_init_entity_util_avg(&p->se);
 
 	activate_task(rq, p, 0);

hackbench -P -g 1 

       ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
min    0.049         0.050         0.051       0,048
avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
max    0.066         0.068         0.070       0,063
stdev  +/-9%         +/-9%         +/-8%       +/-9%

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 8 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 039de34..628b00b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5166,15 +5166,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
-	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long min_runnable_load = ULONG_MAX, this_runnable_load = 0;
+	unsigned long min_avg_load = ULONG_MAX, this_avg_load = 0;
 	int load_idx = sd->forkexec_idx;
-	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	unsigned long imbalance = (scale_load_down(NICE_0_LOAD)*(sd->imbalance_pct-100))/100;
 
 	if (sd_flag & SD_BALANCE_WAKE)
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, runnable_load;
 		int local_group;
 		int i;
 
@@ -5188,6 +5189,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
+		runnable_load = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -5196,21 +5198,43 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 			else
 				load = target_load(i, load_idx);
 
-			avg_load += load;
+			runnable_load += load;
+
+			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
 		}
 
 		/* Adjust by relative CPU capacity of the group */
 		avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
+		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity;
 
 		if (local_group) {
-			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
+			this_runnable_load = runnable_load;
+			this_avg_load = avg_load;
+		} else if (min_runnable_load > (runnable_load + imbalance)) {
+			/*
+			 * The runnable load is significantly smaller so we
+			 * can pick this new cpu
+			 */
+			min_runnable_load = runnable_load;
+			min_avg_load = avg_load;
+			idlest = group;
+		} else if ((runnable_load < (min_runnable_load + imbalance)) &&
+				(100*min_avg_load > sd->imbalance_pct*avg_load)) {
+			/*
+			 * The runnable loads are close so we take into account
+			 * blocked load throught avg_load which is blocked +
+			 * runnable load
+			 */
+			min_avg_load = avg_load;
 			idlest = group;
 		}
+
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest ||
+	    (min_runnable_load > (this_runnable_load + imbalance)) ||
+	    ((this_runnable_load < (min_runnable_load + imbalance)) &&
+			(100*min_avg_load > sd->imbalance_pct*this_avg_load)))
 		return NULL;
 	return idlest;
 }