From patchwork Mon Dec  5 09:27:36 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 86522
Delivered-To: patch@linaro.org
Received: by 10.140.20.101 with SMTP id 92csp1382783qgi;
 Mon, 5 Dec 2016 01:29:33 -0800 (PST)
X-Received: by 10.84.216.20 with SMTP id m20mr123328885pli.126.1480930173740; 
 Mon, 05 Dec 2016 01:29:33 -0800 (PST)
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 d2si14002080pli.315.2016.12.05.01.29.33; 
 Mon, 05 Dec 2016 01:29:33 -0800 (PST)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1752253AbcLEJ1w (ORCPT <rfc822;julien.grall@linaro.org>
 + 25 others); Mon, 5 Dec 2016 04:27:52 -0500
Received: from mail-wj0-f170.google.com ([209.85.210.170]:34472 "EHLO
 mail-wj0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1751340AbcLEJ1l (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Mon, 5 Dec 2016 04:27:41 -0500
Received: by mail-wj0-f170.google.com with SMTP id tg4so28920456wjb.1
 for <linux-kernel@vger.kernel.org>;
 Mon, 05 Dec 2016 01:27:40 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to
 :user-agent; bh=3GIu4Rj1U+rRgeJb+s3rtYN41/tuVEGfYH20JC+HeHk=;
 b=Ok2K99/fxo8TKczzPORRMVVsy/Qsyj+5LsSABms7IpFXN5zgjzYIhalmEIPwOkBuSZ
 bXATuclnUeHNEVoGcFwX2sfadj/USJjl7T1ARzHgCGLYyrXvNL1plLGSnK4ELQXbqw8y
 P3T39ehPmlb9w8DLYUOWkL8Ii1Rn0O2xPrGvE=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=3GIu4Rj1U+rRgeJb+s3rtYN41/tuVEGfYH20JC+HeHk=;
 b=Zt6cNL0YpUsOuNQ+ZONvgWkh8vQxJzRvg/drpaJlppjTfQeL4xLzLADRbyE/vbaUCX
 OBnYO3DJ6f6o1VgqeAydowG13s+cS78fTzlLDHiGtyUqMiHyOB8d/lchcK0MRtTgxeG9
 JZ88vVk6tXh1NELIuM8g+Tl71bif3qkRZorvK/5guKZKW+ODlK/2fX1ZEa+MS2Cn5hIt
 VaZClt+NywU8QhTlaRLtUIaxhfuiDAAitTTEtmKr18+4CeJ7YjuMYCG2JcJE17H/qSPl
 Iz6e3z9Boop9v3nSlyuGmuvF2CLR0qrvvP2jUHPMtvvmbENHLhrvhhyp4UHoyZcCxUgx
 URzw==
X-Gm-Message-State: AKaTC00n2ffQjd4mAAMVRMYnrzWsnG9PTtVH1aE3SlnSQ5Xr3XFrXYU3UPQNcdEP2uBD1nSs
X-Received: by 10.194.85.77 with SMTP id f13mr49073651wjz.187.1480930059146; 
 Mon, 05 Dec 2016 01:27:39 -0800 (PST)
Received: from linaro.org ([2a01:e0a:f:6020:9046:2b86:6f44:ba52])
 by smtp.gmail.com with ESMTPSA id
 c81sm16968965wmf.22.2016.12.05.01.27.37
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 05 Dec 2016 01:27:37 -0800 (PST)
Date: Mon, 5 Dec 2016 10:27:36 +0100
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>,
 Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
 LKML <linux-kernel@vger.kernel.org>, Morten.Rasmussen@arm.com,
 dietmar.eggemann@arm.com, kernellwp@gmail.com, yuyang.du@intel.com,
 umgwanakikbuti@gmail.com, Mel Gorman <mgorman@techsingularity.net>
Subject: Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group
Message-ID: <20161205092735.GA9161@linaro.org>
References: <1480088073-11642-1-git-send-email-vincent.guittot@linaro.org>
 <1480088073-11642-3-git-send-email-vincent.guittot@linaro.org>
 <CAE40pde4VRH8LfRWMX3Vfq5pRoysUB6UuHTzN+VrcKMdSNW0uA@mail.gmail.com>
 <20161203214707.GI20785@codeblueprint.co.uk>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20161203214707.GI20785@codeblueprint.co.uk>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Le Saturday 03 Dec 2016 à 21:47:07 (+0000), Matt Fleming a écrit :
> On Fri, 02 Dec, at 07:31:04PM, Brendan Gregg wrote:
> > 
> > For background, is this from the "A decade of wasted cores" paper's
> > patches?
> 
> No, this patch fixes an issue I originally reported here,
> 
>   https://lkml.kernel.org/r/20160923115808.2330-1-matt@codeblueprint.co.uk
> 
> Essentially, if you have an idle or partially-idle system and a
> workload that consists of fork()'ing a bunch of tasks, where each of
> those tasks immediately sleeps waiting for some wakeup, then those
> tasks aren't spread across all idle CPUs very well.
> 
> We saw this issue when running hackbench with a small loop count, such
> that the actual benchmark setup (fork()'ing) is where the majority of
> the runtime is spent.
> 
> In that scenario, there's a large potential/blocked load, but
> essentially no runnable load, and the balance on fork scheduler code
> only cares about runnable load without Vincent's patch applied.
> 
> The closest thing I can find in the "A decade of wasted cores" paper
> is "The Overload-on-Wakeup bug", but I don't think that's the issue
> here since,
> 
>   a) We're balancing on fork, not wakeup
>   b) The fork on balance code balances across nodes OK
> 
> > What's the expected typical gain? Thanks,
> 
> The results are still coming back from the SUSE performance test grid
> but they do show that this patch is mainly a win for multi-socket
> machines with more than 8 cores or thereabouts.
> 
>  [ Vincent, I'll follow up to your PATCH 1/2 with the results that are
>    specifically for that patch ]
> 
> Assuming a fork-intensive or fork-dominated workload, and a
> multi-socket machine, such as this 2 socket, NUMA, with 12 cores and
> HT enabled (48 cpus), we saw a very clear win between +10% and +15%
> for processes communicating via pipes,
> 
>   (1) tip-sched = tip/sched/core branch
>   (2) fix-fig-for-fork = (1) + PATCH 1/2
>   (3) fix-sig = (1) + (2) + PATCH 2/2
> 
> hackbench-process-pipes
>                          4.9.0-rc6             4.9.0-rc6             4.9.0-rc6
>                          tip-sched      fix-fig-for-fork               fix-sig
> Amean    1        0.0717 (  0.00%)      0.0696 (  2.99%)      0.0730 ( -1.79%)
> Amean    4        0.1244 (  0.00%)      0.1200 (  3.56%)      0.1190 (  4.36%)
> Amean    7        0.1891 (  0.00%)      0.1937 ( -2.42%)      0.1831 (  3.17%)
> Amean    12       0.2964 (  0.00%)      0.3116 ( -5.11%)      0.2784 (  6.07%)
> Amean    21       0.4011 (  0.00%)      0.4090 ( -1.96%)      0.3574 ( 10.90%)
> Amean    30       0.4944 (  0.00%)      0.4654 (  5.87%)      0.4171 ( 15.63%)
> Amean    48       0.6113 (  0.00%)      0.6309 ( -3.20%)      0.5331 ( 12.78%)
> Amean    79       0.8616 (  0.00%)      0.8706 ( -1.04%)      0.7710 ( 10.51%)
> Amean    110      1.1304 (  0.00%)      1.2211 ( -8.02%)      1.0163 ( 10.10%)
> Amean    141      1.3754 (  0.00%)      1.4279 ( -3.81%)      1.2803 (  6.92%)
> Amean    172      1.6217 (  0.00%)      1.7367 ( -7.09%)      1.5363 (  5.27%)
> Amean    192      1.7809 (  0.00%)      2.0199 (-13.42%)      1.7129 (  3.82%)
> 
> Things look even better when using threads and pipes, with wins
> between 11% and 29% when looking at results outside of the noise,
> 
> hackbench-thread-pipes
>                          4.9.0-rc6             4.9.0-rc6             4.9.0-rc6
>                          tip-sched      fix-fig-for-fork               fix-sig
> Amean    1        0.0736 (  0.00%)      0.0794 ( -7.96%)      0.0779 ( -5.83%)
> Amean    4        0.1709 (  0.00%)      0.1690 (  1.09%)      0.1663 (  2.68%)
> Amean    7        0.2836 (  0.00%)      0.3080 ( -8.61%)      0.2640 (  6.90%)
> Amean    12       0.4393 (  0.00%)      0.4843 (-10.24%)      0.4090 (  6.89%)
> Amean    21       0.5821 (  0.00%)      0.6369 ( -9.40%)      0.5126 ( 11.95%)
> Amean    30       0.6557 (  0.00%)      0.6459 (  1.50%)      0.5711 ( 12.90%)
> Amean    48       0.7924 (  0.00%)      0.7760 (  2.07%)      0.6286 ( 20.68%)
> Amean    79       1.0534 (  0.00%)      1.0551 ( -0.16%)      0.8481 ( 19.49%)
> Amean    110      1.5286 (  0.00%)      1.4504 (  5.11%)      1.1121 ( 27.24%)
> Amean    141      1.9507 (  0.00%)      1.7790 (  8.80%)      1.3804 ( 29.23%)
> Amean    172      2.2261 (  0.00%)      2.3330 ( -4.80%)      1.6336 ( 26.62%)
> Amean    192      2.3753 (  0.00%)      2.3307 (  1.88%)      1.8246 ( 23.19%)
> 
> Somewhat surprisingly, I can see improvements for UMA machines with
> fewer cores when the workload heavily saturates the machine and the
> workload isn't dominated by fork. Such heavy saturation isn't super
> realistic, but still interesting. I haven't dug into why these results
> occurred, but I am happy things didn't instead fall off a cliff.
> 
> Here's a 4-cpu UMA box showing some improvement at the higher end,
> 
> hackbench-process-pipes
>                         4.9.0-rc6             4.9.0-rc6             4.9.0-rc6
>                         tip-sched      fix-fig-for-fork               fix-sig
> Amean    1       3.5060 (  0.00%)      3.5747 ( -1.96%)      3.5117 ( -0.16%)
> Amean    3       7.7113 (  0.00%)      7.8160 ( -1.36%)      7.7747 ( -0.82%)
> Amean    5      11.4453 (  0.00%)     11.5710 ( -1.10%)     11.3870 (  0.51%)
> Amean    7      15.3147 (  0.00%)     15.9420 ( -4.10%)     15.8450 ( -3.46%)
> Amean    12     25.5110 (  0.00%)     24.3410 (  4.59%)     22.6717 ( 11.13%)
> Amean    16     32.3010 (  0.00%)     28.5897 ( 11.49%)     25.7473 ( 20.29%)

Hi Matt,

Thanks for the results.

During the review, it has been pointed out by Morten that the test condition
(100*this_avg_load < imbalance_scale*min_avg_load) makes more sense than
(100*min_avg_load > imbalance_scale*this_avg_load). But i see lower
performances with this change. Coud you run tests with the change below on
top of the patchset ?

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8d1ae7..0129fbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5514,7 +5514,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	if (!idlest ||
 	    (min_runnable_load > (this_runnable_load + imbalance)) ||
 	    ((this_runnable_load < (min_runnable_load + imbalance)) &&
-			(100*min_avg_load > imbalance_scale*this_avg_load)))
+			(100*this_avg_load < imbalance_scale*min_avg_load)))
 		return NULL;
 	return idlest;
 }