From patchwork Fri Oct 18 13:26:28 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176831
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861305ill;
 Fri, 18 Oct 2019 06:26:53 -0700 (PDT)
X-Google-Smtp-Source: APXvYqwwDMqxuXyTwBovma9wNFevhqWSacOQNvWoYuaZ2j6NPya1/HMxY5qXN7ic3lUA700BZS1/
X-Received: by 2002:a17:906:d214:: with SMTP id
 w20mr8524872ejz.68.1571405213082; 
 Fri, 18 Oct 2019 06:26:53 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405213; cv=none;
 d=google.com; s=arc-20160816;
 b=nFYWPyqVSdqD+FJ9UuZN0Ib7Ux/GycOwlglOgeBKKMNcOwzrM7EIMDcrMT56336Sab
 aQIYkbMrIrbpsdXt4bEkPpKpQLLqOfGHEiLs247/nzb6ULXGYfGgQDfSHS1NIxxVA4xq
 4HBm5RqP/GFwD22dxT6lIhnIqPjwcrK4M7mig4UNmNXOTD0lsprRGaOnS2FBlQ3HobYD
 t2s7MUo4aBdhrqtWqx4X/JC8avAd1J+MtKHj48rDwEqHnH7IoJrTpe8n+d4mIuYevi1J
 2jwApEg/nWVjOKaullVTpL/GQoJFMMBIDckMzE2gsdavn/cBNJOYaX6LWiYSnOa78FHv
 L2Iw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=Hfv3JPEuhHlwsh1NcK/qqJPBcbuf8k3CcNDkmEcfSbA=;
 b=MtK0Ag5QWEPKxJJoiMpUuKJR/LrUdY4Q7CAGariliu7vnzWtqO8UhZMiWZDSH+ICIO
 ZC7HmtwwTx3/rC1gvB1GPsVfzCxGd1oqzJ4o38N6dSFOm36JMd8MtT7b7FK8mqMJynmy
 2kiRG3flDi756QnyWnOCay0Wr5IV+rVlN+GquKAlcsQJZ3+4z3TNqq1hh2Z1MQaHqRGM
 oFtUdzniW45yrshk5RVYtpC6suDslGeev+ZH3sj/XyFLRgKMiWBfU4KvE1KHft8wZS3D
 s/aTUxvAXG++C4bU9DWXwbfLZ1Vtf35LINS47dWWPmDwKMqfvV0STTByp77MjFb+wqcV
 EKUA==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=QM1sCHgN;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id z6si3531217edr.443.2019.10.18.06.26.52;
 Fri, 18 Oct 2019 06:26:53 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=QM1sCHgN;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2408406AbfJRN0u (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:26:50 -0400
Received: from mail-wm1-f68.google.com ([209.85.128.68]:38919 "EHLO
 mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1729109AbfJRN0s (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:26:48 -0400
Received: by mail-wm1-f68.google.com with SMTP id v17so6097004wml.4
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:26:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=Hfv3JPEuhHlwsh1NcK/qqJPBcbuf8k3CcNDkmEcfSbA=;
 b=QM1sCHgNLaKlYdW+aYM3bL2RnkBeA9SsMgSa9URgCI267/vOddN1HT25EfE1ia+8hb
 4Fh2qI48WM4jrPhf+O3U2YUn80ZwjuOEEQd2i+FLy9PrcLjWuXEYnoZR38bWm69CCDlv
 CrHIuX73DPZLNKoSm82alLSod/V6IODGnHkCq09RomYY+B8GgESJje+yfwJEe1IBiWCe
 UPw4lkfmW/hXS5TJMg0rhH6yaKtgQ7h/e7MNDoyxdMB7tfQuhWuiS1o7c79+u7zX9utj
 okDDrQ3+rSEJDCepPaLvzbJaUp9rvowo+exNU26IKYlDYOwLuEhz5HUIi/WK3nstJLDz
 /GNA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=Hfv3JPEuhHlwsh1NcK/qqJPBcbuf8k3CcNDkmEcfSbA=;
 b=ipxr3tAyESBCXT3BMaB0q84fowyA2WNM6wmiHlDyJuW6FfPOE67nFu5LDjbymyJdX3
 QVp8LKjRZi3X5BG5dxxKbc0f+3iYVc0Zr63X+sKuL+yo02lNhoQAKTKjBLw9CQF/SGYF
 TIfsu8M7uYJ+daWgdapT24FHBdKYE+cl+XcpRNIQ1IxTPEZMA/J28aidjWrxMl7EZawv
 NODIRSb9kS5tqVwUcyEo+L8inaobborvjBDXMDQSnhbVetO9pEpJGWB5GysFIP9dkThN
 otwMHjvj4CoKqseelXhTPagStJTtBHbbMHuabBG/DaSaSnFY58jMWOILsDfaLQQQG51w
 AfMw==
X-Gm-Message-State: APjAAAXKR+7EKFrxCUF6cEd7bxEjr7AsPu7P9d03DgBkke2p3UCCr4N/
 WoSmTCJ3KKDHPMrn3AsrSlYsKbfHzYY=
X-Received: by 2002:a05:600c:23cc:: with SMTP id
 p12mr3591536wmb.163.1571405205387; 
 Fri, 18 Oct 2019 06:26:45 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.43
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:44 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 01/11] sched/fair: clean up asym packing
Date: Fri, 18 Oct 2019 15:26:28 +0200
Message-Id: <1571405198-27570-2-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Clean up asym packing to follow the default load balance behavior:
- classify the group by creating a group_asym_packing field.
- calculate the imbalance in calculate_imbalance() instead of bypassing it.

We don't need to test twice same conditions anymore to detect asym packing
and we consolidate the calculation of imbalance in calculate_imbalance().

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 63 ++++++++++++++---------------------------------------
 1 file changed, 16 insertions(+), 47 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f0a5e1..617145c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7675,6 +7675,7 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	int group_no_capacity;
+	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
@@ -8129,9 +8130,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * ASYM_PACKING needs to move all the work to the highest
 	 * prority CPUs in the group, therefore mark all groups
 	 * of lower priority than ourself as busy.
+	 *
+	 * This is primarily intended to used at the sibling level.  Some
+	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
+	 * case of POWER7, it can move to lower SMT modes only when higher
+	 * threads are idle.  When in lower SMT modes, the threads will
+	 * perform better since they share less core resources.  Hence when we
+	 * have idle threads, we want them to be the higher ones.
 	 */
 	if (sgs->sum_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
+		sgs->group_asym_packing = 1;
 		if (!sds->busiest)
 			return true;
 
@@ -8273,51 +8282,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 }
 
 /**
- * check_asym_packing - Check to see if the group is packed into the
- *			sched domain.
- *
- * This is primarily intended to used at the sibling level.  Some
- * cores like POWER7 prefer to use lower numbered SMT threads.  In the
- * case of POWER7, it can move to lower SMT modes only when higher
- * threads are idle.  When in lower SMT modes, the threads will
- * perform better since they share less core resources.  Hence when we
- * have idle threads, we want them to be the higher ones.
- *
- * This packing function is run on idle threads.  It checks to see if
- * the busiest CPU in this domain (core in the P7 case) has a higher
- * CPU number than the packing function is being run on.  Here we are
- * assuming lower CPU number will be equivalent to lower a SMT thread
- * number.
- *
- * Return: 1 when packing is required and a task should be moved to
- * this CPU.  The amount of the imbalance is returned in env->imbalance.
- *
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain which is to be packed
- */
-static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	int busiest_cpu;
-
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return 0;
-
-	if (env->idle == CPU_NOT_IDLE)
-		return 0;
-
-	if (!sds->busiest)
-		return 0;
-
-	busiest_cpu = sds->busiest->asym_prefer_cpu;
-	if (sched_asym_prefer(busiest_cpu, env->dst_cpu))
-		return 0;
-
-	env->imbalance = sds->busiest_stat.group_load;
-
-	return 1;
-}
-
-/**
  * fix_small_imbalance - Calculate the minor imbalance that exists
  *			amongst the groups of a sched_domain, during
  *			load balancing.
@@ -8401,6 +8365,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
+	if (busiest->group_asym_packing) {
+		env->imbalance = busiest->group_load;
+		return;
+	}
+
 	if (busiest->group_type == group_imbalanced) {
 		/*
 		 * In the group_imb case we cannot rely on group-wide averages
@@ -8505,8 +8474,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	busiest = &sds.busiest_stat;
 
 	/* ASYM feature bypasses nice load balance check */
-	if (check_asym_packing(env, &sds))
-		return sds.busiest;
+	if (busiest->group_asym_packing)
+		goto force_balance;
 
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest || busiest->sum_nr_running == 0)

From patchwork Fri Oct 18 13:26:29 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176832
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861414ill;
 Fri, 18 Oct 2019 06:26:58 -0700 (PDT)
X-Google-Smtp-Source: APXvYqw0RZ88UHQL2ns2uR92UVzr0YSmAvjMiHSBr8JMn0+52yBTYgTurpVwh6JkppS0kEAUs320
X-Received: by 2002:a17:906:1342:: with SMTP id
 x2mr8607549ejb.304.1571405218794; 
 Fri, 18 Oct 2019 06:26:58 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405218; cv=none;
 d=google.com; s=arc-20160816;
 b=c70wiX8MOYevGKCBvs+5hqEyNsH1bvoV1XyCt6PHTJyTNAdMN8YpuBryjmVY2tdNVF
 YPJ4xb2lvm3g9lAefmTez1D5AHMlK/sGzTe0pkrEW75Eclz3Oz9bnAAMX3km/2fzZaUw
 VP3fHggYlMKCMxIzz4lFNQI5r98xQsRWDRtBqZKqxzm0BPOaszcEeiiz2zAanZ9T2BHE
 AlNl6ygNPrAlZe30rHplE0he7iCv3p38v7aivJOM/I2zOOfHMONkrmqXO++lfEUrIVNs
 3b6hFjFJTcqr9g1XsnWlqwyS2JP1QLYXrw5RDv2qOEfzsfy26Whkg8/zAkkIxzrl/OEW
 WH3Q==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=NQ/eLJSNkXW8LLvqvhxjiDQg9h5mHrnthUFHJ+Kv0ak=;
 b=cnqEDW18NXgGBdaH064mmF39wUK9a6YZPLLpQAXG5zNOLHyrCNppGw3g1K9XaK61My
 9Y2at4zjJmXSFjoY7Kt0KTa3d4RJnQxVJue7h6/ZEVcRDzbV3LQhiIauP6ZIKDlJt/FH
 J/KrWgt7IzIs5EjnVNzdlms2r02INZJl46mM2BD7yDHM/uOUKzBjZ0+N6nE4NPFG3rP0
 INm0Sa08yY+J0Q6b87kOXO7bJ3DH0CwriBUE02z144B/mlAa53iClvjpjWks+pZg1M5x
 Vb1pQu6K3WkfqUT8RYufFlCT+oe4jJCTQCTcO6swnC5TIetIR/yHBpwMldyHY5BqtpTl
 tHOg==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=xLuEtlbm;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id q4si3371475eji.152.2019.10.18.06.26.58;
 Fri, 18 Oct 2019 06:26:58 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=xLuEtlbm;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2410301AbfJRN0x (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:26:53 -0400
Received: from mail-wr1-f66.google.com ([209.85.221.66]:44529 "EHLO
 mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1728429AbfJRN0u (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:26:50 -0400
Received: by mail-wr1-f66.google.com with SMTP id z9so6250903wrl.11
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:26:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=NQ/eLJSNkXW8LLvqvhxjiDQg9h5mHrnthUFHJ+Kv0ak=;
 b=xLuEtlbmH+Nznuz5yb+VUq02dwTDaV1iJR9HpM9SMb+dWkPSwWtUAsSrD6IQiwLF3e
 7q0/3FbOoEktY3dlzDqBeX/UeryUQb+IMGdhl5AxF2kfpiyVADAYak3J/CefUoQY3ibD
 2oE3NL5Ab/hC5PBAwckf38DxtLA8Hcia5Kb5d9dtG4AWbuYWEmuqTDwR1YWM/cmm0OVW
 RmRvyHjfLJ+i8ndw8RtMy4T5YDhllFoIr5K/PedxIX3F+BL5QXelO87E2OrWgqauTTRn
 YXt0/voFH6DelLCfHpuCdz5wuPe6HfC1c3crCfbgoAsh2dZYqesUCBoSHFDrzjb48Pap
 rwtg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=NQ/eLJSNkXW8LLvqvhxjiDQg9h5mHrnthUFHJ+Kv0ak=;
 b=MYNY69sI8GYNQIoDGFgyLKItwdgNLCKaeDeAHRQ9qU5bgs2Xzak5BlnhR7Z0aXfciP
 BDlzsjV+dfzzBa23O8xMgEbPOlylt5RvmVCr3/U0lw/hjGhFSwUkl0S7MbrAsu5QtQQv
 4JK4Vqj7ZIB62j0nNahoO8loBt5XjLX5jl806SusvtIJVzUuBmTXz0PaqDgKavAquOcU
 FAP4btrl/IPtS8/jQxtFbGdB5/Pu8obgwKednFUr+zL/1u4gruqclQ5vbS7dPA429wpT
 +c7is4LT5WiqL+uxUPzxXexhMG15ryuKiUpL6ULQy3a37qWKzYf+k0gmpuAE60oFp19t
 +gMw==
X-Gm-Message-State: APjAAAUfQEhYfOqBJIXsMOpFEOfGl5TyjmPx5jHaG6N8r4fh6T9xRgRJ
 e3Z/qCQn0g4ZmMfFEvwRq4ysrHuIsw4=
X-Received: by 2002:adf:f152:: with SMTP id y18mr8106740wro.285.1571405207340; 
 Fri, 18 Oct 2019 06:26:47 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.45
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:45 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 02/11] sched/fair: rename sum_nr_running to sum_h_nr_running
Date: Fri, 18 Oct 2019 15:26:29 +0200
Message-Id: <1571405198-27570-3-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Rename sum_nr_running to sum_h_nr_running because it effectively tracks
cfs->h_nr_running so we can use sum_nr_running to track rq->nr_running
when needed.

There is no functional changes.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

-- 
2.7.4
Acked-by: Mel Gorman <mgorman@techsingularity.net>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 617145c..9a2aceb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7670,7 +7670,7 @@ struct sg_lb_stats {
 	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
-	unsigned int sum_nr_running; /* Nr tasks running in the group */
+	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
@@ -7715,7 +7715,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.total_capacity = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
-			.sum_nr_running = 0,
+			.sum_h_nr_running = 0,
 			.group_type = group_other,
 		},
 	};
@@ -7906,7 +7906,7 @@ static inline int sg_imbalanced(struct sched_group *group)
 static inline bool
 group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running < sgs->group_weight)
+	if (sgs->sum_h_nr_running < sgs->group_weight)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
@@ -7927,7 +7927,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running <= sgs->group_weight)
+	if (sgs->sum_h_nr_running <= sgs->group_weight)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
@@ -8019,7 +8019,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		sgs->group_load += cpu_runnable_load(rq);
 		sgs->group_util += cpu_util(i);
-		sgs->sum_nr_running += rq->cfs.h_nr_running;
+		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
 		if (nr_running > 1)
@@ -8049,8 +8049,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
-	if (sgs->sum_nr_running)
-		sgs->load_per_task = sgs->group_load / sgs->sum_nr_running;
+	if (sgs->sum_h_nr_running)
+		sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
 
 	sgs->group_weight = group->group_weight;
 
@@ -8107,7 +8107,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * capable CPUs may harm throughput. Maximize throughput,
 	 * power/energy consequences are not considered.
 	 */
-	if (sgs->sum_nr_running <= sgs->group_weight &&
+	if (sgs->sum_h_nr_running <= sgs->group_weight &&
 	    group_smaller_min_cpu_capacity(sds->local, sg))
 		return false;
 
@@ -8138,7 +8138,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * perform better since they share less core resources.  Hence when we
 	 * have idle threads, we want them to be the higher ones.
 	 */
-	if (sgs->sum_nr_running &&
+	if (sgs->sum_h_nr_running &&
 	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
 		sgs->group_asym_packing = 1;
 		if (!sds->busiest)
@@ -8156,9 +8156,9 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 #ifdef CONFIG_NUMA_BALANCING
 static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running > sgs->nr_numa_running)
+	if (sgs->sum_h_nr_running > sgs->nr_numa_running)
 		return regular;
-	if (sgs->sum_nr_running > sgs->nr_preferred_running)
+	if (sgs->sum_h_nr_running > sgs->nr_preferred_running)
 		return remote;
 	return all;
 }
@@ -8233,7 +8233,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		 */
 		if (prefer_sibling && sds->local &&
 		    group_has_capacity(env, local) &&
-		    (sgs->sum_nr_running > local->sum_nr_running + 1)) {
+		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
 			sgs->group_no_capacity = 1;
 			sgs->group_type = group_classify(sg, sgs);
 		}
@@ -8245,7 +8245,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 next_group:
 		/* Now, start updating sd_lb_stats */
-		sds->total_running += sgs->sum_nr_running;
+		sds->total_running += sgs->sum_h_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
@@ -8299,7 +8299,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (!local->sum_nr_running)
+	if (!local->sum_h_nr_running)
 		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
 	else if (busiest->load_per_task > local->load_per_task)
 		imbn = 1;
@@ -8397,7 +8397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
+		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
 		if (load_above_capacity > busiest->group_capacity) {
 			load_above_capacity -= busiest->group_capacity;
 			load_above_capacity *= scale_load_down(NICE_0_LOAD);
@@ -8478,7 +8478,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || busiest->sum_nr_running == 0)
+	if (!sds.busiest || busiest->sum_h_nr_running == 0)
 		goto out_balanced;
 
 	/* XXX broken for overlapping NUMA groups */

From patchwork Fri Oct 18 13:26:30 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176834
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861441ill;
 Fri, 18 Oct 2019 06:26:59 -0700 (PDT)
X-Google-Smtp-Source: APXvYqxDUDWNyrRMLakaPH1rWE6SS6Td2H8DrzvJ4ZEEEeEna8SAB7/WsT8YGUWuacG81OwTxcwM
X-Received: by 2002:a17:906:cc90:: with SMTP id
 oq16mr8482650ejb.322.1571405219746; 
 Fri, 18 Oct 2019 06:26:59 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405219; cv=none;
 d=google.com; s=arc-20160816;
 b=ZqRStqgOofQIn1/Tnoyi7x10J4Jzzd/wFxcxZKkleZuxmbiCi+jt5YD2PQa/EJiCyz
 1NHpiYN3oj1Tfi7DKE+0yTBTcqM1vthbvjTSEXol5IDSvRu3n3ukhghcslzeMen0y+kn
 w3aASxKacVtM537hoShsHt3upETxkW9AbeO7+mTjBDQJhJy+Y9ggZuG1zt/FvQcxaZa4
 FiR6+fH7rK+PFrLj0KNSmnem7s3EQQuglGSSTQNs9FB+rozkAsdNJynw+HScvP81VkD6
 hBnZrFRTBcfr7qiiFLT2ukqqmvOtOUQ2UC8M85tXu7c5jy1NJe2lpx28XV95lJiKKfNZ
 feJQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=UTQ1SciiKePYzOmVVy3DyjwjaDHHkwpywXBi+0rOpxQ=;
 b=MYEKdIcRi+2CG6Et3g9QFc6aofqXn9c5YXxDNCJtkz7VSneffNqtr27mYTLpvh/UOE
 3DBQ3H6FFNogH1rcxiwwbqc3T2ZgnerTl1LfKmgBQWkLBMc4MRiIuNde/WZg3OSKEjzW
 /lrAEhtlx30/M/f9o3HC8aeOgcF1+VJJDxWV+C4XVlRhZrA5qOJl/8rBXZcy5dW8rELp
 fXGujShlpOdsPMkTj9f8DYBXByL4w8oNLJuid1Ml+1SUgG5PaMe6Tn4DUYtjdKQg3sHn
 R98iA/L26tCFoXDchjLH6LAfKq4DCyQ+XDunOzxsIoxfloBbmHmI9cEdOTs3PNtdp+uS
 s3Yw==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=QXR90jeK;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id q4si3371475eji.152.2019.10.18.06.26.59;
 Fri, 18 Oct 2019 06:26:59 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=QXR90jeK;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2410322AbfJRN06 (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:26:58 -0400
Received: from mail-wr1-f67.google.com ([209.85.221.67]:40142 "EHLO
 mail-wr1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2410292AbfJRN0w (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:26:52 -0400
Received: by mail-wr1-f67.google.com with SMTP id o28so6271773wro.7
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:26:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=UTQ1SciiKePYzOmVVy3DyjwjaDHHkwpywXBi+0rOpxQ=;
 b=QXR90jeKm05HUBMzUKPG93fbL6a4P5nHC1TJmpDv528sLkrBIwVZKNq/OQC/K458Un
 uxBioUPxoSm+VwhN1IfQqkhRs1DOxdr/UebBO3l/UFi9v7LXNo6GekREPyB1pb/Zh5Cv
 eFW37PbpACZ7F5kImbRvXv6cTZmJzDoAky6U6fldf2YL5q3PGATsrrl7vxJ5sc8zKpMd
 WIXzgJOUQKhM2ulfj7pLDF6EUyR7yUlqAixaeQlTGEJ0be3DN5tfe7KQpMppOlVIh+Aa
 NH8L4R9xNW/tlOQe4kWK0g9AIOdRmYuMp2i501s9eVH9ygXAHpFesaH0yzlhQ2hhZeeC
 bgyQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=UTQ1SciiKePYzOmVVy3DyjwjaDHHkwpywXBi+0rOpxQ=;
 b=sbO7TkMGJ7OrIhvKF6K29sdZrhV7ql0IsaOLajX0vAAPQl+vzNo0Vt9EH5Qw4ze5iW
 Cw/ojyLpaZDi6/NB0qdRwehM0nlBigMdrq2Q0tZvDqGm3i9X24C8zKAx+UM8o0XeHhE0
 hMy35GcSR85h7ogopZHNAvXkB0V1V8oQDpylhsDfMJEc4f2W4xkwhBUkD7GDXGUth2ED
 D0nZ86pd7B4nRXQd9ElobbJC8qn7qLDQD885VuL+KNIq41o1wPZN3uqikJLu7q7YNk8F
 bnreuN1An0jWlLtGNlzpVzE5MoFC/ed3rARbw8HbvKZ8FDkk9LcoahdrbKjlEIT/NUC8
 aWlg==
X-Gm-Message-State: APjAAAUrP9wx+/l4e5+UektKAadOONhDSfax5YZBaSht5hzYfsi2hGML
 d87iU1L0ZwsYZJ7oEs363cbUuR3x9fE=
X-Received: by 2002:adf:de85:: with SMTP id w5mr7613678wrl.278.1571405209399; 
 Fri, 18 Oct 2019 06:26:49 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.47
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:48 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 03/11] sched/fair: remove meaningless imbalance calculation
Date: Fri, 18 Oct 2019 15:26:30 +0200
Message-Id: <1571405198-27570-4-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

clean up load_balance and remove meaningless calculation and fields before
adding new algorithm.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rik van Riel <riel@surriel.com>
---
 kernel/sched/fair.c | 105 +---------------------------------------------------
 1 file changed, 1 insertion(+), 104 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a2aceb..e004841 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5390,18 +5390,6 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
-static unsigned long cpu_avg_load_per_task(int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
-	unsigned long load_avg = cpu_runnable_load(rq);
-
-	if (nr_running)
-		return load_avg / nr_running;
-
-	return 0;
-}
-
 static void record_wakee(struct task_struct *p)
 {
 	/*
@@ -7667,7 +7655,6 @@ static unsigned long task_h_load(struct task_struct *p)
 struct sg_lb_stats {
 	unsigned long avg_load; /*Avg load across the CPUs of the group */
 	unsigned long group_load; /* Total load over the CPUs of the group */
-	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
 	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
@@ -8049,9 +8036,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
-	if (sgs->sum_h_nr_running)
-		sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running;
-
 	sgs->group_weight = group->group_weight;
 
 	sgs->group_no_capacity = group_is_overloaded(env, sgs);
@@ -8282,76 +8266,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 }
 
 /**
- * fix_small_imbalance - Calculate the minor imbalance that exists
- *			amongst the groups of a sched_domain, during
- *			load balancing.
- * @env: The load balancing environment.
- * @sds: Statistics of the sched_domain whose imbalance is to be calculated.
- */
-static inline
-void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	unsigned long tmp, capa_now = 0, capa_move = 0;
-	unsigned int imbn = 2;
-	unsigned long scaled_busy_load_per_task;
-	struct sg_lb_stats *local, *busiest;
-
-	local = &sds->local_stat;
-	busiest = &sds->busiest_stat;
-
-	if (!local->sum_h_nr_running)
-		local->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
-	else if (busiest->load_per_task > local->load_per_task)
-		imbn = 1;
-
-	scaled_busy_load_per_task =
-		(busiest->load_per_task * SCHED_CAPACITY_SCALE) /
-		busiest->group_capacity;
-
-	if (busiest->avg_load + scaled_busy_load_per_task >=
-	    local->avg_load + (scaled_busy_load_per_task * imbn)) {
-		env->imbalance = busiest->load_per_task;
-		return;
-	}
-
-	/*
-	 * OK, we don't have enough imbalance to justify moving tasks,
-	 * however we may be able to increase total CPU capacity used by
-	 * moving them.
-	 */
-
-	capa_now += busiest->group_capacity *
-			min(busiest->load_per_task, busiest->avg_load);
-	capa_now += local->group_capacity *
-			min(local->load_per_task, local->avg_load);
-	capa_now /= SCHED_CAPACITY_SCALE;
-
-	/* Amount of load we'd subtract */
-	if (busiest->avg_load > scaled_busy_load_per_task) {
-		capa_move += busiest->group_capacity *
-			    min(busiest->load_per_task,
-				busiest->avg_load - scaled_busy_load_per_task);
-	}
-
-	/* Amount of load we'd add */
-	if (busiest->avg_load * busiest->group_capacity <
-	    busiest->load_per_task * SCHED_CAPACITY_SCALE) {
-		tmp = (busiest->avg_load * busiest->group_capacity) /
-		      local->group_capacity;
-	} else {
-		tmp = (busiest->load_per_task * SCHED_CAPACITY_SCALE) /
-		      local->group_capacity;
-	}
-	capa_move += local->group_capacity *
-		    min(local->load_per_task, local->avg_load + tmp);
-	capa_move /= SCHED_CAPACITY_SCALE;
-
-	/* Move if we gain throughput */
-	if (capa_move > capa_now)
-		env->imbalance = busiest->load_per_task;
-}
-
-/**
  * calculate_imbalance - Calculate the amount of imbalance present within the
  *			 groups of a given sched_domain during load balance.
  * @env: load balance environment
@@ -8370,15 +8284,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		return;
 	}
 
-	if (busiest->group_type == group_imbalanced) {
-		/*
-		 * In the group_imb case we cannot rely on group-wide averages
-		 * to ensure CPU-load equilibrium, look at wider averages. XXX
-		 */
-		busiest->load_per_task =
-			min(busiest->load_per_task, sds->avg_load);
-	}
-
 	/*
 	 * Avg load of busiest sg can be less and avg load of local sg can
 	 * be greater than avg load across all sgs of sd because avg load
@@ -8389,7 +8294,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	    (busiest->avg_load <= sds->avg_load ||
 	     local->avg_load >= sds->avg_load)) {
 		env->imbalance = 0;
-		return fix_small_imbalance(env, sds);
+		return;
 	}
 
 	/*
@@ -8427,14 +8332,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 				       busiest->group_misfit_task_load);
 	}
 
-	/*
-	 * if *imbalance is less than the average load per runnable task
-	 * there is no guarantee that any tasks will be moved so we'll have
-	 * a think about bumping its value to force at least one task to be
-	 * moved
-	 */
-	if (env->imbalance < busiest->load_per_task)
-		return fix_small_imbalance(env, sds);
 }
 
 /******* find_busiest_group() helpers end here *********************/

From patchwork Fri Oct 18 13:26:31 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176841
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp862022ill;
 Fri, 18 Oct 2019 06:27:28 -0700 (PDT)
X-Google-Smtp-Source: APXvYqwMwVKaQm5OkCoNCBcG9ZMK7+dnIx4tZVI0oinXsMD/g+N+NXLcUQzDWMy+wmLxqXXNQKKa
X-Received: by 2002:a17:906:3016:: with SMTP id
 22mr8357181ejz.227.1571405247828; 
 Fri, 18 Oct 2019 06:27:27 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405247; cv=none;
 d=google.com; s=arc-20160816;
 b=lRTYx/im8rtcqh7LT5ixVrMBK6scACEJIKFnnqmUUSpBlm+zN3Og1Fwk4J/hmE8L0k
 okfkuLtti85TYpzX6whSPUF/gcARsmt96376ZeDgjIoyElHvpUmpWt7xhxCAiJBhhX5D
 GdScyH4Q2TK+Fkv/DD5J4dgzrZstUjMsGVhbzLZmpCv2DsBf26nudgbUwmQOQrnFuRk4
 BQEnxKgP0+z3drttk6DzXo+8Z8yvtDVs1UdawcEGDAImpI+t67PqMFg2agzNvM7YD13k
 dm/jHnhOaio4NR5zD0b3KsuNoBABzGPXpdvj7kRYNC7T9l0QbeImpz80QAIhDIqD8psG
 iJlw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=DD+OOZJusUQze5RChHgFxDyxwgAbUDvFr0pebvimYqI=;
 b=ZmwPvMjUHKRiLoATSpG4uEdxCAP1B9jllXjJqVxw2VRWvMtLpxZc82IYFw6n/UTNm9
 vkkyIaFABuFi76o7XQCP1pDFYN5mtzF/NWqwUkrqKhTD9DxIC3DPtDq7Lt+Cbggvxtzy
 6jGsFmy1OhqBBS/pB20kIZ5FQ4wdxao/ESPw6dVg67+MBM6kJDiCMhMcuPZI9KXcue05
 cX4Nb+qd7rU4XC9eOP8i1eOO0lQX3wg29A02erTi45pLguMiRc2VJgjfn6JqmWEbD+dj
 yBsLmvu6Sl0PVAlkyRiizSozhDW8m9+YgwwZMzcIBRNtTpgeP3CRavn6hv3/un8Xy2FO
 d05g==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b="IE5Dm6k/";
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 q10si3850345eda.293.2019.10.18.06.27.27; 
 Fri, 18 Oct 2019 06:27:27 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b="IE5Dm6k/";
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2410340AbfJRN1X (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:27:23 -0400
Received: from mail-wm1-f68.google.com ([209.85.128.68]:52868 "EHLO
 mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2410298AbfJRN04 (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:26:56 -0400
Received: by mail-wm1-f68.google.com with SMTP id r19so6199535wmh.2
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:26:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=DD+OOZJusUQze5RChHgFxDyxwgAbUDvFr0pebvimYqI=;
 b=IE5Dm6k/xy/+rde2RvWmWyI0/DUmAHyUX22NOZ5evVHlxRmzsKu2JpysW1ZASFlAWQ
 cT8YrDJf3apsyB0NTFkBT6S3F0DRMCAWISya/qC4zRql+VJjt122NZgj7tti+tG+iTU/
 +qQakNGSagQK7//t9r8R8yDCa5SF8ZoJag7oBJm/rE0uixAQC7PsG5C7fj6PcK9HuGPb
 VMKHVdYfkIV61uDp+5iunOz8YWxCIX1q8rZUVr7yAKpdmIezB0i+rkpwh1aAC2jY+PDM
 HQa8OkcZeobkzIRJ5fhe9TaDVIFvmd8A6ceGrY1H1dvo+THmjxcnicCDUIpxad0abmFs
 CzTQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=DD+OOZJusUQze5RChHgFxDyxwgAbUDvFr0pebvimYqI=;
 b=X0o68s/2K58l0pBEAdgHKYnR/Q1IueRvLenEy1re+ojcx55QDW/1dZE/KvVh5581Mh
 F1f3yZW0NNeEeegOU0N0NJRNaO1cBimf0o0hPfeSEF9IiDeHX5RfB/W+ExI7inkjc/rM
 0oVcuJ6Xz2D4otibUGrDpdKK08IH4voOi7Ewka5p6IzcuiU2PO9eVbFBNofgiuaGeDD3
 JccSBpWN7za5E1ESXOYcOJtmJ03nTJASKnwphnHEo7Ge/ek1trSXrncJhP3pQJFCbpaY
 +WECBRT7I7VOnlZ7p6+ELHogBiEoa3chJJmk87X8Gj1QmpGXFPk3Tn0rKeXLVN+86HTL
 nxyA==
X-Gm-Message-State: APjAAAUr2s4ttycRwcNzWbzutePcEHSoK3GYPalW8CfR/M/2aewWwfwf
 fGvnZqr+/OZHodqtvxzKbznnUxZ8+zE=
X-Received: by 2002:a1c:f417:: with SMTP id z23mr7892493wma.77.1571405211215; 
 Fri, 18 Oct 2019 06:26:51 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.49
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:49 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 04/11] sched/fair: rework load_balance
Date: Fri, 18 Oct 2019 15:26:31 +0200
Message-Id: <1571405198-27570-5-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

The load_balance algorithm contains some heuristics which have become
meaningless since the rework of the scheduler's metrics like the
introduction of PELT.

Furthermore, load is an ill-suited metric for solving certain task
placement imbalance scenarios. For instance, in the presence of idle CPUs,
we should simply try to get at least one task per CPU, whereas the current
load-based algorithm can actually leave idle CPUs alone simply because the
load is somewhat balanced. The current algorithm ends up creating virtual
and meaningless value like the avg_load_per_task or tweaks the state of a
group to make it overloaded whereas it's not, in order to try to migrate
tasks.

load_balance should better qualify the imbalance of the group and clearly
define what has to be moved to fix this imbalance.

The type of sched_group has been extended to better reflect the type of
imbalance. We now have :
	group_has_spare
	group_fully_busy
	group_misfit_task
	group_asym_packing
	group_imbalanced
	group_overloaded

Based on the type of sched_group, load_balance now sets what it wants to
move in order to fix the imbalance. It can be some load as before but also
some utilization, a number of task or a type of task:
	migrate_task
	migrate_util
	migrate_load
	migrate_misfit

This new load_balance algorithm fixes several pending wrong tasks
placement:
- the 1 task per CPU case with asymmetric system
- the case of cfs task preempted by other class
- the case of tasks not evenly spread on groups with spare capacity

Also the load balance decisions have been consolidated in the 3 functions
below after removing the few bypasses and hacks of the current code:
- update_sd_pick_busiest() select the busiest sched_group.
- find_busiest_group() checks if there is an imbalance between local and
  busiest group.
- calculate_imbalance() decides what have to be moved.

Finally, the now unused field total_running of struct sd_lb_stats has been
removed.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 611 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 402 insertions(+), 209 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e004841..5ae5281 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7068,11 +7068,26 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+/*
+ * group_type describes the group of CPUs at the moment of the load balance.
+ * The enum is ordered by pulling priority, with the group with lowest priority
+ * first so the groupe_type can be simply compared when selecting the busiest
+ * group. see update_sd_pick_busiest().
+ */
 enum group_type {
-	group_other = 0,
+	group_has_spare = 0,
+	group_fully_busy,
 	group_misfit_task,
+	group_asym_packing,
 	group_imbalanced,
-	group_overloaded,
+	group_overloaded
+};
+
+enum migration_type {
+	migrate_load = 0,
+	migrate_util,
+	migrate_task,
+	migrate_misfit
 };
 
 #define LBF_ALL_PINNED	0x01
@@ -7105,7 +7120,7 @@ struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
-	enum group_type		src_grp_type;
+	enum migration_type	migration_type;
 	struct list_head	tasks;
 };
 
@@ -7328,7 +7343,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
 static const unsigned int sched_nr_migrate_break = 32;
 
 /*
- * detach_tasks() -- tries to detach up to imbalance runnable load from
+ * detach_tasks() -- tries to detach up to imbalance load/util/tasks from
  * busiest_rq, as part of a balancing operation within domain "sd".
  *
  * Returns number of detached tasks if successful and 0 otherwise.
@@ -7336,8 +7351,8 @@ static const unsigned int sched_nr_migrate_break = 32;
 static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
+	unsigned long util, load;
 	struct task_struct *p;
-	unsigned long load;
 	int detached = 0;
 
 	lockdep_assert_held(&env->src_rq->lock);
@@ -7370,19 +7385,51 @@ static int detach_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
-		load = task_h_load(p);
+		switch (env->migration_type) {
+		case migrate_load:
+			load = task_h_load(p);
 
-		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
-			goto next;
+			if (sched_feat(LB_MIN) &&
+			    load < 16 && !env->sd->nr_balance_failed)
+				goto next;
 
-		if ((load / 2) > env->imbalance)
-			goto next;
+			if ((load / 2) > env->imbalance)
+				goto next;
+
+			env->imbalance -= load;
+			break;
+
+		case migrate_util:
+			util = task_util_est(p);
+
+			if (util > env->imbalance)
+				goto next;
+
+			env->imbalance -= util;
+			break;
+
+		case migrate_task:
+			env->imbalance--;
+			break;
+
+		case migrate_misfit:
+			load = task_h_load(p);
+
+			/*
+			 * load of misfit task might decrease a bit since it has
+			 * been recorded. Be conservative in the condition.
+			 */
+			if (load / 2 < env->imbalance)
+				goto next;
+
+			env->imbalance = 0;
+			break;
+		}
 
 		detach_task(p, env);
 		list_add(&p->se.group_node, &env->tasks);
 
 		detached++;
-		env->imbalance -= load;
 
 #ifdef CONFIG_PREEMPTION
 		/*
@@ -7396,7 +7443,7 @@ static int detach_tasks(struct lb_env *env)
 
 		/*
 		 * We only want to steal up to the prescribed amount of
-		 * runnable load.
+		 * load/util/tasks.
 		 */
 		if (env->imbalance <= 0)
 			break;
@@ -7661,7 +7708,6 @@ struct sg_lb_stats {
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_no_capacity;
 	unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
 	unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
@@ -7677,10 +7723,10 @@ struct sg_lb_stats {
 struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
-	unsigned long total_running;
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
+	unsigned int prefer_sibling; /* tasks should go to sibling first */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
@@ -7691,19 +7737,18 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 	/*
 	 * Skimp on the clearing to avoid duplicate work. We can avoid clearing
 	 * local_stat because update_sg_lb_stats() does a full clear/assignment.
-	 * We must however clear busiest_stat::avg_load because
-	 * update_sd_pick_busiest() reads this before assignment.
+	 * We must however set busiest_stat::group_type and
+	 * busiest_stat::idle_cpus to the worst busiest group because
+	 * update_sd_pick_busiest() reads these before assignment.
 	 */
 	*sds = (struct sd_lb_stats){
 		.busiest = NULL,
 		.local = NULL,
-		.total_running = 0UL,
 		.total_load = 0UL,
 		.total_capacity = 0UL,
 		.busiest_stat = {
-			.avg_load = 0UL,
-			.sum_h_nr_running = 0,
-			.group_type = group_other,
+			.idle_cpus = UINT_MAX,
+			.group_type = group_has_spare,
 		},
 	};
 }
@@ -7945,19 +7990,26 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 }
 
 static inline enum
-group_type group_classify(struct sched_group *group,
+group_type group_classify(struct lb_env *env,
+			  struct sched_group *group,
 			  struct sg_lb_stats *sgs)
 {
-	if (sgs->group_no_capacity)
+	if (group_is_overloaded(env, sgs))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
+	if (sgs->group_asym_packing)
+		return group_asym_packing;
+
 	if (sgs->group_misfit_task_load)
 		return group_misfit_task;
 
-	return group_other;
+	if (!group_has_capacity(env, sgs))
+		return group_fully_busy;
+
+	return group_has_spare;
 }
 
 static bool update_nohz_stats(struct rq *rq, bool force)
@@ -7994,10 +8046,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sg_lb_stats *sgs,
 				      int *sg_status)
 {
-	int i, nr_running;
+	int i, nr_running, local_group;
 
 	memset(sgs, 0, sizeof(*sgs));
 
+	local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 
@@ -8022,9 +8076,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
 		 */
-		if (!nr_running && idle_cpu(i))
+		if (!nr_running && idle_cpu(i)) {
 			sgs->idle_cpus++;
+			/* Idle cpu can't have misfit task */
+			continue;
+		}
+
+		if (local_group)
+			continue;
 
+		/* Check for a misfit task on the cpu */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
@@ -8032,14 +8093,24 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
-	/* Adjust by relative CPU capacity of the group */
+	/* Check if dst cpu is idle and preferred to this group */
+	if (env->sd->flags & SD_ASYM_PACKING &&
+	    env->idle != CPU_NOT_IDLE &&
+	    sgs->sum_h_nr_running &&
+	    sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
+		sgs->group_asym_packing = 1;
+	}
+
 	sgs->group_capacity = group->sgc->capacity;
-	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
 
 	sgs->group_weight = group->group_weight;
 
-	sgs->group_no_capacity = group_is_overloaded(env, sgs);
-	sgs->group_type = group_classify(group, sgs);
+	sgs->group_type = group_classify(env, group, sgs);
+
+	/* Computing avg_load makes sense only when group is overloaded */
+	if (sgs->group_type == group_overloaded)
+		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
+				sgs->group_capacity;
 }
 
 /**
@@ -8062,6 +8133,10 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
+	/* Make sure that there is at least one task to pull */
+	if (!sgs->sum_h_nr_running)
+		return false;
+
 	/*
 	 * Don't try to pull misfit tasks we can't help.
 	 * We can use max_capacity here as reduction in capacity on some
@@ -8070,7 +8145,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 */
 	if (sgs->group_type == group_misfit_task &&
 	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
-	     !group_has_capacity(env, &sds->local_stat)))
+	     sds->local_stat.group_type != group_has_spare))
 		return false;
 
 	if (sgs->group_type > busiest->group_type)
@@ -8079,62 +8154,80 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->group_type < busiest->group_type)
 		return false;
 
-	if (sgs->avg_load <= busiest->avg_load)
-		return false;
-
-	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
-		goto asym_packing;
-
 	/*
-	 * Candidate sg has no more than one task per CPU and
-	 * has higher per-CPU capacity. Migrating tasks to less
-	 * capable CPUs may harm throughput. Maximize throughput,
-	 * power/energy consequences are not considered.
+	 * The candidate and the current busiest group are the same type of
+	 * group. Let check which one is the busiest according to the type.
 	 */
-	if (sgs->sum_h_nr_running <= sgs->group_weight &&
-	    group_smaller_min_cpu_capacity(sds->local, sg))
-		return false;
 
-	/*
-	 * If we have more than one misfit sg go with the biggest misfit.
-	 */
-	if (sgs->group_type == group_misfit_task &&
-	    sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+	switch (sgs->group_type) {
+	case group_overloaded:
+		/* Select the overloaded group with highest avg_load. */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_imbalanced:
+		/*
+		 * Select the 1st imbalanced group as we don't have any way to
+		 * choose one more than another.
+		 */
 		return false;
 
-asym_packing:
-	/* This is the busiest node in its class. */
-	if (!(env->sd->flags & SD_ASYM_PACKING))
-		return true;
+	case group_asym_packing:
+		/* Prefer to move from lowest priority CPU's work */
+		if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
+			return false;
+		break;
 
-	/* No ASYM_PACKING if target CPU is already busy */
-	if (env->idle == CPU_NOT_IDLE)
-		return true;
-	/*
-	 * ASYM_PACKING needs to move all the work to the highest
-	 * prority CPUs in the group, therefore mark all groups
-	 * of lower priority than ourself as busy.
-	 *
-	 * This is primarily intended to used at the sibling level.  Some
-	 * cores like POWER7 prefer to use lower numbered SMT threads.  In the
-	 * case of POWER7, it can move to lower SMT modes only when higher
-	 * threads are idle.  When in lower SMT modes, the threads will
-	 * perform better since they share less core resources.  Hence when we
-	 * have idle threads, we want them to be the higher ones.
-	 */
-	if (sgs->sum_h_nr_running &&
-	    sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) {
-		sgs->group_asym_packing = 1;
-		if (!sds->busiest)
-			return true;
+	case group_misfit_task:
+		/*
+		 * If we have more than one misfit sg go with the biggest
+		 * misfit.
+		 */
+		if (sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+			return false;
+		break;
 
-		/* Prefer to move from lowest priority CPU's work */
-		if (sched_asym_prefer(sds->busiest->asym_prefer_cpu,
-				      sg->asym_prefer_cpu))
-			return true;
+	case group_fully_busy:
+		/*
+		 * Select the fully busy group with highest avg_load. In
+		 * theory, there is no need to pull task from such kind of
+		 * group because tasks have all compute capacity that they need
+		 * but we can still improve the overall throughput by reducing
+		 * contention when accessing shared HW resources.
+		 *
+		 * XXX for now avg_load is not computed and always 0 so we
+		 * select the 1st one.
+		 */
+		if (sgs->avg_load <= busiest->avg_load)
+			return false;
+		break;
+
+	case group_has_spare:
+		/*
+		 * Select not overloaded group with lowest number of
+		 * idle cpus. We could also compare the spare capacity
+		 * which is more stable but it can end up that the
+		 * group has less spare capacity but finally more idle
+		 * cpus which means less opportunity to pull tasks.
+		 */
+		if (sgs->idle_cpus >= busiest->idle_cpus)
+			return false;
+		break;
 	}
 
-	return false;
+	/*
+	 * Candidate sg has no more than one task per CPU and has higher
+	 * per-CPU capacity. Migrating tasks to less capable CPUs may harm
+	 * throughput. Maximize throughput, power/energy consequences are not
+	 * considered.
+	 */
+	if ((env->sd->flags & SD_ASYM_CPUCAPACITY) &&
+	    (sgs->group_type <= group_fully_busy) &&
+	    (group_smaller_min_cpu_capacity(sds->local, sg)))
+		return false;
+
+	return true;
 }
 
 #ifdef CONFIG_NUMA_BALANCING
@@ -8172,13 +8265,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
  * @env: The load balancing environment.
  * @sds: variable to hold the statistics for this sched_domain.
  */
+
 static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
 {
 	struct sched_domain *child = env->sd->child;
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 	int sg_status = 0;
 
 #ifdef CONFIG_NO_HZ_COMMON
@@ -8205,22 +8298,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (local_group)
 			goto next_group;
 
-		/*
-		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity so that we'll try
-		 * and move all the excess tasks away. We lower the capacity
-		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks. The extra check prevents the case where
-		 * you always pull from the heaviest group when it is already
-		 * under-utilized (possible with a large weight task outweighs
-		 * the tasks on the system).
-		 */
-		if (prefer_sibling && sds->local &&
-		    group_has_capacity(env, local) &&
-		    (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) {
-			sgs->group_no_capacity = 1;
-			sgs->group_type = group_classify(sg, sgs);
-		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -8229,13 +8306,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 next_group:
 		/* Now, start updating sd_lb_stats */
-		sds->total_running += sgs->sum_h_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
+	/* Tag domain that child domain prefers tasks go to siblings first */
+	sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
+
 #ifdef CONFIG_NO_HZ_COMMON
 	if ((env->flags & LBF_NOHZ_AGAIN) &&
 	    cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) {
@@ -8273,69 +8352,149 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
  */
 static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
 {
-	unsigned long max_pull, load_above_capacity = ~0UL;
 	struct sg_lb_stats *local, *busiest;
 
 	local = &sds->local_stat;
 	busiest = &sds->busiest_stat;
 
-	if (busiest->group_asym_packing) {
-		env->imbalance = busiest->group_load;
+	if (busiest->group_type == group_misfit_task) {
+		/* Set imbalance to allow misfit task to be balanced. */
+		env->migration_type = migrate_misfit;
+		env->imbalance = busiest->group_misfit_task_load;
+		return;
+	}
+
+	if (busiest->group_type == group_asym_packing) {
+		/*
+		 * In case of asym capacity, we will try to migrate all load to
+		 * the preferred CPU.
+		 */
+		env->migration_type = migrate_task;
+		env->imbalance = busiest->sum_h_nr_running;
+		return;
+	}
+
+	if (busiest->group_type == group_imbalanced) {
+		/*
+		 * In the group_imb case we cannot rely on group-wide averages
+		 * to ensure CPU-load equilibrium, try to move any task to fix
+		 * the imbalance. The next load balance will take care of
+		 * balancing back the system.
+		 */
+		env->migration_type = migrate_task;
+		env->imbalance = 1;
 		return;
 	}
 
 	/*
-	 * Avg load of busiest sg can be less and avg load of local sg can
-	 * be greater than avg load across all sgs of sd because avg load
-	 * factors in sg capacity and sgs with smaller group_type are
-	 * skipped when updating the busiest sg:
+	 * Try to use spare capacity of local group without overloading it or
+	 * emptying busiest
 	 */
-	if (busiest->group_type != group_misfit_task &&
-	    (busiest->avg_load <= sds->avg_load ||
-	     local->avg_load >= sds->avg_load)) {
-		env->imbalance = 0;
+	if (local->group_type == group_has_spare) {
+		if (busiest->group_type > group_fully_busy) {
+			/*
+			 * If busiest is overloaded, try to fill spare
+			 * capacity. This might end up creating spare capacity
+			 * in busiest or busiest still being overloaded but
+			 * there is no simple way to directly compute the
+			 * amount of load to migrate in order to balance the
+			 * system.
+			 */
+			env->migration_type = migrate_util;
+			env->imbalance = max(local->group_capacity, local->group_util) -
+					 local->group_util;
+
+			/*
+			 * In some case, the group's utilization is max or even
+			 * higher than capacity because of migrations but the
+			 * local CPU is (newly) idle. There is at least one
+			 * waiting task in this overloaded busiest group. Let
+			 * try to pull it.
+			 */
+			if (env->idle != CPU_NOT_IDLE && env->imbalance == 0) {
+				env->migration_type = migrate_task;
+				env->imbalance = 1;
+			}
+
+			return;
+		}
+
+		if (busiest->group_weight == 1 || sds->prefer_sibling) {
+			unsigned int nr_diff = busiest->sum_h_nr_running;
+			/*
+			 * When prefer sibling, evenly spread running tasks on
+			 * groups.
+			 */
+			env->migration_type = migrate_task;
+			lsub_positive(&nr_diff, local->sum_h_nr_running);
+			env->imbalance = nr_diff >> 1;
+			return;
+		}
+
+		/*
+		 * If there is no overload, we just want to even the number of
+		 * idle cpus.
+		 */
+		env->migration_type = migrate_task;
+		env->imbalance = max_t(long, 0, (local->idle_cpus -
+						 busiest->idle_cpus) >> 1);
 		return;
 	}
 
 	/*
-	 * If there aren't any idle CPUs, avoid creating some.
+	 * Local is fully busy but has to take more load to relieve the
+	 * busiest group
 	 */
-	if (busiest->group_type == group_overloaded &&
-	    local->group_type   == group_overloaded) {
-		load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE;
-		if (load_above_capacity > busiest->group_capacity) {
-			load_above_capacity -= busiest->group_capacity;
-			load_above_capacity *= scale_load_down(NICE_0_LOAD);
-			load_above_capacity /= busiest->group_capacity;
-		} else
-			load_above_capacity = ~0UL;
+	if (local->group_type < group_overloaded) {
+		/*
+		 * Local will become overloaded so the avg_load metrics are
+		 * finally needed.
+		 */
+
+		local->avg_load = (local->group_load * SCHED_CAPACITY_SCALE) /
+				  local->group_capacity;
+
+		sds->avg_load = (sds->total_load * SCHED_CAPACITY_SCALE) /
+				sds->total_capacity;
 	}
 
 	/*
-	 * We're trying to get all the CPUs to the average_load, so we don't
-	 * want to push ourselves above the average load, nor do we wish to
-	 * reduce the max loaded CPU below the average load. At the same time,
-	 * we also don't want to reduce the group load below the group
-	 * capacity. Thus we look for the minimum possible imbalance.
+	 * Both group are or will become overloaded and we're trying to get all
+	 * the CPUs to the average_load, so we don't want to push ourselves
+	 * above the average load, nor do we wish to reduce the max loaded CPU
+	 * below the average load. At the same time, we also don't want to
+	 * reduce the group load below the group capacity. Thus we look for
+	 * the minimum possible imbalance.
 	 */
-	max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
-
-	/* How much load to actually move to equalise the imbalance */
+	env->migration_type = migrate_load;
 	env->imbalance = min(
-		max_pull * busiest->group_capacity,
+		(busiest->avg_load - sds->avg_load) * busiest->group_capacity,
 		(sds->avg_load - local->avg_load) * local->group_capacity
 	) / SCHED_CAPACITY_SCALE;
-
-	/* Boost imbalance to allow misfit task to be balanced. */
-	if (busiest->group_type == group_misfit_task) {
-		env->imbalance = max_t(long, env->imbalance,
-				       busiest->group_misfit_task_load);
-	}
-
 }
 
 /******* find_busiest_group() helpers end here *********************/
 
+/*
+ * Decision matrix according to the local and busiest group type
+ *
+ * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded
+ * has_spare        nr_idle   balanced   N/A    N/A  balanced   balanced
+ * fully_busy       nr_idle   nr_idle    N/A    N/A  balanced   balanced
+ * misfit_task      force     N/A        N/A    N/A  force      force
+ * asym_packing     force     force      N/A    N/A  force      force
+ * imbalanced       force     force      N/A    N/A  force      force
+ * overloaded       force     force      N/A    N/A  force      avg_load
+ *
+ * N/A :      Not Applicable because already filtered while updating
+ *            statistics.
+ * balanced : The system is balanced for these 2 groups.
+ * force :    Calculate the imbalance as load migration is probably needed.
+ * avg_load : Only if imbalance is significant enough.
+ * nr_idle :  dst_cpu is not busy and the number of idle cpus is quite
+ *            different in groups.
+ */
+
 /**
  * find_busiest_group - Returns the busiest group within the sched_domain
  * if there is an imbalance.
@@ -8370,17 +8529,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
-	/* ASYM feature bypasses nice load balance check */
-	if (busiest->group_asym_packing)
-		goto force_balance;
-
 	/* There is no busy sibling group to pull tasks from */
-	if (!sds.busiest || busiest->sum_h_nr_running == 0)
+	if (!sds.busiest)
 		goto out_balanced;
 
-	/* XXX broken for overlapping NUMA groups */
-	sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
-						/ sds.total_capacity;
+	/* Misfit tasks should be dealt with regardless of the avg load */
+	if (busiest->group_type == group_misfit_task)
+		goto force_balance;
+
+	/* ASYM feature bypasses nice load balance check */
+	if (busiest->group_type == group_asym_packing)
+		goto force_balance;
 
 	/*
 	 * If the busiest group is imbalanced the below checks don't
@@ -8391,55 +8550,64 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/*
-	 * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
-	 * capacities from resulting in underutilization due to avg_load.
-	 */
-	if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
-	    busiest->group_no_capacity)
-		goto force_balance;
-
-	/* Misfit tasks should be dealt with regardless of the avg load */
-	if (busiest->group_type == group_misfit_task)
-		goto force_balance;
-
-	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
 	 */
-	if (local->avg_load >= busiest->avg_load)
+	if (local->group_type > busiest->group_type)
 		goto out_balanced;
 
 	/*
-	 * Don't pull any tasks if this group is already above the domain
-	 * average load.
+	 * When groups are overloaded, use the avg_load to ensure fairness
+	 * between tasks.
 	 */
-	if (local->avg_load >= sds.avg_load)
-		goto out_balanced;
+	if (local->group_type == group_overloaded) {
+		/*
+		 * If the local group is more loaded than the selected
+		 * busiest group don't try and pull any tasks.
+		 */
+		if (local->avg_load >= busiest->avg_load)
+			goto out_balanced;
+
+		/* XXX broken for overlapping NUMA groups */
+		sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
+				sds.total_capacity;
 
-	if (env->idle == CPU_IDLE) {
 		/*
-		 * This CPU is idle. If the busiest group is not overloaded
-		 * and there is no imbalance between this and busiest group
-		 * wrt idle CPUs, it is balanced. The imbalance becomes
-		 * significant if the diff is greater than 1 otherwise we
-		 * might end up to just move the imbalance on another group
+		 * Don't pull any tasks if this group is already above the
+		 * domain average load.
 		 */
-		if ((busiest->group_type != group_overloaded) &&
-				(local->idle_cpus <= (busiest->idle_cpus + 1)))
+		if (local->avg_load >= sds.avg_load)
 			goto out_balanced;
-	} else {
+
 		/*
-		 * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
-		 * imbalance_pct to be conservative.
+		 * If the busiest group is more loaded, use imbalance_pct to be
+		 * conservative.
 		 */
 		if (100 * busiest->avg_load <=
 				env->sd->imbalance_pct * local->avg_load)
 			goto out_balanced;
 	}
 
+	/* Try to move all excess tasks to child's sibling domain */
+	if (sds.prefer_sibling && local->group_type == group_has_spare &&
+	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+		goto force_balance;
+
+	if (busiest->group_type != group_overloaded &&
+	     (env->idle == CPU_NOT_IDLE ||
+	      local->idle_cpus <= (busiest->idle_cpus + 1)))
+		/*
+		 * If the busiest group is not overloaded
+		 * and there is no imbalance between this and busiest group
+		 * wrt idle CPUs, it is balanced. The imbalance
+		 * becomes significant if the diff is greater than 1 otherwise
+		 * we might end up to just move the imbalance on another
+		 * group.
+		 */
+		goto out_balanced;
+
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
-	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
 	return env->imbalance ? sds.busiest : NULL;
 
@@ -8455,11 +8623,13 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 				     struct sched_group *group)
 {
 	struct rq *busiest = NULL, *rq;
-	unsigned long busiest_load = 0, busiest_capacity = 1;
+	unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1;
+	unsigned int busiest_nr = 0;
 	int i;
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
-		unsigned long capacity, load;
+		unsigned long capacity, load, util;
+		unsigned int nr_running;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -8487,20 +8657,8 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
-		/*
-		 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
-		 * seek the "biggest" misfit task.
-		 */
-		if (env->src_grp_type == group_misfit_task) {
-			if (rq->misfit_task_load > busiest_load) {
-				busiest_load = rq->misfit_task_load;
-				busiest = rq;
-			}
-
-			continue;
-		}
-
 		capacity = capacity_of(i);
+		nr_running = rq->cfs.h_nr_running;
 
 		/*
 		 * For ASYM_CPUCAPACITY domains, don't pick a CPU that could
@@ -8510,35 +8668,70 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 */
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
 		    capacity_of(env->dst_cpu) < capacity &&
-		    rq->nr_running == 1)
+		    nr_running == 1)
 			continue;
 
-		load = cpu_runnable_load(rq);
+		switch (env->migration_type) {
+		case migrate_load:
+			/*
+			 * When comparing with load imbalance, use
+			 * cpu_runnable_load() which is not scaled with the CPU
+			 * capacity.
+			 */
+			load = cpu_runnable_load(rq);
 
-		/*
-		 * When comparing with imbalance, use cpu_runnable_load()
-		 * which is not scaled with the CPU capacity.
-		 */
+			if (nr_running == 1 && load > env->imbalance &&
+			    !check_cpu_capacity(rq, env->sd))
+				break;
 
-		if (rq->nr_running == 1 && load > env->imbalance &&
-		    !check_cpu_capacity(rq, env->sd))
-			continue;
+			/*
+			 * For the load comparisons with the other CPU's,
+			 * consider the cpu_runnable_load() scaled with the CPU
+			 * capacity, so that the load can be moved away from
+			 * the CPU that is potentially running at a lower
+			 * capacity.
+			 *
+			 * Thus we're looking for max(load_i / capacity_i),
+			 * crosswise multiplication to rid ourselves of the
+			 * division works out to:
+			 * load_i * capacity_j > load_j * capacity_i;
+			 * where j is our previous maximum.
+			 */
+			if (load * busiest_capacity > busiest_load * capacity) {
+				busiest_load = load;
+				busiest_capacity = capacity;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_util:
+			util = cpu_util(cpu_of(rq));
+
+			if (busiest_util < util) {
+				busiest_util = util;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_task:
+			if (busiest_nr < nr_running) {
+				busiest_nr = nr_running;
+				busiest = rq;
+			}
+			break;
+
+		case migrate_misfit:
+			/*
+			 * For ASYM_CPUCAPACITY domains with misfit tasks we
+			 * simply seek the "biggest" misfit task.
+			 */
+			if (rq->misfit_task_load > busiest_load) {
+				busiest_load = rq->misfit_task_load;
+				busiest = rq;
+			}
+
+			break;
 
-		/*
-		 * For the load comparisons with the other CPU's, consider
-		 * the cpu_runnable_load() scaled with the CPU capacity, so
-		 * that the load can be moved away from the CPU that is
-		 * potentially running at a lower capacity.
-		 *
-		 * Thus we're looking for max(load_i / capacity_i), crosswise
-		 * multiplication to rid ourselves of the division works out
-		 * to: load_i * capacity_j > load_j * capacity_i;  where j is
-		 * our previous maximum.
-		 */
-		if (load * busiest_capacity > busiest_load * capacity) {
-			busiest_load = load;
-			busiest_capacity = capacity;
-			busiest = rq;
 		}
 	}
 
@@ -8584,7 +8777,7 @@ voluntary_active_balance(struct lb_env *env)
 			return 1;
 	}
 
-	if (env->src_grp_type == group_misfit_task)
+	if (env->migration_type == migrate_misfit)
 		return 1;
 
 	return 0;

From patchwork Fri Oct 18 13:26:32 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176833
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861431ill;
 Fri, 18 Oct 2019 06:26:59 -0700 (PDT)
X-Google-Smtp-Source: APXvYqzhpqJCcxPl7M3jqkMU+FegnTUzuXTnVYGUU4dpE1hXcbecMn+MeR8XT670ES5Y/Gnf7oCM
X-Received: by 2002:a17:906:3e50:: with SMTP id
 t16mr8717856eji.177.1571405219240; 
 Fri, 18 Oct 2019 06:26:59 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405219; cv=none;
 d=google.com; s=arc-20160816;
 b=wN0vNVG8Z/WZoUfl/my9IO4Anyq65i8LLcwjGI+QTZ8+9pqr/9Qo92G4J6ovmDYchb
 9dPVm6pWDRVTo3AUEHx/cULVAUVkgNoUYBj7hVEmGB4X+tiiB0Sa4/ESPEgkhVvPhxFI
 NK8WBJQnL5iYCdhaQBMrqv8z7iV7fkIGj4mH0D08EYNDgSNtaoLwlQ0+QtKZuVuHjkXB
 rP/ndIWr8ah3l0lvyhQVclSOo0MfSdBju01D+G15kAs8uLVGdzi3B+IrvY7y8kOeKym8
 50w9oCA3/Gt6uDNo2jyt+Lxx65C9RGK5LMywh3/rA9R3F+huLx6FZSpBxnkig7Bw6/ez
 /oOw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=ACgHRAYDPePD5iyvNwVwA2/U+1dhCaKONZj4VM6vt/4=;
 b=VGLCtMA7LVSYNcLtdSysb+d3qj/Plvh99VA9UNPQHsYJF132t9rAPAJHPbiSCghKTl
 tg2PkMAoG8ZEBB+Nf2kCP4BY8PRyPYTzfTKAPcojOevNA4qxkZvwMevUW5w47ny90Us2
 Y7CfGAAa8PSRVr1POWxJPUzOKllS1Z86EcSRbH7r27DG6+uCxXA/EAT4nQw/QJhmK7Lp
 tPstKv23+hGhhIUBlYcfYAowG8SIplX6Sbh5ZJinIo3iHrzqraP2sCk5iLK+cQ2NIgaL
 /nlLI1DBH5IOWxHMON6YnlIdEF9dop5PDVHqkoF2zfwFngU8n+bQPktbGSv7pLKH7PBp
 N6jQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=DYFQYDZv;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id q4si3371475eji.152.2019.10.18.06.26.59;
 Fri, 18 Oct 2019 06:26:59 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=DYFQYDZv;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2410312AbfJRN04 (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:26:56 -0400
Received: from mail-wm1-f68.google.com ([209.85.128.68]:54991 "EHLO
 mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2410303AbfJRN0y (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:26:54 -0400
Received: by mail-wm1-f68.google.com with SMTP id p7so6192727wmp.4
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:26:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=ACgHRAYDPePD5iyvNwVwA2/U+1dhCaKONZj4VM6vt/4=;
 b=DYFQYDZvBUzF7NyIwx1hcaaajr/pskthAizC7tbjYaD3UxnadPZvspq6RYLl5JDuQV
 lEKI6PkORzSPdHHl36BLJS1oBce2zZRRSuXXLSBZjx6XW4AJUgpDve/Kv6JFayHEN0jw
 XQNRSq+JsGYsfTLa5Fj2czaP7Vyqdmt4+bXOhYWdLe4Cs7TXEWqfq6kliKchFPHDPb7v
 R13iliv+7RV44FvjcF3/GVZ+aFGIiQyN/t+BuAED+EJZIFPUTR8sTrxqlwJmb5JhDyGd
 2hITiLWee/Ztfvq41wfS0NSoBlallPH6QnLMIozQ+HyxHqmYFE/5IYKI91fn2sYsdNIE
 MoRw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=ACgHRAYDPePD5iyvNwVwA2/U+1dhCaKONZj4VM6vt/4=;
 b=PkRAW0jdRNZn1vVubR0wUmtUEQEtTtxyKZ6nHLABRBh4fKKmFQKxkE7Ugy0dAhIqR8
 9S2gxtiaLduomyuBylkxZmaKryfMjiBI3rxzNfj09nMKseZI5tJSXyr7SBzBG2f3mhol
 d+7wjMcGMfJdZwA2ZcaVSSU5/EkCv6/lgs8nEqVtW76Br+r6/RCfExc7T9UkmHT77M65
 FZMB8LqqZi6idsIIYI8vu/eHGeRMtfmOSQw+2CWSWWIbHt4uC8UQnFSDqGoZaYt6OJ+t
 oeA+cNKIutS8ubswjvbbtdd1fnLtCHIwCf2HLyg7NEpTu8vR711ZbXzn5EeoiUvBfQ6m
 p9FQ==
X-Gm-Message-State: APjAAAWiAfvdLFSn3w6fBe9AE/KgrJBO+GeiyTHzfBLFTrKvjw4vbmAL
 BrO+jAI7nKBvLeTMYAn2D/l1K8DMbX8=
X-Received: by 2002:a1c:1bc5:: with SMTP id b188mr8035064wmb.88.1571405212870; 
 Fri, 18 Oct 2019 06:26:52 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.51
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:51 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 05/11] sched/fair: use rq->nr_running when balancing load
Date: Fri, 18 Oct 2019 15:26:32 +0200
Message-Id: <1571405198-27570-6-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

cfs load_balance only takes care of CFS tasks whereas CPUs can be used by
other scheduling class. Typically, a CFS task preempted by a RT or deadline
task will not get a chance to be pulled on another CPU because the
load_balance doesn't take into account tasks from other classes.
Add sum of nr_running in the statistics and use it to detect such
situation.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ae5281..e09fe12b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7704,6 +7704,7 @@ struct sg_lb_stats {
 	unsigned long group_load; /* Total load over the CPUs of the group */
 	unsigned long group_capacity;
 	unsigned long group_util; /* Total utilization of the group */
+	unsigned int sum_nr_running; /* Nr of tasks running in the group */
 	unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
@@ -7938,7 +7939,7 @@ static inline int sg_imbalanced(struct sched_group *group)
 static inline bool
 group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_h_nr_running < sgs->group_weight)
+	if (sgs->sum_nr_running < sgs->group_weight)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
@@ -7959,7 +7960,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 static inline bool
 group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_h_nr_running <= sgs->group_weight)
+	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
@@ -8063,6 +8064,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
 		nr_running = rq->nr_running;
+		sgs->sum_nr_running += nr_running;
+
 		if (nr_running > 1)
 			*sg_status |= SG_OVERLOAD;
 
@@ -8420,13 +8423,13 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
-			unsigned int nr_diff = busiest->sum_h_nr_running;
+			unsigned int nr_diff = busiest->sum_nr_running;
 			/*
 			 * When prefer sibling, evenly spread running tasks on
 			 * groups.
 			 */
 			env->migration_type = migrate_task;
-			lsub_positive(&nr_diff, local->sum_h_nr_running);
+			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
 			return;
 		}
@@ -8590,7 +8593,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 
 	/* Try to move all excess tasks to child's sibling domain */
 	if (sds.prefer_sibling && local->group_type == group_has_spare &&
-	    busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
+	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;
 
 	if (busiest->group_type != group_overloaded &&

From patchwork Fri Oct 18 13:26:33 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176840
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861892ill;
 Fri, 18 Oct 2019 06:27:21 -0700 (PDT)
X-Google-Smtp-Source: APXvYqzPrb6S8H2iic3JdpOF9xujrlIyT4+ZPRWNOT7IwszACop01w2R8jzb3hiGYEIRsIWF+5MM
X-Received: by 2002:a50:e40c:: with SMTP id d12mr9401054edm.256.1571405241193; 
 Fri, 18 Oct 2019 06:27:21 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405241; cv=none;
 d=google.com; s=arc-20160816;
 b=mi9KNVswU0N/RKWpEcHBQQ6NmItBt5+m1Q7OjzDpk3241bdz4Z+V0KnJfWxyfN6ajw
 3ktq27N0fslO9ZRNXQBrDEgx3/qFdAUCBfR/Jksr++9qOE4q+I5zHPFakf1iKvydxI01
 UH/PligXeKdRkqS7tPgM0pvfjIfAQMaYA18wpWQX8FDBIvSbmXWSG1GpahhsqxINsihZ
 AJtGe4CHZX8GrFzmOrxnpdIEwM1HGzqjrqiPH5L0Rr5gGVkHUB5xnUc6KSm8gmhVcIde
 b7KbNlML/bfEV3OvNGKslasVm+oz9d/auRc3TqAc5JVE7c+a3D8uv3YAxm8bEOywvP/P
 csyA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=B8mzip3+tA6BeidSYtlNqITD3DSe2ulF+ZWQ0m84V5I=;
 b=mUsaz8Y8JZ9piFuq0l0UbL3iEcSvualEBJGKYepNGuij4b2kB1a5edHRVEUGxeb8eX
 Z8MIZYHQ/jlPOZHfKITS8G8iANmjczSLa62YVzQrbi9uXkafm34qhN9te3b61iO6nGdt
 av7lU1Xx4PmU5scL2U2l+ZEnSPZCOgpeNcL1KOcvyh2ys0ohUWUkQVJGHWfZO3oj4Vl3
 9h9yebs2Q1IssuzUfxh8Rgirjx9kwQuDoMqPQmSfqXtfG9Mrs7YMFyAnApTc6golR23F
 fYM/wKwh00RkTDhbW6R8K509OVLBBggzz7vZaFsVpshUotSP7bPkRfpheczNPBq4LIQC
 j1oQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=zCufCuWZ;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 u27si3475775ejb.172.2019.10.18.06.27.21; 
 Fri, 18 Oct 2019 06:27:21 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=zCufCuWZ;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2634111AbfJRN1R (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:27:17 -0400
Received: from mail-wm1-f65.google.com ([209.85.128.65]:50538 "EHLO
 mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2410304AbfJRN04 (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:26:56 -0400
Received: by mail-wm1-f65.google.com with SMTP id 5so6213746wmg.0
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:26:55 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=B8mzip3+tA6BeidSYtlNqITD3DSe2ulF+ZWQ0m84V5I=;
 b=zCufCuWZkU+wU1JLp+asFBiOQ0bhsNUSe83udjSDa+1I4WD37Kt+NZ4m96hUm3/5Jb
 Jr3073a+WBS2tL510iWr/zKwahzlpBpAhfnNB6Cde4OyNbdtc9FmseNsczNReToG4Pfj
 0cdDn3rXzUF9oddIsjG48uQ9uP0ULEdpLG8pXv0rp2cKf5i0qJ1v9BsbawcCxQyNVVDs
 3LOSV89LrKWe7vXv3klu5cYLdOiS+c2uDqYMWE9exaUfEYZQaLaUm5FJSKhWBlI8Pu43
 fuj87npoKTK6jP1UmuA51tv8fTBWCHvfP5+8yuuFXvN3tys6002i7UXuRe8Qm3fmQing
 2G1A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=B8mzip3+tA6BeidSYtlNqITD3DSe2ulF+ZWQ0m84V5I=;
 b=sEBOebNnFfqYUeboprbpinq5pQnPiWLMKqGJVeFgoPOOsQK2RfuR3NgzJ3vY6432fE
 xrQqz0YX/ekscndgq/cGDN5TTIfpFhjUM12X0o1DReUbgCEriew5TGiHEFSeQ+wx99CA
 RPv+Zkgw0rDxl73QP+VEv6kBszsmRDLEZ5ic+IObyyx42NbBqqZus7EComRN4F+9ms/s
 hiZyKGVXDL08D0pkprEJj0ZZzOeLfihiYBcOmHNqnazToS7arWlxA9DZ15NVIcKqnRlY
 c2iEgWpFMYX0jKlHPUNtGR3ofynYnLwWArt0v9FVPog8c7ZOXfxVq9NELwiHkJZZ3zlo
 VvZg==
X-Gm-Message-State: APjAAAUBqLPP6WK6m7Euq8KId/IN9wOMCNqY3lG9GiyTPXVyZP8cBMDF
 vaLXDuk/buXKWiduQovkbqbuysY5Bjo=
X-Received: by 2002:a1c:9d4c:: with SMTP id g73mr8094349wme.92.1571405214681; 
 Fri, 18 Oct 2019 06:26:54 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.52
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:53 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 06/11] sched/fair: use load instead of runnable load in
 load_balance
Date: Fri, 18 Oct 2019 15:26:33 +0200
Message-Id: <1571405198-27570-7-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

runnable load has been introduced to take into account the case
where blocked load biases the load balance decision which was selecting
underutilized group with huge blocked load whereas other groups were
overloaded.

The load is now only used when groups are overloaded. In this case,
it's worth being conservative and taking into account the sleeping
tasks that might wakeup on the cpu.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e09fe12b..9ac2264 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5385,6 +5385,11 @@ static unsigned long cpu_runnable_load(struct rq *rq)
 	return cfs_rq_runnable_load_avg(&rq->cfs);
 }
 
+static unsigned long cpu_load(struct rq *rq)
+{
+	return cfs_rq_load_avg(&rq->cfs);
+}
+
 static unsigned long capacity_of(int cpu)
 {
 	return cpu_rq(cpu)->cpu_capacity;
@@ -8059,7 +8064,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false))
 			env->flags |= LBF_NOHZ_AGAIN;
 
-		sgs->group_load += cpu_runnable_load(rq);
+		sgs->group_load += cpu_load(rq);
 		sgs->group_util += cpu_util(i);
 		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
 
@@ -8517,7 +8522,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	init_sd_lb_stats(&sds);
 
 	/*
-	 * Compute the various statistics relavent for load balancing at
+	 * Compute the various statistics relevant for load balancing at
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
@@ -8677,11 +8682,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		switch (env->migration_type) {
 		case migrate_load:
 			/*
-			 * When comparing with load imbalance, use
-			 * cpu_runnable_load() which is not scaled with the CPU
-			 * capacity.
+			 * When comparing with load imbalance, use cpu_load()
+			 * which is not scaled with the CPU capacity.
 			 */
-			load = cpu_runnable_load(rq);
+			load = cpu_load(rq);
 
 			if (nr_running == 1 && load > env->imbalance &&
 			    !check_cpu_capacity(rq, env->sd))
@@ -8689,10 +8693,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
 			/*
 			 * For the load comparisons with the other CPU's,
-			 * consider the cpu_runnable_load() scaled with the CPU
-			 * capacity, so that the load can be moved away from
-			 * the CPU that is potentially running at a lower
-			 * capacity.
+			 * consider the cpu_load() scaled with the CPU
+			 * capacity, so that the load can be moved away
+			 * from the CPU that is potentially running at a
+			 * lower capacity.
 			 *
 			 * Thus we're looking for max(load_i / capacity_i),
 			 * crosswise multiplication to rid ourselves of the

From patchwork Fri Oct 18 13:26:34 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176835
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861589ill;
 Fri, 18 Oct 2019 06:27:07 -0700 (PDT)
X-Google-Smtp-Source: APXvYqw/hPN0ZttGPsnrThbm7y1K5LVUMMahd0jE90gZLuYcrDYhwhoeZ5ArpEISGtp5qh6X9+Mg
X-Received: by 2002:a17:906:309b:: with SMTP id
 27mr8630279ejv.243.1571405226901; 
 Fri, 18 Oct 2019 06:27:06 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405226; cv=none;
 d=google.com; s=arc-20160816;
 b=FRCyAFPYGfjYdTUThWIXfwb1wL3dH7y0658UrR4yFr76aRWzWWnV1NgWNk6YoCRVuE
 LuioBxBE+RUYv1qP6ml3UTiw9s2rQYBVF4Zg+CnJ3KQ5RUkkModa2eWHCRUaOnM9K+j1
 ITDSMct9Cq9bhgdJlGJFFpWR8VrIhNqqKNelDjdhkRXLxApXJyLqwKewaYFAA693l92m
 agRjlUu+2+F1dXwKgwdnooWRtl0sEmvs7hkdh77BHCbBbA4NQsg/1LBTAmiRc0y1n6Ou
 jHP3lykJz6z9evwJAANNC15kYWyeY4gdi/tE9ikS+g+oEB0N9NXKxXn/SKTFn3azWEQr
 0WnA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=U+L99VOS6caLFt+CoaTBRiqHtekpuXrbr4pGYmwWDiE=;
 b=xbr0blwi8h40Oniggl5JcYm5y6GEYRnaR1OtHmtf4UKMW516XLwlLJoZJtWDlVWwV9
 ntih+6uE1WcIRqS4Dp0MYxuI0qaA3HPw/KKtJoI7ZWhwb4Lec4gyg6vo5heJYtFuaTcb
 uCQrcql8DkWGL1bRtbuegW4YRWGl6chhlKVB+2kNqmpgaYxzFKE5vKSGxC0IPdkTxz5z
 Zpp2IeoczsLqN1kjUTl0YW9+q6SHX5BZA6EHrSE0BV58o2SXG4YOhKKp1h1Sug3ccp5l
 zLGlr9lhi65qweqTvIpPF6k8/mcfkk5UDCiLVz24+T6kmO6FLZUlcdRc1+flH0fx8XyI
 2+yw==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=YexuFZ4f;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 o32si3789756edc.306.2019.10.18.06.27.06; 
 Fri, 18 Oct 2019 06:27:06 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=YexuFZ4f;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2410335AbfJRN1B (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:27:01 -0400
Received: from mail-wm1-f66.google.com ([209.85.128.66]:56134 "EHLO
 mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2410292AbfJRN1A (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:27:00 -0400
Received: by mail-wm1-f66.google.com with SMTP id a6so6169546wma.5
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:26:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=U+L99VOS6caLFt+CoaTBRiqHtekpuXrbr4pGYmwWDiE=;
 b=YexuFZ4fjlY6AevXkLYwKDu06ECkTRx8vSIFlXx6hDq+1v+NVL84yLK2F9YZbhhFxG
 FXfR5Ko8czlD+FeLKLYoY9Ve/rxWNtsqruSpX0mZBJMXp9pe1oiSc8KPR6wy2SJCuJCH
 wQxXCJh4t7J5njM9BWDMVuGFItJGfZd7xOiUmzwaRnnw9UqkPPJJw8lqPBXbLmzoAlZw
 AQvBqx+pW3YfnOkARBIxpcZCNQfSq6co36YadtT5BFN09zMaMkqojXMIQRX+Lboka8Fw
 mfBRYyX6DfVIUqkxTx/bmo/08rM/gz1lGdpa9xv2YPg0j1ANnF0zUAWThZIhrYkXklhH
 UIcw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=U+L99VOS6caLFt+CoaTBRiqHtekpuXrbr4pGYmwWDiE=;
 b=JDt6yMQ0j4UONOwZqL384Ghcm3lD+Pqk80jn315ymPy5ZyWBo1laG2IEhJ5f7BqL73
 I9oXSNlrujML96/dJEZxBOOU3DmvnWY9vWOdpF5e3heMeSYXSZw7hcwTGYP6cDMjdNBs
 AP3OEyRCD5b7MpyxVoXHHFEaYZsA4quzSVHClNKe4d4B//uktrKAwJ8AWVhW08Pxkimz
 yIIvdo996axB0ljEtl1R8fVGQ6W0zwZmhA4YuRCYDumOTSaAn203lDcCsnPWWZUr1/M/
 7h9Oz49wFEb4DgO2qoPu0NrXz0LrcxmPRWTkCks+Mb5PkFIZ1w3hHCo5BWTnlKrPhUqR
 M6fQ==
X-Gm-Message-State: APjAAAWwHni/5VdNRGm07BZPxZd6fCrHrTvHNfzT8bTIws4eDKD3S3QG
 18vrHSyePVD1r0wmEu6MmrI30j9W+Dk=
X-Received: by 2002:a05:600c:23cc:: with SMTP id
 p12mr3592277wmb.163.1571405216635; 
 Fri, 18 Oct 2019 06:26:56 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.54
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:55 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 07/11] sched/fair: evenly spread tasks when not overloaded
Date: Fri, 18 Oct 2019 15:26:34 +0200
Message-Id: <1571405198-27570-8-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When there is only 1 cpu per group, using the idle cpus to evenly spread
tasks doesn't make sense and nr_running is a better metrics.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9ac2264..9b8e20d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8601,18 +8601,34 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	    busiest->sum_nr_running > local->sum_nr_running + 1)
 		goto force_balance;
 
-	if (busiest->group_type != group_overloaded &&
-	     (env->idle == CPU_NOT_IDLE ||
-	      local->idle_cpus <= (busiest->idle_cpus + 1)))
-		/*
-		 * If the busiest group is not overloaded
-		 * and there is no imbalance between this and busiest group
-		 * wrt idle CPUs, it is balanced. The imbalance
-		 * becomes significant if the diff is greater than 1 otherwise
-		 * we might end up to just move the imbalance on another
-		 * group.
-		 */
-		goto out_balanced;
+	if (busiest->group_type != group_overloaded) {
+		if (env->idle == CPU_NOT_IDLE)
+			/*
+			 * If the busiest group is not overloaded (and as a
+			 * result the local one too) but this cpu is already
+			 * busy, let another idle cpu try to pull task.
+			 */
+			goto out_balanced;
+
+		if (busiest->group_weight > 1 &&
+		    local->idle_cpus <= (busiest->idle_cpus + 1))
+			/*
+			 * If the busiest group is not overloaded
+			 * and there is no imbalance between this and busiest
+			 * group wrt idle CPUs, it is balanced. The imbalance
+			 * becomes significant if the diff is greater than 1
+			 * otherwise we might end up to just move the imbalance
+			 * on another group. Of course this applies only if
+			 * there is more than 1 CPU per group.
+			 */
+			goto out_balanced;
+
+		if (busiest->sum_h_nr_running == 1)
+			/*
+			 * busiest doesn't have any tasks waiting to run
+			 */
+			goto out_balanced;
+	}
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */

From patchwork Fri Oct 18 13:26:35 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176839
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861882ill;
 Fri, 18 Oct 2019 06:27:20 -0700 (PDT)
X-Google-Smtp-Source: APXvYqxI5PN7KbJqh2IfuuQCtI/JdF8xQDobQtrRmgi8ukqSi9xJSAvMx6oXcN+vT6Op0tarrJ3K
X-Received: by 2002:aa7:d687:: with SMTP id d7mr9639129edr.143.1571405240809; 
 Fri, 18 Oct 2019 06:27:20 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405240; cv=none;
 d=google.com; s=arc-20160816;
 b=ZQlmTiC0H5wTWbv1WGDh54AqbLi0OmUgJ8uv6M/172y1LsDc2oT/qpuxBhcfzfPJ4d
 wQ7uInuGhxR6uME0eAgMrhk6eZKp2zKP8Qp3+TttyQuAVEgsSgk9N1kEwEczDSfcIGTJ
 CIzdQtAROCvqC5m127bDXp7oKaUgzMTRtUNagQwnXSdwNU/JP+xn1mmPvxDajjkCpxXN
 5byDvG7z8tfkO8Ro9DW2XW2fd1J+O4T7fHmXSQ/ui6xmaVbgi7Y1IddY+VTi3lNV2M2J
 szedwxp3S1a3K3fRZDftcd5nQAQyaanH1PB5KxozGtHpxthKk7lpkAYQW/CJzzH6zgt/
 YiVQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=+8jnwsi/OmvMHtm1PwUS69bLbdlbxHcNftrW5yRwhLg=;
 b=Pd935AxYxQ1/l2/XXL2HGuGe/4BjxTncPwAH16jCq9Gea958Su1+Dp2PT5g8l0oXGA
 VE/fXpzdPI9ZGBM7jrKnxcH/V6I5WOlBufytx+Ns9TBLXPPIzAmDZnUbixcwhF8XIaOt
 Gkgn2CQV6pQk8KXz6giCqfJvcj8Vpew21tmGzinEQ/wC8XKfS/9oUG19assxHqAdceQS
 QQYVj89//0wzv5ogIWOuqgCtNhcA/ohDCTcOFU/F7uf1SQjoSV8Dp2QFjPRFrCOpi6xf
 XwSf5Y+ZfQKhOSZRfMYxniymLIkHq/F5Rmf4tPkJSa5/RUURwRbf+MYjLqmuicLaX08b
 HjPQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=FjJZTlYA;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 u27si3475775ejb.172.2019.10.18.06.27.20; 
 Fri, 18 Oct 2019 06:27:20 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=FjJZTlYA;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2634100AbfJRN1Q (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:27:16 -0400
Received: from mail-wr1-f65.google.com ([209.85.221.65]:45104 "EHLO
 mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2410324AbfJRN1A (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:27:00 -0400
Received: by mail-wr1-f65.google.com with SMTP id q13so1296910wrs.12
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:26:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=+8jnwsi/OmvMHtm1PwUS69bLbdlbxHcNftrW5yRwhLg=;
 b=FjJZTlYA8PpPgWv6KZYWHaxO2XAZZBU9r+ZOPZQwTUDI0H8meXrJbKqMRqJpbPJ1VI
 iZcv2qeW2BZ4CM3PAxjJUZSdAQnB2vrBsZZsqiAeqCOFvDDLe5aKyhxztqq0WGmaV7Ki
 F1/solils4dofYwDL1hPx5ZcpkayypL3pbIBN2bJhqV3sn9rlHxyBERi8i27egDPh25y
 d0V54tpUCmEeafhIzyX025Xpb/vvRzFMsYWgs5mDo02hfJXjUR1mGo963Qz5zFVp143U
 2XZguQgVtZZUWRUemHyMQs1EyHMN4aq080amxvz42hv017cexxLGEt71iKZdTvAaec1c
 OXWQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=+8jnwsi/OmvMHtm1PwUS69bLbdlbxHcNftrW5yRwhLg=;
 b=jymF2IL14LatddNJKQ4krIoEoksE7e/5tUNG28Sswzh/15xoNQq8jyOWB7mom/P/D7
 Owl2TJziWUfl3lWyzJFJIlXwvBnu+F1hIx7OgoWF90x1BUQ/ZU+MmrM6n13sOWyu0tKU
 brL2h5F626i9Aqaqftr1ZvwHNHnxYQAsp9Ot0s1UcTUXvAmEwkM1KfH9SDJ5y+QI3dm5
 qlUp0qiFtNrvOAAOp30MO3OBeR24UvSmGAQAdf5eXLc35rjd/TXe69+HVGuzcU2tpVhn
 RPQsx3gNLLdX/r1gvEZbbzXnqduf0j/xWexMWWbnqJrD8/3HsKxs7hIT5w7tikTSE6+4
 OA0A==
X-Gm-Message-State: APjAAAWvaxSoB5GmpLKEDktyak4EN5Bkf/X36POe8iIMqezEPbDdIlka
 7d8viuKlMA+p84l86MbcxGnsv+Rs/hY=
X-Received: by 2002:a5d:408f:: with SMTP id o15mr7115548wrp.139.1571405218501; 
 Fri, 18 Oct 2019 06:26:58 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.56
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:57 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 08/11] sched/fair: use utilization to select misfit task
Date: Fri, 18 Oct 2019 15:26:35 +0200
Message-Id: <1571405198-27570-9-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

utilization is used to detect a misfit task but the load is then used to
select the task on the CPU which can lead to select a small task with
high weight instead of the task that triggered the misfit migration.

Check that task can't fit the CPU's capacity when selecting the misfit
task instead of using the load.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
---
 kernel/sched/fair.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9b8e20d..670856d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7418,13 +7418,8 @@ static int detach_tasks(struct lb_env *env)
 			break;
 
 		case migrate_misfit:
-			load = task_h_load(p);
-
-			/*
-			 * load of misfit task might decrease a bit since it has
-			 * been recorded. Be conservative in the condition.
-			 */
-			if (load / 2 < env->imbalance)
+			/* This is not a misfit task */
+			if (task_fits_capacity(p, capacity_of(env->src_cpu)))
 				goto next;
 
 			env->imbalance = 0;
@@ -8368,7 +8363,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	if (busiest->group_type == group_misfit_task) {
 		/* Set imbalance to allow misfit task to be balanced. */
 		env->migration_type = migrate_misfit;
-		env->imbalance = busiest->group_misfit_task_load;
+		env->imbalance = 1;
 		return;
 	}
 

From patchwork Fri Oct 18 13:26:36 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176836
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861606ill;
 Fri, 18 Oct 2019 06:27:07 -0700 (PDT)
X-Google-Smtp-Source: APXvYqy+Uu4zUsZQpSuH6HftuBQsehQnNEcW4LY63TinWYgLwjenGdpEGBhPgRYEFh+sU+LfGFDr
X-Received: by 2002:aa7:cfd4:: with SMTP id r20mr9595977edy.268.1571405227388; 
 Fri, 18 Oct 2019 06:27:07 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405227; cv=none;
 d=google.com; s=arc-20160816;
 b=HwdyCjM53c0bgHwjQWiQuow1kU5nt8tpV8Za9zWlM0M4SYOOncvZpt051GQIJEyE+P
 syhVAep5QxD4lKKmKPE7ZpzYRB5u2JN7Xq3uC+Pksb8yLrpmTyst02UVmy2JhRhpdmBa
 iNBEcgBL6yw7BNjqlDJVuGmB17dRJcd6n5h2DT12ZRmvrzchQMiZZGT6a0INklwAiRx0
 9zAuq3+96ORUB4BZFXXSise/W1XdEm/zfrlm7mGuUrb+C74hBZzcop9QL/CS4RQwxd/B
 H5JOlCIpxG5bpGvtRGua3zcUaOBjL6PCmPnAL1aM8vUaV57GuVm/w+DB72w+A+KBeeru
 KQpg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=P/1EXP59iD2tSSDxQFHqr6RgR861BrswyMXiJ6kbQ8A=;
 b=BKJR8I8sbtZHiw7MGdN4AQye9pMTsEF8e9rAdJoY6XmjFn8fsaj5aFA3S8E9B15TVU
 lLXQk2QyJBgO/aFAT4LJuIqalA4BEaz7bJqWCdgcxMu4dkHWHAQQvrFW9oSAbgs2GAdS
 Z9X97XJkBQj+ccHb7eqzAlA4SMslsol0DiWiSupySJHHXtOIa3hSLQxi+TMc5y8Tht9U
 6fyY5zffuyWDAN3foYH+DaKr2ADxaq/1MxAdwR7IEw/Wm2lspt+yBXb1tp/fIMblItmK
 ZkKdEHTc8PF9d15kbTxembP1JABhAceR4WkzlHlm3YVjug3NVaPWNegOAiMt5UwFHxhy
 zNoQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=awr1kDiw;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 o32si3789756edc.306.2019.10.18.06.27.07; 
 Fri, 18 Oct 2019 06:27:07 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=awr1kDiw;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2439202AbfJRN1E (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:27:04 -0400
Received: from mail-wr1-f66.google.com ([209.85.221.66]:38852 "EHLO
 mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2410333AbfJRN1C (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:27:02 -0400
Received: by mail-wr1-f66.google.com with SMTP id o15so5843108wru.5
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:27:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=P/1EXP59iD2tSSDxQFHqr6RgR861BrswyMXiJ6kbQ8A=;
 b=awr1kDiwCy9ZAcOnBjuEJIflo+CIkuihpEYfhcMLLrMyJoP11CB0U3qlXJ/49XTNEd
 Z8uBraX991qplCmKIWdripBXmIPizYl9pPYMhMqMK3Yeemr6Fnkg2RcgfXct37PHHLrF
 9pPTEYsCdKguRHaEjs3eXNTbdWnnzxO0B6kuLAjhywYonAYaFf4rlvfm9ZthekjCabiG
 G7zNYo5niBXc1G0zYbT1xyZfqIjBLozVhUzdnzICJsFF2A5jV5BwHc317kAyv5qjQ/pJ
 u5BOQtoFnqCloxfpx/tF4Lnawlp7alDFKCIsNugb4q5VgYlx9SNnJe7me4WxPJgJgad3
 zKKA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=P/1EXP59iD2tSSDxQFHqr6RgR861BrswyMXiJ6kbQ8A=;
 b=a2J/buDtiFbQ2TTCJdeJvqSEbZBk7T+XnciGAWjUixsQnh4U3/i+vtjrEc8kQA5v3Z
 PG+RYFs62A1PvoSPz1kjt9nb5P0V92gZoZiKaOXAS3DzFf5WSgRZe+jETiMe/7fDVo9u
 amdG9RcKuubjU/zAjKJ07tGM9FhnwejMEu7lR4o6R2F81zDxJ3ELt5XXGM/ryFdJN4sQ
 lXTOniyGgMUwesVJPMJroIT0NYKi4Led0HW/1w95ashFWDnI8P8rHjscDNUI6behAB5u
 kbC+z+AZt+FeMlHjtKEle6sNvDE3UQmkLZ5EQguK3/Cv+hEXVSVtufUaXoVvnHZQLetn
 OHfg==
X-Gm-Message-State: APjAAAX840PCpsIEERZhzJmCEi8pD1xDYvAQNoOEcBO9GvuWihMNavLe
 pe0rjhIDp3ka2OcbmKqseM9HJukb4sI=
X-Received: by 2002:adf:8123:: with SMTP id 32mr8255555wrm.300.1571405220321; 
 Fri, 18 Oct 2019 06:27:00 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.26.58
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:26:59 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 09/11] sched/fair: use load instead of runnable load in
 wakeup path
Date: Fri, 18 Oct 2019 15:26:36 +0200
Message-Id: <1571405198-27570-10-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

runnable load has been introduced to take into account the case where
blocked load biases the wake up path which may end to select an overloaded
CPU with a large number of runnable tasks instead of an underutilized
CPU with a huge blocked load.

Tha wake up path now starts to looks for idle CPUs before comparing
runnable load and it's worth aligning the wake up path with the
load_balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 670856d..6203e71 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1475,7 +1475,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	       group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
 }
 
-static unsigned long cpu_runnable_load(struct rq *rq);
+static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
+
+static unsigned long cpu_runnable_load(struct rq *rq)
+{
+	return cfs_rq_runnable_load_avg(&rq->cfs);
+}
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
@@ -5380,11 +5385,6 @@ static int sched_idle_cpu(int cpu)
 			rq->nr_running);
 }
 
-static unsigned long cpu_runnable_load(struct rq *rq)
-{
-	return cfs_rq_runnable_load_avg(&rq->cfs);
-}
-
 static unsigned long cpu_load(struct rq *rq)
 {
 	return cfs_rq_load_avg(&rq->cfs);
@@ -5485,7 +5485,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	s64 this_eff_load, prev_eff_load;
 	unsigned long task_load;
 
-	this_eff_load = cpu_runnable_load(cpu_rq(this_cpu));
+	this_eff_load = cpu_load(cpu_rq(this_cpu));
 
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
@@ -5503,7 +5503,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		this_eff_load *= 100;
 	this_eff_load *= capacity_of(prev_cpu);
 
-	prev_eff_load = cpu_runnable_load(cpu_rq(prev_cpu));
+	prev_eff_load = cpu_load(cpu_rq(prev_cpu));
 	prev_eff_load -= task_load;
 	if (sched_feat(WA_BIAS))
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
@@ -5591,7 +5591,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_span(group)) {
-			load = cpu_runnable_load(cpu_rq(i));
+			load = cpu_load(cpu_rq(i));
 			runnable_load += load;
 
 			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
@@ -5732,7 +5732,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this
 				continue;
 			}
 
-			load = cpu_runnable_load(cpu_rq(i));
+			load = cpu_load(cpu_rq(i));
 			if (load < min_load) {
 				min_load = load;
 				least_loaded_cpu = i;

From patchwork Fri Oct 18 13:26:37 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176837
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861723ill;
 Fri, 18 Oct 2019 06:27:12 -0700 (PDT)
X-Google-Smtp-Source: APXvYqyVyLojxTGXHbQQO5lYQzPjB7dmqjOWvDAKhnOl4xjgaVVW8ATdkJ0eGMkhAHesXjl8a1zj
X-Received: by 2002:a05:6402:19bd:: with SMTP id
 o29mr9478402edz.42.1571405232292; 
 Fri, 18 Oct 2019 06:27:12 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405232; cv=none;
 d=google.com; s=arc-20160816;
 b=SUkI20MULkI7jevUEuyvL6iI9/ppGc43AFOVR68plxXvXtj8IyItvqbckn4hC9R917
 coTnigZ43WgaACEG1NCYLOqlnIsRDDQT3O7YFq91owLaBF7dZ5+u8BSGCMwcQz0y66I2
 Vwgvf2BxbusyvySIHOiNd88oOKH6kfEzuxrR6u1g+3odaGabTX0I8tCwmBpwMrmFKwS7
 NODvKazKSQ5Z7ih/9ft35s1JyXtQkD2KcM/YfWHKx8peLRml98wt96+GpOoXfVp7VGnb
 CDbDWl6fBbYuPKDbX4L77UfbS8qFtMLwrBBEpjCe5Lr3zpq4TwPUMyP1HK24fKQXoCc4
 oqdw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=rF17NfzVuoPQDjrV4B8MSfUDRkdA/P46N9yA+eCB77U=;
 b=SvgKpA3hbhn+XZl83WzotEtpd53YbWi2qieaDhtHjeRP3GTDiPgV5+NvVzThhNKmA5
 kAPn+UrmNN+fg+x3w71BFq1FJvwbpUyVkV5j86eY2NTpQGMMjw6rPnOev/C8/H2jTKso
 7YZYG+GOr6hTdzDAl2brHgFv0lgBVj71GBbIb2s5thCPuauLvDGIFybpxy2r0NvK2O15
 Bn8w7LCXCLSLRc13M6XvWnzpL/wHPcKSv2LHDdHhYR8SE6lWxWPObdvDtnGy+yD4Mnkq
 6Tx8+arkWxfTxlnHtudU5NSoLasmwdVQfftnSP/J8dX9cGw/mOoNotcnEMAO2Lea69D4
 qaqQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=G0oDhG1t;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 x62si4011677ede.352.2019.10.18.06.27.12; 
 Fri, 18 Oct 2019 06:27:12 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=G0oDhG1t;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2442845AbfJRN1H (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:27:07 -0400
Received: from mail-wm1-f65.google.com ([209.85.128.65]:55007 "EHLO
 mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2438919AbfJRN1E (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:27:04 -0400
Received: by mail-wm1-f65.google.com with SMTP id p7so6193216wmp.4
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:27:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=rF17NfzVuoPQDjrV4B8MSfUDRkdA/P46N9yA+eCB77U=;
 b=G0oDhG1tp21hqgoc0Wftnssc9hxoXtNaWmw+rX4yhbE0TfCUsLCUi2lUibmsO/dO+R
 ZAeKslXiKdUvQNF5+/tWf8z9YDwDNh6iP0kqNAH1W7mjk3OIIDkR25vTWoJeiXryGQz6
 CiTA0HSae02uTYsH1CtxaxiWGKU6dikwaGiWNLbOlebv82W4MPTw/9aOJKRbnidMVfeA
 Edeo7C2SnFT5Q3TH25FB1bmiIn4sAIFVK0vMDtEm3krbAKip3CQ9qFFs0X7/Il3cQIrx
 65F2uNoLKYHNl6GoBI/NfYUfqLUT+Mb/g1cOXMKKY6xumD9nWHZhvz1+B80kCITNjZRm
 ysYg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=rF17NfzVuoPQDjrV4B8MSfUDRkdA/P46N9yA+eCB77U=;
 b=WhrzCnH9Uzd9PTCSlpWmeaRs+6XZCBCzoW4qm1a6nAnaMWYEnQB59BC4PPMFPXTT0M
 VlEM/2ZW570TEfRGRbZfzrIJ06Lk1SX6HFXHaX4QQv83xC2KtBgDh1UW1bMdnns6WNRd
 Phpq6zMvGb7eqWWJ691lMuYZJb8JUT/PeSIFX6Ir7daPci0PBjrHesSRzV1QsLVITYQC
 s3NH114Y/PtJwFtpBfn5ePzuda1pZcWp72etLjbtatC49sqURXuSSSEv4f16a0X0Z4FY
 oslzxadSRrLBfaQEF6ojeDhOzB7JW9UBNqREaDSY0I6KaAs07zEuzLti6c0lJPc3+Mwv
 u3/A==
X-Gm-Message-State: APjAAAVW5ilrALn/gTZsmZREvgOWYTe9vS3UMNcNMy/tmqKXRRC8DwIo
 QFlFsQQUNS9a53AUvZThx9MJvYLO2KY=
X-Received: by 2002:a1c:9c0c:: with SMTP id f12mr1836748wme.133.1571405222144; 
 Fri, 18 Oct 2019 06:27:02 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.27.00
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:27:00 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 10/11] sched/fair: optimize find_idlest_group
Date: Fri, 18 Oct 2019 15:26:37 +0200
Message-Id: <1571405198-27570-11-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

find_idlest_group() now reads CPU's load_avg in 2 different ways.
Consolidate the function to read and use load_avg only once and simplify
the algorithm to only look for the group with lowest load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 50 ++++++++++++++------------------------------------
 1 file changed, 14 insertions(+), 36 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6203e71..ed1800d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5560,16 +5560,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	struct sched_group *most_spare_sg = NULL;
-	unsigned long min_runnable_load = ULONG_MAX;
-	unsigned long this_runnable_load = ULONG_MAX;
-	unsigned long min_avg_load = ULONG_MAX, this_avg_load = ULONG_MAX;
+	unsigned long min_load = ULONG_MAX, this_load = ULONG_MAX;
 	unsigned long most_spare = 0, this_spare = 0;
 	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
 	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
 				(sd->imbalance_pct-100) / 100;
 
 	do {
-		unsigned long load, avg_load, runnable_load;
+		unsigned long load;
 		unsigned long spare_cap, max_spare_cap;
 		int local_group;
 		int i;
@@ -5586,15 +5584,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		 * Tally up the load of all CPUs in the group and find
 		 * the group containing the CPU with most spare capacity.
 		 */
-		avg_load = 0;
-		runnable_load = 0;
+		load = 0;
 		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_span(group)) {
-			load = cpu_load(cpu_rq(i));
-			runnable_load += load;
-
-			avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs);
+			load += cpu_load(cpu_rq(i));
 
 			spare_cap = capacity_spare_without(i, p);
 
@@ -5603,31 +5597,15 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		}
 
 		/* Adjust by relative CPU capacity of the group */
-		avg_load = (avg_load * SCHED_CAPACITY_SCALE) /
-					group->sgc->capacity;
-		runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) /
+		load = (load * SCHED_CAPACITY_SCALE) /
 					group->sgc->capacity;
 
 		if (local_group) {
-			this_runnable_load = runnable_load;
-			this_avg_load = avg_load;
+			this_load = load;
 			this_spare = max_spare_cap;
 		} else {
-			if (min_runnable_load > (runnable_load + imbalance)) {
-				/*
-				 * The runnable load is significantly smaller
-				 * so we can pick this new CPU:
-				 */
-				min_runnable_load = runnable_load;
-				min_avg_load = avg_load;
-				idlest = group;
-			} else if ((runnable_load < (min_runnable_load + imbalance)) &&
-				   (100*min_avg_load > imbalance_scale*avg_load)) {
-				/*
-				 * The runnable loads are close so take the
-				 * blocked load into account through avg_load:
-				 */
-				min_avg_load = avg_load;
+			if (load < min_load) {
+				min_load = load;
 				idlest = group;
 			}
 
@@ -5668,18 +5646,18 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	 * local domain to be very lightly loaded relative to the remote
 	 * domains but "imbalance" skews the comparison making remote CPUs
 	 * look much more favourable. When considering cross-domain, add
-	 * imbalance to the runnable load on the remote node and consider
-	 * staying local.
+	 * imbalance to the load on the remote node and consider staying
+	 * local.
 	 */
 	if ((sd->flags & SD_NUMA) &&
-	    min_runnable_load + imbalance >= this_runnable_load)
+	     min_load + imbalance >= this_load)
 		return NULL;
 
-	if (min_runnable_load > (this_runnable_load + imbalance))
+	if (min_load >= this_load + imbalance)
 		return NULL;
 
-	if ((this_runnable_load < (min_runnable_load + imbalance)) &&
-	     (100*this_avg_load < imbalance_scale*min_avg_load))
+	if ((this_load < (min_load + imbalance)) &&
+	    (100*this_load < imbalance_scale*min_load))
 		return NULL;
 
 	return idlest;

From patchwork Fri Oct 18 13:26:38 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 176838
Delivered-To: patch@linaro.org
Received: by 2002:a92:7e96:0:0:0:0:0 with SMTP id q22csp861733ill;
 Fri, 18 Oct 2019 06:27:12 -0700 (PDT)
X-Google-Smtp-Source: APXvYqwDxX21N7bLAr2ar3OyRzBMjW16xGm+rjG3sjUlzk8k4PzCLVJ18GvgpqK8TC2L5QE6ya5o
X-Received: by 2002:aa7:d758:: with SMTP id a24mr9816276eds.194.1571405232853; 
 Fri, 18 Oct 2019 06:27:12 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1571405232; cv=none;
 d=google.com; s=arc-20160816;
 b=DS6kg92IoTGQnfas77fVdrft+Apm0asHmT+9ZSVEd38imAF5DFxxVbEoZFrZcZATn3
 5+JVZWDo5dUdizluzSMLqMZujv+icBcf3budp0PhiSW71S3YLR8Ss2ewEpF9vNzhrROA
 vOoRwLK5Xd9WCnqTXmXHpL5Agcjrh82r29WYYzzip0AJw+Q+a5D52mILm2YT1nn3eEs1
 Urm9d3dtnsaEzg4HF5kvRdPfaAa6wPEG/ape+8PHYx7ge+gP8Ybq3Z6m/DlrsXoOJDiq
 kOdrwglxGIkjB7wnWL+oO3t7V9jEFAOCu4943ISeTHWe2L3RLFVGiwO+RkwE8M580kkZ
 U4og==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=YYd+mUbPd3e33Spb/1ehWk7tN88cH6YOPs4EY/LiqPc=;
 b=yTLvc6t/0TE9lt98mrwJ3Js3EwKXQ9EYzYl0dtlA+F4lhyWN0gk6bKkNgbwon1xXUL
 sjp+Wk5NS/qIXrr0G4fllHrvAqx5ty/5c0Dr8S7u9OSnCsi59tSP0xDdqgABbhsn8hPz
 7m/3bfy34ajDvu3rWzmOp4mnjHSik65wIxmSoH/JGqJDlMDm/b4G04AW1k8zSMfZcs+l
 RWyMMbJkFbz77M969H6puVSHAW8zosluntu+cUbCZNxOS7o6BpOJ1bTvEFADJ7EpZKEK
 t6fc4bpImbSwANCj7sHl7LxvlV6T3ilUqHU5kxN0hAf3p6wS+qsjBZ1acyyYtmvPb0Yd
 w8hw==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=VttkybTh;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 x62si4011677ede.352.2019.10.18.06.27.12; 
 Fri, 18 Oct 2019 06:27:12 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=VttkybTh;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S2634091AbfJRN1L (ORCPT <rfc822;lee.jones@linaro.org>
 + 26 others); Fri, 18 Oct 2019 09:27:11 -0400
Received: from mail-wm1-f66.google.com ([209.85.128.66]:34331 "EHLO
 mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S2439247AbfJRN1H (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Fri, 18 Oct 2019 09:27:07 -0400
Received: by mail-wm1-f66.google.com with SMTP id y135so9362293wmc.1
 for <linux-kernel@vger.kernel.org>;
 Fri, 18 Oct 2019 06:27:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=YYd+mUbPd3e33Spb/1ehWk7tN88cH6YOPs4EY/LiqPc=;
 b=VttkybTh0QZIjCzC2bL9vxvKkoHzzSICvaje+Ck9uoTmQo6hcy1P7WgvpAdsOq2m8w
 Fh0UvF1w+ZRdvAdvCU0grNxC1Ndrkm6dVKebiqp2buBE7ksvLv9fRKPnNcCSFWQEwcxq
 4qpjAeLtfWk6fShNIG1MiuawVKEHdVCBYbthScqRbPPGxvGgUfmEGQsbTmSmNdEsU2YD
 oRLy1GYqYsvLsKil3zQrKxQq1WCYzWfyJy1UD9dyt9VjhGOHHwyKwWKBZ670xWXe41bR
 wF0nLhao8m6Xt8NCYnKlTy774zRu9rpEqnAlG0JVdgn0yXKbIGBCB4GlbPt4W09QIHkN
 8rEQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=YYd+mUbPd3e33Spb/1ehWk7tN88cH6YOPs4EY/LiqPc=;
 b=NVKqxDmB1rWiOjoIw8UXXNGpfkbnTn5x6VTwSe4P8DTnfb0FU4eTpM2rqMUaj0sg19
 KgZoQNmo5JlCB/CoMT6AEmSTIFl4uaCMYkCqasTpsH5ZpeSAQLb20LdrzryRYP2FBfeP
 ouW9hFSLcBywIT7aysvQOZmKW21w2JuoQcJ1G5f6q1Jucmu3Gj/am2AOfOI4WIK3SNZ/
 kd+h+mfuSFyPdD7QwNka0dSimO/I1W0gNGzLG3S7rphmi62FtxeaqiitD5KLAF2JamSy
 tGq8HsVs2+rZoWkK/tkgZd/cSGgZA733HG94d7ZlRld95DD8K5AJLohraUU/GjRkhCV4
 SfzQ==
X-Gm-Message-State: APjAAAWQ6pmyi9NiSyUaHozugvpCVPXQuT5pZa4qeD3E6mOIJvKIQCet
 d3OsOIvdVvC5ljMKCgVfOZqOYV+ov4M=
X-Received: by 2002:a1c:f401:: with SMTP id z1mr7425910wma.66.1571405223931; 
 Fri, 18 Oct 2019 06:27:03 -0700 (PDT)
Received: from localhost.localdomain (91-160-61-128.subs.proxad.net.
 [91.160.61.128]) by smtp.gmail.com with ESMTPSA id
 p15sm5870123wrs.94.2019.10.18.06.27.02
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Fri, 18 Oct 2019 06:27:02 -0700 (PDT)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org
Cc: pauld@redhat.com, valentin.schneider@arm.com,
 srikar@linux.vnet.ibm.com, quentin.perret@arm.com,
 dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
 hdanton@sina.com, parth@linux.ibm.com, riel@surriel.com,
 Vincent Guittot <vincent.guittot@linaro.org>
Subject: [PATCH v4 11/11] sched/fair: rework find_idlest_group
Date: Fri, 18 Oct 2019 15:26:38 +0200
Message-Id: <1571405198-27570-12-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
References: <1571405198-27570-1-git-send-email-vincent.guittot@linaro.org>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

The slow wake up path computes per sched_group statisics to select the
idlest group, which is quite similar to what load_balance() is doing
for selecting busiest group. Rework find_idlest_group() to classify the
sched_group and select the idlest one following the same steps as
load_balance().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 384 ++++++++++++++++++++++++++++++++++------------------
 1 file changed, 256 insertions(+), 128 deletions(-)

-- 
2.7.4
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ed1800d..fbaafae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5541,127 +5541,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return target;
 }
 
-static unsigned long cpu_util_without(int cpu, struct task_struct *p);
-
-static unsigned long capacity_spare_without(int cpu, struct task_struct *p)
-{
-	return max_t(long, capacity_of(cpu) - cpu_util_without(cpu, p), 0);
-}
-
-/*
- * find_idlest_group finds and returns the least busy CPU group within the
- * domain.
- *
- * Assumes p is allowed on at least one CPU in sd.
- */
 static struct sched_group *
 find_idlest_group(struct sched_domain *sd, struct task_struct *p,
-		  int this_cpu, int sd_flag)
-{
-	struct sched_group *idlest = NULL, *group = sd->groups;
-	struct sched_group *most_spare_sg = NULL;
-	unsigned long min_load = ULONG_MAX, this_load = ULONG_MAX;
-	unsigned long most_spare = 0, this_spare = 0;
-	int imbalance_scale = 100 + (sd->imbalance_pct-100)/2;
-	unsigned long imbalance = scale_load_down(NICE_0_LOAD) *
-				(sd->imbalance_pct-100) / 100;
-
-	do {
-		unsigned long load;
-		unsigned long spare_cap, max_spare_cap;
-		int local_group;
-		int i;
-
-		/* Skip over this group if it has no CPUs allowed */
-		if (!cpumask_intersects(sched_group_span(group),
-					p->cpus_ptr))
-			continue;
-
-		local_group = cpumask_test_cpu(this_cpu,
-					       sched_group_span(group));
-
-		/*
-		 * Tally up the load of all CPUs in the group and find
-		 * the group containing the CPU with most spare capacity.
-		 */
-		load = 0;
-		max_spare_cap = 0;
-
-		for_each_cpu(i, sched_group_span(group)) {
-			load += cpu_load(cpu_rq(i));
-
-			spare_cap = capacity_spare_without(i, p);
-
-			if (spare_cap > max_spare_cap)
-				max_spare_cap = spare_cap;
-		}
-
-		/* Adjust by relative CPU capacity of the group */
-		load = (load * SCHED_CAPACITY_SCALE) /
-					group->sgc->capacity;
-
-		if (local_group) {
-			this_load = load;
-			this_spare = max_spare_cap;
-		} else {
-			if (load < min_load) {
-				min_load = load;
-				idlest = group;
-			}
-
-			if (most_spare < max_spare_cap) {
-				most_spare = max_spare_cap;
-				most_spare_sg = group;
-			}
-		}
-	} while (group = group->next, group != sd->groups);
-
-	/*
-	 * The cross-over point between using spare capacity or least load
-	 * is too conservative for high utilization tasks on partially
-	 * utilized systems if we require spare_capacity > task_util(p),
-	 * so we allow for some task stuffing by using
-	 * spare_capacity > task_util(p)/2.
-	 *
-	 * Spare capacity can't be used for fork because the utilization has
-	 * not been set yet, we must first select a rq to compute the initial
-	 * utilization.
-	 */
-	if (sd_flag & SD_BALANCE_FORK)
-		goto skip_spare;
-
-	if (this_spare > task_util(p) / 2 &&
-	    imbalance_scale*this_spare > 100*most_spare)
-		return NULL;
-
-	if (most_spare > task_util(p) / 2)
-		return most_spare_sg;
-
-skip_spare:
-	if (!idlest)
-		return NULL;
-
-	/*
-	 * When comparing groups across NUMA domains, it's possible for the
-	 * local domain to be very lightly loaded relative to the remote
-	 * domains but "imbalance" skews the comparison making remote CPUs
-	 * look much more favourable. When considering cross-domain, add
-	 * imbalance to the load on the remote node and consider staying
-	 * local.
-	 */
-	if ((sd->flags & SD_NUMA) &&
-	     min_load + imbalance >= this_load)
-		return NULL;
-
-	if (min_load >= this_load + imbalance)
-		return NULL;
-
-	if ((this_load < (min_load + imbalance)) &&
-	    (100*this_load < imbalance_scale*min_load))
-		return NULL;
-
-	return idlest;
-}
+		  int this_cpu, int sd_flag);
 
 /*
  * find_idlest_group_cpu - find the idlest CPU among the CPUs in the group.
@@ -5734,7 +5616,7 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 		return prev_cpu;
 
 	/*
-	 * We need task's util for capacity_spare_without, sync it up to
+	 * We need task's util for cpu_util_without, sync it up to
 	 * prev_cpu's last_update_time.
 	 */
 	if (!(sd_flag & SD_BALANCE_FORK))
@@ -7915,13 +7797,13 @@ static inline int sg_imbalanced(struct sched_group *group)
  * any benefit for the load balance.
  */
 static inline bool
-group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
+group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 {
 	if (sgs->sum_nr_running < sgs->group_weight)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
-			(sgs->group_util * env->sd->imbalance_pct))
+			(sgs->group_util * imbalance_pct))
 		return true;
 
 	return false;
@@ -7936,13 +7818,13 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
  *  false.
  */
 static inline bool
-group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
+group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
 {
 	if (sgs->sum_nr_running <= sgs->group_weight)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
-			(sgs->group_util * env->sd->imbalance_pct))
+			(sgs->group_util * imbalance_pct))
 		return true;
 
 	return false;
@@ -7969,11 +7851,11 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 }
 
 static inline enum
-group_type group_classify(struct lb_env *env,
+group_type group_classify(unsigned int imbalance_pct,
 			  struct sched_group *group,
 			  struct sg_lb_stats *sgs)
 {
-	if (group_is_overloaded(env, sgs))
+	if (group_is_overloaded(imbalance_pct, sgs))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -7985,7 +7867,7 @@ group_type group_classify(struct lb_env *env,
 	if (sgs->group_misfit_task_load)
 		return group_misfit_task;
 
-	if (!group_has_capacity(env, sgs))
+	if (!group_has_capacity(imbalance_pct, sgs))
 		return group_fully_busy;
 
 	return group_has_spare;
@@ -8086,7 +7968,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_weight = group->group_weight;
 
-	sgs->group_type = group_classify(env, group, sgs);
+	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	/* Computing avg_load makes sense only when group is overloaded */
 	if (sgs->group_type == group_overloaded)
@@ -8241,6 +8123,252 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+
+struct sg_lb_stats;
+
+/*
+ * update_sg_wakeup_stats - Update sched_group's statistics for wakeup.
+ * @denv: The ched_domain level to look for idlest group.
+ * @group: sched_group whose statistics are to be updated.
+ * @sgs: variable to hold the statistics for this group.
+ */
+static inline void update_sg_wakeup_stats(struct sched_domain *sd,
+					  struct sched_group *group,
+					  struct sg_lb_stats *sgs,
+					  struct task_struct *p)
+{
+	int i, nr_running;
+
+	memset(sgs, 0, sizeof(*sgs));
+
+	for_each_cpu(i, sched_group_span(group)) {
+		struct rq *rq = cpu_rq(i);
+
+		sgs->group_load += cpu_load(rq);
+		sgs->group_util += cpu_util_without(i, p);
+		sgs->sum_h_nr_running += rq->cfs.h_nr_running;
+
+		nr_running = rq->nr_running;
+		sgs->sum_nr_running += nr_running;
+
+		/*
+		 * No need to call idle_cpu() if nr_running is not 0
+		 */
+		if (!nr_running && idle_cpu(i))
+			sgs->idle_cpus++;
+
+
+	}
+
+	/* Check if task fits in the group */
+	if (sd->flags & SD_ASYM_CPUCAPACITY &&
+	    !task_fits_capacity(p, group->sgc->max_capacity)) {
+		sgs->group_misfit_task_load = 1;
+	}
+
+	sgs->group_capacity = group->sgc->capacity;
+
+	sgs->group_type = group_classify(sd->imbalance_pct, group, sgs);
+
+	/*
+	 * Computing avg_load makes sense only when group is fully busy or
+	 * overloaded
+	 */
+	if (sgs->group_type < group_fully_busy)
+		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
+				sgs->group_capacity;
+}
+
+static bool update_pick_idlest(struct sched_group *idlest,
+			       struct sg_lb_stats *idlest_sgs,
+			       struct sched_group *group,
+			       struct sg_lb_stats *sgs)
+{
+	if (sgs->group_type < idlest_sgs->group_type)
+		return true;
+
+	if (sgs->group_type > idlest_sgs->group_type)
+		return false;
+
+	/*
+	 * The candidate and the current idles group are the same type of
+	 * group. Let check which one is the idlest according to the type.
+	 */
+
+	switch (sgs->group_type) {
+	case group_overloaded:
+	case group_fully_busy:
+		/* Select the group with lowest avg_load. */
+		if (idlest_sgs->avg_load <= sgs->avg_load)
+			return false;
+		break;
+
+	case group_imbalanced:
+	case group_asym_packing:
+		/* Those types are not used in the slow wakeup path */
+		return false;
+
+	case group_misfit_task:
+		/* Select group with the highest max capacity */
+		if (idlest->sgc->max_capacity >= group->sgc->max_capacity)
+			return false;
+		break;
+
+	case group_has_spare:
+		/* Select group with most idle CPUs */
+		if (idlest_sgs->idle_cpus >= sgs->idle_cpus)
+			return false;
+		break;
+	}
+
+	return true;
+}
+
+/*
+ * find_idlest_group finds and returns the least busy CPU group within the
+ * domain.
+ *
+ * Assumes p is allowed on at least one CPU in sd.
+ */
+static struct sched_group *
+find_idlest_group(struct sched_domain *sd, struct task_struct *p,
+		  int this_cpu, int sd_flag)
+{
+	struct sched_group *idlest = NULL, *local = NULL, *group = sd->groups;
+	struct sg_lb_stats local_sgs, tmp_sgs;
+	struct sg_lb_stats *sgs;
+	unsigned long imbalance;
+	struct sg_lb_stats idlest_sgs = {
+			.avg_load = UINT_MAX,
+			.group_type = group_overloaded,
+	};
+
+	imbalance = scale_load_down(NICE_0_LOAD) *
+				(sd->imbalance_pct-100) / 100;
+
+	do {
+		int local_group;
+
+		/* Skip over this group if it has no CPUs allowed */
+		if (!cpumask_intersects(sched_group_span(group),
+					p->cpus_ptr))
+			continue;
+
+		local_group = cpumask_test_cpu(this_cpu,
+					       sched_group_span(group));
+
+		if (local_group) {
+			sgs = &local_sgs;
+			local = group;
+		} else {
+			sgs = &tmp_sgs;
+		}
+
+		update_sg_wakeup_stats(sd, group, sgs, p);
+
+		if (!local_group && update_pick_idlest(idlest, &idlest_sgs, group, sgs)) {
+			idlest = group;
+			idlest_sgs = *sgs;
+		}
+
+	} while (group = group->next, group != sd->groups);
+
+
+	/* There is no idlest group to push tasks to */
+	if (!idlest)
+		return NULL;
+
+	/*
+	 * If the local group is idler than the selected idlest group
+	 * don't try and push the task.
+	 */
+	if (local_sgs.group_type < idlest_sgs.group_type)
+		return NULL;
+
+	/*
+	 * If the local group is busier than the selected idlest group
+	 * try and push the task.
+	 */
+	if (local_sgs.group_type > idlest_sgs.group_type)
+		return idlest;
+
+	switch (local_sgs.group_type) {
+	case group_overloaded:
+	case group_fully_busy:
+		/*
+		 * When comparing groups across NUMA domains, it's possible for
+		 * the local domain to be very lightly loaded relative to the
+		 * remote domains but "imbalance" skews the comparison making
+		 * remote CPUs look much more favourable. When considering
+		 * cross-domain, add imbalance to the load on the remote node
+		 * and consider staying local.
+		 */
+
+		if ((sd->flags & SD_NUMA) &&
+		    ((idlest_sgs.avg_load + imbalance) >= local_sgs.avg_load))
+			return NULL;
+
+		/*
+		 * If the local group is less loaded than the selected
+		 * idlest group don't try and push any tasks.
+		 */
+		if (idlest_sgs.avg_load >= (local_sgs.avg_load + imbalance))
+			return NULL;
+
+		if (100 * local_sgs.avg_load <= sd->imbalance_pct * idlest_sgs.avg_load)
+			return NULL;
+		break;
+
+	case group_imbalanced:
+	case group_asym_packing:
+		/* Those type are not used in the slow wakeup path */
+		return NULL;
+
+	case group_misfit_task:
+		/* Select group with the highest max capacity */
+		if (local->sgc->max_capacity >= idlest->sgc->max_capacity)
+			return NULL;
+		break;
+
+	case group_has_spare:
+		if (sd->flags & SD_NUMA) {
+#ifdef CONFIG_NUMA_BALANCING
+			int idlest_cpu;
+			/*
+			 * If there is spare capacity at NUMA, try to select
+			 * the preferred node
+			 */
+			if (cpu_to_node(this_cpu) == p->numa_preferred_nid)
+				return NULL;
+
+			idlest_cpu = cpumask_first(sched_group_span(idlest));
+			if (cpu_to_node(idlest_cpu) == p->numa_preferred_nid)
+				return idlest;
+#endif
+			/*
+			 * Otherwise, keep the task on this node to stay close
+			 * its wakeup source and improve locality. If there is
+			 * a real need of migration, periodic load balance will
+			 * take care of it.
+			 */
+			if (local_sgs.idle_cpus)
+				return NULL;
+		}
+
+		/*
+		 * Select group with highest number of idle cpus. We could also
+		 * compare the utilization which is more stable but it can end
+		 * up that the group has less spare capacity but finally more
+		 * idle cpus which means more opportunity to run task.
+		 */
+		if (local_sgs.idle_cpus >= idlest_sgs.idle_cpus)
+			return NULL;
+		break;
+	}
+
+	return idlest;
+}
+
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.