From patchwork Thu Nov 16 14:09:57 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Vincent Guittot <vincent.guittot@linaro.org>
X-Patchwork-Id: 119043
Delivered-To: patch@linaro.org
Received: by 10.140.22.164 with SMTP id 33csp5636905qgn;
 Thu, 16 Nov 2017 06:10:20 -0800 (PST)
X-Google-Smtp-Source: AGs4zMYhfSrEfrp4Ft4O9WkODydyLeSZevHMtzM8sjwfG1b9DoHNKWqhiremFU+uNsx5LflaIwzZ
X-Received: by 10.98.33.8 with SMTP id h8mr1985176pfh.160.1510841420236;
 Thu, 16 Nov 2017 06:10:20 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1510841420; cv=none;
 d=google.com; s=arc-20160816;
 b=xMElyhnnKzwbgVmN2aKmtKkfk3OJr32AMgd6Rqb4ptK+8kwFQRYoorguQ/UcIzVYcG
 V14wxwjRAGsf6kEeXiq8aYCDA+Xt/eOXt07CHNeww1Jaw+5Cc6EGumeQTRxEIJE82vOx
 Z58K4aX2aSnAiUChshMR5I1p2C1MpnAa0E32+wpZVzvwnUWBlhiy9PPsIp90gQBSI489
 +I/uBsBDksECF+1A9ArpvBZlMpfk9vjkwKnF4ZpZGaCbZY3toxuI9sWwSuiuFSlFfJEP
 y/UBf10p6AAkTSWe6iloPomsfki/xD/syfJlX7Wv7gwwhG4mzr90s6+4gzJA2T4jd3+A
 MDSQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature:arc-authentication-results;
 bh=5T4gr1XzPzdG3AAObkptY00mkyhTwQBhuYtTrkCTHOQ=;
 b=GWfcFVd63l4xIFOyV4WOU7z4UlPxLI3G7ByEoZx2NxXbye3bkuIRzTya6PgWBO2l1n
 V5aZonc7Zk31ltDs6ibOUIYkniGhpkTy5dKtbeNuvSyHNB0PEQyaV6dOp//NqwXZ4qwH
 EpzXm1Xw/Eb15C08MUk5KxVC67cKVmdIiueLwG12WUPVuHLQhtT+RRDChYK7LW6EquYB
 Djb5t6BOOA5yWVqaZH2kwOBLp4Nk8a+/gzLMUc23EjypKmHVZInGrG6kkGmc04qNXyI8
 4tupXT8TsdzOFz1kKZy43QHJJkVZzMi6xzQXymPrk1JnxLDymLtlp4YQ2g/FnHzfpUBM
 G7fQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=cZdITkTo;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id s194si942410pgc.0.2017.11.16.06.10.17; 
 Thu, 16 Nov 2017 06:10:20 -0800 (PST)
Received-SPF: pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=cZdITkTo;
 spf=pass (google.com: best guess record for domain of
 linux-kernel-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-kernel-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S935255AbdKPOKQ (ORCPT <rfc822;dan.rue@linaro.org> + 28 others);
 Thu, 16 Nov 2017 09:10:16 -0500
Received: from mail-wm0-f66.google.com ([74.125.82.66]:43904 "EHLO
 mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S932257AbdKPOKG (ORCPT
 <rfc822;linux-kernel@vger.kernel.org>);
 Thu, 16 Nov 2017 09:10:06 -0500
Received: by mail-wm0-f66.google.com with SMTP id x63so260769wmf.2
 for <linux-kernel@vger.kernel.org>;
 Thu, 16 Nov 2017 06:10:05 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=5T4gr1XzPzdG3AAObkptY00mkyhTwQBhuYtTrkCTHOQ=;
 b=cZdITkToZznsu1zauhHl2UwT/MHkgerZHJXiItfQMU1zl1w0qsQQibU9mXFU4BgxJR
 FBv0A1QGZC+KPcJHlI8ZlicFsM1RiNVHqs6bJp9ZY93FkO2JsbbdW65cQsjHmvVrpsih
 I5jZnROBsOWHj3xtHuwlCzEyL9KwvkWLyQUCY=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=5T4gr1XzPzdG3AAObkptY00mkyhTwQBhuYtTrkCTHOQ=;
 b=caNcQIDX806ToWVmHQ0w1DORKhSTymOJIyl3Ir1xxiCZ7Zi8UiEkAzAn9YwbrJWlU6
 6vU/m4vMLpOtAZNt9BXi1pcKRor2MZUDLVTUiR3rz0WkVqV9a8AKG7aVp8ZgEJBURIbL
 ycvnHry970Ku1T7/ri9RVwmJmZKSe9Vk8MrCduya6VEZdCvJmO3QlRkTOfkPrpRY3SCB
 0M5uvIPjV0+W9xigJmvTY/IuXvvCE9wJbnk4zC4zRtppxOkYhMNHDSIO4R312+24HiBb
 grxRhZ/GtpSn+QAqqIjoIksbCOlEBTtTV89ajiaKYhtiL8J+pnmJmrj/ktNg7W3ix20/
 yafQ==
X-Gm-Message-State: AJaThX6FpdDXjdf6U5HCWXzBa8hni+rVef6oFMtvpubKK3oh91wk2lKX
 prt+wQHJ71SykQguWLeg4X1jNg==
X-Received: by 10.223.178.26 with SMTP id u26mr1662171wra.239.1510841405087; 
 Thu, 16 Nov 2017 06:10:05 -0800 (PST)
Received: from localhost.localdomain ([2a01:e0a:f:6020:4f8:17cf:f041:b383])
 by smtp.gmail.com with ESMTPSA id
 10sm2165326wml.27.2017.11.16.06.10.03
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Thu, 16 Nov 2017 06:10:04 -0800 (PST)
From: Vincent Guittot <vincent.guittot@linaro.org>
To: peterz@infradead.org, linux-kernel@vger.kernel.org
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
 Yuyang Du <yuyang.du@intel.com>,
 Ingo Molnar <mingo@kernel.org>, Mike Galbraith <efault@gmx.de>,
 Chris Mason <clm@fb.com>, Linus Torvalds <torvalds@linux-foundation.org>,
 Dietmar Eggemann <dietmar.eggemann@arm.com>,
 Josef Bacik <josef@toxicpanda.com>,
 Ben Segall <bsegall@google.com>, Paul Turner <pjt@google.com>,
 Tejun Heo <tj@kernel.org>, Morten Rasmussen <morten.rasmussen@arm.com>
Subject: [PATCH v3] sched: Update runnable propagation rule
Date: Thu, 16 Nov 2017 15:09:57 +0100
Message-Id: <1510841397-29119-1-git-send-email-vincent.guittot@linaro.org>
X-Mailer: git-send-email 2.7.4
In-Reply-To: <20171031150108.mxihmewrugufwulq@hirez.programming.kicks-ass.net>
References: <20171031150108.mxihmewrugufwulq@hirez.programming.kicks-ass.net>
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.

Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:

  gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

which assumes that tasks are equally runnable which is not true but
easy to compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:

 - ge->avg.runnable_sum <= than LOAD_AVG_MAX
 - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
 - ge->avg.runnable_sum can't increase when we detach a task

Cc: Yuyang Du <yuyang.du@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Chris Mason <clm@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Paul Turner <pjt@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171019150442.GA25025@linaro.org
---

Hi Peter,

I have rebased the patch, updated the 2 comments that were unclear and fixed
the computation of running_sum by using arch_scale_cpu_capacity() instead of 
>> SCHED_CAPACITY_SHIFT.

 kernel/sched/fair.c | 101 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 72 insertions(+), 29 deletions(-)

-- 
2.7.4

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0989676..05eabb2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3413,9 +3413,9 @@ void set_task_rq_fair(struct sched_entity *se,
  * _IFF_ we look at the pure running and runnable sums. Because they
  * represent the very same entity, just at different points in the hierarchy.
  *
- *
- * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and
- * simply copies the running sum over.
+ * Per the above update_tg_cfs_util() is trivial and simply copies the running
+ * sum over (but still wrong, because the group entity and group rq do not have
+ * their PELT windows aligned).
  *
  * However, update_tg_cfs_runnable() is more complex. So we have:
  *
@@ -3424,11 +3424,11 @@ void set_task_rq_fair(struct sched_entity *se,
  * And since, like util, the runnable part should be directly transferable,
  * the following would _appear_ to be the straight forward approach:
  *
- *   grq->avg.load_avg = grq->load.weight * grq->avg.running_avg	(3)
+ *   grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg	(3)
  *
  * And per (1) we have:
  *
- *   ge->avg.running_avg == grq->avg.running_avg
+ *   ge->avg.runnable_avg == grq->avg.runnable_avg
  *
  * Which gives:
  *
@@ -3447,27 +3447,28 @@ void set_task_rq_fair(struct sched_entity *se,
  * to (shortly) return to us. This only works by keeping the weights as
  * integral part of the sum. We therefore cannot decompose as per (3).
  *
- * OK, so what then?
+ * Another reason this doesn't work is that runnable isn't a 0-sum entity.
+ * Imagine a rq with 2 tasks that each are runnable 2/3 of the time. Then the
+ * rq itself is runnable anywhere between 2/3 and 1 depending on how the
+ * runnable section of these tasks overlap (or not). If they were to perfectly
+ * align the rq as a whole would be runnable 2/3 of the time. If however we
+ * always have at least 1 runnable task, the rq as a whole is always runnable.
  *
+ * So we'll have to approximate.. :/
  *
- * Another way to look at things is:
+ * Given the constraint:
  *
- *   grq->avg.load_avg = \Sum se->avg.load_avg
+ *   ge->avg.running_sum <= ge->avg.runnable_sum <= LOAD_AVG_MAX
  *
- * Therefore, per (2):
+ * We can construct a rule that adds runnable to a rq by assuming minimal
+ * overlap.
  *
- *   grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg
+ * On removal, we'll assume each task is equally runnable; which yields:
  *
- * And the very thing we're propagating is a change in that sum (someone
- * joined/left). So we can easily know the runnable change, which would be, per
- * (2) the already tracked se->load_avg divided by the corresponding
- * se->weight.
+ *   grq->avg.runnable_sum = grq->avg.load_sum / grq->load.weight
  *
- * Basically (4) but in differential form:
+ * XXX: only do this for the part of runnable > running ?
  *
- *   d(runnable_avg) += se->avg.load_avg / se->load.weight
- *								   (5)
- *   ge->avg.load_avg += ge->load.weight * d(runnable_avg)
  */
 
 static inline void
@@ -3479,6 +3480,14 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 	if (!delta)
 		return;
 
+	/*
+	 * The relation between sum and avg is:
+	 *
+	 *   LOAD_AVG_MAX - 1024 + sa->period_contrib
+	 *
+	 * however, the PELT windows are not aligned between grq and gse.
+	 */
+
 	/* Set new sched_entity's utilization */
 	se->avg.util_avg = gcfs_rq->avg.util_avg;
 	se->avg.util_sum = se->avg.util_avg * LOAD_AVG_MAX;
@@ -3491,33 +3500,67 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq
 static inline void
 update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se, struct cfs_rq *gcfs_rq)
 {
-	long runnable_sum = gcfs_rq->prop_runnable_sum;
-	long runnable_load_avg, load_avg;
-	s64 runnable_load_sum, load_sum;
+	long delta_avg, running_sum, runnable_sum = gcfs_rq->prop_runnable_sum;
+	unsigned long runnable_load_avg, load_avg;
+	u64 runnable_load_sum, load_sum = 0;
+	s64 delta_sum;
 
 	if (!runnable_sum)
 		return;
 
 	gcfs_rq->prop_runnable_sum = 0;
 
+	if (runnable_sum >= 0) {
+		/*
+		 * Add runnable; clip at LOAD_AVG_MAX. Reflects that until
+		 * the CPU is saturated running == runnable.
+		 */
+		runnable_sum += se->avg.load_sum;
+		runnable_sum = min(runnable_sum, (long)LOAD_AVG_MAX);
+	} else {
+		/*
+		 * Estimate the new unweighted runnable_sum of the gcfs_rq by
+		 * assuming all tasks are equally runnable.
+		 */
+		if (scale_load_down(gcfs_rq->load.weight)) {
+			load_sum = div_s64(gcfs_rq->avg.load_sum,
+				scale_load_down(gcfs_rq->load.weight));
+		}
+
+		/* But make sure to not inflate se's runnable */
+		runnable_sum = min(se->avg.load_sum, load_sum);
+	}
+
+	/*
+	 * runnable_sum can't be lower than running_sum
+	 * As running sum is scale with cpu capacity wehreas the runnable sum
+	 * is not we rescale running_sum 1st
+	 */
+	running_sum = se->avg.util_sum / arch_scale_cpu_capacity(NULL, cpu)
+	runnable_sum = max(runnable_sum, running_sum);
+
 	load_sum = (s64)se_weight(se) * runnable_sum;
 	load_avg = div_s64(load_sum, LOAD_AVG_MAX);
 
-	add_positive(&se->avg.load_sum, runnable_sum);
-	add_positive(&se->avg.load_avg, load_avg);
+	delta_sum = load_sum - (s64)se_weight(se) * se->avg.load_sum;
+	delta_avg = load_avg - se->avg.load_avg;
 
-	add_positive(&cfs_rq->avg.load_avg, load_avg);
-	add_positive(&cfs_rq->avg.load_sum, load_sum);
+	se->avg.load_sum = runnable_sum;
+	se->avg.load_avg = load_avg;
+	add_positive(&cfs_rq->avg.load_avg, delta_avg);
+	add_positive(&cfs_rq->avg.load_sum, delta_sum);
 
 	runnable_load_sum = (s64)se_runnable(se) * runnable_sum;
 	runnable_load_avg = div_s64(runnable_load_sum, LOAD_AVG_MAX);
+	delta_sum = runnable_load_sum - se_weight(se) * se->avg.runnable_load_sum;
+	delta_avg = runnable_load_avg - se->avg.runnable_load_avg;
 
-	add_positive(&se->avg.runnable_load_sum, runnable_sum);
-	add_positive(&se->avg.runnable_load_avg, runnable_load_avg);
+	se->avg.runnable_load_sum = runnable_sum;
+	se->avg.runnable_load_avg = runnable_load_avg;
 
 	if (se->on_rq) {
-		add_positive(&cfs_rq->avg.runnable_load_avg, runnable_load_avg);
-		add_positive(&cfs_rq->avg.runnable_load_sum, runnable_load_sum);
+		add_positive(&cfs_rq->avg.runnable_load_avg, delta_avg);
+		add_positive(&cfs_rq->avg.runnable_load_sum, delta_sum);
 	}
 }