From patchwork Wed Sep 28 13:13:09 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 77082 Delivered-To: patch@linaro.org Received: by 10.140.106.72 with SMTP id d66csp425739qgf; Wed, 28 Sep 2016 06:13:28 -0700 (PDT) X-Received: by 10.66.16.97 with SMTP id f1mr57166591pad.39.1475068408711; Wed, 28 Sep 2016 06:13:28 -0700 (PDT) Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 143si8496510pfu.156.2016.09.28.06.13.28; Wed, 28 Sep 2016 06:13:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932682AbcI1NNX (ORCPT + 27 others); Wed, 28 Sep 2016 09:13:23 -0400 Received: from mail-pf0-f169.google.com ([209.85.192.169]:34678 "EHLO mail-pf0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932398AbcI1NNO (ORCPT ); Wed, 28 Sep 2016 09:13:14 -0400 Received: by mail-pf0-f169.google.com with SMTP id l25so17475808pfb.1 for ; Wed, 28 Sep 2016 06:13:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=ih59oOjIqjNDIAKbFVBz8rOWEe/Ife7iryXIvCgEjWU=; b=a1C3BbiLxfzZMRKulCyJIovANJGxvFqj44UH1AFP7q5UETurBciWqoE8dqjpkjoHI/ oNDJmnnOv9iFVkC+eVC6LumCCRNy0PlgImbY6bL/jiK+cSDz1CeWJjLQEviox2D1jbIn 3RRahJEwm2UET2FwAOz+C2Q5JM84evFYrUMLg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=ih59oOjIqjNDIAKbFVBz8rOWEe/Ife7iryXIvCgEjWU=; b=iO2A9eGAZTPRlc9Ji4UkkOVuq8/5gHF/IkXU8m4rcIKXPAy/Xt89UoBtq5mEKGDEdj OPy068z0YxeGQL/pnG7pqY1mbYSGuuhTcAcQEvGfsQT2BIk3JX3Cy41koBpEx37L598o 1M4lY8H7cqucBU4DAX/LgXHZa24LJOK/2TN1hCZL+11xCGzXPNEc+psw5wwFzb1eJPst 7AsMxPbdEH4j9MfOcOGZjqgQIaZJKuHt3BWuVC1T+XzWQMLIuR+/UGRv/oI2Zq5vPimR Kbeen2v2zh/ZZf1JIsqW8Z8YAzIXPQvNkFsLDq0rt2VP6VUjXkxqzWOIckg+onibUA1S JbTA== X-Gm-Message-State: AE9vXwOq1T/mmozvVQvKFYKaOKGWECD3r4Iftn+IB5baZiEOL+xPdn5DdLzRhm1Ev7EI4NX5 X-Received: by 10.98.220.145 with SMTP id c17mr57653346pfl.159.1475068393548; Wed, 28 Sep 2016 06:13:13 -0700 (PDT) Received: from vingu-laptop ([67.238.99.186]) by smtp.gmail.com with ESMTPSA id x9sm12722509pff.19.2016.09.28.06.13.11 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Wed, 28 Sep 2016 06:13:12 -0700 (PDT) Date: Wed, 28 Sep 2016 06:13:09 -0700 From: Vincent Guittot To: Dietmar Eggemann Cc: Peter Zijlstra , Matt Fleming , Ingo Molnar , linux-kernel , Mike Galbraith , Yuyang Du Subject: Re: [PATCH] sched/fair: Do not decay new task load on first enqueue Message-ID: <20160928131309.GA5483@vingu-laptop> References: <20160923115808.2330-1-matt@codeblueprint.co.uk> <20160928101422.GR5016@twins.programming.kicks-ass.net> <20160928111912.GU5016@twins.programming.kicks-ass.net> <93981a94-8aff-edc9-b964-9f31656ad577@arm.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit : > On 28 September 2016 at 04:31, Dietmar Eggemann > wrote: > > On 28/09/16 12:19, Peter Zijlstra wrote: > >> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote: > >>> On 28/09/16 11:14, Peter Zijlstra wrote: > >>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote: > > > > [...] > > > >>> Not sure what you mean by 'after fixing' but the se is initialized with > >>> a possibly stale 'now' value in post_init_entity_util_avg()-> > >>> attach_entity_load_avg() before the clock is updated in > >>> activate_task()->enqueue_task(). > >> > >> I meant that after I fix the above issue of calling post_init with a > >> stale clock. So the + update_rq_clock(rq) in the patch. > > > > OK. > > > > [...] > > > >>>> While staring at this, I don't think we can still hit > >>>> vruntime_normalized() with a new task, so I _think_ we can remove that > >>>> !se->sum_exec_runtime clause there (and rejoice), no? > >>> > >>> I'm afraid that with accurate timing we will get the same situation that > >>> we add and subtract the same amount of load (probably 1024 now and not > >>> 1002 (or less)) to/from cfs_rq->runnable_load_avg for the initial (fork) > >>> hackbench run. > >>> After all, it's 'runnable' based. > >> > >> The idea was that since we now update rq clock before post_init and then > >> leave it be, both post_init and enqueue see the exact same timestamp, > >> and the delta is 0, resulting in no aging. > >> > >> Or did I fail to make that happen? > > > > No, but IMHO what Matt wants is ageing for the hackench tasks at the end > > of their fork phase so there is a tiny amount of > > cfs_rq->runnable_load_avg left on cpuX after the fork related dequeue so > > the (load-based) fork-balancer chooses cpuY for the next hackbench task. > > That's why he wanted to avoid the __update_load_avg(se) on enqueue (thus > > adding 1024 to cfs_rq->runnable_load_avg) and do the ageing only on > > dequeue (removing <1024 from cfs_rq->runnable_load_avg). > > wanting cfs_rq->runnable_load_avg to be not null when nothing is > runnable on the cfs_rq seems a bit odd. > We should better take into account cfs_rq->avg.load_avg or the > cfs_rq->avg.util_avg in the select_idlest_group in this case IIUC the problem raised by Matt, he see a regression because we now remove during the dequeue the exact same load as during the enqueue so cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already have a lot of hackbench blocked thread. The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable task, is quite correct and we should keep it. But when we look for the idlest group, we have to take into account the blocked thread. That's what i have tried to do below --- kernel/sched/fair.c | 30 +++++++++++++++++++++++------- 1 file changed, 23 insertions(+), 7 deletions(-) > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 06b3c47..702915e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5353,7 +5353,8 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu, int sd_flag) { struct sched_group *idlest = NULL, *group = sd->groups; - unsigned long min_load = ULONG_MAX, this_load = 0; + unsigned long min_runnable_load = ULONG_MAX, this_load = 0; + unsigned long min_avg_load = ULONG_MAX; int load_idx = sd->forkexec_idx; int imbalance = 100 + (sd->imbalance_pct-100)/2; @@ -5361,7 +5362,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, load_idx = sd->wake_idx; do { - unsigned long load, avg_load; + unsigned long load, avg_load, runnable_load; int local_group; int i; @@ -5375,6 +5376,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, /* Tally up the load of all CPUs in the group */ avg_load = 0; + runnable_load = 0; for_each_cpu(i, sched_group_cpus(group)) { /* Bias balancing toward cpus of our domain */ @@ -5383,21 +5385,35 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, else load = target_load(i, load_idx); - avg_load += load; + runnable_load += load; + + avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); } /* Adjust by relative CPU capacity of the group */ avg_load = (avg_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; + runnable_load = (runnable_load * SCHED_CAPACITY_SCALE) / group->sgc->capacity; if (local_group) { - this_load = avg_load; - } else if (avg_load < min_load) { - min_load = avg_load; + this_load = runnable_load; + } else if (runnable_load < min_runnable_load) { + min_runnable_load = runnable_load; + min_avg_load = avg_load; + idlest = group; + } else if ((runnable_load == min_runnable_load) && (avg_load < min_avg_load)) { + /* + * In case that we have same runnable load (especially null + * runnable load), we select the group with smallest blocked + * load + */ + min_avg_load = avg_load; + min_runnable_load = runnable_load; idlest = group; } + } while (group = group->next, group != sd->groups); - if (!idlest || 100*this_load < imbalance*min_load) + if (!idlest || 100*this_load < imbalance*min_runnable_load) return NULL; return idlest; }