From patchwork Fri Jul 19 07:58:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 169231 Delivered-To: patch@linaro.org Received: by 2002:a92:4782:0:0:0:0:0 with SMTP id e2csp3509483ilk; Fri, 19 Jul 2019 00:59:18 -0700 (PDT) X-Google-Smtp-Source: APXvYqw5Wp077Wrhk6ngyNdFFEipAl9oX3NjO6fkeBmQIGitD6Hym3JC5X3OArXXmj5elQCmPwEE X-Received: by 2002:a17:90a:f98a:: with SMTP id cq10mr56330284pjb.43.1563523158623; Fri, 19 Jul 2019 00:59:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563523158; cv=none; d=google.com; s=arc-20160816; b=u3dqg5LW0fGm6y53zXKgbNcTxV4sJz2VUnsq7+2eoHZ5h9mEJpFZyVTupsxvalLKYj fYE8RaKwa1BKqehqQaAnOFcjobpv1PUfvAm6jFLmd/zHZrcGVqlXdlLs/2N/OjL4zoIa ZrghGAgK8bJYk4FkV27VCco+IZbkTEV6eQ9hevm2pWg2M4hI9IYkarj14HdIczZiHGqW 4AhVHNq9eWkxRsDNb+CDJkCCw3H7WsXnoUpWjnvzWCqJ5WgQYvJoxAjwa0U8FKnAM0si H6fadSg7MSrqymFkroT6zQ/XI4xSkubJyhM7/GIKEoFi43m+Z8icAX0nGbhi8jcOgtDM XRWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=QHdcaXLFj8hmuPKGO9zvPjkzsCCROieoZsTLr0egxPg=; b=WR0n6FG1AbhLaqRxqvEYOeqWMkLtrZqGe8QqSyX4abo8LONP09MWRDlP6Xn3pHdT0L mKPbOV0H65uioaJF+o2D/IDNvrxLdvvXtvhV5IZ7XGlSrnLk/alUcpRyC/YtPPBJ03gh DtZrAfVxOEIUY6tmiT/vQJ1hDO6HooykyMY0LqSxjRZQlhLLarSu/oWwAN2+XhQulsyf iN+CC28sWcfWEOG2QDuWxO5z6DRFCC3AXTq6S0SEkfyBn/470AIfQcIf/3LuPuqB7GUX tPhlO1FKGrvP+4jQ0TEpzKE1XoTKq5EXw32gnPiHUi7VU9nb04j4QfouPqrmQZEhFcc0 9t/g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=vznvEbUx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p5si3944108pfp.64.2019.07.19.00.59.18; Fri, 19 Jul 2019 00:59:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=vznvEbUx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727527AbfGSH7R (ORCPT + 29 others); Fri, 19 Jul 2019 03:59:17 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:38201 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726076AbfGSH66 (ORCPT ); Fri, 19 Jul 2019 03:58:58 -0400 Received: by mail-wr1-f65.google.com with SMTP id g17so31282590wrr.5 for ; Fri, 19 Jul 2019 00:58:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=QHdcaXLFj8hmuPKGO9zvPjkzsCCROieoZsTLr0egxPg=; b=vznvEbUxNcH663vJhzhl4z0PMqD//u2+kSJ24N5/XjxaE1ugq0im0no7sO+8r0FiUj QgUxS6oi+Wcjnc4XHwIkaLxEL4fIhDg2k5TqG4vd7gN/Q88WE4XHYmyrqt4vNwDeA9ol c0nj3ccsoFRp5qF7VgoLUP9DziZwlEHRHyQfltTNC9Wcn/L/46bbfwolRfj0GMb46qUb gjhi2dUUcM721Jt7xbVtUeHeqbG5SUpXiZclpSIglJKTOcSSQYYBRQedOYm2IwEqKDDB COxISo/3RJFkncXD8yVLqr9VAhvq1sSPAFR4IJSCoDi1Fs6/tPg6yFTfYItjXyE5fZoj 2z2A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=QHdcaXLFj8hmuPKGO9zvPjkzsCCROieoZsTLr0egxPg=; b=YEd/laQ1rWPNaxmPYiSQOzOXsB+dlcQXAD8+wkVm2E6dWhj28V+cOEr3wqWbIKm/Su 7bYx2/KradU8JchwhpTQTAYASqS3c/sVRAxX5wFiCevRrFZZbnlRinG5EKYUkiVdOoJw iuPi7vGyEx4iWcgChhylVsqa6upRknG7imt4ujlBJPpVwJVvBzN0uNzD72dsvv6KpPFK ekgF9DlOaFCpcE2o+bkrKeIgJ+hQRGKy3gFtlgxne/GIaAFhIAimKR1S5VCn9k75kRwl 5H/yT10RjllgSNGLZLoy4ORjT0mCmIsSWBdtBaS4KGKT6TGDLW+3e0K5loT68HNdS3Lm ir+Q== X-Gm-Message-State: APjAAAVnUD2rXQOsbX5tNPoj0GxUmX4q7Eoai/XNo6zMMGB7L5o88S4c 9PQaNhNP+ewgSpbnRdG/c8mWDTZ3NRU= X-Received: by 2002:adf:da4d:: with SMTP id r13mr21972112wrl.281.1563523135353; Fri, 19 Jul 2019 00:58:55 -0700 (PDT) Received: from localhost.localdomain ([2a01:e0a:f:6020:484b:32fe:1cf4:f69b]) by smtp.gmail.com with ESMTPSA id c1sm58673826wrh.1.2019.07.19.00.58.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 19 Jul 2019 00:58:54 -0700 (PDT) From: Vincent Guittot To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org Cc: quentin.perret@arm.com, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, pauld@redhat.com, Vincent Guittot Subject: [PATCH 1/5] sched/fair: clean up asym packing Date: Fri, 19 Jul 2019 09:58:21 +0200 Message-Id: <1563523105-24673-2-git-send-email-vincent.guittot@linaro.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> References: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Clean up asym packing to follow the default load balance behavior: - classify the group by creating a group_asym_capacity field. - calculate the imbalance in calculate_imbalance() instead of bypassing it. We don't need to test twice same conditions anymore to detect asym packing and we consolidate the calculation of imbalance in calculate_imbalance(). There is no functional changes. Signed-off-by: Vincent Guittot --- kernel/sched/fair.c | 63 ++++++++++++++--------------------------------------- 1 file changed, 16 insertions(+), 47 deletions(-) -- 2.7.4 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fff5632..7a530fd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7654,6 +7654,7 @@ struct sg_lb_stats { unsigned int group_weight; enum group_type group_type; int group_no_capacity; + int group_asym_capacity; unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; @@ -8110,9 +8111,17 @@ static bool update_sd_pick_busiest(struct lb_env *env, * ASYM_PACKING needs to move all the work to the highest * prority CPUs in the group, therefore mark all groups * of lower priority than ourself as busy. + * + * This is primarily intended to used at the sibling level. Some + * cores like POWER7 prefer to use lower numbered SMT threads. In the + * case of POWER7, it can move to lower SMT modes only when higher + * threads are idle. When in lower SMT modes, the threads will + * perform better since they share less core resources. Hence when we + * have idle threads, we want them to be the higher ones. */ if (sgs->sum_nr_running && sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) { + sgs->group_asym_capacity = 1; if (!sds->busiest) return true; @@ -8254,51 +8263,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } /** - * check_asym_packing - Check to see if the group is packed into the - * sched domain. - * - * This is primarily intended to used at the sibling level. Some - * cores like POWER7 prefer to use lower numbered SMT threads. In the - * case of POWER7, it can move to lower SMT modes only when higher - * threads are idle. When in lower SMT modes, the threads will - * perform better since they share less core resources. Hence when we - * have idle threads, we want them to be the higher ones. - * - * This packing function is run on idle threads. It checks to see if - * the busiest CPU in this domain (core in the P7 case) has a higher - * CPU number than the packing function is being run on. Here we are - * assuming lower CPU number will be equivalent to lower a SMT thread - * number. - * - * Return: 1 when packing is required and a task should be moved to - * this CPU. The amount of the imbalance is returned in env->imbalance. - * - * @env: The load balancing environment. - * @sds: Statistics of the sched_domain which is to be packed - */ -static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds) -{ - int busiest_cpu; - - if (!(env->sd->flags & SD_ASYM_PACKING)) - return 0; - - if (env->idle == CPU_NOT_IDLE) - return 0; - - if (!sds->busiest) - return 0; - - busiest_cpu = sds->busiest->asym_prefer_cpu; - if (sched_asym_prefer(busiest_cpu, env->dst_cpu)) - return 0; - - env->imbalance = sds->busiest_stat.group_load; - - return 1; -} - -/** * fix_small_imbalance - Calculate the minor imbalance that exists * amongst the groups of a sched_domain, during * load balancing. @@ -8382,6 +8346,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s local = &sds->local_stat; busiest = &sds->busiest_stat; + if (busiest->group_asym_capacity) { + env->imbalance = busiest->group_load; + return; + } + if (busiest->group_type == group_imbalanced) { /* * In the group_imb case we cannot rely on group-wide averages @@ -8486,8 +8455,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) busiest = &sds.busiest_stat; /* ASYM feature bypasses nice load balance check */ - if (check_asym_packing(env, &sds)) - return sds.busiest; + if (busiest->group_asym_capacity) + goto force_balance; /* There is no busy sibling group to pull tasks from */ if (!sds.busiest || busiest->sum_nr_running == 0) From patchwork Fri Jul 19 07:58:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 169227 Delivered-To: patch@linaro.org Received: by 2002:a92:4782:0:0:0:0:0 with SMTP id e2csp3509299ilk; Fri, 19 Jul 2019 00:59:05 -0700 (PDT) X-Google-Smtp-Source: APXvYqxOcTCbsN6iqQbOBxWGtKPYdGZAklEaHvau8Ma2b9mcpN4SXsSA/U28I0Y86YLuv1kZhClD X-Received: by 2002:a65:4844:: with SMTP id i4mr53431501pgs.113.1563523145462; Fri, 19 Jul 2019 00:59:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563523145; cv=none; d=google.com; s=arc-20160816; b=I13UePlLVNpHrSySDlrqZ9EaLpFoJYAwmNcdS8PE5JDQUsXDq/DqHqlrqdFDEncson gBiejPe/64XeP7B/RQ+A5FPb1hztAMqDBNq2JL+5/Gb0HuIAFu+HnFe4aumvJ5k18Kej Uju+lCRZhrPrb/7GUCBzPDSJzxqSsDPi/MCcTIlKYN2I5qZGad2bxJrWwzmhKs4KRcox qChAXSg5JuG7FARMC5v2U2R0R4uLWcz2OpurhEJs8vS1TzZbmgueRD9PnPZ0s61dvOpz CL301bNz8ighxq3OUkj4vFemu5Ztu2TnJMmOVOa086Sq0TGYdbjTnquoODWFscSVts0i 5u7w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=VFn6DSv72GJsliLwuv9W2q96ioKlPRjyGZeuajHmgtE=; b=ASSycCy4ab4Al+wWvvYNUnZTXVJwtsf0obBkcNY2e9kMrycsUKhDY9Hv7J08Xvx9vM tCUpAFWuXJTHNdkpsb0IYCyXo0XvpApXRin/2rxJ+un3SEisaMc8c93yQGKz9Grb9qOy KDFBEh5I+kk4/5fxdBhob+TYDFV0bEVVsvO0D3fsqsRNO/SeC1kR/7VXW6PUI6ryrf7M LW/dRbZWstxftfAsJBwKZ9tz+0328b8inxqtSzi/36EQ09zqPCRYKnrGZi7kkvAkRPpl 70CX8OFiSvZAOlP2s4inIAzehCN5n1ET7joSiA+x7aEO/K1Hti2QgwgWdouEqhb/aUQG UtiQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b="SB+Ov0l/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x125si1015846pgx.332.2019.07.19.00.59.05; Fri, 19 Jul 2019 00:59:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b="SB+Ov0l/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727318AbfGSH7E (ORCPT + 29 others); Fri, 19 Jul 2019 03:59:04 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:37825 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725853AbfGSH66 (ORCPT ); Fri, 19 Jul 2019 03:58:58 -0400 Received: by mail-wr1-f65.google.com with SMTP id n9so6200847wrr.4 for ; Fri, 19 Jul 2019 00:58:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=VFn6DSv72GJsliLwuv9W2q96ioKlPRjyGZeuajHmgtE=; b=SB+Ov0l/ue1DcfuYS7KJdEGkxp2FpHHZXl+50VpC3zP0c34kTRQAwGeucMbAAVl1Ym eKgsMQPdtJudYvhWBfTFA3WHdWOguR65R2AbhrzwE65dwT9xpV7xvXQrZLEFVFayIMjB jM3Usdn5pm+CboXgEuYNPVw+LdJMMS84BSzJvVDUeVjh7uDKa9eB3VsI30x0ZHV8hZHY bknEIW2O3g7diOO17GclFlDJepm4SVDDke3WWS54R/0TnrSA322/mEl63hxwiXyP9wJx yNadEkuzm5zbPx+9wE4NZMDbMJM27apXefJMrbbQ2X3a0jESln+H85htqNGc5wWkvKj5 Qoyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=VFn6DSv72GJsliLwuv9W2q96ioKlPRjyGZeuajHmgtE=; b=PZotBVcJ1X5pFYI1APF0TneULZLDAWMXFS4eS2oLe1EPK3J9wWfK5UU4M4nInNHCpZ 3g3ByRm8EPpdPc10Fabg8+oYz47JHXobCqnAfTzEvsGG+GE00WJVwmv6Cd82/neh6UG0 QqlMnR/FOM1G2NC0y9OazM3PQO3rRnDF7FEyN2HDEijYmxW/mwCkyHFCz0lDElNe5V/G uo9ua6JS178EVcpaqt9WtaYAfp/XeOZhcGuHLO8GXg9SlYNY2Uaq71itdmpb18yjynT2 1XLdV9dPuRGzjxC3ooJETzIT0pC5IPFjZXF04E/YlAlYueFQKlnFxBD8TgmYk0uD/jN6 pq6g== X-Gm-Message-State: APjAAAVv2w6Os+rhtUxS3+de6bcc/pqr0DK96m+PUDmJBDCb6q79oAR2 QxEUX1RwOZG+PFPXcD+81ev5N0Nbyck= X-Received: by 2002:a5d:5386:: with SMTP id d6mr55713933wrv.207.1563523136338; Fri, 19 Jul 2019 00:58:56 -0700 (PDT) Received: from localhost.localdomain ([2a01:e0a:f:6020:484b:32fe:1cf4:f69b]) by smtp.gmail.com with ESMTPSA id c1sm58673826wrh.1.2019.07.19.00.58.55 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 19 Jul 2019 00:58:55 -0700 (PDT) From: Vincent Guittot To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org Cc: quentin.perret@arm.com, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, pauld@redhat.com, Vincent Guittot Subject: [PATCH 2/5] sched/fair: rename sum_nr_running to sum_h_nr_running Date: Fri, 19 Jul 2019 09:58:22 +0200 Message-Id: <1563523105-24673-3-git-send-email-vincent.guittot@linaro.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> References: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org sum_nr_running will track rq->nr_running task and sum_h_nr_running will track cfs->h_nr_running so we can use both to detect when other scheduling class are running and preempt CFS. There is no functional changes. Signed-off-by: Vincent Guittot --- kernel/sched/fair.c | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-) -- 2.7.4 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7a530fd..67f0acd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7650,6 +7650,7 @@ struct sg_lb_stats { unsigned long group_capacity; unsigned long group_util; /* Total utilization of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ + unsigned int sum_h_nr_running; /* Nr tasks running in the group */ unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; @@ -7695,6 +7696,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, + .sum_h_nr_running = 0, .group_type = group_other, }, }; @@ -7885,7 +7887,7 @@ static inline int sg_imbalanced(struct sched_group *group) static inline bool group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running < sgs->group_weight) + if (sgs->sum_h_nr_running < sgs->group_weight) return true; if ((sgs->group_capacity * 100) > @@ -7906,7 +7908,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) static inline bool group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running <= sgs->group_weight) + if (sgs->sum_h_nr_running <= sgs->group_weight) return false; if ((sgs->group_capacity * 100) < @@ -8000,6 +8002,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_load += cpu_runnable_load(rq); sgs->group_util += cpu_util(i); + sgs->sum_h_nr_running += rq->cfs.h_nr_running; sgs->sum_nr_running += rq->cfs.h_nr_running; nr_running = rq->nr_running; @@ -8030,8 +8033,8 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_capacity = group->sgc->capacity; sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; - if (sgs->sum_nr_running) - sgs->load_per_task = sgs->group_load / sgs->sum_nr_running; + if (sgs->sum_h_nr_running) + sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running; sgs->group_weight = group->group_weight; @@ -8088,7 +8091,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * capable CPUs may harm throughput. Maximize throughput, * power/energy consequences are not considered. */ - if (sgs->sum_nr_running <= sgs->group_weight && + if (sgs->sum_h_nr_running <= sgs->group_weight && group_smaller_min_cpu_capacity(sds->local, sg)) return false; @@ -8119,7 +8122,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * perform better since they share less core resources. Hence when we * have idle threads, we want them to be the higher ones. */ - if (sgs->sum_nr_running && + if (sgs->sum_h_nr_running && sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) { sgs->group_asym_capacity = 1; if (!sds->busiest) @@ -8137,9 +8140,9 @@ static bool update_sd_pick_busiest(struct lb_env *env, #ifdef CONFIG_NUMA_BALANCING static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs) { - if (sgs->sum_nr_running > sgs->nr_numa_running) + if (sgs->sum_h_nr_running > sgs->nr_numa_running) return regular; - if (sgs->sum_nr_running > sgs->nr_preferred_running) + if (sgs->sum_h_nr_running > sgs->nr_preferred_running) return remote; return all; } @@ -8214,7 +8217,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd */ if (prefer_sibling && sds->local && group_has_capacity(env, local) && - (sgs->sum_nr_running > local->sum_nr_running + 1)) { + (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) { sgs->group_no_capacity = 1; sgs->group_type = group_classify(sg, sgs); } @@ -8226,7 +8229,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd next_group: /* Now, start updating sd_lb_stats */ - sds->total_running += sgs->sum_nr_running; + sds->total_running += sgs->sum_h_nr_running; sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; @@ -8280,7 +8283,7 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) local = &sds->local_stat; busiest = &sds->busiest_stat; - if (!local->sum_nr_running) + if (!local->sum_h_nr_running) local->load_per_task = cpu_avg_load_per_task(env->dst_cpu); else if (busiest->load_per_task > local->load_per_task) imbn = 1; @@ -8378,7 +8381,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s */ if (busiest->group_type == group_overloaded && local->group_type == group_overloaded) { - load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE; + load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE; if (load_above_capacity > busiest->group_capacity) { load_above_capacity -= busiest->group_capacity; load_above_capacity *= scale_load_down(NICE_0_LOAD); @@ -8459,7 +8462,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance; /* There is no busy sibling group to pull tasks from */ - if (!sds.busiest || busiest->sum_nr_running == 0) + if (!sds.busiest || busiest->sum_h_nr_running == 0) goto out_balanced; /* XXX broken for overlapping NUMA groups */ @@ -8781,7 +8784,7 @@ static int load_balance(int this_cpu, struct rq *this_rq, env.src_rq = busiest; ld_moved = 0; - if (busiest->nr_running > 1) { + if (busiest->cfs.h_nr_running > 1) { /* * Attempt to move tasks. If find_busiest_group has found * an imbalance but busiest->nr_running <= 1, the group is From patchwork Fri Jul 19 07:58:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 169228 Delivered-To: patch@linaro.org Received: by 2002:a92:4782:0:0:0:0:0 with SMTP id e2csp3509329ilk; Fri, 19 Jul 2019 00:59:07 -0700 (PDT) X-Google-Smtp-Source: APXvYqyGlkQQygkCd420L/iYUxZ0mX3gSpcMMzCn3EZLaJX0qkcOtXvAtb5GMhiSW0XR7vCOn3yL X-Received: by 2002:a63:8ac4:: with SMTP id y187mr12132448pgd.412.1563523147638; Fri, 19 Jul 2019 00:59:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563523147; cv=none; d=google.com; s=arc-20160816; b=qewJiEyQa/I4lZfTYgItJEBl9H8n5rPEd+my2CNjRPKnzHEKPDSG6/wXSuq2lddpQE cX5qLm4iGc7cubao+DZ2osBVfrvnK56u9+lMzkg9Y63ApLh020HFB61yS+Yp5cynJyoa zxowTKPf2Q0+GGsk7Ie2Okkc1uUkzTLgIaBm2AUbGvwoYKIzGLtEBbR7+Noisanz0ffa XEToQkXDaGAWhMnvp15loLrguDcH4lhtnGybAtuJhYm1GMK+R5hsikmEnCaGFQ4FWZ6h pRNjm3eCPnumxOa64ekXQTxMK8ZQRdC+txQ2+b16ilWkCI9KOQJJjELM28EF4fxe7zxO d1HA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=b/KfcFBbb+ZMXtCKjffZI7DjKNIQmY0DhBpSVTvXDk4=; b=AUJHc4vokRCWCV46gTf0D1cgowlofdMFPM0rqpszekCdXsVlL1nP7NF68jOq+33IXl gB+XvHbe5sFLXU7fMJJGzxiGd8DXrdFbsk2t6dNDC212Obteinmpc54mON+F32gSQTRQ 4mPvM+PamNwzBzxjm3UrsVm+VZHGvQ3H/iQAAb6Ya+YVomryRbhjoOxV9BNQKgSkDf+L emRm2HYIHK2uplxD44M09XT2HCYHu4cdkdW7Wp8/ZNECN2dg72jFdcTyrFGXlE49mgB7 oBpfK5wnGOaTyvzpfDa7S8SLMPzHqFNCWxINLNkEZAfqhIC06o6IZ+YXX/4NtyAf1S/O jq0g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=tSbKJaDz; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x125si1015846pgx.332.2019.07.19.00.59.07; Fri, 19 Jul 2019 00:59:07 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=tSbKJaDz; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727389AbfGSH7G (ORCPT + 29 others); Fri, 19 Jul 2019 03:59:06 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:44533 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726909AbfGSH7B (ORCPT ); Fri, 19 Jul 2019 03:59:01 -0400 Received: by mail-wr1-f65.google.com with SMTP id p17so31235934wrf.11 for ; Fri, 19 Jul 2019 00:58:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=b/KfcFBbb+ZMXtCKjffZI7DjKNIQmY0DhBpSVTvXDk4=; b=tSbKJaDzYn04vsyAPXCzH8/Oo06qouGQp5loCPD8LQP+lL/2iKLTKbe4YcS+isotqN XKnSOT6oU6+8ZH0k87vqhTPOqb36fEO1cgc8Ims+OYOoFy0KoX5h9lEvOi8lixsD9QFh txvNjHx+8yECz/mO7fIpc1IbCdCrMIp8gTXtiIt47UjDzBykucXv3ZwNbTNlQqZVYt6J 1syv2xz7dE8bPjit9rY3EKM0HWLPt0kQ3lo596d6/7906BPqF83f+CZcHBDP9qwK91yG l6Fb6DiTpUnd73uLu83had836v/gulM06nx19VfMVcF16Ivp32rH7Sa+q1jpGETLbO4A N+vg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=b/KfcFBbb+ZMXtCKjffZI7DjKNIQmY0DhBpSVTvXDk4=; b=LZfxuzv8aXq4RlkTVuXpMG9A2RppB0HQYJybLMAeY2bHella/5OjWsWA5p05jVnzj/ 9K0JQQtJFulbWKd2YokEdP97JOxUGwYAiID5W/8Fxshh/MdIcoI5RtU+FPJ9hwyS+xsu C1qE9NKu4UnUgrPqflcv+YbCTEuqbpepGR24yLRzpy+tgAa/5boM7yBrI67Y4NBgOUm3 mYXq4ClSNOiSUHj/1uV9Tbepi8Yp/RuJ0HnMWzkXrAaZJyXNpvQjXvs8w0JzPv+EYg7y Ukl1yFKpZVPujpnOGoEBsR18qLmhqdslpAyi+M9T5GU0IeRbcU6q+Idip10P10XytyYH jP6Q== X-Gm-Message-State: APjAAAUw1TeyP8jJL/pyE6cymR2az9IL3mvjrulBQj32K26r/wWcdx3j ZLcKCwANGw/aOikiApftjMKrpuPynpI= X-Received: by 2002:a5d:65c5:: with SMTP id e5mr4693868wrw.266.1563523137439; Fri, 19 Jul 2019 00:58:57 -0700 (PDT) Received: from localhost.localdomain ([2a01:e0a:f:6020:484b:32fe:1cf4:f69b]) by smtp.gmail.com with ESMTPSA id c1sm58673826wrh.1.2019.07.19.00.58.56 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 19 Jul 2019 00:58:56 -0700 (PDT) From: Vincent Guittot To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org Cc: quentin.perret@arm.com, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, pauld@redhat.com, Vincent Guittot Subject: [PATCH 3/5] sched/fair: rework load_balance Date: Fri, 19 Jul 2019 09:58:23 +0200 Message-Id: <1563523105-24673-4-git-send-email-vincent.guittot@linaro.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> References: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The load_balance algorithm contains some heuristics which have becomes meaningless since the rework of metrics and the introduction of PELT. Furthermore, it's sometimes difficult to fix wrong scheduling decisions because everything is based on load whereas some imbalances are not related to the load. The current algorithm ends up to create virtual and meaningless value like the avg_load_per_task or tweaks the state of a group to make it overloaded whereas it's not, in order to try to migrate tasks. load_balance should better qualify the imbalance of the group and define cleary what has to be moved to fix this imbalance. The type of sched_group has been extended to better reflect the type of imbalance. We now have : group_has_spare group_fully_busy group_misfit_task group_asym_capacity group_imbalanced group_overloaded Based on the type fo sched_group, load_balance now sets what it wants to move in order to fix the imnbalance. It can be some load as before but also some utilization, a number of task or a type of task: migrate_task migrate_util migrate_load migrate_misfit This new load_balance algorithm fixes several pending wrong tasks placement: - the 1 task per CPU case with asymetrics system - the case of cfs task preempted by other class - the case of tasks not evenly spread on groups with spare capacity The load balance decisions have been gathered in 3 functions: - update_sd_pick_busiest() select the busiest sched_group. - find_busiest_group() checks if there is an imabalance between local and busiest group. - calculate_imbalance() decides what have to be moved. Signed-off-by: Vincent Guittot --- kernel/sched/fair.c | 539 ++++++++++++++++++++++++++++------------------------ 1 file changed, 289 insertions(+), 250 deletions(-) -- 2.7.4 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 67f0acd..472959df 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3771,7 +3771,7 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq) return; } - rq->misfit_task_load = task_h_load(p); + rq->misfit_task_load = task_util_est(p); } #else /* CONFIG_SMP */ @@ -5376,18 +5376,6 @@ static unsigned long capacity_of(int cpu) return cpu_rq(cpu)->cpu_capacity; } -static unsigned long cpu_avg_load_per_task(int cpu) -{ - struct rq *rq = cpu_rq(cpu); - unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running); - unsigned long load_avg = cpu_runnable_load(rq); - - if (nr_running) - return load_avg / nr_running; - - return 0; -} - static void record_wakee(struct task_struct *p) { /* @@ -7060,12 +7048,21 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10; enum fbq_type { regular, remote, all }; enum group_type { - group_other = 0, + group_has_spare = 0, + group_fully_busy, group_misfit_task, + group_asym_capacity, group_imbalanced, group_overloaded, }; +enum group_migration { + migrate_task = 0, + migrate_util, + migrate_load, + migrate_misfit, +}; + #define LBF_ALL_PINNED 0x01 #define LBF_NEED_BREAK 0x02 #define LBF_DST_PINNED 0x04 @@ -7096,7 +7093,7 @@ struct lb_env { unsigned int loop_max; enum fbq_type fbq_type; - enum group_type src_grp_type; + enum group_migration src_grp_type; struct list_head tasks; }; @@ -7328,7 +7325,6 @@ static int detach_tasks(struct lb_env *env) { struct list_head *tasks = &env->src_rq->cfs_tasks; struct task_struct *p; - unsigned long load; int detached = 0; lockdep_assert_held(&env->src_rq->lock); @@ -7361,19 +7357,46 @@ static int detach_tasks(struct lb_env *env) if (!can_migrate_task(p, env)) goto next; - load = task_h_load(p); + if (env->src_grp_type == migrate_load) { + unsigned long load = task_h_load(p); - if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed) - goto next; + if (sched_feat(LB_MIN) && + load < 16 && !env->sd->nr_balance_failed) + goto next; + + if ((load / 2) > env->imbalance) + goto next; + + env->imbalance -= load; + } else if (env->src_grp_type == migrate_util) { + unsigned long util = task_util_est(p); + + if (util > env->imbalance) + goto next; + + env->imbalance -= util; + } else if (env->src_grp_type == migrate_misfit) { + unsigned long util = task_util_est(p); + + /* + * utilization of misfit task might decrease a bit + * since it has been recorded. Be conservative in the + * condition. + */ + if (2*util < env->imbalance) + goto next; + + env->imbalance = 0; + } else { + /* Migrate task */ + env->imbalance--; + } - if ((load / 2) > env->imbalance) - goto next; detach_task(p, env); list_add(&p->se.group_node, &env->tasks); detached++; - env->imbalance -= load; #ifdef CONFIG_PREEMPT /* @@ -7646,7 +7669,6 @@ static unsigned long task_h_load(struct task_struct *p) struct sg_lb_stats { unsigned long avg_load; /*Avg load across the CPUs of the group */ unsigned long group_load; /* Total load over the CPUs of the group */ - unsigned long load_per_task; unsigned long group_capacity; unsigned long group_util; /* Total utilization of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ @@ -7654,8 +7676,7 @@ struct sg_lb_stats { unsigned int idle_cpus; unsigned int group_weight; enum group_type group_type; - int group_no_capacity; - int group_asym_capacity; + unsigned int group_asym_capacity; /* tasks should be move to preferred cpu */ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */ #ifdef CONFIG_NUMA_BALANCING unsigned int nr_numa_running; @@ -7670,10 +7691,10 @@ struct sg_lb_stats { struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ - unsigned long total_running; unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ + unsigned int prefer_sibling; /* tasks should go to sibling first */ struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ struct sg_lb_stats local_stat; /* Statistics of the local group */ @@ -7690,14 +7711,14 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) *sds = (struct sd_lb_stats){ .busiest = NULL, .local = NULL, - .total_running = 0UL, .total_load = 0UL, .total_capacity = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, .sum_h_nr_running = 0, - .group_type = group_other, + .idle_cpus = UINT_MAX, + .group_type = group_has_spare, }, }; } @@ -7887,7 +7908,7 @@ static inline int sg_imbalanced(struct sched_group *group) static inline bool group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) { - if (sgs->sum_h_nr_running < sgs->group_weight) + if (sgs->sum_nr_running < sgs->group_weight) return true; if ((sgs->group_capacity * 100) > @@ -7908,7 +7929,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs) static inline bool group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs) { - if (sgs->sum_h_nr_running <= sgs->group_weight) + if (sgs->sum_nr_running <= sgs->group_weight) return false; if ((sgs->group_capacity * 100) < @@ -7941,10 +7962,11 @@ group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref) } static inline enum -group_type group_classify(struct sched_group *group, +group_type group_classify(struct lb_env *env, + struct sched_group *group, struct sg_lb_stats *sgs) { - if (sgs->group_no_capacity) + if (group_is_overloaded(env, sgs)) return group_overloaded; if (sg_imbalanced(group)) @@ -7953,7 +7975,13 @@ group_type group_classify(struct sched_group *group, if (sgs->group_misfit_task_load) return group_misfit_task; - return group_other; + if (sgs->group_asym_capacity) + return group_asym_capacity; + + if (group_has_capacity(env, sgs)) + return group_has_spare; + + return group_fully_busy; } static bool update_nohz_stats(struct rq *rq, bool force) @@ -7990,10 +8018,12 @@ static inline void update_sg_lb_stats(struct lb_env *env, struct sg_lb_stats *sgs, int *sg_status) { - int i, nr_running; + int i, nr_running, local_group; memset(sgs, 0, sizeof(*sgs)); + local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group)); + for_each_cpu_and(i, sched_group_span(group), env->cpus) { struct rq *rq = cpu_rq(i); @@ -8003,9 +8033,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->group_load += cpu_runnable_load(rq); sgs->group_util += cpu_util(i); sgs->sum_h_nr_running += rq->cfs.h_nr_running; - sgs->sum_nr_running += rq->cfs.h_nr_running; - nr_running = rq->nr_running; + sgs->sum_nr_running += nr_running; + if (nr_running > 1) *sg_status |= SG_OVERLOAD; @@ -8022,6 +8052,10 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++; + if (local_group) + continue; + + /* Check for a misfit task on the cpu */ if (env->sd->flags & SD_ASYM_CPUCAPACITY && sgs->group_misfit_task_load < rq->misfit_task_load) { sgs->group_misfit_task_load = rq->misfit_task_load; @@ -8029,17 +8063,24 @@ static inline void update_sg_lb_stats(struct lb_env *env, } } - /* Adjust by relative CPU capacity of the group */ - sgs->group_capacity = group->sgc->capacity; - sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity; + /* Check if dst cpu is idle and preferred to this group */ + if (env->sd->flags & SD_ASYM_PACKING && + env->idle != CPU_NOT_IDLE && + sgs->sum_h_nr_running && + sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) { + sgs->group_asym_capacity = 1; + } - if (sgs->sum_h_nr_running) - sgs->load_per_task = sgs->group_load / sgs->sum_h_nr_running; + sgs->group_capacity = group->sgc->capacity; sgs->group_weight = group->group_weight; - sgs->group_no_capacity = group_is_overloaded(env, sgs); - sgs->group_type = group_classify(group, sgs); + sgs->group_type = group_classify(env, group, sgs); + + /* Computing avg_load makes sense only when group is overloaded */ + if (sgs->group_type != group_overloaded) + sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / + sgs->group_capacity; } /** @@ -8070,7 +8111,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, */ if (sgs->group_type == group_misfit_task && (!group_smaller_max_cpu_capacity(sg, sds->local) || - !group_has_capacity(env, &sds->local_stat))) + sds->local_stat.group_type != group_has_spare)) return false; if (sgs->group_type > busiest->group_type) @@ -8079,11 +8120,18 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->group_type < busiest->group_type) return false; - if (sgs->avg_load <= busiest->avg_load) + /* Select the overloaded group with highest avg_load */ + if (sgs->group_type == group_overloaded && + sgs->avg_load <= busiest->avg_load) + return false; + + /* Prefer to move from lowest priority CPU's work */ + if (sgs->group_type == group_asym_capacity && sds->busiest && + sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu)) return false; if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) - goto asym_packing; + goto spare_capacity; /* * Candidate sg has no more than one task per CPU and @@ -8091,7 +8139,7 @@ static bool update_sd_pick_busiest(struct lb_env *env, * capable CPUs may harm throughput. Maximize throughput, * power/energy consequences are not considered. */ - if (sgs->sum_h_nr_running <= sgs->group_weight && + if (sgs->group_type <= group_fully_busy && group_smaller_min_cpu_capacity(sds->local, sg)) return false; @@ -8102,39 +8150,32 @@ static bool update_sd_pick_busiest(struct lb_env *env, sgs->group_misfit_task_load < busiest->group_misfit_task_load) return false; -asym_packing: - /* This is the busiest node in its class. */ - if (!(env->sd->flags & SD_ASYM_PACKING)) - return true; - - /* No ASYM_PACKING if target CPU is already busy */ - if (env->idle == CPU_NOT_IDLE) - return true; +spare_capacity: /* - * ASYM_PACKING needs to move all the work to the highest - * prority CPUs in the group, therefore mark all groups - * of lower priority than ourself as busy. - * - * This is primarily intended to used at the sibling level. Some - * cores like POWER7 prefer to use lower numbered SMT threads. In the - * case of POWER7, it can move to lower SMT modes only when higher - * threads are idle. When in lower SMT modes, the threads will - * perform better since they share less core resources. Hence when we - * have idle threads, we want them to be the higher ones. - */ - if (sgs->sum_h_nr_running && - sched_asym_prefer(env->dst_cpu, sg->asym_prefer_cpu)) { - sgs->group_asym_capacity = 1; - if (!sds->busiest) - return true; + * Select not overloaded group with lowest number of idle cpus. + * We could also compare the spare capacity which is more stable + * but it can end up that the group has less spare capacity but + * finally more idle cpus which means less opportunity to pull + * tasks. + */ + if (sgs->group_type == group_has_spare && + sgs->idle_cpus > busiest->idle_cpus) + return false; - /* Prefer to move from lowest priority CPU's work */ - if (sched_asym_prefer(sds->busiest->asym_prefer_cpu, - sg->asym_prefer_cpu)) - return true; - } + /* + * Select the fully busy group with highest avg_load. + * In theory, there is no need to pull task from such kind of group + * because tasks have all compute capacity that they need but we can + * still improve the overall throughput by reducing contention + * when accessing shared HW resources. + * XXX for now avg_load is not computed and always 0 so we select the + * 1st one. + */ + if (sgs->group_type == group_fully_busy && + sgs->avg_load <= busiest->avg_load) + return false; - return false; + return true; } #ifdef CONFIG_NUMA_BALANCING @@ -8172,13 +8213,13 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) * @env: The load balancing environment. * @sds: variable to hold the statistics for this sched_domain. */ + static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; struct sg_lb_stats tmp_sgs; - bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING; int sg_status = 0; #ifdef CONFIG_NO_HZ_COMMON @@ -8205,22 +8246,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (local_group) goto next_group; - /* - * In case the child domain prefers tasks go to siblings - * first, lower the sg capacity so that we'll try - * and move all the excess tasks away. We lower the capacity - * of a group only if the local group has the capacity to fit - * these excess tasks. The extra check prevents the case where - * you always pull from the heaviest group when it is already - * under-utilized (possible with a large weight task outweighs - * the tasks on the system). - */ - if (prefer_sibling && sds->local && - group_has_capacity(env, local) && - (sgs->sum_h_nr_running > local->sum_h_nr_running + 1)) { - sgs->group_no_capacity = 1; - sgs->group_type = group_classify(sg, sgs); - } if (update_sd_pick_busiest(env, sds, sg, sgs)) { sds->busiest = sg; @@ -8229,13 +8254,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd next_group: /* Now, start updating sd_lb_stats */ - sds->total_running += sgs->sum_h_nr_running; sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; sg = sg->next; } while (sg != env->sd->groups); + /* Tag domain that child domain prefers tasks go to siblings first */ + sds->prefer_sibling = child && child->flags & SD_PREFER_SIBLING; + #ifdef CONFIG_NO_HZ_COMMON if ((env->flags & LBF_NOHZ_AGAIN) && cpumask_subset(nohz.idle_cpus_mask, sched_domain_span(env->sd))) { @@ -8266,76 +8293,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd } /** - * fix_small_imbalance - Calculate the minor imbalance that exists - * amongst the groups of a sched_domain, during - * load balancing. - * @env: The load balancing environment. - * @sds: Statistics of the sched_domain whose imbalance is to be calculated. - */ -static inline -void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) -{ - unsigned long tmp, capa_now = 0, capa_move = 0; - unsigned int imbn = 2; - unsigned long scaled_busy_load_per_task; - struct sg_lb_stats *local, *busiest; - - local = &sds->local_stat; - busiest = &sds->busiest_stat; - - if (!local->sum_h_nr_running) - local->load_per_task = cpu_avg_load_per_task(env->dst_cpu); - else if (busiest->load_per_task > local->load_per_task) - imbn = 1; - - scaled_busy_load_per_task = - (busiest->load_per_task * SCHED_CAPACITY_SCALE) / - busiest->group_capacity; - - if (busiest->avg_load + scaled_busy_load_per_task >= - local->avg_load + (scaled_busy_load_per_task * imbn)) { - env->imbalance = busiest->load_per_task; - return; - } - - /* - * OK, we don't have enough imbalance to justify moving tasks, - * however we may be able to increase total CPU capacity used by - * moving them. - */ - - capa_now += busiest->group_capacity * - min(busiest->load_per_task, busiest->avg_load); - capa_now += local->group_capacity * - min(local->load_per_task, local->avg_load); - capa_now /= SCHED_CAPACITY_SCALE; - - /* Amount of load we'd subtract */ - if (busiest->avg_load > scaled_busy_load_per_task) { - capa_move += busiest->group_capacity * - min(busiest->load_per_task, - busiest->avg_load - scaled_busy_load_per_task); - } - - /* Amount of load we'd add */ - if (busiest->avg_load * busiest->group_capacity < - busiest->load_per_task * SCHED_CAPACITY_SCALE) { - tmp = (busiest->avg_load * busiest->group_capacity) / - local->group_capacity; - } else { - tmp = (busiest->load_per_task * SCHED_CAPACITY_SCALE) / - local->group_capacity; - } - capa_move += local->group_capacity * - min(local->load_per_task, local->avg_load + tmp); - capa_move /= SCHED_CAPACITY_SCALE; - - /* Move if we gain throughput */ - if (capa_move > capa_now) - env->imbalance = busiest->load_per_task; -} - -/** * calculate_imbalance - Calculate the amount of imbalance present within the * groups of a given sched_domain during load balance. * @env: load balance environment @@ -8343,13 +8300,17 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) */ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds) { - unsigned long max_pull, load_above_capacity = ~0UL; struct sg_lb_stats *local, *busiest; local = &sds->local_stat; busiest = &sds->busiest_stat; - if (busiest->group_asym_capacity) { + if (busiest->group_type == group_asym_capacity) { + /* + * In case of asym capacity, we will try to migrate all load + * to the preferred CPU + */ + env->src_grp_type = migrate_load; env->imbalance = busiest->group_load; return; } @@ -8357,72 +8318,115 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s if (busiest->group_type == group_imbalanced) { /* * In the group_imb case we cannot rely on group-wide averages - * to ensure CPU-load equilibrium, look at wider averages. XXX + * to ensure CPU-load equilibrium, try to move any task to fix + * the imbalance. The next load balance will take care of + * balancing back the system. */ - busiest->load_per_task = - min(busiest->load_per_task, sds->avg_load); + env->src_grp_type = migrate_task; + env->imbalance = 1; + return; } - /* - * Avg load of busiest sg can be less and avg load of local sg can - * be greater than avg load across all sgs of sd because avg load - * factors in sg capacity and sgs with smaller group_type are - * skipped when updating the busiest sg: - */ - if (busiest->group_type != group_misfit_task && - (busiest->avg_load <= sds->avg_load || - local->avg_load >= sds->avg_load)) { - env->imbalance = 0; - return fix_small_imbalance(env, sds); + if (busiest->group_type == group_misfit_task) { + /* Set imbalance to allow misfit task to be balanced. */ + env->src_grp_type = migrate_misfit; + env->imbalance = busiest->group_misfit_task_load; + return; } /* - * If there aren't any idle CPUs, avoid creating some. + * Try to use spare capacity of local group without overloading it or + * emptying busiest */ - if (busiest->group_type == group_overloaded && - local->group_type == group_overloaded) { - load_above_capacity = busiest->sum_h_nr_running * SCHED_CAPACITY_SCALE; - if (load_above_capacity > busiest->group_capacity) { - load_above_capacity -= busiest->group_capacity; - load_above_capacity *= scale_load_down(NICE_0_LOAD); - load_above_capacity /= busiest->group_capacity; - } else - load_above_capacity = ~0UL; + if (local->group_type == group_has_spare) { + long imbalance; + + /* + * If there is no overload, we just want to even the number of + * idle cpus. + */ + env->src_grp_type = migrate_task; + imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1); + + if (sds->prefer_sibling) + /* + * When prefer sibling, evenly spread running tasks on + * groups. + */ + imbalance = (busiest->sum_nr_running - local->sum_nr_running) >> 1; + + if (busiest->group_type > group_fully_busy) { + /* + * If busiest is overloaded, try to fill spare + * capacity. This might end up creating spare capacity + * in busiest or busiest still being overloaded but + * there is no simple way to directly compute the + * amount of load to migrate in order to balance the + * system. + */ + env->src_grp_type = migrate_util; + imbalance = max(local->group_capacity, local->group_util) - + local->group_util; + } + + env->imbalance = imbalance; + return; } /* - * We're trying to get all the CPUs to the average_load, so we don't - * want to push ourselves above the average load, nor do we wish to - * reduce the max loaded CPU below the average load. At the same time, - * we also don't want to reduce the group load below the group - * capacity. Thus we look for the minimum possible imbalance. + * Local is fully busy but have to take more load to relieve the + * busiest group */ - max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity); + if (local->group_type < group_overloaded) { + /* + * Local will become overvloaded so the avg_load metrics are + * finally needed + */ - /* How much load to actually move to equalise the imbalance */ - env->imbalance = min( - max_pull * busiest->group_capacity, - (sds->avg_load - local->avg_load) * local->group_capacity - ) / SCHED_CAPACITY_SCALE; + local->avg_load = (local->group_load*SCHED_CAPACITY_SCALE) + / local->group_capacity; - /* Boost imbalance to allow misfit task to be balanced. */ - if (busiest->group_type == group_misfit_task) { - env->imbalance = max_t(long, env->imbalance, - busiest->group_misfit_task_load); + sds->avg_load = (SCHED_CAPACITY_SCALE * sds->total_load) + / sds->total_capacity; } /* - * if *imbalance is less than the average load per runnable task - * there is no guarantee that any tasks will be moved so we'll have - * a think about bumping its value to force at least one task to be - * moved + * Both group are or will become overloaded and we're trying to get + * all the CPUs to the average_load, so we don't want to push + * ourselves above the average load, nor do we wish to reduce the + * max loaded CPU below the average load. At the same time, we also + * don't want to reduce the group load below the group capacity. + * Thus we look for the minimum possible imbalance. */ - if (env->imbalance < busiest->load_per_task) - return fix_small_imbalance(env, sds); + env->src_grp_type = migrate_load; + env->imbalance = min( + (busiest->avg_load - sds->avg_load) * busiest->group_capacity, + (sds->avg_load - local->avg_load) * local->group_capacity + ) / SCHED_CAPACITY_SCALE; } /******* find_busiest_group() helpers end here *********************/ +/* + * Decision matrix according to the local and busiest group state + * + * busiest \ local has_spare fully_busy misfit asym imbalanced overloaded + * has_spare nr_idle balanced N/A N/A balanced balanced + * fully_busy nr_idle nr_idle N/A N/A balanced balanced + * misfit_task force N/A N/A N/A force force + * asym_capacity force force N/A N/A force force + * imbalanced force force N/A N/A force force + * overloaded force force N/A N/A force avg_load + * + * N/A : Not Applicable because already filtered while updating + * statistics. + * balanced : The system is balanced for these 2 groups. + * force : Calculate the imbalance as load migration is probably needed. + * avg_load : Only if imbalance is significant enough. + * nr_idle : dst_cpu is not busy and the number of idle cpus is quite + * different in groups. + */ + /** * find_busiest_group - Returns the busiest group within the sched_domain * if there is an imbalance. @@ -8457,17 +8461,13 @@ static struct sched_group *find_busiest_group(struct lb_env *env) local = &sds.local_stat; busiest = &sds.busiest_stat; - /* ASYM feature bypasses nice load balance check */ - if (busiest->group_asym_capacity) - goto force_balance; - /* There is no busy sibling group to pull tasks from */ if (!sds.busiest || busiest->sum_h_nr_running == 0) goto out_balanced; - /* XXX broken for overlapping NUMA groups */ - sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load) - / sds.total_capacity; + /* ASYM feature bypasses nice load balance check */ + if (busiest->group_type == group_asym_capacity) + goto force_balance; /* * If the busiest group is imbalanced the below checks don't @@ -8477,14 +8477,6 @@ static struct sched_group *find_busiest_group(struct lb_env *env) if (busiest->group_type == group_imbalanced) goto force_balance; - /* - * When dst_cpu is idle, prevent SMP nice and/or asymmetric group - * capacities from resulting in underutilization due to avg_load. - */ - if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) && - busiest->group_no_capacity) - goto force_balance; - /* Misfit tasks should be dealt with regardless of the avg load */ if (busiest->group_type == group_misfit_task) goto force_balance; @@ -8493,44 +8485,68 @@ static struct sched_group *find_busiest_group(struct lb_env *env) * If the local group is busier than the selected busiest group * don't try and pull any tasks. */ - if (local->avg_load >= busiest->avg_load) + if (local->group_type > busiest->group_type) goto out_balanced; /* - * Don't pull any tasks if this group is already above the domain - * average load. + * When groups are overloaded, use the avg_load to ensure fairness + * between tasks. */ - if (local->avg_load >= sds.avg_load) - goto out_balanced; + if (local->group_type == group_overloaded) { + /* + * If the local group is more loaded than the selected + * busiest group don't try and pull any tasks. + */ + if (local->avg_load >= busiest->avg_load) + goto out_balanced; + + /* XXX broken for overlapping NUMA groups */ + sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load) + / sds.total_capacity; - if (env->idle == CPU_IDLE) { /* - * This CPU is idle. If the busiest group is not overloaded - * and there is no imbalance between this and busiest group - * wrt idle CPUs, it is balanced. The imbalance becomes - * significant if the diff is greater than 1 otherwise we - * might end up to just move the imbalance on another group + * Don't pull any tasks if this group is already above the + * domain average load. */ - if ((busiest->group_type != group_overloaded) && - (local->idle_cpus <= (busiest->idle_cpus + 1))) + if (local->avg_load >= sds.avg_load) goto out_balanced; - } else { + /* - * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use - * imbalance_pct to be conservative. + * If the busiest group is more loaded, use imbalance_pct to be + * conservative. */ if (100 * busiest->avg_load <= env->sd->imbalance_pct * local->avg_load) goto out_balanced; + } + /* Try to move all excess tasks to child's sibling domain */ + if (sds.prefer_sibling && local->group_type == group_has_spare && + busiest->sum_nr_running > local->sum_nr_running + 1) + goto force_balance; + + if (busiest->group_type != group_overloaded && + (env->idle == CPU_NOT_IDLE || + local->idle_cpus <= (busiest->idle_cpus + 1))) + /* + * If the busiest group is not overloaded + * and there is no imbalance between this and busiest group + * wrt idle CPUs, it is balanced. The imbalance + * becomes significant if the diff is greater than 1 otherwise + * we might end up to just move the imbalance on another + * group. + */ + goto out_balanced; + force_balance: /* Looks like there is an imbalance. Compute it */ - env->src_grp_type = busiest->group_type; calculate_imbalance(env, &sds); + return env->imbalance ? sds.busiest : NULL; out_balanced: + env->imbalance = 0; return NULL; } @@ -8542,11 +8558,13 @@ static struct rq *find_busiest_queue(struct lb_env *env, struct sched_group *group) { struct rq *busiest = NULL, *rq; - unsigned long busiest_load = 0, busiest_capacity = 1; + unsigned long busiest_util = 0, busiest_load = 0, busiest_capacity = 1; + unsigned int busiest_nr = 0; int i; for_each_cpu_and(i, sched_group_span(group), env->cpus) { - unsigned long capacity, load; + unsigned long capacity, load, util; + unsigned int nr_running; enum fbq_type rt; rq = cpu_rq(i); @@ -8578,7 +8596,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, * For ASYM_CPUCAPACITY domains with misfit tasks we simply * seek the "biggest" misfit task. */ - if (env->src_grp_type == group_misfit_task) { + if (env->src_grp_type == migrate_misfit) { if (rq->misfit_task_load > busiest_load) { busiest_load = rq->misfit_task_load; busiest = rq; @@ -8600,12 +8618,33 @@ static struct rq *find_busiest_queue(struct lb_env *env, rq->nr_running == 1) continue; - load = cpu_runnable_load(rq); + if (env->src_grp_type == migrate_task) { + nr_running = rq->cfs.h_nr_running; + + if (busiest_nr < nr_running) { + busiest_nr = nr_running; + busiest = rq; + } + + continue; + } + + if (env->src_grp_type == migrate_util) { + util = cpu_util(cpu_of(rq)); + + if (busiest_util < util) { + busiest_util = util; + busiest = rq; + } + + continue; + } /* - * When comparing with imbalance, use cpu_runnable_load() + * When comparing with load imbalance, use weighted_cpuload() * which is not scaled with the CPU capacity. */ + load = cpu_runnable_load(rq); if (rq->nr_running == 1 && load > env->imbalance && !check_cpu_capacity(rq, env->sd)) @@ -8671,7 +8710,7 @@ voluntary_active_balance(struct lb_env *env) return 1; } - if (env->src_grp_type == group_misfit_task) + if (env->src_grp_type == migrate_misfit) return 1; return 0; From patchwork Fri Jul 19 07:58:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 169230 Delivered-To: patch@linaro.org Received: by 2002:a92:4782:0:0:0:0:0 with SMTP id e2csp3509419ilk; Fri, 19 Jul 2019 00:59:14 -0700 (PDT) X-Google-Smtp-Source: APXvYqy2O01mosK4vcMDbkc9KuHLhenqZmbCTvy6LFT5fYEnixuYk4qah/X1f4/OHCtPVsvriyod X-Received: by 2002:a17:902:381:: with SMTP id d1mr54164030pld.331.1563523154893; Fri, 19 Jul 2019 00:59:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563523154; cv=none; d=google.com; s=arc-20160816; b=FA9meIzRls1r2RSnvtCcOlJ29O/5mZOKeFcGm2KBIRx/lIVK+MVC+2e7jgxsmHp8xg xil/Hc5TmnnJ+6VAdnyvt3Mn+zAcls7CXJqKJlfqW+sGpKBFN2ibeIbxZfD9zT8a3sNm c2NQl351a8DRckW88D277bx3bpluMWKV/kH35P34spa9SiIhK5tPnz21KeOLPjZZjByC V0sFYX4RgGzzBs2J1ZkxUtm/z8Ri8MFcJUy+Cg/CjhH1Fd0kJE3+hjhsK5A+O7VJrrdI bQ0ZOBnIBphJZPu4HKrv0mFfapF9fMPrTHv2cg5UVkGB4x1BSKf6ZTIbOi2w7DA36Mg/ 0wRg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=uMqEORZKZxack2XqfX2Ct1qRRq3fXFeaXwXevbx4NJQ=; b=jXs91oAwE5boHpohSsvMkWq2jPSAMiO7HsmSKxgFPpzJjYv+QciGD9m1ASKSftQdgm Uv/KN24XuJGszqgbfn7qZ5k74BpG+wpYN5NFqJu+Ql0WSTAMHCYuEQst0ISA1Q1dAamM Eu6yRBUcc082O8sUedXCLS7MBJ+B5AQEG+eimUvw5fc9z8ckFhHws8nk0PtcrM8cCBV5 0om773AXovAnMalYZEs/dEjpNfP6XjO4IhHGQdNTpEr4YQM4ePJNc9ck34+OfhA4fQZb jl6ShIPld1JVCZH8C5cFn6EF+OqAI9PUllXHFNrt0M7Cu/eRKpU941V7wvjNJxXwd97/ XgsQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=X0e9c5ge; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 6si736870pgt.13.2019.07.19.00.59.14; Fri, 19 Jul 2019 00:59:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=X0e9c5ge; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727177AbfGSH7D (ORCPT + 29 others); Fri, 19 Jul 2019 03:59:03 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:41378 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726036AbfGSH7A (ORCPT ); Fri, 19 Jul 2019 03:59:00 -0400 Received: by mail-wr1-f65.google.com with SMTP id c2so28071356wrm.8 for ; Fri, 19 Jul 2019 00:58:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=uMqEORZKZxack2XqfX2Ct1qRRq3fXFeaXwXevbx4NJQ=; b=X0e9c5geHASXobnY4+VsSDlwg7F3sr1FwU2iX6VKMRFsANCoSWuJpkdXFixx774oAK xo02I+gragJZZQ+h6i7CfDlIZIGgtxoUXANdvwIsk5/RpVQD+mDtVIjyb/6UzHxOJYXV E3+t0bvkjBc9vK1ImBLBwuCXcehZt2Akp82PoYliQbh3O0Ta77Bl7ah5zcIPRm33sOcA 1oYfjgzHO7TmSVvEz66vFrT4QWJfMiV03U8cAfaEOI5kPE+TuMT1GI3OAvXuWlzYwxsW JQEtjt3TwZ1efSdLKuE19RaGZRltsF4Mq5UsFnRm1OYC6X+HDXc7ip0ZZ80ASqxOFUPN w8sg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=uMqEORZKZxack2XqfX2Ct1qRRq3fXFeaXwXevbx4NJQ=; b=oKaIJaPglx7TQbuRCW0I6t3O338UCrX4AEYKHLD0jkWnUe1/BkyYXB85sbwtIBzPBU QPtnUFzKl1QNoEfRksCReXvwuAivJLKeWH5As2lPJzLoAs0AJjTYh0I5X3p87RmddMo9 pM5Dl2a70gEr+yx0Bvx8cwAV2z/9rpw1mGwDESlsCvtVezijunr3/uGQbmpl0O21ctdo IF3VZXgKWsVYkZAKta42OtbOFFluyp/PVDlMHS7rzBXN32lOnM/hG1522ss6Zqcqw5dz bdJxNT/TeTnbY08/cPb1uOyTh51axSQ0YGshz4ZIqTkKHedjXKxWJvcei1cAHK2KpZpr /oTw== X-Gm-Message-State: APjAAAVR4SRpUw+YhRwzZ2v6I2LlHIOygm/Nn64WPEsbFRH0o9Z/a7l2 Tf1gKuCZl4u9/UgoKpncfyN8S3a8DF8= X-Received: by 2002:adf:f286:: with SMTP id k6mr45104371wro.320.1563523138168; Fri, 19 Jul 2019 00:58:58 -0700 (PDT) Received: from localhost.localdomain ([2a01:e0a:f:6020:484b:32fe:1cf4:f69b]) by smtp.gmail.com with ESMTPSA id c1sm58673826wrh.1.2019.07.19.00.58.57 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 19 Jul 2019 00:58:57 -0700 (PDT) From: Vincent Guittot To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org Cc: quentin.perret@arm.com, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, pauld@redhat.com, Vincent Guittot Subject: [PATCH 4/5] sched/fair: use load instead of runnable load Date: Fri, 19 Jul 2019 09:58:24 +0200 Message-Id: <1563523105-24673-5-git-send-email-vincent.guittot@linaro.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> References: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org runnable load has been introduced to take into account the case where blocked load biases the load balance decision which was selecting underutilized group with huge blocked load whereas other groups were overloaded. The load is now only used when groups are overloaded. In this case, it's worth being conservative and taking into account the sleeping tasks that might wakeup on the cpu. Signed-off-by: Vincent Guittot --- kernel/sched/fair.c | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) -- 2.7.4 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 472959df..c221713 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5371,6 +5371,11 @@ static unsigned long cpu_runnable_load(struct rq *rq) return cfs_rq_runnable_load_avg(&rq->cfs); } +static unsigned long cpu_load(struct rq *rq) +{ + return cfs_rq_load_avg(&rq->cfs); +} + static unsigned long capacity_of(int cpu) { return cpu_rq(cpu)->cpu_capacity; @@ -5466,7 +5471,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p, s64 this_eff_load, prev_eff_load; unsigned long task_load; - this_eff_load = cpu_runnable_load(cpu_rq(this_cpu)); + this_eff_load = cpu_load(cpu_rq(this_cpu)); if (sync) { unsigned long current_load = task_h_load(current); @@ -5484,7 +5489,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p, this_eff_load *= 100; this_eff_load *= capacity_of(prev_cpu); - prev_eff_load = cpu_runnable_load(cpu_rq(prev_cpu)); + prev_eff_load = cpu_load(cpu_rq(prev_cpu)); prev_eff_load -= task_load; if (sched_feat(WA_BIAS)) prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2; @@ -5572,7 +5577,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, max_spare_cap = 0; for_each_cpu(i, sched_group_span(group)) { - load = cpu_runnable_load(cpu_rq(i)); + load = cpu_load(cpu_rq(i)); runnable_load += load; avg_load += cfs_rq_load_avg(&cpu_rq(i)->cfs); @@ -5708,7 +5713,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this shallowest_idle_cpu = i; } } else if (shallowest_idle_cpu == -1) { - load = cpu_runnable_load(cpu_rq(i)); + load = cpu_load(cpu_rq(i)); if (load < min_load) { min_load = load; least_loaded_cpu = i; @@ -8030,7 +8035,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if ((env->flags & LBF_NOHZ_STATS) && update_nohz_stats(rq, false)) env->flags |= LBF_NOHZ_AGAIN; - sgs->group_load += cpu_runnable_load(rq); + sgs->group_load += cpu_load(rq); sgs->group_util += cpu_util(i); sgs->sum_h_nr_running += rq->cfs.h_nr_running; nr_running = rq->nr_running; @@ -8446,7 +8451,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) init_sd_lb_stats(&sds); /* - * Compute the various statistics relavent for load balancing at + * Compute the various statistics relevant for load balancing at * this level. */ update_sd_lb_stats(env, &sds); @@ -8641,10 +8646,10 @@ static struct rq *find_busiest_queue(struct lb_env *env, } /* - * When comparing with load imbalance, use weighted_cpuload() + * When comparing with load imbalance, use cpu_load() * which is not scaled with the CPU capacity. */ - load = cpu_runnable_load(rq); + load = cpu_load(rq); if (rq->nr_running == 1 && load > env->imbalance && !check_cpu_capacity(rq, env->sd)) From patchwork Fri Jul 19 07:58:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vincent Guittot X-Patchwork-Id: 169229 Delivered-To: patch@linaro.org Received: by 2002:a92:4782:0:0:0:0:0 with SMTP id e2csp3509357ilk; Fri, 19 Jul 2019 00:59:10 -0700 (PDT) X-Google-Smtp-Source: APXvYqxngRnfj4uWucTcZM5NXAWcErQfWVDRCd9JUa0wmE44VuL6zAkanl2UGbz/IeYnTeqWbhxS X-Received: by 2002:a17:902:2862:: with SMTP id e89mr55674515plb.258.1563523150279; Fri, 19 Jul 2019 00:59:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1563523150; cv=none; d=google.com; s=arc-20160816; b=DI/any0cxQQgu8s/q0wB35O/iFo43uzW1qYcrBiY3/hMnKO2lPRCyJ2shrikFeOWuZ DphrDeGLx2v77KHYu3yRo1Lxx9JPEsBIe+QyCVxGpp4T2SZelGW3hXFaYbXD6fE83Xew PpZ8COTgjqH6EuDTX3YMYNGid8bEGULh1A4qmG9nooc21OVZtcYpr/tdx+EasrF5pEIO GnUmPtVIxW35IFTrL+s4auvxi0Zv0CjUiP4T6tn/T78tuv83/UF58letn+KeUWmksw3h NhLkC9TlLleKqycGrg8ylax8Ridom6PIy66slT/4b602ygtlvRPpdiEnII7ejidbXLA5 wsWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=2g1oW2O4/Y2MDAYw/ieUv2Y7XyEA8DlEcwQeTSjYL8E=; b=G+xBUqyr+18TeuXfyTn0LNO+HWAciSgt5UmFhQcR8V5/+53hvuofaXxvF5Or62Q37l wPgPQGQYiVNvnpbDVA1SWSAHnhGwa4Apjemf7nK/Y1WV9wIMSK8t3LGpj8kZEh6rZQpr MF5+EuA0WNz9j/rrA+nyN9iv3DSoFM5O/EmJv3F07JX7eiB6/HZIFHzx6I1tEGJ4EPug hlgwDVjwWhk5Wn0HJqgA0Z02GmvQpAyXIoLhkuXHm/1n2/30WtidP69GFunAFEz1xxxr +QeZ0RGR2Ls6TRPhSehha0R+N5ueP00mHHqpxEZAyvXRTQN2p5buu/wprnOAZ/pav54i +huA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=fMYClMx0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 6si736870pgt.13.2019.07.19.00.59.10; Fri, 19 Jul 2019 00:59:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=fMYClMx0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727483AbfGSH7J (ORCPT + 29 others); Fri, 19 Jul 2019 03:59:09 -0400 Received: from mail-wm1-f65.google.com ([209.85.128.65]:36055 "EHLO mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726067AbfGSH7B (ORCPT ); Fri, 19 Jul 2019 03:59:01 -0400 Received: by mail-wm1-f65.google.com with SMTP id g67so23952244wme.1 for ; Fri, 19 Jul 2019 00:59:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=2g1oW2O4/Y2MDAYw/ieUv2Y7XyEA8DlEcwQeTSjYL8E=; b=fMYClMx0ivNpXEExhxPGrj+mU9NTw55/wz4x01IIeZ1Yew+7VsC7ogFZDY4R9OiEG0 mHx7pCQyF/th8dUEWNDD7JyDpxaC+lj+kb9TANFkmB1TfDapD2tzC8EXmn91AnNs/btF v9+OtNKXMf53urQX/WDJdrY/QjAXwyS9th+Qx3NqzDhEz0KD4GwkGbwJ0wVLfcWXbldo ak6GMzzwnMqmYKdzdoz0SStbFhBpGxi5g7qYTnBU3fsAQqGiJLQVqj58gsHx6BsPqJb/ EKFOu22KWJIY46fVy3RTYylz7BmPV2Uhrccc1pBkQHD+O/q14615fb3fR0m6ttrmyD+V opIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=2g1oW2O4/Y2MDAYw/ieUv2Y7XyEA8DlEcwQeTSjYL8E=; b=UdTcPWmwr9q+Ul/+DegiYuYVMmR68yZMN1Q/78VdACOqbs2+XDX36CWVJOPvmUq1ug qnV2NB0LwllLIlpcegKwkzhMjPl4SEXWEOZkhbWBM9UOcaCpiip5eT8V71lyzeC00Rwl w9Ed2Ol7d1FVR/3n3sRzlIBU81d2VkJ5qEkILhsuvx/qPVE96TVwBHzzvu2LBC2SUgXH EmfG6fKp3SKqY6sUyrEHr1koniUjSS4D+1ZPyCKS67M6ONWYqqUvu0n0/mb0nFUss7fr fMTaXAntz170Sns6HdcuSY1JlbITLTMqqOrpWriLxkX4y8zegizSNPj3Mqc8YIIMYnxt BBDQ== X-Gm-Message-State: APjAAAUTCWS8cGwWSnABSxGU/mlK426+n9JgQqPZH173gJaapz/cb8VC 4MIFkPlaJsCudaRLaO0Rd+08gZBf0Es= X-Received: by 2002:a1c:3587:: with SMTP id c129mr48253558wma.90.1563523139161; Fri, 19 Jul 2019 00:58:59 -0700 (PDT) Received: from localhost.localdomain ([2a01:e0a:f:6020:484b:32fe:1cf4:f69b]) by smtp.gmail.com with ESMTPSA id c1sm58673826wrh.1.2019.07.19.00.58.58 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 19 Jul 2019 00:58:58 -0700 (PDT) From: Vincent Guittot To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org Cc: quentin.perret@arm.com, dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com, pauld@redhat.com, Vincent Guittot Subject: [PATCH 5/5] sched/fair: evenly spread tasks when not overloaded Date: Fri, 19 Jul 2019 09:58:25 +0200 Message-Id: <1563523105-24673-6-git-send-email-vincent.guittot@linaro.org> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> References: <1563523105-24673-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When there is only 1 cpu per group, using the idle cpus to evenly spread tasks doesn't make sense and nr_running is a better metrics. Signed-off-by: Vincent Guittot --- kernel/sched/fair.c | 42 +++++++++++++++++++++++++++++------------- 1 file changed, 29 insertions(+), 13 deletions(-) -- 2.7.4 diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c221713..a60ddef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8353,7 +8353,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s env->src_grp_type = migrate_task; imbalance = max_t(long, 0, (local->idle_cpus - busiest->idle_cpus) >> 1); - if (sds->prefer_sibling) + if (sds->prefer_sibling || busiest->group_weight == 1) /* * When prefer sibling, evenly spread running tasks on * groups. @@ -8531,18 +8531,34 @@ static struct sched_group *find_busiest_group(struct lb_env *env) busiest->sum_nr_running > local->sum_nr_running + 1) goto force_balance; - if (busiest->group_type != group_overloaded && - (env->idle == CPU_NOT_IDLE || - local->idle_cpus <= (busiest->idle_cpus + 1))) - /* - * If the busiest group is not overloaded - * and there is no imbalance between this and busiest group - * wrt idle CPUs, it is balanced. The imbalance - * becomes significant if the diff is greater than 1 otherwise - * we might end up to just move the imbalance on another - * group. - */ - goto out_balanced; + if (busiest->group_type != group_overloaded) { + if (env->idle == CPU_NOT_IDLE) + /* + * If the busiest group is not overloaded (and as a + * result the local one too) but this cpu is already + * busy, let another idle cpu try to pull task. + */ + goto out_balanced; + + if (busiest->group_weight > 1 && + local->idle_cpus <= (busiest->idle_cpus + 1)) + /* + * If the busiest group is not overloaded + * and there is no imbalance between this and busiest + * group wrt idle CPUs, it is balanced. The imbalance + * becomes significant if the diff is greater than 1 + * otherwise we might end up to just move the imbalance + * on another group. Of course this applies only if + * there is more than 1 CPU per group. + */ + goto out_balanced; + + if (busiest->sum_nr_running == 1) + /* + * busiest doesn't have any tasks waiting to run + */ + goto out_balanced; + } force_balance: /* Looks like there is an imbalance. Compute it */