From patchwork Thu Feb 4 16:30:49 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Juri Lelli X-Patchwork-Id: 61208 Delivered-To: patch@linaro.org Received: by 10.112.43.199 with SMTP id y7csp558703lbl; Thu, 4 Feb 2016 08:30:13 -0800 (PST) X-Received: by 10.98.14.68 with SMTP id w65mr12413746pfi.146.1454603413117; Thu, 04 Feb 2016 08:30:13 -0800 (PST) Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g28si14093116pfj.91.2016.02.04.08.30.12; Thu, 04 Feb 2016 08:30:13 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756577AbcBDQaK (ORCPT + 30 others); Thu, 4 Feb 2016 11:30:10 -0500 Received: from foss.arm.com ([217.140.101.70]:43167 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755199AbcBDQaI (ORCPT ); Thu, 4 Feb 2016 11:30:08 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 11A883A1; Thu, 4 Feb 2016 08:29:23 -0800 (PST) Received: from e106622-lin (e106622-lin.cambridge.arm.com [10.1.208.152]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5D07E3F213; Thu, 4 Feb 2016 08:30:06 -0800 (PST) Date: Thu, 4 Feb 2016 16:30:49 +0000 From: Juri Lelli To: Steven Rostedt Cc: Peter Zijlstra , Ingo Molnar , LKML , Clark Williams , John Kacur , Daniel Bristot de Oliveira , Juri Lelli Subject: Re: [BUG] Corrupted SCHED_DEADLINE bandwidth with cpusets Message-ID: <20160204163049.GE29586@e106622-lin> References: <20160203135550.5f95ecb2@gandalf.local.home> <20160204095448.GE12132@e106622-lin> <20160204120412.GA29586@e106622-lin> <20160204122745.GC29586@e106622-lin> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160204122745.GC29586@e106622-lin> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Steve, On 04/02/16 12:27, Juri Lelli wrote: > On 04/02/16 12:04, Juri Lelli wrote: > > On 04/02/16 09:54, Juri Lelli wrote: > > > Hi Steve, > > > > > > first of all thanks a lot for your detailed report, if only all bug > > > reports were like this.. :) > > > > > > On 03/02/16 13:55, Steven Rostedt wrote: > > > > [...] > > > > > > > > Right. I think this is the same thing that happens after hotplug. IIRC > > > the code paths are actually the same. The problem is that hotplug or > > > cpuset reconfiguration operations are destructive w.r.t. root_domains, > > > so we lose bandwidth information when that happens. The problem is that > > > we only store cumulative information regarding bandwidth in root_domain, > > > while information about which task belongs to which cpuset is store in > > > cpuset data structures. > > > > > > I tried to fix this a while back, but my tentative was broken, I failed > > > to get locking right and, even though it seemed to fix the issue for me, > > > it was prone to race conditions. You might still want to have a look at > > > that for reference: https://lkml.org/lkml/2015/9/2/162 > > > > > > > [...] > > > > > > > > It's good that we can recover, but that's still a bug yes :/. > > > > > > I'll try to see if my broken patch make what you are seeing apparently > > > disappear, so that we can at least confirm that we are seeing the same > > > problem; you could do the same if you want, I pushed that here > > > > > > > No it doesn't solve this :/. I placed restoring code in the hotplug > > workfn, so updates generated by toggling sched_load_balance don't get > > caught, of course. But, this at least tells us that we should solve this > > someplace else. > > > > Well, if I call an unlocked version of my cpuset_hotplug_update_rd() > from kernel/cpuset.c:update_flag() the issue seems to go away. But, we > end up overcommitting the default null domain (try to toggle sched_load_ > balance multiple times). I updated the branch, but I still think we > should solve this differently. > I've actually changed a bit this approach, and things seem better here. Could you please give this a try? (You can also fetch the same branch). Thanks, - Juri --->8--- >From c45d255859a2978a350bae39ead52f4dd11ab767 Mon Sep 17 00:00:00 2001 From: Juri Lelli Date: Tue, 28 Jul 2015 11:55:51 +0100 Subject: [PATCH] sched/{cpuset,core}: restore root_domain status across destructive ops Hotplug and sched_domains update operations are destructive w.r.t data associated with cpuset; in this case we care about root_domains. SCHED_DEADLINE puts bandwidth information regarding admitted tasks on root_domains, information that is gone when an hotplug or update operation happens. Also, it is not currently possible to tell to which task(s) the allocated bandwidth belongs, as this link is lost after sched_setscheduler() succeeds. This patch forces rebuilding of allocated bandwidth information at root_domain level after partition_sched_domains() is done. It also ensures that we don't leave stale information in def_root_domain when that becomes empty (since it is never freed). Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Li Zefan Cc: cgroups@vger.kernel.org Reported-by: Wanpeng Li Reported-by: Steven Rostedt Signed-off-by: Juri Lelli --- include/linux/sched.h | 2 ++ kernel/cpuset.c | 39 +++++++++++++++++++++++++++++++++++++++ kernel/sched/core.c | 28 ++++++++++++++++++++++++++++ 3 files changed, 69 insertions(+) -- 2.7.0 diff --git a/include/linux/sched.h b/include/linux/sched.h index a10494a..5f9eeb4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2241,6 +2241,8 @@ extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial); extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_cpus_allowed); +void sched_restore_dl_bw(struct task_struct *task, + const struct cpumask *new_mask); #ifdef CONFIG_SMP extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask); diff --git a/kernel/cpuset.c b/kernel/cpuset.c index 3e945fc..57078f0 100644 --- a/kernel/cpuset.c +++ b/kernel/cpuset.c @@ -785,6 +785,44 @@ done: return ndoms; } +/** + * update_tasks_rd - Update tasks' root_domains status. + * @cs: the cpuset to which each task's root_domain belongs + * + * Iterate through each task of @cs updating state of its related + * root_domain. + */ +static void update_tasks_rd(struct cpuset *cs) +{ + struct css_task_iter it; + struct task_struct *task; + + css_task_iter_start(&cs->css, &it); + while ((task = css_task_iter_next(&it))) + sched_restore_dl_bw(task, cs->effective_cpus); + css_task_iter_end(&it); +} + +static void cpuset_update_rd(void) +{ + struct cpuset *cs; + struct cgroup_subsys_state *pos_css; + + lockdep_assert_held(&cpuset_mutex); + rcu_read_lock(); + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { + if (!css_tryget_online(&cs->css)) + continue; + rcu_read_unlock(); + + update_tasks_rd(cs); + + rcu_read_lock(); + css_put(&cs->css); + } + rcu_read_unlock(); +} + /* * Rebuild scheduler domains. * @@ -818,6 +856,7 @@ static void rebuild_sched_domains_locked(void) /* Have scheduler rebuild the domains */ partition_sched_domains(ndoms, doms, attr); + cpuset_update_rd(); out: put_online_cpus(); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f1ce7a8..f9558f0 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2277,6 +2277,23 @@ static inline int dl_bw_cpus(int i) } #endif +void sched_restore_dl_bw(struct task_struct *task, + const struct cpumask *new_mask) +{ + struct dl_bw *dl_b; + unsigned long flags; + + if (!task_has_dl_policy(task)) + return; + + rcu_read_lock_sched(); + dl_b = dl_bw_of(cpumask_any(new_mask)); + raw_spin_lock_irqsave(&dl_b->lock, flags); + dl_b->total_bw += task->dl.dl_bw; + raw_spin_unlock_irqrestore(&dl_b->lock, flags); + rcu_read_unlock_sched(); +} + /* * We must be sure that accepting a new task (or allowing changing the * parameters of an existing one) is consistent with the bandwidth @@ -5636,6 +5653,17 @@ static void rq_attach_root(struct rq *rq, struct root_domain *rd) cpumask_clear_cpu(rq->cpu, old_rd->span); + if (old_rd == &def_root_domain && + cpumask_empty(old_rd->span)) { + /* + * def_root_domain is never freed, so we have to clean + * it when it becomes empty. + */ + raw_spin_lock(&old_rd->dl_bw.lock); + old_rd->dl_bw.total_bw = 0; + raw_spin_unlock(&old_rd->dl_bw.lock); + } + /* * If we dont want to free the old_rd yet then * set old_rd to NULL to skip the freeing later