From patchwork Tue Feb 25 14:15:15 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Juri Lelli X-Patchwork-Id: 25297 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-yk0-f197.google.com (mail-yk0-f197.google.com [209.85.160.197]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id CAE5B20543 for ; Tue, 25 Feb 2014 14:15:12 +0000 (UTC) Received: by mail-yk0-f197.google.com with SMTP id 142sf30484946ykq.0 for ; Tue, 25 Feb 2014 06:15:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:delivered-to:date:from:to:cc:subject:message-id :in-reply-to:references:mime-version:sender:precedence:list-id :x-original-sender:x-original-authentication-results:mailing-list :list-post:list-help:list-archive:list-unsubscribe:content-type :content-transfer-encoding; bh=BNIeo1O1RNI/d8KUXfhYGwlyPhJejnD9dBNJu5YiLQE=; b=CMWUzlJ3P2igGz5rGcgXWhuPUpzb2pZmS9w0kSYG0H393eEDCqbgVZWJlSkIh1li4a 95qi2W5OMDo+LhTmAQGrk+oXFOoMKKnXFjsc2nkP/F6mc1Q5hGNnmvG9l7ZCRtGBgi0n Iu5gI5h3uYC+uPA2/Y6XO4oPxSrgKhotNoaqS833T56V+BtTTDxXZLF6a/dQ9xvdhOy+ /cXONf/1zLpnR7tUPYNJalOo5ThNLmkghnDq9hYsZsv8CZqvszYikrKmsF9bs49Yg418 7WetAdXiOH79wnOYjIrGLuz3Thri2lFE548KzRWhtZizpbDZqSE3JOnIqIiAqu/G/r9U WlFQ== X-Gm-Message-State: ALoCoQm4cQvhuQwqxmPVoidfig6vEe/57Byg+3wVXjtHoKipxBHnVrTYLFnCBqWCAw3woLsZSPpI X-Received: by 10.58.106.80 with SMTP id gs16mr772613veb.1.1393337712389; Tue, 25 Feb 2014 06:15:12 -0800 (PST) X-BeenThere: patchwork-forward@linaro.org Received: by 10.140.49.113 with SMTP id p104ls2491134qga.41.gmail; Tue, 25 Feb 2014 06:15:12 -0800 (PST) X-Received: by 10.58.235.129 with SMTP id um1mr1380021vec.17.1393337712257; Tue, 25 Feb 2014 06:15:12 -0800 (PST) Received: from mail-vc0-x229.google.com (mail-vc0-x229.google.com [2607:f8b0:400c:c03::229]) by mx.google.com with ESMTPS id l16si6825794vcl.51.2014.02.25.06.15.12 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 25 Feb 2014 06:15:12 -0800 (PST) Received-SPF: neutral (google.com: 2607:f8b0:400c:c03::229 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) client-ip=2607:f8b0:400c:c03::229; Received: by mail-vc0-f169.google.com with SMTP id hq11so7444427vcb.28 for ; Tue, 25 Feb 2014 06:15:12 -0800 (PST) X-Received: by 10.58.100.100 with SMTP id ex4mr1401192veb.2.1393337712117; Tue, 25 Feb 2014 06:15:12 -0800 (PST) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patch@linaro.org Received: by 10.220.174.196 with SMTP id u4csp141941vcz; Tue, 25 Feb 2014 06:15:11 -0800 (PST) X-Received: by 10.68.211.164 with SMTP id nd4mr95586pbc.44.1393337710884; Tue, 25 Feb 2014 06:15:10 -0800 (PST) Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id zt8si3742354pbc.255.2014.02.25.06.15.10; Tue, 25 Feb 2014 06:15:10 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752960AbaBYOPB (ORCPT + 26 others); Tue, 25 Feb 2014 09:15:01 -0500 Received: from mail-ea0-f182.google.com ([209.85.215.182]:63949 "EHLO mail-ea0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752505AbaBYOO7 (ORCPT ); Tue, 25 Feb 2014 09:14:59 -0500 Received: by mail-ea0-f182.google.com with SMTP id r15so150879ead.13 for ; Tue, 25 Feb 2014 06:14:58 -0800 (PST) X-Received: by 10.14.221.201 with SMTP id r49mr452537eep.104.1393337696735; Tue, 25 Feb 2014 06:14:56 -0800 (PST) Received: from neville (nat-cataldo.sssup.it. [193.205.81.5]) by mx.google.com with ESMTPSA id 43sm75159738eeh.13.2014.02.25.06.14.54 for (version=TLSv1.1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 25 Feb 2014 06:14:55 -0800 (PST) Date: Tue, 25 Feb 2014 15:15:15 +0100 From: Juri Lelli To: tkhai@yandex.ru Cc: Peter Zijlstra , "linux-kernel@vger.kernel.org" , Steven Rostedt , Ingo Molnar Subject: Re: [RFC] sched/deadline: Prevent rt_time growth to infinity Message-Id: <20140225151515.617714e2f2cd6c558531ba61@gmail.com> In-Reply-To: <5307F5DB.3000705@yandex.ru> References: <230991392848160@web13m.yandex.ru> <20140221103715.GP9987@twins.programming.kicks-ass.net> <20140221173641.a060b3d6c0993c21e77f29c2@gmail.com> <5307F5DB.3000705@yandex.ru> X-Mailer: Sylpheed 3.2.0beta5 (GTK+ 2.24.10; x86_64-pc-linux-gnu) Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: list List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Original-Sender: juri.lelli@gmail.com X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com: 2607:f8b0:400c:c03::229 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org; dkim=neutral (bad format) header.i=@gmail.com; dmarc=fail (p=NONE dis=NONE) header.from=gmail.com Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org X-Google-Group-Id: 836684582541 List-Post: , List-Help: , List-Archive: List-Unsubscribe: , On Sat, 22 Feb 2014 04:56:59 +0400 Kirill Tkhai wrote: > On 21.02.2014 20:36, Juri Lelli wrote: > > On Fri, 21 Feb 2014 11:37:15 +0100 > > Peter Zijlstra wrote: > > > >> On Thu, Feb 20, 2014 at 02:16:00AM +0400, Kirill Tkhai wrote: > >>> Since deadline tasks share rt bandwidth, we must care about > >>> bandwidth timer set. Otherwise rt_time may grow up to infinity > >>> in update_curr_dl(), if there are no other available RT tasks > >>> on top level bandwidth. > >>> > >>> I'm going to decide the problem the way below. Almost untested > >>> because of I skipped almost all of recent patches which haveto be applied from lkml. > >>> > >>> Please say, if I skipped anything in idea. Maybe better put > >>> start_top_rt_bandwidth() into set_curr_task_dl()? > >> > >> How about we only increment rt_time when there's an RT bandwidth timer > >> active? > >> > >> > >> --- > >> --- a/kernel/sched/rt.c > >> +++ b/kernel/sched/rt.c > >> @@ -568,6 +568,12 @@ static inline struct rt_bandwidth *sched > >> > >> #endif /* CONFIG_RT_GROUP_SCHED */ > >> > >> +bool sched_rt_bandwidth_active(struct rt_rq *rt_rq) > >> +{ > >> + struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); > >> + return hrtimer_active(&rt_b->rt_period_timer); > >> +} > >> + > >> #ifdef CONFIG_SMP > >> /* > >> * We ran out of runtime, see if we can borrow some from our neighbours. > >> --- a/kernel/sched/deadline.c > >> +++ b/kernel/sched/deadline.c > >> @@ -587,6 +587,8 @@ int dl_runtime_exceeded(struct rq *rq, s > >> return 1; > >> } > >> > >> +extern bool sched_rt_bandwidth_active(struct rt_rq *rt_rq); > >> + > >> /* > >> * Update the current task's runtime statistics (provided it is still > >> * a -deadline task and has not been removed from the dl_rq). > >> @@ -650,11 +652,13 @@ static void update_curr_dl(struct rq *rq > >> struct rt_rq *rt_rq = &rq->rt; > >> > >> raw_spin_lock(&rt_rq->rt_runtime_lock); > >> - rt_rq->rt_time += delta_exec; > >> /* > >> * We'll let actual RT tasks worry about the overflow here, we > >> - * have our own CBS to keep us inline -- see above. > >> + * have our own CBS to keep us inline; only account when RT > >> + * bandwidth is relevant. > >> */ > >> + if (sched_rt_bandwidth_active(rt_rq)) > >> + rt_rq->rt_time += delta_exec; > >> raw_spin_unlock(&rt_rq->rt_runtime_lock); > >> } > >> } > > > > So, I ran some tests with the above and I'd like to share with you what > > I've found. You can find here a trace-cmd trace that should be feeded > > to kernelshark to be able to understand what follows (or feel free to > > reproduce same scenario :)): > > http://retis.sssup.it/~jlelli/traces/trace_rt_time.dat > > > > Here you have a DL task (4/10) and a while(1) RT task, both running > > inside a rt_bw of 0.5. RT tasks is activated 500ms after DL. As I > > filtered in sched_rt_period_timer(), you can search for time instants > > when the rt_bw is replenished. It is evident that the first time after > > rt timer is activated back (search for start_bandwidth_timer), we can > > eat some bw to FAIR tasks (if any). This is due to the fact that we > > reset rt_bw budget at this time, start decrementing rt_time for both DL > > and RT tasks, throttle RT tasks when rt_time > runtime, but, since DL > > tasks acually executes inside their own server, they don't care about > > rt_bw. Good news is that steady state is ok: keeping track of overruns > > we are able to stop eating bw to other guys. > > > > My thougths: > > > > - Peter's patch is an easy fix to Kirill's problem (RT tasks were > > throttled too early); > > - something to add to this solution could be to pre-calculate bw of > > ready DL tasks and subtract it to rt_bw at replenishment time, but > > it sounds quite awkward, pessimistic, and I'm not sure it is gonna > > work; > > - we are stealing bw to best-effort tasks, and just at the beginning > > of the transistion, is it really a problem? > > - I mean, if you want guarantees make your tasks DL! :); > > - in the long run we are gonna have RT tasks scheduled inside CBS > > servers, and all this will be properly fixed up. > > > > Comments? > > > > BTW, rt timer activation/deactivation should probably be fixed for > > !RT_GROUP_SCHED with something like this: > > > > --- > > kernel/sched/rt.c | 10 +++++++--- > > 1 file changed, 7 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index 6161de8..274f992 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -86,12 +86,12 @@ void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq) > > raw_spin_lock_init(&rt_rq->rt_runtime_lock); > > } > > > > -#ifdef CONFIG_RT_GROUP_SCHED > > static void destroy_rt_bandwidth(struct rt_bandwidth *rt_b) > > { > > hrtimer_cancel(&rt_b->rt_period_timer); > > } > > > > +#ifdef CONFIG_RT_GROUP_SCHED > > #define rt_entity_is_task(rt_se) (!(rt_se)->my_q) > > > > static inline struct task_struct *rt_task_of(struct sched_rt_entity *rt_se) > > @@ -1017,8 +1017,12 @@ inc_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) > > start_rt_bandwidth(&def_rt_bandwidth); > > } > > > > -static inline > > -void dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) {} > > +static void > > +dec_rt_group(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq) > > +{ > > + if (!rt_rq->rt_nr_running) > > + destroy_rt_bandwidth(&def_rt_bandwidth); > > +} > > > > #endif /* CONFIG_RT_GROUP_SCHED */ > > > > It looks with both patches applied, we may get into a situation, > when all CPU time is shared between RT and DL tasks: > > rt_runtime = n > rt_period = 2n > > | RT working, DL sleeping | DL working, RT sleeping | > ----------------------------------------------------------- > | (1) duration = n | (2) duration = n | (repeat) > |--------------------------|------------------------------| > | (rt_bw timer is running) | (rt_bw timer is not running) | > > No time for fair tasks at all. Ok, this situation is pathological. DL bandwidth is guaranteed at admission control, while RT isn't. In this case RT tasks are doomed by construction. Still you'd like to let FAIR tasks execute :). I argumented on a slightly different solution in what follows, what you think? Thanks, - Juri >From e44fe2eef34433a7799cfc153f467f7c62813596 Mon Sep 17 00:00:00 2001 From: Juri Lelli Date: Fri, 21 Feb 2014 11:37:15 +0100 Subject: [PATCH] sched/deadline: Prevent rt_time growth to infinity Kirill Tkhai noted: Since deadline tasks share rt bandwidth, we must care about bandwidth timer set. Otherwise rt_time may grow up to infinity in update_curr_dl(), if there are no other available RT tasks on top level bandwidth. RT task were in fact throttled right after they got enqueued, and never executed again (rt_time never again went below rt_runtime). Peter than proposed to accrue DL execution on rt_time only when rt timer is active, and proposed a patch (this patch is a slight modification of that) to implement that behavior. While this solves Kirill problem, it has a drawback. Indeed, Kirill noted again: It looks we may get into a situation, when all CPU time is shared between RT and DL tasks: rt_runtime = n rt_period = 2n | RT working, DL sleeping | DL working, RT sleeping | ----------------------------------------------------------- | (1) duration = n | (2) duration = n | (repeat) |--------------------------|------------------------------| | (rt_bw timer is running) | (rt_bw timer is not running) | No time for fair tasks at all. While this can happen during the first period, if rq is always backlogged, RT tasks won't have the opportunity to execute anymore: rt_time reached rt_runtime during (1), suppose after (2) RT is enqueued back, it gets throttled since rt timer didn't fire, replenishment is from now on eaten up by DL tasks that accrue their execution on rt_time (while rt timer is active - we have an RT task waiting for replenishment). FAIR tasks are not touched after this first period. Ok, this is not ideal, and the situation is even worse! What above (the nice case), practically never happens in reality, where your rt timer is not aligned to tasks periods, tasks are in general not periodic, etc.. Long story short, you always risk to overload your system. This patch is based on Peter's idea, but exploits an additional fact: if you don't have RT tasks enqueued, it makes little sense to continue incrementing rt_time once you reached the upper limit (DL tasks have their own mechanism for throttling). This cures both problems: - no matter how many DL instances in the past, you'll have an rt_time slightly above rt_runtime when an RT task is enqueued, and from that point on (after the first replenishment), the task will normally execute; - you can still eat up all bandwidth during the first period, but not anymore after that, remember that DL execution will increment rt_time till the upper limit is reached. The situation is still not perfect! But, we have a simple solution for now, that limits how much you can jeopardize your system, as we keep working towards the right answer: RT groups scheduled using deadline servers. Signed-off-by: Juri Lelli --- kernel/sched/deadline.c | 8 ++++++-- kernel/sched/rt.c | 8 ++++++++ 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 15cbc17..f59d774 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -564,6 +564,8 @@ int dl_runtime_exceeded(struct rq *rq, struct sched_dl_entity *dl_se) return 1; } +extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq); + /* * Update the current task's runtime statistics (provided it is still * a -deadline task and has not been removed from the dl_rq). @@ -627,11 +629,13 @@ static void update_curr_dl(struct rq *rq) struct rt_rq *rt_rq = &rq->rt; raw_spin_lock(&rt_rq->rt_runtime_lock); - rt_rq->rt_time += delta_exec; /* * We'll let actual RT tasks worry about the overflow here, we - * have our own CBS to keep us inline -- see above. + * have our own CBS to keep us inline; only account when RT + * bandwidth is relevant. */ + if (sched_rt_bandwidth_account(rt_rq)) + rt_rq->rt_time += delta_exec; raw_spin_unlock(&rt_rq->rt_runtime_lock); } } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 7dba25a..7f372e1 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -538,6 +538,14 @@ static inline struct rt_bandwidth *sched_rt_bandwidth(struct rt_rq *rt_rq) #endif /* CONFIG_RT_GROUP_SCHED */ +bool sched_rt_bandwidth_account(struct rt_rq *rt_rq) +{ + struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq); + + return (hrtimer_active(&rt_b->rt_period_timer) || + rt_rq->rt_time < rt_b->rt_runtime); +} + #ifdef CONFIG_SMP /* * We ran out of runtime, see if we can borrow some from our neighbours.