diff mbox series

[RFC,v1,8/8] sched/deadline: make bandwidth enforcement scale-invariant

Message ID 20170705085905.6558-9-juri.lelli@arm.com
State New
Headers show
Series [RFC,v1,1/8] sched/cpufreq_schedutil: make use of DEADLINE utilization signal | expand

Commit Message

Juri Lelli July 5, 2017, 8:59 a.m. UTC
Apply frequency and cpu scale-invariance correction factor to bandwidth
enforcement (similar to what we already do to fair utilization tracking).

Each delta_exec gets scaled considering current frequency and maximum
cpu capacity; which means that the reservation runtime parameter (that
need to be specified profiling the task execution at max frequency on
biggest capacity core) gets thus scaled accordingly.

Signed-off-by: Juri Lelli <juri.lelli@arm.com>

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Claudio Scordino <claudio@evidence.eu.com>
---
 kernel/sched/deadline.c | 26 ++++++++++++++++++++++----
 kernel/sched/fair.c     |  2 --
 kernel/sched/sched.h    |  2 ++
 3 files changed, 24 insertions(+), 6 deletions(-)

-- 
2.11.0

Comments

Juri Lelli July 19, 2017, 11:16 a.m. UTC | #1
On 19/07/17 13:00, Peter Zijlstra wrote:
> On Wed, Jul 19, 2017 at 10:20:29AM +0100, Juri Lelli wrote:

> > On 19/07/17 09:21, Peter Zijlstra wrote:

> > > On Wed, Jul 05, 2017 at 09:59:05AM +0100, Juri Lelli wrote:

> > > > @@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)

> > > >  	if (unlikely(dl_entity_is_special(dl_se)))

> > > >  		return;

> > > >  

> > > > -	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))

> > > > -		delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);

> > > > -	dl_se->runtime -= delta_exec;

> > > > +	/*

> > > > +	 * For tasks that participate in GRUB, we implement GRUB-PA: the

> > > > +	 * spare reclaimed bandwidth is used to clock down frequency.

> > > > +	 *

> > > > +	 * For the others, we still need to scale reservation parameters

> > > > +	 * according to current frequency and CPU maximum capacity.

> > > > +	 */

> > > > +	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {

> > > > +		scaled_delta_exec = grub_reclaim(delta_exec,

> > > > +						 rq,

> > > > +						 &curr->dl);

> > > > +	} else {

> > > > +		unsigned long scale_freq = arch_scale_freq_capacity(cpu);

> > > > +		unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

> > > > +

> > > > +		scaled_delta_exec = cap_scale(delta_exec, scale_freq);

> > > > +		scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);

> > > > +	}

> > > > +

> > > > +	dl_se->runtime -= scaled_delta_exec;

> > > >  

> > > 

> > > This I don't get... 

> > 

> > 

> > Considering that we use GRUB's active utilization to drive clock

> > frequency selection, rationale is that GRUB tasks don't need any special

> > scaling, as their delta_exec is already scaled according to GRUB rules.

> > OTOH, normal tasks need to have their runtime (delta_exec) explicitly

> > scaled considering current frequency (and CPU max capacity), otherwise

> > they are going to receive less runtime than granted at AC, when

> > frequency is reduced.

> 

> I don't think that quite works out. Given that the frequency selection

> will never quite end up at exactly the same fraction (if the hardware

> listens to your requests at all).

> 


It's an approximation yes (how big it depends on the granularity of the
available frequencies). But, for the !GRUB tasks it should be OK, as we
always select a frequency (among the available ones) bigger than the
current active utilization.

Also, for platforms/archs that don't redefine arch_scale_* this is not
used. In case they are defined instead the assumption is that either hw
listens to requests or scaling factors can be derived in some other ways
(avgs?).

> Also, by not scaling the GRUB stuff, don't you run the risk of

> attempting to hand out more idle time than there actually is?


The way I understand it is that for GRUB tasks we always scale
considering the "correct" factor. Then frequency could be higher, but
this spare idle time will be reclaimed by other GRUB tasks.
luca abeni July 25, 2017, 7:03 a.m. UTC | #2
Hi Peter,

On Mon, 24 Jul 2017 18:43:49 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Jul 19, 2017 at 12:16:24PM +0100, Juri Lelli wrote:

> > On 19/07/17 13:00, Peter Zijlstra wrote:  

> > > On Wed, Jul 19, 2017 at 10:20:29AM +0100, Juri Lelli wrote:  

> > > > On 19/07/17 09:21, Peter Zijlstra wrote:  

> > > > > On Wed, Jul 05, 2017 at 09:59:05AM +0100, Juri Lelli wrote:  

> > > > > > @@ -1156,9 +1157,26 @@ static void update_curr_dl(struct rq *rq)

> > > > > >  	if (unlikely(dl_entity_is_special(dl_se)))

> > > > > >  		return;

> > > > > >  

> > > > > > -	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))

> > > > > > -		delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);

> > > > > > -	dl_se->runtime -= delta_exec;

> > > > > > +	/*

> > > > > > +	 * For tasks that participate in GRUB, we implement GRUB-PA: the

> > > > > > +	 * spare reclaimed bandwidth is used to clock down frequency.

> > > > > > +	 *

> > > > > > +	 * For the others, we still need to scale reservation parameters

> > > > > > +	 * according to current frequency and CPU maximum capacity.

> > > > > > +	 */

> > > > > > +	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {

> > > > > > +		scaled_delta_exec = grub_reclaim(delta_exec,

> > > > > > +						 rq,

> > > > > > +						 &curr->dl);

> > > > > > +	} else {

> > > > > > +		unsigned long scale_freq = arch_scale_freq_capacity(cpu);

> > > > > > +		unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

> > > > > > +

> > > > > > +		scaled_delta_exec = cap_scale(delta_exec, scale_freq);

> > > > > > +		scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);

> > > > > > +	}

> > > > > > +

> > > > > > +	dl_se->runtime -= scaled_delta_exec;

> > > > > >    

> > > > > 

> > > > > This I don't get...   

> > > > 

> > > > 

> > > > Considering that we use GRUB's active utilization to drive clock

> > > > frequency selection, rationale is that GRUB tasks don't need any special

> > > > scaling, as their delta_exec is already scaled according to GRUB rules.

> > > > OTOH, normal tasks need to have their runtime (delta_exec) explicitly

> > > > scaled considering current frequency (and CPU max capacity), otherwise

> > > > they are going to receive less runtime than granted at AC, when

> > > > frequency is reduced.  

> > > 

> > > I don't think that quite works out. Given that the frequency selection

> > > will never quite end up at exactly the same fraction (if the hardware

> > > listens to your requests at all).

> > >   

> > 

> > It's an approximation yes (how big it depends on the granularity of the

> > available frequencies). But, for the !GRUB tasks it should be OK, as we

> > always select a frequency (among the available ones) bigger than the

> > current active utilization.

> > 

> > Also, for platforms/archs that don't redefine arch_scale_* this is not

> > used. In case they are defined instead the assumption is that either hw

> > listens to requests or scaling factors can be derived in some other ways

> > (avgs?).

> >   

> > > Also, by not scaling the GRUB stuff, don't you run the risk of

> > > attempting to hand out more idle time than there actually is?  

> > 

> > The way I understand it is that for GRUB tasks we always scale

> > considering the "correct" factor. Then frequency could be higher, but

> > this spare idle time will be reclaimed by other GRUB tasks.  

> 

> I'm still confused..

> 

> So GRUB does:

> 

> 	dq = Uact -dt

> 

> right?


Right. This is what the original (single processor) GRUB did. And this
was used by the "GRUB-PA" algorithm:
https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf

(basically, GRUB-PA uses GRUB for reclaiming, and scales the CPU
frequency based on Uact)


> Now, you do DVFS using that same Uact. If we lower the clock, we need

> more time, so would we then not end up with something like:

> 

> 	dq = 1/Uact -dt


Well, in the GRUB-PA algorithm GRUB reclaiming is the mechanism used to
give more runtime to the task... Since Uact is < 1, doing
	dq = - Uact * dt
means that we decrease the current runtime by a smaller amount of time.
And so we end up giving more runtime to the task: instead of giving 
dl_runtime every dl_period, we give "dl_runtime / Uact" every
dl_period... And since the CPU is slower (by a ratio Uact), this is
equivalent to giving dl_runtime at the maximum CPU speed / frequency
(at least, in theory :).


> After all; our budget assignment is such that we're able to complete

> our work at max freq. Therefore, when we lower the frequency, we'll have

> to increase budget pro rata, otherwise we'll not complete our work and

> badness happens.


Right. But instead of increasing dl_runtime, GRUB-PA decreases the
amount of time accounted to the current runtime.


> Say we have a 1 Ghz part and Uact=0.5 we'd select 500 Mhz and need

> double the time to complete.

> 

> Now, if we fold these two together, you'd get:

> 

> 	dq = Uact/Uact -dt = -dt


Not sure why " / Uact"... According to the GRUB-PA algorithm, you just
do
	dq = - Uact * dt = -0.5dt
and you end up giving the CPU to the task for 2 * dl_runtime every
dl_period (as expected)

> Because, after all, if we lowered the clock to consume all idle time,

> there's no further idle time to reclaim.


The idea of GRUB-PA is that you do not change dl_runtime... So, the
task is still "nominally reserved" dl_runtime every dl_period (in
this example, 1/2*dl_period every dl_period)... It is the reclaiming
mechanism that allows the task to execute for dl_runtime/Uact every
dl_period (in this example, for dl_period every dl_period, so it
reclaims 1/2dl_period every dl_period).


> Now, of course, our DVFS assignment isn't as precise nor as

> deterministic as this, so we'll get a slightly different ratio, lets

> call that Udvfs.


This is where GRUB-PA starts to have issues... :)
The implicit assumption is (I think) that is the DVFS mechanism cannot
assign exactly the requested frequency then it makes a "conservative"
assignment (assigning a frequency higher than the requested one).


> So would then not GRUB change into something like:

> 

> 	dq = Uact/Udvfs -dt

> 

> Allowing it to only reclaim that idle time that exists because our DVFS

> level is strictly higher than required?


I think GRUB should still do "dq = -Uact * dt", trying to reclaim all
the idle CPU time (up to the configured limit, of course).



				Luca

> 

> This way, on our 1 GHz part, with Uact=.5 but Udvfs=.6, we'll allow it

> to reclaim just the additional 100Mhz of idle time.

> 

> 

> Or am I completely off the rails now?
Peter Zijlstra July 25, 2017, 1:51 p.m. UTC | #3
On Tue, Jul 25, 2017 at 09:03:08AM +0200, Luca Abeni wrote:

> > I'm still confused..

> > 

> > So GRUB does:

> > 

> > 	dq = Uact -dt

> > 

> > right?

> 

> Right. This is what the original (single processor) GRUB did. And this

> was used by the "GRUB-PA" algorithm:

> https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf

> 

> (basically, GRUB-PA uses GRUB for reclaiming, and scales the CPU

> frequency based on Uact)

> 

> 

> > Now, you do DVFS using that same Uact. If we lower the clock, we need

> > more time, so would we then not end up with something like:

> > 

> > 	dq = 1/Uact -dt

> 

> Well, in the GRUB-PA algorithm GRUB reclaiming is the mechanism used to

> give more runtime to the task... Since Uact is < 1, doing

> 	dq = - Uact * dt

> means that we decrease the current runtime by a smaller amount of time.

> And so we end up giving more runtime to the task: instead of giving 

> dl_runtime every dl_period, we give "dl_runtime / Uact" every

> dl_period... And since the CPU is slower (by a ratio Uact), this is

> equivalent to giving dl_runtime at the maximum CPU speed / frequency

> (at least, in theory :).

> 

> 

> > After all; our budget assignment is such that we're able to complete

> > our work at max freq. Therefore, when we lower the frequency, we'll have

> > to increase budget pro rata, otherwise we'll not complete our work and

> > badness happens.

> 

> Right. But instead of increasing dl_runtime, GRUB-PA decreases the

> amount of time accounted to the current runtime.

> 

> 

> > Say we have a 1 Ghz part and Uact=0.5 we'd select 500 Mhz and need

> > double the time to complete.

> > 

> > Now, if we fold these two together, you'd get:

> > 

> > 	dq = Uact/Uact -dt = -dt

> 

> Not sure why " / Uact"... According to the GRUB-PA algorithm, you just

> do

> 	dq = - Uact * dt = -0.5dt

> and you end up giving the CPU to the task for 2 * dl_runtime every

> dl_period (as expected)


Yeah, I seem to have gone off the rails there... Bah I'm terminally
confused now. Let me try and get my brain the right way up.
luca abeni July 26, 2017, 1:50 p.m. UTC | #4
On Tue, 25 Jul 2017 15:51:05 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Jul 25, 2017 at 09:03:08AM +0200, Luca Abeni wrote:

> 

> > > I'm still confused..

> > > 

> > > So GRUB does:

> > > 

> > > 	dq = Uact -dt

> > > 

> > > right?  

> > 

> > Right. This is what the original (single processor) GRUB did. And

> > this was used by the "GRUB-PA" algorithm:

> > https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf

> > 

> > (basically, GRUB-PA uses GRUB for reclaiming, and scales the CPU

> > frequency based on Uact)

> > 

> >   

> > > Now, you do DVFS using that same Uact. If we lower the clock, we

> > > need more time, so would we then not end up with something like:

> > > 

> > > 	dq = 1/Uact -dt  

> > 

> > Well, in the GRUB-PA algorithm GRUB reclaiming is the mechanism

> > used to give more runtime to the task... Since Uact is < 1, doing

> > 	dq = - Uact * dt

> > means that we decrease the current runtime by a smaller amount of

> > time. And so we end up giving more runtime to the task: instead of

> > giving dl_runtime every dl_period, we give "dl_runtime / Uact" every

> > dl_period... And since the CPU is slower (by a ratio Uact), this is

> > equivalent to giving dl_runtime at the maximum CPU speed / frequency

> > (at least, in theory :).

> > 

> >   

> > > After all; our budget assignment is such that we're able to

> > > complete our work at max freq. Therefore, when we lower the

> > > frequency, we'll have to increase budget pro rata, otherwise

> > > we'll not complete our work and badness happens.  

> > 

> > Right. But instead of increasing dl_runtime, GRUB-PA decreases the

> > amount of time accounted to the current runtime.

> > 

> >   

> > > Say we have a 1 Ghz part and Uact=0.5 we'd select 500 Mhz and need

> > > double the time to complete.

> > > 

> > > Now, if we fold these two together, you'd get:

> > > 

> > > 	dq = Uact/Uact -dt = -dt  

> > 

> > Not sure why " / Uact"... According to the GRUB-PA algorithm, you

> > just do

> > 	dq = - Uact * dt = -0.5dt

> > and you end up giving the CPU to the task for 2 * dl_runtime every

> > dl_period (as expected)  

> 

> Yeah, I seem to have gone off the rails there... Bah I'm terminally

> confused now. Let me try and get my brain the right way up.


This stuff always confuses me too... :)

The parts that gives me more headaches is how to combine GRUB-PA with
non-reclaiming tasks, and how to cope with "real world issues" (such as
an actual DVFS frequency different from the theoretical one, GRUB
reclaiming less than 100% of the CPU time, etc...)

Anyway, Claudio is running some experiments with this patchset,
measuring power saving and missed deadlines for various sets of
periodic real-time tasks... We hope to present the results at RTLWS.


			Luca
diff mbox series

Patch

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 4ec82fa56b04..5bed1a2fae06 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1122,7 +1122,8 @@  static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
 	struct sched_dl_entity *dl_se = &curr->dl;
-	u64 delta_exec;
+	u64 delta_exec, scaled_delta_exec;
+	int cpu = cpu_of(rq);
 
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1156,9 +1157,26 @@  static void update_curr_dl(struct rq *rq)
 	if (unlikely(dl_entity_is_special(dl_se)))
 		return;
 
-	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
-		delta_exec = grub_reclaim(delta_exec, rq, &curr->dl);
-	dl_se->runtime -= delta_exec;
+	/*
+	 * For tasks that participate in GRUB, we implement GRUB-PA: the
+	 * spare reclaimed bandwidth is used to clock down frequency.
+	 *
+	 * For the others, we still need to scale reservation parameters
+	 * according to current frequency and CPU maximum capacity.
+	 */
+	if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
+		scaled_delta_exec = grub_reclaim(delta_exec,
+						 rq,
+						 &curr->dl);
+	} else {
+		unsigned long scale_freq = arch_scale_freq_capacity(cpu);
+		unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
+		scaled_delta_exec = cap_scale(delta_exec, scale_freq);
+		scaled_delta_exec = cap_scale(scaled_delta_exec, scale_cpu);
+	}
+
+	dl_se->runtime -= scaled_delta_exec;
 
 throttle:
 	if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 97bd5aae285d..a6e8a4f5786f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2844,8 +2844,6 @@  static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
 	return c1 + c2 + c3;
 }
 
-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
-
 /*
  * Accumulate the three separate parts of the sum; d1 the remainder
  * of the last (incomplete) period, d2 the span of full periods and d3
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 178f4a5df2fa..688185761eaf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -155,6 +155,8 @@  static inline int task_has_dl_policy(struct task_struct *p)
 	return dl_policy(p->policy);
 }
 
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
 /*
  * !! For sched_setattr_nocheck() (kernel) only !!
  *