diff mbox

[1/2] sched/deadline: add per rq tracking of admitted bandwidth

Message ID 20160211171012.GS11415@e106622-lin
State New
Headers show

Commit Message

Juri Lelli Feb. 11, 2016, 5:10 p.m. UTC
On 11/02/16 09:25, Steven Rostedt wrote:
> On Thu, 11 Feb 2016 14:05:45 +0100

> luca abeni <luca.abeni@unitn.it> wrote:

> 

>   

> > Well, I never used the rq utilization to re-build the root_domain

> > utilization (and I never played with root domains too much... :)...

> > So, I do not really know. Maybe the code should do:

> > 	raw_spin_lock(&rq->lock);

> > 	raw_spin_lock(&cpu_rq(cpu)->lock);

> 

> Of course you want to use double_rq_lock() here instead.

> 


Right. Is something like this completely out of question/broken?

I slighly tested it with Steve's test and I don't see the warning
anymore (sched_debug looks good as well); but my confidence is still
pretty low. :(

--->8---

From 9713e12bc682ca364e62f9d69bcd44598c50a8a9 Mon Sep 17 00:00:00 2001
From: Juri Lelli <juri.lelli@arm.com>

Date: Thu, 11 Feb 2016 16:55:49 +0000
Subject: [PATCH] fixup! sched/deadline: add per rq tracking of admitted
 bandwidth

Signed-off-by: Juri Lelli <juri.lelli@arm.com>

---
 include/linux/init_task.h |  1 +
 include/linux/sched.h     |  1 +
 kernel/sched/core.c       |  5 ++++-
 kernel/sched/deadline.c   | 26 +++++++++++++++++++++++++-
 4 files changed, 31 insertions(+), 2 deletions(-)

-- 
2.7.0

Comments

Juri Lelli Feb. 12, 2016, 5:19 p.m. UTC | #1
On 12/02/16 18:05, Peter Zijlstra wrote:
> On Thu, Feb 11, 2016 at 05:10:12PM +0000, Juri Lelli wrote:

> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c

> > index 6368f43..1eccecf 100644

> > --- a/kernel/sched/deadline.c

> > +++ b/kernel/sched/deadline.c

> 

> > +static void swap_task_ac_bw(struct task_struct *p,

> > +			    struct rq *from,

> > +			    struct rq *to)

> > +{

> > +	unsigned long flags;

> > +

> > +	lockdep_assert_held(&p->pi_lock);

> > +	local_irq_save(flags);

> > +	double_rq_lock(from, to);

> > +	__dl_sub_ac(from, p->dl.dl_bw);

> > +	__dl_add_ac(to, p->dl.dl_bw);

> > +	double_rq_unlock(from, to);

> > +	local_irq_restore(flags);

> > +}

> 

> > +static void migrate_task_rq_dl(struct task_struct *p)

> > +{

> > +	if (p->fallback_cpu != -1)

> > +		swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu));

> > +}

> 

> This patch scares me.

> 


Yeah, same here. However, I didn't find yet something different to fix
this and wanted some help :).

> Now, my brain is having an awfully hard time trying to re-engage after

> flu, but this looks very wrong.

> 

> So we call sched_class::migrate_task_rq() from set_task_cpu(), and we

> call set_task_cpu() while potentially holding rq::lock's (try

> push_dl_task() for kicks).

> 

> Sure, you play horrible games with fallback_cpu, but those games are

> just that, horrible.

> 


Right. I'm counting on fallback_cpu to be able to not call
swap_task_ac_bw() (and the rq locks) during push/pull migrations.
I was actually thinking that we could have a non locked version of swap
and call that in push/pull from migrate_task_rq_dl. But this is most
probably more horrible.

> 

> So your initial patch migrates the bandwidth along when a runnable task

> gets moved about, this hack seems to be mostly about waking up. The

> 'normal' accounting is done on enqueue/dequeue, while here you use the

> migration hook.

> 


The problem is that I don't do anything in enqueue/dequeue (apart from
when cpuset migrates us while still on_rq), and I think we don't want to
do anything there as a task dl_bw should remain in ac_bw when it goes to
sleep, etc. This is the static view of admitted bw. We want to
save/restore the admitted bw in the root_domain also when tasks are
sleeping/blocked.

> Having two separate means of accounting this also feels more fragile

> than one would want.

> 

> Let me think a bit about this.

> 


I was looking at sending out a v2 with this as RFC. I guess is better if
I wait :).

Thanks!

Best,

- Juri
Juri Lelli Feb. 25, 2016, 10:07 a.m. UTC | #2
Hi Peter,

On 24/02/16 20:17, Peter Zijlstra wrote:
> On Fri, Feb 12, 2016 at 06:05:30PM +0100, Peter Zijlstra wrote:

> > Having two separate means of accounting this also feels more fragile

> > than one would want.

> > 

> > Let me think a bit about this.

> 

> I think there's a fundamental problem that makes the whole notion of

> per-rq accounting 'impossible'.

>

> On hot-unplug we only migrate runnable tasks, all blocked tasks remain

> on the dead cpu. This would very much include their bandwidth

> requirements.

> 

> This means that between a hot-unplug and the moment that _all_ those

> blocked tasks have ran at least once, the sum of online bandwidth

> doesn't match and we can get into admission trouble (same for GRUB,

> which can also use per-rq bw like this).

> 

> The main problem is that there is no real way to find blocked tasks;

> currently the only way is to iterate _all_ tasks and filter on

> task_cpu().

> 

> We could of course add a blocked tree/list for deadline tasks, to

> explicitly keep track of all these; this would allow migrating blocked

> tasks on hotplug and avoid the real ugly I think. But I've not tried

> yet.

> 


Argh, this makes lot of sense to me. I've actually pondered a tree/list
solution, but then decided to try the cumulative approach because it
looked nicer. But it contains holes, I'm afraid. As Luca already said,
GRUB shouldn't have these problems though.

I'll try and see what introducting a list of blocked/throttled deadline
tasks means, considering also the interaction with cpusets and such.
Maybe it's simpler than it seems.

I'm not sure this will come anytime soon, unfortunately. I'm almost 100%
on the sched-freq/schedutil discussion these days.

Anyway, do you also think that what we want to solve the root domain
issue is something based on rq_online/offline and per-rq information?
Everything else that I tried or thought of was broken/more horrible. :-/

Best,

- Juri
diff mbox

Patch

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index f2cb8d4..c582f9d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -199,6 +199,7 @@  extern struct task_group root_task_group;
 	.policy		= SCHED_NORMAL,					\
 	.cpus_allowed	= CPU_MASK_ALL,					\
 	.nr_cpus_allowed= NR_CPUS,					\
+	.fallback_cpu	= -1,						\
 	.mm		= NULL,						\
 	.active_mm	= &init_mm,					\
 	.restart_block = {						\
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..a6fc95c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1401,6 +1401,7 @@  struct task_struct {
 	struct task_struct *last_wakee;
 
 	int wake_cpu;
+	int fallback_cpu;
 #endif
 	int on_rq;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7fb9246..4e4bc41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1442,7 +1442,8 @@  static int select_fallback_rq(int cpu, struct task_struct *p)
 				continue;
 			if (!cpu_active(dest_cpu))
 				continue;
-			if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
+			if (cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p))) {
+				p->fallback_cpu = dest_cpu;
 				return dest_cpu;
 		}
 	}
@@ -1490,6 +1491,7 @@  out:
 		}
 	}
 
+	p->fallback_cpu = dest_cpu;
 	return dest_cpu;
 }
 
@@ -1954,6 +1956,7 @@  try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	if (task_cpu(p) != cpu) {
 		wake_flags |= WF_MIGRATED;
 		set_task_cpu(p, cpu);
+		p->fallback_cpu = -1;
 	}
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6368f43..1eccecf 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1043,6 +1043,21 @@  static void yield_task_dl(struct rq *rq)
 
 #ifdef CONFIG_SMP
 
+static void swap_task_ac_bw(struct task_struct *p,
+			    struct rq *from,
+			    struct rq *to)
+{
+	unsigned long flags;
+
+	lockdep_assert_held(&p->pi_lock);
+	local_irq_save(flags);
+	double_rq_lock(from, to);
+	__dl_sub_ac(from, p->dl.dl_bw);
+	__dl_add_ac(to, p->dl.dl_bw);
+	double_rq_unlock(from, to);
+	local_irq_restore(flags);
+}
+
 static int find_later_rq(struct task_struct *task);
 
 static int
@@ -1077,8 +1092,10 @@  select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
 		if (target != -1 &&
 				(dl_time_before(p->dl.deadline,
 					cpu_rq(target)->dl.earliest_dl.curr) ||
-				(cpu_rq(target)->dl.dl_nr_running == 0)))
+				(cpu_rq(target)->dl.dl_nr_running == 0))) {
 			cpu = target;
+			swap_task_ac_bw(p, rq, cpu_rq(target));
+		}
 	}
 	rcu_read_unlock();
 
@@ -1807,6 +1824,12 @@  static void prio_changed_dl(struct rq *rq, struct task_struct *p,
 		switched_to_dl(rq, p);
 }
 
+static void migrate_task_rq_dl(struct task_struct *p)
+{
+	if (p->fallback_cpu != -1)
+		swap_task_ac_bw(p, task_rq(p), cpu_rq(p->fallback_cpu));
+}
+
 const struct sched_class dl_sched_class = {
 	.next			= &rt_sched_class,
 	.enqueue_task		= enqueue_task_dl,
@@ -1820,6 +1843,7 @@  const struct sched_class dl_sched_class = {
 
 #ifdef CONFIG_SMP
 	.select_task_rq		= select_task_rq_dl,
+	.migrate_task_rq	= migrate_task_rq_dl,
 	.set_cpus_allowed       = set_cpus_allowed_dl,
 	.rq_online              = rq_online_dl,
 	.rq_offline             = rq_offline_dl,