[0/7] sched/deadline: fix cpusets bandwidth accounting

Message ID	1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Mathieu Poirier <mathieu.poirier@linaro.org> To: mingo@redhat.com, peterz@infradead.org Cc: tj@kernel.org, vbabka@suse.cz, lizefan@huawei.com, akpm@linux-foundation.org, weiyongjun1@huawei.com, juri.lelli@arm.com, rostedt@goodmis.org, claudio@evidence.eu.com, luca.abeni@santannapisa.it, bristot@redhat.com, linux-kernel@vger.kernel.org, mathieu.poirier@linaro.org Subject: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting Date: Wed, 16 Aug 2017 15:20:36 -0600 Message-Id: <1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk
Series	sched/deadline: fix cpusets bandwidth accounting \| expand [0/7] sched/deadline: fix cpusets bandwidth accounting [1/7] sched/topology: Adding function partition_sched_domains_locked() [2/7] cpuset: Rebuild root domain deadline accounting information [3/7] sched/deadline: Keep new DL task within root domain's boundary [4/7] cgroup: Constrain 'sched_load_balance' flag when DL tasks are present [5/7] cgroup: Concentrate DL related validation code in one place [6/7] cgroup: Constrain the addition of CPUs to a new CPUset [7/7] sched/core: Don't change the affinity of DL tasks

Mathieu Poirier Aug. 16, 2017, 9:20 p.m. UTC

This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
operations.  When CPUhotplug and some CUPset manipulation take place root
domains are destroyed and new ones created, loosing at the same time DL
accounting pertaining to utilisation.

An earlier attempt by Juri [2] used the scheduling classes' rq_online() and
rq_offline() methods, something that highlighted a problem with sleeping
DL tasks. The email thread that followed envisioned creating a list of
sleeping tasks to circle through when recomputing DL accounting.

In this set the problem is addressed by relying on existing list of tasks
(sleeping or not) already maintained by CPUsets. When CPUset or 
CPUhotplug operations have completed we circle through the list of tasks
maintained by each CPUset looking for DL tasks.  When a DL task is found
its utilisation is added to the root domain it pertains to by way of its
runqueue.

The advantage of proceeding this way is that recomputing of DL accounting
is done the same way for both active and inactive tasks, along with
guaranteeing that DL accounting for tasks end up in the correct root
domain regardless of the CPUset topology.  The disadvantage is that
circling through all the tasks in a CPUset can be time consuming.  The
counter argument is that both CPUset and CPUhotplug operations are time
consuming in the first place.

OPEN ISSUE:

Regardless of how we proceed (using existing CPUset list or new ones) we
need to deal with DL tasks that span more than one root domain,  something
that will typically happen after a CPUset operation.  For example, if we
split the number of available CPUs on a system in two CPUsets and then turn
off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
parent CPUset will end up spanning two root domains.

One way to deal with this is to prevent CPUset operations from happening
when such condition is detected, as enacted in this set.  Although simple
this approach feels brittle and akin to a "whack-a-mole" game.  A better
and more reliable approach would be to teach the DL scheduler to deal with
tasks that span multiple root domains, a serious and substantial
undertaking.

I am sending this as a starting point for discussion.  I would be grateful
if you could take the time to comment on the approach and most importantly
provide input on how to deal with the open issue underlined above.

Many thanks,
Mathieu

[1]. https://lkml.org/lkml/2016/2/3/966 
[2]. https://marc.info/?l=linux-kernel&m=145493552607388&w=2  

Mathieu Poirier (7):
  sched/topology: Adding function partition_sched_domains_locked()
  cpuset: Rebuild root domain deadline accounting information
  sched/deadline: Keep new DL task within root domain's boundary
  cgroup: Constrain 'sched_load_balance' flag when DL tasks are present
  cgroup: Concentrate DL related validation code in one place
  cgroup: Constrain the addition of CPUs to a new CPUset
  sched/core: Don't change the affinity of DL tasks

 include/linux/sched.h          |   3 +
 include/linux/sched/deadline.h |   8 ++
 include/linux/sched/topology.h |   9 ++
 kernel/cgroup/cpuset.c         | 186 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/core.c            |  10 +--
 kernel/sched/deadline.c        |  47 ++++++++++-
 kernel/sched/sched.h           |   3 -
 kernel/sched/topology.c        |  31 +++++--
 8 files changed, 272 insertions(+), 25 deletions(-)

-- 
2.7.4

luca abeni Aug. 22, 2017, 12:21 p.m. UTC | #1

Hi Mathieu,

On Wed, 16 Aug 2017 15:20:36 -0600
Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]

> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug

> operations.  When CPUhotplug and some CUPset manipulation take place root

> domains are destroyed and new ones created, loosing at the same time DL

> accounting pertaining to utilisation.


Thanks for looking at this longstanding issue! I am just back from
vacations; in the next days I'll try your patches.
Do you have some kind of scripts for reproducing the issue
automatically? (I see that in the original email Steven described how
to reproduce it manually; I just wonder if anyone already scripted the
test).

> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and

> rq_offline() methods, something that highlighted a problem with sleeping

> DL tasks. The email thread that followed envisioned creating a list of

> sleeping tasks to circle through when recomputing DL accounting.

> 

> In this set the problem is addressed by relying on existing list of tasks

> (sleeping or not) already maintained by CPUsets. When CPUset or 

> CPUhotplug operations have completed we circle through the list of tasks

> maintained by each CPUset looking for DL tasks.  When a DL task is found

> its utilisation is added to the root domain it pertains to by way of its

> runqueue.

> 

> The advantage of proceeding this way is that recomputing of DL accounting

> is done the same way for both active and inactive tasks, along with

> guaranteeing that DL accounting for tasks end up in the correct root

> domain regardless of the CPUset topology.  The disadvantage is that

> circling through all the tasks in a CPUset can be time consuming.  The

> counter argument is that both CPUset and CPUhotplug operations are time

> consuming in the first place.


I do not know the cpuset code too much, but I agree that your approach
looks better than creating an additional list for blocked deadline
tasks.


> OPEN ISSUE:

> 

> Regardless of how we proceed (using existing CPUset list or new ones) we

> need to deal with DL tasks that span more than one root domain,  something

> that will typically happen after a CPUset operation.  For example, if we

> split the number of available CPUs on a system in two CPUsets and then turn

> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the

> parent CPUset will end up spanning two root domains.

> 

> One way to deal with this is to prevent CPUset operations from happening

> when such condition is detected, as enacted in this set.


I think this is the simplest (if not only?) solution if we want to use
gEDF in each root domain.

> Although simple

> this approach feels brittle and akin to a "whack-a-mole" game.  A better

> and more reliable approach would be to teach the DL scheduler to deal with

> tasks that span multiple root domains, a serious and substantial

> undertaking.

> 

> I am sending this as a starting point for discussion.  I would be grateful

> if you could take the time to comment on the approach and most importantly

> provide input on how to deal with the open issue underlined above.


I suspect that if we want to guarantee bounded tardiness then we have to
go for a solution similar to the one suggested by Tommaso some time ago
(if I remember well):

if we want to create some "second level cpusets" inside a "parent
cpuset", allowing deadline tasks to be placed inside both the "parent
cpuset" and the "second level cpusets", then we have to subtract the
"second level cpusets" maximum utilizations from the "parent cpuset"
utilization.

I am not sure how difficult it can be to implement this...


If, instead, we want to allow to guarantee the respect of all the
deadlines, then we need to have a look at Brandenburg's paper on
arbitrary affinities:
https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf


			Thanks,
				Luca

Mathieu Poirier Aug. 23, 2017, 7:47 p.m. UTC | #2

On 22 August 2017 at 06:21, Luca Abeni <luca.abeni@santannapisa.it> wrote:
> Hi Mathieu,


Good day to you,

>

> On Wed, 16 Aug 2017 15:20:36 -0600

> Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

>

>> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]

>> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug

>> operations.  When CPUhotplug and some CUPset manipulation take place root

>> domains are destroyed and new ones created, loosing at the same time DL

>> accounting pertaining to utilisation.

>

> Thanks for looking at this longstanding issue! I am just back from

> vacations; in the next days I'll try your patches.

> Do you have some kind of scripts for reproducing the issue

> automatically? (I see that in the original email Steven described how

> to reproduce it manually; I just wonder if anyone already scripted the

> test).


I didn't bother scripting it since it is so easy to do.  I'm eager to
see how things work out on your end.

>

>> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and

>> rq_offline() methods, something that highlighted a problem with sleeping

>> DL tasks. The email thread that followed envisioned creating a list of

>> sleeping tasks to circle through when recomputing DL accounting.

>>

>> In this set the problem is addressed by relying on existing list of tasks

>> (sleeping or not) already maintained by CPUsets. When CPUset or

>> CPUhotplug operations have completed we circle through the list of tasks

>> maintained by each CPUset looking for DL tasks.  When a DL task is found

>> its utilisation is added to the root domain it pertains to by way of its

>> runqueue.

>>

>> The advantage of proceeding this way is that recomputing of DL accounting

>> is done the same way for both active and inactive tasks, along with

>> guaranteeing that DL accounting for tasks end up in the correct root

>> domain regardless of the CPUset topology.  The disadvantage is that

>> circling through all the tasks in a CPUset can be time consuming.  The

>> counter argument is that both CPUset and CPUhotplug operations are time

>> consuming in the first place.

>

> I do not know the cpuset code too much, but I agree that your approach

> looks better than creating an additional list for blocked deadline

> tasks.

>

>

>> OPEN ISSUE:

>>

>> Regardless of how we proceed (using existing CPUset list or new ones) we

>> need to deal with DL tasks that span more than one root domain,  something

>> that will typically happen after a CPUset operation.  For example, if we

>> split the number of available CPUs on a system in two CPUsets and then turn

>> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the

>> parent CPUset will end up spanning two root domains.

>>

>> One way to deal with this is to prevent CPUset operations from happening

>> when such condition is detected, as enacted in this set.

>

> I think this is the simplest (if not only?) solution if we want to use

> gEDF in each root domain.


Global Earliest Deadline First?  Is my interpretation correct?

>

>> Although simple

>> this approach feels brittle and akin to a "whack-a-mole" game.  A better

>> and more reliable approach would be to teach the DL scheduler to deal with

>> tasks that span multiple root domains, a serious and substantial

>> undertaking.

>>

>> I am sending this as a starting point for discussion.  I would be grateful

>> if you could take the time to comment on the approach and most importantly

>> provide input on how to deal with the open issue underlined above.

>

> I suspect that if we want to guarantee bounded tardiness then we have to

> go for a solution similar to the one suggested by Tommaso some time ago

> (if I remember well):

>

> if we want to create some "second level cpusets" inside a "parent

> cpuset", allowing deadline tasks to be placed inside both the "parent

> cpuset" and the "second level cpusets", then we have to subtract the

> "second level cpusets" maximum utilizations from the "parent cpuset"

> utilization.

>

> I am not sure how difficult it can be to implement this...


Humm...  I am missing some context here.  Nonetheless the approach I
was contemplating was to repeat the current mathematics to all the
root domains accessible from a p->cpus_allowed's flag.  As such we'd
have the same acceptance test but repeated to more than one root
domain.  To do that time can be an issue but the real problem I see is
related to the current DL code.  It is geared around a single root
domain and changing that means meddling in a lot of places.  I had a
prototype that was beginning to address that but decided to gather
people's opinion before getting in too deep.

>

>

> If, instead, we want to allow to guarantee the respect of all the

> deadlines, then we need to have a look at Brandenburg's paper on

> arbitrary affinities:

> https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf

>


Ouch, that's an extended read...

>

>                         Thanks,

>                                 Luca

luca abeni Aug. 24, 2017, 7:53 a.m. UTC | #3

On Wed, 23 Aug 2017 13:47:13 -0600
Mathieu Poirier <mathieu.poirier@linaro.org> wrote:
> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]

> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug

> >> operations.  When CPUhotplug and some CUPset manipulation take place root

> >> domains are destroyed and new ones created, loosing at the same time DL

> >> accounting pertaining to utilisation.  

> >

> > Thanks for looking at this longstanding issue! I am just back from

> > vacations; in the next days I'll try your patches.

> > Do you have some kind of scripts for reproducing the issue

> > automatically? (I see that in the original email Steven described how

> > to reproduce it manually; I just wonder if anyone already scripted the

> > test).  

> 

> I didn't bother scripting it since it is so easy to do.  I'm eager to

> see how things work out on your end.


Ok, so I'll try to reproduce the issue manually as described in Steven's
original email; I'll run some tests as soon as I finish with some stuff
that accumulated during vacations.

[...]
> >> OPEN ISSUE:

> >>

> >> Regardless of how we proceed (using existing CPUset list or new ones) we

> >> need to deal with DL tasks that span more than one root domain,  something

> >> that will typically happen after a CPUset operation.  For example, if we

> >> split the number of available CPUs on a system in two CPUsets and then turn

> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the

> >> parent CPUset will end up spanning two root domains.

> >>

> >> One way to deal with this is to prevent CPUset operations from happening

> >> when such condition is detected, as enacted in this set.  

> >

> > I think this is the simplest (if not only?) solution if we want to use

> > gEDF in each root domain.  

> 

> Global Earliest Deadline First?  Is my interpretation correct?


Right. As far as I understand, the original SCHED_DEADLINE design is to
partition the CPUs in disjoint sets, and then use global EDF scheduling
on each one of those sets (this guarantees bounded tardiness, and if
you run some additional admission tests in user space you can also
guarantee the hard respect of every deadline).


> >> Although simple

> >> this approach feels brittle and akin to a "whack-a-mole" game.  A better

> >> and more reliable approach would be to teach the DL scheduler to deal with

> >> tasks that span multiple root domains, a serious and substantial

> >> undertaking.

> >>

> >> I am sending this as a starting point for discussion.  I would be grateful

> >> if you could take the time to comment on the approach and most importantly

> >> provide input on how to deal with the open issue underlined above.  

> >

> > I suspect that if we want to guarantee bounded tardiness then we have to

> > go for a solution similar to the one suggested by Tommaso some time ago

> > (if I remember well):

> >

> > if we want to create some "second level cpusets" inside a "parent

> > cpuset", allowing deadline tasks to be placed inside both the "parent

> > cpuset" and the "second level cpusets", then we have to subtract the

> > "second level cpusets" maximum utilizations from the "parent cpuset"

> > utilization.

> >

> > I am not sure how difficult it can be to implement this...  

> 

> Humm...  I am missing some context here.


Or maybe I misunderstood the issue you were seeing (I am no expert on
cpusets). Is it related to hierarchies of cpusets (with one cpuset
contained inside another one)?
Can you describe how to reproduce the problematic situation?

> Nonetheless the approach I

> was contemplating was to repeat the current mathematics to all the

> root domains accessible from a p->cpus_allowed's flag.


I think in the original SCHED_DEADLINE design there should be only one
root domain compatible with the task's affinity... If this does not
happen, I suspect it is a bug (Juri, can you confirm?).

My understanding is that with SCHED_DEADLINE cpusets should be used to
partition the system's CPUs in disjoint sets (and I think there is one
root domain for each one of those disjoint sets). And the task affinity
mask should correspond with the CPUs composing the set in which the
task is executing.


> As such we'd

> have the same acceptance test but repeated to more than one root

> domain.  To do that time can be an issue but the real problem I see is

> related to the current DL code.  It is geared around a single root

> domain and changing that means meddling in a lot of places.  I had a

> prototype that was beginning to address that but decided to gather

> people's opinion before getting in too deep.


I still do not fully understand this (I got the impression that this is
related to hierarchies of cpusets, but I am not sure if this
understanding is correct). Maybe an example would help me to understand.



			Thanks,
				Luca

Juri Lelli Aug. 24, 2017, 8:29 a.m. UTC | #4

Hi,

On 24/08/17 09:53, Luca Abeni wrote:
> On Wed, 23 Aug 2017 13:47:13 -0600

> Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

> > >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]

> > >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug

> > >> operations.  When CPUhotplug and some CUPset manipulation take place root

> > >> domains are destroyed and new ones created, loosing at the same time DL

> > >> accounting pertaining to utilisation.  

> > >

> > > Thanks for looking at this longstanding issue! I am just back from

> > > vacations; in the next days I'll try your patches.

> > > Do you have some kind of scripts for reproducing the issue

> > > automatically? (I see that in the original email Steven described how

> > > to reproduce it manually; I just wonder if anyone already scripted the

> > > test).  

> > 

> > I didn't bother scripting it since it is so easy to do.  I'm eager to

> > see how things work out on your end.

> 

> Ok, so I'll try to reproduce the issue manually as described in Steven's

> original email; I'll run some tests as soon as I finish with some stuff

> that accumulated during vacations.

> 


I have to apologize myself, as I suspect I won't have much time to
properly review this set before LPC. :(
I'll try my best to have a look though.

[...]

> > Nonetheless the approach I

> > was contemplating was to repeat the current mathematics to all the

> > root domains accessible from a p->cpus_allowed's flag.

> 

> I think in the original SCHED_DEADLINE design there should be only one

> root domain compatible with the task's affinity... If this does not

> happen, I suspect it is a bug (Juri, can you confirm?).

> 

> My understanding is that with SCHED_DEADLINE cpusets should be used to

> partition the system's CPUs in disjoint sets (and I think there is one

> root domain for each one of those disjoint sets). And the task affinity

> mask should correspond with the CPUs composing the set in which the

> task is executing.

> 


Correct. No overlapping cpusets are allowed, and a task's affinity can't
be restricted to a subset of the cpuset's root domain cpus.

[...]

Thanks,

- Juri

Mathieu Poirier Aug. 24, 2017, 8:32 p.m. UTC | #5

On 24 August 2017 at 01:53, Luca Abeni <luca.abeni@santannapisa.it> wrote:
> On Wed, 23 Aug 2017 13:47:13 -0600

> Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

>> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]

>> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug

>> >> operations.  When CPUhotplug and some CUPset manipulation take place root

>> >> domains are destroyed and new ones created, loosing at the same time DL

>> >> accounting pertaining to utilisation.

>> >

>> > Thanks for looking at this longstanding issue! I am just back from

>> > vacations; in the next days I'll try your patches.

>> > Do you have some kind of scripts for reproducing the issue

>> > automatically? (I see that in the original email Steven described how

>> > to reproduce it manually; I just wonder if anyone already scripted the

>> > test).

>>

>> I didn't bother scripting it since it is so easy to do.  I'm eager to

>> see how things work out on your end.

>

> Ok, so I'll try to reproduce the issue manually as described in Steven's

> original email; I'll run some tests as soon as I finish with some stuff

> that accumulated during vacations.

>

> [...]

>> >> OPEN ISSUE:

>> >>

>> >> Regardless of how we proceed (using existing CPUset list or new ones) we

>> >> need to deal with DL tasks that span more than one root domain,  something

>> >> that will typically happen after a CPUset operation.  For example, if we

>> >> split the number of available CPUs on a system in two CPUsets and then turn

>> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the

>> >> parent CPUset will end up spanning two root domains.

>> >>

>> >> One way to deal with this is to prevent CPUset operations from happening

>> >> when such condition is detected, as enacted in this set.

>> >

>> > I think this is the simplest (if not only?) solution if we want to use

>> > gEDF in each root domain.

>>

>> Global Earliest Deadline First?  Is my interpretation correct?

>

> Right. As far as I understand, the original SCHED_DEADLINE design is to

> partition the CPUs in disjoint sets, and then use global EDF scheduling

> on each one of those sets (this guarantees bounded tardiness, and if

> you run some additional admission tests in user space you can also

> guarantee the hard respect of every deadline).

>

>

>> >> Although simple

>> >> this approach feels brittle and akin to a "whack-a-mole" game.  A better

>> >> and more reliable approach would be to teach the DL scheduler to deal with

>> >> tasks that span multiple root domains, a serious and substantial

>> >> undertaking.

>> >>

>> >> I am sending this as a starting point for discussion.  I would be grateful

>> >> if you could take the time to comment on the approach and most importantly

>> >> provide input on how to deal with the open issue underlined above.

>> >

>> > I suspect that if we want to guarantee bounded tardiness then we have to

>> > go for a solution similar to the one suggested by Tommaso some time ago

>> > (if I remember well):

>> >

>> > if we want to create some "second level cpusets" inside a "parent

>> > cpuset", allowing deadline tasks to be placed inside both the "parent

>> > cpuset" and the "second level cpusets", then we have to subtract the

>> > "second level cpusets" maximum utilizations from the "parent cpuset"

>> > utilization.

>> >

>> > I am not sure how difficult it can be to implement this...

>>

>> Humm...  I am missing some context here.

>

> Or maybe I misunderstood the issue you were seeing (I am no expert on

> cpusets). Is it related to hierarchies of cpusets (with one cpuset

> contained inside another one)?


Having spent a lot of time in the CPUset code, I can understand the confusion.

CPUset allows to create a hierarchy of sets, _seemingly_ creating
overlapping root domains.  Fortunately that isn't the case -
overlapping CPUsets are morphed together to create non-overlapping
root domains.  The magic happens in rebuild_sched_domains_locked() [1]
where generate_sched_domains() [2] transforms any CPUset topology into
disjoint domains.

> Can you describe how to reproduce the problematic situation?


Let's start with a 4 CPU system (in this case the Q401c Dragon board)
where patches 1/7 and 2/7 have been applied to a vanilla kernel.  I'm
also using Juri's tools [3,4] as describe in Steve's email [5].

root@linaro-developer:/home/linaro# uname -a
Linux linaro-developer 4.13.0-rc5-00012-g98bf1310205e #149 SMP PREEMPT
Thu Aug 24 13:12:39 MDT 2017 aarch64 GNU/Linux
root@linaro-developer:/home/linaro#
root@linaro-developer:/home/linaro# cat /sys/devices/system/cpu/online
0-3
root@linaro-developer:/home/linaro#
root@linaro-developer:/home/linaro# grep dl /proc/sched_debug
dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[2]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0
root@linaro-developer:/home/linaro#

This checks out as expected.  Now let's create 2 CPUsets and make sure
new root domains are created by setting the 'sched_load_balance' flag
to '0' on the default CPUset.

root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 > set1/cpuset.cpus
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 2,3 > set2/cpuset.cpus
root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
root@linaro-developer:/sys/fs/cgroup/cpuset#

At this time runqueue0 and runqueue1 point to root domain A while
runqueue2 and runqueue3 point to root domain B (something that can't
be seen without adding more instrumentation).  Newly created tasks can
roam on all the CPUs available:


root@linaro-developer:/home/linaro# ./burn &
[1] 3973
root@linaro-developer:/home/linaro# grep Cpus_allowed: /proc/3973/status
Cpus_allowed: f
root@linaro-developer:/home/linaro#

The above demonstrate that even if we have two CPUsets new task belong
to the "default" CPUset and as such can use all the available CPUs.
Now let's make task 3973 a DL task:

root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000 3973
root@linaro-developer:/home/linaro# grep dl /proc/sched_debug
  dl_rq[0]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0                  <------ Problem
  dl_rq[1]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 0                  <------ Problem
  dl_rq[2]:
  .dl_nr_running                 : 1
  .dl_nr_migratory               : 1
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 943718        <------ As expected
  dl_rq[3]:
  .dl_nr_running                 : 0
  .dl_nr_migratory               : 0
  .dl_bw->bw                     : 996147
  .dl_bw->total_bw               : 943718        <------ As expected
root@linaro-developer:/home/linaro/jlelli#

When task 3973 was promoted to a DL task it was running on either CPU2
or CPU3.  The acceptance test was done on root domain B and the task
utilisation added as expected.  But as pointed out above task 3973 can
still be scheduled on CPU0 and CPU1 and that is a problem since the
utilisation hasn't been added there as well.  The task is now spread
over two root domains rather than a single one, as currently expected
by the DL code (note that there are many ways to reproduce this
situation).

In its current form the patchset prevents specific operations from
being carried out if we recognise that a task could end up spanning
more than a single root domain.  But that will break as soon as we
find a new way to create a DL task that spans multiple domains (and I
may not have caught them all either).

Another way to fix this is to do an acceptance test on all the root
domain of a task.  So above we'd run the acceptance test on root
domain A and B before promoting the task.  Of course we'd also have to
add the utilisation of that task to both root domain.  Although simple
it goes at the core of the DL scheduler and touches pretty much every
aspect of it, something I'm reluctant to embark on.

[1]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814
[2]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634
[3]. https://github.com/jlelli/tests.git
[4]. https://github.com/jlelli/schedtool-dl.git
[5]. https://lkml.org/lkml/2016/2/3/966

>

>> Nonetheless the approach I

>> was contemplating was to repeat the current mathematics to all the

>> root domains accessible from a p->cpus_allowed's flag.

>

> I think in the original SCHED_DEADLINE design there should be only one

> root domain compatible with the task's affinity... If this does not

> happen, I suspect it is a bug (Juri, can you confirm?).

>

> My understanding is that with SCHED_DEADLINE cpusets should be used to

> partition the system's CPUs in disjoint sets (and I think there is one

> root domain for each one of those disjoint sets). And the task affinity

> mask should correspond with the CPUs composing the set in which the

> task is executing.

>

>

>> As such we'd

>> have the same acceptance test but repeated to more than one root

>> domain.  To do that time can be an issue but the real problem I see is

>> related to the current DL code.  It is geared around a single root

>> domain and changing that means meddling in a lot of places.  I had a

>> prototype that was beginning to address that but decided to gather

>> people's opinion before getting in too deep.

>

> I still do not fully understand this (I got the impression that this is

> related to hierarchies of cpusets, but I am not sure if this

> understanding is correct). Maybe an example would help me to understand.


The above should say it all - please get back to me if I haven't
expressed myself clearly.

>

>

>

>                         Thanks,

>                                 Luca

luca abeni Aug. 25, 2017, 6:02 a.m. UTC | #6

Hi Mathieu,

On Thu, 24 Aug 2017 14:32:20 -0600
Mathieu Poirier <mathieu.poirier@linaro.org> wrote:
[...]
> >> > if we want to create some "second level cpusets" inside a "parent

> >> > cpuset", allowing deadline tasks to be placed inside both the

> >> > "parent cpuset" and the "second level cpusets", then we have to

> >> > subtract the "second level cpusets" maximum utilizations from

> >> > the "parent cpuset" utilization.

> >> >

> >> > I am not sure how difficult it can be to implement this...  

> >>

> >> Humm...  I am missing some context here.  

> >

> > Or maybe I misunderstood the issue you were seeing (I am no expert

> > on cpusets). Is it related to hierarchies of cpusets (with one

> > cpuset contained inside another one)?  

> 

> Having spent a lot of time in the CPUset code, I can understand the

> confusion.

> 

> CPUset allows to create a hierarchy of sets, _seemingly_ creating

> overlapping root domains.  Fortunately that isn't the case -

> overlapping CPUsets are morphed together to create non-overlapping

> root domains.  The magic happens in rebuild_sched_domains_locked() [1]

> where generate_sched_domains() [2] transforms any CPUset topology into

> disjoint domains.


Ok; thanks for explaining

[...]
> root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2

> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem

> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems

> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 >

> set1/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo

> 2,3 > set2/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset#

> echo 0 > cpuset.sched_load_balance

> root@linaro-developer:/sys/fs/cgroup/cpuset#

> 

> At this time runqueue0 and runqueue1 point to root domain A while

> runqueue2 and runqueue3 point to root domain B (something that can't

> be seen without adding more instrumentation).


Ok; up to here, everything is clear to me ;-)

> Newly created tasks can  roam on all the CPUs available:

> 

> 

> root@linaro-developer:/home/linaro# ./burn &

> [1] 3973

> root@linaro-developer:/home/linaro# grep

> Cpus_allowed: /proc/3973/status Cpus_allowed: f

> root@linaro-developer:/home/linaro#


This happens because the task is not in set1 nor in set2, right? I
_think_ (but I am not sure; I did not design this part of
SCHED_DEADLINE) that the original idea was that in this situation
SCHED_DEADLINE tasks can be only in set1 or in set2 (SCHED_DEADLINE
tasks are not allowed to be in the "default" CPUset, in this setup).
Is this what one of your later patches enforces?


> The above demonstrate that even if we have two CPUsets new task belong

> to the "default" CPUset and as such can use all the available CPUs.


I still have a doubt (probably showing all my ignorance about
CPUsets :)... In this situation, we have 3 CPUsets: "default",
set1, and set2... Is everyone of these CPUsets associated to a
root domain (so, we have 3 root domains)? Or only set1 and set2 are
associated to a root domain?


> Now let's make task 3973 a DL task:

> 

> root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000

> 3973 root@linaro-developer:/home/linaro# grep dl /proc/sched_debug

>   dl_rq[0]:

>   .dl_nr_running                 : 0

>   .dl_nr_migratory               : 0

>   .dl_bw->bw                     : 996147

>   .dl_bw->total_bw               : 0                  <------ Problem


Ok; I think I understand the problem, now...


>   dl_rq[3]:

>   .dl_nr_running                 : 0

>   .dl_nr_migratory               : 0

>   .dl_bw->bw                     : 996147

>   .dl_bw->total_bw               : 943718        <------ As expected

> root@linaro-developer:/home/linaro/jlelli#

> 

> When task 3973 was promoted to a DL task it was running on either CPU2

> or CPU3.  The acceptance test was done on root domain B and the task

> utilisation added as expected.  But as pointed out above task 3973 can

> still be scheduled on CPU0 and CPU1 and that is a problem since the

> utilisation hasn't been added there as well.  The task is now spread

> over two root domains rather than a single one, as currently expected

> by the DL code (note that there are many ways to reproduce this

> situation).


I think this is a bug, and the only reasonable solution is to allow the
task to become SCHED_DEADLINE if it is in set1 or set2 (so, if its
affinity mask coincides exactly with all of the CPUs of the root domain
where the task utilization is added).


> In its current form the patchset prevents specific operations from

> being carried out if we recognise that a task could end up spanning

> more than a single root domain.


Good. I think this is the right way to go.


> But that will break as soon as we

> find a new way to create a DL task that spans multiple domains (and I

> may not have caught them all either).


So, we need to fix that too ;-)


> Another way to fix this is to do an acceptance test on all the root

> domain of a task.


I think we need to undestand what's the inteded behaviour of
SCHED_DEADLINE in this situation... My understanding is that
SCHED_DEADLINE is designed to do global EDF scheduling inside an
"isolated" CPUset; a SCHED_DEADLINE task spanning multiple domains would
break some SCHED_DEADLINE properties (from the scheduling theory
point of view) in some interesting ways...

I am not saying we should not do this, but I believe that allowing
tasks to span multiple domains require some redesign of the admission
test and migration mechanisms in SCHED_DEADLINE.

I think this is related to the "generic affinities" issue that Peter
mentioned some time ago.


> So above we'd run the acceptance test on root

> domain A and B before promoting the task.  Of course we'd also have to

> add the utilisation of that task to both root domain.  Although simple

> it goes at the core of the DL scheduler and touches pretty much every

> aspect of it, something I'm reluctant to embark on.


I see... So, the "default" CPUset does not have any root domain
associated to it? If it had, we could just subtract the maximum
utilizations of set1 and set2 to it when creating the root domains of
set1 and set2.



			Thanks,
				Luca

> 

> [1].

> http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814

> [2].

> http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634

> [3]. https://github.com/jlelli/tests.git [4].

> https://github.com/jlelli/schedtool-dl.git [5].

> https://lkml.org/lkml/2016/2/3/966

> 

> >  

> >> Nonetheless the approach I

> >> was contemplating was to repeat the current mathematics to all the

> >> root domains accessible from a p->cpus_allowed's flag.  

> >

> > I think in the original SCHED_DEADLINE design there should be only

> > one root domain compatible with the task's affinity... If this does

> > not happen, I suspect it is a bug (Juri, can you confirm?).

> >

> > My understanding is that with SCHED_DEADLINE cpusets should be used

> > to partition the system's CPUs in disjoint sets (and I think there

> > is one root domain for each one of those disjoint sets). And the

> > task affinity mask should correspond with the CPUs composing the

> > set in which the task is executing.

> >

> >  

> >> As such we'd

> >> have the same acceptance test but repeated to more than one root

> >> domain.  To do that time can be an issue but the real problem I

> >> see is related to the current DL code.  It is geared around a

> >> single root domain and changing that means meddling in a lot of

> >> places.  I had a prototype that was beginning to address that but

> >> decided to gather people's opinion before getting in too deep.  

> >

> > I still do not fully understand this (I got the impression that

> > this is related to hierarchies of cpusets, but I am not sure if this

> > understanding is correct). Maybe an example would help me to

> > understand.  

> 

> The above should say it all - please get back to me if I haven't

> expressed myself clearly.

> 

> >

> >

> >

> >                         Thanks,

> >                                 Luca

luca abeni Aug. 25, 2017, 9:52 a.m. UTC | #7

On Fri, 25 Aug 2017 08:02:43 +0200
luca abeni <luca.abeni@santannapisa.it> wrote:
[...]
> > The above demonstrate that even if we have two CPUsets new task belong

> > to the "default" CPUset and as such can use all the available CPUs.  

> 

> I still have a doubt (probably showing all my ignorance about

> CPUsets :)... In this situation, we have 3 CPUsets: "default",

> set1, and set2... Is everyone of these CPUsets associated to a

> root domain (so, we have 3 root domains)? Or only set1 and set2 are

> associated to a root domain?


Ok, after reading (and hopefully understanding better :) the code, I
think this question was kind of silly... There are only 2 root domains,
corresponding to set1 and set2 (right?).

[...]

> > So above we'd run the acceptance test on root

> > domain A and B before promoting the task.  Of course we'd also have to

> > add the utilisation of that task to both root domain.  Although simple

> > it goes at the core of the DL scheduler and touches pretty much every

> > aspect of it, something I'm reluctant to embark on.  

> 

> I see... So, the "default" CPUset does not have any root domain

> associated to it? If it had, we could just subtract the maximum

> utilizations of set1 and set2 to it when creating the root domains of

> set1 and set2.

...
So, this idea of mine had no sense.

I think the correct solution is what you implemented in your patchset
(if I understand it correctly).

If we want to have task spanning multiple root domains, many more
changes in the code are needed... I am wondering if it would make more
sense to track utilizations per runqueue (instead of per root domain):
- when a task tries to become SCHED_DEADLINE, we count how many CPUs are
  in its affinity mask. Let's call "n" this number
- then, we sum u / n (where "u" is the task's utilization) to the
  utilization of every runqueue that is in its affinity mask, and we
  check if all the sums are below the schedulability bound

For tasks spanning one single root domain, this should be equivalent to
the current admission test. Moreover, this check should ensure that no
root domain can be ever overloaded (even if tasks span multiple
domains).
But I do not know the locking implications for this idea... I suspect
it will not scale :(



				Luca

luca abeni Aug. 25, 2017, 2:37 p.m. UTC | #8

Hi Mathieu,

On Wed, 23 Aug 2017 13:47:13 -0600
Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

> On 22 August 2017 at 06:21, Luca Abeni <luca.abeni@santannapisa.it> wrote:

> > Hi Mathieu,  

> 

> Good day to you,

> 

> >

> > On Wed, 16 Aug 2017 15:20:36 -0600

> > Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

> >  

> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]

> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug

> >> operations.  When CPUhotplug and some CUPset manipulation take place root

> >> domains are destroyed and new ones created, loosing at the same time DL

> >> accounting pertaining to utilisation.  

> >

> > Thanks for looking at this longstanding issue! I am just back from

> > vacations; in the next days I'll try your patches.

> > Do you have some kind of scripts for reproducing the issue

> > automatically? (I see that in the original email Steven described how

> > to reproduce it manually; I just wonder if anyone already scripted the

> > test).  

> 

> I didn't bother scripting it since it is so easy to do.  I'm eager to

> see how things work out on your end.

I ran some tests with your patchset, and I confirm that it fixes the
issue originally pointed out by Steven.

But I still need to run some more tests (I'll continue on Monday).

I think I found an issue by:
1) creating two disjoint cpusets (CPUs 0 and 1 in the first cpuset,
   CPUs 2 and 3 in the second one) and setting sched_load_balance to 0
2) starting a task in one of the two cpusets, and making it
   SCHED_DEADLINE <--- up to here, everything looks fine
3) setting sched_load_balance to 1 <--- At this point, I think there is
   a bug: the system has only one root domain, and the task utilization
   is summed to it... But the task affinity mask is still the one of
   the "old root domain" that was associated with the cpuset where the
   task is executing.

I still need to run some experiments about this.

				Thanks,
					Luca

Mathieu Poirier Aug. 25, 2017, 7:53 p.m. UTC | #9

On 25 August 2017 at 03:52, Luca Abeni <luca.abeni@santannapisa.it> wrote:
> On Fri, 25 Aug 2017 08:02:43 +0200

> luca abeni <luca.abeni@santannapisa.it> wrote:

> [...]

>> > The above demonstrate that even if we have two CPUsets new task belong

>> > to the "default" CPUset and as such can use all the available CPUs.

>>

>> I still have a doubt (probably showing all my ignorance about

>> CPUsets :)... In this situation, we have 3 CPUsets: "default",

>> set1, and set2... Is everyone of these CPUsets associated to a

>> root domain (so, we have 3 root domains)? Or only set1 and set2 are

>> associated to a root domain?

>

> Ok, after reading (and hopefully understanding better :) the code, I

> think this question was kind of silly... There are only 2 root domains,

> corresponding to set1 and set2 (right?).


Correct - although there is a default CPUset there isn't a default root domain.

>

> [...]

>

>> > So above we'd run the acceptance test on root

>> > domain A and B before promoting the task.  Of course we'd also have to

>> > add the utilisation of that task to both root domain.  Although simple

>> > it goes at the core of the DL scheduler and touches pretty much every

>> > aspect of it, something I'm reluctant to embark on.

>>

>> I see... So, the "default" CPUset does not have any root domain

>> associated to it? If it had, we could just subtract the maximum

>> utilizations of set1 and set2 to it when creating the root domains of

>> set1 and set2.

> ...

> So, this idea of mine had no sense.

>

> I think the correct solution is what you implemented in your patchset

> (if I understand it correctly).

>

> If we want to have task spanning multiple root domains, many more

> changes in the code are needed... I am wondering if it would make more

> sense to track utilizations per runqueue (instead of per root domain):

> - when a task tries to become SCHED_DEADLINE, we count how many CPUs are

>   in its affinity mask. Let's call "n" this number

> - then, we sum u / n (where "u" is the task's utilization) to the

>   utilization of every runqueue that is in its affinity mask, and we

>   check if all the sums are below the schedulability bound

>

> For tasks spanning one single root domain, this should be equivalent to

> the current admission test. Moreover, this check should ensure that no

> root domain can be ever overloaded (even if tasks span multiple

> domains).


This is an idea worth exploring.

> But I do not know the locking implications for this idea... I suspect

> it will not scale :(


Right, scaling could be a problem - we'd have to prototype it and see
how bad things get.  We _may_ be able to figure something out with RCU
trickery.

As I mention in a previous email I toyed with the idea of extending
the DL code to support more than one root domain.  Maybe it is time to
go back to it, finish the admission test and publish just that part...
At least we would have code to comment on.

Regardless of the avenue we choose to go with I think we could use my
current solution as a stepping stone while we figure out what we
really want to do.  At least it would be an improvement on the current
situation.

>

>

>

>                                 Luca

Mathieu Poirier Aug. 25, 2017, 8:29 p.m. UTC | #10

On 25 August 2017 at 08:37, Luca Abeni <luca.abeni@santannapisa.it> wrote:
> Hi Mathieu,

>

> On Wed, 23 Aug 2017 13:47:13 -0600

> Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

>

>> On 22 August 2017 at 06:21, Luca Abeni <luca.abeni@santannapisa.it> wrote:

>> > Hi Mathieu,

>>

>> Good day to you,

>>

>> >

>> > On Wed, 16 Aug 2017 15:20:36 -0600

>> > Mathieu Poirier <mathieu.poirier@linaro.org> wrote:

>> >

>> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]

>> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug

>> >> operations.  When CPUhotplug and some CUPset manipulation take place root

>> >> domains are destroyed and new ones created, loosing at the same time DL

>> >> accounting pertaining to utilisation.

>> >

>> > Thanks for looking at this longstanding issue! I am just back from

>> > vacations; in the next days I'll try your patches.

>> > Do you have some kind of scripts for reproducing the issue

>> > automatically? (I see that in the original email Steven described how

>> > to reproduce it manually; I just wonder if anyone already scripted the

>> > test).

>>

>> I didn't bother scripting it since it is so easy to do.  I'm eager to

>> see how things work out on your end.

>

> I ran some tests with your patchset, and I confirm that it fixes the

> issue originally pointed out by Steven.

>


Good, at least it's a start.

> But I still need to run some more tests (I'll continue on Monday).

>

> I think I found an issue by:

> 1) creating two disjoint cpusets (CPUs 0 and 1 in the first cpuset,

>    CPUs 2 and 3 in the second one) and setting sched_load_balance to 0

> 2) starting a task in one of the two cpusets, and making it

>    SCHED_DEADLINE <--- up to here, everything looks fine

> 3) setting sched_load_balance to 1 <--- At this point, I think there is

>    a bug: the system has only one root domain, and the task utilization

>    is summed to it... But the task affinity mask is still the one of

>    the "old root domain" that was associated with the cpuset where the

>    task is executing.


I can reproduce the problem on my side as well.

This is how CPUset works and the expected behaviour.  For normal tasks
it isn't a problem but I agree with you that for DL tasks, we need to
address this.

>

> I still need to run some experiments about this.


Thanks for the time,
Mathieu

>

>

>

>                                 Thanks,

>                                         Luca

Mathieu Poirier Aug. 25, 2017, 8:35 p.m. UTC | #11

On 25 August 2017 at 03:52, Luca Abeni <luca.abeni@santannapisa.it> wrote:
> On Fri, 25 Aug 2017 08:02:43 +0200

> luca abeni <luca.abeni@santannapisa.it> wrote:

> [...]

>> > The above demonstrate that even if we have two CPUsets new task belong

>> > to the "default" CPUset and as such can use all the available CPUs.

>>

>> I still have a doubt (probably showing all my ignorance about

>> CPUsets :)... In this situation, we have 3 CPUsets: "default",

>> set1, and set2... Is everyone of these CPUsets associated to a

>> root domain (so, we have 3 root domains)? Or only set1 and set2 are

>> associated to a root domain?

>

> Ok, after reading (and hopefully understanding better :) the code, I

> think this question was kind of silly... There are only 2 root domains,

> corresponding to set1 and set2 (right?).


For this scenario yes, you are correct.

>

> [...]

>

>> > So above we'd run the acceptance test on root

>> > domain A and B before promoting the task.  Of course we'd also have to

>> > add the utilisation of that task to both root domain.  Although simple

>> > it goes at the core of the DL scheduler and touches pretty much every

>> > aspect of it, something I'm reluctant to embark on.

>>

>> I see... So, the "default" CPUset does not have any root domain

>> associated to it? If it had, we could just subtract the maximum

>> utilizations of set1 and set2 to it when creating the root domains of

>> set1 and set2.

> ...

> So, this idea of mine had no sense.

>

> I think the correct solution is what you implemented in your patchset

> (if I understand it correctly).

>

> If we want to have task spanning multiple root domains, many more

> changes in the code are needed... I am wondering if it would make more

> sense to track utilizations per runqueue (instead of per root domain):

> - when a task tries to become SCHED_DEADLINE, we count how many CPUs are

>   in its affinity mask. Let's call "n" this number

> - then, we sum u / n (where "u" is the task's utilization) to the

>   utilization of every runqueue that is in its affinity mask, and we

>   check if all the sums are below the schedulability bound

>

> For tasks spanning one single root domain, this should be equivalent to

> the current admission test. Moreover, this check should ensure that no

> root domain can be ever overloaded (even if tasks span multiple

> domains).

> But I do not know the locking implications for this idea... I suspect

> it will not scale :(

>

>

>

>                                 Luca

Peter Zijlstra Oct. 11, 2017, 4:02 p.m. UTC | #12

On Wed, Aug 16, 2017 at 03:20:36PM -0600, Mathieu Poirier wrote:

> In this set the problem is addressed by relying on existing list of tasks

> (sleeping or not) already maintained by CPUsets. 


Right, that's a much saner approach :-)

> OPEN ISSUE:

> 

> Regardless of how we proceed (using existing CPUset list or new ones) we

> need to deal with DL tasks that span more than one root domain,  something

> that will typically happen after a CPUset operation.  For example, if we

> split the number of available CPUs on a system in two CPUsets and then turn

> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the

> parent CPUset will end up spanning two root domains.

> 

> One way to deal with this is to prevent CPUset operations from happening

> when such condition is detected, as enacted in this set.  Although simple

> this approach feels brittle and akin to a "whack-a-mole" game.  A better

> and more reliable approach would be to teach the DL scheduler to deal with

> tasks that span multiple root domains, a serious and substantial

> undertaking.

> 

> I am sending this as a starting point for discussion.  I would be grateful

> if you could take the time to comment on the approach and most importantly

> provide input on how to deal with the open issue underlined above.


Right, so teaching DEADLINE about arbitrary affinities is 'interesting'.

Although the rules proposed by Tomasso; if found sufficient; would
greatly simplify things. Also the online semi-partition approach to SMP
could help with that.

But yes, that's fairly massive surgery. For now I think we'll have to
live and accept the limitations. So failing the various cpuset
operations when they violate rules seems fine. Relaxing rules is always
easier than tightening them (later).

One 'series' you might be interested in when respinning these is:

  https://lkml.kernel.org/r/20171011094833.pdp4torvotvjdmkt@hirez.programming.kicks-ass.net

By doing synchronous domain rebuild we loose a bunch of funnies.

Mathieu Poirier Oct. 12, 2017, 4:57 p.m. UTC | #13

On 11 October 2017 at 10:02, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Aug 16, 2017 at 03:20:36PM -0600, Mathieu Poirier wrote:

>

>> In this set the problem is addressed by relying on existing list of tasks

>> (sleeping or not) already maintained by CPUsets.

>

> Right, that's a much saner approach :-)


Luca and Juri had the same opinion so let's continue with that solution.

>

>> OPEN ISSUE:

>>

>> Regardless of how we proceed (using existing CPUset list or new ones) we

>> need to deal with DL tasks that span more than one root domain,  something

>> that will typically happen after a CPUset operation.  For example, if we

>> split the number of available CPUs on a system in two CPUsets and then turn

>> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the

>> parent CPUset will end up spanning two root domains.

>>

>> One way to deal with this is to prevent CPUset operations from happening

>> when such condition is detected, as enacted in this set.  Although simple

>> this approach feels brittle and akin to a "whack-a-mole" game.  A better

>> and more reliable approach would be to teach the DL scheduler to deal with

>> tasks that span multiple root domains, a serious and substantial

>> undertaking.

>>

>> I am sending this as a starting point for discussion.  I would be grateful

>> if you could take the time to comment on the approach and most importantly

>> provide input on how to deal with the open issue underlined above.

>

> Right, so teaching DEADLINE about arbitrary affinities is 'interesting'.

>

> Although the rules proposed by Tomasso; if found sufficient; would

> greatly simplify things. Also the online semi-partition approach to SMP

> could help with that.


The "rules" proposed by Tomasso, are you referring to patches or the
deadline/cgroup extension work that he presented at OSPM?  I'd also be
interested to know more about this "online semi-partition approach to
SMP" you mentioned.  Maybe that's a conversation we could have at the
upcoming RT summit in Prague.

>

> But yes, that's fairly massive surgery. For now I think we'll have to

> live and accept the limitations. So failing the various cpuset

> operations when they violate rules seems fine. Relaxing rules is always

> easier than tightening them (later).


Agreed.

>

> One 'series' you might be interested in when respinning these is:

>

>   https://lkml.kernel.org/r/20171011094833.pdp4torvotvjdmkt@hirez.programming.kicks-ass.net

>

> By doing synchronous domain rebuild we loose a bunch of funnies.


Getting rid of the asynchronous nature of the hotplug path would be a
delight - I'll start keeping track of that effort as well.

Thanks for the review,
Mathieu

luca abeni Oct. 13, 2017, 8:04 a.m. UTC | #14

Hi Mathieu,

On Thu, 12 Oct 2017 10:57:09 -0600
Mathieu Poirier <mathieu.poirier@linaro.org> wrote:
[...]
> >> Regardless of how we proceed (using existing CPUset list or new ones) we

> >> need to deal with DL tasks that span more than one root domain,  something

> >> that will typically happen after a CPUset operation.  For example, if we

> >> split the number of available CPUs on a system in two CPUsets and then turn

> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the

> >> parent CPUset will end up spanning two root domains.

> >>

> >> One way to deal with this is to prevent CPUset operations from happening

> >> when such condition is detected, as enacted in this set.  Although simple

> >> this approach feels brittle and akin to a "whack-a-mole" game.  A better

> >> and more reliable approach would be to teach the DL scheduler to deal with

> >> tasks that span multiple root domains, a serious and substantial

> >> undertaking.

> >>

> >> I am sending this as a starting point for discussion.  I would be grateful

> >> if you could take the time to comment on the approach and most importantly

> >> provide input on how to deal with the open issue underlined above.  

> >

> > Right, so teaching DEADLINE about arbitrary affinities is 'interesting'.

> >

> > Although the rules proposed by Tomasso; if found sufficient; would

> > greatly simplify things. Also the online semi-partition approach to SMP

> > could help with that.  

> 

> The "rules" proposed by Tomasso, are you referring to patches or the

> deadline/cgroup extension work that he presented at OSPM?


No, that is an unrelated thing... Tommaso previously proposed some
improvements to the admission control mechanism to take arbitrary
affinities into account.


I think Tommaso's proposal is similar to what I previously proposed in
this thread (to admit a SCHED_DEADLINE task with utilization
u = runtime / period and affinity to N runqueues, we can account u / N
to each one of the runqueues, and check if the sum of the utilizations
on each runqueue is < 1).

As previously noticed by Peter, this might have some scalability issues
(a naive implementation would lock the root domain while iterating on
all the runqueues). Few days ago, I was discussing with Tommaso about a
possible solution based on not locking the root domain structure, and
eventually using a roll-back strategy if the status of the root domain
changes while we are updating it. I think in a previous email you
mentioned RCU, which might result in a similar solution.

Anyway, I am adding Tommaso in cc so that he can comment more.


> I'd also be

> interested to know more about this "online semi-partition approach to

> SMP" you mentioned.


It is basically an implementation (and extension to arbitrary
affinities) of this work:
http://drops.dagstuhl.de/opus/volltexte/2017/7165/


				Luca

> Maybe that's a conversation we could have at the

> upcoming RT summit in Prague.

> 

> >

> > But yes, that's fairly massive surgery. For now I think we'll have to

> > live and accept the limitations. So failing the various cpuset

> > operations when they violate rules seems fine. Relaxing rules is always

> > easier than tightening them (later).  

> 

> Agreed.

> 

> >

> > One 'series' you might be interested in when respinning these is:

> >

> >   https://lkml.kernel.org/r/20171011094833.pdp4torvotvjdmkt@hirez.programming.kicks-ass.net

> >

> > By doing synchronous domain rebuild we loose a bunch of funnies.  

> 

> Getting rid of the asynchronous nature of the hotplug path would be a

> delight - I'll start keeping track of that effort as well.

> 

> Thanks for the review,

> Mathieu

[0/7] sched/deadline: fix cpusets bandwidth accounting

Message

Comments