diff mbox

sched/deadline: Add sched_dl documentation

Message ID 1390214440-2711-1-git-send-email-juri.lelli@gmail.com
State New
Headers show

Commit Message

Juri Lelli Jan. 20, 2014, 10:40 a.m. UTC
From: Dario Faggioli <raistlin@linux.it>

Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.

Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: mingo@redhat.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rob@landley.net
Cc: rostedt@goodmis.org
Cc: tglx@linutronix.de
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 Documentation/scheduler/00-INDEX           |    2 +
 Documentation/scheduler/sched-deadline.txt |  189 ++++++++++++++++++++++++++++
 kernel/sched/deadline.c                    |    3 +-
 3 files changed, 193 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/scheduler/sched-deadline.txt

Comments

Juri Lelli Jan. 20, 2014, 12:15 p.m. UTC | #1
On 01/20/2014 12:24 PM, Henrik Austad wrote:
> On Mon, Jan 20, 2014 at 11:40:40AM +0100, Juri Lelli wrote:
>> From: Dario Faggioli <raistlin@linux.it>
>>
>> Add in Documentation/scheduler/ some hints about the design
>> choices, the usage and the future possible developments of the
>> sched_dl scheduling class and of the SCHED_DEADLINE policy.
>>
>> Cc: bruce.ashfield@windriver.com
>> Cc: claudio@evidence.eu.com
>> Cc: darren@dvhart.com
>> Cc: dhaval.giani@gmail.com
>> Cc: fchecconi@gmail.com
>> Cc: fweisbec@gmail.com
>> Cc: harald.gustafsson@ericsson.com
>> Cc: hgu1972@gmail.com
>> Cc: insop.song@gmail.com
>> Cc: jkacur@redhat.com
>> Cc: johan.eker@ericsson.com
>> Cc: liming.wang@windriver.com
>> Cc: luca.abeni@unitn.it
>> Cc: michael@amarulasolutions.com
>> Cc: mingo@redhat.com
>> Cc: nicola.manica@disi.unitn.it
>> Cc: oleg@redhat.com
>> Cc: paulmck@linux.vnet.ibm.com
>> Cc: p.faure@akatech.ch
>> Cc: rob@landley.net
>> Cc: rostedt@goodmis.org
>> Cc: tglx@linutronix.de
>> Cc: tommaso.cucinotta@sssup.it
>> Cc: vincent.guittot@linaro.org
>> Signed-off-by: Dario Faggioli <raistlin@linux.it>
>> Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
>> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
>> ---
>>  Documentation/scheduler/00-INDEX           |    2 +
>>  Documentation/scheduler/sched-deadline.txt |  189 ++++++++++++++++++++++++++++
>>  kernel/sched/deadline.c                    |    3 +-
>>  3 files changed, 193 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/scheduler/sched-deadline.txt
>>
>> diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
>> index d2651c4..46702e4 100644
>> --- a/Documentation/scheduler/00-INDEX
>> +++ b/Documentation/scheduler/00-INDEX
>> @@ -10,5 +10,7 @@ sched-nice-design.txt
>>  	- How and why the scheduler's nice levels are implemented.
>>  sched-rt-group.txt
>>  	- real-time group scheduling.
>> +sched-deadline.txt
>> +	- deadline scheduling.
>>  sched-stats.txt
>>  	- information on schedstats (Linux Scheduler Statistics).
>> diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
>> new file mode 100644
>> index 0000000..8980de1
>> --- /dev/null
>> +++ b/Documentation/scheduler/sched-deadline.txt
>> @@ -0,0 +1,189 @@
>> +			  Deadline Task Scheduling
>> +			  ------------------------
>> +
>> +CONTENTS
>> +========
>> +
>> + 0. WARNING
>> + 1. Overview
>> + 2. Task scheduling
>> + 2. The Interface
>> + 3. Bandwidth management
>> +   3.1 System-wide settings
>> +   3.2 Task interface
>> +   3.4 Default behavior
>> + 4. Tasks CPU affinity
>> +   4.1 SCHED_DEADLINE and cpusets HOWTO
>> + 5. Future plans
>> +
>> +
>> +0. WARNING
>> +==========
>> +
>> + Fiddling with these settings can result in an unpredictable or even unstable
>> + system behavior. As for -rt (group) scheduling, it is assumed that root users
>> + know what they're doing.
>> +
>> +
>> +1. Overview
>> +===========
>> +
>> + The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
>> + basically an implementation of the Earliest Deadline First (EDF) scheduling
>> + algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
>> + that makes it possible to isolate the behavior of tasks between each other.
> 
> 
> Why not something along the lines of giving a task a guaranteed slice of 
> the CPU as well as making sure that a task takes no more than a given 
> slice? I.e. making the point of a lower as well as an upper limit of CPU 
> usage.
> 

I'd keep the term "isolate" in, as is one of the strong points on having all
this merged in. But, we could add something along the lines you suggested:

"that makes it possible to isolate the behavior of tasks between each other.
IOW, isolation means that we can reserve a task a guaranteed percentage of the
CPU and, at the same time, we ensure that the task takes no more than the
percentage reserved."

>> +2. Task scheduling
>> +==================
>> +
>> + The typical -deadline task is composed of a computation phase (instance)
>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>> + duration of such computation is called the task's runtime; the time interval
>> + by which each instance needs to be completed is called the task's relative
>> + deadline. The task's absolute deadline is dynamically calculated as the
>> + time instant a task (or, more properly) activates plus the relative
>> + deadline.
> 
> activates - released?
> 

I'd keep (modifying a bit):

"time instant a task activates plus the relative deadline."

This is probably the nearest thing to what is implemented that we can say
(without entering into the theory too much), a task that "activates" can mean
that it is first released, enqueued, woken-up, etc.

> Since real-time papers from different rt-campus around the academia insist 
> on using *slightly* different terminology, perhaps add a short dictionary 
> for some of the more common terms?
> 
> D: relative deadline, typically N ms after release
> d: absolute deadline, the physical time when a given instance of a job 
>    needs to be completed
> R: relative release time, for periodic tasks, this is typically 'every N 
>    ms'
> r: absolute release time
> C: Worst-case execution time
> 
>    ...you get the idea.
> 
> Perhaps too academic?
> 

Mmm.. we don't go too deep in theory (we just refer to papers below), could
adding a dictionary only complicate things? I mean, if you add a term you have
to explain its meaning related to the task-model you are using. And this means
you have to also define the task model and so on. Who wants more details
already finds them in the papers below.

>> + The EDF[1] algorithm selects the task with the smallest absolute deadline as
>> + the one to be executed first, while the CBS[2,3] ensures that each task runs
>> + for at most its runtime every period, avoiding any interference between
>> + different tasks (bandwidth isolation).
>> + Thanks to this feature, also tasks that do not strictly comply with the
>> + computational model described above can effectively use the new policy.
>> + IOW, there are no limitations on what kind of task can exploit this new
>> + scheduling discipline, even if it must be said that it is particularly
>> + suited for periodic or sporadic tasks that need guarantees on their
>> + timing behavior, e.g., multimedia, streaming, control applications, etc.
> 
> I assume that ties are broken arbitrarily and that a running task is not 
> preempted for another task with equal deadline. Correct?
> 

Yes.

> This would be a nice point to include in this doc methinks.
> 
>> + References:
>> +  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
>> +      ming in a hard-real-time environment. Journal of the Association for
>> +      Computing Machinery, 20(1), 1973.
>> +  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
>> +      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
>> +      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
>> +  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
>> +      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
>> +
>> +3. Bandwidth management
>> +=======================
>> +
>> + In order for the -deadline scheduling to be effective and useful, it is
>> + important to have some method to keep the allocation of the available CPU
>> + bandwidth to the tasks under control.
>> + This is usually called "admission control" and if it is not performed at all,
>> + no guarantee can be given on the actual scheduling of the -deadline tasks.
>> +
>> + Since when RT-throttling has been introduced each task group has a bandwidth
>> + associated, calculated as a certain amount of runtime over a period.
>> + Moreover, to make it possible to manipulate such bandwidth, readable/writable
>> + controls have been added to both procfs (for system wide settings) and cgroupfs
>> + (for per-group settings).
>> + Therefore, the same interface is being used for controlling the bandwidth
>> + distrubution to -deadline tasks.
>> +
>> + However, more discussion is needed in order to figure out how we want to manage
>> + SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
>> + uses (for now) a less sophisticated, but actually very sensible, mechanism to
>> + ensure that a certain utilization cap is not overcome per each root_domain.
>> +
>> + Another main difference between deadline bandwidth management and RT-throttling
>> + is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
>> + and thus we don't need an higher level throttling mechanism to enforce the
>> + desired bandwidth.
>> +
>> +3.1 System wide settings
>> +------------------------
>> +
>> + The system wide settings are configured under the /proc virtual file system.
>> +
>> + For now the -rt knobs are used for dl admission control and the -deadline
>> + runtime is accounted against the -rt runtime. We realise that this isn't
>> + entirely desirable; however, it is better to have a small interface for now,
>> + and be able to change it easily later. The ideal situation (see 5.) is to run
>> + -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
>> + subset of dl_bw.
>> +
>> + This means that, for a root_domain comprising M CPUs, -deadline tasks
>> + can be created while the sum of their bandwidths stays below:
>> +
>> +   M * (sched_rt_runtime_us / sched_rt_period_us)
>> +
>> + It is also possible to disable this bandwidth management logic, and
>> + be thus free of oversubscribing the system up to any arbitrary level.
>> + This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
>> +
>> +
>> +3.2 Task interface
>> +------------------
>> +
>> + Specifying a periodic/sporadic task that executes for a given amount of
>> + runtime at each instance, and that is scheduled according to the urgency of
>> + its own timing constraints needs, in general, a way of declaring:
>> +  - a (maximum/typical) instance execution time,
>> +  - a minimum interval between consecutive instances,
>> +  - a time constraint by which each instance must be completed.
>> +
>> + Therefore:
>> +  * a new struct sched_attr, containing all the necessary fields is
>> +    provided;
>> +  * the new scheduling related syscalls that manipulate it, i.e.,
>> +    sched_setattr() and sched_getattr() are implemented.
>> +
>> +
>> +3.3 Default behavior
>> +---------------------
>> +
>> + The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
>> + 95000. With rt_period equal to 1000000, by default, it means that -deadline
>     ^^^^
>     This seems to be 9.5% to me ;)
> 

Argh! s/95000/950000/

>> + tasks can use at most 95%, multiplied by the number of CPUs that compose the
>> + root_domain, for each root_domain.
>> +
>> + A -deadline task cannot fork.
>> +
>> +4. Tasks CPU affinity
>> +=====================
>> +
>> + -deadline tasks cannot have an affinity mask smaller that the entire
>> + root_domain they are created on. However, affinities can be specified
>> + through the cpuset facility (Documentation/cgroups/cpusets.txt).
> 
> Does this mean that sched_deadline is a somewhat global implementation? Or 
> rather, at what point in time will sched_deadline take all cpus in a set 
> into consideration and when will it only look at the current CPU? Where is 
> the line drawn between global and fully partitioned?
> 
> Also, how do you account the budget when a resource holder is boosted in 
> order to release a resource? (IIRC, you use BWI, right?)
> 

Peter already replied about this.

Thanks,

- Juri

>> +4.1 SCHED_DEADLINE and cpusets HOWTO
>> +------------------------------------
>> +
>> + An example of a simple configuration (pin a -deadline task to CPU0)
>> + follows (rt-app is used to create a -deadline task).
>> +
>> + mkdir /dev/cpuset
>> + mount -t cgroup -o cpuset cpuset /dev/cpuset
>> + cd /dev/cpuset
>> + mkdir cpu0
>> + echo 0 > cpu0/cpuset.cpus
>> + echo 0 > cpu0/cpuset.mems
>> + echo 1 > cpuset.cpu_exclusive
>> + echo 0 > cpuset.sched_load_balance
>> + echo 1 > cpu0/cpuset.cpu_exclusive
>> + echo 1 > cpu0/cpuset.mem_exclusive
>> + echo $$ > cpu0/tasks
>> + rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
>> + task affinity)
>> +
>> +5. Future plans
>> +===============
>> +
>> + Still missing:
>> +
>> +  - refinements to deadline inheritance, especially regarding the possibility
>> +    of retaining bandwidth isolation among non-interacting tasks. This is
>> +    being studied from both theoretical and practical points of view, and
>> +    hopefully we should be able to produce some demonstrative code soon;
>> +  - (c)group based bandwidth management, and maybe scheduling;
>> +  - access control for non-root users (and related security concerns to
>> +    address), which is the best way to allow unprivileged use of the mechanisms
>> +    and how to prevent non-root users "cheat" the system?
>> +
>> + As already discussed, we are planning also to merge this work with the EDF
>> + throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
>> + the preliminary phases of the merge and we really seek feedback that would
>> + help us decide on the direction it should take.
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 0de2482..0dd5e09 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
>>   * disrupting the schedulability of the system. Otherwise, we should
>>   * refill the runtime and set the deadline a period in the future,
>>   * because keeping the current (absolute) deadline of the task would
>> - * result in breaking guarantees promised to other tasks.
>> + * result in breaking guarantees promised to other tasks (refer to
>> + * Documentation/scheduler/sched-deadline.txt for more informations).
>>   *
>>   * This function returns true if:
>>   *
>> -- 
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Juri Lelli Jan. 21, 2014, 12:11 p.m. UTC | #2
Ok,

I'd say that we need to re-write parts of the document, as we need more
consensus about terms and better explain to users what SCHED_DEADLINE can be
used for, and the type of guarantees it can provide. I'll try to do this with
Luca, taking into account Henrik's points.

Thanks,

- Juri

On 01/21/2014 12:35 PM, Luca Abeni wrote:
> On 01/21/2014 11:20 AM, Henrik Austad wrote:
>> On Mon, Jan 20, 2014 at 02:39:29PM +0100, Luca Abeni wrote:
>>> Hi all,
>>>
>>> On 01/20/2014 02:16 PM, Henrik Austad wrote:
>>> [...]
>>>>>>> + The typical -deadline task is composed of a computation phase (instance)
>>>>>>> + which is activated on a periodic or sporadic fashion. The expected (maximum)
>>>>>>> + duration of such computation is called the task's runtime; the time interval
>>>>>>> + by which each instance needs to be completed is called the task's relative
>>>>>>> + deadline. The task's absolute deadline is dynamically calculated as the
>>>>>>> + time instant a task (or, more properly) activates plus the relative
>>>>>>> + deadline.
>>>>>>
>>>>>> activates - released?
>>>>>>
>>>>>
>>>>> I'd keep (modifying a bit):
>>>>>
>>>>> "time instant a task activates plus the relative deadline."
>>>>>
>>>>> This is probably the nearest thing to what is implemented that we can say
>>>>> (without entering into the theory too much), a task that "activates" can mean
>>>>> that it is first released, enqueued, woken-up, etc.
>>>>
>>>> So, if we look at release (yes, I'm avoiding activates for a little while)
>>>> as the time at the *beginning* of a new period, then, and only then should
>>>> the *absolute* deadline be computed.
>>>>
>>>> If you den move on to use 'activate' as a term for when a task becomes
>>>> eligble to run, then 'release' becomes a subset of 'activate', and you
>>>> should only compute the absolute deadline at that time. Did that make
>>>> sense?
>>
>>> I think things are a little bit complex here, because there are 2 different
>>> "deadlines" we can think about:
>>> - the "jobs deadlines" (the absolute job deadline can be computed at job
>>>    arrival, as the arrival time + the relative deadline). These are generally
>>>    used for performance metrics, to see if a job is respecting its timing
>>>    constraints or not
>>>
>>> - the "scheduling deadlines", which are the ones used by the scheduler to
>>>    schedule the tasks. These are computed at tasks' wake-up time - notice that
>>>    for self-suspending jobs there can be wake-up times that are not job arrival
>>>    times. And are assigned according to the rules described in the CBS paper.
>>>
>>> I think this can easily become very confusing, so I currently have no concrete
>>> proposals for improving the documentation... But I wanted to point out that
>>> things here are more complex than in the "traditional" real-time task model.
>>
>> Traditional real-time as in the current real-time model used in Linux, or
>> traditional as in EDF terminology used by Liu & Layland?
> Sorry, by traditional real-time task model I meant the Liu & Layland one
> (which is similar to what is describe above "... a task is composed of
> a computation phase (instance) which is activated on a periodic or sporadic
> fashion"...). I just wanted to say that things in reality are more complex
> than this.
> 
> 
>>> Maybe a solution could be to simply describe scheduling deadlines (which are
>>> what sched_deadline uses) without going into the details of jobs' deadlines.
>>
>> Huh?
>>
>> We definately need a short dictionary. In fact, I'd like to have a
>> paragraph describing what deadline driven scheduling is.
>>
>> For instance, I'm getting *Really* confused wrt to arrival time - you seem
>> to wrap several types of arrival into the same name,  yet treat it
>> differently.
>>
>> - arrival: when a job gets ready to run for the first time
>> - arrival: when a job unblocks on some resource
>>
>> Or did I misunderstand?
> Well, my point was (but I might be wrong) that describing all of these details
> in this document might be confusing, so it might make sense to describe only
> the relevant things (leaving all of the theoretical details to other documents,
> like the papers and technical reports cited in this document).
> You are right: there are different concepts of "arrival" and "arrival time": the
> "job arrival" (when some activity - for example decoding a video frame - starts)
> and the "task wake up". But I think in this document it is not important to mention
> jobs (or task instances) and job arrivals: the only relevant things are task
> wake-ups.
> 
>> So, the terminology I'm used to, an attempt to write up something to
>> clear up the terminology and establish common grounds. Please
>> edit/elaborate or shoot down as appropriate:
>>
>> """
>> N. Crashcourse in deadline-terminology:
>>
>> In a system, we typically look at a set of tasks. In Linux-kernel
>> terminology, a particular task is normally a thread. When a thread is
>> ready to run, we say that a *job* of that task is running.
> This would be true in the original Liu&Layland model (where a task blocks
> only when a job finishes), but I do not think it is correct in a real system...
> For example: (notice: this discussion might be slightly off-topic, and I do not
> think this should go in the document... I am writing just to clarify my point
> of view)
> - Let's consider a (over simplified) video decoder as an example of task
> - The task periodically read a video frame (from disk or network), decodes it,
>    and displays it
> - So, each job starts when the frame is read, and finishes when the frame is
>    displayed. And jobs are (in this case) activated periodically
> - During the execution of a job, the task might invoke a blocking system call,
>    and block... When it wakes up, it is still in the same job (decoding the same
>    video frame), and not in a different one.
> This is (IMHO) where all the confusion comes from.
> 
>> It is
>> perhaps easiest to grasp this if one think only of periodic tasks, i.e. a
>> thread that need to run for 2ms every 10ms. Normally, we want one job to
>> finish before a new (of the same task) start, which implies that the
>> deadline for this task is also 10ms. Once this is clear, expanding one's
>> mind to aperiodic and/or sporadic tasks is easier.
>>
>> * Periodic task: a task that needs to run for a while every N us.
>> * Sporadic task: a tasks that needs tor un for a while at most every N us
>>    (jobs start no closer than N us apart)
>> * Aperiodic task: a task that have no particular period, but once
>>    released, needs to complete before a given deadline.
> This is all correct, but I think it is not needed to understand how sched_deadline
> works
> 
> 
>> * Set of all deadline-tasks in the system: \tau
>> * One particluar task: \tau_i
>> * The j'th job of task i: \tau_{i,j}
>> * The (relative) deadline of task i: D_i
>> * The (periodic, relative) release time of task i: R_i
>> * Required execution time a tasks's job needs to complete. C_i
>> * Absolute release-time, the time when a new job is ready (when a thread is
>>    woken up for a new period).
>> * The absolute deadline of a job, the actual point in time where a job
>>    needs to be finished. This is what the scheduler looks at when it picks
>>    the next thread to run.
> This is not completely correct... The scheduler does not use this deadline, but
> uses a "scheduling deadline". The difference is, for example, that the scheduling
> deadline is postponed when the task tries to execute for more than the runtime...
> And this is why I think (but I might be wrong) that the concepts above do not need
> to be introduced in this document.
> Notice that if runtime >= C_i and if the sched_deadline period is >= "R_i", then
> the scheduling deadlines and the absolute deadlines you mention above coincide, so
> the task is guaranteed to respect the jobs' absolute deadlines (this is what is
> called "hard schedulability property" in the CBS paper).
> 
>> We can now construct a 3-tuple describing a perioic and sporadic tasks:
>>
>>            (C_i, R_i, D_i).
>>
>> These 3 items is what you can use to describe your task to the scheduler.
>> """
>>
>>> So, the document could avoid talking about instances (or jobs), and can say
>>> that a task is guaranteed to receive "runtime" time units every "period" time
>>> units (and these "runtime" time units are available within "deadline" time
>>> units from the beginning of the period). Every time the task wakes up, the
>>> scheduler computes a scheduling deadline consistent with this constraint,
>>> and tasks are scheduled using EDF on these scheduling deadlines.
>>> Every time "runtime" time units are consumed in a period, the scheduling
>>> deadline is postponed by a period.
>>
>> What is wrong with using the CORRECT TERMINOLOGY?
> Well, what I mentioned above could be "correct terminology" :).
> I just avoided talking about the two different kinds of deadlines
> (jobs' deadlines and scheduling deadlines), because this could be confusing.
> And about jobs, in order to avoid talking about self-suspending jobs, and
> having to make a distinction about job arrivals and other task wake ups...
> 
>> People looking at using
>> sched_deadline _need_ to understand what a _deadline_ scheduler is.
> I think the sentence I wrote above gives an idea about what sched_deadline does,
> and how to use the three parameters. I can be extended by clarifying that "using
> EDF on these scheduling deadlines" means "scheduling the task with the smallest
> scheduling deadline".
> 
> 
>> If the only goal of sched_deadline is to act as a bandwidth-gauge, then
>> fine, but *I* want to use sched_deadline for systems that HAVE DEADLINES.
> Ok, so this is a more advanced usage. A section about the "hard schedulability
> property" mentioned above can be added (later in the document, I think) to explain
> how sched_deadline can be used to provide guarantees to real-time tasks (so,
> the real-time task model you mentioned can be described later, and the condition
> I mentioned above can be explained).
> My point is: first explain what sched_deadline is and how it works (something about
> my paragraph above), then, later explain what a real-time task is and how sched_deadline
> can be used to properly schedule it.
> 
>> I do NOT want to mess around with mapping deadlines to priorities in order to
>> meet my deadlines.
> I agree, this is not needed.
> 
>> I suspect others would like to use sched_deadline for the same.
> Again, I might be wrong... But once the "runtime time units every period, served
> within a specified deadline" idea is clear, it is easier to understand how to use
> sched_deadline.
> 
> 
>>> This is of course an approximative description, but I think it can give an
>>> intuitive idea of how the scheduler works, and about the meaning of the three
>>> scheduling parameters.
>>
>> Look, I'm not in favour of turning sched_deadline.txt into an academic
>> paper, but it is clear that we need to establish a baseline here, and then
>> we need to include tasks, instances/jobs, deadline, arrival/release and so
>> on.
> See above... IMHO, this is not needed to understand how sched_deadline works.
> But once the sched_deadline behaviour is defined, these concepts can be introduced
> and it can be explained how to use sched_deadline to respect the jobs' deadlines...
> 
> 
> 
> 
> 				Luca
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Juri Lelli Jan. 21, 2014, 2:38 p.m. UTC | #3
On 01/21/2014 02:55 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 01:50:41PM +0100, Luca Abeni wrote:
>> On 01/21/2014 01:33 PM, Peter Zijlstra wrote:
> 
>>>> - During the execution of a job, the task might invoke a blocking system call,
>>>>   and block... When it wakes up, it is still in the same job (decoding the same
>>>>   video frame), and not in a different one.
>>>> This is (IMHO) where all the confusion comes from.
>>>
>>> I would strongly urge you not to use that as an example, because its
>>> dead wrong design. An RT thread (be it RR,FIFO or DL) should _NEVER_ do
>>> blocking IO.
>> Well, but it does happen in reality :)
> 
> Yeah, I know, my point was more about not encouraging people to do this
> by explicitly mentioning it.
> 
>> On the other hand, I agree with you that a hard real-time task should be designed
>> not to do things like this. But SCHED_DEADLINE is flexible enough to be used on
>> many different kinds of tasks (hard real-time, soft real-time, etc...).
> 
> At which point I feel obliged to mention the work Jim did on statistical
> bounded tardiness and a potential future option:
> SCHED_FLAG_DL_AVG_RUNTIME, where we would allow tasks to somewhat exceed
> their runtime budget provided that they meet their budget on average.
> 
> A possible implementation could be to track the unused budget of
> previous instances and keep a decaying sum (such that we're guaranteed
> this unused budget < 2*runtime). And then allow runtime overruns upto
> this limit.
> 
> Another possibly extension; one proposed by Ingo; is to demote tasks to
> SCHED_OTHER once they exceed their budget instead of the full block they
> get now -- we could possibly call this SCHED_FLAG_DL_CBS_SOFT or such.
> 
> And of course SCHED_FLAG_DL_CBS_SIGNAL, where the task gets a signal
> delivered if it exceeded the runtime -- I think some of the earlier
> patches had things like this, no?
>

Yes, they both got removed along the way. But we can pick them back if we
realize that they are needed for some scenario.

>>> On the other subject; I wouldn't actually mind if it grew into a proper
>>> (academic or not) summary of deadline scheduling theory and how it
>>> applies.
>>>
>>> Sure, refer to actual papers for all the proofs and such, but it would
>>> be very good to go over all the bits and pieces that make up the system.
>>>
>>> So cover the periodic, sporadic and aperiodic model like henr_k
>>> suggested, please do cover the job/instance idiom as it is used all over
>>> the place.
>> Ok... My point was that it would be better (IMHO) to first explain how
>> sched_deadline works (and no notion of job/instance, etc is needed for this),
>> and then explain how this applies to the real-time task model (and here, of
>> course all the formal notation can be introduced).
>>
>> Do you think this can be reasonable?
> 
> Sure, I think that's reasonable.
> 
>>> Then also treat schedulability tests and their ramification, explain
>>> what laxity is, what tardiness is, that GEDF doesn't have 0 tardiness
>>> but does have bounded tardiness.
>>>
>>> Maybe even mention the actual bounds -- but refer to papers for their
>>> proofs.
>>>
>>> Mention CBS and the ramification etc..
>> Ok.
>> I guess some of these details can be added incrementally, with additional
>> patches?
> 
> Oh sure, all of this will always be a work in progress anyway ;-)
> 

Ok, we are working on a first update.

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
index d2651c4..46702e4 100644
--- a/Documentation/scheduler/00-INDEX
+++ b/Documentation/scheduler/00-INDEX
@@ -10,5 +10,7 @@  sched-nice-design.txt
 	- How and why the scheduler's nice levels are implemented.
 sched-rt-group.txt
 	- real-time group scheduling.
+sched-deadline.txt
+	- deadline scheduling.
 sched-stats.txt
 	- information on schedstats (Linux Scheduler Statistics).
diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
new file mode 100644
index 0000000..8980de1
--- /dev/null
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -0,0 +1,189 @@ 
+			  Deadline Task Scheduling
+			  ------------------------
+
+CONTENTS
+========
+
+ 0. WARNING
+ 1. Overview
+ 2. Task scheduling
+ 2. The Interface
+ 3. Bandwidth management
+   3.1 System-wide settings
+   3.2 Task interface
+   3.4 Default behavior
+ 4. Tasks CPU affinity
+   4.1 SCHED_DEADLINE and cpusets HOWTO
+ 5. Future plans
+
+
+0. WARNING
+==========
+
+ Fiddling with these settings can result in an unpredictable or even unstable
+ system behavior. As for -rt (group) scheduling, it is assumed that root users
+ know what they're doing.
+
+
+1. Overview
+===========
+
+ The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is
+ basically an implementation of the Earliest Deadline First (EDF) scheduling
+ algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS)
+ that makes it possible to isolate the behavior of tasks between each other.
+
+
+2. Task scheduling
+==================
+
+ The typical -deadline task is composed of a computation phase (instance)
+ which is activated on a periodic or sporadic fashion. The expected (maximum)
+ duration of such computation is called the task's runtime; the time interval
+ by which each instance needs to be completed is called the task's relative
+ deadline. The task's absolute deadline is dynamically calculated as the
+ time instant a task (or, more properly) activates plus the relative
+ deadline.
+
+ The EDF[1] algorithm selects the task with the smallest absolute deadline as
+ the one to be executed first, while the CBS[2,3] ensures that each task runs
+ for at most its runtime every period, avoiding any interference between
+ different tasks (bandwidth isolation).
+ Thanks to this feature, also tasks that do not strictly comply with the
+ computational model described above can effectively use the new policy.
+ IOW, there are no limitations on what kind of task can exploit this new
+ scheduling discipline, even if it must be said that it is particularly
+ suited for periodic or sporadic tasks that need guarantees on their
+ timing behavior, e.g., multimedia, streaming, control applications, etc.
+
+ References:
+  1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
+      ming in a hard-real-time environment. Journal of the Association for
+      Computing Machinery, 20(1), 1973.
+  2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard
+      Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems
+      Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
+  3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
+      Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps
+
+3. Bandwidth management
+=======================
+
+ In order for the -deadline scheduling to be effective and useful, it is
+ important to have some method to keep the allocation of the available CPU
+ bandwidth to the tasks under control.
+ This is usually called "admission control" and if it is not performed at all,
+ no guarantee can be given on the actual scheduling of the -deadline tasks.
+
+ Since when RT-throttling has been introduced each task group has a bandwidth
+ associated, calculated as a certain amount of runtime over a period.
+ Moreover, to make it possible to manipulate such bandwidth, readable/writable
+ controls have been added to both procfs (for system wide settings) and cgroupfs
+ (for per-group settings).
+ Therefore, the same interface is being used for controlling the bandwidth
+ distrubution to -deadline tasks.
+
+ However, more discussion is needed in order to figure out how we want to manage
+ SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE
+ uses (for now) a less sophisticated, but actually very sensible, mechanism to
+ ensure that a certain utilization cap is not overcome per each root_domain.
+
+ Another main difference between deadline bandwidth management and RT-throttling
+ is that -deadline tasks have bandwidth on their own (while -rt ones don't!),
+ and thus we don't need an higher level throttling mechanism to enforce the
+ desired bandwidth.
+
+3.1 System wide settings
+------------------------
+
+ The system wide settings are configured under the /proc virtual file system.
+
+ For now the -rt knobs are used for dl admission control and the -deadline
+ runtime is accounted against the -rt runtime. We realise that this isn't
+ entirely desirable; however, it is better to have a small interface for now,
+ and be able to change it easily later. The ideal situation (see 5.) is to run
+ -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct
+ subset of dl_bw.
+
+ This means that, for a root_domain comprising M CPUs, -deadline tasks
+ can be created while the sum of their bandwidths stays below:
+
+   M * (sched_rt_runtime_us / sched_rt_period_us)
+
+ It is also possible to disable this bandwidth management logic, and
+ be thus free of oversubscribing the system up to any arbitrary level.
+ This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us.
+
+
+3.2 Task interface
+------------------
+
+ Specifying a periodic/sporadic task that executes for a given amount of
+ runtime at each instance, and that is scheduled according to the urgency of
+ its own timing constraints needs, in general, a way of declaring:
+  - a (maximum/typical) instance execution time,
+  - a minimum interval between consecutive instances,
+  - a time constraint by which each instance must be completed.
+
+ Therefore:
+  * a new struct sched_attr, containing all the necessary fields is
+    provided;
+  * the new scheduling related syscalls that manipulate it, i.e.,
+    sched_setattr() and sched_getattr() are implemented.
+
+
+3.3 Default behavior
+---------------------
+
+ The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to
+ 95000. With rt_period equal to 1000000, by default, it means that -deadline
+ tasks can use at most 95%, multiplied by the number of CPUs that compose the
+ root_domain, for each root_domain.
+
+ A -deadline task cannot fork.
+
+4. Tasks CPU affinity
+=====================
+
+ -deadline tasks cannot have an affinity mask smaller that the entire
+ root_domain they are created on. However, affinities can be specified
+ through the cpuset facility (Documentation/cgroups/cpusets.txt).
+
+4.1 SCHED_DEADLINE and cpusets HOWTO
+------------------------------------
+
+ An example of a simple configuration (pin a -deadline task to CPU0)
+ follows (rt-app is used to create a -deadline task).
+
+ mkdir /dev/cpuset
+ mount -t cgroup -o cpuset cpuset /dev/cpuset
+ cd /dev/cpuset
+ mkdir cpu0
+ echo 0 > cpu0/cpuset.cpus
+ echo 0 > cpu0/cpuset.mems
+ echo 1 > cpuset.cpu_exclusive
+ echo 0 > cpuset.sched_load_balance
+ echo 1 > cpu0/cpuset.cpu_exclusive
+ echo 1 > cpu0/cpuset.mem_exclusive
+ echo $$ > cpu0/tasks
+ rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
+ task affinity)
+
+5. Future plans
+===============
+
+ Still missing:
+
+  - refinements to deadline inheritance, especially regarding the possibility
+    of retaining bandwidth isolation among non-interacting tasks. This is
+    being studied from both theoretical and practical points of view, and
+    hopefully we should be able to produce some demonstrative code soon;
+  - (c)group based bandwidth management, and maybe scheduling;
+  - access control for non-root users (and related security concerns to
+    address), which is the best way to allow unprivileged use of the mechanisms
+    and how to prevent non-root users "cheat" the system?
+
+ As already discussed, we are planning also to merge this work with the EDF
+ throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in
+ the preliminary phases of the merge and we really seek feedback that would
+ help us decide on the direction it should take.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 0de2482..0dd5e09 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -351,7 +351,8 @@  static void replenish_dl_entity(struct sched_dl_entity *dl_se,
  * disrupting the schedulability of the system. Otherwise, we should
  * refill the runtime and set the deadline a period in the future,
  * because keeping the current (absolute) deadline of the task would
- * result in breaking guarantees promised to other tasks.
+ * result in breaking guarantees promised to other tasks (refer to
+ * Documentation/scheduler/sched-deadline.txt for more informations).
  *
  * This function returns true if:
  *