Message ID | 1488292722-19410-1-git-send-email-patrick.bellasi@arm.com |
---|---|
Headers | show |
Series | Add capacity capping support to the CPU controller | expand |
On 13-Mar 03:08, Joel Fernandes (Google) wrote: > Hi Patrick, > > On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi > <patrick.bellasi@arm.com> wrote: > > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE. > > Such a mandatory policy can be made more tunable from userspace thus > > allowing for example to define a reasonable max capacity (i.e. > > frequency) which is required for the execution of a specific RT/DL > > workload. This will contribute to make the RT class more "friendly" for > > power/energy sensible applications. > > > > This patch extends the usage of capacity_{min,max} to the RT/DL classes. > > Whenever a task in these classes is RUNNABLE, the capacity required is > > defined by the constraints of the control group that task belongs to. > > > > We briefly discussed this at Linaro Connect that this works well for > sporadic RT tasks that run briefly and then sleep for long periods of > time - so certainly this patch is good, but its only a partial > solution to the problem of frequent and short-sleepers and something > is required to keep the boost active for short non-RUNNABLE as well. > The behavior with many periodic RT tasks is that they will sleep for > short intervals and run for short intervals periodically. In this case > removing the clamp (or the boost as in schedtune v2) on a dequeue will > essentially mean during a narrow window cpufreq can drop the frequency > and only to make it go back up again. > > Currently for schedtune v2, I am working on prototyping something like > the following for Android: > - if RT task is enqueue, introduce the boost. > - When task is dequeued, start a timer for a "minimum deboost delay > time" before taking out the boost. > - If task is enqueued again before the timer fires, then cancel the timer. > > I don't think any "fix" to this particular issue should be to the > schedutil governor and should be sorted before going to cpufreq itself > (that is before making the request). What do you think about this? My short observations are: 1) for certain RT tasks, which have a quite "predictable" activation pattern, we should definitively try to use DEADLINE... which will factor out all "boosting potential races" since the bandwidth requirements are well defined at task description time. 2) CPU boosting is, at least for the time being, a best-effort feature which is introduced mainly for FAIR tasks. 3) Tracking the boost at enqueue/dequeue time matches with the design to track features/properties of the currently RUNNABLE tasks, while avoiding to add yet another signal to track CPUs utilization. 4) Previous point is about "separation of concerns", thus IMHO any policy defining how to consume the CPU utilization signal (whether it is boosted or not) should be responsibility of schedutil, which eventually does not exclude useful input from the scheduler. 5) I understand the usefulness of a scale down threshold for schedutil to reduce the current OPP, while I don't get the point for a scale up threshold. If the system is demanding more capacity and there are not HW constrains (e.g. pending changes) then we should go up as soon as possible. Finally, I think we can improve quite a lot the boosting issues you are having with RT tasks by better refining the schedutil thresholds implementation. We already have some patches pending for review: https://lkml.org/lkml/2017/3/2/385 which fixes some schedutil issue and we will follow up with others trying to improve the rate-limiting to not compromise responsiveness. > Thanks, > Joel Cheers Patrick -- #include <best/regards.h> Patrick Bellasi
On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote: > Was: SchedTune: central, scheduler-driven, power-perfomance control > > This series presents a possible alternative design for what has been presented > in the past as SchedTune. This redesign has been defined to address the main > concerns and comments collected in the LKML discussion [1] as well at the last > LPC [2]. > The aim of this posting is to present a working prototype which implements > what has been discussed [2] with people like PeterZ, PaulT and TejunH. > > The main differences with respect to the previous proposal [1] are: > 1. Task boosting/capping is now implemented as an extension on top of > the existing CGroup CPU controller. > 2. The previous boosting strategy, based on the inflation of the CPU's > utilization, has been now replaced by a more simple yet effective set > of capacity constraints. > > The proposed approach allows to constrain the minimum and maximum capacity > of a CPU depending on the set of tasks currently RUNNABLE on that CPU. > The set of active constraints are tracked by the core scheduler, thus they > apply across all the scheduling classes. The value of the constraints are > used to clamp the CPU utilization when the schedutil CPUFreq's governor > selects a frequency for that CPU. > > This means that the new proposed approach allows to extend the concept of > tasks classification to frequencies selection, thus allowing informed > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different > optimization policies such as: > a) Boosting of important tasks, by enforcing a minimum capacity in the > CPUs where they are enqueued for execution. > b) Capping of background tasks, by enforcing a maximum capacity. > c) Containment of OPPs for RT tasks which cannot easily be switched to > the usage of the DL class, but still don't need to run at the maximum > frequency. Do you have any practical examples of that, like for example what exactly Android is going to use this for? I gather that there is some experience with the current EAS implementation there, so I wonder how this work is related to that. Thanks, Rafael
On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > On 13-Mar 03:08, Joel Fernandes (Google) wrote: >> Hi Patrick, >> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi >> <patrick.bellasi@arm.com> wrote: >> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE. >> > Such a mandatory policy can be made more tunable from userspace thus >> > allowing for example to define a reasonable max capacity (i.e. >> > frequency) which is required for the execution of a specific RT/DL >> > workload. This will contribute to make the RT class more "friendly" for >> > power/energy sensible applications. >> > >> > This patch extends the usage of capacity_{min,max} to the RT/DL classes. >> > Whenever a task in these classes is RUNNABLE, the capacity required is >> > defined by the constraints of the control group that task belongs to. >> > >> >> We briefly discussed this at Linaro Connect that this works well for >> sporadic RT tasks that run briefly and then sleep for long periods of >> time - so certainly this patch is good, but its only a partial >> solution to the problem of frequent and short-sleepers and something >> is required to keep the boost active for short non-RUNNABLE as well. >> The behavior with many periodic RT tasks is that they will sleep for >> short intervals and run for short intervals periodically. In this case >> removing the clamp (or the boost as in schedtune v2) on a dequeue will >> essentially mean during a narrow window cpufreq can drop the frequency >> and only to make it go back up again. >> >> Currently for schedtune v2, I am working on prototyping something like >> the following for Android: >> - if RT task is enqueue, introduce the boost. >> - When task is dequeued, start a timer for a "minimum deboost delay >> time" before taking out the boost. >> - If task is enqueued again before the timer fires, then cancel the timer. >> >> I don't think any "fix" to this particular issue should be to the >> schedutil governor and should be sorted before going to cpufreq itself >> (that is before making the request). What do you think about this? > > My short observations are: > > 1) for certain RT tasks, which have a quite "predictable" activation > pattern, we should definitively try to use DEADLINE... which will > factor out all "boosting potential races" since the bandwidth > requirements are well defined at task description time. I don't immediately see how deadline can fix this, when a task is dequeued after end of its current runtime, its bandwidth will be subtracted from the active running bandwidth. This is what drives the DL part of the capacity request. In this case, we run into the same issue as with the boost-removal on dequeue. Isn't it? > 4) Previous point is about "separation of concerns", thus IMHO any > policy defining how to consume the CPU utilization signal > (whether it is boosted or not) should be responsibility of > schedutil, which eventually does not exclude useful input from the > scheduler. > > 5) I understand the usefulness of a scale down threshold for schedutil > to reduce the current OPP, while I don't get the point for a scale > up threshold. If the system is demanding more capacity and there > are not HW constrains (e.g. pending changes) then we should go up > as soon as possible. > > Finally, I think we can improve quite a lot the boosting issues you > are having with RT tasks by better refining the schedutil thresholds > implementation. > > We already have some patches pending for review: > https://lkml.org/lkml/2017/3/2/385 > which fixes some schedutil issue and we will follow up with others > trying to improve the rate-limiting to not compromise responsiveness. I agree we can try to explore fixing schedutil to do the right thing. J.
On 15-Mar 12:41, Rafael J. Wysocki wrote: > On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote: > > Was: SchedTune: central, scheduler-driven, power-perfomance control > > > > This series presents a possible alternative design for what has been presented > > in the past as SchedTune. This redesign has been defined to address the main > > concerns and comments collected in the LKML discussion [1] as well at the last > > LPC [2]. > > The aim of this posting is to present a working prototype which implements > > what has been discussed [2] with people like PeterZ, PaulT and TejunH. > > > > The main differences with respect to the previous proposal [1] are: > > 1. Task boosting/capping is now implemented as an extension on top of > > the existing CGroup CPU controller. > > 2. The previous boosting strategy, based on the inflation of the CPU's > > utilization, has been now replaced by a more simple yet effective set > > of capacity constraints. > > > > The proposed approach allows to constrain the minimum and maximum capacity > > of a CPU depending on the set of tasks currently RUNNABLE on that CPU. > > The set of active constraints are tracked by the core scheduler, thus they > > apply across all the scheduling classes. The value of the constraints are > > used to clamp the CPU utilization when the schedutil CPUFreq's governor > > selects a frequency for that CPU. > > > > This means that the new proposed approach allows to extend the concept of > > tasks classification to frequencies selection, thus allowing informed > > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different > > optimization policies such as: > > a) Boosting of important tasks, by enforcing a minimum capacity in the > > CPUs where they are enqueued for execution. > > b) Capping of background tasks, by enforcing a maximum capacity. > > c) Containment of OPPs for RT tasks which cannot easily be switched to > > the usage of the DL class, but still don't need to run at the maximum > > frequency. > > Do you have any practical examples of that, like for example what exactly > Android is going to use this for? In general, every "informed run-time" usually know quite a lot about tasks requirements and how they impact the user experience. In Android for example tasks are classified depending on their _current_ role. We can distinguish for example between: - TOP_APP: which are tasks currently affecting the UI, i.e. part of the app currently in foreground - BACKGROUND: which are tasks not directly impacting the user experience Given these information it could make sense to adopt different service/optimization policy for different tasks. For example, we can be interested in giving maximum responsiveness to TOP_APP tasks while we still want to be able to save as much energy as possible for the BACKGROUND tasks. That's where the proposal in this series (partially) comes on hand. What we propose is a "standard" interface to collect sensible information from "informed run-times" which can be used to: a) classify tasks according to the main optimization goals: performance boosting vs energy saving b) support a more dynamic tuning of kernel side behaviors, mainly OPPs selection and tasks placement Regarding this last point, this series specifically represents a proposal for the integration with schedutil. The main usages we are looking for in Android are: a) Boosting the OPP selected for certain critical tasks, with the goal to speed-up their completion regardless of (potential) energy impacts. A kind-of "race-to-idle" policy for certain tasks. b) Capping the OPP selection for certain non critical tasks, which is a major concerns especially for RT tasks in mobile context, but it also apply to FAIR tasks representing background activities. > I gather that there is some experience with the current EAS implementation > there, so I wonder how this work is related to that. You right. We started developing a task boosting strategy a couple of years ago. The first implementation we did is what is currently in use by the EAS version in used on Pixel smartphones. Since the beginning our attitude has always been "mainline first". However, we found it extremely valuable to proof both interface's design and feature's benefits on real devices. That's why we keep backporting these bits on different Android kernels. Google, which primary representatives are in CC, is also quite focused on using mainline solutions for their current and future solutions. That's why, after the release of the Pixel devices end of last year, we refreshed and posted the proposal on LKML [1] and collected a first run of valuable feedbacks at LCP [2]. This posting is an expression of the feedbacks collected so far and the main goal for us are: 1) validate once more the soundness of a scheduler-driven run-time power-performance control which is based on information collected from informed run-time 2) get an agreement on whether the current interface can be considered sufficiently "mainline friendly" to have a chance to get merged 3) rework/refactor what is required if point 2 is not (yet) satisfied It's worth to notice that these bits are completely independent from EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by itself and it can be quite useful in many different scenarios where EAS is not used at all. A simple example is making schedutil to behave concurrently like the powersave governor for certain tasks and the performance governor for other tasks. As a final remark, this series is going to be a discussion topic in the upcoming OSPM summit [3]. It would be nice if we can get there with a sufficient knowledge of the main goals and the current status. However, please let's keep discussing here about all the possible concerns which can be raised about this proposal. > Thanks, > Rafael Cheers Patrick [1] https://lkml.org/lkml/2016/10/27/503 [2] https://lkml.org/lkml/2016/11/25/342 [3] http://retis.sssup.it/ospm-summit/ -- #include <best/regards.h> Patrick Bellasi
Hi Joel, On 15/03/17 05:59, Joel Fernandes wrote: > On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi > <patrick.bellasi@arm.com> wrote: > > On 13-Mar 03:08, Joel Fernandes (Google) wrote: > >> Hi Patrick, > >> > >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi > >> <patrick.bellasi@arm.com> wrote: > >> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE. > >> > Such a mandatory policy can be made more tunable from userspace thus > >> > allowing for example to define a reasonable max capacity (i.e. > >> > frequency) which is required for the execution of a specific RT/DL > >> > workload. This will contribute to make the RT class more "friendly" for > >> > power/energy sensible applications. > >> > > >> > This patch extends the usage of capacity_{min,max} to the RT/DL classes. > >> > Whenever a task in these classes is RUNNABLE, the capacity required is > >> > defined by the constraints of the control group that task belongs to. > >> > > >> > >> We briefly discussed this at Linaro Connect that this works well for > >> sporadic RT tasks that run briefly and then sleep for long periods of > >> time - so certainly this patch is good, but its only a partial > >> solution to the problem of frequent and short-sleepers and something > >> is required to keep the boost active for short non-RUNNABLE as well. > >> The behavior with many periodic RT tasks is that they will sleep for > >> short intervals and run for short intervals periodically. In this case > >> removing the clamp (or the boost as in schedtune v2) on a dequeue will > >> essentially mean during a narrow window cpufreq can drop the frequency > >> and only to make it go back up again. > >> > >> Currently for schedtune v2, I am working on prototyping something like > >> the following for Android: > >> - if RT task is enqueue, introduce the boost. > >> - When task is dequeued, start a timer for a "minimum deboost delay > >> time" before taking out the boost. > >> - If task is enqueued again before the timer fires, then cancel the timer. > >> > >> I don't think any "fix" to this particular issue should be to the > >> schedutil governor and should be sorted before going to cpufreq itself > >> (that is before making the request). What do you think about this? > > > > My short observations are: > > > > 1) for certain RT tasks, which have a quite "predictable" activation > > pattern, we should definitively try to use DEADLINE... which will > > factor out all "boosting potential races" since the bandwidth > > requirements are well defined at task description time. > > I don't immediately see how deadline can fix this, when a task is > dequeued after end of its current runtime, its bandwidth will be > subtracted from the active running bandwidth. This is what drives the > DL part of the capacity request. In this case, we run into the same > issue as with the boost-removal on dequeue. Isn't it? > Unfortunately, I still have to post the set of patches (based on Luca's reclaiming set) that introduces driving of clock frequency from DEADLINE, so I guess everything we can discuss about how DEADLINE might help here might be difficult to understand. :( I should definitely fix that. However, trying to quickly summarize how that would work (for who is already somewhat familiar with reclaiming bits): - a task utilization contribution is accounted for (at rq level) as soon as it wakes up for the first time in a new period - its contribution is then removed after the 0lag time (or when the task gets throttled) - frequency transitions are triggered accordingly So, I don't see why triggering a go down request after the 0lag time expired and quickly reacting to tasks waking up would have create problems in your case? Thanks, - Juri
On Wed, Mar 15, 2017 at 7:44 AM, Juri Lelli <juri.lelli@arm.com> wrote: > Hi Joel, > > On 15/03/17 05:59, Joel Fernandes wrote: >> On Wed, Mar 15, 2017 at 4:40 AM, Patrick Bellasi >> <patrick.bellasi@arm.com> wrote: >> > On 13-Mar 03:08, Joel Fernandes (Google) wrote: >> >> Hi Patrick, >> >> >> >> On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi >> >> <patrick.bellasi@arm.com> wrote: >> >> > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE. >> >> > Such a mandatory policy can be made more tunable from userspace thus >> >> > allowing for example to define a reasonable max capacity (i.e. >> >> > frequency) which is required for the execution of a specific RT/DL >> >> > workload. This will contribute to make the RT class more "friendly" for >> >> > power/energy sensible applications. >> >> > >> >> > This patch extends the usage of capacity_{min,max} to the RT/DL classes. >> >> > Whenever a task in these classes is RUNNABLE, the capacity required is >> >> > defined by the constraints of the control group that task belongs to. >> >> > >> >> >> >> We briefly discussed this at Linaro Connect that this works well for >> >> sporadic RT tasks that run briefly and then sleep for long periods of >> >> time - so certainly this patch is good, but its only a partial >> >> solution to the problem of frequent and short-sleepers and something >> >> is required to keep the boost active for short non-RUNNABLE as well. >> >> The behavior with many periodic RT tasks is that they will sleep for >> >> short intervals and run for short intervals periodically. In this case >> >> removing the clamp (or the boost as in schedtune v2) on a dequeue will >> >> essentially mean during a narrow window cpufreq can drop the frequency >> >> and only to make it go back up again. >> >> >> >> Currently for schedtune v2, I am working on prototyping something like >> >> the following for Android: >> >> - if RT task is enqueue, introduce the boost. >> >> - When task is dequeued, start a timer for a "minimum deboost delay >> >> time" before taking out the boost. >> >> - If task is enqueued again before the timer fires, then cancel the timer. >> >> >> >> I don't think any "fix" to this particular issue should be to the >> >> schedutil governor and should be sorted before going to cpufreq itself >> >> (that is before making the request). What do you think about this? >> > >> > My short observations are: >> > >> > 1) for certain RT tasks, which have a quite "predictable" activation >> > pattern, we should definitively try to use DEADLINE... which will >> > factor out all "boosting potential races" since the bandwidth >> > requirements are well defined at task description time. >> >> I don't immediately see how deadline can fix this, when a task is >> dequeued after end of its current runtime, its bandwidth will be >> subtracted from the active running bandwidth. This is what drives the >> DL part of the capacity request. In this case, we run into the same >> issue as with the boost-removal on dequeue. Isn't it? >> > > Unfortunately, I still have to post the set of patches (based on Luca's > reclaiming set) that introduces driving of clock frequency from > DEADLINE, so I guess everything we can discuss about how DEADLINE might > help here might be difficult to understand. :( > > I should definitely fix that. I fully understand, Sorry to be discussing this too soon here... > However, trying to quickly summarize how that would work (for who is > already somewhat familiar with reclaiming bits): > > - a task utilization contribution is accounted for (at rq level) as > soon as it wakes up for the first time in a new period > - its contribution is then removed after the 0lag time (or when the > task gets throttled) > - frequency transitions are triggered accordingly > > So, I don't see why triggering a go down request after the 0lag time > expired and quickly reacting to tasks waking up would have create > problems in your case? In my experience, the 'reacting to tasks' bit doesn't work very well. For short running period tasks, we need to set the frequency to something and not ramp it down too quickly (for ex, runtime 1.5ms and period 3ms). In this case the 0-lag time would be < 3ms. I guess if we're going to use 0-lag time, then we'd need to set it runtime and period to be higher than exactly matching the task's? So would we be assigning the same bandwidth but for R/T instead of r/t (Where r, R are the runtimes and t,T are periods, and R > r and T > t)? Thanks, Joel
On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote: [..] > >> > However, trying to quickly summarize how that would work (for who is >> > already somewhat familiar with reclaiming bits): >> > >> > - a task utilization contribution is accounted for (at rq level) as >> > soon as it wakes up for the first time in a new period >> > - its contribution is then removed after the 0lag time (or when the >> > task gets throttled) >> > - frequency transitions are triggered accordingly >> > >> > So, I don't see why triggering a go down request after the 0lag time >> > expired and quickly reacting to tasks waking up would have create >> > problems in your case? >> >> In my experience, the 'reacting to tasks' bit doesn't work very well. > > Humm.. but in this case we won't be 'reacting', we will be > 'anticipating' tasks' needs, right? Are you saying we will start ramping frequency before the next activation so that we're ready for it? If not, it sounds like it will only make the frequency request on the next activation when the Active bandwidth increases due to the task waking up. By then task has already started to run, right? Thanks, Joel
On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi <patrick.bellasi@arm.com> wrote: > On 15-Mar 12:41, Rafael J. Wysocki wrote: >> On Tuesday, February 28, 2017 02:38:37 PM Patrick Bellasi wrote: >> > Was: SchedTune: central, scheduler-driven, power-perfomance control >> > >> > This series presents a possible alternative design for what has been presented >> > in the past as SchedTune. This redesign has been defined to address the main >> > concerns and comments collected in the LKML discussion [1] as well at the last >> > LPC [2]. >> > The aim of this posting is to present a working prototype which implements >> > what has been discussed [2] with people like PeterZ, PaulT and TejunH. >> > >> > The main differences with respect to the previous proposal [1] are: >> > 1. Task boosting/capping is now implemented as an extension on top of >> > the existing CGroup CPU controller. >> > 2. The previous boosting strategy, based on the inflation of the CPU's >> > utilization, has been now replaced by a more simple yet effective set >> > of capacity constraints. >> > >> > The proposed approach allows to constrain the minimum and maximum capacity >> > of a CPU depending on the set of tasks currently RUNNABLE on that CPU. >> > The set of active constraints are tracked by the core scheduler, thus they >> > apply across all the scheduling classes. The value of the constraints are >> > used to clamp the CPU utilization when the schedutil CPUFreq's governor >> > selects a frequency for that CPU. >> > >> > This means that the new proposed approach allows to extend the concept of >> > tasks classification to frequencies selection, thus allowing informed >> > run-times (e.g. Android, ChromeOS, etc.) to efficiently implement different >> > optimization policies such as: >> > a) Boosting of important tasks, by enforcing a minimum capacity in the >> > CPUs where they are enqueued for execution. >> > b) Capping of background tasks, by enforcing a maximum capacity. >> > c) Containment of OPPs for RT tasks which cannot easily be switched to >> > the usage of the DL class, but still don't need to run at the maximum >> > frequency. >> >> Do you have any practical examples of that, like for example what exactly >> Android is going to use this for? > > In general, every "informed run-time" usually know quite a lot about > tasks requirements and how they impact the user experience. > > In Android for example tasks are classified depending on their _current_ > role. We can distinguish for example between: > > - TOP_APP: which are tasks currently affecting the UI, i.e. part of > the app currently in foreground > - BACKGROUND: which are tasks not directly impacting the user > experience > > Given these information it could make sense to adopt different > service/optimization policy for different tasks. > For example, we can be interested in > giving maximum responsiveness to TOP_APP tasks while we still want to > be able to save as much energy as possible for the BACKGROUND tasks. > > That's where the proposal in this series (partially) comes on hand. A question: Does "responsiveness" translate directly to "capacity" somehow? Moreover, how exactly is "responsiveness" defined? > What we propose is a "standard" interface to collect sensible > information from "informed run-times" which can be used to: > > a) classify tasks according to the main optimization goals: > performance boosting vs energy saving > > b) support a more dynamic tuning of kernel side behaviors, mainly > OPPs selection and tasks placement > > Regarding this last point, this series specifically represents a > proposal for the integration with schedutil. The main usages we are > looking for in Android are: > > a) Boosting the OPP selected for certain critical tasks, with the goal > to speed-up their completion regardless of (potential) energy impacts. > A kind-of "race-to-idle" policy for certain tasks. It looks like this could be addressed by adding a "this task should race to idle" flag too. > b) Capping the OPP selection for certain non critical tasks, which is > a major concerns especially for RT tasks in mobile context, but > it also apply to FAIR tasks representing background activities. Well, is the information on how much CPU capacity assign to those tasks really there in user space? What's the source of it if so? >> I gather that there is some experience with the current EAS implementation >> there, so I wonder how this work is related to that. > > You right. We started developing a task boosting strategy a couple of > years ago. The first implementation we did is what is currently in use > by the EAS version in used on Pixel smartphones. > > Since the beginning our attitude has always been "mainline first". > However, we found it extremely valuable to proof both interface's > design and feature's benefits on real devices. That's why we keep > backporting these bits on different Android kernels. > > Google, which primary representatives are in CC, is also quite focused > on using mainline solutions for their current and future solutions. > That's why, after the release of the Pixel devices end of last year, > we refreshed and posted the proposal on LKML [1] and collected a first > run of valuable feedbacks at LCP [2]. Thanks for the info, but my question was more about how it was related from the technical angle. IOW, there surely is some experience related to how user space can deal with energy problems and I would expect that experience to be an important factor in designing a kernel interface for that user space, so I wonder if any particular needs of the Android user space are addressed here. I'm not intimately familiar with Android, so I guess I would like to be educated somewhat on that. :-) > This posting is an expression of the feedbacks collected so far and > the main goal for us are: > 1) validate once more the soundness of a scheduler-driven run-time > power-performance control which is based on information collected > from informed run-time > 2) get an agreement on whether the current interface can be considered > sufficiently "mainline friendly" to have a chance to get merged > 3) rework/refactor what is required if point 2 is not (yet) satisfied My definition of "mainline friendly" may be different from a someone else's one, but I usually want to know two things: 1. What problem exactly is at hand. 2. What alternative ways of addressing it have been considered and why the particular one proposed has been chosen over the other ones. At the moment I don't feel like I have enough information in both aspects. For example, if you said "Android wants to do XYZ because of ABC and that's how we want to make that possible, and it also could be done in the other GHJ ways, but they are not attractive and here's why etc" that would help quite a bit from my POV. > It's worth to notice that these bits are completely independent from > EAS. OPP biasing (i.e. capping/boosting) is a feature which stand by > itself and it can be quite useful in many different scenarios where > EAS is not used at all. A simple example is making schedutil to behave > concurrently like the powersave governor for certain tasks and the > performance governor for other tasks. That's fine in theory, but honestly an interface like this will be a maintenance burden and adding it just because it may be useful to somebody sounds not serious enough. IOW, I'd like to be able to say "This is going to be used by user space X to do A and that's how etc" is somebody asks me about that which honestly I can't at this point. > > As a final remark, this series is going to be a discussion topic in > the upcoming OSPM summit [3]. It would be nice if we can get there > with a sufficient knowledge of the main goals and the current status. I'm not sure what you mean here, sorry. > However, please let's keep discussing here about all the possible > concerns which can be raised about this proposal. OK Thanks, Rafael
Hi Rafael, On Wed, Mar 15, 2017 at 6:04 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Wed, Mar 15, 2017 at 1:59 PM, Patrick Bellasi >>> Do you have any practical examples of that, like for example what exactly >>> Android is going to use this for? >> >> In general, every "informed run-time" usually know quite a lot about >> tasks requirements and how they impact the user experience. >> >> In Android for example tasks are classified depending on their _current_ >> role. We can distinguish for example between: >> >> - TOP_APP: which are tasks currently affecting the UI, i.e. part of >> the app currently in foreground >> - BACKGROUND: which are tasks not directly impacting the user >> experience >> >> Given these information it could make sense to adopt different >> service/optimization policy for different tasks. >> For example, we can be interested in >> giving maximum responsiveness to TOP_APP tasks while we still want to >> be able to save as much energy as possible for the BACKGROUND tasks. >> >> That's where the proposal in this series (partially) comes on hand. > > A question: Does "responsiveness" translate directly to "capacity" somehow? > > Moreover, how exactly is "responsiveness" defined? Responsiveness is basically how quickly the UI is responding to user interaction after doing its computation, application-logic and rendering. Android apps have 2 important threads, the main thread (or UI thread) which does all the work and computation for the app, and a Render thread which does the rendering and submission of frames to display pipeline for further composition and display. We wish to bias towards performance than energy for this work since this front facing to the user and we don't care about much about energy for these tasks at this point, what's most critical is completion as quickly as possible so the user experience doesn't suffer from a performance issue that is noticeable. One metric to define this is "Jank" where we drop frames and aren't able to render on time. One of the reasons this can happen because the main thread (UI thread) took longer than expected for some computation. Whatever the interface - we'd just like to bias the scheduling and frequency guidance to be more concerned with performance and less with energy. And use this information for both frequency selection and task placement. 'What we need' is also app dependent since every app has its own main thread and is free to compute whatever it needs. So Android can't estimate this - but we do know that this app is user facing so in broad terms the interface is used to say please don't sacrifice performance for these top-apps - without accurately defining what these performance needs really are because we don't know it. For YouTube app for example, the complexity of the video decoding and the frame rate are very variable depending on the encoding scheme and the video being played. The flushing of the frames through the display pipeline is also variable (frame rate depends on the video being decoded), so this work is variable and we can't say for sure in definitive terms how much capacity we need. What we can do is with Patrick's work, we can take the worst case based on measurements and specify say we need atleast this much capacity regardless of what load-tracking thinks we need and then we can scale frequency accordingly. This is the usecase for the minimum capacity in his clamping patch. This is still not perfect in terms of defining something accurately because - we don't even know how much we need, but atleast in broad terms we have some way of telling the governor to maintain atleast X capacity. For the clamping of maximum capacity, there are usecases like background tasks like Patrick said, but also usecases where we don't want to run at max frequency even though load-tracking thinks that we need to. For example, there are case where for foreground camera tasks, where we want to provide sustainable performance without entering thermal throttling, so the capping will help there. >> What we propose is a "standard" interface to collect sensible >> information from "informed run-times" which can be used to: >> >> a) classify tasks according to the main optimization goals: >> performance boosting vs energy saving >> >> b) support a more dynamic tuning of kernel side behaviors, mainly >> OPPs selection and tasks placement >> >> Regarding this last point, this series specifically represents a >> proposal for the integration with schedutil. The main usages we are >> looking for in Android are: >> >> a) Boosting the OPP selected for certain critical tasks, with the goal >> to speed-up their completion regardless of (potential) energy impacts. >> A kind-of "race-to-idle" policy for certain tasks. > > It looks like this could be addressed by adding a "this task should > race to idle" flag too. But he said 'kind-of' race-to-idle. Racing to idle all the time for ex. at max frequency will be wasteful of energy so although we don't care about energy much for top-apps, we do care a bit. > >> b) Capping the OPP selection for certain non critical tasks, which is >> a major concerns especially for RT tasks in mobile context, but >> it also apply to FAIR tasks representing background activities. > > Well, is the information on how much CPU capacity assign to those > tasks really there in user space? What's the source of it if so? I believe this is just a matter of tuning and modeling for what is needed. For ex. to prevent thermal throttling as I mentioned and also to ensure background activities aren't running at highest frequency and consuming excessive energy (since racing to idle at higher frequency is more expensive energy than running slower to idle since we run at higher voltages at higher frequency and the slow of the perf/W curve is steeper - p = c * V^2 * F. So the V component being higher just drains more power quadratic-ally which is of no use to background tasks - infact in some tests, we're just as happy with setting them at much lower frequencies than what load-tracking thinks is needed. >>> I gather that there is some experience with the current EAS implementation >>> there, so I wonder how this work is related to that. >> >> You right. We started developing a task boosting strategy a couple of >> years ago. The first implementation we did is what is currently in use >> by the EAS version in used on Pixel smartphones. >> >> Since the beginning our attitude has always been "mainline first". >> However, we found it extremely valuable to proof both interface's >> design and feature's benefits on real devices. That's why we keep >> backporting these bits on different Android kernels. >> >> Google, which primary representatives are in CC, is also quite focused >> on using mainline solutions for their current and future solutions. >> That's why, after the release of the Pixel devices end of last year, >> we refreshed and posted the proposal on LKML [1] and collected a first >> run of valuable feedbacks at LCP [2]. > > Thanks for the info, but my question was more about how it was related > from the technical angle. IOW, there surely is some experience > related to how user space can deal with energy problems and I would > expect that experience to be an important factor in designing a kernel > interface for that user space, so I wonder if any particular needs of > the Android user space are addressed here. > > I'm not intimately familiar with Android, so I guess I would like to > be educated somewhat on that. :-) Hope this sheds some light into the Android side of things a bit. Regards, Joel
On 15/03/17 16:40, Joel Fernandes wrote: > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote: > [..] > > > >> > However, trying to quickly summarize how that would work (for who is > >> > already somewhat familiar with reclaiming bits): > >> > > >> > - a task utilization contribution is accounted for (at rq level) as > >> > soon as it wakes up for the first time in a new period > >> > - its contribution is then removed after the 0lag time (or when the > >> > task gets throttled) > >> > - frequency transitions are triggered accordingly > >> > > >> > So, I don't see why triggering a go down request after the 0lag time > >> > expired and quickly reacting to tasks waking up would have create > >> > problems in your case? > >> > >> In my experience, the 'reacting to tasks' bit doesn't work very well. > > > > Humm.. but in this case we won't be 'reacting', we will be > > 'anticipating' tasks' needs, right? > > Are you saying we will start ramping frequency before the next > activation so that we're ready for it? > I'm saying that there is no need to ramp, simply select the frequency that is needed for a task (or a set of them). > If not, it sounds like it will only make the frequency request on the > next activation when the Active bandwidth increases due to the task > waking up. By then task has already started to run, right? > When the task is enqueued back we select the frequency considering its bandwidth request (and the bandwidth/utilization of the others). So, when it actually starts running it will already have enough capacity to finish in time.
On 16-Mar 11:16, Juri Lelli wrote: > On 15/03/17 16:40, Joel Fernandes wrote: > > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote: > > [..] > > > > > >> > However, trying to quickly summarize how that would work (for who is > > >> > already somewhat familiar with reclaiming bits): > > >> > > > >> > - a task utilization contribution is accounted for (at rq level) as > > >> > soon as it wakes up for the first time in a new period > > >> > - its contribution is then removed after the 0lag time (or when the > > >> > task gets throttled) > > >> > - frequency transitions are triggered accordingly > > >> > > > >> > So, I don't see why triggering a go down request after the 0lag time > > >> > expired and quickly reacting to tasks waking up would have create > > >> > problems in your case? > > >> > > >> In my experience, the 'reacting to tasks' bit doesn't work very well. > > > > > > Humm.. but in this case we won't be 'reacting', we will be > > > 'anticipating' tasks' needs, right? > > > > Are you saying we will start ramping frequency before the next > > activation so that we're ready for it? > > > > I'm saying that there is no need to ramp, simply select the frequency > that is needed for a task (or a set of them). > > > If not, it sounds like it will only make the frequency request on the > > next activation when the Active bandwidth increases due to the task > > waking up. By then task has already started to run, right? > > > > When the task is enqueued back we select the frequency considering its > bandwidth request (and the bandwidth/utilization of the others). So, > when it actually starts running it will already have enough capacity to > finish in time. Here we are factoring out the time required to actually switch to the required OPP. I think Joel was referring to this time. That time cannot really be eliminated but from having faster OOP swiching HW support. Still, jumping strating to the "optimal" OPP instead of rumping up is a big improvement. -- #include <best/regards.h> Patrick Bellasi
On 16/03/17 12:27, Patrick Bellasi wrote: > On 16-Mar 11:16, Juri Lelli wrote: > > On 15/03/17 16:40, Joel Fernandes wrote: > > > On Wed, Mar 15, 2017 at 9:24 AM, Juri Lelli <juri.lelli@arm.com> wrote: > > > [..] > > > > > > > >> > However, trying to quickly summarize how that would work (for who is > > > >> > already somewhat familiar with reclaiming bits): > > > >> > > > > >> > - a task utilization contribution is accounted for (at rq level) as > > > >> > soon as it wakes up for the first time in a new period > > > >> > - its contribution is then removed after the 0lag time (or when the > > > >> > task gets throttled) > > > >> > - frequency transitions are triggered accordingly > > > >> > > > > >> > So, I don't see why triggering a go down request after the 0lag time > > > >> > expired and quickly reacting to tasks waking up would have create > > > >> > problems in your case? > > > >> > > > >> In my experience, the 'reacting to tasks' bit doesn't work very well. > > > > > > > > Humm.. but in this case we won't be 'reacting', we will be > > > > 'anticipating' tasks' needs, right? > > > > > > Are you saying we will start ramping frequency before the next > > > activation so that we're ready for it? > > > > > > > I'm saying that there is no need to ramp, simply select the frequency > > that is needed for a task (or a set of them). > > > > > If not, it sounds like it will only make the frequency request on the > > > next activation when the Active bandwidth increases due to the task > > > waking up. By then task has already started to run, right? > > > > > > > When the task is enqueued back we select the frequency considering its > > bandwidth request (and the bandwidth/utilization of the others). So, > > when it actually starts running it will already have enough capacity to > > finish in time. > > Here we are factoring out the time required to actually switch to the > required OPP. I think Joel was referring to this time. > Right. But, this is an HW limitation. It seems a problem that every scheduler driven decision will have to take into account. So, doesn't make more sense to let the driver (or the governor shim layer) introduce some sort of hysteresis to frequency changes if needed? > That time cannot really be eliminated but from having faster OOP > swiching HW support. Still, jumping strating to the "optimal" OPP > instead of rumping up is a big improvement. > > > -- > #include <best/regards.h> > > Patrick Bellasi
Hello, Patrick. On Tue, Feb 28, 2017 at 02:38:37PM +0000, Patrick Bellasi wrote: > a) Boosting of important tasks, by enforcing a minimum capacity in the > CPUs where they are enqueued for execution. > b) Capping of background tasks, by enforcing a maximum capacity. > c) Containment of OPPs for RT tasks which cannot easily be switched to > the usage of the DL class, but still don't need to run at the maximum > frequency. As this is something completely new, I think it'd be a great idea to give a couple concerete examples in the head message to help people understand what it's for. Thanks. -- tejun
On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > illustrated per your above points in that it affects both, while in > > fact it actually modifies another metric, namely util_avg. > > I don't see it modifying in any direct way util_avg. The point is that clamps called 'capacity' are applied to util. So while you don't modify util directly, you do modify the util signal (for one consumer).
On 12-Apr 14:22, Peter Zijlstra wrote: > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > Sorry, I don't get instead what are the "confusing nesting properties" > > you are referring to? > > If a parent group sets min=.2 and max=.8, what are the constraints on > its child groups for setting their resp min and max? Currently the logic I'm proposing enforces this: a) capacity_max can only be reduced because we accept that a child can be further constrained for example: - a resource manager allocates a max capacity to an application - the application itself knows that some of its child are background tasks and they can be further constrained b) capacity_min can only be increased because we want to inhibit child affecting overall performance for example: - a resource manager allocates a minimum capacity to an application - the application itself cannot slow-down some of its child without risking to affect other (unknown) external entities > I can't immediately gives rules that would make sense. The second rule is more tricky, but I see it matching better an overall decomposition schema where a single resource manager is allocating a capacity_min to two different entities (A and B) which are independent but (it only knows) are also cooperating. Let's think about the Android run-time which allocate resources to a system service (entity A) which it knows it has to interact with a certain app (entity B). The cooperation dependency can be resolved only by the resource manager, by assigning capacity_min at entity level CGroups. Thus, entities subgroups should not be allowed to further reduce this constraint without risking to impact an (unknown for them) external entity. > For instance, allowing a child to lower min would violate the parent > constraint, Quite likely don't want this. > while allowing a child to increase min would grant the child > more resources than the parent. But still within the capacity_max enforced by the parent. We should always consider the pair (min,max), once a parent defined this range to me it's seem ok that child can freely play within that range. Why should not be allowed a child group to set: capacity_min_child = capacity_max_parent ? > Neither seem like a good thing. -- #include <best/regards.h> Patrick Bellasi
On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote: > On 12-Apr 14:48, Peter Zijlstra wrote: > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > > > illustrated per your above points in that it affects both, while in > > > > fact it actually modifies another metric, namely util_avg. > > > > > > I don't see it modifying in any direct way util_avg. > > > > The point is that clamps called 'capacity' are applied to util. So while > > you don't modify util directly, you do modify the util signal (for one > > consumer). > > Right, but this consumer (i.e. schedutil) it's already translating > the util_avg into a next_freq (which ultimately it's a capacity). > > Thus, I don't see a big misfit in that code path to "filter" this > translation with a capacity clamp. Still strikes me as odd though.
On 12-Apr 16:34, Peter Zijlstra wrote: > On Wed, Apr 12, 2017 at 02:27:41PM +0100, Patrick Bellasi wrote: > > On 12-Apr 14:48, Peter Zijlstra wrote: > > > On Tue, Apr 11, 2017 at 06:58:33PM +0100, Patrick Bellasi wrote: > > > > > illustrated per your above points in that it affects both, while in > > > > > fact it actually modifies another metric, namely util_avg. > > > > > > > > I don't see it modifying in any direct way util_avg. > > > > > > The point is that clamps called 'capacity' are applied to util. So while > > > you don't modify util directly, you do modify the util signal (for one > > > consumer). > > > > Right, but this consumer (i.e. schedutil) it's already translating > > the util_avg into a next_freq (which ultimately it's a capacity). > > > > Thus, I don't see a big misfit in that code path to "filter" this > > translation with a capacity clamp. > > Still strikes me as odd though. Can you better elaborate on they why? -- #include <best/regards.h> Patrick Bellasi