Message ID | 1404144343-18720-10-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote: > This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e. > > We are going to use runnable_avg_sum and runnable_avg_period in order to get > the utilization of the CPU. This statistic includes all tasks that run the CPU > and not only CFS tasks. But this rq->avg is not the one that is migration aware, right? So why use this? We already compensate cpu_capacity for !fair tasks, so I don't see why we can't use the migration aware one (and kill this one as Yuyang keeps proposing) and compensate with the capacity factor. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 10 July 2014 15:16, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote: >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e. >> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get >> the utilization of the CPU. This statistic includes all tasks that run the CPU >> and not only CFS tasks. > > But this rq->avg is not the one that is migration aware, right? So why > use this? Yes, it's not the one that is migration aware > > We already compensate cpu_capacity for !fair tasks, so I don't see why > we can't use the migration aware one (and kill this one as Yuyang keeps > proposing) and compensate with the capacity factor. The 1st point is that cpu_capacity is compensated by both !fair_tasks and frequency scaling and we should not take into account frequency scaling for detecting overload What we have now is the the weighted load avg that is the sum of the weight load of entities on the run queue. This is not usable to detect overload because of the weight. An unweighted version of this figure would be more usefull but it's not as accurate as the one I use IMHO. The example that has been discussed during the review of the last version has shown some limitations With the following schedule pattern from Morten's example | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq The scheduler will see the following values: Task A unweighted load value is 47% Task B unweight load is 60% The maximum Sum of unweighted load is 104% rq->avg load is 60% And the real CPU load is 50% So we will have opposite decision depending of the used values: the rq->avg or the Sum of unweighted load The sum of unweighted load has the main advantage of showing immediately what will be the relative impact of adding/removing a task. In the example, we can see that removing task A or B will remove around half the CPU load but it's not so good for giving the current utilization of the CPU Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jul 11, 2014 at 09:51:06AM +0200, Vincent Guittot wrote: > On 10 July 2014 15:16, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote: > >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e. > >> > >> We are going to use runnable_avg_sum and runnable_avg_period in order to get > >> the utilization of the CPU. This statistic includes all tasks that run the CPU > >> and not only CFS tasks. > > > > But this rq->avg is not the one that is migration aware, right? So why > > use this? > > Yes, it's not the one that is migration aware > > > > > We already compensate cpu_capacity for !fair tasks, so I don't see why > > we can't use the migration aware one (and kill this one as Yuyang keeps > > proposing) and compensate with the capacity factor. > > The 1st point is that cpu_capacity is compensated by both !fair_tasks > and frequency scaling and we should not take into account frequency > scaling for detecting overload dvfs could help? Also we should not use arch_scale_freq_capacity() for things like cpufreq-ondemand etc. Because for those the compute capacity is still the max. We should only use it when we hard limit things. > What we have now is the the weighted load avg that is the sum of the > weight load of entities on the run queue. This is not usable to detect > overload because of the weight. An unweighted version of this figure > would be more usefull but it's not as accurate as the one I use IMHO. > The example that has been discussed during the review of the last > version has shown some limitations > > With the following schedule pattern from Morten's example > > | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | > A: run rq run ----------- sleeping ------------- run > B: rq run rq run ---- sleeping ------------- rq > > The scheduler will see the following values: > Task A unweighted load value is 47% > Task B unweight load is 60% > The maximum Sum of unweighted load is 104% > rq->avg load is 60% > > And the real CPU load is 50% > > So we will have opposite decision depending of the used values: the > rq->avg or the Sum of unweighted load > > The sum of unweighted load has the main advantage of showing > immediately what will be the relative impact of adding/removing a > task. In the example, we can see that removing task A or B will remove > around half the CPU load but it's not so good for giving the current > utilization of the CPU In that same discussion ISTR a suggestion about adding avg_running time, as opposed to the current avg_runnable. The sum of avg_running should be much more accurate, and still react correctly to migrations. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jul 11, 2014 at 08:51:06AM +0100, Vincent Guittot wrote: > On 10 July 2014 15:16, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote: > >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e. > >> > >> We are going to use runnable_avg_sum and runnable_avg_period in order to get > >> the utilization of the CPU. This statistic includes all tasks that run the CPU > >> and not only CFS tasks. > > > > But this rq->avg is not the one that is migration aware, right? So why > > use this? > > Yes, it's not the one that is migration aware > > > > > We already compensate cpu_capacity for !fair tasks, so I don't see why > > we can't use the migration aware one (and kill this one as Yuyang keeps > > proposing) and compensate with the capacity factor. > > The 1st point is that cpu_capacity is compensated by both !fair_tasks > and frequency scaling and we should not take into account frequency > scaling for detecting overload > > What we have now is the the weighted load avg that is the sum of the > weight load of entities on the run queue. This is not usable to detect > overload because of the weight. An unweighted version of this figure > would be more usefull but it's not as accurate as the one I use IMHO. IMHO there is no perfect utilization metric, but I think it is fundamentally wrong to use a metric that is migration unaware to make migration decisions. I mentioned that during the last review as well. It is like having a very fast controller with a really slow (large delay) feedback loop. There is a high risk of getting an unstable balance when you load-balance rate is faster than the feedback delay. > The example that has been discussed during the review of the last > version has shown some limitations > > With the following schedule pattern from Morten's example > > | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | > A: run rq run ----------- sleeping ------------- run > B: rq run rq run ---- sleeping ------------- rq > > The scheduler will see the following values: > Task A unweighted load value is 47% > Task B unweight load is 60% > The maximum Sum of unweighted load is 104% > rq->avg load is 60% > > And the real CPU load is 50% > > So we will have opposite decision depending of the used values: the > rq->avg or the Sum of unweighted load > > The sum of unweighted load has the main advantage of showing > immediately what will be the relative impact of adding/removing a > task. In the example, we can see that removing task A or B will remove > around half the CPU load but it's not so good for giving the current > utilization of the CPU You forgot to mention the issues with rq->avg that were brought up last time :-) Here is an load-balancing example: Task A, B, C, and D are all running/runnable constantly. To avoid decimals we assume the sched tick to have a 9 ms period. We have four cpus in a single sched_domain. rq == rq->avg uw == unweighted tracked load cpu0: | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | A: run rq rq B: rq run rq C: rq rq run D: rq rq rq run run run run run run rq: 100% 100% 100% 100% 100% 100% 100% 100% 100% uw: 400% 400% 400% 100% 100% 100% 100% 100% 100% cpu1: | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | A: run rq run rq run rq B: rq run rq run rq run C: D: rq: 0% 0% 0% 0% 6% 12% 18% 23% 28% uw: 0% 0% 0% 200% 200% 200% 200% 200% 200% cpu2: | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | A: B: C: run run run run run run D: rq: 0% 0% 0% 0% 6% 12% 18% 23% 28% uw: 0% 0% 0% 100% 100% 100% 100% 100% 100% cpu3: | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | 3 ms | A: B: C: D: rq: 0% 0% 0% 0% 0% 0% 0% 0% 0% uw: 0% 0% 0% 0% 0% 0% 0% 0% 0% A periodic load-balance occurs on cpu1 after 9 ms. cpu0 rq->avg indicates overload. Consequently cpu1 pulls task A and B. Shortly after (<1 ms) cpu2 does a periodic load-balance. cpu0 rq->avg hasn't changed so cpu0 still appears overloaded. cpu2 pulls task C. Shortly after (<1 ms) cpu3 does a periodic load-balance. cpu0 rq->avg still indicates overload so cpu3 tries to pull tasks but fails since there is only task D left. 9 ms later the sched tick causes periodic load-balances on all the cpus. cpu0 rq->avg still indicates that it has the highest load since cpu1 rq->avg has not had time to indicate overload. Consequently cpu1, 2, and 3 will try to pull from that and fail. The balance will only change once cpu1 rq->avg has increased enough to indicate overload. Unweighted load will on the other hand indicate the load changes instantaneously, so cpu3 would observe the overload of cpu1 immediately and pull task A or B. In this example using rq->avg leads to imbalance whereas unweighted load would not. Correct me if I missed anything. Coming back to the previous example. I'm not convinced that inflation of the unweighted load sum when tasks overlap in time is a bad thing. I have mentioned this before. The average cpu utilization over the 40ms period is 50%. However the true compute capacity demand is 200% for the first 15ms of the period, 100% for the next 5ms and 0% for the remaining 25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is factored into the unweighted load whereas rq->avg would give you the same utilization no matter if the tasks are overlapped or not. Hence unweighted load would give us an indication that the mix of tasks isn't optimal even if the cpu has spare cycles. If you don't care about overlap and latency, the unweighted sum of task running time (that Peter has proposed a number of times) is better metric, IMHO. As long the cpu isn't fully utilized. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 11 July 2014 17:13, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Jul 11, 2014 at 09:51:06AM +0200, Vincent Guittot wrote: >> On 10 July 2014 15:16, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Mon, Jun 30, 2014 at 06:05:40PM +0200, Vincent Guittot wrote: >> >> This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e. >> >> >> >> We are going to use runnable_avg_sum and runnable_avg_period in order to get >> >> the utilization of the CPU. This statistic includes all tasks that run the CPU >> >> and not only CFS tasks. >> > >> > But this rq->avg is not the one that is migration aware, right? So why >> > use this? >> >> Yes, it's not the one that is migration aware >> >> > >> > We already compensate cpu_capacity for !fair tasks, so I don't see why >> > we can't use the migration aware one (and kill this one as Yuyang keeps >> > proposing) and compensate with the capacity factor. >> >> The 1st point is that cpu_capacity is compensated by both !fair_tasks >> and frequency scaling and we should not take into account frequency >> scaling for detecting overload > > dvfs could help? Also we should not use arch_scale_freq_capacity() for > things like cpufreq-ondemand etc. Because for those the compute capacity > is still the max. We should only use it when we hard limit things. In my mind, arch_scale_cpu_freq was intend to scale the capacity of the CPU according to the current dvfs operating point. As it's no more use anywhere now that we have arch_scale_cpu, we could probably remove it .. and see when it will become used. > >> What we have now is the the weighted load avg that is the sum of the >> weight load of entities on the run queue. This is not usable to detect >> overload because of the weight. An unweighted version of this figure >> would be more usefull but it's not as accurate as the one I use IMHO. >> The example that has been discussed during the review of the last >> version has shown some limitations >> >> With the following schedule pattern from Morten's example >> >> | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | >> A: run rq run ----------- sleeping ------------- run >> B: rq run rq run ---- sleeping ------------- rq >> >> The scheduler will see the following values: >> Task A unweighted load value is 47% >> Task B unweight load is 60% >> The maximum Sum of unweighted load is 104% >> rq->avg load is 60% >> >> And the real CPU load is 50% >> >> So we will have opposite decision depending of the used values: the >> rq->avg or the Sum of unweighted load >> >> The sum of unweighted load has the main advantage of showing >> immediately what will be the relative impact of adding/removing a >> task. In the example, we can see that removing task A or B will remove >> around half the CPU load but it's not so good for giving the current >> utilization of the CPU > > In that same discussion ISTR a suggestion about adding avg_running time, > as opposed to the current avg_runnable. The sum of avg_running should be > much more accurate, and still react correctly to migrations. I haven't look in details but I agree that avg_running would be much more accurate than avg_runnable and should probably fit the requirement. Does it means that we could re-add the avg_running (or something similar) that has disappeared during the review of load avg tracking patchset ? > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote: > In my mind, arch_scale_cpu_freq was intend to scale the capacity of > the CPU according to the current dvfs operating point. > As it's no more use anywhere now that we have arch_scale_cpu, we could > probably remove it .. and see when it will become used. I probably should have written comments when I wrote that code, but it was meant to be used only where, as described above, we limit things. Ondemand and such, which will temporarily decrease freq, will ramp it up again at demand, and therefore lowering the capacity will skew things. You'll put less load on because its run slower, and then you'll run it slower because there's less load on -> cyclic FAIL. > > In that same discussion ISTR a suggestion about adding avg_running time, > > as opposed to the current avg_runnable. The sum of avg_running should be > > much more accurate, and still react correctly to migrations. > > I haven't look in details but I agree that avg_running would be much > more accurate than avg_runnable and should probably fit the > requirement. Does it means that we could re-add the avg_running (or > something similar) that has disappeared during the review of load avg > tracking patchset ? Sure, I think we killed it there because there wasn't an actual use for it and I'm always in favour of stripping everything to their bare bones, esp big and complex things. And then later, add things back once we have need for it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote: > On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote: > > In my mind, arch_scale_cpu_freq was intend to scale the capacity of > > the CPU according to the current dvfs operating point. > > As it's no more use anywhere now that we have arch_scale_cpu, we could > > probably remove it .. and see when it will become used. > > I probably should have written comments when I wrote that code, but it > was meant to be used only where, as described above, we limit things. > Ondemand and such, which will temporarily decrease freq, will ramp it up > again at demand, and therefore lowering the capacity will skew things. > > You'll put less load on because its run slower, and then you'll run it > slower because there's less load on -> cyclic FAIL. Agreed. We can't use a frequency scaled compute capacity for all load-balancing decisions. However, IMHO, it would be useful to have know the current compute capacity in addition to the max compute capacity when considering energy costs. So we would have something like: * capacity_max: cpu capacity at highest frequency. * capacity_cur: cpu capacity at current frequency. * capacity_avail: cpu capacity currently available. Basically capacity_cur taking rt, deadline, and irq accounting into account. capacity_max should probably include rt, deadline, and irq accounting as well. Or we need both? Based on your description arch_scale_freq_capacity() can't be abused to implement capacity_cur (and capacity_avail) unless it is repurposed. Nobody seems to implement it. Otherwise we would need something similar to update capacity_cur (and capacity_avail). As a side note, we can potentially get into a similar fail cycle already due to the lack of scale invariance in the entity load tracking. > > > > In that same discussion ISTR a suggestion about adding avg_running time, > > > as opposed to the current avg_runnable. The sum of avg_running should be > > > much more accurate, and still react correctly to migrations. > > > > I haven't look in details but I agree that avg_running would be much > > more accurate than avg_runnable and should probably fit the > > requirement. Does it means that we could re-add the avg_running (or > > something similar) that has disappeared during the review of load avg > > tracking patchset ? > > Sure, I think we killed it there because there wasn't an actual use for > it and I'm always in favour of stripping everything to their bare bones, > esp big and complex things. > > And then later, add things back once we have need for it. I think it is a useful addition to the set of utilization metrics. I don't think it is universally more accurate than runnable_avg. Actually quite the opposite when the cpu is overloaded. But for partially loaded cpus it is very useful if you don't want to factor in waiting time on the rq. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Jul 14, 2014 at 01:55:29PM +0100, Morten Rasmussen wrote: > On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote: > > On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote: > > > In my mind, arch_scale_cpu_freq was intend to scale the capacity of > > > the CPU according to the current dvfs operating point. > > > As it's no more use anywhere now that we have arch_scale_cpu, we could > > > probably remove it .. and see when it will become used. > > > > I probably should have written comments when I wrote that code, but it > > was meant to be used only where, as described above, we limit things. > > Ondemand and such, which will temporarily decrease freq, will ramp it up > > again at demand, and therefore lowering the capacity will skew things. > > > > You'll put less load on because its run slower, and then you'll run it > > slower because there's less load on -> cyclic FAIL. > > Agreed. We can't use a frequency scaled compute capacity for all > load-balancing decisions. However, IMHO, it would be useful to have know > the current compute capacity in addition to the max compute capacity > when considering energy costs. So we would have something like: > > * capacity_max: cpu capacity at highest frequency. > > * capacity_cur: cpu capacity at current frequency. > > * capacity_avail: cpu capacity currently available. Basically > capacity_cur taking rt, deadline, and irq accounting into account. > > capacity_max should probably include rt, deadline, and irq accounting as > well. Or we need both? I'm struggling to fully grasp your intent. We need DVFS like accounting for sure, and that means a current freq hook, but I'm not entirely sure how that relates to capacity. > Based on your description arch_scale_freq_capacity() can't be abused to > implement capacity_cur (and capacity_avail) unless it is repurposed. > Nobody seems to implement it. Otherwise we would need something similar > to update capacity_cur (and capacity_avail). Yeah, I never got around to doing so. I started doing a APERF/MPERF SMT capacity thing for x86 but never finished that. The naive implementation suffered the same FAIL loop as above because APERF stops on idle. So when idle your capacity drops to nothing, leading to no new work, leading to more idle etc. I never got around to fixing that -- adding an idle filter, and ever since things have somewhat bitrotted. > As a side note, we can potentially get into a similar fail cycle already > due to the lack of scale invariance in the entity load tracking. Yah, I think that got mentioned a long while ago. > > > > In that same discussion ISTR a suggestion about adding avg_running time, > > > > as opposed to the current avg_runnable. The sum of avg_running should be > > > > much more accurate, and still react correctly to migrations. > > > > > > I haven't look in details but I agree that avg_running would be much > > > more accurate than avg_runnable and should probably fit the > > > requirement. Does it means that we could re-add the avg_running (or > > > something similar) that has disappeared during the review of load avg > > > tracking patchset ? > > > > Sure, I think we killed it there because there wasn't an actual use for > > it and I'm always in favour of stripping everything to their bare bones, > > esp big and complex things. > > > > And then later, add things back once we have need for it. > > I think it is a useful addition to the set of utilization metrics. I > don't think it is universally more accurate than runnable_avg. Actually > quite the opposite when the cpu is overloaded. But for partially loaded > cpus it is very useful if you don't want to factor in waiting time on > the rq. Well, different things different names. Utilization as per literature is simply the fraction of CPU time actually used. In that sense running_avg is about right for that. Our current runnable_avg is entirely different (as I think we all agree by now). But yes, for application the tipping point is u == 1, up until that point pure utilization makes sense, after that our runnable_avg makes more sense.
On Mon, Jul 14, 2014 at 02:20:52PM +0100, Peter Zijlstra wrote: > On Mon, Jul 14, 2014 at 01:55:29PM +0100, Morten Rasmussen wrote: > > On Fri, Jul 11, 2014 at 09:12:38PM +0100, Peter Zijlstra wrote: > > > On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote: > > > > In my mind, arch_scale_cpu_freq was intend to scale the capacity of > > > > the CPU according to the current dvfs operating point. > > > > As it's no more use anywhere now that we have arch_scale_cpu, we could > > > > probably remove it .. and see when it will become used. > > > > > > I probably should have written comments when I wrote that code, but it > > > was meant to be used only where, as described above, we limit things. > > > Ondemand and such, which will temporarily decrease freq, will ramp it up > > > again at demand, and therefore lowering the capacity will skew things. > > > > > > You'll put less load on because its run slower, and then you'll run it > > > slower because there's less load on -> cyclic FAIL. > > > > Agreed. We can't use a frequency scaled compute capacity for all > > load-balancing decisions. However, IMHO, it would be useful to have know > > the current compute capacity in addition to the max compute capacity > > when considering energy costs. So we would have something like: > > > > * capacity_max: cpu capacity at highest frequency. > > > > * capacity_cur: cpu capacity at current frequency. > > > > * capacity_avail: cpu capacity currently available. Basically > > capacity_cur taking rt, deadline, and irq accounting into account. > > > > capacity_max should probably include rt, deadline, and irq accounting as > > well. Or we need both? > > I'm struggling to fully grasp your intent. We need DVFS like accounting > for sure, and that means a current freq hook, but I'm not entirely sure > how that relates to capacity. We can abstract all the factors that affect current compute capacity (frequency, P-states, big.LITTLE,...) in the scheduler by having something like capacity_{cur,avail} to tell us how much capacity does a particular cpu have in its current state. Assuming that implement scale invariance for entity load tracking (we are working on that), we can directly compare task utilization with compute capacity for balancing decisions. For example, we can figure out how much spare capacity a cpu has in its current state by simply: spare_capacity(cpu) = capacity_avail(cpu) - \sum_{tasks(cpu)}^{t} util(t) If you put more than spare_capacity(cpu) worth of task utilization on the cpu, you will cause the cpu (and any affected cpus) to change P-state and potentially be less energy-efficient. Does that make any sense? Instead of dealing with frequencies directly in the scheduler code, we can abstract it by just having scalable compute capacity. > > > Based on your description arch_scale_freq_capacity() can't be abused to > > implement capacity_cur (and capacity_avail) unless it is repurposed. > > Nobody seems to implement it. Otherwise we would need something similar > > to update capacity_cur (and capacity_avail). > > Yeah, I never got around to doing so. I started doing a APERF/MPERF SMT > capacity thing for x86 but never finished that. The naive implementation > suffered the same FAIL loop as above because APERF stops on idle. So > when idle your capacity drops to nothing, leading to no new work, > leading to more idle etc. > > I never got around to fixing that -- adding an idle filter, and ever > since things have somewhat bitrotted. I see. > > > As a side note, we can potentially get into a similar fail cycle already > > due to the lack of scale invariance in the entity load tracking. > > Yah, I think that got mentioned a long while ago. It did :-) > > > > > > In that same discussion ISTR a suggestion about adding avg_running time, > > > > > as opposed to the current avg_runnable. The sum of avg_running should be > > > > > much more accurate, and still react correctly to migrations. > > > > > > > > I haven't look in details but I agree that avg_running would be much > > > > more accurate than avg_runnable and should probably fit the > > > > requirement. Does it means that we could re-add the avg_running (or > > > > something similar) that has disappeared during the review of load avg > > > > tracking patchset ? > > > > > > Sure, I think we killed it there because there wasn't an actual use for > > > it and I'm always in favour of stripping everything to their bare bones, > > > esp big and complex things. > > > > > > And then later, add things back once we have need for it. > > > > I think it is a useful addition to the set of utilization metrics. I > > don't think it is universally more accurate than runnable_avg. Actually > > quite the opposite when the cpu is overloaded. But for partially loaded > > cpus it is very useful if you don't want to factor in waiting time on > > the rq. > > Well, different things different names. Utilization as per literature is > simply the fraction of CPU time actually used. In that sense running_avg > is about right for that. Our current runnable_avg is entirely different > (as I think we all agree by now). > > But yes, for application the tipping point is u == 1, up until that > point pure utilization makes sense, after that our runnable_avg makes > more sense. Agreed. If you really care about latency/performance you might be interested in comparing running_avg and runnable_avg even for u < 1. If the running_avg/runnable_avg ratio is significantly less than one, tasks are waiting on the rq to be scheduled. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Jul 14, 2014 at 03:04:35PM +0100, Morten Rasmussen wrote: > > I'm struggling to fully grasp your intent. We need DVFS like accounting > > for sure, and that means a current freq hook, but I'm not entirely sure > > how that relates to capacity. > > We can abstract all the factors that affect current compute capacity > (frequency, P-states, big.LITTLE,...) in the scheduler by having > something like capacity_{cur,avail} to tell us how much capacity does a > particular cpu have in its current state. Assuming that implement scale > invariance for entity load tracking (we are working on that), we can > directly compare task utilization with compute capacity for balancing > decisions. For example, we can figure out how much spare capacity a cpu > has in its current state by simply: > > spare_capacity(cpu) = capacity_avail(cpu) - \sum_{tasks(cpu)}^{t} util(t) > > If you put more than spare_capacity(cpu) worth of task utilization on > the cpu, you will cause the cpu (and any affected cpus) to change > P-state and potentially be less energy-efficient. > > Does that make any sense? > > Instead of dealing with frequencies directly in the scheduler code, we > can abstract it by just having scalable compute capacity. Ah, ok. Same thing then. > > But yes, for application the tipping point is u == 1, up until that > > point pure utilization makes sense, after that our runnable_avg makes > > more sense. > > Agreed. > > If you really care about latency/performance you might be interested in > comparing running_avg and runnable_avg even for u < 1. If the > running_avg/runnable_avg ratio is significantly less than one, tasks are > waiting on the rq to be scheduled. Indeed, that gives a measure of queueing.
[...] >> In that same discussion ISTR a suggestion about adding avg_running time, >> as opposed to the current avg_runnable. The sum of avg_running should be >> much more accurate, and still react correctly to migrations. > > I haven't look in details but I agree that avg_running would be much > more accurate than avg_runnable and should probably fit the > requirement. Does it means that we could re-add the avg_running (or > something similar) that has disappeared during the review of load avg > tracking patchset ? Are you referring to '[RFC PATCH 14/14] sched: implement usage tracking' https://lkml.org/lkml/2012/2/1/769 from Paul Turner? __update_entity_runnable_avg() has an additional parameter 'running' so that it can be called for a) sched_entities in update_entity_load_avg(): __update_entity_runnable_avg(..., se->on_rq, cfs_rq->curr == se)) b) rq's in update_rq_runnable_avg(): __update_entity_runnable_avg(..., runnable, runnable); I can see how it gives us two different signals for a sched_entity but for a rq? Do I miss something here? -- Dietmar [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 11 July 2014 22:12, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Jul 11, 2014 at 07:39:29PM +0200, Vincent Guittot wrote: >> In my mind, arch_scale_cpu_freq was intend to scale the capacity of >> the CPU according to the current dvfs operating point. >> As it's no more use anywhere now that we have arch_scale_cpu, we could >> probably remove it .. and see when it will become used. > > I probably should have written comments when I wrote that code, but it > was meant to be used only where, as described above, we limit things. > Ondemand and such, which will temporarily decrease freq, will ramp it up > again at demand, and therefore lowering the capacity will skew things. > > You'll put less load on because its run slower, and then you'll run it > slower because there's less load on -> cyclic FAIL. > >> > In that same discussion ISTR a suggestion about adding avg_running time, >> > as opposed to the current avg_runnable. The sum of avg_running should be >> > much more accurate, and still react correctly to migrations. >> >> I haven't look in details but I agree that avg_running would be much >> more accurate than avg_runnable and should probably fit the >> requirement. Does it means that we could re-add the avg_running (or >> something similar) that has disappeared during the review of load avg >> tracking patchset ? > > Sure, I think we killed it there because there wasn't an actual use for > it and I'm always in favour of stripping everything to their bare bones, > esp big and complex things. > > And then later, add things back once we have need for it. Ok, i'm going to look how to add it back taking nto account current Yuyang's rework of load_avg > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 11 July 2014 18:13, Morten Rasmussen <morten.rasmussen@arm.com> wrote: [snip] > > In this example using rq->avg leads to imbalance whereas unweighted load > would not. Correct me if I missed anything. You just miss to take into account how the imbalance is computed > > Coming back to the previous example. I'm not convinced that inflation of > the unweighted load sum when tasks overlap in time is a bad thing. I > have mentioned this before. The average cpu utilization over the 40ms > period is 50%. However the true compute capacity demand is 200% for the > first 15ms of the period, 100% for the next 5ms and 0% for the remaining > 25ms. The cpu is actually overloaded for 15ms every 40ms. This fact is > factored into the unweighted load whereas rq->avg would give you the > same utilization no matter if the tasks are overlapped or not. Hence > unweighted load would give us an indication that the mix of tasks isn't > optimal even if the cpu has spare cycles. > > If you don't care about overlap and latency, the unweighted sum of task > running time (that Peter has proposed a number of times) is better > metric, IMHO. As long the cpu isn't fully utilized. > > Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Jul 15, 2014 at 10:27:19AM +0100, Vincent Guittot wrote: > On 11 July 2014 18:13, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > [snip] > > > > > In this example using rq->avg leads to imbalance whereas unweighted load > > would not. Correct me if I missed anything. > > You just miss to take into account how the imbalance is computed I don't think so. I'm aware that the imbalance is calculated based on the runnable_load_avg of the cpus. But if you pick the wrong cpus to compare to begin with, it doesn't matter. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 15 July 2014 11:32, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Tue, Jul 15, 2014 at 10:27:19AM +0100, Vincent Guittot wrote: >> On 11 July 2014 18:13, Morten Rasmussen <morten.rasmussen@arm.com> wrote: >> >> [snip] >> >> > >> > In this example using rq->avg leads to imbalance whereas unweighted load >> > would not. Correct me if I missed anything. >> >> You just miss to take into account how the imbalance is computed > > I don't think so. I'm aware that the imbalance is calculated based on > the runnable_load_avg of the cpus. But if you pick the wrong cpus to > compare to begin with, it doesn't matter. The average load of your sched_domain is 1 task per CPU so CPU1 will pull only 1 task and not 2 > > Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Jul 14, 2014 at 06:54:11PM +0100, Dietmar Eggemann wrote: > __update_entity_runnable_avg() has an additional parameter 'running' so > that it can be called for > > a) sched_entities in update_entity_load_avg(): > > __update_entity_runnable_avg(..., se->on_rq, cfs_rq->curr == se)) > > > b) rq's in update_rq_runnable_avg(): > > __update_entity_runnable_avg(..., runnable, runnable); > > I can see how it gives us two different signals for a sched_entity but > for a rq? > For rq, __update_entity_runnable_avg(..., runnable, runnable > 0) Then, first one would be effectively CPU ConCurrency (fair and !fair) and second one would be CPU (has task) running (or about CPU utilization for fair and !fair), :) Thanks, Yuyang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 742ad88..bdc5cb9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2428,19 +2428,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se) se->avg.load_avg_contrib >>= NICE_0_SHIFT; } } - -static inline void update_rq_runnable_avg(struct rq *rq, int runnable) -{ - __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); - __update_tg_runnable_avg(&rq->avg, &rq->cfs); -} #else /* CONFIG_FAIR_GROUP_SCHED */ static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq, int force_update) {} static inline void __update_tg_runnable_avg(struct sched_avg *sa, struct cfs_rq *cfs_rq) {} static inline void __update_group_entity_contrib(struct sched_entity *se) {} -static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {} #endif /* CONFIG_FAIR_GROUP_SCHED */ static inline void __update_task_entity_contrib(struct sched_entity *se) @@ -2539,6 +2532,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) __update_cfs_rq_tg_load_contrib(cfs_rq, force_update); } +static inline void update_rq_runnable_avg(struct rq *rq, int runnable) +{ + __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable); + __update_tg_runnable_avg(&rq->avg, &rq->cfs); +} + /* Add the load generated by se into cfs_rq's child load-average */ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7b311de..fb53166 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -542,8 +542,6 @@ struct rq { #ifdef CONFIG_FAIR_GROUP_SCHED /* list of leaf cfs_rq on this cpu: */ struct list_head leaf_cfs_rq_list; - - struct sched_avg avg; #endif /* CONFIG_FAIR_GROUP_SCHED */ /* @@ -634,6 +632,8 @@ struct rq { #ifdef CONFIG_SMP struct llist_head wake_list; #endif + + struct sched_avg avg; }; static inline int cpu_of(struct rq *rq)
This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e. We are going to use runnable_avg_sum and runnable_avg_period in order to get the utilization of the CPU. This statistic includes all tasks that run the CPU and not only CFS tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 13 ++++++------- kernel/sched/sched.h | 4 ++-- 2 files changed, 8 insertions(+), 9 deletions(-)