Message ID | 1448372970-8764-1-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
On 25 November 2015 at 10:24, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote: >> Instead of scaling the complete value of PELT algo, we should only scale >> the running time by the current capacity of the CPU. It seems more correct >> to only scale the running time because the non running time of a task >> (sleeping or waiting for a runqueue) is the same whatever the current freq >> and the compute capacity of the CPU. > > So I'm leaning towards liking this; however with your previous example > of 3 cpus and 7 tasks, where CPU0-1 are 'little' and of half the > capacity as the 'big' CPU2, with 2 tasks on CPU0-1 each and 3 tasks on > CPU2. > > This would result, for CPU0, in a load of 100% wait time + 100% runtime, > scaling the runtime 50% will get you a total load of 150%. > > For CPU2 we get 100% runtime and 200% wait time, no scaling, for a total > load of 300%. > > So the CPU0-1 cluster has a 300% load and the CPU2 'cluster' has a 300% > load, even though the actual load is not actually equal, CPUs0-1 > combined have the same capacity as CPU2, so it should be 4-4 tasks for > an equal balance. With the example above, we have (after that everything has reached their stable value) With the mainline: load_avg of CPU0 : 2048 and load_avg of each task should be 1024 load_avg of CPU1 : 2048 and load_avg of each task should be 1024 load_avg of CPU2 : 3072 and load_avg of each task should be 1024 With this patch which now includes the cpu invariance in the calculation of load_avg load_avg of CPU0 : 2048 and load_avg of each task should be 1024 load_avg of CPU1 : 2048 and load_avg of each task should be 1024 load_avg of CPU2 : 3072 and load_avg of each task should be 1024 The main difference will be in the time needed to reach these values. CPU2 will reach 95% of the final value in 136ms whereas the load_avg of CPU0 and CPU1 should be around 789 at that time and will reach the same value than CPU2 after additional 136ms Regards, Vincent > > > So I'm not sure the claim of comparable between CPUs stands. Still it is > an interesting idea and I will consider it more. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote: > The current implementation of load tracking invariance scales the load > tracking value with current frequency and uarch performance (only for > utilization) of the CPU. > > One main result of the current formula is that the figures are capped by > the current capacity of the CPU. This limitation is the main reason of not > including the uarch invariance (arch_scale_cpu_capacity) in the calculation > of load_avg because capping the load can generate erroneous system load > statistic as described with this example [1] The reason why we don't want to scale load_avg with regard to uarch capacity (as we do with util_avg) is explained in e3279a2e6d697e00e74f905851ee7cf532f72b2d as well. > Instead of scaling the complete value of PELT algo, we should only scale > the running time by the current capacity of the CPU. It seems more correct > to only scale the running time because the non running time of a task > (sleeping or waiting for a runqueue) is the same whatever the current freq > and the compute capacity of the CPU. You seem to imply that we currently scale running, waiting, and sleeping time. That is not the case. We scale running and waiting time, but not sleeping time. Whether we should scale waiting time or not is a good question. The waiting time is affected by the running time of the other tasks on the cfs_rq, so on one hand it seems a bit inconsistent to scale one and not the other. On the other hand, not scaling waiting time would make tasks that spend a lot of time waiting appear bigger, which could an advantage as it would make load-balancing more prone to spread tasks. A third alternative is to drop the scaling of load_avg completely, but it is still needed for util_avg as we want util_avg to be invariant to frequency and uarch scaling. > Then, one main advantage of this change is that the load of a task can > reach max value whatever the current freq and the uarch of the CPU on which > it run. It will just take more time at a lower freq than a max freq or on a > "little" CPU compared to a "big" one. The load and the utilization stay > invariant across system so we can still compared them between CPU but with > a wider range of values. Just removing scaling of waiting time and applying scaling by current capacity (including uarch) to the running time will not make load_avg reach the max value for tasks running alone on a cpu. Since the task isn't waiting at all (it is alone) all contributions are running time which is scaled, IIUC, and hence the result is still capped by the current capacity of the cpu. But that doesn't match your example results further down if I read them correctly. The changes made in the code of this patch are quite subtle, but very important as they change the behaviour of the PELT geometric series quite a lot. It is much more than just changing whether we scale waiting time and apply uarch scaling to running time of load_avg or not. I think we need to understand the math behind this patch to understand how the PELT metrics are affected because I think this patch changes some of the fundamentals originally described by Paul and Ben. Instead of scaling the contribution of each 1024us segment like we currently do, this patch is essentially warping time and lumps it together and let it contribute fully but skips decays. It is rather hard to explain, but the result is that the patch affects both load_avg and util_avg, and it breaks scale-invariance. Executive summary: Scaling time (delta) instead of the individual segment contributions breaks scale-invariance. The net result on load_avg seems to be nothing apart from slower reaction time. That is how I see after having tested it a bit. But I could be getting it all wrong. :-/ Much more detail: Original geometric series: \sum (0..n) u_n * y^n Current geometric series with scale invariance: \sum (0..n) u_n * c_n * y^n In reality we only approximate having the capacity scaling for each segment as don't enforce PELT updates for each capacity change due to frequency scaling. In this patch scaling is applied to the entire delta since last update instead of each individual segment. That gives us a very interesting time warping effect when updates happen less frequently than every 1ms. On cpus with reduced capacity the delta is reduced and all the math is done as if less time had passed since last update which introduces an error with regard to the decay of the series as we segments of time with zero contribution. It is probably easier described with an example: We have one periodic task with a period of 4ms. Busy time per activation is 1ms at 100% capacity. The task has been running forever (>350ms) and we consider the load_avg calculations at enqueue/dequeue, which is should the most common update points for this scenario besides the tick updates. task states s = sleeping R = running (scheduled) pelt d = decay segment (load_avg * y, y^32 = 0.5) [0..1024] = segment contribution (including any scaling) U = __update_load_avg() is called f = 100% | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us | task | s | R | s | s | s | R | pelt ml | d U 1024 U d | d | d U 1024 U patch | d U 1024 U d | d | d U 1024 U f = 33% | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us | task | s | R | R | R | s | R | pelt ml | d U 341y^2 | 341y | 341 U d U 341y^2 | patch | d U 1024 | 0 | 0 U d U 1024 | In the first case, f = 100%, the update after the busy period is complete we decay load_avg by one period (segment) and add a contribution of 1024. We are at 100% so it is a full contribution for this segment both with and without this patch. The task enqueue update accounts for the sleeping time by decaying load_avg three periods. The same in both cases. We could say that the contributions of a full cycle of the the task is: f_100% cycle = 1024 + decay(4) If we reduce the capacity to 33%, things look a bit different. In mainline, the dequeque update after the busy period would decay three periods and add \sum (i = 2..0) 0.33*1024*y^i to account for the three busy segments. The enqueue update decays the load_avg by one segment. The full cycle contribution becomes: Mainline: f_33% cycle = 341*y^2 + 341*y + 341 + decay(4) With this patch it is different. At the dequeue update we scale the time delta instead of the contribution, such that delta = 0.33*delta, so the calculation is based on only one period (segment) has passed. Hence we decay by one segment and add 1024, but still set the update point to the true timestamp so the following update doesn't take the two remaining segments into account. The enqueue update decays the load_avg by one segment, just like it does in mainline. The full cycle contribution becomes: Patch: f_33% cycle = 1024 + decay(2) This is clearly different from mainline. Not only is the busy contribution higher, 1024 > 341*y^2 + 341*y + 341, since y < 1, but we also decay less. The result is an inflation of the load_avg and util_avg metrics for tasks that run for more than 1ms at the time if __update_load_avg() isn't called every 1ms. I did a quick test to confirm this using a single periodic task and changing the compute capacity. util_avg capacity mainline patch 1024 ~359 ~352 512 ~340 ~534 Execution time went from 1.4ms to 2.8ms per activation without overloading the cpu. The fundamental idea in scale invariance is that util_avg should be comparable between cpu at any capacity as long none of them are over-utilized. This isn't preserved by the patch in its current form. > With this change, we don't have to test if a CPU is overloaded or not in > order to use one metric (util) or another (load) as all metrics are always > valid. I'm not sure what you mean by always valid. util_avg is still not a meaningful metric for tasks running on over-utilized cpus, so it can not be used unconditionally. If util_avg > capacity we still have no clue if the task can fit on a different cpu with higher capacity. > I have put below some examples of duration to reach some typical load value > according to the capacity of the CPU with current implementation > and with this patch. > > Util (%) max capacity half capacity(mainline) half capacity(w/ patch) > 972 (95%) 138ms not reachable 276ms > 486 (47.5%) 30ms 138ms 60ms > 256 (25%) 13ms 32ms 26ms I assume that these are numbers for util_avg and not load_avg as said in the text above. It confuses me a little bit as you started out by talking about the lack of uarch scaling of load_avg and propose to change that, not util_avg. The equivalent table for load_avg would something like this: load_avg (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms 138ms 276ms 486 (47.5%) 30ms 30ms 60ms 256 (25%) 13ms 13ms 26ms load_avg does reach max capacity as it is. The patch just makes it happen at a slower pace, which I'm not sure is a good or bad thing. > We can see that at half capacity, we need twice the duration of max > capacity with this patch whereas we have a non linear increase of the > duration with current implementation. Is it a problem that the time to reach a certain value is not linear? It is still somewhat unclear to me why we would want this change. Adding uarch scaling to load_avg but then modify the geometric series so the end result is the same except that it now reacts slower at lower capacities seems a bit strange. > > [1] https://lkml.org/lkml/2014/12/18/128 > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 28 +++++++++++++--------------- > 1 file changed, 13 insertions(+), 15 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 824aa9f..f2a18e1 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2560,10 +2560,9 @@ static __always_inline int > __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > unsigned long weight, int running, struct cfs_rq *cfs_rq) > { > - u64 delta, scaled_delta, periods; > + u64 delta, periods; > u32 contrib; > - unsigned int delta_w, scaled_delta_w, decayed = 0; > - unsigned long scale_freq, scale_cpu; > + unsigned int delta_w, decayed = 0; > > delta = now - sa->last_update_time; > /* > @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > return 0; > sa->last_update_time = now; > > - scale_freq = arch_scale_freq_capacity(NULL, cpu); > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > + if (running) { > + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); > + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); This is where the time warping happens. delta is used to determine the number of periods (segments) since last update. Scaling this, as opposed to the contributions for each segment individually, can lead to disappearing segments. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Morten, Thanks for the review and sorry for the late reply On 8 December 2015 at 18:04, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote: >> The current implementation of load tracking invariance scales the load >> tracking value with current frequency and uarch performance (only for >> utilization) of the CPU. >> >> One main result of the current formula is that the figures are capped by >> the current capacity of the CPU. This limitation is the main reason of not >> including the uarch invariance (arch_scale_cpu_capacity) in the calculation >> of load_avg because capping the load can generate erroneous system load >> statistic as described with this example [1] > > The reason why we don't want to scale load_avg with regard to uarch > capacity (as we do with util_avg) is explained in > e3279a2e6d697e00e74f905851ee7cf532f72b2d as well. > >> Instead of scaling the complete value of PELT algo, we should only scale >> the running time by the current capacity of the CPU. It seems more correct >> to only scale the running time because the non running time of a task >> (sleeping or waiting for a runqueue) is the same whatever the current freq >> and the compute capacity of the CPU. > > You seem to imply that we currently scale running, waiting, and sleeping > time. That is not the case. We scale running and waiting time, but not > sleeping time. Whether we should scale waiting time or not is a good In fact, I was referring to the same equation than you use below \sum (0..n) u_n * c_n * y^n to say that the complete value of PELT is scaled because we scale all u_n which include idle fractions. Note that this doesn't change anything at the end as u_n is null during idle. > question. The waiting time is affected by the running time of the other > tasks on the cfs_rq, so on one hand it seems a bit inconsistent to scale > one and not the other. On the other hand, not scaling waiting time would > make tasks that spend a lot of time waiting appear bigger, which could > an advantage as it would make load-balancing more prone to spread tasks. > A third alternative is to drop the scaling of load_avg completely, but I don't think it's a good idea to limit the usage of load_avg to system that are overloaded and at the opposite to limit the util for not overloaded system. The boundary between both states is rarely clear and you often have part of the system that can be overloaded while the other part is not. > it is still needed for util_avg as we want util_avg to be invariant to > frequency and uarch scaling. > >> Then, one main advantage of this change is that the load of a task can >> reach max value whatever the current freq and the uarch of the CPU on which >> it run. It will just take more time at a lower freq than a max freq or on a >> "little" CPU compared to a "big" one. The load and the utilization stay >> invariant across system so we can still compared them between CPU but with >> a wider range of values. > > Just removing scaling of waiting time and applying scaling by current > capacity (including uarch) to the running time will not make load_avg > reach the max value for tasks running alone on a cpu. Since the task > isn't waiting at all (it is alone) all contributions are running time > which is scaled, IIUC, and hence the result is still capped by the > current capacity of the cpu. But that doesn't match your example results > further down if I read them correctly. In the current implementation, we scale the full contribution of each fraction of time in the PELT equation so if the capacity of a CPU can't be larger than Clocal_max because of frequency scaling and/or uarch, we have \sum (0..n) u_n * c_n * y^n <= Clocal_max * \sum (0..n) u_n * y^n With the proposed way to take into account the uarch and the current frequency, we scale the time that elapses before accounting it into a segment of the equation. As a summary, the delta time is scaled to reflect the amount of time that would have been used at the max capacity of the system. So if the frequency is half max freq, the time that will be accounted, will be half the really elapsed time. In parallel, the duration of the job will be twice longer so we will have the same amount of time accounted at the end. With this patch, the PELT equation stays \sum (0..n) u_n * y^n whatever the uarch and the current frequency. The main benefits is that we can reach the max value whatever uarch and current freq. The impact of the uarch and the current frequency is taken into account before the equation when we are accounting the time into a segment. > > The changes made in the code of this patch are quite subtle, but very > important as they change the behaviour of the PELT geometric series > quite a lot. It is much more than just changing whether we scale waiting > time and apply uarch scaling to running time of load_avg or not. I > think we need to understand the math behind this patch to understand how > the PELT metrics are affected because I think this patch changes some of > the fundamentals originally described by Paul and Ben. As explained above, the PELT equation in itself will be no more impacted by freq and uarch as their impacts are taken into account outside. > > Instead of scaling the contribution of each 1024us segment like we > currently do, this patch is essentially warping time and lumps it > together and let it contribute fully but skips decays. It is rather hard > to explain, but the result is that the patch affects both load_avg and > util_avg, and it breaks scale-invariance. > > Executive summary: Scaling time (delta) instead of the individual > segment contributions breaks scale-invariance. The net result on > load_avg seems to be nothing apart from slower reaction time. > > That is how I see after having tested it a bit. But I could be getting > it all wrong. :-/ For me it's not slowing the reaction time but reflecting more accurately the real behavior. Let take the example of a task with a computation that take 10ms at max capacity. At max capacity, the job will run 10ms and the util value will be 199 as well as the load_avg. At half frequency, the job will run 20ms instead of 10ms . With the current scale-invariance implementation, the util_avg value will be 180 as well as the load_avg. The load_avg value would have been 360 if we remove all kind of scale-invariance as you proposed above. With the proposed implementation, the util value will be 199 as well as the load_avg because we will add the same amount segment . The PELT implementation is about calculating the load/utilization of a task/CPU. It uses the time to reflect the amount of work done by a task. In a system has the same fix compute capacity per second for all cpus, it's fine to only use the time. But when we have different compute capacity across the system, we have to reflect this difference in the time that is added to a segment. > > > Much more detail: > > Original geometric series: > > \sum (0..n) u_n * y^n > > Current geometric series with scale invariance: > > \sum (0..n) u_n * c_n * y^n > > In reality we only approximate having the capacity scaling for each > segment as don't enforce PELT updates for each capacity change due to > frequency scaling. > > In this patch scaling is applied to the entire delta since last update we probably don't have the same meaning of the last update but that's exactly the same as the current implementation > instead of each individual segment. That gives us a very interesting > time warping effect when updates happen less frequently than every 1ms. > On cpus with reduced capacity the delta is reduced and all the math is > done as if less time had passed since last update which introduces an > error with regard to the decay of the series as we segments of time with > zero contribution. This happen because the compute capacity is lower and the actual work done during this segment of time is lower too. The duration of the computation will be longer and at the end, we will have the same amount of segments of time for the job > > It is probably easier described with an example: > > We have one periodic task with a period of 4ms. Busy time per activation > is 1ms at 100% capacity. The task has been running forever (>350ms) and It's not clear for me why you want a task that was running forever before the use case ? Apart from starting at max value or more precisely 33% of max value when f=33% ? > we consider the load_avg calculations at enqueue/dequeue, which is > should the most common update points for this scenario besides the tick > updates. > > task states > s = sleeping > R = running (scheduled) > > pelt > d = decay segment (load_avg * y, y^32 = 0.5) > [0..1024] = segment contribution (including any scaling) > U = __update_load_avg() is called > > f = 100% > | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us | > task | s | R | s | s | s | R | > pelt ml | d U 1024 U d | d | d U 1024 U > patch | d U 1024 U d | d | d U 1024 U > > f = 33% > | 1024us | 1024us | 1024us | 1024us | 1024us | 1024us | > task | s | R | R | R | s | R | > pelt ml | d U 341y^2 | 341y | 341 U d U 341y^2 | > patch | d U 1024 | 0 | 0 U d U 1024 | > > In the first case, f = 100%, the update after the busy period is > complete we decay load_avg by one period (segment) and add a > contribution of 1024. We are at 100% so it is a full contribution for > this segment both with and without this patch. The task enqueue update > accounts for the sleeping time by decaying load_avg three periods. The > same in both cases. We could say that the contributions of a full cycle > of the the task is: > > f_100% cycle = 1024 + decay(4) > > If we reduce the capacity to 33%, things look a bit different. In > mainline, the dequeque update after the busy period would decay three > periods and add \sum (i = 2..0) 0.33*1024*y^i to account for the three > busy segments. The enqueue update decays the load_avg by one segment. > The full cycle contribution becomes: > > Mainline: > f_33% cycle = 341*y^2 + 341*y + 341 + decay(4) > > With this patch it is different. At the dequeue update we scale the time > delta instead of the contribution, such that delta = 0.33*delta, so the > calculation is based on only one period (segment) has passed. Hence we > decay by one segment and add 1024, but still set the update point to the > true timestamp so the following update doesn't take the two remaining > segments into account. The enqueue update decays the load_avg by one > segment, just like it does in mainline. The full cycle contribution > becomes: > > Patch: > f_33% cycle = 1024 + decay(2) > > This is clearly different from mainline. Not only is the busy > contribution higher, 1024 > 341*y^2 + 341*y + 341, since y < 1, but we > also decay less. The result is an inflation of the load_avg and util_avg So the busy contribution is exactly the same as fmax whereas it's not the case with current implementation as you mentioned above. But the number of "decay" is not the same whereas the current implementation have the same number of decay. I have to look at how i can improve the decay accuracy. > metrics for tasks that run for more than 1ms at the time if > __update_load_avg() isn't called every 1ms. > > I did a quick test to confirm this using a single periodic task and > changing the compute capacity. > > util_avg > capacity mainline patch > 1024 ~359 ~352 > 512 ~340 ~534 > > Execution time went from 1.4ms to 2.8ms per activation without > overloading the cpu. At the opposite, there are some use cases where the proposed util_avg is more accurate. In fact, this mainly depends of which part of the decay or the load is preponderant in the value of util_avg/load_avg As soon as the running time is around 100ms, we "saturate" the load_avg or the util_avg. So a task that runs 50ms each 150ms at max capacity will be around 695 for both util_avg and load_avg, whereas it will be around 470 for util_avg and 940 for load_avg at half capacity (due uarch) as the duration becomes 100ms. For this example, we have lost the scale invariance with current implementation. With the proposed changes, the util_avg and the load_avg would be 750. > > The fundamental idea in scale invariance is that util_avg should be > comparable between cpu at any capacity as long none of them are > over-utilized. This isn't preserved by the patch in its current form. > >> With this change, we don't have to test if a CPU is overloaded or not in >> order to use one metric (util) or another (load) as all metrics are always >> valid. > > I'm not sure what you mean by always valid. util_avg is still not a > meaningful metric for tasks running on over-utilized cpus, so it can not > be used unconditionally. If util_avg > capacity we still have no clue if > the task can fit on a different cpu with higher capacity. That's one side goal of changing the way the scale invariance is taken into account in util_avg. Being > current capacity can still be meaningful > >> I have put below some examples of duration to reach some typical load value >> according to the capacity of the CPU with current implementation >> and with this patch. >> >> Util (%) max capacity half capacity(mainline) half capacity(w/ patch) >> 972 (95%) 138ms not reachable 276ms >> 486 (47.5%) 30ms 138ms 60ms >> 256 (25%) 13ms 32ms 26ms > > I assume that these are numbers for util_avg and not load_avg as said in It can be both. half capacity can refer to frequency invariance or uarch invariance > the text above. It confuses me a little bit as you started out by > talking about the lack of uarch scaling of load_avg and propose to > change that, not util_avg. The goal is to impact both util_avg and load_avg: Being able to add the uarch in the calculation of the load_avg to improve the fairness in presence of cpus with different capacity. Being able to use the util_avg in a wider time scale. > > The equivalent table for load_avg would something like this: > > load_avg (%) max capacity half capacity(mainline) half capacity(w/ patch) > 972 (95%) 138ms 138ms 276ms > 486 (47.5%) 30ms 30ms 60ms > 256 (25%) 13ms 13ms 26ms > > load_avg does reach max capacity as it is. The patch just makes it > happen at a slower pace, which I'm not sure is a good or bad thing. > >> We can see that at half capacity, we need twice the duration of max >> capacity with this patch whereas we have a non linear increase of the >> duration with current implementation. > > Is it a problem that the time to reach a certain value is not linear? This doesn't help in the scale-invariance Thanks, Vincent > > It is still somewhat unclear to me why we would want this change. Adding > uarch scaling to load_avg but then modify the geometric series so the > end result is the same except that it now reacts slower at lower > capacities seems a bit strange. > >> >> [1] https://lkml.org/lkml/2014/12/18/128 >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> --- >> kernel/sched/fair.c | 28 +++++++++++++--------------- >> 1 file changed, 13 insertions(+), 15 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 824aa9f..f2a18e1 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2560,10 +2560,9 @@ static __always_inline int >> __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> unsigned long weight, int running, struct cfs_rq *cfs_rq) >> { >> - u64 delta, scaled_delta, periods; >> + u64 delta, periods; >> u32 contrib; >> - unsigned int delta_w, scaled_delta_w, decayed = 0; >> - unsigned long scale_freq, scale_cpu; >> + unsigned int delta_w, decayed = 0; >> >> delta = now - sa->last_update_time; >> /* >> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> return 0; >> sa->last_update_time = now; >> >> - scale_freq = arch_scale_freq_capacity(NULL, cpu); >> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); >> + if (running) { >> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); >> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); > > This is where the time warping happens. delta is used to determine the > number of periods (segments) since last update. Scaling this, as opposed > to the contributions for each segment individually, can lead to > disappearing segments. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 14 December 2015 at 01:26, Yuyang Du <yuyang.du@intel.com> wrote: > Hi Vincent, > > I don't quite catch what this is doing, maybe I need more time > to ramp up to the gory detail difficult like this. > > Do you scale or not scale? You seem removed the scaling, but added it > after "Remainder of delta accrued against u_0".. I'm scaling the time before taking it in the pelt algorithm. My reply to Morten's comment tries to explain more deeply what i'm trying to achieve Thanks, Vincent > > Thanks, > Yuyang > > On Tue, Nov 24, 2015 at 02:49:30PM +0100, Vincent Guittot wrote: >> The current implementation of load tracking invariance scales the load >> tracking value with current frequency and uarch performance (only for >> utilization) of the CPU. >> >> One main result of the current formula is that the figures are capped by >> the current capacity of the CPU. This limitation is the main reason of not >> including the uarch invariance (arch_scale_cpu_capacity) in the calculation >> of load_avg because capping the load can generate erroneous system load >> statistic as described with this example [1] >> >> Instead of scaling the complete value of PELT algo, we should only scale >> the running time by the current capacity of the CPU. It seems more correct >> to only scale the running time because the non running time of a task >> (sleeping or waiting for a runqueue) is the same whatever the current freq >> and the compute capacity of the CPU. >> >> Then, one main advantage of this change is that the load of a task can >> reach max value whatever the current freq and the uarch of the CPU on which >> it run. It will just take more time at a lower freq than a max freq or on a >> "little" CPU compared to a "big" one. The load and the utilization stay >> invariant across system so we can still compared them between CPU but with >> a wider range of values. >> >> With this change, we don't have to test if a CPU is overloaded or not in >> order to use one metric (util) or another (load) as all metrics are always >> valid. >> >> I have put below some examples of duration to reach some typical load value >> according to the capacity of the CPU with current implementation >> and with this patch. >> >> Util (%) max capacity half capacity(mainline) half capacity(w/ patch) >> 972 (95%) 138ms not reachable 276ms >> 486 (47.5%) 30ms 138ms 60ms >> 256 (25%) 13ms 32ms 26ms >> >> We can see that at half capacity, we need twice the duration of max >> capacity with this patch whereas we have a non linear increase of the >> duration with current implementation. >> >> [1] https://lkml.org/lkml/2014/12/18/128 >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> --- >> kernel/sched/fair.c | 28 +++++++++++++--------------- >> 1 file changed, 13 insertions(+), 15 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 824aa9f..f2a18e1 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -2560,10 +2560,9 @@ static __always_inline int >> __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> unsigned long weight, int running, struct cfs_rq *cfs_rq) >> { >> - u64 delta, scaled_delta, periods; >> + u64 delta, periods; >> u32 contrib; >> - unsigned int delta_w, scaled_delta_w, decayed = 0; >> - unsigned long scale_freq, scale_cpu; >> + unsigned int delta_w, decayed = 0; >> >> delta = now - sa->last_update_time; >> /* >> @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> return 0; >> sa->last_update_time = now; >> >> - scale_freq = arch_scale_freq_capacity(NULL, cpu); >> - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); >> + if (running) { >> + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); >> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); >> + } >> >> /* delta_w is the amount already accumulated against our next period */ >> delta_w = sa->period_contrib; >> @@ -2601,16 +2602,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> * period and accrue it. >> */ >> delta_w = 1024 - delta_w; >> - scaled_delta_w = cap_scale(delta_w, scale_freq); >> if (weight) { >> - sa->load_sum += weight * scaled_delta_w; >> + sa->load_sum += weight * delta_w; >> if (cfs_rq) { >> cfs_rq->runnable_load_sum += >> - weight * scaled_delta_w; >> + weight * delta_w; >> } >> } >> if (running) >> - sa->util_sum += scaled_delta_w * scale_cpu; >> + sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT; >> >> delta -= delta_w; >> >> @@ -2627,25 +2627,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, >> >> /* Efficiently calculate \sum (1..n_period) 1024*y^i */ >> contrib = __compute_runnable_contrib(periods); >> - contrib = cap_scale(contrib, scale_freq); >> if (weight) { >> sa->load_sum += weight * contrib; >> if (cfs_rq) >> cfs_rq->runnable_load_sum += weight * contrib; >> } >> if (running) >> - sa->util_sum += contrib * scale_cpu; >> + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; >> } >> >> /* Remainder of delta accrued against u_0` */ >> - scaled_delta = cap_scale(delta, scale_freq); >> if (weight) { >> - sa->load_sum += weight * scaled_delta; >> + sa->load_sum += weight * delta; >> if (cfs_rq) >> - cfs_rq->runnable_load_sum += weight * scaled_delta; >> + cfs_rq->runnable_load_sum += weight * delta; >> } >> if (running) >> - sa->util_sum += scaled_delta * scale_cpu; >> + sa->util_sum += delta << SCHED_CAPACITY_SHIFT; >> >> sa->period_contrib += delta; >> >> -- >> 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 824aa9f..f2a18e1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2560,10 +2560,9 @@ static __always_inline int __update_load_avg(u64 now, int cpu, struct sched_avg *sa, unsigned long weight, int running, struct cfs_rq *cfs_rq) { - u64 delta, scaled_delta, periods; + u64 delta, periods; u32 contrib; - unsigned int delta_w, scaled_delta_w, decayed = 0; - unsigned long scale_freq, scale_cpu; + unsigned int delta_w, decayed = 0; delta = now - sa->last_update_time; /* @@ -2584,8 +2583,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, return 0; sa->last_update_time = now; - scale_freq = arch_scale_freq_capacity(NULL, cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + if (running) { + delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu)); + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu)); + } /* delta_w is the amount already accumulated against our next period */ delta_w = sa->period_contrib; @@ -2601,16 +2602,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, * period and accrue it. */ delta_w = 1024 - delta_w; - scaled_delta_w = cap_scale(delta_w, scale_freq); if (weight) { - sa->load_sum += weight * scaled_delta_w; + sa->load_sum += weight * delta_w; if (cfs_rq) { cfs_rq->runnable_load_sum += - weight * scaled_delta_w; + weight * delta_w; } } if (running) - sa->util_sum += scaled_delta_w * scale_cpu; + sa->util_sum += delta_w << SCHED_CAPACITY_SHIFT; delta -= delta_w; @@ -2627,25 +2627,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, /* Efficiently calculate \sum (1..n_period) 1024*y^i */ contrib = __compute_runnable_contrib(periods); - contrib = cap_scale(contrib, scale_freq); if (weight) { sa->load_sum += weight * contrib; if (cfs_rq) cfs_rq->runnable_load_sum += weight * contrib; } if (running) - sa->util_sum += contrib * scale_cpu; + sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; } /* Remainder of delta accrued against u_0` */ - scaled_delta = cap_scale(delta, scale_freq); if (weight) { - sa->load_sum += weight * scaled_delta; + sa->load_sum += weight * delta; if (cfs_rq) - cfs_rq->runnable_load_sum += weight * scaled_delta; + cfs_rq->runnable_load_sum += weight * delta; } if (running) - sa->util_sum += scaled_delta * scale_cpu; + sa->util_sum += delta << SCHED_CAPACITY_SHIFT; sa->period_contrib += delta;
The current implementation of load tracking invariance scales the load tracking value with current frequency and uarch performance (only for utilization) of the CPU. One main result of the current formula is that the figures are capped by the current capacity of the CPU. This limitation is the main reason of not including the uarch invariance (arch_scale_cpu_capacity) in the calculation of load_avg because capping the load can generate erroneous system load statistic as described with this example [1] Instead of scaling the complete value of PELT algo, we should only scale the running time by the current capacity of the CPU. It seems more correct to only scale the running time because the non running time of a task (sleeping or waiting for a runqueue) is the same whatever the current freq and the compute capacity of the CPU. Then, one main advantage of this change is that the load of a task can reach max value whatever the current freq and the uarch of the CPU on which it run. It will just take more time at a lower freq than a max freq or on a "little" CPU compared to a "big" one. The load and the utilization stay invariant across system so we can still compared them between CPU but with a wider range of values. With this change, we don't have to test if a CPU is overloaded or not in order to use one metric (util) or another (load) as all metrics are always valid. I have put below some examples of duration to reach some typical load value according to the capacity of the CPU with current implementation and with this patch. Util (%) max capacity half capacity(mainline) half capacity(w/ patch) 972 (95%) 138ms not reachable 276ms 486 (47.5%) 30ms 138ms 60ms 256 (25%) 13ms 32ms 26ms We can see that at half capacity, we need twice the duration of max capacity with this patch whereas we have a non linear increase of the duration with current implementation. [1] https://lkml.org/lkml/2014/12/18/128 Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/