Message ID | 20150908165331.GC27098@e105550-lin.cambridge.arm.com |
---|---|
State | New |
Headers | show |
On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote: > Sadly that makes the code worse; I get 14 mul instructions where > previously I had 11. FWIW I count like: objdump -d defconfig-build/kernel/sched/fair.o | awk '/<[^>]*>:/ { p=0 } /<update_blocked_averages>:/ { p=1 } { if (p) print $0 }' | cut -d\: -f2- | grep mul | wc -l -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote: > On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote: > > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote: > > > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote: > > > But if we apply the scaling to the weight instead of time, we would only > > > have to apply it once and not three times like it is now? So maybe we > > > can end up with almost the same number of multiplications. > > > > > > We might be loosing bits for low priority task running on cpus at a low > > > frequency though. > > > > Something like the below. We should be saving one multiplication. > > > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > return 0; > > sa->last_update_time = now; > > > > - scale_freq = arch_scale_freq_capacity(NULL, cpu); > > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > > + if (weight || running) > > + scale_freq = arch_scale_freq_capacity(NULL, cpu); > > + if (weight) > > + scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT; > > + if (running) > > + scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu) > > + >> SCHED_CAPACITY_SHIFT; > > > > /* delta_w is the amount already accumulated against our next period */ > > delta_w = sa->period_contrib; > > @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > * period and accrue it. > > */ > > delta_w = 1024 - delta_w; > > - scaled_delta_w = cap_scale(delta_w, scale_freq); > > if (weight) { > > - sa->load_sum += weight * scaled_delta_w; > > + sa->load_sum += scaled_weight * delta_w; > > if (cfs_rq) { > > cfs_rq->runnable_load_sum += > > - weight * scaled_delta_w; > > + scaled_weight * delta_w; > > } > > } > > if (running) > > - sa->util_sum += scaled_delta_w * scale_cpu; > > + sa->util_sum += delta_w * scale_freq_cpu; > > > > delta -= delta_w; > > > > Sadly that makes the code worse; I get 14 mul instructions where > previously I had 11. > > What happens is that GCC gets confused and cannot constant propagate the > new variables, so what used to be shifts now end up being actual > multiplications. > > With this, I get back to 11. Can you see what happens on ARM where you > have both functions defined to non constants? We repeated the experiment on arm and arm64 but still with functions defined to constant to compare with your results. The mul instruction count seems to be somewhat compiler version dependent, but consistently show no effect of the patch: arm before after gcc4.9 12 12 gcc4.8 10 10 arm64 before after gcc4.9 11 11 I will get numbers with the arch-functions implemented as well and do hackbench runs to see what happens in terms of performance. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Morten, On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote: > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote: > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote: > > > > > > Something like teh below.. > > > > > > Another thing to ponder; the downside of scaled_delta_w is that its > > > fairly likely delta is small and you loose all bits, whereas the weight > > > is likely to be large can could loose a fwe bits without issue. > > > > That issue applies both to load and util. > > > > > > > > That is, in fixed point scaling like this, you want to start with the > > > biggest numbers, not the smallest, otherwise you loose too much. > > > > > > The flip side is of course that now you can share a multiplcation. > > > > But if we apply the scaling to the weight instead of time, we would only > > have to apply it once and not three times like it is now? So maybe we > > can end up with almost the same number of multiplications. > > > > We might be loosing bits for low priority task running on cpus at a low > > frequency though. > > Something like the below. We should be saving one multiplication. > > --- 8< --- > > From: Morten Rasmussen <morten.rasmussen@arm.com> > Date: Tue, 8 Sep 2015 17:15:40 +0100 > Subject: [PATCH] sched/fair: Scale load/util contribution rather than time > > When updating load/util tracking the time delta might be very small (1) > in many cases, scaling it futher down with frequency and cpu invariance > scaling might cause us to loose precision. Instead of scaling time we > can scale the weight of the task for load and the capacity for > utilization. Both weight (>=15) and capacity should be significantly > bigger in most cases. Low priority tasks might still suffer a bit but > worst should be improved, as weight is at least 15 before invariance > scaling. > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> > --- > kernel/sched/fair.c | 38 +++++++++++++++++++------------------- > 1 file changed, 19 insertions(+), 19 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 9301291..d5ee72a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n) > #error "load tracking assumes 2^10 as unit" > #endif > > -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) > - > /* > * We can represent the historical contribution to runnable average as the > * coefficients of a geometric series. To do this we sub-divide our runnable > @@ -2553,10 +2551,10 @@ static __always_inline int > __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > unsigned long weight, int running, struct cfs_rq *cfs_rq) > { > - u64 delta, scaled_delta, periods; > + u64 delta, periods; > u32 contrib; > - unsigned int delta_w, scaled_delta_w, decayed = 0; > - unsigned long scale_freq, scale_cpu; > + unsigned int delta_w, decayed = 0; > + unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0; > > delta = now - sa->last_update_time; > /* > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > return 0; > sa->last_update_time = now; > > - scale_freq = arch_scale_freq_capacity(NULL, cpu); > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > + if (weight || running) > + scale_freq = arch_scale_freq_capacity(NULL, cpu); > + if (weight) > + scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT; > + if (running) > + scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu) > + >> SCHED_CAPACITY_SHIFT; maybe below question is stupid :) Why not calculate the scaled_weight depend on cpu's capacity as well? So like: scaled_weight = weight * scale_freq_cpu. > /* delta_w is the amount already accumulated against our next period */ > delta_w = sa->period_contrib; > @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > * period and accrue it. > */ > delta_w = 1024 - delta_w; > - scaled_delta_w = cap_scale(delta_w, scale_freq); > if (weight) { > - sa->load_sum += weight * scaled_delta_w; > + sa->load_sum += scaled_weight * delta_w; > if (cfs_rq) { > cfs_rq->runnable_load_sum += > - weight * scaled_delta_w; > + scaled_weight * delta_w; > } > } > if (running) > - sa->util_sum += scaled_delta_w * scale_cpu; > + sa->util_sum += delta_w * scale_freq_cpu; > > delta -= delta_w; > > @@ -2620,25 +2622,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > /* Efficiently calculate \sum (1..n_period) 1024*y^i */ > contrib = __compute_runnable_contrib(periods); > - contrib = cap_scale(contrib, scale_freq); > if (weight) { > - sa->load_sum += weight * contrib; > + sa->load_sum += scaled_weight * contrib; > if (cfs_rq) > - cfs_rq->runnable_load_sum += weight * contrib; > + cfs_rq->runnable_load_sum += scaled_weight * contrib; > } > if (running) > - sa->util_sum += contrib * scale_cpu; > + sa->util_sum += contrib * scale_freq_cpu; > } > > /* Remainder of delta accrued against u_0` */ > - scaled_delta = cap_scale(delta, scale_freq); > if (weight) { > - sa->load_sum += weight * scaled_delta; > + sa->load_sum += scaled_weight * delta; > if (cfs_rq) > - cfs_rq->runnable_load_sum += weight * scaled_delta; > + cfs_rq->runnable_load_sum += scaled_weight * delta; > } > if (running) > - sa->util_sum += scaled_delta * scale_cpu; > + sa->util_sum += delta * scale_freq_cpu; > > sa->period_contrib += delta; > > -- > 1.9.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Sep 11, 2015 at 03:46:51PM +0800, Leo Yan wrote: > Hi Morten, > > On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote: > > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote: > > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote: > > > > > > > > Something like teh below.. > > > > > > > > Another thing to ponder; the downside of scaled_delta_w is that its > > > > fairly likely delta is small and you loose all bits, whereas the weight > > > > is likely to be large can could loose a fwe bits without issue. > > > > > > That issue applies both to load and util. > > > > > > > > > > > That is, in fixed point scaling like this, you want to start with the > > > > biggest numbers, not the smallest, otherwise you loose too much. > > > > > > > > The flip side is of course that now you can share a multiplcation. > > > > > > But if we apply the scaling to the weight instead of time, we would only > > > have to apply it once and not three times like it is now? So maybe we > > > can end up with almost the same number of multiplications. > > > > > > We might be loosing bits for low priority task running on cpus at a low > > > frequency though. > > > > Something like the below. We should be saving one multiplication. > > > > --- 8< --- > > > > From: Morten Rasmussen <morten.rasmussen@arm.com> > > Date: Tue, 8 Sep 2015 17:15:40 +0100 > > Subject: [PATCH] sched/fair: Scale load/util contribution rather than time > > > > When updating load/util tracking the time delta might be very small (1) > > in many cases, scaling it futher down with frequency and cpu invariance > > scaling might cause us to loose precision. Instead of scaling time we > > can scale the weight of the task for load and the capacity for > > utilization. Both weight (>=15) and capacity should be significantly > > bigger in most cases. Low priority tasks might still suffer a bit but > > worst should be improved, as weight is at least 15 before invariance > > scaling. > > > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> > > --- > > kernel/sched/fair.c | 38 +++++++++++++++++++------------------- > > 1 file changed, 19 insertions(+), 19 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 9301291..d5ee72a 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n) > > #error "load tracking assumes 2^10 as unit" > > #endif > > > > -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) > > - > > /* > > * We can represent the historical contribution to runnable average as the > > * coefficients of a geometric series. To do this we sub-divide our runnable > > @@ -2553,10 +2551,10 @@ static __always_inline int > > __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > unsigned long weight, int running, struct cfs_rq *cfs_rq) > > { > > - u64 delta, scaled_delta, periods; > > + u64 delta, periods; > > u32 contrib; > > - unsigned int delta_w, scaled_delta_w, decayed = 0; > > - unsigned long scale_freq, scale_cpu; > > + unsigned int delta_w, decayed = 0; > > + unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0; > > > > delta = now - sa->last_update_time; > > /* > > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > return 0; > > sa->last_update_time = now; > > > > - scale_freq = arch_scale_freq_capacity(NULL, cpu); > > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > > + if (weight || running) > > + scale_freq = arch_scale_freq_capacity(NULL, cpu); > > + if (weight) > > + scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT; > > + if (running) > > + scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu) > > + >> SCHED_CAPACITY_SHIFT; > > maybe below question is stupid :) > > Why not calculate the scaled_weight depend on cpu's capacity as well? > So like: scaled_weight = weight * scale_freq_cpu. IMHO, we should not scale load by cpu capacity since load isn't really comparable to capacity. It is runnable time based (not running time like utilization) and the idea is to used it for balancing when when the system is fully utilized. When the system is fully utilized we can't say anything about the true compute demands of a task, it may get exactly the cpu time it needs or it may need much more. Hence it doesn't really make sense to scale the demand by the capacity of the cpu. Two busy loops on cpus with different cpu capacities should have the load as they have the same compute demands. I mentioned this briefly in the commit message of patch 3 in this series. Makes sense? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Sep 11, 2015 at 11:02:33AM +0100, Morten Rasmussen wrote: > On Fri, Sep 11, 2015 at 03:46:51PM +0800, Leo Yan wrote: > > Hi Morten, > > > > On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote: > > > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote: > > > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote: > > > > > > > > > > Something like teh below.. > > > > > > > > > > Another thing to ponder; the downside of scaled_delta_w is that its > > > > > fairly likely delta is small and you loose all bits, whereas the weight > > > > > is likely to be large can could loose a fwe bits without issue. > > > > > > > > That issue applies both to load and util. > > > > > > > > > > > > > > That is, in fixed point scaling like this, you want to start with the > > > > > biggest numbers, not the smallest, otherwise you loose too much. > > > > > > > > > > The flip side is of course that now you can share a multiplcation. > > > > > > > > But if we apply the scaling to the weight instead of time, we would only > > > > have to apply it once and not three times like it is now? So maybe we > > > > can end up with almost the same number of multiplications. > > > > > > > > We might be loosing bits for low priority task running on cpus at a low > > > > frequency though. > > > > > > Something like the below. We should be saving one multiplication. > > > > > > --- 8< --- > > > > > > From: Morten Rasmussen <morten.rasmussen@arm.com> > > > Date: Tue, 8 Sep 2015 17:15:40 +0100 > > > Subject: [PATCH] sched/fair: Scale load/util contribution rather than time > > > > > > When updating load/util tracking the time delta might be very small (1) > > > in many cases, scaling it futher down with frequency and cpu invariance > > > scaling might cause us to loose precision. Instead of scaling time we > > > can scale the weight of the task for load and the capacity for > > > utilization. Both weight (>=15) and capacity should be significantly > > > bigger in most cases. Low priority tasks might still suffer a bit but > > > worst should be improved, as weight is at least 15 before invariance > > > scaling. > > > > > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> > > > --- > > > kernel/sched/fair.c | 38 +++++++++++++++++++------------------- > > > 1 file changed, 19 insertions(+), 19 deletions(-) > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > index 9301291..d5ee72a 100644 > > > --- a/kernel/sched/fair.c > > > +++ b/kernel/sched/fair.c > > > @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n) > > > #error "load tracking assumes 2^10 as unit" > > > #endif > > > > > > -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) > > > - > > > /* > > > * We can represent the historical contribution to runnable average as the > > > * coefficients of a geometric series. To do this we sub-divide our runnable > > > @@ -2553,10 +2551,10 @@ static __always_inline int > > > __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > > unsigned long weight, int running, struct cfs_rq *cfs_rq) > > > { > > > - u64 delta, scaled_delta, periods; > > > + u64 delta, periods; > > > u32 contrib; > > > - unsigned int delta_w, scaled_delta_w, decayed = 0; > > > - unsigned long scale_freq, scale_cpu; > > > + unsigned int delta_w, decayed = 0; > > > + unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0; > > > > > > delta = now - sa->last_update_time; > > > /* > > > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, > > > return 0; > > > sa->last_update_time = now; > > > > > > - scale_freq = arch_scale_freq_capacity(NULL, cpu); > > > - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); > > > + if (weight || running) > > > + scale_freq = arch_scale_freq_capacity(NULL, cpu); > > > + if (weight) > > > + scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT; > > > + if (running) > > > + scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu) > > > + >> SCHED_CAPACITY_SHIFT; > > > > maybe below question is stupid :) > > > > Why not calculate the scaled_weight depend on cpu's capacity as well? > > So like: scaled_weight = weight * scale_freq_cpu. > > IMHO, we should not scale load by cpu capacity since load isn't really > comparable to capacity. It is runnable time based (not running time like > utilization) and the idea is to used it for balancing when when the > system is fully utilized. When the system is fully utilized we can't say > anything about the true compute demands of a task, it may get exactly > the cpu time it needs or it may need much more. Hence it doesn't really > make sense to scale the demand by the capacity of the cpu. Two busy > loops on cpus with different cpu capacities should have the load as they > have the same compute demands. > > I mentioned this briefly in the commit message of patch 3 in this > series. > > Makes sense? Yeah, after your reminding, i recognise load only includes runnable time on rq but not include running time. Thanks, Leo Yan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Sep 09, 2015 at 12:13:10PM +0100, Morten Rasmussen wrote: > On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote: > > Sadly that makes the code worse; I get 14 mul instructions where > > previously I had 11. > > > > What happens is that GCC gets confused and cannot constant propagate the > > new variables, so what used to be shifts now end up being actual > > multiplications. > > > > With this, I get back to 11. Can you see what happens on ARM where you > > have both functions defined to non constants? > > We repeated the experiment on arm and arm64 but still with functions > defined to constant to compare with your results. The mul instruction > count seems to be somewhat compiler version dependent, but consistently > show no effect of the patch: > > arm before after > gcc4.9 12 12 > gcc4.8 10 10 > > arm64 before after > gcc4.9 11 11 > > I will get numbers with the arch-functions implemented as well and do > hackbench runs to see what happens in terms of performance. I have done some runs with the proposed fixes added: 1. PeterZ's util_sum shift fix (change util_sum). 2. Morten's scaling of weight instead of time (reduce bit loss). 3. PeterZ's unconditional calls to arch*() functions (compiler opt). To be clear: 2 includes 1, and 3 includes 1 and 2. Runs where done with the default (#define) implementation of the arch-functions and with arch specific implementation for ARM. I realized that just looking for 'mul' instructions in update_blocked_averages() is probably not a fair comparison on ARM as it turned out that it has quite a few multiply-accumulate instructions. So I have included the total count including those too. Test platforms: ARM TC2 (A7x3 only) perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 200 #mul: grep -e mul (in update_blocked_averages()) #mul_all: grep -e mul -e mla -e mls -e mia (in update_blocked_averages()) gcc: 4.9.3 Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 15000 #mul: grep -e mul (in update_blocked_averages()) gcc: 4.9.2 Results: perf numbers are average of three (x10) runs. Raw data is available further down. ARM TC2 #mul #mul_all perf bench arch*() default arm default arm default arm 1 shift_fix 10 16 22 36 13.401 13.288 2 scaled_weight 12 14 30 32 13.282 13.238 3 unconditional 12 14 26 32 13.296 13.427 Intel E5-2690 #mul #mul_all perf bench arch*() default default default 1 shift_fix 13 14.786 2 scaled_weight 18 15.078 3 unconditional 14 15.195 Overall it appears that fewer 'mul' instructions doesn't necessarily mean better perf bench score. For ARM, 2 seems the best choice overall. While 1 is better for Intel. If we want to try avoid the bit loss by scaling weight instead of time, 2 is best for both. However, all that said, looking at the raw numbers there is a significant difference between runs of perf --repeat, so we can't really draw any strong conclusions. It all appears to be in the noise. I suggest that I spin a v2 of this series and go with scaled_weight to reduce bit loss. Any objections? While at it, should I include Yuyang's patch redefining the SCALE/SHIFT mess? Raw numbers: ARM TC2 shift_fix default_arch gcc4.9.3 #mul 10 #mul+mla+mls+mia 22 13.384416727 seconds time elapsed ( +- 0.17% ) 13.431014702 seconds time elapsed ( +- 0.18% ) 13.387434890 seconds time elapsed ( +- 0.15% ) shift_fix arm_arch gcc4.9.3 #mul 16 #mul+mla+mls+mia 36 13.271044081 seconds time elapsed ( +- 0.11% ) 13.310189123 seconds time elapsed ( +- 0.19% ) 13.283594740 seconds time elapsed ( +- 0.12% ) scaled_weight default_arch gcc4.9.3 #mul 12 #mul+mla+mls+mia 30 13.295649553 seconds time elapsed ( +- 0.20% ) 13.271634654 seconds time elapsed ( +- 0.19% ) 13.280081329 seconds time elapsed ( +- 0.14% ) scaled_weight arm_arch gcc4.9.3 #mul 14 #mul+mla+mls+mia 32 13.230659223 seconds time elapsed ( +- 0.15% ) 13.222276527 seconds time elapsed ( +- 0.15% ) 13.260275081 seconds time elapsed ( +- 0.21% ) unconditional default_arch gcc4.9.3 #mul 12 #mul+mla+mls+mia 26 13.274904460 seconds time elapsed ( +- 0.13% ) 13.307853511 seconds time elapsed ( +- 0.15% ) 13.304084844 seconds time elapsed ( +- 0.22% ) unconditional arm_arch gcc4.9.3 #mul 14 #mul+mla+mls+mia 32 13.432878577 seconds time elapsed ( +- 0.13% ) 13.417950552 seconds time elapsed ( +- 0.12% ) 13.431682719 seconds time elapsed ( +- 0.18% ) Intel shift_fix default_arch gcc4.9.2 #mul 13 14.905815416 seconds time elapsed ( +- 0.61% ) 14.811113694 seconds time elapsed ( +- 0.84% ) 14.639739309 seconds time elapsed ( +- 0.76% ) scaled_weight default_arch gcc4.9.2 #mul 18 15.113275474 seconds time elapsed ( +- 0.64% ) 15.056777680 seconds time elapsed ( +- 0.44% ) 15.064074416 seconds time elapsed ( +- 0.71% ) unconditional default_arch gcc4.9.2 #mul 14 15.105152500 seconds time elapsed ( +- 0.71% ) 15.346405473 seconds time elapsed ( +- 0.81% ) 15.132933523 seconds time elapsed ( +- 0.82% ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote: > While at it, should I include Yuyang's patch redefining the SCALE/SHIFT > mess? I suspect his patch will fail to compile on ARM which uses SCHED_CAPACITY_* outside of kernel/sched/*. But if you all (Ben, Yuyang, you) can agree on a patch simplifying these things I'm not opposed to it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Yuyang Du <yuyang.du@intel.com> writes: > On Thu, Sep 17, 2015 at 12:38:25PM +0200, Peter Zijlstra wrote: >> On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote: >> >> > While at it, should I include Yuyang's patch redefining the SCALE/SHIFT >> > mess? >> >> I suspect his patch will fail to compile on ARM which uses >> SCHED_CAPACITY_* outside of kernel/sched/*. >> >> But if you all (Ben, Yuyang, you) can agree on a patch simplifying these >> things I'm not opposed to it. > > Yes, indeed. So SCHED_RESOLUTION_SHIFT has to be defined in include/linux/sched.h. > > With this, I think the codes still need some cleanup, and importantly > documentation. > > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high > or low resolution. So we have no reason to have low resolution (10bits) load_avg > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load, > as opposed to now we have load_avg = runnable% * scale_load_down(load). > > We get rid of all scale_load_down() for runnable load average? Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already going to give errors on 32-bit (even with the old code in fact). This should probably be fixed... somehow (dividing by 4 for load_sum on 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on 32-bit might have made sense but would be a weird difference between 32 and 64, and could break userspace anyway, so it's presumably too late for that). 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on 32-bit. > > -- > > Subject: [PATCH] sched/fair: Generalize the load/util averages resolution > definition > > The metric needs certain resolution to determine how much detail we > can look into (or not losing detail by integer rounding), which also > determines the range of the metrics. > > For instance, to increase the resolution of [0, 1] (two levels), one > can multiply 1024 and get [0, 1024] (1025 levels). > > In sched/fair, a few metrics depend on the resolution: load/load_avg, > util_avg, and capacity (frequency adjustment). In order to reduce the > risks to make mistakes relating to resolution/range, we therefore > generalize the resolution by defining a basic resolution constant > number, and then formalize all metrics by depending on the basic > resolution. The basic resolution is 1024 or (1 << 10). Further, one > can recursively apply the basic resolution to increase the final > resolution. > > Pointed out by Ben Segall, NICE_0's weight (visible to user) and load > have independent resolution, but they must be well calibrated. > > Signed-off-by: Yuyang Du <yuyang.du@intel.com> > --- > include/linux/sched.h | 9 ++++++--- > kernel/sched/fair.c | 4 ---- > kernel/sched/sched.h | 15 ++++++++++----- > 3 files changed, 16 insertions(+), 12 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index bd38b3e..9b86f79 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -909,10 +909,13 @@ enum cpu_idle_type { > CPU_MAX_IDLE_TYPES > }; > > +# define SCHED_RESOLUTION_SHIFT 10 > +# define SCHED_RESOLUTION_SCALE (1L << SCHED_RESOLUTION_SHIFT) > + > /* > * Increase resolution of cpu_capacity calculations > */ > -#define SCHED_CAPACITY_SHIFT 10 > +#define SCHED_CAPACITY_SHIFT SCHED_RESOLUTION_SHIFT > #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT) > > /* > @@ -1180,8 +1183,8 @@ struct load_weight { > * 1) load_avg factors frequency scaling into the amount of time that a > * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the > * aggregated such weights of all runnable and blocked sched_entities. > - * 2) util_avg factors frequency and cpu scaling into the amount of time > - * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE]. > + * 2) util_avg factors frequency and cpu capacity scaling into the amount of time > + * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE]. > * For cfs_rq, it is the aggregated such times of all runnable and > * blocked sched_entities. > * The 64 bit load_sum can: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 4df37a4..c61fd8e 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2522,10 +2522,6 @@ static u32 __compute_runnable_contrib(u64 n) > return contrib + runnable_avg_yN_sum[n]; > } > > -#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10 > -#error "load tracking assumes 2^10 as unit" > -#endif > - > #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) > > /* > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index 3845a71..31b4022 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -53,18 +53,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { } > * increased costs. > */ > #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */ > -# define SCHED_LOAD_RESOLUTION 10 > -# define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) > -# define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION) > +# define SCHED_LOAD_SHIFT (SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT) > +# define scale_load(w) ((w) << SCHED_RESOLUTION_SHIFT) > +# define scale_load_down(w) ((w) >> SCHED_RESOLUTION_SHIFT) > #else > -# define SCHED_LOAD_RESOLUTION 0 > +# define SCHED_LOAD_SHIFT (SCHED_RESOLUTION_SHIFT) > # define scale_load(w) (w) > # define scale_load_down(w) (w) > #endif > > -#define SCHED_LOAD_SHIFT (10 + SCHED_LOAD_RESOLUTION) > #define SCHED_LOAD_SCALE (1L << SCHED_LOAD_SHIFT) > > +/* > + * NICE_0's weight (visible to user) and its load (invisible to user) have > + * independent resolution, but they should be well calibrated. We use scale_load() > + * and scale_load_down(w) to convert between them, the following must be true: > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD > + */ > #define NICE_0_LOAD SCHED_LOAD_SCALE > #define NICE_0_SHIFT SCHED_LOAD_SHIFT I still think tying the scale_load shift to be the same as the SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is worse. Honestly if I was going to change anything it would be to define NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT. However I'm not sure if calculate_imbalance's use of SCHED_LOAD_SCALE is actually a separate use of 1024*SLR-as-percentage or is basically assuming most tasks are nice-0 or what. It sure /looks/ like it's comparing values with different units - it's doing (nr_running * CONST - group_capacity) and comparing to load, so it looks like both (ie increasing load.weight of everything on your system by X% would change load balancer behavior here). Given that it might make sense to make it clear that capacity units and nice-0-task units have to be the same thing due to load balancer approximations (though they are still entirely separate from the SCHED_LOAD_RESOLUTION multiplier). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote: > > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high > > or low resolution. So we have no reason to have low resolution (10bits) load_avg > > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load, > > as opposed to now we have load_avg = runnable% * scale_load_down(load). > > > > We get rid of all scale_load_down() for runnable load average? > > Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a > 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already > going to give errors on 32-bit (even with the old code in fact). This > should probably be fixed... somehow (dividing by 4 for load_sum on > 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on > 32-bit might have made sense but would be a weird difference between 32 > and 64, and could break userspace anyway, so it's presumably too late > for that). > > 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on > 32-bit. > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load. So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits load resolution. This is ok, because struct load_weight's load is also unsigned long. If overflown, cfs_rq->load.weight will be overflown in the first place. However, after a second thought, this is not quite right. Because load_avg is not necessarily no greater than load, since load_avg has blocked load in it. Although, load_avg is still at the same level as load (converging to be <= load), we may not want the risk to overflow on 32bit. > > +/* > > + * NICE_0's weight (visible to user) and its load (invisible to user) have > > + * independent resolution, but they should be well calibrated. We use scale_load() > > + * and scale_load_down(w) to convert between them, the following must be true: > > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD > > + */ > > #define NICE_0_LOAD SCHED_LOAD_SCALE > > #define NICE_0_SHIFT SCHED_LOAD_SHIFT > > I still think tying the scale_load shift to be the same as the > SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is > worse. Honestly if I was going to change anything it would be to define > NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT. If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get nice-0's load, I don't understand why you want to separate them. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Yuyang Du <yuyang.du@intel.com> writes: > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote: >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load, >> > as opposed to now we have load_avg = runnable% * scale_load_down(load). >> > >> > We get rid of all scale_load_down() for runnable load average? >> >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already >> going to give errors on 32-bit (even with the old code in fact). This >> should probably be fixed... somehow (dividing by 4 for load_sum on >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on >> 32-bit might have made sense but would be a weird difference between 32 >> and 64, and could break userspace anyway, so it's presumably too late >> for that). >> >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on >> 32-bit. >> > > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load. > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits > load resolution. This is ok, because struct load_weight's load is also unsigned > long. If overflown, cfs_rq->load.weight will be overflown in the first place. > > However, after a second thought, this is not quite right. Because load_avg is not > necessarily no greater than load, since load_avg has blocked load in it. Although, > load_avg is still at the same level as load (converging to be <= load), we may not > want the risk to overflow on 32bit. Yeah, I missed that load_sum was u64 and only load_avg was long. This means we're fine on 32-bit with no SLR (or more precisely, cfs_rq runnable_load_avg can overflow, but only when cfs_rq load.weight does, so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37 tasks before load.weight overflows, and ~2^31 tasks before runnable_load_avg does, which is obviously fine (and in fact you hit PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over). Now, applying SLR to runnable_load_avg would cut this down to ~2^21 tasks running at once or 2^20 with cgroups, which is technically allowed, though it seems utterly implausible (especially since this would have to all be on one cpu). If SLR was increased as peterz asked about, this could be an issue though. All that said, using SLR on load_sum/load_avg as opposed to cfs_rq runnable_load_avg would be fine, as they're limited to only one task/cgroup's weight. Having it SLRed and cfs_rq not would be a little odd, but not impossible. > >> > +/* >> > + * NICE_0's weight (visible to user) and its load (invisible to user) have >> > + * independent resolution, but they should be well calibrated. We use scale_load() >> > + * and scale_load_down(w) to convert between them, the following must be true: >> > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD >> > + */ >> > #define NICE_0_LOAD SCHED_LOAD_SCALE >> > #define NICE_0_SHIFT SCHED_LOAD_SHIFT >> >> I still think tying the scale_load shift to be the same as the >> SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is >> worse. Honestly if I was going to change anything it would be to define >> NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT. > > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get > nice-0's load, I don't understand why you want to separate them. SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is used in precisely one place other than the newish util_avg, and as I mentioned it's not remotely clear what compute_imbalance is doing theer) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote: > Yuyang Du <yuyang.du@intel.com> writes: > > > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote: > >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high > >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg > >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load, > >> > as opposed to now we have load_avg = runnable% * scale_load_down(load). > >> > > >> > We get rid of all scale_load_down() for runnable load average? > >> > >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a > >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already > >> going to give errors on 32-bit (even with the old code in fact). This > >> should probably be fixed... somehow (dividing by 4 for load_sum on > >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on > >> 32-bit might have made sense but would be a weird difference between 32 > >> and 64, and could break userspace anyway, so it's presumably too late > >> for that). > >> > >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on > >> 32-bit. > >> > > > > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load. > > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits > > load resolution. This is ok, because struct load_weight's load is also unsigned > > long. If overflown, cfs_rq->load.weight will be overflown in the first place. > > > > However, after a second thought, this is not quite right. Because load_avg is not > > necessarily no greater than load, since load_avg has blocked load in it. Although, > > load_avg is still at the same level as load (converging to be <= load), we may not > > want the risk to overflow on 32bit. This second thought made a mistake (what was wrong with me). load_avg is for sure no greater than load with or without blocked load. With that said, it really does not matter what the following numbers are, 32bit or 64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry whether overflow or not, not the load_avg. It is as simple as that. With that, I think we can and should get rid of the scale_load_down() for load_avg. > Yeah, I missed that load_sum was u64 and only load_avg was long. This > means we're fine on 32-bit with no SLR (or more precisely, cfs_rq > runnable_load_avg can overflow, but only when cfs_rq load.weight does, > so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37 > tasks before load.weight overflows, and ~2^31 tasks before > runnable_load_avg does, which is obviously fine (and in fact you hit > PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over). > > Now, applying SLR to runnable_load_avg would cut this down to ~2^21 > tasks running at once or 2^20 with cgroups, which is technically > allowed, though it seems utterly implausible (especially since this > would have to all be on one cpu). If SLR was increased as peterz asked > about, this could be an issue though. > > All that said, using SLR on load_sum/load_avg as opposed to cfs_rq > runnable_load_avg would be fine, as they're limited to only one > task/cgroup's weight. Having it SLRed and cfs_rq not would be a > little odd, but not impossible. > > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get > > nice-0's load, I don't understand why you want to separate them. > > SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to > have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is > used in precisely one place other than the newish util_avg, and as I > mentioned it's not remotely clear what compute_imbalance is doing theer) Yes, it is not clear to me either. With the above proposal to get rid of scale_load_down() for load_avg, so I think now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(), and raname scale_load_down() to kernel_to_user_load(). Hmm? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Yuyang Du <yuyang.du@intel.com> writes: > On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote: >> Yuyang Du <yuyang.du@intel.com> writes: >> >> > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote: >> >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high >> >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg >> >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load, >> >> > as opposed to now we have load_avg = runnable% * scale_load_down(load). >> >> > >> >> > We get rid of all scale_load_down() for runnable load average? >> >> >> >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a >> >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already >> >> going to give errors on 32-bit (even with the old code in fact). This >> >> should probably be fixed... somehow (dividing by 4 for load_sum on >> >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on >> >> 32-bit might have made sense but would be a weird difference between 32 >> >> and 64, and could break userspace anyway, so it's presumably too late >> >> for that). >> >> >> >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on >> >> 32-bit. >> >> >> > >> > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load. >> > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits >> > load resolution. This is ok, because struct load_weight's load is also unsigned >> > long. If overflown, cfs_rq->load.weight will be overflown in the first place. >> > >> > However, after a second thought, this is not quite right. Because load_avg is not >> > necessarily no greater than load, since load_avg has blocked load in it. Although, >> > load_avg is still at the same level as load (converging to be <= load), we may not >> > want the risk to overflow on 32bit. > > This second thought made a mistake (what was wrong with me). load_avg is for sure > no greater than load with or without blocked load. > > With that said, it really does not matter what the following numbers are, 32bit or > 64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry > whether overflow or not, not the load_avg. It is as simple as that. > > With that, I think we can and should get rid of the scale_load_down() > for load_avg. load_avg yes is bounded by load.weight, but on 64-bit load_sum is only bounded by load.weight * LOAD_AVG_MAX and is the same size as load.weight (as I said below). There's still space for anything reasonable though with 10 bits of SLR. > >> Yeah, I missed that load_sum was u64 and only load_avg was long. This >> means we're fine on 32-bit with no SLR (or more precisely, cfs_rq >> runnable_load_avg can overflow, but only when cfs_rq load.weight does, >> so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37 >> tasks before load.weight overflows, and ~2^31 tasks before >> runnable_load_avg does, which is obviously fine (and in fact you hit >> PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over). >> >> Now, applying SLR to runnable_load_avg would cut this down to ~2^21 >> tasks running at once or 2^20 with cgroups, which is technically >> allowed, though it seems utterly implausible (especially since this >> would have to all be on one cpu). If SLR was increased as peterz asked >> about, this could be an issue though. >> >> All that said, using SLR on load_sum/load_avg as opposed to cfs_rq >> runnable_load_avg would be fine, as they're limited to only one >> task/cgroup's weight. Having it SLRed and cfs_rq not would be a >> little odd, but not impossible. > > >> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get >> > nice-0's load, I don't understand why you want to separate them. >> >> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to >> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is >> used in precisely one place other than the newish util_avg, and as I >> mentioned it's not remotely clear what compute_imbalance is doing theer) > > Yes, it is not clear to me either. > > With the above proposal to get rid of scale_load_down() for load_avg, so I think > now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(), > and raname scale_load_down() to kernel_to_user_load(). > > Hmm? I have no opinion on renaming the scale_load functions, it's certainly reasonable, but the scale_load names seem fine too. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Sep 23, 2015 at 09:54:08AM -0700, bsegall@google.com wrote: > > This second thought made a mistake (what was wrong with me). load_avg is for sure > > no greater than load with or without blocked load. > > > > With that said, it really does not matter what the following numbers are, 32bit or > > 64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry > > whether overflow or not, not the load_avg. It is as simple as that. > > > > With that, I think we can and should get rid of the scale_load_down() > > for load_avg. > > load_avg yes is bounded by load.weight, but on 64-bit load_sum is only > bounded by load.weight * LOAD_AVG_MAX and is the same size as > load.weight (as I said below). There's still space for anything > reasonable though with 10 bits of SLR. You are absolutely right. > >> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get > >> > nice-0's load, I don't understand why you want to separate them. > >> > >> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to > >> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is > >> used in precisely one place other than the newish util_avg, and as I > >> mentioned it's not remotely clear what compute_imbalance is doing theer) > > > > Yes, it is not clear to me either. > > > > With the above proposal to get rid of scale_load_down() for load_avg, so I think > > now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(), > > and raname scale_load_down() to kernel_to_user_load(). > > > > Hmm? > > I have no opinion on renaming the scale_load functions, it's certainly > reasonable, but the scale_load names seem fine too. Without scale_load_down() in load_avg, it seems they are only used when reading/writing load between user and kernel. I will ponder more, but lets see whether others have opinion. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote: > If SLR was increased as peterz asked > about Right, so I was under the impression that you (Google) run with it increased and in mainline its currently dead code. So if its valuable to you guys we should fix in mainline. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9301291..d5ee72a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n) #error "load tracking assumes 2^10 as unit" #endif -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) - /* * We can represent the historical contribution to runnable average as the * coefficients of a geometric series. To do this we sub-divide our runnable @@ -2553,10 +2551,10 @@ static __always_inline int __update_load_avg(u64 now, int cpu, struct sched_avg *sa, unsigned long weight, int running, struct cfs_rq *cfs_rq) { - u64 delta, scaled_delta, periods; + u64 delta, periods; u32 contrib; - unsigned int delta_w, scaled_delta_w, decayed = 0; - unsigned long scale_freq, scale_cpu; + unsigned int delta_w, decayed = 0; + unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0; delta = now - sa->last_update_time; /* @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, return 0; sa->last_update_time = now; - scale_freq = arch_scale_freq_capacity(NULL, cpu); - scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + if (weight || running) + scale_freq = arch_scale_freq_capacity(NULL, cpu); + if (weight) + scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT; + if (running) + scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu) + >> SCHED_CAPACITY_SHIFT; /* delta_w is the amount already accumulated against our next period */ delta_w = sa->period_contrib; @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, * period and accrue it. */ delta_w = 1024 - delta_w; - scaled_delta_w = cap_scale(delta_w, scale_freq); if (weight) { - sa->load_sum += weight * scaled_delta_w; + sa->load_sum += scaled_weight * delta_w; if (cfs_rq) { cfs_rq->runnable_load_sum += - weight * scaled_delta_w; + scaled_weight * delta_w; } } if (running) - sa->util_sum += scaled_delta_w * scale_cpu; + sa->util_sum += delta_w * scale_freq_cpu; delta -= delta_w; @@ -2620,25 +2622,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa, /* Efficiently calculate \sum (1..n_period) 1024*y^i */ contrib = __compute_runnable_contrib(periods); - contrib = cap_scale(contrib, scale_freq); if (weight) { - sa->load_sum += weight * contrib; + sa->load_sum += scaled_weight * contrib; if (cfs_rq) - cfs_rq->runnable_load_sum += weight * contrib; + cfs_rq->runnable_load_sum += scaled_weight * contrib; } if (running) - sa->util_sum += contrib * scale_cpu; + sa->util_sum += contrib * scale_freq_cpu; } /* Remainder of delta accrued against u_0` */ - scaled_delta = cap_scale(delta, scale_freq); if (weight) { - sa->load_sum += weight * scaled_delta; + sa->load_sum += scaled_weight * delta; if (cfs_rq) - cfs_rq->runnable_load_sum += weight * scaled_delta; + cfs_rq->runnable_load_sum += scaled_weight * delta; } if (running) - sa->util_sum += scaled_delta * scale_cpu; + sa->util_sum += delta * scale_freq_cpu; sa->period_contrib += delta;