diff mbox

[5/6] sched/fair: Get rid of scaling utilization by capacity_orig

Message ID 20150908165331.GC27098@e105550-lin.cambridge.arm.com
State New
Headers show

Commit Message

Morten Rasmussen Sept. 8, 2015, 4:53 p.m. UTC
On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > 
> > Something like teh below..
> > 
> > Another thing to ponder; the downside of scaled_delta_w is that its
> > fairly likely delta is small and you loose all bits, whereas the weight
> > is likely to be large can could loose a fwe bits without issue.
> 
> That issue applies both to load and util.
> 
> > 
> > That is, in fixed point scaling like this, you want to start with the
> > biggest numbers, not the smallest, otherwise you loose too much.
> > 
> > The flip side is of course that now you can share a multiplcation.
> 
> But if we apply the scaling to the weight instead of time, we would only
> have to apply it once and not three times like it is now? So maybe we
> can end up with almost the same number of multiplications.
> 
> We might be loosing bits for low priority task running on cpus at a low
> frequency though.

Something like the below. We should be saving one multiplication.

--- 8< ---

From: Morten Rasmussen <morten.rasmussen@arm.com>
Date: Tue, 8 Sep 2015 17:15:40 +0100
Subject: [PATCH] sched/fair: Scale load/util contribution rather than time

When updating load/util tracking the time delta might be very small (1)
in many cases, scaling it futher down with frequency and cpu invariance
scaling might cause us to loose precision. Instead of scaling time we
can scale the weight of the task for load and the capacity for
utilization. Both weight (>=15) and capacity should be significantly
bigger in most cases. Low priority tasks might still suffer a bit but
worst should be improved, as weight is at least 15 before invariance
scaling.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

Comments

Peter Zijlstra Sept. 9, 2015, 9:45 a.m. UTC | #1
On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote:
> Sadly that makes the code worse; I get 14 mul instructions where
> previously I had 11.

FWIW I count like:

objdump -d defconfig-build/kernel/sched/fair.o |
  awk '/<[^>]*>:/ { p=0 }
       /<update_blocked_averages>:/ { p=1 }
       { if (p) print $0 }' |
  cut -d\: -f2- | grep mul | wc -l
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen Sept. 9, 2015, 11:13 a.m. UTC | #2
On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> 
> > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > But if we apply the scaling to the weight instead of time, we would only
> > > have to apply it once and not three times like it is now? So maybe we
> > > can end up with almost the same number of multiplications.
> > > 
> > > We might be loosing bits for low priority task running on cpus at a low
> > > frequency though.
> > 
> > Something like the below. We should be saving one multiplication.
> 
> > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		return 0;
> >  	sa->last_update_time = now;
> >  
> > -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +	if (weight || running)
> > +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	if (weight)
> > +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> > +	if (running)
> > +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> > +							>> SCHED_CAPACITY_SHIFT;
> >  
> >  	/* delta_w is the amount already accumulated against our next period */
> >  	delta_w = sa->period_contrib;
> > @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		 * period and accrue it.
> >  		 */
> >  		delta_w = 1024 - delta_w;
> > -		scaled_delta_w = cap_scale(delta_w, scale_freq);
> >  		if (weight) {
> > -			sa->load_sum += weight * scaled_delta_w;
> > +			sa->load_sum += scaled_weight * delta_w;
> >  			if (cfs_rq) {
> >  				cfs_rq->runnable_load_sum +=
> > -						weight * scaled_delta_w;
> > +						scaled_weight * delta_w;
> >  			}
> >  		}
> >  		if (running)
> > -			sa->util_sum += scaled_delta_w * scale_cpu;
> > +			sa->util_sum += delta_w * scale_freq_cpu;
> >  
> >  		delta -= delta_w;
> >  
> 
> Sadly that makes the code worse; I get 14 mul instructions where
> previously I had 11.
> 
> What happens is that GCC gets confused and cannot constant propagate the
> new variables, so what used to be shifts now end up being actual
> multiplications.
> 
> With this, I get back to 11. Can you see what happens on ARM where you
> have both functions defined to non constants?

We repeated the experiment on arm and arm64 but still with functions
defined to constant to compare with your results. The mul instruction
count seems to be somewhat compiler version dependent, but consistently
show no effect of the patch:

arm	before	after
gcc4.9	12	12
gcc4.8	10	10

arm64	before	after
gcc4.9	11	11

I will get numbers with the arch-functions implemented as well and do
hackbench runs to see what happens in terms of performance.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Leo Yan Sept. 11, 2015, 7:46 a.m. UTC | #3
Hi Morten,

On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > 
> > > Something like teh below..
> > > 
> > > Another thing to ponder; the downside of scaled_delta_w is that its
> > > fairly likely delta is small and you loose all bits, whereas the weight
> > > is likely to be large can could loose a fwe bits without issue.
> > 
> > That issue applies both to load and util.
> > 
> > > 
> > > That is, in fixed point scaling like this, you want to start with the
> > > biggest numbers, not the smallest, otherwise you loose too much.
> > > 
> > > The flip side is of course that now you can share a multiplcation.
> > 
> > But if we apply the scaling to the weight instead of time, we would only
> > have to apply it once and not three times like it is now? So maybe we
> > can end up with almost the same number of multiplications.
> > 
> > We might be loosing bits for low priority task running on cpus at a low
> > frequency though.
> 
> Something like the below. We should be saving one multiplication.
> 
> --- 8< ---
> 
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> Date: Tue, 8 Sep 2015 17:15:40 +0100
> Subject: [PATCH] sched/fair: Scale load/util contribution rather than time
> 
> When updating load/util tracking the time delta might be very small (1)
> in many cases, scaling it futher down with frequency and cpu invariance
> scaling might cause us to loose precision. Instead of scaling time we
> can scale the weight of the task for load and the capacity for
> utilization. Both weight (>=15) and capacity should be significantly
> bigger in most cases. Low priority tasks might still suffer a bit but
> worst should be improved, as weight is at least 15 before invariance
> scaling.
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 38 +++++++++++++++++++-------------------
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9301291..d5ee72a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n)
>  #error "load tracking assumes 2^10 as unit"
>  #endif
>  
> -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> -
>  /*
>   * We can represent the historical contribution to runnable average as the
>   * coefficients of a geometric series.  To do this we sub-divide our runnable
> @@ -2553,10 +2551,10 @@ static __always_inline int
>  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
>  {
> -	u64 delta, scaled_delta, periods;
> +	u64 delta, periods;
>  	u32 contrib;
> -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> -	unsigned long scale_freq, scale_cpu;
> +	unsigned int delta_w, decayed = 0;
> +	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
>  
>  	delta = now - sa->last_update_time;
>  	/*
> @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		return 0;
>  	sa->last_update_time = now;
>  
> -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> +	if (weight || running)
> +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> +	if (weight)
> +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> +	if (running)
> +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> +							>> SCHED_CAPACITY_SHIFT;

maybe below question is stupid :)

Why not calculate the scaled_weight depend on cpu's capacity as well?
So like: scaled_weight = weight * scale_freq_cpu.

>  	/* delta_w is the amount already accumulated against our next period */
>  	delta_w = sa->period_contrib;
> @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		 * period and accrue it.
>  		 */
>  		delta_w = 1024 - delta_w;
> -		scaled_delta_w = cap_scale(delta_w, scale_freq);
>  		if (weight) {
> -			sa->load_sum += weight * scaled_delta_w;
> +			sa->load_sum += scaled_weight * delta_w;
>  			if (cfs_rq) {
>  				cfs_rq->runnable_load_sum +=
> -						weight * scaled_delta_w;
> +						scaled_weight * delta_w;
>  			}
>  		}
>  		if (running)
> -			sa->util_sum += scaled_delta_w * scale_cpu;
> +			sa->util_sum += delta_w * scale_freq_cpu;
>  
>  		delta -= delta_w;
>  
> @@ -2620,25 +2622,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  
>  		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
>  		contrib = __compute_runnable_contrib(periods);
> -		contrib = cap_scale(contrib, scale_freq);
>  		if (weight) {
> -			sa->load_sum += weight * contrib;
> +			sa->load_sum += scaled_weight * contrib;
>  			if (cfs_rq)
> -				cfs_rq->runnable_load_sum += weight * contrib;
> +				cfs_rq->runnable_load_sum += scaled_weight * contrib;
>  		}
>  		if (running)
> -			sa->util_sum += contrib * scale_cpu;
> +			sa->util_sum += contrib * scale_freq_cpu;
>  	}
>  
>  	/* Remainder of delta accrued against u_0` */
> -	scaled_delta = cap_scale(delta, scale_freq);
>  	if (weight) {
> -		sa->load_sum += weight * scaled_delta;
> +		sa->load_sum += scaled_weight * delta;
>  		if (cfs_rq)
> -			cfs_rq->runnable_load_sum += weight * scaled_delta;
> +			cfs_rq->runnable_load_sum += scaled_weight * delta;
>  	}
>  	if (running)
> -		sa->util_sum += scaled_delta * scale_cpu;
> +		sa->util_sum += delta * scale_freq_cpu;
>  
>  	sa->period_contrib += delta;
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen Sept. 11, 2015, 10:02 a.m. UTC | #4
On Fri, Sep 11, 2015 at 03:46:51PM +0800, Leo Yan wrote:
> Hi Morten,
> 
> On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > > 
> > > > Something like teh below..
> > > > 
> > > > Another thing to ponder; the downside of scaled_delta_w is that its
> > > > fairly likely delta is small and you loose all bits, whereas the weight
> > > > is likely to be large can could loose a fwe bits without issue.
> > > 
> > > That issue applies both to load and util.
> > > 
> > > > 
> > > > That is, in fixed point scaling like this, you want to start with the
> > > > biggest numbers, not the smallest, otherwise you loose too much.
> > > > 
> > > > The flip side is of course that now you can share a multiplcation.
> > > 
> > > But if we apply the scaling to the weight instead of time, we would only
> > > have to apply it once and not three times like it is now? So maybe we
> > > can end up with almost the same number of multiplications.
> > > 
> > > We might be loosing bits for low priority task running on cpus at a low
> > > frequency though.
> > 
> > Something like the below. We should be saving one multiplication.
> > 
> > --- 8< ---
> > 
> > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > Date: Tue, 8 Sep 2015 17:15:40 +0100
> > Subject: [PATCH] sched/fair: Scale load/util contribution rather than time
> > 
> > When updating load/util tracking the time delta might be very small (1)
> > in many cases, scaling it futher down with frequency and cpu invariance
> > scaling might cause us to loose precision. Instead of scaling time we
> > can scale the weight of the task for load and the capacity for
> > utilization. Both weight (>=15) and capacity should be significantly
> > bigger in most cases. Low priority tasks might still suffer a bit but
> > worst should be improved, as weight is at least 15 before invariance
> > scaling.
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 38 +++++++++++++++++++-------------------
> >  1 file changed, 19 insertions(+), 19 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9301291..d5ee72a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n)
> >  #error "load tracking assumes 2^10 as unit"
> >  #endif
> >  
> > -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> > -
> >  /*
> >   * We can represent the historical contribution to runnable average as the
> >   * coefficients of a geometric series.  To do this we sub-divide our runnable
> > @@ -2553,10 +2551,10 @@ static __always_inline int
> >  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> >  {
> > -	u64 delta, scaled_delta, periods;
> > +	u64 delta, periods;
> >  	u32 contrib;
> > -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> > -	unsigned long scale_freq, scale_cpu;
> > +	unsigned int delta_w, decayed = 0;
> > +	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
> >  
> >  	delta = now - sa->last_update_time;
> >  	/*
> > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		return 0;
> >  	sa->last_update_time = now;
> >  
> > -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +	if (weight || running)
> > +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	if (weight)
> > +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> > +	if (running)
> > +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> > +							>> SCHED_CAPACITY_SHIFT;
> 
> maybe below question is stupid :)
> 
> Why not calculate the scaled_weight depend on cpu's capacity as well?
> So like: scaled_weight = weight * scale_freq_cpu.

IMHO, we should not scale load by cpu capacity since load isn't really
comparable to capacity. It is runnable time based (not running time like
utilization) and the idea is to used it for balancing when when the
system is fully utilized. When the system is fully utilized we can't say
anything about the true compute demands of a task, it may get exactly
the cpu time it needs or it may need much more. Hence it doesn't really
make sense to scale the demand by the capacity of the cpu. Two busy
loops on cpus with different cpu capacities should have the load as they
have the same compute demands.

I mentioned this briefly in the commit message of patch 3 in this
series.

Makes sense?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Leo Yan Sept. 11, 2015, 2:11 p.m. UTC | #5
On Fri, Sep 11, 2015 at 11:02:33AM +0100, Morten Rasmussen wrote:
> On Fri, Sep 11, 2015 at 03:46:51PM +0800, Leo Yan wrote:
> > Hi Morten,
> > 
> > On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> > > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> > > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > > > 
> > > > > Something like teh below..
> > > > > 
> > > > > Another thing to ponder; the downside of scaled_delta_w is that its
> > > > > fairly likely delta is small and you loose all bits, whereas the weight
> > > > > is likely to be large can could loose a fwe bits without issue.
> > > > 
> > > > That issue applies both to load and util.
> > > > 
> > > > > 
> > > > > That is, in fixed point scaling like this, you want to start with the
> > > > > biggest numbers, not the smallest, otherwise you loose too much.
> > > > > 
> > > > > The flip side is of course that now you can share a multiplcation.
> > > > 
> > > > But if we apply the scaling to the weight instead of time, we would only
> > > > have to apply it once and not three times like it is now? So maybe we
> > > > can end up with almost the same number of multiplications.
> > > > 
> > > > We might be loosing bits for low priority task running on cpus at a low
> > > > frequency though.
> > > 
> > > Something like the below. We should be saving one multiplication.
> > > 
> > > --- 8< ---
> > > 
> > > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > > Date: Tue, 8 Sep 2015 17:15:40 +0100
> > > Subject: [PATCH] sched/fair: Scale load/util contribution rather than time
> > > 
> > > When updating load/util tracking the time delta might be very small (1)
> > > in many cases, scaling it futher down with frequency and cpu invariance
> > > scaling might cause us to loose precision. Instead of scaling time we
> > > can scale the weight of the task for load and the capacity for
> > > utilization. Both weight (>=15) and capacity should be significantly
> > > bigger in most cases. Low priority tasks might still suffer a bit but
> > > worst should be improved, as weight is at least 15 before invariance
> > > scaling.
> > > 
> > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > > ---
> > >  kernel/sched/fair.c | 38 +++++++++++++++++++-------------------
> > >  1 file changed, 19 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 9301291..d5ee72a 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n)
> > >  #error "load tracking assumes 2^10 as unit"
> > >  #endif
> > >  
> > > -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> > > -
> > >  /*
> > >   * We can represent the historical contribution to runnable average as the
> > >   * coefficients of a geometric series.  To do this we sub-divide our runnable
> > > @@ -2553,10 +2551,10 @@ static __always_inline int
> > >  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > >  		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > >  {
> > > -	u64 delta, scaled_delta, periods;
> > > +	u64 delta, periods;
> > >  	u32 contrib;
> > > -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > -	unsigned long scale_freq, scale_cpu;
> > > +	unsigned int delta_w, decayed = 0;
> > > +	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
> > >  
> > >  	delta = now - sa->last_update_time;
> > >  	/*
> > > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > >  		return 0;
> > >  	sa->last_update_time = now;
> > >  
> > > -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > > -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > > +	if (weight || running)
> > > +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > > +	if (weight)
> > > +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> > > +	if (running)
> > > +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> > > +							>> SCHED_CAPACITY_SHIFT;
> > 
> > maybe below question is stupid :)
> > 
> > Why not calculate the scaled_weight depend on cpu's capacity as well?
> > So like: scaled_weight = weight * scale_freq_cpu.
> 
> IMHO, we should not scale load by cpu capacity since load isn't really
> comparable to capacity. It is runnable time based (not running time like
> utilization) and the idea is to used it for balancing when when the
> system is fully utilized. When the system is fully utilized we can't say
> anything about the true compute demands of a task, it may get exactly
> the cpu time it needs or it may need much more. Hence it doesn't really
> make sense to scale the demand by the capacity of the cpu. Two busy
> loops on cpus with different cpu capacities should have the load as they
> have the same compute demands.
> 
> I mentioned this briefly in the commit message of patch 3 in this
> series.
> 
> Makes sense?

Yeah, after your reminding, i recognise load only includes runnable
time on rq but not include running time.

Thanks,
Leo Yan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Morten Rasmussen Sept. 11, 2015, 5:22 p.m. UTC | #6
On Wed, Sep 09, 2015 at 12:13:10PM +0100, Morten Rasmussen wrote:
> On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote:
> > Sadly that makes the code worse; I get 14 mul instructions where
> > previously I had 11.
> > 
> > What happens is that GCC gets confused and cannot constant propagate the
> > new variables, so what used to be shifts now end up being actual
> > multiplications.
> > 
> > With this, I get back to 11. Can you see what happens on ARM where you
> > have both functions defined to non constants?
> 
> We repeated the experiment on arm and arm64 but still with functions
> defined to constant to compare with your results. The mul instruction
> count seems to be somewhat compiler version dependent, but consistently
> show no effect of the patch:
> 
> arm	before	after
> gcc4.9	12	12
> gcc4.8	10	10
> 
> arm64	before	after
> gcc4.9	11	11
> 
> I will get numbers with the arch-functions implemented as well and do
> hackbench runs to see what happens in terms of performance.

I have done some runs with the proposed fixes added:

1. PeterZ's util_sum shift fix (change util_sum).
2. Morten's scaling of weight instead of time (reduce bit loss).
3. PeterZ's unconditional calls to arch*() functions (compiler opt).

To be clear: 2 includes 1, and 3 includes 1 and 2.

Runs where done with the default (#define) implementation of the
arch-functions and with arch specific implementation for ARM.

I realized that just looking for 'mul' instructions in
update_blocked_averages() is probably not a fair comparison on ARM as it
turned out that it has quite a few multiply-accumulate instructions. So
I have included the total count including those too.


Test platforms:

ARM TC2 (A7x3 only)
perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 200
#mul: grep -e mul (in update_blocked_averages())
#mul_all: grep -e mul -e mla -e mls -e mia (in update_blocked_averages())
gcc: 4.9.3

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 15000
#mul: grep -e mul (in update_blocked_averages())
gcc: 4.9.2


Results:

perf numbers are average of three (x10) runs. Raw data is available
further down.

ARM TC2		#mul		#mul_all	perf bench
arch*()		default	arm	default	arm	default	arm

1 shift_fix	10	16	22	36	13.401	13.288
2 scaled_weight	12	14	30	32	13.282	13.238
3 unconditional	12	14	26	32	13.296	13.427

Intel E5-2690	#mul		#mul_all	perf bench
arch*()		default		default		default

1 shift_fix	13				14.786
2 scaled_weight	18				15.078
3 unconditional	14				15.195


Overall it appears that fewer 'mul' instructions doesn't necessarily
mean better perf bench score. For ARM, 2 seems the best choice overall.
While 1 is better for Intel. If we want to try avoid the bit loss by
scaling weight instead of time, 2 is best for both. However, all that
said, looking at the raw numbers there is a significant difference
between runs of perf --repeat, so we can't really draw any strong
conclusions. It all appears to be in the noise.

I suggest that I spin a v2 of this series and go with scaled_weight to
reduce bit loss. Any objections?

While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
mess?


Raw numbers:

ARM TC2

shift_fix	default_arch
gcc4.9.3
#mul 10
#mul+mla+mls+mia 22
13.384416727 seconds time elapsed ( +-  0.17% )
13.431014702 seconds time elapsed ( +-  0.18% )
13.387434890 seconds time elapsed ( +-  0.15% )

shift_fix	arm_arch
gcc4.9.3
#mul 16
#mul+mla+mls+mia 36
13.271044081 seconds time elapsed ( +-  0.11% )
13.310189123 seconds time elapsed ( +-  0.19% )
13.283594740 seconds time elapsed ( +-  0.12% )

scaled_weight	default_arch
gcc4.9.3
#mul 12
#mul+mla+mls+mia 30
13.295649553 seconds time elapsed ( +-  0.20% )
13.271634654 seconds time elapsed ( +-  0.19% )
13.280081329 seconds time elapsed ( +-  0.14% )

scaled_weight	arm_arch
gcc4.9.3
#mul 14
#mul+mla+mls+mia 32
13.230659223 seconds time elapsed ( +-  0.15% )
13.222276527 seconds time elapsed ( +-  0.15% )
13.260275081 seconds time elapsed ( +-  0.21% )

unconditional	default_arch
gcc4.9.3
#mul 12
#mul+mla+mls+mia 26
13.274904460 seconds time elapsed ( +-  0.13% )
13.307853511 seconds time elapsed ( +-  0.15% )
13.304084844 seconds time elapsed ( +-  0.22% )

unconditional	arm_arch
gcc4.9.3
#mul 14
#mul+mla+mls+mia 32
13.432878577 seconds time elapsed ( +-  0.13% )
13.417950552 seconds time elapsed ( +-  0.12% )
13.431682719 seconds time elapsed ( +-  0.18% )


Intel

shift_fix	default_arch
gcc4.9.2
#mul 13
14.905815416 seconds time elapsed ( +-  0.61% )
14.811113694 seconds time elapsed ( +-  0.84% )
14.639739309 seconds time elapsed ( +-  0.76% )

scaled_weight	default_arch
gcc4.9.2
#mul 18
15.113275474 seconds time elapsed ( +-  0.64% )
15.056777680 seconds time elapsed ( +-  0.44% )
15.064074416 seconds time elapsed ( +-  0.71% )

unconditional	default_arch
gcc4.9.2
#mul 14
15.105152500 seconds time elapsed ( +-  0.71% )
15.346405473 seconds time elapsed ( +-  0.81% )
15.132933523 seconds time elapsed ( +-  0.82% )
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 17, 2015, 10:38 a.m. UTC | #7
On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote:

> While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
> mess?

I suspect his patch will fail to compile on ARM which uses
SCHED_CAPACITY_* outside of kernel/sched/*.

But if you all (Ben, Yuyang, you) can agree on a patch simplifying these
things I'm not opposed to it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Benjamin Segall Sept. 21, 2015, 5:30 p.m. UTC | #8
Yuyang Du <yuyang.du@intel.com> writes:

> On Thu, Sep 17, 2015 at 12:38:25PM +0200, Peter Zijlstra wrote:
>> On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote:
>> 
>> > While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
>> > mess?
>> 
>> I suspect his patch will fail to compile on ARM which uses
>> SCHED_CAPACITY_* outside of kernel/sched/*.
>> 
>> But if you all (Ben, Yuyang, you) can agree on a patch simplifying these
>> things I'm not opposed to it.
>
> Yes, indeed. So SCHED_RESOLUTION_SHIFT has to be defined in include/linux/sched.h.
>
> With this, I think the codes still need some cleanup, and importantly
> documentation.
>
> But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> or low resolution. So we have no reason to have low resolution (10bits) load_avg
> when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> as opposed to now we have load_avg = runnable% * scale_load_down(load).
>
> We get rid of all scale_load_down() for runnable load average?

Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
going to give errors on 32-bit (even with the old code in fact). This
should probably be fixed... somehow (dividing by 4 for load_sum on
32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
32-bit might have made sense but would be a weird difference between 32
and 64, and could break userspace anyway, so it's presumably too late
for that).

64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
32-bit.

>
> --
>
> Subject: [PATCH] sched/fair: Generalize the load/util averages resolution
>  definition
>
> The metric needs certain resolution to determine how much detail we
> can look into (or not losing detail by integer rounding), which also
> determines the range of the metrics.
>
> For instance, to increase the resolution of [0, 1] (two levels), one
> can multiply 1024 and get [0, 1024] (1025 levels).
>
> In sched/fair, a few metrics depend on the resolution: load/load_avg,
> util_avg, and capacity (frequency adjustment). In order to reduce the
> risks to make mistakes relating to resolution/range, we therefore
> generalize the resolution by defining a basic resolution constant
> number, and then formalize all metrics by depending on the basic
> resolution. The basic resolution is 1024 or (1 << 10). Further, one
> can recursively apply the basic resolution to increase the final
> resolution.
>
> Pointed out by Ben Segall, NICE_0's weight (visible to user) and load
> have independent resolution, but they must be well calibrated.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  include/linux/sched.h |  9 ++++++---
>  kernel/sched/fair.c   |  4 ----
>  kernel/sched/sched.h  | 15 ++++++++++-----
>  3 files changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bd38b3e..9b86f79 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -909,10 +909,13 @@ enum cpu_idle_type {
>  	CPU_MAX_IDLE_TYPES
>  };
>  
> +# define SCHED_RESOLUTION_SHIFT	10
> +# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
> +
>  /*
>   * Increase resolution of cpu_capacity calculations
>   */
> -#define SCHED_CAPACITY_SHIFT	10
> +#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>  
>  /*
> @@ -1180,8 +1183,8 @@ struct load_weight {
>   * 1) load_avg factors frequency scaling into the amount of time that a
>   * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
>   * aggregated such weights of all runnable and blocked sched_entities.
> - * 2) util_avg factors frequency and cpu scaling into the amount of time
> - * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
> + * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
> + * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
>   * For cfs_rq, it is the aggregated such times of all runnable and
>   * blocked sched_entities.
>   * The 64 bit load_sum can:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..c61fd8e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2522,10 +2522,6 @@ static u32 __compute_runnable_contrib(u64 n)
>  	return contrib + runnable_avg_yN_sum[n];
>  }
>  
> -#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> -#error "load tracking assumes 2^10 as unit"
> -#endif
> -
>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>  
>  /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3845a71..31b4022 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -53,18 +53,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>   * increased costs.
>   */
>  #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> -# define SCHED_LOAD_RESOLUTION	10
> -# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
> -# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
> +# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
> +# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
>  #else
> -# define SCHED_LOAD_RESOLUTION	0
> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
>  # define scale_load(w)		(w)
>  # define scale_load_down(w)	(w)
>  #endif
>  
> -#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
>  
> +/*
> + * NICE_0's weight (visible to user) and its load (invisible to user) have
> + * independent resolution, but they should be well calibrated. We use scale_load()
> + * and scale_load_down(w) to convert between them, the following must be true:
> + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
> + */
>  #define NICE_0_LOAD		SCHED_LOAD_SCALE
>  #define NICE_0_SHIFT		SCHED_LOAD_SHIFT

I still think tying the scale_load shift to be the same as the
SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
worse. Honestly if I was going to change anything it would be to define
NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.

However I'm not sure if calculate_imbalance's use of SCHED_LOAD_SCALE is
actually a separate use of 1024*SLR-as-percentage or is basically
assuming most tasks are nice-0 or what. It sure /looks/ like it's
comparing values with different units - it's doing (nr_running * CONST -
group_capacity) and comparing to load, so it looks like both (ie
increasing load.weight of everything on your system by X% would change
load balancer behavior here).

Given that it might make sense to make it clear that capacity units and
nice-0-task units have to be the same thing due to load balancer
approximations (though they are still entirely separate from the
SCHED_LOAD_RESOLUTION multiplier).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Yuyang Du Sept. 21, 2015, 11:39 p.m. UTC | #9
On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
> >
> > We get rid of all scale_load_down() for runnable load average?
> 
> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
> going to give errors on 32-bit (even with the old code in fact). This
> should probably be fixed... somehow (dividing by 4 for load_sum on
> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
> 32-bit might have made sense but would be a weird difference between 32
> and 64, and could break userspace anyway, so it's presumably too late
> for that).
> 
> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
> 32-bit.
> 

load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
load resolution. This is ok, because struct load_weight's load is also unsigned
long. If overflown, cfs_rq->load.weight will be overflown in the first place.

However, after a second thought, this is not quite right. Because load_avg is not
necessarily no greater than load, since load_avg has blocked load in it. Although,
load_avg is still at the same level as load (converging to be <= load), we may not
want the risk to overflow on 32bit.

> > +/*
> > + * NICE_0's weight (visible to user) and its load (invisible to user) have
> > + * independent resolution, but they should be well calibrated. We use scale_load()
> > + * and scale_load_down(w) to convert between them, the following must be true:
> > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
> > + */
> >  #define NICE_0_LOAD		SCHED_LOAD_SCALE
> >  #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
> 
> I still think tying the scale_load shift to be the same as the
> SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
> worse. Honestly if I was going to change anything it would be to define
> NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.

If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
nice-0's load, I don't understand why you want to separate them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Benjamin Segall Sept. 22, 2015, 5:18 p.m. UTC | #10
Yuyang Du <yuyang.du@intel.com> writes:

> On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
>> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
>> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
>> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
>> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
>> >
>> > We get rid of all scale_load_down() for runnable load average?
>> 
>> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
>> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
>> going to give errors on 32-bit (even with the old code in fact). This
>> should probably be fixed... somehow (dividing by 4 for load_sum on
>> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
>> 32-bit might have made sense but would be a weird difference between 32
>> and 64, and could break userspace anyway, so it's presumably too late
>> for that).
>> 
>> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
>> 32-bit.
>> 
>
> load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
> So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
> load resolution. This is ok, because struct load_weight's load is also unsigned
> long. If overflown, cfs_rq->load.weight will be overflown in the first place.
>
> However, after a second thought, this is not quite right. Because load_avg is not
> necessarily no greater than load, since load_avg has blocked load in it. Although,
> load_avg is still at the same level as load (converging to be <= load), we may not
> want the risk to overflow on 32bit.

Yeah, I missed that load_sum was u64 and only load_avg was long. This
means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
runnable_load_avg can overflow, but only when cfs_rq load.weight does,
so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
tasks before load.weight overflows, and ~2^31 tasks before
runnable_load_avg does, which is obviously fine (and in fact you hit
PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).

Now, applying SLR to runnable_load_avg would cut this down to ~2^21
tasks running at once or 2^20 with cgroups, which is technically
allowed, though it seems utterly implausible (especially since this
would have to all be on one cpu). If SLR was increased as peterz asked
about, this could be an issue though.

All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
runnable_load_avg would be fine, as they're limited to only one
task/cgroup's weight. Having it SLRed and cfs_rq not would be a
little odd, but not impossible.

>
>> > +/*
>> > + * NICE_0's weight (visible to user) and its load (invisible to user) have
>> > + * independent resolution, but they should be well calibrated. We use scale_load()
>> > + * and scale_load_down(w) to convert between them, the following must be true:
>> > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
>> > + */
>> >  #define NICE_0_LOAD		SCHED_LOAD_SCALE
>> >  #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
>> 
>> I still think tying the scale_load shift to be the same as the
>> SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
>> worse. Honestly if I was going to change anything it would be to define
>> NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.
>
> If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
> nice-0's load, I don't understand why you want to separate them.

SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
used in precisely one place other than the newish util_avg, and as I
mentioned it's not remotely clear what compute_imbalance is doing theer)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Yuyang Du Sept. 22, 2015, 11:22 p.m. UTC | #11
On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote:
> Yuyang Du <yuyang.du@intel.com> writes:
> 
> > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
> >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
> >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> >> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
> >> >
> >> > We get rid of all scale_load_down() for runnable load average?
> >> 
> >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
> >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
> >> going to give errors on 32-bit (even with the old code in fact). This
> >> should probably be fixed... somehow (dividing by 4 for load_sum on
> >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
> >> 32-bit might have made sense but would be a weird difference between 32
> >> and 64, and could break userspace anyway, so it's presumably too late
> >> for that).
> >> 
> >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
> >> 32-bit.
> >> 
> >
> > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
> > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
> > load resolution. This is ok, because struct load_weight's load is also unsigned
> > long. If overflown, cfs_rq->load.weight will be overflown in the first place.
> >
> > However, after a second thought, this is not quite right. Because load_avg is not
> > necessarily no greater than load, since load_avg has blocked load in it. Although,
> > load_avg is still at the same level as load (converging to be <= load), we may not
> > want the risk to overflow on 32bit.
 
This second thought made a mistake (what was wrong with me). load_avg is for sure
no greater than load with or without blocked load.

With that said, it really does not matter what the following numbers are, 32bit or
64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry
whether overflow or not, not the load_avg. It is as simple as that.

With that, I think we can and should get rid of the scale_load_down() for load_avg.

> Yeah, I missed that load_sum was u64 and only load_avg was long. This
> means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
> runnable_load_avg can overflow, but only when cfs_rq load.weight does,
> so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
> tasks before load.weight overflows, and ~2^31 tasks before
> runnable_load_avg does, which is obviously fine (and in fact you hit
> PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).
> 
> Now, applying SLR to runnable_load_avg would cut this down to ~2^21
> tasks running at once or 2^20 with cgroups, which is technically
> allowed, though it seems utterly implausible (especially since this
> would have to all be on one cpu). If SLR was increased as peterz asked
> about, this could be an issue though.
> 
> All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
> runnable_load_avg would be fine, as they're limited to only one
> task/cgroup's weight. Having it SLRed and cfs_rq not would be a
> little odd, but not impossible.
 

> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
> > nice-0's load, I don't understand why you want to separate them.
> 
> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
> used in precisely one place other than the newish util_avg, and as I
> mentioned it's not remotely clear what compute_imbalance is doing theer)

Yes, it is not clear to me either.

With the above proposal to get rid of scale_load_down() for load_avg, so I think
now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(),
and raname scale_load_down() to kernel_to_user_load().

Hmm?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Benjamin Segall Sept. 23, 2015, 4:54 p.m. UTC | #12
Yuyang Du <yuyang.du@intel.com> writes:

> On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote:
>> Yuyang Du <yuyang.du@intel.com> writes:
>> 
>> > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
>> >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
>> >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
>> >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
>> >> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
>> >> >
>> >> > We get rid of all scale_load_down() for runnable load average?
>> >> 
>> >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
>> >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
>> >> going to give errors on 32-bit (even with the old code in fact). This
>> >> should probably be fixed... somehow (dividing by 4 for load_sum on
>> >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
>> >> 32-bit might have made sense but would be a weird difference between 32
>> >> and 64, and could break userspace anyway, so it's presumably too late
>> >> for that).
>> >> 
>> >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
>> >> 32-bit.
>> >> 
>> >
>> > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
>> > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
>> > load resolution. This is ok, because struct load_weight's load is also unsigned
>> > long. If overflown, cfs_rq->load.weight will be overflown in the first place.
>> >
>> > However, after a second thought, this is not quite right. Because load_avg is not
>> > necessarily no greater than load, since load_avg has blocked load in it. Although,
>> > load_avg is still at the same level as load (converging to be <= load), we may not
>> > want the risk to overflow on 32bit.
>  
> This second thought made a mistake (what was wrong with me). load_avg is for sure
> no greater than load with or without blocked load.
>
> With that said, it really does not matter what the following numbers are, 32bit or
> 64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry
> whether overflow or not, not the load_avg. It is as simple as that.
>
> With that, I think we can and should get rid of the scale_load_down()
> for load_avg.

load_avg yes is bounded by load.weight, but on 64-bit load_sum is only
bounded by load.weight * LOAD_AVG_MAX and is the same size as
load.weight (as I said below). There's still space for anything
reasonable though with 10 bits of SLR.

>
>> Yeah, I missed that load_sum was u64 and only load_avg was long. This
>> means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
>> runnable_load_avg can overflow, but only when cfs_rq load.weight does,
>> so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
>> tasks before load.weight overflows, and ~2^31 tasks before
>> runnable_load_avg does, which is obviously fine (and in fact you hit
>> PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).
>> 
>> Now, applying SLR to runnable_load_avg would cut this down to ~2^21
>> tasks running at once or 2^20 with cgroups, which is technically
>> allowed, though it seems utterly implausible (especially since this
>> would have to all be on one cpu). If SLR was increased as peterz asked
>> about, this could be an issue though.
>> 
>> All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
>> runnable_load_avg would be fine, as they're limited to only one
>> task/cgroup's weight. Having it SLRed and cfs_rq not would be a
>> little odd, but not impossible.
>  
>
>> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
>> > nice-0's load, I don't understand why you want to separate them.
>> 
>> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
>> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
>> used in precisely one place other than the newish util_avg, and as I
>> mentioned it's not remotely clear what compute_imbalance is doing theer)
>
> Yes, it is not clear to me either.
>
> With the above proposal to get rid of scale_load_down() for load_avg, so I think
> now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(),
> and raname scale_load_down() to kernel_to_user_load().
>
> Hmm?

I have no opinion on renaming the scale_load functions, it's certainly
reasonable, but the scale_load names seem fine too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Yuyang Du Sept. 24, 2015, 12:22 a.m. UTC | #13
On Wed, Sep 23, 2015 at 09:54:08AM -0700, bsegall@google.com wrote:
> > This second thought made a mistake (what was wrong with me). load_avg is for sure
> > no greater than load with or without blocked load.
> >
> > With that said, it really does not matter what the following numbers are, 32bit or
> > 64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry
> > whether overflow or not, not the load_avg. It is as simple as that.
> >
> > With that, I think we can and should get rid of the scale_load_down()
> > for load_avg.
> 
> load_avg yes is bounded by load.weight, but on 64-bit load_sum is only
> bounded by load.weight * LOAD_AVG_MAX and is the same size as
> load.weight (as I said below). There's still space for anything
> reasonable though with 10 bits of SLR.
 
You are absolutely right.

> >> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
> >> > nice-0's load, I don't understand why you want to separate them.
> >> 
> >> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
> >> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
> >> used in precisely one place other than the newish util_avg, and as I
> >> mentioned it's not remotely clear what compute_imbalance is doing theer)
> >
> > Yes, it is not clear to me either.
> >
> > With the above proposal to get rid of scale_load_down() for load_avg, so I think
> > now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(),
> > and raname scale_load_down() to kernel_to_user_load().
> >
> > Hmm?
> 
> I have no opinion on renaming the scale_load functions, it's certainly
> reasonable, but the scale_load names seem fine too.

Without scale_load_down() in load_avg, it seems they are only used when
reading/writing load between user and kernel. I will ponder more, but
lets see whether others have opinion.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Peter Zijlstra Sept. 30, 2015, 12:52 p.m. UTC | #14
On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote:
> If SLR was increased as peterz asked
> about

Right, so I was under the impression that you (Google) run with it
increased and in mainline its currently dead code.

So if its valuable to you guys we should fix in mainline.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9301291..d5ee72a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2519,8 +2519,6 @@  static u32 __compute_runnable_contrib(u64 n)
 #error "load tracking assumes 2^10 as unit"
 #endif
 
-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
-
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2553,10 +2551,10 @@  static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
+	u64 delta, periods;
 	u32 contrib;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq, scale_cpu;
+	unsigned int delta_w, decayed = 0;
+	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2577,8 +2575,13 @@  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	if (weight || running)
+		scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	if (weight)
+		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
+	if (running)
+		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
+							>> SCHED_CAPACITY_SHIFT;
 
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;
@@ -2594,16 +2597,15 @@  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
+			sa->load_sum += scaled_weight * delta_w;
 			if (cfs_rq) {
 				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
+						scaled_weight * delta_w;
 			}
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
+			sa->util_sum += delta_w * scale_freq_cpu;
 
 		delta -= delta_w;
 
@@ -2620,25 +2622,23 @@  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
-		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * contrib;
+			sa->load_sum += scaled_weight * contrib;
 			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * contrib;
+				cfs_rq->runnable_load_sum += scaled_weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib * scale_cpu;
+			sa->util_sum += contrib * scale_freq_cpu;
 	}
 
 	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
 	if (weight) {
-		sa->load_sum += weight * scaled_delta;
+		sa->load_sum += scaled_weight * delta;
 		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
+			cfs_rq->runnable_load_sum += scaled_weight * delta;
 	}
 	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
+		sa->util_sum += delta * scale_freq_cpu;
 
 	sa->period_contrib += delta;