diff mbox series

[v3] sched/fair: unlink misfit task from cpu overutilized

Message ID 20230113134056.257691-1-vincent.guittot@linaro.org
State Superseded
Headers show
Series [v3] sched/fair: unlink misfit task from cpu overutilized | expand

Commit Message

Vincent Guittot Jan. 13, 2023, 1:40 p.m. UTC
By taking into account uclamp_min, the 1:1 relation between task misfit
and cpu overutilized is no more true as a task with a small util_avg of
may not fit a high capacity cpu because of uclamp_min constraint.

Add a new state in util_fits_cpu() to reflect the case that task would fit
a CPU except for the uclamp_min hint which is a performance requirement.

Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
can use this new value to take additional action to select the best CPU
that doesn't match uclamp_min hint.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Change since v2:
- fix a condition in feec()
- add comments

 kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 83 insertions(+), 25 deletions(-)

Comments

Qais Yousef Jan. 15, 2023, 12:19 a.m. UTC | #1
On 01/13/23 14:40, Vincent Guittot wrote:
> By taking into account uclamp_min, the 1:1 relation between task misfit
> and cpu overutilized is no more true as a task with a small util_avg of
> may not fit a high capacity cpu because of uclamp_min constraint.
> 
> Add a new state in util_fits_cpu() to reflect the case that task would fit
> a CPU except for the uclamp_min hint which is a performance requirement.
> 
> Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> can use this new value to take additional action to select the best CPU
> that doesn't match uclamp_min hint.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> Change since v2:
> - fix a condition in feec()
> - add comments
> 
>  kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 83 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e9d906a9bba9..29adb9e27b3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4525,8 +4525,7 @@ static inline int util_fits_cpu(unsigned long util,
>  	 *     2. The system is being saturated when we're operating near
>  	 *        max capacity, it doesn't make sense to block overutilized.
>  	 */
> -	uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> -	uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> +	uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);

I think this hunk is what is causing the overutilized issues Kajetan reported
against your patch.

For the big cpu, this expression will always be true. So overutilized will
trigger only for little and medium cores.

I appreciate writing the boolean in a shorter form might appear like less code,
but it makes things harder to read too; the compiler should be good at
simplifying the expression if it can, no?

Shall we leave the original expression as-is since it's easier to reason about?

I think already by 'saving' using another variable we reduced readability and
lead to this error. First line checks if we are the max_capacity which is
the corner case we'd like to avoid and accidentally lost

v1 code was:

+	max_capacity = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
+	uclamp_max_fits = !max_capacity && (uclamp_max <= capacity_orig);
+	fits = fits || uclamp_max_fits;

I think that extra variable was well named to help make it obvious what corner
case we're trying to catch here. Especially it matches the comment above it
explaining this corner case. This auto variable should be free, no?

Can we go back to this form please?

>  	fits = fits || uclamp_max_fits;
>  
>  	/*
> @@ -4561,8 +4560,8 @@ static inline int util_fits_cpu(unsigned long util,
>  	 * handle the case uclamp_min > uclamp_max.
>  	 */
>  	uclamp_min = min(uclamp_min, uclamp_max);
> -	if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> -		fits = fits && (uclamp_min <= capacity_orig_thermal);
> +	if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> +		return -1;

Here shouldn't this be 

	if (util < uclamp_min) {
		fits = fits && (uclamp_min <= capacity_orig);
		if (fits && (uclamp_min > capacity_orig_thermal))
			return -1;
	}

uclamp_min should fit capacity_orig first then we'd check for the corner case
if thermal pressure is causing it not to fit. And all of this should only
matter if utill < uclamp_min. Otherwise util_avg should be driving force if
uclamp_max is not trumping it.

I need time to update my unit test [1] to catch these error cases as I didn't
see them. In the next version I think it's worth including the changes to
remove the capacity inversion in the patch.

[1] https://github.com/qais-yousef/uclamp_test/blob/master/uclamp_test_thermal_pressure.c


Thanks!

--
Qais Yousef

>  
>  	return fits;
>  }
> @@ -4572,7 +4571,11 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
>  	unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
>  	unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
>  	unsigned long util = task_util_est(p);
> -	return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> +	/*
> +	 * Return true only if the cpu fully fits the task requirements, which
> +	 * include the utilization but also the performance.
> +	 */
> +	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
>  }
>  
>  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
>  	unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
>  	unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
>  
> +	/* Return true only if the utlization doesn't fit its capacity */
>  	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>  }
>  
> @@ -6925,6 +6929,7 @@ static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
>  	unsigned long task_util, util_min, util_max, best_cap = 0;
> +	int fits, best_fits = 0;
>  	int cpu, best_cpu = -1;
>  	struct cpumask *cpus;
>  
> @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  
>  		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
>  			continue;
> -		if (util_fits_cpu(task_util, util_min, util_max, cpu))
> +
> +		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> +
> +		/* This CPU fits with all capacity and performance requirements */
> +		if (fits > 0)
>  			return cpu;
> +		/*
> +		 * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> +		 * for the CPU with highest performance capacity.
> +		 */
> +		else if (fits < 0)
> +			cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
>  
> -		if (cpu_cap > best_cap) {
> +		/*
> +		 * First, select cpu which fits better (-1 being better than 0).
> +		 * Then, select the one with largest capacity at same level.
> +		 */
> +		if ((fits < best_fits) ||
> +		    ((fits == best_fits) && (cpu_cap > best_cap))) {
>  			best_cap = cpu_cap;
>  			best_cpu = cpu;
> +			best_fits = fits;
>  		}
>  	}
>  
> @@ -6958,7 +6979,11 @@ static inline bool asym_fits_cpu(unsigned long util,
>  				 int cpu)
>  {
>  	if (sched_asym_cpucap_active())
> -		return util_fits_cpu(util, util_min, util_max, cpu);
> +		/*
> +		 * Return true only if the cpu fully fits the task requirements
> +		 * which include the utilization but also the performance.
> +		 */
> +		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
>  
>  	return true;
>  }
> @@ -7325,6 +7350,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
>  	struct root_domain *rd = this_rq()->rd;
>  	int cpu, best_energy_cpu, target = -1;
> +	int prev_fits = -1, best_fits = -1;
> +	unsigned long best_thermal_cap = 0;
> +	unsigned long prev_thermal_cap = 0;
>  	struct sched_domain *sd;
>  	struct perf_domain *pd;
>  	struct energy_env eenv;
> @@ -7360,6 +7388,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  		unsigned long prev_spare_cap = 0;
>  		int max_spare_cap_cpu = -1;
>  		unsigned long base_energy;
> +		int fits, max_fits = -1;
>  
>  		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
>  
> @@ -7412,7 +7441,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  					util_max = max(rq_util_max, p_util_max);
>  				}
>  			}
> -			if (!util_fits_cpu(util, util_min, util_max, cpu))
> +
> +			fits = util_fits_cpu(util, util_min, util_max, cpu);
> +			if (!fits)
>  				continue;
>  
>  			lsub_positive(&cpu_cap, util);
> @@ -7420,7 +7451,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			if (cpu == prev_cpu) {
>  				/* Always use prev_cpu as a candidate. */
>  				prev_spare_cap = cpu_cap;
> -			} else if (cpu_cap > max_spare_cap) {
> +				prev_fits = fits;
> +			} else if ((fits > max_fits) ||
> +				   ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
>  				/*
>  				 * Find the CPU with the maximum spare capacity
>  				 * among the remaining CPUs in the performance
> @@ -7428,6 +7461,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  				 */
>  				max_spare_cap = cpu_cap;
>  				max_spare_cap_cpu = cpu;
> +				max_fits = fits;
>  			}
>  		}
>  
> @@ -7446,26 +7480,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			if (prev_delta < base_energy)
>  				goto unlock;
>  			prev_delta -= base_energy;
> +			prev_thermal_cap = cpu_thermal_cap;
>  			best_delta = min(best_delta, prev_delta);
>  		}
>  
>  		/* Evaluate the energy impact of using max_spare_cap_cpu. */
>  		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> +			/* Current best energy cpu fits better */
> +			if (max_fits < best_fits)
> +				continue;
> +
> +			/*
> +			 * Both don't fit performance (i.e. uclamp_min) but
> +			 * best energy cpu has better performance.
> +			 */
> +			if ((max_fits < 0) &&
> +			    (cpu_thermal_cap <= best_thermal_cap))
> +				continue;
> +
>  			cur_delta = compute_energy(&eenv, pd, cpus, p,
>  						   max_spare_cap_cpu);
>  			/* CPU utilization has changed */
>  			if (cur_delta < base_energy)
>  				goto unlock;
>  			cur_delta -= base_energy;
> -			if (cur_delta < best_delta) {
> -				best_delta = cur_delta;
> -				best_energy_cpu = max_spare_cap_cpu;
> -			}
> +
> +			/*
> +			 * Both fit for the task but best energy cpu has lower
> +			 * energy impact.
> +			 */
> +			if ((max_fits > 0) && (best_fits > 0) &&
> +			    (cur_delta >= best_delta))
> +				continue;
> +
> +			best_delta = cur_delta;
> +			best_energy_cpu = max_spare_cap_cpu;
> +			best_fits = max_fits;
> +			best_thermal_cap = cpu_thermal_cap;
>  		}
>  	}
>  	rcu_read_unlock();
>  
> -	if (best_delta < prev_delta)
> +	if ((best_fits > prev_fits) ||
> +	    ((best_fits > 0) && (best_delta < prev_delta)) ||
> +	    ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
>  		target = best_energy_cpu;
>  
>  	return target;
> @@ -10259,24 +10317,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	 */
>  	update_sd_lb_stats(env, &sds);
>  
> -	if (sched_energy_enabled()) {
> -		struct root_domain *rd = env->dst_rq->rd;
> -
> -		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> -			goto out_balanced;
> -	}
> -
> -	local = &sds.local_stat;
> -	busiest = &sds.busiest_stat;
> -
>  	/* There is no busy sibling group to pull tasks from */
>  	if (!sds.busiest)
>  		goto out_balanced;
>  
> +	busiest = &sds.busiest_stat;
> +
>  	/* Misfit tasks should be dealt with regardless of the avg load */
>  	if (busiest->group_type == group_misfit_task)
>  		goto force_balance;
>  
> +	if (sched_energy_enabled()) {
> +		struct root_domain *rd = env->dst_rq->rd;
> +
> +		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> +			goto out_balanced;
> +	}
> +
>  	/* ASYM feature bypasses nice load balance check */
>  	if (busiest->group_type == group_asym_packing)
>  		goto force_balance;
> @@ -10289,6 +10346,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	if (busiest->group_type == group_imbalanced)
>  		goto force_balance;
>  
> +	local = &sds.local_stat;
>  	/*
>  	 * If the local group is busier than the selected busiest group
>  	 * don't try and pull any tasks.
> -- 
> 2.34.1
>
Dietmar Eggemann Jan. 16, 2023, 9:07 a.m. UTC | #2
On 13/01/2023 14:40, Vincent Guittot wrote:
> By taking into account uclamp_min, the 1:1 relation between task misfit
> and cpu overutilized is no more true as a task with a small util_avg of

of ?

> may not fit a high capacity cpu because of uclamp_min constraint.
> 
> Add a new state in util_fits_cpu() to reflect the case that task would fit
> a CPU except for the uclamp_min hint which is a performance requirement.
> 
> Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> can use this new value to take additional action to select the best CPU
> that doesn't match uclamp_min hint.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> Change since v2:
> - fix a condition in feec()
> - add comments
> 
>  kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 83 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e9d906a9bba9..29adb9e27b3d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4525,8 +4525,7 @@ static inline int util_fits_cpu(unsigned long util,
>  	 *     2. The system is being saturated when we're operating near
>  	 *        max capacity, it doesn't make sense to block overutilized.
>  	 */
> -	uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> -	uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> +	uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);

Isn't `uclamp_max <= capacity_orig` always true for `capacity_orig ==
SCHED_CAPACITY_SCALE`?

uclamp_max = [0..SCHED_CAPACITY_SCALE] , capacity_orig =
SCHED_CAPACITY_SCALE

>  	fits = fits || uclamp_max_fits;

Like Qais I don't understand this change. I read the previous discussion
from https://lkml.kernel.org/r/20221208140526.vvmjxlz6akgqyoma@airbuntu
("Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect
capacity inversion").

I assume Qais wanted to force `uclamp_max_fits = 0` for a big CPU
(`capacity_orig = 1024`) and a `uclamp_max = 1024` to not lift `fits`
from 0 to 1.

>  	/*
> @@ -4561,8 +4560,8 @@ static inline int util_fits_cpu(unsigned long util,
>  	 * handle the case uclamp_min > uclamp_max.
>  	 */
>  	uclamp_min = min(uclamp_min, uclamp_max);
> -	if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> -		fits = fits && (uclamp_min <= capacity_orig_thermal);
> +	if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> +		return -1;
>  	return fits;
>  }
> @@ -4572,7 +4571,11 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
>  	unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
>  	unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
>  	unsigned long util = task_util_est(p);
> -	return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> +	/*
> +	 * Return true only if the cpu fully fits the task requirements, which
> +	 * include the utilization but also the performance.

Not sure if people get what `performance requirements` mean here? I know
we want to use `performance` rather `bandwidth hint` to describe what
uclamp is. So shouldn't we use `utilization but also uclamp`?

> +	 */
> +	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
>  }
>  
>  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
>  	unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
>  	unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
>  
> +	/* Return true only if the utlization doesn't fit its capacity */

s/utlization/utilization
s/its/CPU ?

>  	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>  }

cpu_overutilized() is the only place where we now only test for
!util_fits_cpu(). The new comment says we only care about utilization
not fitting CPU capacity.

Does this mean the rq uclamp values are not important here and we could
go back to use fits_capacity()?

Not sure since util_fits_cpu() is still coded differently:

  fits = fits_capacity(util, capacity)

  fits = fits || uclamp_max_fits  <-- uclamp_max_fits can turn fits from
                                      0 into 1, so util doesn't fit but
                                      we don't return 1?

That said, I don't understand the current 'uclamp_max_fits = (uclamp_max
<= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE)' in
util_fits_cpu(), like already mentioned.

> @@ -6925,6 +6929,7 @@ static int
>  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  {
>  	unsigned long task_util, util_min, util_max, best_cap = 0;
> +	int fits, best_fits = 0;
>  	int cpu, best_cpu = -1;
>  	struct cpumask *cpus;
>  
> @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>  
>  		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
>  			continue;
> -		if (util_fits_cpu(task_util, util_min, util_max, cpu))
> +
> +		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> +
> +		/* This CPU fits with all capacity and performance requirements */

In task_fits_cpu() `utilization and performance (better uclamp)
requirements` term was used. I assume it's the same thing here?

> +		if (fits > 0)
>  			return cpu;
> +		/*
> +		 * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> +		 * for the CPU with highest performance capacity.
                                            ^^^^^^^^^^^^^^^^^^^^

Do we use a new CPU capacity value `performance capacity (1)` here?

Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.

I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
= capacity_orig - arch_scale_thermal_pressure()` when checking whether
to return -1. Shouldn't (1) and (2) be the same?

> +		 */
> +		else if (fits < 0)
> +			cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
>  
> -		if (cpu_cap > best_cap) {
> +		/*
> +		 * First, select cpu which fits better (-1 being better than 0).
> +		 * Then, select the one with largest capacity at same level.
> +		 */
> +		if ((fits < best_fits) ||
> +		    ((fits == best_fits) && (cpu_cap > best_cap))) {
>  			best_cap = cpu_cap;
>  			best_cpu = cpu;
> +			best_fits = fits;
>  		}
>  	}
>  

[...]

> @@ -7446,26 +7480,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  			if (prev_delta < base_energy)
>  				goto unlock;
>  			prev_delta -= base_energy;
> +			prev_thermal_cap = cpu_thermal_cap;
>  			best_delta = min(best_delta, prev_delta);
>  		}
>  
>  		/* Evaluate the energy impact of using max_spare_cap_cpu. */
>  		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> +			/* Current best energy cpu fits better */
> +			if (max_fits < best_fits)
> +				continue;
> +
> +			/*
> +			 * Both don't fit performance (i.e. uclamp_min) but
> +			 * best energy cpu has better performance.

IMHO, `performance` stands for `cpu_thermal_cap` which is
`cpu_capacity_orig - thermal pressure`.

I assume `performance` is equal to `performance capacity` used in
select_idle_capacity which uses thermal_load_avg() and not thermal
pressure. Why this difference?

[...]
Vincent Guittot Jan. 16, 2023, 11:08 a.m. UTC | #3
On Sun, 15 Jan 2023 at 01:19, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/13/23 14:40, Vincent Guittot wrote:
> > By taking into account uclamp_min, the 1:1 relation between task misfit
> > and cpu overutilized is no more true as a task with a small util_avg of
> > may not fit a high capacity cpu because of uclamp_min constraint.
> >
> > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > a CPU except for the uclamp_min hint which is a performance requirement.
> >
> > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > can use this new value to take additional action to select the best CPU
> > that doesn't match uclamp_min hint.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >
> > Change since v2:
> > - fix a condition in feec()
> > - add comments
> >
> >  kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 83 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e9d906a9bba9..29adb9e27b3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4525,8 +4525,7 @@ static inline int util_fits_cpu(unsigned long util,
> >        *     2. The system is being saturated when we're operating near
> >        *        max capacity, it doesn't make sense to block overutilized.
> >        */
> > -     uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > -     uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> > +     uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
>
> I think this hunk is what is causing the overutilized issues Kajetan reported
> against your patch.

Yeah, I have been to agressive with uclamp_max

>
> For the big cpu, this expression will always be true. So overutilized will
> trigger only for little and medium cores.
>
> I appreciate writing the boolean in a shorter form might appear like less code,
> but it makes things harder to read too; the compiler should be good at
> simplifying the expression if it can, no?
>
> Shall we leave the original expression as-is since it's easier to reason about?
>
> I think already by 'saving' using another variable we reduced readability and
> lead to this error. First line checks if we are the max_capacity which is
> the corner case we'd like to avoid and accidentally lost
>
> v1 code was:
>
> +       max_capacity = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> +       uclamp_max_fits = !max_capacity && (uclamp_max <= capacity_orig);
> +       fits = fits || uclamp_max_fits;
>
> I think that extra variable was well named to help make it obvious what corner
> case we're trying to catch here. Especially it matches the comment above it
> explaining this corner case. This auto variable should be free, no?
>
> Can we go back to this form please?

Yes, I'm going to come back to previous version

>
> >       fits = fits || uclamp_max_fits;
> >
> >       /*
> > @@ -4561,8 +4560,8 @@ static inline int util_fits_cpu(unsigned long util,
> >        * handle the case uclamp_min > uclamp_max.
> >        */
> >       uclamp_min = min(uclamp_min, uclamp_max);
> > -     if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> > -             fits = fits && (uclamp_min <= capacity_orig_thermal);
> > +     if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> > +             return -1;
>
> Here shouldn't this be
>
>         if (util < uclamp_min) {
>                 fits = fits && (uclamp_min <= capacity_orig);
>                 if (fits && (uclamp_min > capacity_orig_thermal))
>                         return -1;
>         }
>
> uclamp_min should fit capacity_orig first then we'd check for the corner case

I don't get why we should test capacity_orig first ?

case 1:
  capacity_orig = 800
  uclamp_min = 512
  capacity_orig_thermal = 400
  util = 200
util_fits_cpu should return -1

case 2:
  uclamp_min = 900
  capacity_orig = 800
  capacity_orig_thermal = 400
  utili_avg = 200
util_fits_cpu should return -1 whereas with your proposal above it will return 0

> if thermal pressure is causing it not to fit. And all of this should only
> matter if utill < uclamp_min. Otherwise util_avg should be driving force if
> uclamp_max is not trumping it.
>
> I need time to update my unit test [1] to catch these error cases as I didn't
> see them. In the next version I think it's worth including the changes to
> remove the capacity inversion in the patch.
>
> [1] https://github.com/qais-yousef/uclamp_test/blob/master/uclamp_test_thermal_pressure.c
>
>
> Thanks!
>
> --
> Qais Yousef
>
> >
> >       return fits;
> >  }
> > @@ -4572,7 +4571,11 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> >       unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> >       unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> >       unsigned long util = task_util_est(p);
> > -     return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> > +     /*
> > +      * Return true only if the cpu fully fits the task requirements, which
> > +      * include the utilization but also the performance.
> > +      */
> > +     return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
> >  }
> >
> >  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> >       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> >       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> >
> > +     /* Return true only if the utlization doesn't fit its capacity */
> >       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> >  }
> >
> > @@ -6925,6 +6929,7 @@ static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> >       unsigned long task_util, util_min, util_max, best_cap = 0;
> > +     int fits, best_fits = 0;
> >       int cpu, best_cpu = -1;
> >       struct cpumask *cpus;
> >
> > @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >
> >               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> >                       continue;
> > -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> > +
> > +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > +
> > +             /* This CPU fits with all capacity and performance requirements */
> > +             if (fits > 0)
> >                       return cpu;
> > +             /*
> > +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> > +              * for the CPU with highest performance capacity.
> > +              */
> > +             else if (fits < 0)
> > +                     cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
> >
> > -             if (cpu_cap > best_cap) {
> > +             /*
> > +              * First, select cpu which fits better (-1 being better than 0).
> > +              * Then, select the one with largest capacity at same level.
> > +              */
> > +             if ((fits < best_fits) ||
> > +                 ((fits == best_fits) && (cpu_cap > best_cap))) {
> >                       best_cap = cpu_cap;
> >                       best_cpu = cpu;
> > +                     best_fits = fits;
> >               }
> >       }
> >
> > @@ -6958,7 +6979,11 @@ static inline bool asym_fits_cpu(unsigned long util,
> >                                int cpu)
> >  {
> >       if (sched_asym_cpucap_active())
> > -             return util_fits_cpu(util, util_min, util_max, cpu);
> > +             /*
> > +              * Return true only if the cpu fully fits the task requirements
> > +              * which include the utilization but also the performance.
> > +              */
> > +             return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> >
> >       return true;
> >  }
> > @@ -7325,6 +7350,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >       unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> >       struct root_domain *rd = this_rq()->rd;
> >       int cpu, best_energy_cpu, target = -1;
> > +     int prev_fits = -1, best_fits = -1;
> > +     unsigned long best_thermal_cap = 0;
> > +     unsigned long prev_thermal_cap = 0;
> >       struct sched_domain *sd;
> >       struct perf_domain *pd;
> >       struct energy_env eenv;
> > @@ -7360,6 +7388,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >               unsigned long prev_spare_cap = 0;
> >               int max_spare_cap_cpu = -1;
> >               unsigned long base_energy;
> > +             int fits, max_fits = -1;
> >
> >               cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> >
> > @@ -7412,7 +7441,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                                       util_max = max(rq_util_max, p_util_max);
> >                               }
> >                       }
> > -                     if (!util_fits_cpu(util, util_min, util_max, cpu))
> > +
> > +                     fits = util_fits_cpu(util, util_min, util_max, cpu);
> > +                     if (!fits)
> >                               continue;
> >
> >                       lsub_positive(&cpu_cap, util);
> > @@ -7420,7 +7451,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                       if (cpu == prev_cpu) {
> >                               /* Always use prev_cpu as a candidate. */
> >                               prev_spare_cap = cpu_cap;
> > -                     } else if (cpu_cap > max_spare_cap) {
> > +                             prev_fits = fits;
> > +                     } else if ((fits > max_fits) ||
> > +                                ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> >                               /*
> >                                * Find the CPU with the maximum spare capacity
> >                                * among the remaining CPUs in the performance
> > @@ -7428,6 +7461,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                                */
> >                               max_spare_cap = cpu_cap;
> >                               max_spare_cap_cpu = cpu;
> > +                             max_fits = fits;
> >                       }
> >               }
> >
> > @@ -7446,26 +7480,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                       if (prev_delta < base_energy)
> >                               goto unlock;
> >                       prev_delta -= base_energy;
> > +                     prev_thermal_cap = cpu_thermal_cap;
> >                       best_delta = min(best_delta, prev_delta);
> >               }
> >
> >               /* Evaluate the energy impact of using max_spare_cap_cpu. */
> >               if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> > +                     /* Current best energy cpu fits better */
> > +                     if (max_fits < best_fits)
> > +                             continue;
> > +
> > +                     /*
> > +                      * Both don't fit performance (i.e. uclamp_min) but
> > +                      * best energy cpu has better performance.
> > +                      */
> > +                     if ((max_fits < 0) &&
> > +                         (cpu_thermal_cap <= best_thermal_cap))
> > +                             continue;
> > +
> >                       cur_delta = compute_energy(&eenv, pd, cpus, p,
> >                                                  max_spare_cap_cpu);
> >                       /* CPU utilization has changed */
> >                       if (cur_delta < base_energy)
> >                               goto unlock;
> >                       cur_delta -= base_energy;
> > -                     if (cur_delta < best_delta) {
> > -                             best_delta = cur_delta;
> > -                             best_energy_cpu = max_spare_cap_cpu;
> > -                     }
> > +
> > +                     /*
> > +                      * Both fit for the task but best energy cpu has lower
> > +                      * energy impact.
> > +                      */
> > +                     if ((max_fits > 0) && (best_fits > 0) &&
> > +                         (cur_delta >= best_delta))
> > +                             continue;
> > +
> > +                     best_delta = cur_delta;
> > +                     best_energy_cpu = max_spare_cap_cpu;
> > +                     best_fits = max_fits;
> > +                     best_thermal_cap = cpu_thermal_cap;
> >               }
> >       }
> >       rcu_read_unlock();
> >
> > -     if (best_delta < prev_delta)
> > +     if ((best_fits > prev_fits) ||
> > +         ((best_fits > 0) && (best_delta < prev_delta)) ||
> > +         ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
> >               target = best_energy_cpu;
> >
> >       return target;
> > @@ -10259,24 +10317,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >        */
> >       update_sd_lb_stats(env, &sds);
> >
> > -     if (sched_energy_enabled()) {
> > -             struct root_domain *rd = env->dst_rq->rd;
> > -
> > -             if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> > -                     goto out_balanced;
> > -     }
> > -
> > -     local = &sds.local_stat;
> > -     busiest = &sds.busiest_stat;
> > -
> >       /* There is no busy sibling group to pull tasks from */
> >       if (!sds.busiest)
> >               goto out_balanced;
> >
> > +     busiest = &sds.busiest_stat;
> > +
> >       /* Misfit tasks should be dealt with regardless of the avg load */
> >       if (busiest->group_type == group_misfit_task)
> >               goto force_balance;
> >
> > +     if (sched_energy_enabled()) {
> > +             struct root_domain *rd = env->dst_rq->rd;
> > +
> > +             if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> > +                     goto out_balanced;
> > +     }
> > +
> >       /* ASYM feature bypasses nice load balance check */
> >       if (busiest->group_type == group_asym_packing)
> >               goto force_balance;
> > @@ -10289,6 +10346,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >       if (busiest->group_type == group_imbalanced)
> >               goto force_balance;
> >
> > +     local = &sds.local_stat;
> >       /*
> >        * If the local group is busier than the selected busiest group
> >        * don't try and pull any tasks.
> > --
> > 2.34.1
> >
Vincent Guittot Jan. 16, 2023, 11:23 a.m. UTC | #4
On Mon, 16 Jan 2023 at 10:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 13/01/2023 14:40, Vincent Guittot wrote:
> > By taking into account uclamp_min, the 1:1 relation between task misfit
> > and cpu overutilized is no more true as a task with a small util_avg of
>
> of ?
>
> > may not fit a high capacity cpu because of uclamp_min constraint.
> >
> > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > a CPU except for the uclamp_min hint which is a performance requirement.
> >
> > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > can use this new value to take additional action to select the best CPU
> > that doesn't match uclamp_min hint.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >
> > Change since v2:
> > - fix a condition in feec()
> > - add comments
> >
> >  kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++----------
> >  1 file changed, 83 insertions(+), 25 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e9d906a9bba9..29adb9e27b3d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4525,8 +4525,7 @@ static inline int util_fits_cpu(unsigned long util,
> >        *     2. The system is being saturated when we're operating near
> >        *        max capacity, it doesn't make sense to block overutilized.
> >        */
> > -     uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > -     uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> > +     uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
>
> Isn't `uclamp_max <= capacity_orig` always true for `capacity_orig ==
> SCHED_CAPACITY_SCALE`?
>
> uclamp_max = [0..SCHED_CAPACITY_SCALE] , capacity_orig =
> SCHED_CAPACITY_SCALE
>
> >       fits = fits || uclamp_max_fits;
>
> Like Qais I don't understand this change. I read the previous discussion
> from https://lkml.kernel.org/r/20221208140526.vvmjxlz6akgqyoma@airbuntu
> ("Re: [RFC PATCH 3/3] sched/fair: Traverse cpufreq policies to detect
> capacity inversion").
>
> I assume Qais wanted to force `uclamp_max_fits = 0` for a big CPU
> (`capacity_orig = 1024`) and a `uclamp_max = 1024` to not lift `fits`
> from 0 to 1.
>
> >       /*
> > @@ -4561,8 +4560,8 @@ static inline int util_fits_cpu(unsigned long util,
> >        * handle the case uclamp_min > uclamp_max.
> >        */
> >       uclamp_min = min(uclamp_min, uclamp_max);
> > -     if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> > -             fits = fits && (uclamp_min <= capacity_orig_thermal);
> > +     if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> > +             return -1;
> >       return fits;
> >  }
> > @@ -4572,7 +4571,11 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> >       unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> >       unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> >       unsigned long util = task_util_est(p);
> > -     return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> > +     /*
> > +      * Return true only if the cpu fully fits the task requirements, which
> > +      * include the utilization but also the performance.
>
> Not sure if people get what `performance requirements` mean here? I know
> we want to use `performance` rather `bandwidth hint` to describe what
> uclamp is. So shouldn't we use `utilization but also uclamp`?
>
> > +      */
> > +     return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
> >  }
> >
> >  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> >       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> >       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> >
> > +     /* Return true only if the utlization doesn't fit its capacity */
>
> s/utlization/utilization
> s/its/CPU ?
>
> >       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> >  }
>
> cpu_overutilized() is the only place where we now only test for
> !util_fits_cpu(). The new comment says we only care about utilization
> not fitting CPU capacity.
>
> Does this mean the rq uclamp values are not important here and we could
> go back to use fits_capacity()?
>
> Not sure since util_fits_cpu() is still coded differently:

uclamp_min is not important but uclamp_max still cap the utilization

>
>   fits = fits_capacity(util, capacity)
>
>   fits = fits || uclamp_max_fits  <-- uclamp_max_fits can turn fits from
>                                       0 into 1, so util doesn't fit but
>                                       we don't return 1?
>
> That said, I don't understand the current 'uclamp_max_fits = (uclamp_max
> <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE)' in
> util_fits_cpu(), like already mentioned.
>
> > @@ -6925,6 +6929,7 @@ static int
> >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >  {
> >       unsigned long task_util, util_min, util_max, best_cap = 0;
> > +     int fits, best_fits = 0;
> >       int cpu, best_cpu = -1;
> >       struct cpumask *cpus;
> >
> > @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >
> >               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> >                       continue;
> > -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> > +
> > +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > +
> > +             /* This CPU fits with all capacity and performance requirements */
>
> In task_fits_cpu() `utilization and performance (better uclamp)
> requirements` term was used. I assume it's the same thing here?
>
> > +             if (fits > 0)
> >                       return cpu;
> > +             /*
> > +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> > +              * for the CPU with highest performance capacity.
>                                             ^^^^^^^^^^^^^^^^^^^^
>
> Do we use a new CPU capacity value `performance capacity (1)` here?
>
> Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.
>
> I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
> = capacity_orig - arch_scale_thermal_pressure()` when checking whether
> to return -1. Shouldn't (1) and (2) be the same?

I'm all in favor of both being capacity_orig_of(cpu) -
thermal_load_avg(cpu_rq(cpu) like the capacity inversion detection

>
> > +              */
> > +             else if (fits < 0)
> > +                     cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
> >
> > -             if (cpu_cap > best_cap) {
> > +             /*
> > +              * First, select cpu which fits better (-1 being better than 0).
> > +              * Then, select the one with largest capacity at same level.
> > +              */
> > +             if ((fits < best_fits) ||
> > +                 ((fits == best_fits) && (cpu_cap > best_cap))) {
> >                       best_cap = cpu_cap;
> >                       best_cpu = cpu;
> > +                     best_fits = fits;
> >               }
> >       }
> >
>
> [...]
>
> > @@ -7446,26 +7480,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> >                       if (prev_delta < base_energy)
> >                               goto unlock;
> >                       prev_delta -= base_energy;
> > +                     prev_thermal_cap = cpu_thermal_cap;
> >                       best_delta = min(best_delta, prev_delta);
> >               }
> >
> >               /* Evaluate the energy impact of using max_spare_cap_cpu. */
> >               if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> > +                     /* Current best energy cpu fits better */
> > +                     if (max_fits < best_fits)
> > +                             continue;
> > +
> > +                     /*
> > +                      * Both don't fit performance (i.e. uclamp_min) but
> > +                      * best energy cpu has better performance.
>
> IMHO, `performance` stands for `cpu_thermal_cap` which is
> `cpu_capacity_orig - thermal pressure`.
>
> I assume `performance` is equal to `performance capacity` used in
> select_idle_capacity which uses thermal_load_avg() and not thermal
> pressure. Why this difference?
>
> [...]
Dietmar Eggemann Jan. 16, 2023, 2:56 p.m. UTC | #5
On 16/01/2023 12:23, Vincent Guittot wrote:
> On Mon, 16 Jan 2023 at 10:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 13/01/2023 14:40, Vincent Guittot wrote:

[...]

>>> @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
>>>       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
>>>       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
>>>
>>> +     /* Return true only if the utlization doesn't fit its capacity */
>>
>> s/utlization/utilization
>> s/its/CPU ?
>>
>>>       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
>>>  }
>>
>> cpu_overutilized() is the only place where we now only test for
>> !util_fits_cpu(). The new comment says we only care about utilization
>> not fitting CPU capacity.
>>
>> Does this mean the rq uclamp values are not important here and we could
>> go back to use fits_capacity()?
>>
>> Not sure since util_fits_cpu() is still coded differently:
> 
> uclamp_min is not important but uclamp_max still cap the utilization

OK, makes sense.

I.e. we could pass in `rq_util_min = 0` to avoid fetching it
unnecessary? In case `fits == 1` before the uclamp_min condition in
util_fits_cpu() it doesn't matter if we switch to return `-1` when
called from cpu_overutilized(). Detail though ...

[...]

>>> @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>>>
>>>               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
>>>                       continue;
>>> -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
>>> +
>>> +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>>> +
>>> +             /* This CPU fits with all capacity and performance requirements */
>>
>> In task_fits_cpu() `utilization and performance (better uclamp)
>> requirements` term was used. I assume it's the same thing here?
>>
>>> +             if (fits > 0)
>>>                       return cpu;
>>> +             /*
>>> +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
>>> +              * for the CPU with highest performance capacity.
>>                                             ^^^^^^^^^^^^^^^^^^^^
>>
>> Do we use a new CPU capacity value `performance capacity (1)` here?
>>
>> Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.
>>
>> I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
>> = capacity_orig - arch_scale_thermal_pressure()` when checking whether
>> to return -1. Shouldn't (1) and (2) be the same?
> 
> I'm all in favor of both being capacity_orig_of(cpu) -
> thermal_load_avg(cpu_rq(cpu) like the capacity inversion detection

I think we need a handy name for this new capacity value, which seems to
be `capacity_orig - capacity reduced by thermal`. And we should either
use `thermal_load_avg` or `thermal pressure` for the latter part. And
then we should use this consistently in all these places:
util_fits_cpu(), feec(), sic().

[...]
Qais Yousef Jan. 17, 2023, 4:20 p.m. UTC | #6
On 01/16/23 12:08, Vincent Guittot wrote:
> On Sun, 15 Jan 2023 at 01:19, Qais Yousef <qyousef@layalina.io> wrote:
> >
> > On 01/13/23 14:40, Vincent Guittot wrote:
> > > By taking into account uclamp_min, the 1:1 relation between task misfit
> > > and cpu overutilized is no more true as a task with a small util_avg of
> > > may not fit a high capacity cpu because of uclamp_min constraint.
> > >
> > > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > > a CPU except for the uclamp_min hint which is a performance requirement.
> > >
> > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > > can use this new value to take additional action to select the best CPU
> > > that doesn't match uclamp_min hint.
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >
> > > Change since v2:
> > > - fix a condition in feec()
> > > - add comments
> > >
> > >  kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++----------
> > >  1 file changed, 83 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e9d906a9bba9..29adb9e27b3d 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4525,8 +4525,7 @@ static inline int util_fits_cpu(unsigned long util,
> > >        *     2. The system is being saturated when we're operating near
> > >        *        max capacity, it doesn't make sense to block overutilized.
> > >        */
> > > -     uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > > -     uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> > > +     uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
> >
> > I think this hunk is what is causing the overutilized issues Kajetan reported
> > against your patch.
> 
> Yeah, I have been to agressive with uclamp_max
> 
> >
> > For the big cpu, this expression will always be true. So overutilized will
> > trigger only for little and medium cores.
> >
> > I appreciate writing the boolean in a shorter form might appear like less code,
> > but it makes things harder to read too; the compiler should be good at
> > simplifying the expression if it can, no?
> >
> > Shall we leave the original expression as-is since it's easier to reason about?
> >
> > I think already by 'saving' using another variable we reduced readability and
> > lead to this error. First line checks if we are the max_capacity which is
> > the corner case we'd like to avoid and accidentally lost
> >
> > v1 code was:
> >
> > +       max_capacity = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > +       uclamp_max_fits = !max_capacity && (uclamp_max <= capacity_orig);
> > +       fits = fits || uclamp_max_fits;
> >
> > I think that extra variable was well named to help make it obvious what corner
> > case we're trying to catch here. Especially it matches the comment above it
> > explaining this corner case. This auto variable should be free, no?
> >
> > Can we go back to this form please?
> 
> Yes, I'm going to come back to previous version
> 
> >
> > >       fits = fits || uclamp_max_fits;
> > >
> > >       /*
> > > @@ -4561,8 +4560,8 @@ static inline int util_fits_cpu(unsigned long util,
> > >        * handle the case uclamp_min > uclamp_max.
> > >        */
> > >       uclamp_min = min(uclamp_min, uclamp_max);
> > > -     if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> > > -             fits = fits && (uclamp_min <= capacity_orig_thermal);
> > > +     if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> > > +             return -1;
> >
> > Here shouldn't this be
> >
> >         if (util < uclamp_min) {
> >                 fits = fits && (uclamp_min <= capacity_orig);
> >                 if (fits && (uclamp_min > capacity_orig_thermal))
> >                         return -1;
> >         }
> >
> > uclamp_min should fit capacity_orig first then we'd check for the corner case
> 
> I don't get why we should test capacity_orig first ?
> 
> case 1:
>   capacity_orig = 800
>   uclamp_min = 512
>   capacity_orig_thermal = 400
>   util = 200
> util_fits_cpu should return -1
> 
> case 2:
>   uclamp_min = 900
>   capacity_orig = 800
>   capacity_orig_thermal = 400
>   utili_avg = 200
> util_fits_cpu should return -1 whereas with your proposal above it will return 0

Ah. Our definition of what -1 means is different.

I thought it's a fallback case for when we fit capacity_orig but due to thermal
pressure we'll actually fail to get. So what I understood -1 would be is that:

	We fit in capacity_orig but due to thermal pressure we will actually
	not get the perf point we'd hope for. So might be worthwhile to search
	harder if another CPU can fit.

But your definition is:

	If util fits but uclamp_min doesn't (regardless because of thermal
	pressure or we fail capacity_orig); then fallback to the cpu with the
	highest (capacity_orig - thermal_pressure) that fits util.

Did I get this right?

Hmm, sounds reasonable, I need to sleep on it but this looks better, yes.


Cheers

--
Qais Yousef

> 
> > if thermal pressure is causing it not to fit. And all of this should only
> > matter if utill < uclamp_min. Otherwise util_avg should be driving force if
> > uclamp_max is not trumping it.
> >
> > I need time to update my unit test [1] to catch these error cases as I didn't
> > see them. In the next version I think it's worth including the changes to
> > remove the capacity inversion in the patch.
> >
> > [1] https://github.com/qais-yousef/uclamp_test/blob/master/uclamp_test_thermal_pressure.c
> >
> >
> > Thanks!
> >
> > --
> > Qais Yousef
> >
> > >
> > >       return fits;
> > >  }
> > > @@ -4572,7 +4571,11 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> > >       unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > >       unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > >       unsigned long util = task_util_est(p);
> > > -     return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> > > +     /*
> > > +      * Return true only if the cpu fully fits the task requirements, which
> > > +      * include the utilization but also the performance.
> > > +      */
> > > +     return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
> > >  }
> > >
> > >  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > > @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> > >       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> > >       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> > >
> > > +     /* Return true only if the utlization doesn't fit its capacity */
> > >       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> > >  }
> > >
> > > @@ -6925,6 +6929,7 @@ static int
> > >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >  {
> > >       unsigned long task_util, util_min, util_max, best_cap = 0;
> > > +     int fits, best_fits = 0;
> > >       int cpu, best_cpu = -1;
> > >       struct cpumask *cpus;
> > >
> > > @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >
> > >               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> > >                       continue;
> > > -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> > > +
> > > +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > > +
> > > +             /* This CPU fits with all capacity and performance requirements */
> > > +             if (fits > 0)
> > >                       return cpu;
> > > +             /*
> > > +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> > > +              * for the CPU with highest performance capacity.
> > > +              */
> > > +             else if (fits < 0)
> > > +                     cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
> > >
> > > -             if (cpu_cap > best_cap) {
> > > +             /*
> > > +              * First, select cpu which fits better (-1 being better than 0).
> > > +              * Then, select the one with largest capacity at same level.
> > > +              */
> > > +             if ((fits < best_fits) ||
> > > +                 ((fits == best_fits) && (cpu_cap > best_cap))) {
> > >                       best_cap = cpu_cap;
> > >                       best_cpu = cpu;
> > > +                     best_fits = fits;
> > >               }
> > >       }
> > >
> > > @@ -6958,7 +6979,11 @@ static inline bool asym_fits_cpu(unsigned long util,
> > >                                int cpu)
> > >  {
> > >       if (sched_asym_cpucap_active())
> > > -             return util_fits_cpu(util, util_min, util_max, cpu);
> > > +             /*
> > > +              * Return true only if the cpu fully fits the task requirements
> > > +              * which include the utilization but also the performance.
> > > +              */
> > > +             return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > >
> > >       return true;
> > >  }
> > > @@ -7325,6 +7350,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >       unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> > >       struct root_domain *rd = this_rq()->rd;
> > >       int cpu, best_energy_cpu, target = -1;
> > > +     int prev_fits = -1, best_fits = -1;
> > > +     unsigned long best_thermal_cap = 0;
> > > +     unsigned long prev_thermal_cap = 0;
> > >       struct sched_domain *sd;
> > >       struct perf_domain *pd;
> > >       struct energy_env eenv;
> > > @@ -7360,6 +7388,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >               unsigned long prev_spare_cap = 0;
> > >               int max_spare_cap_cpu = -1;
> > >               unsigned long base_energy;
> > > +             int fits, max_fits = -1;
> > >
> > >               cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > >
> > > @@ -7412,7 +7441,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >                                       util_max = max(rq_util_max, p_util_max);
> > >                               }
> > >                       }
> > > -                     if (!util_fits_cpu(util, util_min, util_max, cpu))
> > > +
> > > +                     fits = util_fits_cpu(util, util_min, util_max, cpu);
> > > +                     if (!fits)
> > >                               continue;
> > >
> > >                       lsub_positive(&cpu_cap, util);
> > > @@ -7420,7 +7451,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >                       if (cpu == prev_cpu) {
> > >                               /* Always use prev_cpu as a candidate. */
> > >                               prev_spare_cap = cpu_cap;
> > > -                     } else if (cpu_cap > max_spare_cap) {
> > > +                             prev_fits = fits;
> > > +                     } else if ((fits > max_fits) ||
> > > +                                ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> > >                               /*
> > >                                * Find the CPU with the maximum spare capacity
> > >                                * among the remaining CPUs in the performance
> > > @@ -7428,6 +7461,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >                                */
> > >                               max_spare_cap = cpu_cap;
> > >                               max_spare_cap_cpu = cpu;
> > > +                             max_fits = fits;
> > >                       }
> > >               }
> > >
> > > @@ -7446,26 +7480,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > >                       if (prev_delta < base_energy)
> > >                               goto unlock;
> > >                       prev_delta -= base_energy;
> > > +                     prev_thermal_cap = cpu_thermal_cap;
> > >                       best_delta = min(best_delta, prev_delta);
> > >               }
> > >
> > >               /* Evaluate the energy impact of using max_spare_cap_cpu. */
> > >               if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> > > +                     /* Current best energy cpu fits better */
> > > +                     if (max_fits < best_fits)
> > > +                             continue;
> > > +
> > > +                     /*
> > > +                      * Both don't fit performance (i.e. uclamp_min) but
> > > +                      * best energy cpu has better performance.
> > > +                      */
> > > +                     if ((max_fits < 0) &&
> > > +                         (cpu_thermal_cap <= best_thermal_cap))
> > > +                             continue;
> > > +
> > >                       cur_delta = compute_energy(&eenv, pd, cpus, p,
> > >                                                  max_spare_cap_cpu);
> > >                       /* CPU utilization has changed */
> > >                       if (cur_delta < base_energy)
> > >                               goto unlock;
> > >                       cur_delta -= base_energy;
> > > -                     if (cur_delta < best_delta) {
> > > -                             best_delta = cur_delta;
> > > -                             best_energy_cpu = max_spare_cap_cpu;
> > > -                     }
> > > +
> > > +                     /*
> > > +                      * Both fit for the task but best energy cpu has lower
> > > +                      * energy impact.
> > > +                      */
> > > +                     if ((max_fits > 0) && (best_fits > 0) &&
> > > +                         (cur_delta >= best_delta))
> > > +                             continue;
> > > +
> > > +                     best_delta = cur_delta;
> > > +                     best_energy_cpu = max_spare_cap_cpu;
> > > +                     best_fits = max_fits;
> > > +                     best_thermal_cap = cpu_thermal_cap;
> > >               }
> > >       }
> > >       rcu_read_unlock();
> > >
> > > -     if (best_delta < prev_delta)
> > > +     if ((best_fits > prev_fits) ||
> > > +         ((best_fits > 0) && (best_delta < prev_delta)) ||
> > > +         ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
> > >               target = best_energy_cpu;
> > >
> > >       return target;
> > > @@ -10259,24 +10317,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > >        */
> > >       update_sd_lb_stats(env, &sds);
> > >
> > > -     if (sched_energy_enabled()) {
> > > -             struct root_domain *rd = env->dst_rq->rd;
> > > -
> > > -             if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> > > -                     goto out_balanced;
> > > -     }
> > > -
> > > -     local = &sds.local_stat;
> > > -     busiest = &sds.busiest_stat;
> > > -
> > >       /* There is no busy sibling group to pull tasks from */
> > >       if (!sds.busiest)
> > >               goto out_balanced;
> > >
> > > +     busiest = &sds.busiest_stat;
> > > +
> > >       /* Misfit tasks should be dealt with regardless of the avg load */
> > >       if (busiest->group_type == group_misfit_task)
> > >               goto force_balance;
> > >
> > > +     if (sched_energy_enabled()) {
> > > +             struct root_domain *rd = env->dst_rq->rd;
> > > +
> > > +             if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> > > +                     goto out_balanced;
> > > +     }
> > > +
> > >       /* ASYM feature bypasses nice load balance check */
> > >       if (busiest->group_type == group_asym_packing)
> > >               goto force_balance;
> > > @@ -10289,6 +10346,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > >       if (busiest->group_type == group_imbalanced)
> > >               goto force_balance;
> > >
> > > +     local = &sds.local_stat;
> > >       /*
> > >        * If the local group is busier than the selected busiest group
> > >        * don't try and pull any tasks.
> > > --
> > > 2.34.1
> > >
Qais Yousef Jan. 17, 2023, 4:38 p.m. UTC | #7
On 01/16/23 09:07, Dietmar Eggemann wrote:

[...]

> Not sure if people get what `performance requirements` mean here? I know
> we want to use `performance` rather `bandwidth hint` to describe what
> uclamp is. So shouldn't we use `utilization but also uclamp`?

We do have the uclamp doc now which explains this, no? I'm not keen on
utilization because it's an overloaded term. In the context of uclamp
- utilization _signal_ in the scheduler is used to indicate performance
requirements of a workload, no?

Using 'uclamp hint' if you found it really confusing, is fine by me. But I'd
rather steer away from 'bandwidth' or 'utilization' when describing uclamp and
its intention.

I like using performance requirements because it enforces what this hint
actually means.


Cheers

--
Qais Yousef
Qais Yousef Jan. 17, 2023, 8:20 p.m. UTC | #8
On 01/16/23 14:56, Dietmar Eggemann wrote:
> On 16/01/2023 12:23, Vincent Guittot wrote:
> > On Mon, 16 Jan 2023 at 10:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 13/01/2023 14:40, Vincent Guittot wrote:
> 
> [...]
> 
> >>> @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> >>>       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> >>>       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> >>>
> >>> +     /* Return true only if the utlization doesn't fit its capacity */
> >>
> >> s/utlization/utilization
> >> s/its/CPU ?
> >>
> >>>       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> >>>  }
> >>
> >> cpu_overutilized() is the only place where we now only test for
> >> !util_fits_cpu(). The new comment says we only care about utilization
> >> not fitting CPU capacity.
> >>
> >> Does this mean the rq uclamp values are not important here and we could
> >> go back to use fits_capacity()?
> >>
> >> Not sure since util_fits_cpu() is still coded differently:
> > 
> > uclamp_min is not important but uclamp_max still cap the utilization
> 
> OK, makes sense.
> 
> I.e. we could pass in `rq_util_min = 0` to avoid fetching it
> unnecessary? In case `fits == 1` before the uclamp_min condition in
> util_fits_cpu() it doesn't matter if we switch to return `-1` when
> called from cpu_overutilized(). Detail though ...
> 
> [...]
> 
> >>> @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >>>
> >>>               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> >>>                       continue;
> >>> -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> >>> +
> >>> +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >>> +
> >>> +             /* This CPU fits with all capacity and performance requirements */
> >>
> >> In task_fits_cpu() `utilization and performance (better uclamp)
> >> requirements` term was used. I assume it's the same thing here?
> >>
> >>> +             if (fits > 0)
> >>>                       return cpu;
> >>> +             /*
> >>> +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> >>> +              * for the CPU with highest performance capacity.
> >>                                             ^^^^^^^^^^^^^^^^^^^^
> >>
> >> Do we use a new CPU capacity value `performance capacity (1)` here?
> >>
> >> Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.
> >>
> >> I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
> >> = capacity_orig - arch_scale_thermal_pressure()` when checking whether
> >> to return -1. Shouldn't (1) and (2) be the same?
> > 
> > I'm all in favor of both being capacity_orig_of(cpu) -
> > thermal_load_avg(cpu_rq(cpu) like the capacity inversion detection
> 
> I think we need a handy name for this new capacity value, which seems to
> be `capacity_orig - capacity reduced by thermal`. And we should either
> use `thermal_load_avg` or `thermal pressure` for the latter part. And
> then we should use this consistently in all these places:
> util_fits_cpu(), feec(), sic().

So we had reports from Xuewen that not using instantaneous pressure causes
problems.

Lukasz came up with this patch to help address the problem, but it's still
waiting discussions. I think we need to discuss this problem more there.

	https://lore.kernel.org/lkml/20220429091245.12423-1-lukasz.luba@arm.com/

At the moment there's no good solution and either comes with its own set of
caveat(s). Being consistent is not an option now AFAICT? We need to improve
thermal_load_avg() response time first somehow.

For now, to make best decision - we look at instantaneous. But when falling
back - I think using the long term pressure signal makes more sense because we
wouldn't be doing this fallback path if everything works as expected thermal
wise.

At least, that's what I think is a good trade-of for now :)


Thanks!

--
Qais Yousef
Vincent Guittot Jan. 18, 2023, 8:15 a.m. UTC | #9
On Mon, 16 Jan 2023 at 15:56, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 16/01/2023 12:23, Vincent Guittot wrote:
> > On Mon, 16 Jan 2023 at 10:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 13/01/2023 14:40, Vincent Guittot wrote:
>
> [...]
>
> >>> @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> >>>       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> >>>       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> >>>
> >>> +     /* Return true only if the utlization doesn't fit its capacity */
> >>
> >> s/utlization/utilization
> >> s/its/CPU ?
> >>
> >>>       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> >>>  }
> >>
> >> cpu_overutilized() is the only place where we now only test for
> >> !util_fits_cpu(). The new comment says we only care about utilization
> >> not fitting CPU capacity.
> >>
> >> Does this mean the rq uclamp values are not important here and we could
> >> go back to use fits_capacity()?
> >>
> >> Not sure since util_fits_cpu() is still coded differently:
> >
> > uclamp_min is not important but uclamp_max still cap the utilization
>
> OK, makes sense.
>
> I.e. we could pass in `rq_util_min = 0` to avoid fetching it
> unnecessary? In case `fits == 1` before the uclamp_min condition in
> util_fits_cpu() it doesn't matter if we switch to return `-1` when
> called from cpu_overutilized(). Detail though ...

One comment from Qais was to minimize knowledge outside
util_fits_cpu() that's why I pass both uclamp_min and uclamp_max.

>
> [...]
>
> >>> @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> >>>
> >>>               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> >>>                       continue;
> >>> -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> >>> +
> >>> +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >>> +
> >>> +             /* This CPU fits with all capacity and performance requirements */
> >>
> >> In task_fits_cpu() `utilization and performance (better uclamp)
> >> requirements` term was used. I assume it's the same thing here?
> >>
> >>> +             if (fits > 0)
> >>>                       return cpu;
> >>> +             /*
> >>> +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> >>> +              * for the CPU with highest performance capacity.
> >>                                             ^^^^^^^^^^^^^^^^^^^^
> >>
> >> Do we use a new CPU capacity value `performance capacity (1)` here?
> >>
> >> Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.
> >>
> >> I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
> >> = capacity_orig - arch_scale_thermal_pressure()` when checking whether
> >> to return -1. Shouldn't (1) and (2) be the same?
> >
> > I'm all in favor of both being capacity_orig_of(cpu) -
> > thermal_load_avg(cpu_rq(cpu) like the capacity inversion detection
>
> I think we need a handy name for this new capacity value, which seems to
> be `capacity_orig - capacity reduced by thermal`. And we should either
> use `thermal_load_avg` or `thermal pressure` for the latter part. And
> then we should use this consistently in all these places:
> util_fits_cpu(), feec(), sic().

Ok, let me change this everywhere

>
> [...]
Vincent Guittot Jan. 18, 2023, 1:49 p.m. UTC | #10
On Tue, 17 Jan 2023 at 17:20, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/16/23 12:08, Vincent Guittot wrote:
> > On Sun, 15 Jan 2023 at 01:19, Qais Yousef <qyousef@layalina.io> wrote:
> > >
> > > On 01/13/23 14:40, Vincent Guittot wrote:
> > > > By taking into account uclamp_min, the 1:1 relation between task misfit
> > > > and cpu overutilized is no more true as a task with a small util_avg of
> > > > may not fit a high capacity cpu because of uclamp_min constraint.
> > > >
> > > > Add a new state in util_fits_cpu() to reflect the case that task would fit
> > > > a CPU except for the uclamp_min hint which is a performance requirement.
> > > >
> > > > Use -1 to reflect that a CPU doesn't fit only because of uclamp_min so we
> > > > can use this new value to take additional action to select the best CPU
> > > > that doesn't match uclamp_min hint.
> > > >
> > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > > ---
> > > >
> > > > Change since v2:
> > > > - fix a condition in feec()
> > > > - add comments
> > > >
> > > >  kernel/sched/fair.c | 108 ++++++++++++++++++++++++++++++++++----------
> > > >  1 file changed, 83 insertions(+), 25 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index e9d906a9bba9..29adb9e27b3d 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -4525,8 +4525,7 @@ static inline int util_fits_cpu(unsigned long util,
> > > >        *     2. The system is being saturated when we're operating near
> > > >        *        max capacity, it doesn't make sense to block overutilized.
> > > >        */
> > > > -     uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > > > -     uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
> > > > +     uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
> > >
> > > I think this hunk is what is causing the overutilized issues Kajetan reported
> > > against your patch.
> >
> > Yeah, I have been to agressive with uclamp_max
> >
> > >
> > > For the big cpu, this expression will always be true. So overutilized will
> > > trigger only for little and medium cores.
> > >
> > > I appreciate writing the boolean in a shorter form might appear like less code,
> > > but it makes things harder to read too; the compiler should be good at
> > > simplifying the expression if it can, no?
> > >
> > > Shall we leave the original expression as-is since it's easier to reason about?
> > >
> > > I think already by 'saving' using another variable we reduced readability and
> > > lead to this error. First line checks if we are the max_capacity which is
> > > the corner case we'd like to avoid and accidentally lost
> > >
> > > v1 code was:
> > >
> > > +       max_capacity = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
> > > +       uclamp_max_fits = !max_capacity && (uclamp_max <= capacity_orig);
> > > +       fits = fits || uclamp_max_fits;
> > >
> > > I think that extra variable was well named to help make it obvious what corner
> > > case we're trying to catch here. Especially it matches the comment above it
> > > explaining this corner case. This auto variable should be free, no?
> > >
> > > Can we go back to this form please?
> >
> > Yes, I'm going to come back to previous version
> >
> > >
> > > >       fits = fits || uclamp_max_fits;
> > > >
> > > >       /*
> > > > @@ -4561,8 +4560,8 @@ static inline int util_fits_cpu(unsigned long util,
> > > >        * handle the case uclamp_min > uclamp_max.
> > > >        */
> > > >       uclamp_min = min(uclamp_min, uclamp_max);
> > > > -     if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
> > > > -             fits = fits && (uclamp_min <= capacity_orig_thermal);
> > > > +     if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
> > > > +             return -1;
> > >
> > > Here shouldn't this be
> > >
> > >         if (util < uclamp_min) {
> > >                 fits = fits && (uclamp_min <= capacity_orig);
> > >                 if (fits && (uclamp_min > capacity_orig_thermal))
> > >                         return -1;
> > >         }
> > >
> > > uclamp_min should fit capacity_orig first then we'd check for the corner case
> >
> > I don't get why we should test capacity_orig first ?
> >
> > case 1:
> >   capacity_orig = 800
> >   uclamp_min = 512
> >   capacity_orig_thermal = 400
> >   util = 200
> > util_fits_cpu should return -1
> >
> > case 2:
> >   uclamp_min = 900
> >   capacity_orig = 800
> >   capacity_orig_thermal = 400
> >   utili_avg = 200
> > util_fits_cpu should return -1 whereas with your proposal above it will return 0
>
> Ah. Our definition of what -1 means is different.
>
> I thought it's a fallback case for when we fit capacity_orig but due to thermal
> pressure we'll actually fail to get. So what I understood -1 would be is that:
>
>         We fit in capacity_orig but due to thermal pressure we will actually
>         not get the perf point we'd hope for. So might be worthwhile to search
>         harder if another CPU can fit.
>
> But your definition is:
>
>         If util fits but uclamp_min doesn't (regardless because of thermal
>         pressure or we fail capacity_orig); then fallback to the cpu with the
>         highest (capacity_orig - thermal_pressure) that fits util.
>
> Did I get this right?

Sorry I forgot to reply but yes that my definition

>
> Hmm, sounds reasonable, I need to sleep on it but this looks better, yes.
>
>
> Cheers
>
> --
> Qais Yousef
>
> >
> > > if thermal pressure is causing it not to fit. And all of this should only
> > > matter if utill < uclamp_min. Otherwise util_avg should be driving force if
> > > uclamp_max is not trumping it.
> > >
> > > I need time to update my unit test [1] to catch these error cases as I didn't
> > > see them. In the next version I think it's worth including the changes to
> > > remove the capacity inversion in the patch.
> > >
> > > [1] https://github.com/qais-yousef/uclamp_test/blob/master/uclamp_test_thermal_pressure.c
> > >
> > >
> > > Thanks!
> > >
> > > --
> > > Qais Yousef
> > >
> > > >
> > > >       return fits;
> > > >  }
> > > > @@ -4572,7 +4571,11 @@ static inline int task_fits_cpu(struct task_struct *p, int cpu)
> > > >       unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
> > > >       unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
> > > >       unsigned long util = task_util_est(p);
> > > > -     return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
> > > > +     /*
> > > > +      * Return true only if the cpu fully fits the task requirements, which
> > > > +      * include the utilization but also the performance.
> > > > +      */
> > > > +     return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
> > > >  }
> > > >
> > > >  static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > > > @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> > > >       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> > > >       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> > > >
> > > > +     /* Return true only if the utlization doesn't fit its capacity */
> > > >       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> > > >  }
> > > >
> > > > @@ -6925,6 +6929,7 @@ static int
> > > >  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > > >  {
> > > >       unsigned long task_util, util_min, util_max, best_cap = 0;
> > > > +     int fits, best_fits = 0;
> > > >       int cpu, best_cpu = -1;
> > > >       struct cpumask *cpus;
> > > >
> > > > @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > > >
> > > >               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> > > >                       continue;
> > > > -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> > > > +
> > > > +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > > > +
> > > > +             /* This CPU fits with all capacity and performance requirements */
> > > > +             if (fits > 0)
> > > >                       return cpu;
> > > > +             /*
> > > > +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> > > > +              * for the CPU with highest performance capacity.
> > > > +              */
> > > > +             else if (fits < 0)
> > > > +                     cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
> > > >
> > > > -             if (cpu_cap > best_cap) {
> > > > +             /*
> > > > +              * First, select cpu which fits better (-1 being better than 0).
> > > > +              * Then, select the one with largest capacity at same level.
> > > > +              */
> > > > +             if ((fits < best_fits) ||
> > > > +                 ((fits == best_fits) && (cpu_cap > best_cap))) {
> > > >                       best_cap = cpu_cap;
> > > >                       best_cpu = cpu;
> > > > +                     best_fits = fits;
> > > >               }
> > > >       }
> > > >
> > > > @@ -6958,7 +6979,11 @@ static inline bool asym_fits_cpu(unsigned long util,
> > > >                                int cpu)
> > > >  {
> > > >       if (sched_asym_cpucap_active())
> > > > -             return util_fits_cpu(util, util_min, util_max, cpu);
> > > > +             /*
> > > > +              * Return true only if the cpu fully fits the task requirements
> > > > +              * which include the utilization but also the performance.
> > > > +              */
> > > > +             return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > > >
> > > >       return true;
> > > >  }
> > > > @@ -7325,6 +7350,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >       unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
> > > >       struct root_domain *rd = this_rq()->rd;
> > > >       int cpu, best_energy_cpu, target = -1;
> > > > +     int prev_fits = -1, best_fits = -1;
> > > > +     unsigned long best_thermal_cap = 0;
> > > > +     unsigned long prev_thermal_cap = 0;
> > > >       struct sched_domain *sd;
> > > >       struct perf_domain *pd;
> > > >       struct energy_env eenv;
> > > > @@ -7360,6 +7388,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >               unsigned long prev_spare_cap = 0;
> > > >               int max_spare_cap_cpu = -1;
> > > >               unsigned long base_energy;
> > > > +             int fits, max_fits = -1;
> > > >
> > > >               cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
> > > >
> > > > @@ -7412,7 +7441,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >                                       util_max = max(rq_util_max, p_util_max);
> > > >                               }
> > > >                       }
> > > > -                     if (!util_fits_cpu(util, util_min, util_max, cpu))
> > > > +
> > > > +                     fits = util_fits_cpu(util, util_min, util_max, cpu);
> > > > +                     if (!fits)
> > > >                               continue;
> > > >
> > > >                       lsub_positive(&cpu_cap, util);
> > > > @@ -7420,7 +7451,9 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >                       if (cpu == prev_cpu) {
> > > >                               /* Always use prev_cpu as a candidate. */
> > > >                               prev_spare_cap = cpu_cap;
> > > > -                     } else if (cpu_cap > max_spare_cap) {
> > > > +                             prev_fits = fits;
> > > > +                     } else if ((fits > max_fits) ||
> > > > +                                ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
> > > >                               /*
> > > >                                * Find the CPU with the maximum spare capacity
> > > >                                * among the remaining CPUs in the performance
> > > > @@ -7428,6 +7461,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >                                */
> > > >                               max_spare_cap = cpu_cap;
> > > >                               max_spare_cap_cpu = cpu;
> > > > +                             max_fits = fits;
> > > >                       }
> > > >               }
> > > >
> > > > @@ -7446,26 +7480,50 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > >                       if (prev_delta < base_energy)
> > > >                               goto unlock;
> > > >                       prev_delta -= base_energy;
> > > > +                     prev_thermal_cap = cpu_thermal_cap;
> > > >                       best_delta = min(best_delta, prev_delta);
> > > >               }
> > > >
> > > >               /* Evaluate the energy impact of using max_spare_cap_cpu. */
> > > >               if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
> > > > +                     /* Current best energy cpu fits better */
> > > > +                     if (max_fits < best_fits)
> > > > +                             continue;
> > > > +
> > > > +                     /*
> > > > +                      * Both don't fit performance (i.e. uclamp_min) but
> > > > +                      * best energy cpu has better performance.
> > > > +                      */
> > > > +                     if ((max_fits < 0) &&
> > > > +                         (cpu_thermal_cap <= best_thermal_cap))
> > > > +                             continue;
> > > > +
> > > >                       cur_delta = compute_energy(&eenv, pd, cpus, p,
> > > >                                                  max_spare_cap_cpu);
> > > >                       /* CPU utilization has changed */
> > > >                       if (cur_delta < base_energy)
> > > >                               goto unlock;
> > > >                       cur_delta -= base_energy;
> > > > -                     if (cur_delta < best_delta) {
> > > > -                             best_delta = cur_delta;
> > > > -                             best_energy_cpu = max_spare_cap_cpu;
> > > > -                     }
> > > > +
> > > > +                     /*
> > > > +                      * Both fit for the task but best energy cpu has lower
> > > > +                      * energy impact.
> > > > +                      */
> > > > +                     if ((max_fits > 0) && (best_fits > 0) &&
> > > > +                         (cur_delta >= best_delta))
> > > > +                             continue;
> > > > +
> > > > +                     best_delta = cur_delta;
> > > > +                     best_energy_cpu = max_spare_cap_cpu;
> > > > +                     best_fits = max_fits;
> > > > +                     best_thermal_cap = cpu_thermal_cap;
> > > >               }
> > > >       }
> > > >       rcu_read_unlock();
> > > >
> > > > -     if (best_delta < prev_delta)
> > > > +     if ((best_fits > prev_fits) ||
> > > > +         ((best_fits > 0) && (best_delta < prev_delta)) ||
> > > > +         ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
> > > >               target = best_energy_cpu;
> > > >
> > > >       return target;
> > > > @@ -10259,24 +10317,23 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > > >        */
> > > >       update_sd_lb_stats(env, &sds);
> > > >
> > > > -     if (sched_energy_enabled()) {
> > > > -             struct root_domain *rd = env->dst_rq->rd;
> > > > -
> > > > -             if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> > > > -                     goto out_balanced;
> > > > -     }
> > > > -
> > > > -     local = &sds.local_stat;
> > > > -     busiest = &sds.busiest_stat;
> > > > -
> > > >       /* There is no busy sibling group to pull tasks from */
> > > >       if (!sds.busiest)
> > > >               goto out_balanced;
> > > >
> > > > +     busiest = &sds.busiest_stat;
> > > > +
> > > >       /* Misfit tasks should be dealt with regardless of the avg load */
> > > >       if (busiest->group_type == group_misfit_task)
> > > >               goto force_balance;
> > > >
> > > > +     if (sched_energy_enabled()) {
> > > > +             struct root_domain *rd = env->dst_rq->rd;
> > > > +
> > > > +             if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
> > > > +                     goto out_balanced;
> > > > +     }
> > > > +
> > > >       /* ASYM feature bypasses nice load balance check */
> > > >       if (busiest->group_type == group_asym_packing)
> > > >               goto force_balance;
> > > > @@ -10289,6 +10346,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> > > >       if (busiest->group_type == group_imbalanced)
> > > >               goto force_balance;
> > > >
> > > > +     local = &sds.local_stat;
> > > >       /*
> > > >        * If the local group is busier than the selected busiest group
> > > >        * don't try and pull any tasks.
> > > > --
> > > > 2.34.1
> > > >
Dietmar Eggemann Jan. 18, 2023, 4:17 p.m. UTC | #11
On 17/01/2023 16:38, Qais Yousef wrote:
> On 01/16/23 09:07, Dietmar Eggemann wrote:
> 
> [...]
> 
>> Not sure if people get what `performance requirements` mean here? I know
>> we want to use `performance` rather `bandwidth hint` to describe what
>> uclamp is. So shouldn't we use `utilization but also uclamp`?
> 
> We do have the uclamp doc now which explains this, no? I'm not keen on
> utilization because it's an overloaded term. In the context of uclamp

And `performance` isn't ? ;-) True, the doc refers to uclamp as a
`performance requirements`.

> - utilization _signal_ in the scheduler is used to indicate performance
> requirements of a workload, no?

I was referring to:

 4569 static inline int task_fits_cpu(struct task_struct *p, int cpu)
 4570 {
 4571   unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
 4572   unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
 4573   unsigned long util = task_util_est(p);
 4574   /*
 4575   * Return true only if the cpu fully fits the task requirements,
 4576   * which include the utilization but also the performance.
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 4577   */
 4578   return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
 4579 }

So here we explicitly talk about `utilization` (util_avg/util_est)
versus `uclamp (max/min)` and the latter is referred as `performance`.
You're right, we shouldn't refer to `uclamp (min/max)` as `utilization`
either.

In other places we use:

select_idle_capacity()

/* This CPU fits with all capacity and performance requirements */
                          ^^^^^^^^     ^^^^^^^^^^^^^^^^^^^^^^^^

`capacity` is probably equal `utilization`? and `performance
requirements` stand for `uclamp (min/max)`.

/* Only the min performance (i.e. uclamp_min) doesn't fit */
            ^^^^^^^^^^^^^^^
here we link `min performance` explicitly to `uclamp_min`.

/* Look for the CPU with highest performance capacity.
                                 ^^^^^^^^^^^^^^^^^^^^
I guess this stands for `cap_orig - thermal_load_avg()`

feec()

/* Both don't fit performance (i.e. uclamp_min) but best energy cpu has
                  ^^^^^^^^^^^  ^^^^^^^^^^^^^^^
better performance. */
^^^^^^ ^^^^^^^^^^^

Here I assume `better performance` stands for higher `cap_orig -
thermal_pressure', not for `uclamp min or max`?

---

IMHO, referring to `uclamp (min/max)` as `performance (min/max)
hint/(requirement)` is fine as long as it's done consistently in
comments and the alias is not used for other items.

> 
> Using 'uclamp hint' if you found it really confusing, is fine by me. But I'd
> rather steer away from 'bandwidth' or 'utilization' when describing uclamp and
> its intention.
> 
> I like using performance requirements because it enforces what this hint
> actually means.
Qais Yousef Jan. 18, 2023, 4:48 p.m. UTC | #12
On 01/18/23 09:15, Vincent Guittot wrote:
> On Mon, 16 Jan 2023 at 15:56, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >
> > On 16/01/2023 12:23, Vincent Guittot wrote:
> > > On Mon, 16 Jan 2023 at 10:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>
> > >> On 13/01/2023 14:40, Vincent Guittot wrote:
> >
> > [...]
> >
> > >>> @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> > >>>       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> > >>>       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> > >>>
> > >>> +     /* Return true only if the utlization doesn't fit its capacity */
> > >>
> > >> s/utlization/utilization
> > >> s/its/CPU ?
> > >>
> > >>>       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> > >>>  }
> > >>
> > >> cpu_overutilized() is the only place where we now only test for
> > >> !util_fits_cpu(). The new comment says we only care about utilization
> > >> not fitting CPU capacity.
> > >>
> > >> Does this mean the rq uclamp values are not important here and we could
> > >> go back to use fits_capacity()?
> > >>
> > >> Not sure since util_fits_cpu() is still coded differently:
> > >
> > > uclamp_min is not important but uclamp_max still cap the utilization
> >
> > OK, makes sense.
> >
> > I.e. we could pass in `rq_util_min = 0` to avoid fetching it
> > unnecessary? In case `fits == 1` before the uclamp_min condition in
> > util_fits_cpu() it doesn't matter if we switch to return `-1` when
> > called from cpu_overutilized(). Detail though ...
> 
> One comment from Qais was to minimize knowledge outside
> util_fits_cpu() that's why I pass both uclamp_min and uclamp_max.
> 
> >
> > [...]
> >
> > >>> @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > >>>
> > >>>               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> > >>>                       continue;
> > >>> -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> > >>> +
> > >>> +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > >>> +
> > >>> +             /* This CPU fits with all capacity and performance requirements */
> > >>
> > >> In task_fits_cpu() `utilization and performance (better uclamp)
> > >> requirements` term was used. I assume it's the same thing here?
> > >>
> > >>> +             if (fits > 0)
> > >>>                       return cpu;
> > >>> +             /*
> > >>> +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> > >>> +              * for the CPU with highest performance capacity.
> > >>                                             ^^^^^^^^^^^^^^^^^^^^
> > >>
> > >> Do we use a new CPU capacity value `performance capacity (1)` here?
> > >>
> > >> Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.
> > >>
> > >> I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
> > >> = capacity_orig - arch_scale_thermal_pressure()` when checking whether
> > >> to return -1. Shouldn't (1) and (2) be the same?
> > >
> > > I'm all in favor of both being capacity_orig_of(cpu) -
> > > thermal_load_avg(cpu_rq(cpu) like the capacity inversion detection
> >
> > I think we need a handy name for this new capacity value, which seems to
> > be `capacity_orig - capacity reduced by thermal`. And we should either
> > use `thermal_load_avg` or `thermal pressure` for the latter part. And
> > then we should use this consistently in all these places:
> > util_fits_cpu(), feec(), sic().
> 
> Ok, let me change this everywhere

I'm not keen on this :-/

Changing this everywhere could have implications beyond our simple capabilities
of testing now :(

Current choice (in util_fits_cpu()) was based on a direct feedback from Xuewen.
I think we should discuss how we can improve the situation instead rather than
worry about consistency. I don't think we can be consistent without doing some
improvements on thermal pressure response time.

A separate proposal patch to invoke some testing and discussion is fine by me.

Better keep it a separate work item please?


Cheers

--
Qais Yousef
Vincent Guittot Jan. 19, 2023, 10:08 a.m. UTC | #13
On Wed, 18 Jan 2023 at 17:48, Qais Yousef <qyousef@layalina.io> wrote:
>
> On 01/18/23 09:15, Vincent Guittot wrote:
> > On Mon, 16 Jan 2023 at 15:56, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >
> > > On 16/01/2023 12:23, Vincent Guittot wrote:
> > > > On Mon, 16 Jan 2023 at 10:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > > >>
> > > >> On 13/01/2023 14:40, Vincent Guittot wrote:
> > >
> > > [...]
> > >
> > > >>> @@ -6132,6 +6135,7 @@ static inline bool cpu_overutilized(int cpu)
> > > >>>       unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
> > > >>>       unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
> > > >>>
> > > >>> +     /* Return true only if the utlization doesn't fit its capacity */
> > > >>
> > > >> s/utlization/utilization
> > > >> s/its/CPU ?
> > > >>
> > > >>>       return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
> > > >>>  }
> > > >>
> > > >> cpu_overutilized() is the only place where we now only test for
> > > >> !util_fits_cpu(). The new comment says we only care about utilization
> > > >> not fitting CPU capacity.
> > > >>
> > > >> Does this mean the rq uclamp values are not important here and we could
> > > >> go back to use fits_capacity()?
> > > >>
> > > >> Not sure since util_fits_cpu() is still coded differently:
> > > >
> > > > uclamp_min is not important but uclamp_max still cap the utilization
> > >
> > > OK, makes sense.
> > >
> > > I.e. we could pass in `rq_util_min = 0` to avoid fetching it
> > > unnecessary? In case `fits == 1` before the uclamp_min condition in
> > > util_fits_cpu() it doesn't matter if we switch to return `-1` when
> > > called from cpu_overutilized(). Detail though ...
> >
> > One comment from Qais was to minimize knowledge outside
> > util_fits_cpu() that's why I pass both uclamp_min and uclamp_max.
> >
> > >
> > > [...]
> > >
> > > >>> @@ -6940,12 +6945,28 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > > >>>
> > > >>>               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> > > >>>                       continue;
> > > >>> -             if (util_fits_cpu(task_util, util_min, util_max, cpu))
> > > >>> +
> > > >>> +             fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> > > >>> +
> > > >>> +             /* This CPU fits with all capacity and performance requirements */
> > > >>
> > > >> In task_fits_cpu() `utilization and performance (better uclamp)
> > > >> requirements` term was used. I assume it's the same thing here?
> > > >>
> > > >>> +             if (fits > 0)
> > > >>>                       return cpu;
> > > >>> +             /*
> > > >>> +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
> > > >>> +              * for the CPU with highest performance capacity.
> > > >>                                             ^^^^^^^^^^^^^^^^^^^^
> > > >>
> > > >> Do we use a new CPU capacity value `performance capacity (1)` here?
> > > >>
> > > >> Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.
> > > >>
> > > >> I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
> > > >> = capacity_orig - arch_scale_thermal_pressure()` when checking whether
> > > >> to return -1. Shouldn't (1) and (2) be the same?
> > > >
> > > > I'm all in favor of both being capacity_orig_of(cpu) -
> > > > thermal_load_avg(cpu_rq(cpu) like the capacity inversion detection
> > >
> > > I think we need a handy name for this new capacity value, which seems to
> > > be `capacity_orig - capacity reduced by thermal`. And we should either
> > > use `thermal_load_avg` or `thermal pressure` for the latter part. And
> > > then we should use this consistently in all these places:
> > > util_fits_cpu(), feec(), sic().
> >
> > Ok, let me change this everywhere
>
> I'm not keen on this :-/
>
> Changing this everywhere could have implications beyond our simple capabilities
> of testing now :(
>
> Current choice (in util_fits_cpu()) was based on a direct feedback from Xuewen.
> I think we should discuss how we can improve the situation instead rather than
> worry about consistency. I don't think we can be consistent without doing some
> improvements on thermal pressure response time.
>
> A separate proposal patch to invoke some testing and discussion is fine by me.
>
> Better keep it a separate work item please?

Ok, I'm going to keep the current use of arch_scale_thermal_pressure
and thermal_load_avg for this patch

Thanks


>
>
> Cheers
>
> --
> Qais Yousef
Dietmar Eggemann Jan. 20, 2023, 6:53 p.m. UTC | #14
On 19/01/2023 10:08, Vincent Guittot wrote:
> On Wed, 18 Jan 2023 at 17:48, Qais Yousef <qyousef@layalina.io> wrote:
>>
>> On 01/18/23 09:15, Vincent Guittot wrote:
>>> On Mon, 16 Jan 2023 at 15:56, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 16/01/2023 12:23, Vincent Guittot wrote:
>>>>> On Mon, 16 Jan 2023 at 10:07, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>>>
>>>>>> On 13/01/2023 14:40, Vincent Guittot wrote:

[...]

>>>>>> In task_fits_cpu() `utilization and performance (better uclamp)
>>>>>> requirements` term was used. I assume it's the same thing here?
>>>>>>
>>>>>>> +             if (fits > 0)
>>>>>>>                       return cpu;
>>>>>>> +             /*
>>>>>>> +              * Only the min performance (i.e. uclamp_min) doesn't fit. Look
>>>>>>> +              * for the CPU with highest performance capacity.
>>>>>>                                             ^^^^^^^^^^^^^^^^^^^^
>>>>>>
>>>>>> Do we use a new CPU capacity value `performance capacity (1)` here?
>>>>>>
>>>>>> Which I guess is `capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)`.
>>>>>>
>>>>>> I'm asking since util_fits_cpu() still uses: `capacity_orig_thermal (2)
>>>>>> = capacity_orig - arch_scale_thermal_pressure()` when checking whether
>>>>>> to return -1. Shouldn't (1) and (2) be the same?
>>>>>
>>>>> I'm all in favor of both being capacity_orig_of(cpu) -
>>>>> thermal_load_avg(cpu_rq(cpu) like the capacity inversion detection
>>>>
>>>> I think we need a handy name for this new capacity value, which seems to
>>>> be `capacity_orig - capacity reduced by thermal`. And we should either
>>>> use `thermal_load_avg` or `thermal pressure` for the latter part. And
>>>> then we should use this consistently in all these places:
>>>> util_fits_cpu(), feec(), sic().
>>>
>>> Ok, let me change this everywhere
>>
>> I'm not keen on this :-/
>>
>> Changing this everywhere could have implications beyond our simple capabilities
>> of testing now :(

It's actually not everywhere. I'm aware of 2 occurrences now in which we
use 'cap_orig - th_pressure': in feec()/compute_energy() (commit
489f16459e00 "sched/fair: Take thermal pressure into account while
estimating energy") and now also in util_fits_cpu().

>> Current choice (in util_fits_cpu()) was based on a direct feedback from Xuewen.

I went through these ~40 emails in the '[PATCH] sched: Take thermal
pressure into account when determine rt fits capacity' thread (1):

https://lkml.kernel.org/r/20220407051932.4071-1-xuewen.yan@unisoc.com

and the '[PATCH 1/7] sched/uclamp: Fix relationship between uclamp and
migration margin' (2):

https://lkml.kernel.org/r/20220629194632.1117723-2-qais.yousef@arm.com

There is this email from Xuewen in (1):

https://lkml.kernel.org/r/CAB8ipk--Y8HxetcmUhBmtWq6Mmd726QmDbcbibGLERJw_PUqkQ@mail.gmail.com

in which he mentioned that he prefers th_pressure but this was a CapInv
prototype in update_cpu_capacity() (the whole discussion was about
th_pressure in rt_task_fits_capacity()) rather than util_fits_cpu().

Maybe I missed something more directly related to util_fits_cpu()?

>> I think we should discuss how we can improve the situation instead rather than
>> worry about consistency. I don't think we can be consistent without doing some
>> improvements on thermal pressure response time.

I'm fine with discussing this next Wednesday.

We just have to watch out for v4 of this patch which uses `cap_orig -
thermal_load_avg()` in sic().

>> A separate proposal patch to invoke some testing and discussion is fine by me.
>>
>> Better keep it a separate work item please?
> 
> Ok, I'm going to keep the current use of arch_scale_thermal_pressure
> and thermal_load_avg for this patch

OK.
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9d906a9bba9..29adb9e27b3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4525,8 +4525,7 @@  static inline int util_fits_cpu(unsigned long util,
 	 *     2. The system is being saturated when we're operating near
 	 *        max capacity, it doesn't make sense to block overutilized.
 	 */
-	uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
-	uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
+	uclamp_max_fits = (uclamp_max <= capacity_orig) || (capacity_orig == SCHED_CAPACITY_SCALE);
 	fits = fits || uclamp_max_fits;
 
 	/*
@@ -4561,8 +4560,8 @@  static inline int util_fits_cpu(unsigned long util,
 	 * handle the case uclamp_min > uclamp_max.
 	 */
 	uclamp_min = min(uclamp_min, uclamp_max);
-	if (util < uclamp_min && capacity_orig != SCHED_CAPACITY_SCALE)
-		fits = fits && (uclamp_min <= capacity_orig_thermal);
+	if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
+		return -1;
 
 	return fits;
 }
@@ -4572,7 +4571,11 @@  static inline int task_fits_cpu(struct task_struct *p, int cpu)
 	unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
 	unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
 	unsigned long util = task_util_est(p);
-	return util_fits_cpu(util, uclamp_min, uclamp_max, cpu);
+	/*
+	 * Return true only if the cpu fully fits the task requirements, which
+	 * include the utilization but also the performance.
+	 */
+	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
 }
 
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@@ -6132,6 +6135,7 @@  static inline bool cpu_overutilized(int cpu)
 	unsigned long rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
 	unsigned long rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
 
+	/* Return true only if the utlization doesn't fit its capacity */
 	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
 }
 
@@ -6925,6 +6929,7 @@  static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
 	unsigned long task_util, util_min, util_max, best_cap = 0;
+	int fits, best_fits = 0;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
 
@@ -6940,12 +6945,28 @@  select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 
 		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
 			continue;
-		if (util_fits_cpu(task_util, util_min, util_max, cpu))
+
+		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
+
+		/* This CPU fits with all capacity and performance requirements */
+		if (fits > 0)
 			return cpu;
+		/*
+		 * Only the min performance (i.e. uclamp_min) doesn't fit. Look
+		 * for the CPU with highest performance capacity.
+		 */
+		else if (fits < 0)
+			cpu_cap = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
 
-		if (cpu_cap > best_cap) {
+		/*
+		 * First, select cpu which fits better (-1 being better than 0).
+		 * Then, select the one with largest capacity at same level.
+		 */
+		if ((fits < best_fits) ||
+		    ((fits == best_fits) && (cpu_cap > best_cap))) {
 			best_cap = cpu_cap;
 			best_cpu = cpu;
+			best_fits = fits;
 		}
 	}
 
@@ -6958,7 +6979,11 @@  static inline bool asym_fits_cpu(unsigned long util,
 				 int cpu)
 {
 	if (sched_asym_cpucap_active())
-		return util_fits_cpu(util, util_min, util_max, cpu);
+		/*
+		 * Return true only if the cpu fully fits the task requirements
+		 * which include the utilization but also the performance.
+		 */
+		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
 
 	return true;
 }
@@ -7325,6 +7350,9 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
 	struct root_domain *rd = this_rq()->rd;
 	int cpu, best_energy_cpu, target = -1;
+	int prev_fits = -1, best_fits = -1;
+	unsigned long best_thermal_cap = 0;
+	unsigned long prev_thermal_cap = 0;
 	struct sched_domain *sd;
 	struct perf_domain *pd;
 	struct energy_env eenv;
@@ -7360,6 +7388,7 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		unsigned long prev_spare_cap = 0;
 		int max_spare_cap_cpu = -1;
 		unsigned long base_energy;
+		int fits, max_fits = -1;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
 
@@ -7412,7 +7441,9 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 					util_max = max(rq_util_max, p_util_max);
 				}
 			}
-			if (!util_fits_cpu(util, util_min, util_max, cpu))
+
+			fits = util_fits_cpu(util, util_min, util_max, cpu);
+			if (!fits)
 				continue;
 
 			lsub_positive(&cpu_cap, util);
@@ -7420,7 +7451,9 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (cpu == prev_cpu) {
 				/* Always use prev_cpu as a candidate. */
 				prev_spare_cap = cpu_cap;
-			} else if (cpu_cap > max_spare_cap) {
+				prev_fits = fits;
+			} else if ((fits > max_fits) ||
+				   ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
 				/*
 				 * Find the CPU with the maximum spare capacity
 				 * among the remaining CPUs in the performance
@@ -7428,6 +7461,7 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 				 */
 				max_spare_cap = cpu_cap;
 				max_spare_cap_cpu = cpu;
+				max_fits = fits;
 			}
 		}
 
@@ -7446,26 +7480,50 @@  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (prev_delta < base_energy)
 				goto unlock;
 			prev_delta -= base_energy;
+			prev_thermal_cap = cpu_thermal_cap;
 			best_delta = min(best_delta, prev_delta);
 		}
 
 		/* Evaluate the energy impact of using max_spare_cap_cpu. */
 		if (max_spare_cap_cpu >= 0 && max_spare_cap > prev_spare_cap) {
+			/* Current best energy cpu fits better */
+			if (max_fits < best_fits)
+				continue;
+
+			/*
+			 * Both don't fit performance (i.e. uclamp_min) but
+			 * best energy cpu has better performance.
+			 */
+			if ((max_fits < 0) &&
+			    (cpu_thermal_cap <= best_thermal_cap))
+				continue;
+
 			cur_delta = compute_energy(&eenv, pd, cpus, p,
 						   max_spare_cap_cpu);
 			/* CPU utilization has changed */
 			if (cur_delta < base_energy)
 				goto unlock;
 			cur_delta -= base_energy;
-			if (cur_delta < best_delta) {
-				best_delta = cur_delta;
-				best_energy_cpu = max_spare_cap_cpu;
-			}
+
+			/*
+			 * Both fit for the task but best energy cpu has lower
+			 * energy impact.
+			 */
+			if ((max_fits > 0) && (best_fits > 0) &&
+			    (cur_delta >= best_delta))
+				continue;
+
+			best_delta = cur_delta;
+			best_energy_cpu = max_spare_cap_cpu;
+			best_fits = max_fits;
+			best_thermal_cap = cpu_thermal_cap;
 		}
 	}
 	rcu_read_unlock();
 
-	if (best_delta < prev_delta)
+	if ((best_fits > prev_fits) ||
+	    ((best_fits > 0) && (best_delta < prev_delta)) ||
+	    ((best_fits < 0) && (best_thermal_cap > prev_thermal_cap)))
 		target = best_energy_cpu;
 
 	return target;
@@ -10259,24 +10317,23 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	 */
 	update_sd_lb_stats(env, &sds);
 
-	if (sched_energy_enabled()) {
-		struct root_domain *rd = env->dst_rq->rd;
-
-		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
-			goto out_balanced;
-	}
-
-	local = &sds.local_stat;
-	busiest = &sds.busiest_stat;
-
 	/* There is no busy sibling group to pull tasks from */
 	if (!sds.busiest)
 		goto out_balanced;
 
+	busiest = &sds.busiest_stat;
+
 	/* Misfit tasks should be dealt with regardless of the avg load */
 	if (busiest->group_type == group_misfit_task)
 		goto force_balance;
 
+	if (sched_energy_enabled()) {
+		struct root_domain *rd = env->dst_rq->rd;
+
+		if (rcu_dereference(rd->pd) && !READ_ONCE(rd->overutilized))
+			goto out_balanced;
+	}
+
 	/* ASYM feature bypasses nice load balance check */
 	if (busiest->group_type == group_asym_packing)
 		goto force_balance;
@@ -10289,6 +10346,7 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	if (busiest->group_type == group_imbalanced)
 		goto force_balance;
 
+	local = &sds.local_stat;
 	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.