diff mbox series

sched/fair: Fix frequency selection for non invariant case

Message ID 20240114183600.135316-1-vincent.guittot@linaro.org
State Accepted
Commit e37617c8e53a1f7fcba6d5e1041f4fd8a2425c27
Headers show
Series sched/fair: Fix frequency selection for non invariant case | expand

Commit Message

Vincent Guittot Jan. 14, 2024, 6:36 p.m. UTC
When frequency invariance is not enabled, get_capacity_ref_freq(policy)
returns the current frequency and the performance margin applied by
map_util_perf(), enabled the utilization to go above the maximum compute
capacity and to select a higher frequency than the current one.

The performance margin is now applied earlier in the path to take into
account some utilization clampings and we can't get an utilization higher
than the maximum compute capacity.

We must use a frequency above the current frequency to get a chance to
select a higher OPP when the current one becomes fully used. Apply
the same margin and returns a frequency 25% higher than the current one in
order to switch to the next OPP before we fully use the cpu at the current
one.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/lkml/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com/
Reported-by: Wyes Karny <wkarny@gmail.com>
Closes: https://lore.kernel.org/lkml/20240114091240.xzdvqk75ifgfj5yx@wyes-pc/
Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Wyes Karny <wkarny@gmail.com>
---
 kernel/sched/cpufreq_schedutil.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Qais Yousef Jan. 15, 2024, 12:15 p.m. UTC | #1
On 01/14/24 19:36, Vincent Guittot wrote:
> When frequency invariance is not enabled, get_capacity_ref_freq(policy)
> returns the current frequency and the performance margin applied by
> map_util_perf(), enabled the utilization to go above the maximum compute
> capacity and to select a higher frequency than the current one.
> 
> The performance margin is now applied earlier in the path to take into
> account some utilization clampings and we can't get an utilization higher
> than the maximum compute capacity.
> 
> We must use a frequency above the current frequency to get a chance to
> select a higher OPP when the current one becomes fully used. Apply
> the same margin and returns a frequency 25% higher than the current one in
> order to switch to the next OPP before we fully use the cpu at the current
> one.
> 
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Closes: https://lore.kernel.org/lkml/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com/
> Reported-by: Wyes Karny <wkarny@gmail.com>
> Closes: https://lore.kernel.org/lkml/20240114091240.xzdvqk75ifgfj5yx@wyes-pc/
> Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: Wyes Karny <wkarny@gmail.com>
> ---
>  kernel/sched/cpufreq_schedutil.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..d12e95d30e2e 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,11 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
>  	if (arch_scale_freq_invariant())
>  		return policy->cpuinfo.max_freq;
>  
> -	return policy->cur;
> +	/*
> +	 * Apply a 25% margin so that we select a higher frequency than
> +	 * the current one before the CPU is full busy
> +	 */
> +	return policy->cur + (policy->cur >> 2);

I think we can do better, but this does re-instate the previous behavior at
least for this merge window. So FWIW

Reviewed-and-tested-by: Qais Yousef <qyousef@layalina.io>

>  }
>  
>  /**
> -- 
> 2.34.1
>
Dietmar Eggemann Jan. 15, 2024, 8:06 p.m. UTC | #2
On 14/01/2024 19:36, Vincent Guittot wrote:
> When frequency invariance is not enabled, get_capacity_ref_freq(policy)
> returns the current frequency and the performance margin applied by
> map_util_perf(), enabled the utilization to go above the maximum compute
> capacity and to select a higher frequency than the current one.
> 
> The performance margin is now applied earlier in the path to take into
> account some utilization clampings and we can't get an utilization higher
> than the maximum compute capacity.
> 
> We must use a frequency above the current frequency to get a chance to
> select a higher OPP when the current one becomes fully used. Apply
> the same margin and returns a frequency 25% higher than the current one in
> order to switch to the next OPP before we fully use the cpu at the current
> one.
> 
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Closes: https://lore.kernel.org/lkml/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com/
> Reported-by: Wyes Karny <wkarny@gmail.com>
> Closes: https://lore.kernel.org/lkml/20240114091240.xzdvqk75ifgfj5yx@wyes-pc/
> Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: Wyes Karny <wkarny@gmail.com>
> ---
>  kernel/sched/cpufreq_schedutil.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..d12e95d30e2e 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,11 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
>  	if (arch_scale_freq_invariant())
>  		return policy->cpuinfo.max_freq;
>  
> -	return policy->cur;
> +	/*
> +	 * Apply a 25% margin so that we select a higher frequency than
> +	 * the current one before the CPU is full busy
> +	 */
> +	return policy->cur + (policy->cur >> 2);
>  }
>  
>  /**

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

(on Intel Xeon CPU E5-2690 v2 with frequency invariance disabled)
Ingo Molnar Jan. 16, 2024, 9:59 a.m. UTC | #3
* Vincent Guittot <vincent.guittot@linaro.org> wrote:

> When frequency invariance is not enabled, get_capacity_ref_freq(policy)
> returns the current frequency and the performance margin applied by
> map_util_perf(), enabled the utilization to go above the maximum compute
> capacity and to select a higher frequency than the current one.
> 
> The performance margin is now applied earlier in the path to take into
> account some utilization clampings and we can't get an utilization higher
> than the maximum compute capacity.
> 
> We must use a frequency above the current frequency to get a chance to
> select a higher OPP when the current one becomes fully used. Apply
> the same margin and returns a frequency 25% higher than the current one in
> order to switch to the next OPP before we fully use the cpu at the current
> one.
> 
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Closes: https://lore.kernel.org/lkml/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com/
> Reported-by: Wyes Karny <wkarny@gmail.com>
> Closes: https://lore.kernel.org/lkml/20240114091240.xzdvqk75ifgfj5yx@wyes-pc/
> Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: Wyes Karny <wkarny@gmail.com>
> ---
>  kernel/sched/cpufreq_schedutil.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..d12e95d30e2e 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,11 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
>  	if (arch_scale_freq_invariant())
>  		return policy->cpuinfo.max_freq;
>  
> -	return policy->cur;
> +	/*
> +	 * Apply a 25% margin so that we select a higher frequency than
> +	 * the current one before the CPU is full busy
> +	 */
> +	return policy->cur + (policy->cur >> 2);
>  }

I've updated the changelog to better express what was broken and how we 
fixed it. Ack?

	Ingo

==========================>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Sun, 14 Jan 2024 19:36:00 +0100
Subject: [PATCH] sched/fair: Fix frequency selection for non-invariant case

Linus reported a ~50% performance regression on single-threaded
workloads on his AMD Ryzen system, and bisected it to:

  9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")

When frequency invariance is not enabled, get_capacity_ref_freq(policy)
is supposed to return the current frequency and the performance margin
applied by map_util_perf(), enabling the utilization to go above the
maximum compute capacity and to select a higher frequency than the current one.

After the changes in 9c0b4bb7f630, the performance margin was applied
earlier in the path to take into account utilization clampings and
we couldn't get a utilization higher than the maximum compute capacity,
and the CPU remained 'stuck' at lower frequencies.

To fix this, we must use a frequency above the current frequency to
get a chance to select a higher OPP when the current one becomes fully used.
Apply the same margin and return a frequency 25% higher than the current
one in order to switch to the next OPP before we fully use the CPU
at the current one.

[ mingo: Clarified the changelog. ]

Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Wyes Karny <wkarny@gmail.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Wyes Karny <wkarny@gmail.com>
Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org
---
 kernel/sched/cpufreq_schedutil.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..eece6244f9d2 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -133,7 +133,11 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
 	if (arch_scale_freq_invariant())
 		return policy->cpuinfo.max_freq;
 
-	return policy->cur;
+	/*
+	 * Apply a 25% margin so that we select a higher frequency than
+	 * the current one before the CPU is fully busy:
+	 */
+	return policy->cur + (policy->cur >> 2);
 }
 
 /**
Vincent Guittot Jan. 16, 2024, 10:04 a.m. UTC | #4
On Tue, 16 Jan 2024 at 10:59, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> > When frequency invariance is not enabled, get_capacity_ref_freq(policy)
> > returns the current frequency and the performance margin applied by
> > map_util_perf(), enabled the utilization to go above the maximum compute
> > capacity and to select a higher frequency than the current one.
> >
> > The performance margin is now applied earlier in the path to take into
> > account some utilization clampings and we can't get an utilization higher
> > than the maximum compute capacity.
> >
> > We must use a frequency above the current frequency to get a chance to
> > select a higher OPP when the current one becomes fully used. Apply
> > the same margin and returns a frequency 25% higher than the current one in
> > order to switch to the next OPP before we fully use the cpu at the current
> > one.
> >
> > Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Closes: https://lore.kernel.org/lkml/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com/
> > Reported-by: Wyes Karny <wkarny@gmail.com>
> > Closes: https://lore.kernel.org/lkml/20240114091240.xzdvqk75ifgfj5yx@wyes-pc/
> > Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Tested-by: Wyes Karny <wkarny@gmail.com>
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..d12e95d30e2e 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -133,7 +133,11 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> >       if (arch_scale_freq_invariant())
> >               return policy->cpuinfo.max_freq;
> >
> > -     return policy->cur;
> > +     /*
> > +      * Apply a 25% margin so that we select a higher frequency than
> > +      * the current one before the CPU is full busy
> > +      */
> > +     return policy->cur + (policy->cur >> 2);
> >  }
>
> I've updated the changelog to better express what was broken and how we
> fixed it. Ack?

Looks good

Thanks

>
>         Ingo
>
> ==========================>
> From: Vincent Guittot <vincent.guittot@linaro.org>
> Date: Sun, 14 Jan 2024 19:36:00 +0100
> Subject: [PATCH] sched/fair: Fix frequency selection for non-invariant case
>
> Linus reported a ~50% performance regression on single-threaded
> workloads on his AMD Ryzen system, and bisected it to:
>
>   9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
>
> When frequency invariance is not enabled, get_capacity_ref_freq(policy)
> is supposed to return the current frequency and the performance margin
> applied by map_util_perf(), enabling the utilization to go above the
> maximum compute capacity and to select a higher frequency than the current one.
>
> After the changes in 9c0b4bb7f630, the performance margin was applied
> earlier in the path to take into account utilization clampings and
> we couldn't get a utilization higher than the maximum compute capacity,
> and the CPU remained 'stuck' at lower frequencies.
>
> To fix this, we must use a frequency above the current frequency to
> get a chance to select a higher OPP when the current one becomes fully used.
> Apply the same margin and return a frequency 25% higher than the current
> one in order to switch to the next OPP before we fully use the CPU
> at the current one.
>
> [ mingo: Clarified the changelog. ]
>
> Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Bisected-by: Linus Torvalds <torvalds@linux-foundation.org>
> Reported-by: Wyes Karny <wkarny@gmail.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Tested-by: Wyes Karny <wkarny@gmail.com>
> Link: https://lore.kernel.org/r/20240114183600.135316-1-vincent.guittot@linaro.org
> ---
>  kernel/sched/cpufreq_schedutil.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..eece6244f9d2 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,11 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
>         if (arch_scale_freq_invariant())
>                 return policy->cpuinfo.max_freq;
>
> -       return policy->cur;
> +       /*
> +        * Apply a 25% margin so that we select a higher frequency than
> +        * the current one before the CPU is fully busy:
> +        */
> +       return policy->cur + (policy->cur >> 2);
>  }
>
>  /**
Linux regression tracking (Thorsten Leemhuis) Jan. 17, 2024, 2:28 p.m. UTC | #5
On 16.01.24 10:59, Ingo Molnar wrote:
> 
> * Vincent Guittot <vincent.guittot@linaro.org> wrote:
> 
>> When frequency invariance is not enabled, get_capacity_ref_freq(policy)
>> returns the current frequency and the performance margin applied by
>> map_util_perf(), enabled the utilization to go above the maximum compute
>> capacity and to select a higher frequency than the current one.
>> [...]
>> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Closes: https://lore.kernel.org/lkml/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com/
>> Reported-by: Wyes Karny <wkarny@gmail.com>
>> Closes: https://lore.kernel.org/lkml/20240114091240.xzdvqk75ifgfj5yx@wyes-pc/

Thx for resolving this everyone. Allow me a quick question:

I noticed the two Closes: tags above are missing in the actual commit:
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=e37617c8e53a1f7fcba6d5e1041f4fd8a2425c27

Is there a overeager script here that removes them when it shouldn't?

Just asking, because the lack of these tags makes regression tracking
hard. And Linus really wants those tags in cases like this[1].

Ciao, Thorsten

[1] for details, see:
https://lore.kernel.org/all/CAHk-=wjMmSZzMJ3Xnskdg4+GGz=5p5p+GSYyFBTh0f-DgvdBWg@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wgs38ZrfPvy=nOwVkVzjpM3VFU1zobP37Fwd_h9iAD5JQ@mail.gmail.com/
https://lore.kernel.org/all/CAHk-=wjxzafG-=J8oT30s7upn4RhBs6TX-uVFZ5rME+L5_DoJA@mail.gmail.com/
Jon Hunter Feb. 14, 2024, 5:12 p.m. UTC | #6
Hi Vincent,

On 14/01/2024 18:36, Vincent Guittot wrote:
> When frequency invariance is not enabled, get_capacity_ref_freq(policy)
> returns the current frequency and the performance margin applied by
> map_util_perf(), enabled the utilization to go above the maximum compute
> capacity and to select a higher frequency than the current one.
> 
> The performance margin is now applied earlier in the path to take into
> account some utilization clampings and we can't get an utilization higher
> than the maximum compute capacity.
> 
> We must use a frequency above the current frequency to get a chance to
> select a higher OPP when the current one becomes fully used. Apply
> the same margin and returns a frequency 25% higher than the current one in
> order to switch to the next OPP before we fully use the cpu at the current
> one.
> 
> Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> Closes: https://lore.kernel.org/lkml/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com/
> Reported-by: Wyes Karny <wkarny@gmail.com>
> Closes: https://lore.kernel.org/lkml/20240114091240.xzdvqk75ifgfj5yx@wyes-pc/
> Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Tested-by: Wyes Karny <wkarny@gmail.com>
> ---
>   kernel/sched/cpufreq_schedutil.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 95c3c097083e..d12e95d30e2e 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -133,7 +133,11 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
>   	if (arch_scale_freq_invariant())
>   		return policy->cpuinfo.max_freq;
>   
> -	return policy->cur;
> +	/*
> +	 * Apply a 25% margin so that we select a higher frequency than
> +	 * the current one before the CPU is full busy
> +	 */
> +	return policy->cur + (policy->cur >> 2);
>   }
>   
>   /**


We have also observed a performance degradation on our Tegra platforms 
with v6.8-rc1. Unfortunately, the above change does not fix the problem 
for us and we are still seeing a performance issue with v6.8-rc4. For 
example, running Dhrystone on Tegra234 I am seeing the following ...

Linux v6.7:
[ 2216.301949] CPU0: Dhrystones per Second: 31976326 (18199 DMIPS)
[ 2220.993877] CPU1: Dhrystones per Second: 49568123 (28211 DMIPS)
[ 2225.685280] CPU2: Dhrystones per Second: 49568123 (28211 DMIPS)
[ 2230.364423] CPU3: Dhrystones per Second: 49632220 (28248 DMIPS)

Linux v6.8-rc4:
[   44.661686] CPU0: Dhrystones per Second: 16068483 (9145 DMIPS)
[   51.895107] CPU1: Dhrystones per Second: 16077457 (9150 DMIPS)
[   59.105410] CPU2: Dhrystones per Second: 16095436 (9160 DMIPS)
[   66.333297] CPU3: Dhrystones per Second: 16064000 (9142 DMIPS)

If I revert this change and the following ...

  b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
  f12560779f9d ("sched/cpufreq: Rework iowait boost")
  9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor

... then the perf is similar to where it was ...

Linux v6.8-rc4 plus reverts:
[   31.768189] CPU0: Dhrystones per Second: 48421678 (27559 DMIPS)
[   36.556838] CPU1: Dhrystones per Second: 48401324 (27547 DMIPS)
[   41.343343] CPU2: Dhrystones per Second: 48421678 (27559 DMIPS)
[   46.163347] CPU3: Dhrystones per Second: 47679814 (27137 DMIPS)

All CPUs are running with the schedutil CPUFREQ governor. We have not 
looked any further but wanted to report this in case you have any more 
thoughts or suggestions for us to try.

Thanks
Jon
Vincent Guittot Feb. 14, 2024, 5:19 p.m. UTC | #7
Hi John,

On Wed, 14 Feb 2024 at 18:12, Jon Hunter <jonathanh@nvidia.com> wrote:
>
> Hi Vincent,
>
> On 14/01/2024 18:36, Vincent Guittot wrote:
> > When frequency invariance is not enabled, get_capacity_ref_freq(policy)
> > returns the current frequency and the performance margin applied by
> > map_util_perf(), enabled the utilization to go above the maximum compute
> > capacity and to select a higher frequency than the current one.
> >
> > The performance margin is now applied earlier in the path to take into
> > account some utilization clampings and we can't get an utilization higher
> > than the maximum compute capacity.
> >
> > We must use a frequency above the current frequency to get a chance to
> > select a higher OPP when the current one becomes fully used. Apply
> > the same margin and returns a frequency 25% higher than the current one in
> > order to switch to the next OPP before we fully use the cpu at the current
> > one.
> >
> > Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Closes: https://lore.kernel.org/lkml/CAHk-=wgWcYX2oXKtgvNN2LLDXP7kXkbo-xTfumEjmPbjSer2RQ@mail.gmail.com/
> > Reported-by: Wyes Karny <wkarny@gmail.com>
> > Closes: https://lore.kernel.org/lkml/20240114091240.xzdvqk75ifgfj5yx@wyes-pc/
> > Fixes: 9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor performance estimation")
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Tested-by: Wyes Karny <wkarny@gmail.com>
> > ---
> >   kernel/sched/cpufreq_schedutil.c | 6 +++++-
> >   1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 95c3c097083e..d12e95d30e2e 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -133,7 +133,11 @@ unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
> >       if (arch_scale_freq_invariant())
> >               return policy->cpuinfo.max_freq;
> >
> > -     return policy->cur;
> > +     /*
> > +      * Apply a 25% margin so that we select a higher frequency than
> > +      * the current one before the CPU is full busy
> > +      */
> > +     return policy->cur + (policy->cur >> 2);
> >   }
> >
> >   /**
>
>
> We have also observed a performance degradation on our Tegra platforms
> with v6.8-rc1. Unfortunately, the above change does not fix the problem
> for us and we are still seeing a performance issue with v6.8-rc4. For
> example, running Dhrystone on Tegra234 I am seeing the following ...
>
> Linux v6.7:
> [ 2216.301949] CPU0: Dhrystones per Second: 31976326 (18199 DMIPS)
> [ 2220.993877] CPU1: Dhrystones per Second: 49568123 (28211 DMIPS)
> [ 2225.685280] CPU2: Dhrystones per Second: 49568123 (28211 DMIPS)
> [ 2230.364423] CPU3: Dhrystones per Second: 49632220 (28248 DMIPS)
>
> Linux v6.8-rc4:
> [   44.661686] CPU0: Dhrystones per Second: 16068483 (9145 DMIPS)
> [   51.895107] CPU1: Dhrystones per Second: 16077457 (9150 DMIPS)
> [   59.105410] CPU2: Dhrystones per Second: 16095436 (9160 DMIPS)
> [   66.333297] CPU3: Dhrystones per Second: 16064000 (9142 DMIPS)
>
> If I revert this change and the following ...
>
>   b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
>   f12560779f9d ("sched/cpufreq: Rework iowait boost")
>   9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
>
> ... then the perf is similar to where it was ...
>
> Linux v6.8-rc4 plus reverts:
> [   31.768189] CPU0: Dhrystones per Second: 48421678 (27559 DMIPS)
> [   36.556838] CPU1: Dhrystones per Second: 48401324 (27547 DMIPS)
> [   41.343343] CPU2: Dhrystones per Second: 48421678 (27559 DMIPS)
> [   46.163347] CPU3: Dhrystones per Second: 47679814 (27137 DMIPS)
>
> All CPUs are running with the schedutil CPUFREQ governor. We have not
> looked any further but wanted to report this in case you have any more
> thoughts or suggestions for us to try.

Have you tried this :
https://lore.kernel.org/lkml/20240117190545.596057-1-vincent.guittot@linaro.org/

It's in driver-core-linus' branch and should be sent to Linus soon

Vincent

>
> Thanks
> Jon
>
> --
> nvpublic
Linus Torvalds Feb. 14, 2024, 5:19 p.m. UTC | #8
On Wed, 14 Feb 2024 at 09:12, Jon Hunter <jonathanh@nvidia.com> wrote:
>
> We have also observed a performance degradation on our Tegra platforms
> with v6.8-rc1. Unfortunately, the above change does not fix the problem
> for us and we are still seeing a performance issue with v6.8-rc4. For
> example, running Dhrystone on Tegra234 I am seeing the following ...
>
> Linux v6.7:
> [ 2216.301949] CPU0: Dhrystones per Second: 31976326 (18199 DMIPS)
> [ 2220.993877] CPU1: Dhrystones per Second: 49568123 (28211 DMIPS)
> [ 2225.685280] CPU2: Dhrystones per Second: 49568123 (28211 DMIPS)
> [ 2230.364423] CPU3: Dhrystones per Second: 49632220 (28248 DMIPS)
>
> Linux v6.8-rc4:
> [   44.661686] CPU0: Dhrystones per Second: 16068483 (9145 DMIPS)
> [   51.895107] CPU1: Dhrystones per Second: 16077457 (9150 DMIPS)
> [   59.105410] CPU2: Dhrystones per Second: 16095436 (9160 DMIPS)
> [   66.333297] CPU3: Dhrystones per Second: 16064000 (9142 DMIPS)
>
> If I revert this change and the following ...
>
>   b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
>   f12560779f9d ("sched/cpufreq: Rework iowait boost")
>   9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
>
> ... then the perf is similar to where it was ...

Ok, guys, this whole scheduler / cpufreq rewrite seems to have been
completely buggered.

Please tell me why we shouldn't just revert things as per above?

Sure, the problem _I_ experienced is fixed, but apparently there are
others just lurking, and they are even bigger degradations than the
one I saw.

We're now at rc4, we're not releasing a 6.8 with the above kinds of
numbers. So either there's another obvious one-liner fix, or we need
to revert this whole thing.

Yes, dhrystones is a truly crappy benchmark, but partly _because_ it's
such a horribly bad benchmark it's also a very simple case. It's pure
CPU load with absolutely nothing interesting going on. Regressing on
that by a factor of three is a sign of complete failure.

                  Linus
Vincent Guittot Feb. 14, 2024, 5:22 p.m. UTC | #9
On Wed, 14 Feb 2024 at 18:20, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 14 Feb 2024 at 09:12, Jon Hunter <jonathanh@nvidia.com> wrote:
> >
> > We have also observed a performance degradation on our Tegra platforms
> > with v6.8-rc1. Unfortunately, the above change does not fix the problem
> > for us and we are still seeing a performance issue with v6.8-rc4. For
> > example, running Dhrystone on Tegra234 I am seeing the following ...
> >
> > Linux v6.7:
> > [ 2216.301949] CPU0: Dhrystones per Second: 31976326 (18199 DMIPS)
> > [ 2220.993877] CPU1: Dhrystones per Second: 49568123 (28211 DMIPS)
> > [ 2225.685280] CPU2: Dhrystones per Second: 49568123 (28211 DMIPS)
> > [ 2230.364423] CPU3: Dhrystones per Second: 49632220 (28248 DMIPS)
> >
> > Linux v6.8-rc4:
> > [   44.661686] CPU0: Dhrystones per Second: 16068483 (9145 DMIPS)
> > [   51.895107] CPU1: Dhrystones per Second: 16077457 (9150 DMIPS)
> > [   59.105410] CPU2: Dhrystones per Second: 16095436 (9160 DMIPS)
> > [   66.333297] CPU3: Dhrystones per Second: 16064000 (9142 DMIPS)
> >
> > If I revert this change and the following ...
> >
> >   b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> >   f12560779f9d ("sched/cpufreq: Rework iowait boost")
> >   9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> >
> > ... then the perf is similar to where it was ...
>
> Ok, guys, this whole scheduler / cpufreq rewrite seems to have been
> completely buggered.
>
> Please tell me why we shouldn't just revert things as per above?
>
> Sure, the problem _I_ experienced is fixed, but apparently there are
> others just lurking, and they are even bigger degradations than the
> one I saw.
>
> We're now at rc4, we're not releasing a 6.8 with the above kinds of
> numbers. So either there's another obvious one-liner fix, or we need
> to revert this whole thing.

This should fix it:
https://lore.kernel.org/lkml/20240117190545.596057-1-vincent.guittot@linaro.org/

>
> Yes, dhrystones is a truly crappy benchmark, but partly _because_ it's
> such a horribly bad benchmark it's also a very simple case. It's pure
> CPU load with absolutely nothing interesting going on. Regressing on
> that by a factor of three is a sign of complete failure.
>
>                   Linus
Jon Hunter Feb. 14, 2024, 5:57 p.m. UTC | #10
On 14/02/2024 17:22, Vincent Guittot wrote:
> On Wed, 14 Feb 2024 at 18:20, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> On Wed, 14 Feb 2024 at 09:12, Jon Hunter <jonathanh@nvidia.com> wrote:
>>>
>>> We have also observed a performance degradation on our Tegra platforms
>>> with v6.8-rc1. Unfortunately, the above change does not fix the problem
>>> for us and we are still seeing a performance issue with v6.8-rc4. For
>>> example, running Dhrystone on Tegra234 I am seeing the following ...
>>>
>>> Linux v6.7:
>>> [ 2216.301949] CPU0: Dhrystones per Second: 31976326 (18199 DMIPS)
>>> [ 2220.993877] CPU1: Dhrystones per Second: 49568123 (28211 DMIPS)
>>> [ 2225.685280] CPU2: Dhrystones per Second: 49568123 (28211 DMIPS)
>>> [ 2230.364423] CPU3: Dhrystones per Second: 49632220 (28248 DMIPS)
>>>
>>> Linux v6.8-rc4:
>>> [   44.661686] CPU0: Dhrystones per Second: 16068483 (9145 DMIPS)
>>> [   51.895107] CPU1: Dhrystones per Second: 16077457 (9150 DMIPS)
>>> [   59.105410] CPU2: Dhrystones per Second: 16095436 (9160 DMIPS)
>>> [   66.333297] CPU3: Dhrystones per Second: 16064000 (9142 DMIPS)
>>>
>>> If I revert this change and the following ...
>>>
>>>    b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
>>>    f12560779f9d ("sched/cpufreq: Rework iowait boost")
>>>    9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
>>>
>>> ... then the perf is similar to where it was ...
>>
>> Ok, guys, this whole scheduler / cpufreq rewrite seems to have been
>> completely buggered.
>>
>> Please tell me why we shouldn't just revert things as per above?
>>
>> Sure, the problem _I_ experienced is fixed, but apparently there are
>> others just lurking, and they are even bigger degradations than the
>> one I saw.
>>
>> We're now at rc4, we're not releasing a 6.8 with the above kinds of
>> numbers. So either there's another obvious one-liner fix, or we need
>> to revert this whole thing.
> 
> This should fix it:
> https://lore.kernel.org/lkml/20240117190545.596057-1-vincent.guittot@linaro.org/


Yes I can confirm that this does fix it ...

[   29.440836] CPU0: Dhrystones per Second: 48340366 (27513 DMIPS)
[   34.221323] CPU1: Dhrystones per Second: 48585127 (27652 DMIPS)
[   38.988036] CPU2: Dhrystones per Second: 48667266 (27699 DMIPS)
[   43.769430] CPU3: Dhrystones per Second: 48544161 (27629 DMIPS)

>> Yes, dhrystones is a truly crappy benchmark, but partly _because_ it's
>> such a horribly bad benchmark it's also a very simple case. It's pure
>> CPU load with absolutely nothing interesting going on. Regressing on
>> that by a factor of three is a sign of complete failure.


We have a few other more extensive tests that have been failing due to 
the perf issue. We will run those with the above and if we see any more 
issues I will let everyone know.

Thanks
Jon
Linux regression tracking (Thorsten Leemhuis) Feb. 15, 2024, 8:45 a.m. UTC | #11
Linus, what...

On 14.02.24 18:22, Vincent Guittot wrote:
> On Wed, 14 Feb 2024 at 18:20, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, 14 Feb 2024 at 09:12, Jon Hunter <jonathanh@nvidia.com> wrote:
>>> We have also observed a performance degradation on our Tegra platforms
>>> with v6.8-rc1. Unfortunately, the above change does not fix the problem
>>> for us and we are still seeing a performance issue with v6.8-rc4. For
>>> example, running Dhrystone on Tegra234 I am seeing the following ...
>>> [...]
>>> If I revert this change and the following ...
>>>   b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
>>>   f12560779f9d ("sched/cpufreq: Rework iowait boost")
>>>   9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
>>> ... then the perf is similar to where it was ...
>>
>> Ok, guys, this whole scheduler / cpufreq rewrite seems to have been
>> completely buggered.
>> [...]
> This should fix it:
> https://lore.kernel.org/lkml/20240117190545.596057-1-vincent.guittot@linaro.org/

...do you want me to do in situations like this? I'm asking, as I see
situations like this frequently -- e.g. people reporting problems a
second, third, or fourth time while the fix is already sitting in -next
for a few days.

Want me to list them in the weekly reports so that you can cherry-pick
them from -next if you want?

Ciao, Thorsten
Greg KH Feb. 15, 2024, 9:13 a.m. UTC | #12
On Thu, Feb 15, 2024 at 09:45:01AM +0100, Thorsten Leemhuis wrote:
> Linus, what...
> 
> On 14.02.24 18:22, Vincent Guittot wrote:
> > On Wed, 14 Feb 2024 at 18:20, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >> On Wed, 14 Feb 2024 at 09:12, Jon Hunter <jonathanh@nvidia.com> wrote:
> >>> We have also observed a performance degradation on our Tegra platforms
> >>> with v6.8-rc1. Unfortunately, the above change does not fix the problem
> >>> for us and we are still seeing a performance issue with v6.8-rc4. For
> >>> example, running Dhrystone on Tegra234 I am seeing the following ...
> >>> [...]
> >>> If I revert this change and the following ...
> >>>   b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
> >>>   f12560779f9d ("sched/cpufreq: Rework iowait boost")
> >>>   9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
> >>> ... then the perf is similar to where it was ...
> >>
> >> Ok, guys, this whole scheduler / cpufreq rewrite seems to have been
> >> completely buggered.
> >> [...]
> > This should fix it:
> > https://lore.kernel.org/lkml/20240117190545.596057-1-vincent.guittot@linaro.org/
> 
> ...do you want me to do in situations like this? I'm asking, as I see
> situations like this frequently -- e.g. people reporting problems a
> second, third, or fourth time while the fix is already sitting in -next
> for a few days.
> 
> Want me to list them in the weekly reports so that you can cherry-pick
> them from -next if you want?

Poke the maintainer to get off their butt and submit the pull request to
Linus (note, this is me in this case...)

I'll get it into the next -rc, sorry for the delay, other things got in
the way, my fault.

greg k-h
Linux regression tracking (Thorsten Leemhuis) Feb. 15, 2024, 10 a.m. UTC | #13
On 15.02.24 10:13, Greg KH wrote:
> On Thu, Feb 15, 2024 at 09:45:01AM +0100, Thorsten Leemhuis wrote:
>> Linus, what...
>>
>> On 14.02.24 18:22, Vincent Guittot wrote:
>>> On Wed, 14 Feb 2024 at 18:20, Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>> On Wed, 14 Feb 2024 at 09:12, Jon Hunter <jonathanh@nvidia.com> wrote:
>>>>> We have also observed a performance degradation on our Tegra platforms
>>>>> with v6.8-rc1. Unfortunately, the above change does not fix the problem
>>>>> for us and we are still seeing a performance issue with v6.8-rc4. For
>>>>> example, running Dhrystone on Tegra234 I am seeing the following ...
>>>>> [...]
>>>>> If I revert this change and the following ...
>>>>>   b3edde44e5d4 ("cpufreq/schedutil: Use a fixed reference frequency")
>>>>>   f12560779f9d ("sched/cpufreq: Rework iowait boost")
>>>>>   9c0b4bb7f630 ("sched/cpufreq: Rework schedutil governor
>>>>> ... then the perf is similar to where it was ...
>>>>
>>>> Ok, guys, this whole scheduler / cpufreq rewrite seems to have been
>>>> completely buggered.
>>>> [...]
>>> This should fix it:
>>> https://lore.kernel.org/lkml/20240117190545.596057-1-vincent.guittot@linaro.org/
>>
>> ...do you want me to do in situations like this? I'm asking, as I see
>> situations like this frequently -- e.g. people reporting problems a
>> second, third, or fourth time while the fix is already sitting in -next
>> for a few days.
>>
>> Want me to list them in the weekly reports so that you can cherry-pick
>> them from -next if you want?
> 
> Poke the maintainer to get off their butt and submit the pull request to
> Linus 

Well, I did that sometimes and will continue to do so. But some
maintainers then feel pestered and become annoyed by my efforts -- which
in the long-term is counter productive, as regression tracking will only
work well if maintainers and I work well together. That's why I'm a bit
careful with such things (side note: don't worry, I know that some
conflict is inevitable -- but I don't have your or Linus standing, so I
have to choose my fights carefully...).

I sometimes also got replies along the lines of "we are only at -rc2,
this can wait till -rc5 or -rc6" -- and I have no quote from Linus at
hand I can point maintainers to that says something along the lines of
"if a regression fix was in -next for at least two days, submit it to
mainline before the next -rc, unless there is a strong reason why that
particular fix needs more testing" (or whatever he actually wants).

> (note, this is me in this case...)
> I'll get it into the next -rc, sorry for the delay, other things got in
> the way, my fault.

Happens, I don't care too much about this specific event, more about the
general problem. Especially the -mm tree bugs me sometimes, as I noticed
that Andrew often lets regression fixes linger in -next for round about
a week; he furthermore sometimes sends this stuff to Linus on Mondays or
Tuesdays. Due to that the fixes often miss at least one, sometimes two
-rcs. That is especially hard to watch if the regression made it to a
stable kernel and you are waiting for the fix to get mainlined.

Ciao, Thorsten
diff mbox series

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 95c3c097083e..d12e95d30e2e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -133,7 +133,11 @@  unsigned long get_capacity_ref_freq(struct cpufreq_policy *policy)
 	if (arch_scale_freq_invariant())
 		return policy->cpuinfo.max_freq;
 
-	return policy->cur;
+	/*
+	 * Apply a 25% margin so that we select a higher frequency than
+	 * the current one before the CPU is full busy
+	 */
+	return policy->cur + (policy->cur >> 2);
 }
 
 /**