diff mbox series

[1/1] cpufreq: Rewire arch specific feedback for cpuinfo/scaling_cur_freq

Message ID 20240603081331.3829278-2-beata.michalska@arm.com
State New
Headers show
Series [1/1] cpufreq: Rewire arch specific feedback for cpuinfo/scaling_cur_freq | expand

Commit Message

Beata Michalska June 3, 2024, 8:13 a.m. UTC
Some architectures provide a way to determine an average frequency over
a certain period of time, based on available performance monitors (AMU on
ARM or APERF/MPERf on x86). With those at hand, enroll arch_freq_get_on_cpu
into cpuinfo_cur_freq policy sysfs attribute handler, which is expected to
represent the current frequency of a given CPU,as obtained by the hardware.
This is the type of feedback that counters do provide.
At the same time, keep the scaling_cur_freq attribute align with the docs
and make it provide most recently requested frequency, still allowing to
fallback to using arch_freq_get_on_cpu for cases when cpuinfo_cur_freq is
not available.

Signed-off-by: Beata Michalska <beata.michalska@arm.com>
---
 drivers/cpufreq/cpufreq.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Beata Michalska June 5, 2024, 8:36 a.m. UTC | #1
On Mon, Jun 03, 2024 at 03:43:12PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jun 3, 2024 at 1:48 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > Hi Beata,
> >
> > Thanks for taking this forward.
> >
> > On 03-06-24, 09:13, Beata Michalska wrote:
> > > Some architectures provide a way to determine an average frequency over
> > > a certain period of time, based on available performance monitors (AMU on
> > > ARM or APERF/MPERf on x86). With those at hand, enroll arch_freq_get_on_cpu
> > > into cpuinfo_cur_freq policy sysfs attribute handler, which is expected to
> > > represent the current frequency of a given CPU,as obtained by the hardware.
> > > This is the type of feedback that counters do provide.
> >
> > Please add blank line between paragraphs, it makes it easier to read
> > them.
> >
> > > At the same time, keep the scaling_cur_freq attribute align with the docs
> > > and make it provide most recently requested frequency, still allowing to
> > > fallback to using arch_freq_get_on_cpu for cases when cpuinfo_cur_freq is
> > > not available.
> >
> > Please split this patch into two parts, they are very distinct changes
> > and should be kept separate.
> >
> > > Signed-off-by: Beata Michalska <beata.michalska@arm.com>
> > > ---
> > >  drivers/cpufreq/cpufreq.c | 8 ++++++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > index a45aac17c20f..3b0eabe4a983 100644
> > > --- a/drivers/cpufreq/cpufreq.c
> > > +++ b/drivers/cpufreq/cpufreq.c
> > > @@ -758,7 +758,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> > >       ssize_t ret;
> > >       unsigned int freq;
> > >
> > > -     freq = arch_freq_get_on_cpu(policy->cpu);
> > > +     freq = !cpufreq_driver->get ? arch_freq_get_on_cpu(policy->cpu)
> > > +                                 : 0;
> >
> > This is getting trickier than I thought as I dived into more details
> > of all the changes to the file.
> >
> > Rafael,
> >
> > We probably need to decide on a policy for these two files, it is
> > getting a bit confusing.
> >
> > cpuinfo_cur_freq:
> >
> > The purpose of this file is abundantly clear. This returns the best
> > possible guess of the current hardware frequency. It should rely on
> > arch_freq_get_on_cpu() or ->get() to get the value.
> 
> Let me quote the documentation:
> 
> "This is expected to be the frequency the hardware actually runs at.
> If that frequency cannot be determined, this attribute should not be
> present."
> 
> In my reading, this has nothing to do with arch_freq_get_on_cpu(), at
> least on x86.
My reading on this (and I might be wrong) is that for x86,
arch_freq_get_on_cpu() is utilizing the APERF/MPERF registers to get
a rough/average frequency and as such it does, in a way, get a hw feedback
and it does somewhat fall under "frequency the hardware actually runs at".
Because it is an average value, it might not provide an instant view on
current frequency, but either way, the value provided here migh be a bit off
anyway. But then, we could adjust the timewidnow being used to make it more
accurate. I might be looking at it the wrong way though.
> 
> > Perhaps we can
> > make this available all the time, instead of conditionally on ->get()
> > callback (which isn't present for intel-pstate for example).
> 
> We could, but then on x86 there is no expectation that this file will
> be present and changing this may introduce significant confusion
> because of the way it is documented (which would need to be changed,
> but people might be forgiven for failing to notice the change of
> interpretation of this file).
> 
> > scaling_cur_freq:
> >
> > This should better reflect the last requested frequency, but since a
> > significant time now it is trying to show what cpuinfo_cur_freq shows.
> 
> Well, not really.
> 
> > commit c034b02e213d ("cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers")
> > commit f8475cef9008 ("x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF")
> 
> "In the majority of cases, this is the frequency of the last P-state
> requested by the scaling driver from the hardware using the scaling
> interface provided by it, which may or may not reflect the frequency
> the CPU is actually running at (due to hardware design and other
> limitations).
> 
> Some architectures (e.g. x86) may attempt to provide information more
> precisely reflecting the current CPU frequency through this attribute,
> but that still may not be the exact current CPU frequency as seen by
> the hardware at the moment."
> 
> So the problem is that on Intel x86 with HWP and intel_pstate in the
> active mode, say, "the frequency of the last P-state requested by the
> scaling driver from the hardware" is actually never known, so exposing
> it via scaling_cur_freq is not possible.
> 
> Moreover, because cpuinfo_cur_freq is not present at all in that case,
> scaling_cur_freq is the only way to allow user space to get an idea
> about the CPU current frequency.  I don't think it can be changed now
> without confusing users.
> 
What's your take on leaving the scaling_cur_freq to resolve to
arch_freq_get_on_cpu() when cpuinfo_cur_freq is not present ?
This way nothing will change for the intel_pstate + HWP
but it will allow using the info provided by arch_freq_get_on_cpu() for
cpuinfo_cur_freq if one is provided and it will automatically ignore it for
scaling_cur_freq ? Though I guess it falls under "making all confused" that you
have described donw the line.

> > What should we do ? I wonder if we will break some userspace tools
> > (which may have started relying on these changes).
> 
> We will.
> 
> IIUC, it is desirable to expose "the frequency of the last P-state
> requested by the scaling driver from the hardware" via
> scaling_cur_freq on ARM, but it is also desirable to expose an
> approximation of the actual current CPU frequency, so the only way to
> do that without confusing the heck out of everybody downstream would
> be to introduce a new attribute for this purpose and document it
> precisely.
What would introducing new attribute mean for scaling_cur_freq on x86 then ?
I do assume we would leave cpuinfo_cur_freq untouched without calling
arch_freq_get_on_cpu() (as it is now). Then keeping scaling_cur_freq to actually
use it - what would be the 3rd attribute meaning, if:
- cpuinfo_cur_freq -> actual freq as seen by the hardware (but not the
  counters?)
- scaling_cur_freq -> last requested frequency and for some cases, feedback from
  counters
- whatever_name_we_will_come_with -> average frequency based on counters ?
  not always available

 It is still bit confusing.

 ---
 BR
 Beata
Rafael J. Wysocki June 5, 2024, 2:17 p.m. UTC | #2
On Wed, Jun 5, 2024 at 10:36 AM Beata Michalska <beata.michalska@arm.com> wrote:
>
> On Mon, Jun 03, 2024 at 03:43:12PM +0200, Rafael J. Wysocki wrote:
> > On Mon, Jun 3, 2024 at 1:48 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > >
> > > Hi Beata,
> > >
> > > Thanks for taking this forward.
> > >
> > > On 03-06-24, 09:13, Beata Michalska wrote:
> > > > Some architectures provide a way to determine an average frequency over
> > > > a certain period of time, based on available performance monitors (AMU on
> > > > ARM or APERF/MPERf on x86). With those at hand, enroll arch_freq_get_on_cpu
> > > > into cpuinfo_cur_freq policy sysfs attribute handler, which is expected to
> > > > represent the current frequency of a given CPU,as obtained by the hardware.
> > > > This is the type of feedback that counters do provide.
> > >
> > > Please add blank line between paragraphs, it makes it easier to read
> > > them.
> > >
> > > > At the same time, keep the scaling_cur_freq attribute align with the docs
> > > > and make it provide most recently requested frequency, still allowing to
> > > > fallback to using arch_freq_get_on_cpu for cases when cpuinfo_cur_freq is
> > > > not available.
> > >
> > > Please split this patch into two parts, they are very distinct changes
> > > and should be kept separate.
> > >
> > > > Signed-off-by: Beata Michalska <beata.michalska@arm.com>
> > > > ---
> > > >  drivers/cpufreq/cpufreq.c | 8 ++++++--
> > > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > > > index a45aac17c20f..3b0eabe4a983 100644
> > > > --- a/drivers/cpufreq/cpufreq.c
> > > > +++ b/drivers/cpufreq/cpufreq.c
> > > > @@ -758,7 +758,8 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
> > > >       ssize_t ret;
> > > >       unsigned int freq;
> > > >
> > > > -     freq = arch_freq_get_on_cpu(policy->cpu);
> > > > +     freq = !cpufreq_driver->get ? arch_freq_get_on_cpu(policy->cpu)
> > > > +                                 : 0;
> > >
> > > This is getting trickier than I thought as I dived into more details
> > > of all the changes to the file.
> > >
> > > Rafael,
> > >
> > > We probably need to decide on a policy for these two files, it is
> > > getting a bit confusing.
> > >
> > > cpuinfo_cur_freq:
> > >
> > > The purpose of this file is abundantly clear. This returns the best
> > > possible guess of the current hardware frequency. It should rely on
> > > arch_freq_get_on_cpu() or ->get() to get the value.
> >
> > Let me quote the documentation:
> >
> > "This is expected to be the frequency the hardware actually runs at.
> > If that frequency cannot be determined, this attribute should not be
> > present."
> >
> > In my reading, this has nothing to do with arch_freq_get_on_cpu(), at
> > least on x86.
> My reading on this (and I might be wrong) is that for x86,
> arch_freq_get_on_cpu() is utilizing the APERF/MPERF registers to get
> a rough/average frequency and as such it does, in a way, get a hw feedback
> and it does somewhat fall under "frequency the hardware actually runs at".

But it is not expected to somewhat fall under that definition, it is
expected to be exactly the frequency the CPU is running at when the
file is accessed.

In the past, there was hardware that could provide that information.

> Because it is an average value, it might not provide an instant view on
> current frequency, but either way, the value provided here migh be a bit off
> anyway.

Well, not just a bit.  It may be off completely.

> But then, we could adjust the timewidnow being used to make it more
> accurate.

Not on x86 AFAICS because the time window is the scheduler tick period.

On x86 it is better to use a utility like turbostat to measure
frequency instead of the cpufreq sysfs (or /proc/cpuinfo for that
matter).

> I might be looking at it the wrong way though.
> >
> > > Perhaps we can
> > > make this available all the time, instead of conditionally on ->get()
> > > callback (which isn't present for intel-pstate for example).
> >
> > We could, but then on x86 there is no expectation that this file will
> > be present and changing this may introduce significant confusion
> > because of the way it is documented (which would need to be changed,
> > but people might be forgiven for failing to notice the change of
> > interpretation of this file).
> >
> > > scaling_cur_freq:
> > >
> > > This should better reflect the last requested frequency, but since a
> > > significant time now it is trying to show what cpuinfo_cur_freq shows.
> >
> > Well, not really.
> >
> > > commit c034b02e213d ("cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers")
> > > commit f8475cef9008 ("x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF")
> >
> > "In the majority of cases, this is the frequency of the last P-state
> > requested by the scaling driver from the hardware using the scaling
> > interface provided by it, which may or may not reflect the frequency
> > the CPU is actually running at (due to hardware design and other
> > limitations).
> >
> > Some architectures (e.g. x86) may attempt to provide information more
> > precisely reflecting the current CPU frequency through this attribute,
> > but that still may not be the exact current CPU frequency as seen by
> > the hardware at the moment."
> >
> > So the problem is that on Intel x86 with HWP and intel_pstate in the
> > active mode, say, "the frequency of the last P-state requested by the
> > scaling driver from the hardware" is actually never known, so exposing
> > it via scaling_cur_freq is not possible.
> >
> > Moreover, because cpuinfo_cur_freq is not present at all in that case,
> > scaling_cur_freq is the only way to allow user space to get an idea
> > about the CPU current frequency.  I don't think it can be changed now
> > without confusing users.
> >
> What's your take on leaving the scaling_cur_freq to resolve to
> arch_freq_get_on_cpu() when cpuinfo_cur_freq is not present ?

IIUC that's what happens right now on x86 and I don't see major
problems with it.

> This way nothing will change for the intel_pstate + HWP
> but it will allow using the info provided by arch_freq_get_on_cpu() for
> cpuinfo_cur_freq if one is provided and it will automatically ignore it for
> scaling_cur_freq ? Though I guess it falls under "making all confused" that you
> have described donw the line.

Well, it does.

> > > What should we do ? I wonder if we will break some userspace tools
> > > (which may have started relying on these changes).
> >
> > We will.
> >
> > IIUC, it is desirable to expose "the frequency of the last P-state
> > requested by the scaling driver from the hardware" via
> > scaling_cur_freq on ARM, but it is also desirable to expose an
> > approximation of the actual current CPU frequency, so the only way to
> > do that without confusing the heck out of everybody downstream would
> > be to introduce a new attribute for this purpose and document it
> > precisely.
>
> What would introducing new attribute mean for scaling_cur_freq on x86 then ?

Nothing to start with, but later it might be possible to make it only
work for drivers that don't implement ->setpolicy(), ie. when
policy->cur is well defined, so scaling_cur_freq, if it were present,
would only return the last frequency requested by the driver.

> I do assume we would leave cpuinfo_cur_freq untouched without calling
> arch_freq_get_on_cpu() (as it is now). Then keeping scaling_cur_freq to actually
> use it - what would be the 3rd attribute meaning, if:
> - cpuinfo_cur_freq -> actual freq as seen by the hardware (but not the
>   counters?)

This is to be used only when the hardware can tell what frequency the
CPU is running at when it is asked for that value.

> - scaling_cur_freq -> last requested frequency and for some cases, feedback from
>   counters

Yes, but see above - I would prefer to limit it to the last requested
frequency in the future.

> - whatever_name_we_will_come_with -> average frequency based on counters ?

Yes.

>   not always available

None of them would be always present, but at least one of them would
be present on any system.

>  It is still bit confusing.

Fair enough.
Viresh Kumar June 6, 2024, 8:55 a.m. UTC | #3
On 03-06-24, 15:43, Rafael J. Wysocki wrote:
> On Mon, Jun 3, 2024 at 1:48 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > Rafael,
> >
> > We probably need to decide on a policy for these two files, it is
> > getting a bit confusing.
> >
> > cpuinfo_cur_freq:
> >
> > The purpose of this file is abundantly clear. This returns the best
> > possible guess of the current hardware frequency. It should rely on
> > arch_freq_get_on_cpu() or ->get() to get the value.
> 
> Let me quote the documentation:
> 
> "This is expected to be the frequency the hardware actually runs at.
> If that frequency cannot be determined, this attribute should not be
> present."
> 
> In my reading, this has nothing to do with arch_freq_get_on_cpu(), at
> least on x86.
> 
> > Perhaps we can
> > make this available all the time, instead of conditionally on ->get()
> > callback (which isn't present for intel-pstate for example).
> 
> We could, but then on x86 there is no expectation that this file will
> be present and changing this may introduce significant confusion
> because of the way it is documented (which would need to be changed,
> but people might be forgiven for failing to notice the change of
> interpretation of this file).
 
> > scaling_cur_freq:
> >
> > This should better reflect the last requested frequency, but since a
> > significant time now it is trying to show what cpuinfo_cur_freq shows.
> 
> Well, not really.
> 
> > commit c034b02e213d ("cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers")
> > commit f8475cef9008 ("x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF")
> 
> "In the majority of cases, this is the frequency of the last P-state
> requested by the scaling driver from the hardware using the scaling
> interface provided by it, which may or may not reflect the frequency
> the CPU is actually running at (due to hardware design and other
> limitations).
> 
> Some architectures (e.g. x86) may attempt to provide information more
> precisely reflecting the current CPU frequency through this attribute,
> but that still may not be the exact current CPU frequency as seen by
> the hardware at the moment."

Right, with time the documentation is updated and now it has mixed
the purpose of both these files IMO.

> So the problem is that on Intel x86 with HWP and intel_pstate in the
> active mode, say, "the frequency of the last P-state requested by the
> scaling driver from the hardware" is actually never known, so exposing
> it via scaling_cur_freq is not possible.
> 
> Moreover, because cpuinfo_cur_freq is not present at all in that case,
> scaling_cur_freq is the only way to allow user space to get an idea
> about the CPU current frequency.  I don't think it can be changed now
> without confusing users.

Yes, this is a valid concern. The changes in documentation have been
there for many years and changing the behavior now is not going to be
an easy / right thing to do.

> > What should we do ? I wonder if we will break some userspace tools
> > (which may have started relying on these changes).
> 
> We will.
> 
> IIUC, it is desirable to expose "the frequency of the last P-state
> requested by the scaling driver from the hardware" via
> scaling_cur_freq on ARM, but it is also desirable to expose an
> approximation of the actual current CPU frequency, so the only way to
> do that without confusing the heck out of everybody downstream would
> be to introduce a new attribute for this purpose and document it
> precisely.

Hmm, having 3 files would confuse people even more I guess. I wanted
to get this sorted to have the same behavior for all platforms, but it
seems somewhat difficult to achieve now.

What about this, hopefully this doesn't break any existing platforms
and fix the problems for ARM (and others):

- scaling_cur_freq:

  Returns the frequency of the last P-state requested by the scaling
  driver from the hardware. For set_policy() drivers, use the ->get()
  callback to get a value that can provide the best estimate to user.

  To make this work, we can add get() callback to intel and amd pstate
  drivers, and use arch_freq_get_on_cpu().

  This will keep the current behavior intact for such drivers.

- cpuinfo_cur_freq:

  Currently this file is available only if the get() callback is
  available. Maybe we can keep this behavior as is, and expose this
  now for both the pstate drivers (once above change is added). We
  will be left with only one driver that doesn't provide the get()
  callback: pasemi-cpufreq.c

  Coming back to the implementation of the file read operation, I
  think the whole purpose of arch_freq_get_on_cpu() was to get a
  better estimate (which may not be perfect) of the frequency the
  hardware is really running at (in the last window) and if a platform
  provides this, then it can be given priority over the ->get()
  callback in order to show the value to userspace.

  And so, if the platform provides, we can use arch_freq_get_on_cpu()
  first and then the get() callback.

That would leave us to this change for the core, and yes a get()
callback for intel-pstate and amd-pstate:

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 7c6879efe9ef..e265f8450160 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -756,12 +756,8 @@ __weak unsigned int arch_freq_get_on_cpu(int cpu)
 static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
 {
        ssize_t ret;
-       unsigned int freq;

-       freq = arch_freq_get_on_cpu(policy->cpu);
-       if (freq)
-               ret = sprintf(buf, "%u\n", freq);
-       else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
+       if (cpufreq_driver->setpolicy && cpufreq_driver->get)
                ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
        else
                ret = sprintf(buf, "%u\n", policy->cur);
@@ -795,7 +791,10 @@ store_one(scaling_max_freq, max);
 static ssize_t show_cpuinfo_cur_freq(struct cpufreq_policy *policy,
                                        char *buf)
 {
-       unsigned int cur_freq = __cpufreq_get(policy);
+       unsigned int cur_freq = arch_freq_get_on_cpu(policy->cpu);
+
+       if (!cur_freq)
+               cur_freq = __cpufreq_get(policy);

        if (cur_freq)
                return sprintf(buf, "%u\n", cur_freq);


I think this will also make more sense from documentation's
perspective, which says that:

"In the majority of cases, this is the frequency of the last P-state
requested by the scaling driver from the hardware using the scaling
interface provided by it, which may or may not reflect the frequency
the CPU is actually running at (due to hardware design and other
limitations)."

-- we do this at the core level.

"Some architectures (e.g. x86) may attempt to provide information more
precisely reflecting the current CPU frequency through this attribute,
but that still may not be the exact current CPU frequency as seen by
the hardware at the moment."

-- and this at driver level, as a special case.
Rafael J. Wysocki June 7, 2024, 2:21 p.m. UTC | #4
On Thu, Jun 6, 2024 at 10:55 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 03-06-24, 15:43, Rafael J. Wysocki wrote:
> > On Mon, Jun 3, 2024 at 1:48 PM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > > Rafael,
> > >
> > > We probably need to decide on a policy for these two files, it is
> > > getting a bit confusing.
> > >
> > > cpuinfo_cur_freq:
> > >
> > > The purpose of this file is abundantly clear. This returns the best
> > > possible guess of the current hardware frequency. It should rely on
> > > arch_freq_get_on_cpu() or ->get() to get the value.
> >
> > Let me quote the documentation:
> >
> > "This is expected to be the frequency the hardware actually runs at.
> > If that frequency cannot be determined, this attribute should not be
> > present."
> >
> > In my reading, this has nothing to do with arch_freq_get_on_cpu(), at
> > least on x86.
> >
> > > Perhaps we can
> > > make this available all the time, instead of conditionally on ->get()
> > > callback (which isn't present for intel-pstate for example).
> >
> > We could, but then on x86 there is no expectation that this file will
> > be present and changing this may introduce significant confusion
> > because of the way it is documented (which would need to be changed,
> > but people might be forgiven for failing to notice the change of
> > interpretation of this file).
>
> > > scaling_cur_freq:
> > >
> > > This should better reflect the last requested frequency, but since a
> > > significant time now it is trying to show what cpuinfo_cur_freq shows.
> >
> > Well, not really.
> >
> > > commit c034b02e213d ("cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers")
> > > commit f8475cef9008 ("x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF")
> >
> > "In the majority of cases, this is the frequency of the last P-state
> > requested by the scaling driver from the hardware using the scaling
> > interface provided by it, which may or may not reflect the frequency
> > the CPU is actually running at (due to hardware design and other
> > limitations).
> >
> > Some architectures (e.g. x86) may attempt to provide information more
> > precisely reflecting the current CPU frequency through this attribute,
> > but that still may not be the exact current CPU frequency as seen by
> > the hardware at the moment."
>
> Right, with time the documentation is updated and now it has mixed
> the purpose of both these files IMO.
>
> > So the problem is that on Intel x86 with HWP and intel_pstate in the
> > active mode, say, "the frequency of the last P-state requested by the
> > scaling driver from the hardware" is actually never known, so exposing
> > it via scaling_cur_freq is not possible.
> >
> > Moreover, because cpuinfo_cur_freq is not present at all in that case,
> > scaling_cur_freq is the only way to allow user space to get an idea
> > about the CPU current frequency.  I don't think it can be changed now
> > without confusing users.
>
> Yes, this is a valid concern. The changes in documentation have been
> there for many years and changing the behavior now is not going to be
> an easy / right thing to do.
>
> > > What should we do ? I wonder if we will break some userspace tools
> > > (which may have started relying on these changes).
> >
> > We will.
> >
> > IIUC, it is desirable to expose "the frequency of the last P-state
> > requested by the scaling driver from the hardware" via
> > scaling_cur_freq on ARM, but it is also desirable to expose an
> > approximation of the actual current CPU frequency, so the only way to
> > do that without confusing the heck out of everybody downstream would
> > be to introduce a new attribute for this purpose and document it
> > precisely.
>
> Hmm, having 3 files would confuse people even more I guess. I wanted
> to get this sorted to have the same behavior for all platforms, but it
> seems somewhat difficult to achieve now.
>
> What about this, hopefully this doesn't break any existing platforms
> and fix the problems for ARM (and others):
>
> - scaling_cur_freq:
>
>   Returns the frequency of the last P-state requested by the scaling
>   driver from the hardware.

This would change the behavior for intel_pstate in the passive mode AFAICS.

ATM it calls arch_freq_get_on_cpu(), after the change it would return
policy->cur which would not be the same value most of the time.  And
in the ->adjust_perf() case policy->cur is not updated by it even.

>  For set_policy() drivers, use the ->get()
>   callback to get a value that can provide the best estimate to user.
>
>   To make this work, we can add get() callback to intel and amd pstate
>   drivers, and use arch_freq_get_on_cpu().
>
>   This will keep the current behavior intact for such drivers.

Well, the passive mode thing would need to be addressed then.

> - cpuinfo_cur_freq:
>
>   Currently this file is available only if the get() callback is
>   available. Maybe we can keep this behavior as is, and expose this
>   now for both the pstate drivers (once above change is added). We
>   will be left with only one driver that doesn't provide the get()
>   callback: pasemi-cpufreq.c

I would rather get rid of it completely.

>   Coming back to the implementation of the file read operation, I
>   think the whole purpose of arch_freq_get_on_cpu() was to get a
>   better estimate (which may not be perfect) of the frequency the
>   hardware is really running at (in the last window) and if a platform
>   provides this, then it can be given priority over the ->get()
>   callback in order to show the value to userspace.

There was a reason to add it and it was related to policy->cur being
meaningless on x86 in general (even in the acpi-cpufreq case), but
let's not go there.

Hooking this up to cpuinfo_cur_freq on x86 wouldn't make much sense
IMV because at times it is not even close to the frequency the
hardware is running at.  It comes from the previous tick period,
basically, and the hardware can adjust the frequency with a resolution
that is orders of magnitude higher than the tick rate.

>   And so, if the platform provides, we can use arch_freq_get_on_cpu()
>   first and then the get() callback.
>
> That would leave us to this change for the core, and yes a get()
> callback for intel-pstate and amd-pstate:
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 7c6879efe9ef..e265f8450160 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -756,12 +756,8 @@ __weak unsigned int arch_freq_get_on_cpu(int cpu)
>  static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
>  {
>         ssize_t ret;
> -       unsigned int freq;
>
> -       freq = arch_freq_get_on_cpu(policy->cpu);
> -       if (freq)
> -               ret = sprintf(buf, "%u\n", freq);
> -       else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
> +       if (cpufreq_driver->setpolicy && cpufreq_driver->get)
>                 ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
>         else
>                 ret = sprintf(buf, "%u\n", policy->cur);
> @@ -795,7 +791,10 @@ store_one(scaling_max_freq, max);
>  static ssize_t show_cpuinfo_cur_freq(struct cpufreq_policy *policy,
>                                         char *buf)
>  {
> -       unsigned int cur_freq = __cpufreq_get(policy);
> +       unsigned int cur_freq = arch_freq_get_on_cpu(policy->cpu);
> +
> +       if (!cur_freq)
> +               cur_freq = __cpufreq_get(policy);
>
>         if (cur_freq)
>                 return sprintf(buf, "%u\n", cur_freq);
>
>
> I think this will also make more sense from documentation's
> perspective, which says that:
>
> "In the majority of cases, this is the frequency of the last P-state
> requested by the scaling driver from the hardware using the scaling
> interface provided by it, which may or may not reflect the frequency
> the CPU is actually running at (due to hardware design and other
> limitations)."
>
> -- we do this at the core level.
>
> "Some architectures (e.g. x86) may attempt to provide information more
> precisely reflecting the current CPU frequency through this attribute,
> but that still may not be the exact current CPU frequency as seen by
> the hardware at the moment."
>
> -- and this at driver level, as a special case.

Well, this sounds nice, but the changes are a bit problematic.

If you don't want 3 files, I'd drop cpuinfo_cur_freq and introduce
something else to replace it which will expose the
arch_freq_get_on_cpu() return value and will be documented
accordingly.

Then scaling_cur_freq can be (over time) switched over to returning
policy->cur in the cases when it is meaningful and -ENODATA otherwise.

This would at least allow us to stop making up stuff.
Viresh Kumar June 13, 2024, 8:23 a.m. UTC | #5
On 07-06-24, 16:21, Rafael J. Wysocki wrote:
> On Thu, Jun 6, 2024 at 10:55 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > What about this, hopefully this doesn't break any existing platforms
> > and fix the problems for ARM (and others):
> >
> > - scaling_cur_freq:
> >
> >   Returns the frequency of the last P-state requested by the scaling
> >   driver from the hardware.
> 
> This would change the behavior for intel_pstate in the passive mode AFAICS.
> 
> ATM it calls arch_freq_get_on_cpu(), after the change it would return
> policy->cur which would not be the same value most of the time.  And
> in the ->adjust_perf() case policy->cur is not updated by it even.

Yeah, we would need to do the below part to make it work.

> >  For set_policy() drivers, use the ->get()
> >   callback to get a value that can provide the best estimate to user.
> >
> >   To make this work, we can add get() callback to intel and amd pstate
> >   drivers, and use arch_freq_get_on_cpu().
> >
> >   This will keep the current behavior intact for such drivers.
> 
> Well, the passive mode thing would need to be addressed then.

Right. So this would keep the behavior of the file as is for all platforms and
simplify the core.

> > - cpuinfo_cur_freq:
> >
> >   Currently this file is available only if the get() callback is
> >   available. Maybe we can keep this behavior as is, and expose this
> >   now for both the pstate drivers (once above change is added). We
> >   will be left with only one driver that doesn't provide the get()
> >   callback: pasemi-cpufreq.c
> 
> I would rather get rid of it completely.

cpuinfo_cur_freq itself ? I thought such changes aren't allowed as they may end
up breaking userspace tools.

> >   Coming back to the implementation of the file read operation, I
> >   think the whole purpose of arch_freq_get_on_cpu() was to get a
> >   better estimate (which may not be perfect) of the frequency the
> >   hardware is really running at (in the last window) and if a platform
> >   provides this, then it can be given priority over the ->get()
> >   callback in order to show the value to userspace.
> 
> There was a reason to add it and it was related to policy->cur being
> meaningless on x86 in general (even in the acpi-cpufreq case), but
> let's not go there.

Right.

> Hooking this up to cpuinfo_cur_freq on x86 wouldn't make much sense
> IMV because at times it is not even close to the frequency the
> hardware is running at.  It comes from the previous tick period,
> basically, and the hardware can adjust the frequency with a resolution
> that is orders of magnitude higher than the tick rate.

Hmm. If that is the concern (which looks valid), how come it makes sense to do
the same on ARM ? Beata, Ionela ?

I thought, just like X86, ARM also doesn't have a guaranteed way to know the
exact frequency anymore and AMUs are providing a better picture, and so we are
moving to the same.

If we don't want it for X86, then it can be done with help of a new driver flag
CPUFREQ_NO_CPUINFO_SCALING_FREQ, instead of the availability of the get()
callback.

> Well, this sounds nice, but the changes are a bit problematic.
> 
> If you don't want 3 files, I'd drop cpuinfo_cur_freq and introduce
> something else to replace it which will expose the
> arch_freq_get_on_cpu() return value and will be documented
> accordingly.

Well it is still meaningful to show the return value of the ->get() callback
where the hardware provides it.

> Then scaling_cur_freq can be (over time) switched over to returning
> policy->cur in the cases when it is meaningful and -ENODATA otherwise.
> 
> This would at least allow us to stop making up stuff.

Maybe a third file, just for arch_freq_get_on_cpu() is not that bad of an idea
:)
Rafael J. Wysocki June 13, 2024, 9:27 a.m. UTC | #6
On Thu, Jun 13, 2024 at 10:24 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
>
> On 07-06-24, 16:21, Rafael J. Wysocki wrote:
> > On Thu, Jun 6, 2024 at 10:55 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > > What about this, hopefully this doesn't break any existing platforms
> > > and fix the problems for ARM (and others):
> > >
> > > - scaling_cur_freq:
> > >
> > >   Returns the frequency of the last P-state requested by the scaling
> > >   driver from the hardware.
> >
> > This would change the behavior for intel_pstate in the passive mode AFAICS.
> >
> > ATM it calls arch_freq_get_on_cpu(), after the change it would return
> > policy->cur which would not be the same value most of the time.  And
> > in the ->adjust_perf() case policy->cur is not updated by it even.
>
> Yeah, we would need to do the below part to make it work.
>
> > >  For set_policy() drivers, use the ->get()
> > >   callback to get a value that can provide the best estimate to user.
> > >
> > >   To make this work, we can add get() callback to intel and amd pstate
> > >   drivers, and use arch_freq_get_on_cpu().
> > >
> > >   This will keep the current behavior intact for such drivers.
> >
> > Well, the passive mode thing would need to be addressed then.
>
> Right. So this would keep the behavior of the file as is for all platforms and
> simplify the core.
>
> > > - cpuinfo_cur_freq:
> > >
> > >   Currently this file is available only if the get() callback is
> > >   available. Maybe we can keep this behavior as is, and expose this
> > >   now for both the pstate drivers (once above change is added). We
> > >   will be left with only one driver that doesn't provide the get()
> > >   callback: pasemi-cpufreq.c
> >
> > I would rather get rid of it completely.
>
> cpuinfo_cur_freq itself ? I thought such changes aren't allowed as they may end
> up breaking userspace tools.

cpuinfo_cur_freq is not always present anyway, so user space tools
need to be able to cope with the lack of it anyway.

> > >   Coming back to the implementation of the file read operation, I
> > >   think the whole purpose of arch_freq_get_on_cpu() was to get a
> > >   better estimate (which may not be perfect) of the frequency the
> > >   hardware is really running at (in the last window) and if a platform
> > >   provides this, then it can be given priority over the ->get()
> > >   callback in order to show the value to userspace.
> >
> > There was a reason to add it and it was related to policy->cur being
> > meaningless on x86 in general (even in the acpi-cpufreq case), but
> > let's not go there.
>
> Right.
>
> > Hooking this up to cpuinfo_cur_freq on x86 wouldn't make much sense
> > IMV because at times it is not even close to the frequency the
> > hardware is running at.  It comes from the previous tick period,
> > basically, and the hardware can adjust the frequency with a resolution
> > that is orders of magnitude higher than the tick rate.
>
> Hmm. If that is the concern (which looks valid), how come it makes sense to do
> the same on ARM ? Beata, Ionela ?
>
> I thought, just like X86, ARM also doesn't have a guaranteed way to know the
> exact frequency anymore and AMUs are providing a better picture, and so we are
> moving to the same.
>
> If we don't want it for X86, then it can be done with help of a new driver flag
> CPUFREQ_NO_CPUINFO_SCALING_FREQ, instead of the availability of the get()
> callback.
>
> > Well, this sounds nice, but the changes are a bit problematic.
> >
> > If you don't want 3 files, I'd drop cpuinfo_cur_freq and introduce
> > something else to replace it which will expose the
> > arch_freq_get_on_cpu() return value and will be documented
> > accordingly.
>
> Well it is still meaningful to show the return value of the ->get() callback
> where the hardware provides it.

But this is a valid point.

> > Then scaling_cur_freq can be (over time) switched over to returning
> > policy->cur in the cases when it is meaningful and -ENODATA otherwise.
> >
> > This would at least allow us to stop making up stuff.
>
> Maybe a third file, just for arch_freq_get_on_cpu() is not that bad of an idea
> :)

/me thinks so.
Beata Michalska June 13, 2024, 9:47 a.m. UTC | #7
On Thu, Jun 13, 2024 at 01:53:58PM +0530, Viresh Kumar wrote:
> On 07-06-24, 16:21, Rafael J. Wysocki wrote:
> > On Thu, Jun 6, 2024 at 10:55 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > > What about this, hopefully this doesn't break any existing platforms
> > > and fix the problems for ARM (and others):
> > >
> > > - scaling_cur_freq:
> > >
> > >   Returns the frequency of the last P-state requested by the scaling
> > >   driver from the hardware.
> > 
> > This would change the behavior for intel_pstate in the passive mode AFAICS.
> > 
> > ATM it calls arch_freq_get_on_cpu(), after the change it would return
> > policy->cur which would not be the same value most of the time.  And
> > in the ->adjust_perf() case policy->cur is not updated by it even.
> 
> Yeah, we would need to do the below part to make it work.
> 
> > >  For set_policy() drivers, use the ->get()
> > >   callback to get a value that can provide the best estimate to user.
> > >
> > >   To make this work, we can add get() callback to intel and amd pstate
> > >   drivers, and use arch_freq_get_on_cpu().
> > >
> > >   This will keep the current behavior intact for such drivers.
> > 
> > Well, the passive mode thing would need to be addressed then.
> 
> Right. So this would keep the behavior of the file as is for all platforms and
> simplify the core.
> 
> > > - cpuinfo_cur_freq:
> > >
> > >   Currently this file is available only if the get() callback is
> > >   available. Maybe we can keep this behavior as is, and expose this
> > >   now for both the pstate drivers (once above change is added). We
> > >   will be left with only one driver that doesn't provide the get()
> > >   callback: pasemi-cpufreq.c
> > 
> > I would rather get rid of it completely.
> 
> cpuinfo_cur_freq itself ? I thought such changes aren't allowed as they may end
> up breaking userspace tools.
> 
> > >   Coming back to the implementation of the file read operation, I
> > >   think the whole purpose of arch_freq_get_on_cpu() was to get a
> > >   better estimate (which may not be perfect) of the frequency the
> > >   hardware is really running at (in the last window) and if a platform
> > >   provides this, then it can be given priority over the ->get()
> > >   callback in order to show the value to userspace.
> > 
> > There was a reason to add it and it was related to policy->cur being
> > meaningless on x86 in general (even in the acpi-cpufreq case), but
> > let's not go there.
> 
> Right.
> 
> > Hooking this up to cpuinfo_cur_freq on x86 wouldn't make much sense
> > IMV because at times it is not even close to the frequency the
> > hardware is running at.  It comes from the previous tick period,
> > basically, and the hardware can adjust the frequency with a resolution
> > that is orders of magnitude higher than the tick rate.
> 
> Hmm. If that is the concern (which looks valid), how come it makes sense to do
> the same on ARM ? Beata, Ionela ?
>
The arch_freq_get_on_cpu that is using AMU counters on ARM is also bound to
a tick - this is still the average over a tick period.
Now it is not exactly current frequency as seen by hardware but gives a good
estimate as of what is happening under the hood.
I guess the question would be: what is the use of instant read on hw freq for
userspace tools if, as mentioned already, the freq can change at any point of
time so it is rather cumbersome to reason about. Once the userspace gets that
info it might be no longer valid. 
> I thought, just like X86, ARM also doesn't have a guaranteed way to know the
> exact frequency anymore and AMUs are providing a better picture, and so we are
> moving to the same.
It is very platform specific - and even if there is a way to get a feedback on
every freq change - again what userspace is supposed to do when it gets stale
data?

---
BR
Beata
> 
> If we don't want it for X86, then it can be done with help of a new driver flag
> CPUFREQ_NO_CPUINFO_SCALING_FREQ, instead of the availability of the get()
> callback.
> 
> > Well, this sounds nice, but the changes are a bit problematic.
> > 
> > If you don't want 3 files, I'd drop cpuinfo_cur_freq and introduce
> > something else to replace it which will expose the
> > arch_freq_get_on_cpu() return value and will be documented
> > accordingly.
> 
> Well it is still meaningful to show the return value of the ->get() callback
> where the hardware provides it.
> 
> > Then scaling_cur_freq can be (over time) switched over to returning
> > policy->cur in the cases when it is meaningful and -ENODATA otherwise.
> > 
> > This would at least allow us to stop making up stuff.
> 
> Maybe a third file, just for arch_freq_get_on_cpu() is not that bad of an idea
> :)
> 
> -- 
> viresh
Beata Michalska June 13, 2024, 9:55 a.m. UTC | #8
On Thu, Jun 13, 2024 at 11:27:52AM +0200, Rafael J. Wysocki wrote:
> On Thu, Jun 13, 2024 at 10:24 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> >
> > On 07-06-24, 16:21, Rafael J. Wysocki wrote:
> > > On Thu, Jun 6, 2024 at 10:55 AM Viresh Kumar <viresh.kumar@linaro.org> wrote:
> > > > What about this, hopefully this doesn't break any existing platforms
> > > > and fix the problems for ARM (and others):
> > > >
> > > > - scaling_cur_freq:
> > > >
> > > >   Returns the frequency of the last P-state requested by the scaling
> > > >   driver from the hardware.
> > >
> > > This would change the behavior for intel_pstate in the passive mode AFAICS.
> > >
> > > ATM it calls arch_freq_get_on_cpu(), after the change it would return
> > > policy->cur which would not be the same value most of the time.  And
> > > in the ->adjust_perf() case policy->cur is not updated by it even.
> >
> > Yeah, we would need to do the below part to make it work.
> >
> > > >  For set_policy() drivers, use the ->get()
> > > >   callback to get a value that can provide the best estimate to user.
> > > >
> > > >   To make this work, we can add get() callback to intel and amd pstate
> > > >   drivers, and use arch_freq_get_on_cpu().
> > > >
> > > >   This will keep the current behavior intact for such drivers.
> > >
> > > Well, the passive mode thing would need to be addressed then.
> >
> > Right. So this would keep the behavior of the file as is for all platforms and
> > simplify the core.
> >
> > > > - cpuinfo_cur_freq:
> > > >
> > > >   Currently this file is available only if the get() callback is
> > > >   available. Maybe we can keep this behavior as is, and expose this
> > > >   now for both the pstate drivers (once above change is added). We
> > > >   will be left with only one driver that doesn't provide the get()
> > > >   callback: pasemi-cpufreq.c
> > >
> > > I would rather get rid of it completely.
> >
> > cpuinfo_cur_freq itself ? I thought such changes aren't allowed as they may end
> > up breaking userspace tools.
> 
> cpuinfo_cur_freq is not always present anyway, so user space tools
> need to be able to cope with the lack of it anyway.
>
> > > >   Coming back to the implementation of the file read operation, I
> > > >   think the whole purpose of arch_freq_get_on_cpu() was to get a
> > > >   better estimate (which may not be perfect) of the frequency the
> > > >   hardware is really running at (in the last window) and if a platform
> > > >   provides this, then it can be given priority over the ->get()
> > > >   callback in order to show the value to userspace.
> > >
> > > There was a reason to add it and it was related to policy->cur being
> > > meaningless on x86 in general (even in the acpi-cpufreq case), but
> > > let's not go there.
> >
> > Right.
> >
> > > Hooking this up to cpuinfo_cur_freq on x86 wouldn't make much sense
> > > IMV because at times it is not even close to the frequency the
> > > hardware is running at.  It comes from the previous tick period,
> > > basically, and the hardware can adjust the frequency with a resolution
> > > that is orders of magnitude higher than the tick rate.
> >
> > Hmm. If that is the concern (which looks valid), how come it makes sense to do
> > the same on ARM ? Beata, Ionela ?
> >
> > I thought, just like X86, ARM also doesn't have a guaranteed way to know the
> > exact frequency anymore and AMUs are providing a better picture, and so we are
> > moving to the same.
> >
> > If we don't want it for X86, then it can be done with help of a new driver flag
> > CPUFREQ_NO_CPUINFO_SCALING_FREQ, instead of the availability of the get()
> > callback.
> >
> > > Well, this sounds nice, but the changes are a bit problematic.
> > >
> > > If you don't want 3 files, I'd drop cpuinfo_cur_freq and introduce
> > > something else to replace it which will expose the
> > > arch_freq_get_on_cpu() return value and will be documented
> > > accordingly.
> >
> > Well it is still meaningful to show the return value of the ->get() callback
> > where the hardware provides it.
> 
> But this is a valid point.
> 
> > > Then scaling_cur_freq can be (over time) switched over to returning
> > > policy->cur in the cases when it is meaningful and -ENODATA otherwise.
> > >
> > > This would at least allow us to stop making up stuff.
> >
> > Maybe a third file, just for arch_freq_get_on_cpu() is not that bad of an idea
> > :)
> 
> /me thinks so.
I am starting to lean towards that option.
Making both cpuinfo_cur_freq and scaling_cur_freq sane, might create even more
confusion as per which is providing what. We are already in a rather tricky
situation. The interface should be clean, leaving no room for various
interpretations - as much as possible, of course.

---
BR
Beata
diff mbox series

Patch

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index a45aac17c20f..3b0eabe4a983 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -758,7 +758,8 @@  static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
 	ssize_t ret;
 	unsigned int freq;
 
-	freq = arch_freq_get_on_cpu(policy->cpu);
+	freq = !cpufreq_driver->get ? arch_freq_get_on_cpu(policy->cpu)
+				    : 0;
 	if (freq)
 		ret = sprintf(buf, "%u\n", freq);
 	else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
@@ -795,7 +796,10 @@  store_one(scaling_max_freq, max);
 static ssize_t show_cpuinfo_cur_freq(struct cpufreq_policy *policy,
 					char *buf)
 {
-	unsigned int cur_freq = __cpufreq_get(policy);
+	unsigned int cur_freq = arch_freq_get_on_cpu(policy->cpu);
+
+	if (!cur_freq)
+		cur_freq = __cpufreq_get(policy);
 
 	if (cur_freq)
 		return sprintf(buf, "%u\n", cur_freq);