Message ID | 20220415133356.179706384@linutronix.de |
---|---|
Headers | show |
Series | x86/cpu: Consolidate APERF/MPERF code | expand |
On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote: > APERF/MPERF is utilized in two ways: > > 1) Ad hoc readout of CPU frequency which requires IPIs > > 2) Frequency scale calculation for frequency invariant scheduling which > reads APERF/MPERF on every tick. > > These are completely independent code parts. Eric observed long latencies > when reading /proc/cpuinfo which reads out CPU frequency via #1 and > proposed to replace the per CPU single IPI with a broadcast IPI. > > While this makes the latency smaller, it is not necessary at all because #2 > samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs > which are excluded from IPI already. > > It could be argued that not all APERF/MPERF capable systems have the > required BIOS information to enable frequency invariance support, but in > practice most of them do. So the APERF/MPERF sampling can be made > unconditional and just the frequency scale calculation for the scheduler > excluded. > > The following series consolidates that. Acked-by: Paul E. McKenney <paulmck@kernel.org> > Thanks, > > tglx > --- > arch/x86/include/asm/cpu.h | 2 > arch/x86/include/asm/topology.h | 17 - > arch/x86/kernel/acpi/cppc.c | 28 -- > arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++-------- > arch/x86/kernel/cpu/proc.c | 2 > arch/x86/kernel/smpboot.c | 358 ----------------------------- > fs/proc/cpuinfo.c | 6 > include/linux/cpufreq.h | 1 > 8 files changed, 405 insertions(+), 483 deletions(-) > >
Hi Thomas, Rafael, Thank you for your replies. On 2022.04.19 14:11 Thomas Gleixner wrote: > On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote: >> On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote: >>> For intel_pstate (active), both HWP enabled or disabled, the behaviour >>> of scaling_cur_freq is inconsistent with prior to this patch set and other >>> scaling driver governor combinations. >>> >>> Note there is no issue with " grep MHz /proc/cpuinfo" for any >>> combination. >>> >>> Examples: >>> >>> No-HWP: >>> >>> active/powersave: >>> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq >>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418 >>> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006 >>> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005 >>> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 >> >> That's because after the changes in this series scaling_cur_freq >> returns 0 if the given CPU is idle. > > Which is sensible IMO as there is really no point in waking an idle CPU > just to read those MSRs, then wait 20ms wake it up again to read those > MSRs again. I totally agree. It is the inconsistency for what is displayed as a function of driver/governor that is my concern. > >> I guess it could return the last known result, but that wouldn't be >> more meaningful. > > Right. How about something like this, which I realize might break something else, but just to demonstrate: doug@s19:~/kernel/linux$ git diff diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 80f535cc8a75..a161e75794cd 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -710,7 +710,7 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf) else if (cpufreq_driver->setpolicy && cpufreq_driver->get) ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu)); else - ret = sprintf(buf, "%u\n", policy->cur); + ret = sprintf(buf, "%u\n", freq); return ret; } Note: I left the other 0 return condition, because I do not know what uses it. Which gives: acpi-cpufreq/schedutil doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:4100723 intel_pstate/powersave (no-HWP) doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800295 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800015 intel_cpufreq/schedutil (no-HWP) doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:1971265 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2785446 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 Which I suggest is more consistent. Note: because it was deleted from this thread, and just for reference, I'll repost the previous intel_cpufreq/schedutil (no-HWP) output: doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000 ... Doug
On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote: > On 2022.04.19 14:11 Thomas Gleixner wrote: >>> That's because after the changes in this series scaling_cur_freq >>> returns 0 if the given CPU is idle. >> >> Which is sensible IMO as there is really no point in waking an idle CPU >> just to read those MSRs, then wait 20ms wake it up again to read those >> MSRs again. > > I totally agree. > It is the inconsistency for what is displayed as a function of driver/governor > that is my concern. Raphael suggested to move the show_cpuinfo() logic into the a/mperf code. See below. Thanks, tglx --- Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo() From: Thomas Gleixner <tglx@linutronix.de> Date: Mon, 25 Apr 2022 15:19:29 +0200 Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return 0 when the last sample was too long ago. show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to return cpu_khz, but the readout code for the per CPU scaling frequency in sysfs does not. Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same when reading /proc/cpuinfo and /sys/..../cur_scaling_freq. Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/cpu/aperfmperf.c | 10 +++++++--- arch/x86/kernel/cpu/proc.c | 7 +------ 2 files changed, 8 insertions(+), 9 deletions(-) --- a/arch/x86/kernel/cpu/aperfmperf.c +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -405,12 +405,12 @@ void arch_scale_freq_tick(void) unsigned int arch_freq_get_on_cpu(int cpu) { struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu); + unsigned int seq, freq; unsigned long last; - unsigned int seq; u64 acnt, mcnt; if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF)) - return 0; + goto fallback; do { seq = raw_read_seqcount_begin(&s->seq); @@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cp * which covers idle and NOHZ full CPUs. */ if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE) - return 0; + goto fallback; return div64_u64((cpu_khz * acnt), mcnt); + +fallback: + freq = cpufreq_quick_get(cpu); + return freq ? freq : cpu_khz; } static int __init bp_init_aperfmperf(void) --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file if (cpu_has(c, X86_FEATURE_TSC)) { unsigned int freq = arch_freq_get_on_cpu(cpu); - if (!freq) - freq = cpufreq_quick_get(cpu); - if (!freq) - freq = cpu_khz; - seq_printf(m, "cpu MHz\t\t: %u.%03u\n", - freq / 1000, (freq % 1000)); + seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000)); } /* Cache size */
On Mon, Apr 25, 2022 at 8:45 AM Thomas Gleixner <tglx@linutronix.de> wrote: > On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote: >> On 2022.04.19 14:11 Thomas Gleixner wrote: >>>> That's because after the changes in this series scaling_cur_freq >>>> returns 0 if the given CPU is idle. >>> >>> Which is sensible IMO as there is really no point in waking an idle CPU >>> just to read those MSRs, then wait 20ms wake it up again to read those >>> MSRs again. >> >> I totally agree. >> It is the inconsistency for what is displayed as a function of driver/governor >> that is my concern. > > Raphael suggested to move the show_cpuinfo() logic into the a/mperf > code. See below. Hi Thomas, I tested the patch on top of your 10 patch set on kernel 5.18-rc3. It addresses my consistency concerns. Thank you ... Doug > --- > Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo() > From: Thomas Gleixner <tglx@linutronix.de> > Date: Mon, 25 Apr 2022 15:19:29 +0200 > ...