Message ID | 20220921063638.2489-1-kprateek.nayak@amd.com |
---|---|
State | New |
Headers | show |
Series | ACPI: processor_idle: Skip dummy wait for processors based on the Zen microarchitecture | expand |
On 22-09-2022 01:21 am, Borislav Petkov wrote: > On Wed, Sep 21, 2022 at 07:15:07AM -0700, Dave Hansen wrote: >> In the end, the delay is because of buggy, circa 2006 chipsets? So, we >> use a CPU vendor specific check to approximate that the chipset is >> recent and not affected by the bug? If so, is there no better way to >> check for a newer chipset than this? > > So I did some git archeology but that particular addition is in some > conglomerate, glued-together patch from 2007 which added the cpuidle > tree: > > commit 4f86d3a8e297205780cca027e974fd5f81064780 > Author: Len Brown <len.brown@intel.com> > Date: Wed Oct 3 18:58:00 2007 -0400 > > cpuidle: consolidate 2.6.22 cpuidle branch into one patch In fact, the code has moved around a fair bit and the check in its initial form goes as far back as ACPI's posting for inclusion in the kernel in March 2002 [1]. We could not find any way of digging further back, yet. Prior to that, I think the ACPI enablement code was being released independent of the kernel per https://kernel.org/doc/ols/2004/ols2004v1-pages-121-132.pdf and was included in Andrew's mm tree for a while.
On Thu, Sep 22, 2022 at 05:21:21PM +0200, Rafael J. Wysocki wrote:
> Well, it can be forced to use ACPI idle instead.
Yeah, I did that earlier. The dummy IO read in question costs ~3K on
average on my Coffeelake box here.
On 9/20/22 23:36, K Prateek Nayak wrote: > Cc: stable@vger.kernel.org > Cc: regressions@lists.linux.dev *Is* this a regression?
[Public] > -----Original Message----- > From: Dave Hansen <dave.hansen@intel.com> > Sent: Thursday, September 22, 2022 13:18 > To: Limonciello, Mario <Mario.Limonciello@amd.com>; Nayak, K Prateek > <KPrateek.Nayak@amd.com>; linux-kernel@vger.kernel.org > Cc: rafael@kernel.org; lenb@kernel.org; linux-acpi@vger.kernel.org; linux- > pm@vger.kernel.org; dave.hansen@linux.intel.com; bp@alien8.de; > tglx@linutronix.de; andi@lisas.de; puwen@hygon.cn; peterz@infradead.org; > rui.zhang@intel.com; gpiccoli@igalia.com; daniel.lezcano@linaro.org; > Narayan, Ananth <Ananth.Narayan@amd.com>; Shenoy, Gautham Ranjal > <gautham.shenoy@amd.com>; Ong, Calvin <Calvin.Ong@amd.com>; > stable@vger.kernel.org; regressions@lists.linux.dev > Subject: Re: [PATCH] ACPI: processor_idle: Skip dummy wait for processors > based on the Zen microarchitecture > > On 9/22/22 10:48, Limonciello, Mario wrote: > > > > 2) The title says to limit it to old intel systems, but nothing about this > actually enforces that. > > It actually is limited to all Intel systems, but effectively won't be used on > anything but new > > ones because of intel_idle. > > > > As an idea for #2 you could check for CONFIG_INTEL_IDLE in the Intel case > and > > if it's not defined show a pr_notice_once() type of message trying to tell > people to use > > Intel Idle instead for better performance. > > What does that have to do with *this* patch, though? It was just a thought triggered by your commit message title. > > If you've got CONFIG_INTEL_IDLE disabled, you'll be slow before this > patch. You'll also be slow after this patch. It's entirely orthogonal. > Yeah it's orthogonal, but with this discussion happening and the code is changing /anyway/ then a pr_notice_once() seemed like a nice way to guide people towards intel_idle at the same time so they didn't trip into the same problem AMD systems do today. > I can add a "Practically" to the subject so folks don't confuse it with > some hard limit that is being enforced: > > ACPI: processor idle: Practically limit "Dummy wait" workaround to > old > Intel systems That works. > > BTW, is there seriously a strong technical reason that AMD systems are > still using this code? Or is it pure inertia? Maybe a better question for Ananth and Prateek to comment on.
Hi, On Thu, Sep 22, 2022 at 10:01:46AM -0700, Dave Hansen wrote: > diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c > index 16a1663d02d4..9f40917c49ef 100644 > --- a/drivers/acpi/processor_idle.c > +++ b/drivers/acpi/processor_idle.c > @@ -531,10 +531,27 @@ static void wait_for_freeze(void) > /* No delay is needed if we are in guest */ > if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) > return; > + /* > + * Modern (>=Nehalem) Intel systems use ACPI via intel_idle, > + * not this code. Assume that any Intel systems using this > + * are ancient and may need the dummy wait. This also assumes > + * that the motivating chipset issue was Intel-only. > + */ > + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) > + return; > #endif > - /* Dummy wait op - must do something useless after P_LVL2 read > - because chipsets cannot guarantee that STPCLK# signal > - gets asserted in time to freeze execution properly. */ 16 years ago, I did my testing on a VIA 8233/8235 chipset (AMD Athlon/Duron) system...... (plus reading VIA spec PDFs which mentioned "STPCLK#" etc.). AFAIR I was doing kernel profiling (via oprofile, IIRC) for painful performance hotspots (read: I/O accesses etc.), and this was one resulting place which I stumbled over. And if I'm not completely mistaken, that dummy wait I/O op *was* needed (else "nice" effects) on my system (put loud and clear: *non*-Intel). So one can see where my profiling effort went (*optimizing* things, not degrading them) --> hints that current Zen3-originating effort is not about a regression in the "regression bug" sense - merely a (albeit rather appreciable/sizeable... congrats!) performance deterioration vs. an optimal (currently non-achieved) software implementation state (also: of PORT-based handling [vs. MWAIT], mind you!). I still have that VIA hardware, but inactive (had the oh-so-usual capacitors issue :( ). Sorry for sabotaging your current fix efforts ;-) - but thank you very much for your work/discussion in this very central/hotpath area! (this extends to all of you...) Greetings Andreas Mohr
On Thu, Sep 22, 2022 at 09:42:15PM +0200, Andreas Mohr wrote: > So one can see where my profiling effort went > (*optimizing* things, not degrading them) > --> hints that current Zen3-originating effort is not > about a regression in the "regression bug" sense - > merely a (albeit rather appreciable/sizeable... congrats!) > performance deterioration vs. > an optimal (currently non-achieved) software implementation state > (also: of PORT-based handling [vs. MWAIT], mind you!). I'd like to add a word of caution here: AFAIK power management (here: ACPI Cx) handling generally is about a painful *tradeoff* between achieving best-possible performance (that's the respectable Zen3 32MB/s vs. 33MB/s argument) and achieving maximum power savings. We all know that one can configure the system for non-idle mode (idle=poll cmdline?) and achieve record numbers in performance (...*and* power consumption - ouch!). Current decision/implementation aspects AFAICS: - why is the Zen3 config used here choosing less-favourable(?) PORT-based operation mode? - Zen3 is said to not have the STPCLK# issue (- but then what about other more modern chipsets?) --> we need to achieve (hopefully sufficiently precisely) a solution which takes into account Zen3 STPCLK# improvements while preserving "accepted" behaviour/requirements on *all* STPCLK#-hampered chipsets ("STPCLK# I/O wait is default/traditional handling"?). Greetings Andreas Mohr
On 9/22/22 13:10, Andreas Mohr wrote: > (- but then what about other more modern chipsets?) > > --> we need to achieve (hopefully sufficiently precisely) a solution which > takes into account Zen3 STPCLK# improvements while > preserving "accepted" behaviour/requirements on *all* STPCLK#-hampered chipsets > ("STPCLK# I/O wait is default/traditional handling"?). Ideally, sure. But, we're talking about theoretically regressing the idle behavior of some indeterminate set of old systems, the majority of which are sitting in a puddle of capacitor goo at the bottom of a landfill right now. This is far from an ideal situation. FWIW, I'd much rather do something like if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86_model >= 0xF)) return; inl(slow_whatever); than a Zen check. AMD has, as far as I know, been a lot more sequential and sane about model numbers than Intel, and there are some AMD model number range checks in the codebase today. A check like this would also be _relatively_ future-proof in the case that X86_FEATURE_ZEN stops getting set on future AMD CPUs. That's a lot more likely than AMD going and reusing a <0xF model.
[Public] > -----Original Message----- > From: Dave Hansen <dave.hansen@intel.com> > Sent: Thursday, September 22, 2022 16:22 > To: Andreas Mohr <andi@lisas.de> > Cc: Nayak, K Prateek <KPrateek.Nayak@amd.com>; linux- > kernel@vger.kernel.org; rafael@kernel.org; lenb@kernel.org; linux- > acpi@vger.kernel.org; linux-pm@vger.kernel.org; > dave.hansen@linux.intel.com; bp@alien8.de; tglx@linutronix.de; > puwen@hygon.cn; Limonciello, Mario <Mario.Limonciello@amd.com>; > peterz@infradead.org; rui.zhang@intel.com; gpiccoli@igalia.com; > daniel.lezcano@linaro.org; Narayan, Ananth <Ananth.Narayan@amd.com>; > Shenoy, Gautham Ranjal <gautham.shenoy@amd.com>; Ong, Calvin > <Calvin.Ong@amd.com>; stable@vger.kernel.org; > regressions@lists.linux.dev > Subject: Re: [PATCH] ACPI: processor_idle: Skip dummy wait for processors > based on the Zen microarchitecture > > On 9/22/22 13:10, Andreas Mohr wrote: > > (- but then what about other more modern chipsets?) > > > > --> we need to achieve (hopefully sufficiently precisely) a solution which > > takes into account Zen3 STPCLK# improvements while > > preserving "accepted" behaviour/requirements on *all* STPCLK#- > hampered chipsets > > ("STPCLK# I/O wait is default/traditional handling"?). > > Ideally, sure. But, we're talking about theoretically regressing the > idle behavior of some indeterminate set of old systems, the majority of > which are sitting in a puddle of capacitor goo at the bottom of a > landfill right now. This is far from an ideal situation. > > FWIW, I'd much rather do something like > > if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && > (boot_cpu_data.x86_model >= 0xF)) > return; > > inl(slow_whatever); > > than a Zen check. AMD has, as far as I know, been a lot more sequential > and sane about model numbers than Intel, and there are some AMD model > number range checks in the codebase today. > > A check like this would also be _relatively_ future-proof in the case > that X86_FEATURE_ZEN stops getting set on future AMD CPUs. That's a lot > more likely than AMD going and reusing a <0xF model. If you're going to use a family check instead it should be 0x17 or newer. (c->x86 >= 0x17) That does match what's used to set X86_FEATURE_ZEN at least then right now too.
On Thu, Sep 22, 2022 at 02:21:31PM -0700, Dave Hansen wrote: > On 9/22/22 13:10, Andreas Mohr wrote: > > (- but then what about other more modern chipsets?) > > > > --> we need to achieve (hopefully sufficiently precisely) a solution which > > takes into account Zen3 STPCLK# improvements while > > preserving "accepted" behaviour/requirements on *all* STPCLK#-hampered chipsets > > ("STPCLK# I/O wait is default/traditional handling"?). > > Ideally, sure. But, we're talking about theoretically regressing the > idle behavior of some indeterminate set of old systems, the majority of > which are sitting in a puddle of capacitor goo at the bottom of a > landfill right now. This is far from an ideal situation. > > FWIW, I'd much rather do something like > > if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && > (boot_cpu_data.x86_model >= 0xF)) > return; > > inl(slow_whatever); > > than a Zen check. AMD has, as far as I know, been a lot more sequential > and sane about model numbers than Intel, and there are some AMD model > number range checks in the codebase today. > > A check like this would also be _relatively_ future-proof in the case > that X86_FEATURE_ZEN stops getting set on future AMD CPUs. That's a lot > more likely than AMD going and reusing a <0xF model. Except you need to add VENDOR_HYGON at the very least. All of this turns into a trainwreck real quick.
On Thu, Sep 22, 2022 at 02:21:31PM -0700, Dave Hansen wrote: > FWIW, I'd much rather do something like > > if ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && > (boot_cpu_data.x86_model >= 0xF)) > return; > > inl(slow_whatever); > > than a Zen check. AMD has, as far as I know, been a lot more sequential > and sane about model numbers than Intel, and there are some AMD model > number range checks in the codebase today. Some might be broken; apparently their SoC/Entertainment divisions has a few out of order SKUs that were not listed in their regular documents. (yay interweb) I ran into this when I tried doing a Zen2 range check for retbleed. In the end we ended up using the availablility of STIBP as a heuristic to indentify Zen2+ or something.
On 22-09-2022 11:58 pm, Limonciello, Mario wrote: >> BTW, is there seriously a strong technical reason that AMD systems are >> still using this code? Or is it pure inertia? > > Maybe a better question for Ananth and Prateek to comment on. We have evaluated using MWAIT for C2 entry and feel that there are good micro architectural reasons to stick to IOPORT based transitions for now. Regards, Ananth
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 16a1663d02d4..18850aa2b79b 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -528,8 +528,11 @@ static int acpi_idle_bm_check(void) static void wait_for_freeze(void) { #ifdef CONFIG_X86 - /* No delay is needed if we are in guest */ - if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + /* + * No delay is needed if we are in guest or on a processor + * based on the Zen microarchitecture. + */ + if (boot_cpu_has(X86_FEATURE_HYPERVISOR) || boot_cpu_has(X86_FEATURE_ZEN)) return; #endif /* Dummy wait op - must do something useless after P_LVL2 read
Processors based on the Zen microarchitecture support IOPORT based deeper C-states. The idle driver reads the acpi_gbl_FADT.xpm_timer_block.address in the IOPORT based C-state exit path which is claimed to be a "Dummy wait op" and has been around since ACPI introduction to Linux dating back to Andy Grover's Mar 14, 2002 posting [1]. The comment above the dummy operation was elaborated by Andreas Mohr back in 2006 in commit b488f02156d3d ("ACPI: restore comment justifying 'extra' P_LVLx access") [2] where the commit log claims: "this dummy read was about: STPCLK# doesn't get asserted in time on (some) chipsets, which is why we need to have a dummy I/O read to delay further instruction processing until the CPU is fully stopped." However, sampling certain workloads with IBS on AMD Zen3 system shows that a significant amount of time is spent in the dummy op, which incorrectly gets accounted as C-State residency. A large C-State residency value can prime the cpuidle governor to recommend a deeper C-State during the subsequent idle instances, starting a vicious cycle, leading to performance degradation on workloads that rapidly switch between busy and idle phases. One such workload is tbench where a massive performance degradation can be observed during certain runs. Following are some statistics gathered by running tbench with 128 clients, on a dual socket (2 x 64C/128T) Zen3 system with the baseline kernel, baseline kernel keeping C2 disabled, and baseline kernel with this patch applied keeping C2 enabled: baseline kernel was tip:sched/core at commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle") Kernel : baseline baseline + C2 disabled baseline + patch Min (MB/s) : 2215.06 33072.10 (+1393.05%) 33016.10 (+1390.52%) Max (MB/s) : 32938.80 34399.10 34774.50 Median (MB/s) : 32191.80 33476.60 33805.70 AMean (MB/s) : 22448.55 33649.27 (+49.89%) 33865.43 (+50.85%) AMean Stddev : 17526.70 680.14 880.72 AMean CoefVar : 78.07% 2.02% 2.60% The data shows there are edge cases that can cause massive regressions in case of tbench. Profiling the bad runs with IBS shows a significant amount of time being spent in acpi_idle_do_entry method: Overhead Command Shared Object Symbol 74.76% swapper [kernel.kallsyms] [k] acpi_idle_do_entry 0.71% tbench [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 0.69% tbench_srv [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 0.49% swapper [kernel.kallsyms] [k] psi_group_change ... Annotation of acpi_idle_do_entry method reveals almost all the time in acpi_idle_do_entry is spent on the port I/O in wait_for_freeze(): 0.14 │ in (%dx),%al # <------ First "in" corresponding to inb(cx->address) 0.51 │ mov 0x144d64d(%rip),%rax 0.00 │ test $0x80000000,%eax │ ↓ jne 62 # <------ Skip if running in guest 0.00 │ mov 0x19800c3(%rip),%rdx 99.33 │ in (%dx),%eax # <------ Second "in" corresponding to inl(acpi_gbl_FADT.xpm_timer_block.address) 0.00 │62: mov -0x8(%rbp),%r12 0.00 │ leave 0.00 │ ← ret This overhead is reflected in the C2 residency on the test system where C2 is an IOPORT based C-State. The total C-state residency reported by "cpupower idle-info" on CPU0 for good and bad case over the 80s tbench run is as follows (all numbers are in microseconds): Good Run Bad Run (Baseline) POLL: 43338 6231 (-85.62%) C1 (MWAIT Based): 23576156 363861 (-98.45%) C2 (IOPORT Based): 10781218 77027280 (+614.45%) The larger residency value in bad case leads to the system recommending C2 state again for subsequent idle instances. The pattern lasts till the end of the tbench run. Following is the breakdown of "entry_method" passed to acpi_idle_do_entry during good run and bad run: Good Run Bad Run (Baseline) Number of times acpi_idle_do_entry was called: 6149573 6149050 (-0.01%) |-> Number of times entry_method was "ACPI_CSTATE_FFH": 6141494 88144 (-98.56%) |-> Number of times entry_method was "ACPI_CSTATE_HALT": 0 0 (+0.00%) |-> Number of times entry_method was "ACPI_CSTATE_SYSTEMIO": 8079 6060906 (+74920.49%) For processors based on the Zen microarchitecture, this dummy wait op is unnecessary and can be skipped when choosing IOPORT based C-States to avoid polluting the C-state residency information. Link: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/commit/?id=972c16130d9dc182cedcdd408408d9eacc7d6a2d [1] Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b488f02156d3deb08f5ad7816d565c370a8cc6f1 [2] Suggested-by: Calvin Ong <calvin.ong@amd.com> Cc: stable@vger.kernel.org Cc: regressions@lists.linux.dev Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> --- drivers/acpi/processor_idle.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)