diff mbox series

[v2] x86,acpi: Limit "Dummy wait" workaround to older AMD and Intel processors

Message ID 20220923153801.9167-1-kprateek.nayak@amd.com
State New
Headers show
Series [v2] x86,acpi: Limit "Dummy wait" workaround to older AMD and Intel processors | expand

Commit Message

K Prateek Nayak Sept. 23, 2022, 3:38 p.m. UTC
Processors based on the Zen microarchitecture support IOPORT based deeper
C-states. The ACPI idle driver reads the
acpi_gbl_FADT.xpm_timer_block.address in the IOPORT based C-state exit
path which is claimed to be a "Dummy wait op" and has been around since
ACPI's introduction to Linux dating back to Andy Grover's Mar 14, 2002
posting [1].

Old, circa 2002 chipsets have a bug which was elaborated by Andreas Mohr
back in 2006 in commit b488f02156d3d ("ACPI: restore comment justifying
'extra' P_LVLx access") where the commit log claims:
"this dummy read was about: STPCLK# doesn't get asserted in time on
(some) chipsets, which is why we need to have a dummy I/O read to delay
further instruction processing until the CPU is fully stopped."

This workaround is very painful on modern systems with a large number of
cores. The "inl()" can take thousands of cycles. Sampling certain
workloads with IBS on AMD Zen3 system shows that a significant amount of
time is spent in the dummy op, which incorrectly gets accounted as
C-State residency. A large C-State residency value can prime the cpuidle
governor to recommend a deeper C-State during the subsequent idle
instances, starting a vicious cycle, leading to performance degradation
on workloads that rapidly switch between busy and idle phases.
(For the extent of the performance degradation refer link [2])

The dummy wait is unnecessary on processors based on the Zen
microarchitecture (AMD family 17h+ and HYGON). Skip it to prevent
polluting the C-state residency information. Among the pre-family 17h
AMD processors, there has been at least one report of an AMD Athlon on a
VIA chipset (circa 2006) where this this problem was seen (see [3] for
report by Andreas Mohr).

Modern Intel processors use MWAIT based C-States in the intel_idle driver
and are not impacted by this code path. For older Intel processors that
use the acpi_idle driver, a workaround was suggested by Dave Hansen and
Rafael J. Wysocki to regard all Intel chipsets using the IOPORT based
C-state management as being affected by this problem (see [4] for
workaround proposed).

For these reasons, mark all the Intel processors and pre-family 17h
AMD processors with x86_BUG_STPCLK. In the acpi_idle driver, restrict the
dummy wait during IOPORT based C-state transitions to only these
processors.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/commit/?id=972c16130d9dc182cedcdd408408d9eacc7d6a2d [1]
Link: https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/ [2]
Link: https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@rhlx01.hs-esslingen.de/ [3]
Link: https://lore.kernel.org/lkml/88c17568-8694-940a-0f1f-9d345e8dcbdb@intel.com/ [4]

Suggested-by: Calvin Ong <calvin.ong@amd.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
CC: Pu Wen <puwen@hygon.cn>
Cc: stable@vger.kernel.org
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v1->v2:
o Introduce X86_BUG_STPCLK to mark chipsets as being affected by the
  STPCLK# signal assertion issue.
o Mark all Intel chipsets and pre fam-17h AMD chipsets as being affected
  by the X86_BUG_STPCLK.
o Skip dummy xpm_timer_block read in IOPORT based C-state exit path in
  ACPI processor_idle if chipset is not affected by X86_BUG_STPCLK.
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/kernel/cpu/amd.c          | 12 ++++++++++++
 arch/x86/kernel/cpu/intel.c        | 12 ++++++++++++
 drivers/acpi/processor_idle.c      |  8 ++++++++
 4 files changed, 33 insertions(+)

Comments

David Laight Sept. 26, 2022, 8:19 a.m. UTC | #1
From: K Prateek Nayak
> Sent: 23 September 2022 16:38
....
> 
> This workaround is very painful on modern systems with a large number of
> cores. The "inl()" can take thousands of cycles. Sampling certain
> workloads with IBS on AMD Zen3 system shows that a significant amount of
> time is spent in the dummy op, which incorrectly gets accounted as
> C-State residency. A large C-State residency value can prime the cpuidle
> governor to recommend a deeper C-State during the subsequent idle
> instances, starting a vicious cycle, leading to performance degradation
> on workloads that rapidly switch between busy and idle phases.
> (For the extent of the performance degradation refer link [2])

Isn't that a horrid bug itself?
Sounds like it affects any code that is doing pio reads of hardware buffers.
While they are slow they are necessary.
IIRC any PCIe read into an Altera fpga takes about 128 cycles of the 125MHz
clock. The Intel cpu I've checked will only execute one concurrent PCIe read
for each cpu core - so the cpu soon stalls for thousands of clocks.

> The dummy wait is unnecessary on processors based on the Zen
> microarchitecture (AMD family 17h+ and HYGON). Skip it to prevent
> polluting the C-state residency information. Among the pre-family 17h
> AMD processors, there has been at least one report of an AMD Athlon on a
> VIA chipset (circa 2006) where this this problem was seen (see [3] for
> report by Andreas Mohr).
> 
> Modern Intel processors use MWAIT based C-States in the intel_idle driver
> and are not impacted by this code path. For older Intel processors that
> use the acpi_idle driver, a workaround was suggested by Dave Hansen and
> Rafael J. Wysocki to regard all Intel chipsets using the IOPORT based
> C-state management as being affected by this problem (see [4] for
> workaround proposed).

Can you use a surrogate (maybe AVX support?) to exclude large groups
on modern cpu?

Another possibility is that is the io address doesn't really matter
are there any locations that have moved on-die and are now executed
much faster than the ISA bus speed of older systems?
Or do all the 'originally ISA' peripherals still run at ISA speeds?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Peter Zijlstra Sept. 26, 2022, 12:07 p.m. UTC | #2
On Fri, Sep 23, 2022 at 09:08:01PM +0530, K Prateek Nayak wrote:
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index ef4775c6db01..fcd3617ed315 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -460,5 +460,6 @@
>  #define X86_BUG_MMIO_UNKNOWN		X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
>  #define X86_BUG_RETBLEED		X86_BUG(27) /* CPU is affected by RETBleed */
>  #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
> +#define X86_BUG_STPCLK			X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
>  
>  #endif /* _ASM_X86_CPUFEATURES_H */
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 48276c0e479d..8cb5887a53a3 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -988,6 +988,18 @@ static void init_amd(struct cpuinfo_x86 *c)
>  	if (!cpu_has(c, X86_FEATURE_XENPV))
>  		set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
>  
> +	/*
> +	 * CPUs based on the Zen microarchitecture (Fam 17h onward) can
> +	 * guarantee that STPCLK# signal is asserted in time after the
> +	 * P_LVL2 read to freeze execution after an IOPORT based C-state
> +	 * entry. Among the older AMD processors, there has been at least
> +	 * one report of an AMD Athlon processor on a VIA chipset
> +	 * (circa 2006) having this issue. Mark all these older AMD
> +	 * processor families as being affected.
> +	 */
> +	if (c->x86 < 0x17)
> +		set_cpu_bug(c, X86_BUG_STPCLK);
> +
>  	/*
>  	 * Turn on the Instructions Retired free counter on machines not
>  	 * susceptible to erratum #1054 "Instructions Retired Performance
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 2d7ea5480ec3..96fe1320c238 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -696,6 +696,18 @@ static void init_intel(struct cpuinfo_x86 *c)
>  		((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
>  		set_cpu_bug(c, X86_BUG_MONITOR);
>  
> +	/*
> +	 * Intel chipsets prior to Nehalem used the ACPI processor_idle
> +	 * driver for C-state management. Some of these processors that
> +	 * used IOPORT based C-states could not guarantee that STPCLK#
> +	 * signal gets asserted in time after P_LVL2 read to freeze
> +	 * execution properly. Since a clear cut-off point is not known
> +	 * as to when this bug was solved, mark all the chipsets as
> +	 * being affected. Only the ones that use IOPORT based C-state
> +	 * transitions via the acpi_idle driver will be impacted.
> +	 */
> +	set_cpu_bug(c, X86_BUG_STPCLK);
> +
>  #ifdef CONFIG_X86_64
>  	if (c->x86 == 15)
>  		c->x86_cache_alignment = c->x86_clflush_size * 2;

Quiz time:

  #define X86_VENDOR_INTEL       0
  #define X86_VENDOR_CYRIX       1
  #define X86_VENDOR_AMD         2
  #define X86_VENDOR_UMC         3
  #define X86_VENDOR_CENTAUR     5
  #define X86_VENDOR_TRANSMETA   7
  #define X86_VENDOR_NSC         8
  #define X86_VENDOR_HYGON       9
  #define X86_VENDOR_ZHAOXIN     10
  #define X86_VENDOR_VORTEX      11
  #define X86_VENDOR_NUM         12
  #define X86_VENDOR_UNKNOWN     0xff

For how many of the above have you changed behaviour?

Not to mention that this is the gazillion-th time AMD has failed to
change HYGON in lock-step. That's Zen too -- deal with it.
K Prateek Nayak Sept. 26, 2022, 4:32 p.m. UTC | #3
Hello Peter,

On 9/26/2022 5:37 PM, Peter Zijlstra wrote:
> On Fri, Sep 23, 2022 at 09:08:01PM +0530, K Prateek Nayak wrote:
>> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
>> index ef4775c6db01..fcd3617ed315 100644
>> --- a/arch/x86/include/asm/cpufeatures.h
>> +++ b/arch/x86/include/asm/cpufeatures.h
>> @@ -460,5 +460,6 @@
>>  #define X86_BUG_MMIO_UNKNOWN		X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
>>  #define X86_BUG_RETBLEED		X86_BUG(27) /* CPU is affected by RETBleed */
>>  #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
>> +#define X86_BUG_STPCLK			X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
>>  
>>  #endif /* _ASM_X86_CPUFEATURES_H */
>> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
>> index 48276c0e479d..8cb5887a53a3 100644
>> --- a/arch/x86/kernel/cpu/amd.c
>> +++ b/arch/x86/kernel/cpu/amd.c
>> @@ -988,6 +988,18 @@ static void init_amd(struct cpuinfo_x86 *c)
>>  	if (!cpu_has(c, X86_FEATURE_XENPV))
>>  		set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
>>  
>> +	/*
>> +	 * CPUs based on the Zen microarchitecture (Fam 17h onward) can
>> +	 * guarantee that STPCLK# signal is asserted in time after the
>> +	 * P_LVL2 read to freeze execution after an IOPORT based C-state
>> +	 * entry. Among the older AMD processors, there has been at least
>> +	 * one report of an AMD Athlon processor on a VIA chipset
>> +	 * (circa 2006) having this issue. Mark all these older AMD
>> +	 * processor families as being affected.
>> +	 */
>> +	if (c->x86 < 0x17)
>> +		set_cpu_bug(c, X86_BUG_STPCLK);
>> +
>>  	/*
>>  	 * Turn on the Instructions Retired free counter on machines not
>>  	 * susceptible to erratum #1054 "Instructions Retired Performance
>> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
>> index 2d7ea5480ec3..96fe1320c238 100644
>> --- a/arch/x86/kernel/cpu/intel.c
>> +++ b/arch/x86/kernel/cpu/intel.c
>> @@ -696,6 +696,18 @@ static void init_intel(struct cpuinfo_x86 *c)
>>  		((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
>>  		set_cpu_bug(c, X86_BUG_MONITOR);
>>  
>> +	/*
>> +	 * Intel chipsets prior to Nehalem used the ACPI processor_idle
>> +	 * driver for C-state management. Some of these processors that
>> +	 * used IOPORT based C-states could not guarantee that STPCLK#
>> +	 * signal gets asserted in time after P_LVL2 read to freeze
>> +	 * execution properly. Since a clear cut-off point is not known
>> +	 * as to when this bug was solved, mark all the chipsets as
>> +	 * being affected. Only the ones that use IOPORT based C-state
>> +	 * transitions via the acpi_idle driver will be impacted.
>> +	 */
>> +	set_cpu_bug(c, X86_BUG_STPCLK);
>> +
>>  #ifdef CONFIG_X86_64
>>  	if (c->x86 == 15)
>>  		c->x86_cache_alignment = c->x86_clflush_size * 2;
> 
> Quiz time:
> 
>   #define X86_VENDOR_INTEL       0
>   #define X86_VENDOR_CYRIX       1
>   #define X86_VENDOR_AMD         2
>   #define X86_VENDOR_UMC         3
>   #define X86_VENDOR_CENTAUR     5
>   #define X86_VENDOR_TRANSMETA   7
>   #define X86_VENDOR_NSC         8
>   #define X86_VENDOR_HYGON       9
>   #define X86_VENDOR_ZHAOXIN     10
>   #define X86_VENDOR_VORTEX      11
>   #define X86_VENDOR_NUM         12
>   #define X86_VENDOR_UNKNOWN     0xff
> 
> For how many of the above have you changed behaviour?

The proposed logic does alter the behavior for x86 chipsets that depend
on acpi_idle driver and have IOPORT based C-state. Based on what
Rafael and Dave suggested, I have marked all Intel processors to be
affected by this bug. In light of Andreas' report, I've also marked
all the pre-family 17h AMD processors to be affected by this bug to avoid
causing any regression.

It is hard to tell if any other vendor had this bug in their chipsets.
Dave's patch does not make this consideration either and limits the
dummy operation to only Intel chipsets using acpi_idle driver.
(https://lore.kernel.org/all/78d13a19-2806-c8af-573e-7f2625edfab8@intel.com/)
If folks reported a regression, I would have been happy to fix it for
them.

> 
> Not to mention that this is the gazillion-th time AMD has failed to
> change HYGON in lock-step. That's Zen too -- deal with it.

Hygon is based on the Zen microarchitecture (equivalent to Fam 17h on
AMD) and they too do not need the the dummy wait op to ensure correct
behavior. Hence, they are not marked with x86_BUG_STPCLK.

In the patch description, I've called out:

"mark all the Intel processors and pre-family 17h
AMD processors with x86_BUG_STPCLK. In the acpi_idle driver, restrict the
dummy wait during IOPORT based C-state transitions to only these
processors."
 
Both Hygon and AMD Fam 17h+, which are based on Zen microachitecture, are
not affected by x86_BUG_STPCLK and hence skip the dummy wait op.

Did I miss something?
--
Thanks and Regards,
Prateek
Andreas Mohr Sept. 26, 2022, 7:55 p.m. UTC | #4
Hi,

Am Mon, Sep 26, 2022 at 10:02:44PM +0530 schrieb K Prateek Nayak:
> Hello Peter,
> 
> On 9/26/2022 5:37 PM, Peter Zijlstra wrote:
> > For how many of the above have you changed behaviour?
> 
> The proposed logic does alter the behavior for x86 chipsets that depend
> on acpi_idle driver and have IOPORT based C-state. Based on what
> Rafael and Dave suggested, I have marked all Intel processors to be
> affected by this bug. In light of Andreas' report, I've also marked
> all the pre-family 17h AMD processors to be affected by this bug to avoid
> causing any regression.
> 
> It is hard to tell if any other vendor had this bug in their chipsets.
> Dave's patch does not make this consideration either and limits the
> dummy operation to only Intel chipsets using acpi_idle driver.
> (https://lore.kernel.org/all/78d13a19-2806-c8af-573e-7f2625edfab8@intel.com/)
> If folks reported a regression, I would have been happy to fix it for
> them.

Despite certain, umm, controversies, the discussion/patch activities
appear to be heading into a good direction ;)



This text somehow prompted me to think of
whether STPCLK# [quirk] behaviour is
a property of the CPU family, or the chipset, or actually a combination of it.

Given that [from recollection] VIA 8233/8235 spec PDFs do mention STPCLK#,
possibly a chipset does have a say in it? (which
obviously would then mean that
the kernel's quirk state decision-making would have to be refined)

Minor reference (note 8237, not 8233):
http://www.chipset-ic.com/datasheet/VT8237.pdf
  "STPCLK# is asserted by the VT8237R to the CPU to throttle the processor."
  (and many other STPCLK# mentions there)

Greetings

Andreas Mohr
diff mbox series

Patch

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ef4775c6db01..fcd3617ed315 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -460,5 +460,6 @@ 
 #define X86_BUG_MMIO_UNKNOWN		X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
 #define X86_BUG_RETBLEED		X86_BUG(27) /* CPU is affected by RETBleed */
 #define X86_BUG_EIBRS_PBRSB		X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
+#define X86_BUG_STPCLK			X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 48276c0e479d..8cb5887a53a3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -988,6 +988,18 @@  static void init_amd(struct cpuinfo_x86 *c)
 	if (!cpu_has(c, X86_FEATURE_XENPV))
 		set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
 
+	/*
+	 * CPUs based on the Zen microarchitecture (Fam 17h onward) can
+	 * guarantee that STPCLK# signal is asserted in time after the
+	 * P_LVL2 read to freeze execution after an IOPORT based C-state
+	 * entry. Among the older AMD processors, there has been at least
+	 * one report of an AMD Athlon processor on a VIA chipset
+	 * (circa 2006) having this issue. Mark all these older AMD
+	 * processor families as being affected.
+	 */
+	if (c->x86 < 0x17)
+		set_cpu_bug(c, X86_BUG_STPCLK);
+
 	/*
 	 * Turn on the Instructions Retired free counter on machines not
 	 * susceptible to erratum #1054 "Instructions Retired Performance
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 2d7ea5480ec3..96fe1320c238 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -696,6 +696,18 @@  static void init_intel(struct cpuinfo_x86 *c)
 		((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
 		set_cpu_bug(c, X86_BUG_MONITOR);
 
+	/*
+	 * Intel chipsets prior to Nehalem used the ACPI processor_idle
+	 * driver for C-state management. Some of these processors that
+	 * used IOPORT based C-states could not guarantee that STPCLK#
+	 * signal gets asserted in time after P_LVL2 read to freeze
+	 * execution properly. Since a clear cut-off point is not known
+	 * as to when this bug was solved, mark all the chipsets as
+	 * being affected. Only the ones that use IOPORT based C-state
+	 * transitions via the acpi_idle driver will be impacted.
+	 */
+	set_cpu_bug(c, X86_BUG_STPCLK);
+
 #ifdef CONFIG_X86_64
 	if (c->x86 == 15)
 		c->x86_cache_alignment = c->x86_clflush_size * 2;
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 16a1663d02d4..493f9ccdb72d 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -528,6 +528,14 @@  static int acpi_idle_bm_check(void)
 static void wait_for_freeze(void)
 {
 #ifdef	CONFIG_X86
+	/*
+	 * A dummy wait operation is only required for those chipsets
+	 * that cannot assert STPCLK# signal in time after P_LVL2 read.
+	 * If a chipset is not affected by this problem, skip it.
+	 */
+	if (!static_cpu_has_bug(X86_BUG_STPCLK))
+		return;
+
 	/* No delay is needed if we are in guest */
 	if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		return;