Message ID | 20201125144847.3920-3-punitagrawal@gmail.com |
---|---|
State | New |
Headers | show |
Series | Add processor to the ignore PSD override list | expand |
On 11/25/20 8:48 AM, Punit Agrawal wrote: > Booting Linux on a Zen2 based processor (family: 0x17, model: 0x60, > stepping: 0x1) shows the following message in the logs - > > acpi_cpufreq: overriding BIOS provided _PSD data > > Although commit 5368512abe08 ("acpi-cpufreq: Honor _PSD table setting > on new AMD CPUs") indicates that the override is not required for Zen3 > onwards, it seems that domain information can be trusted even on Given that the original quirk acd316248205 ("acpi-cpufreq: Add quirk to disable _PSD usage on all AMD CPUs") was submitted 8 years ago, it is not a surprise that some system firmware before family 19h might been fixed. Unfortunately, like what Punit said, I didn't find any documentation on the list of existing, fixed CPUs. In my commit 5368512abe ("acpi-cpufreq: Honor _PSD table setting on new AMD CPUs"), family 19h was picked because 1) we know BIOS will fix this problem for this specific generation of CPUs, and 2) without this commit, it _might_ cause issues on certain CPUs. In summary, this patch is fine if Punit already verified it. My only concern is the list can potentially increase over the time, and we will keep coming back to fix override_acpi_psd() function. > certain earlier systems. Update the check, to skip the override for > Zen2 processors known to work without the override. > > Signed-off-by: Punit Agrawal <punitagrawal@gmail.com> > Cc: Wei Huang <wei.huang2@amd.com> > --- > drivers/cpufreq/acpi-cpufreq.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c > index b1e7df96d428..29f1cd93541e 100644 > --- a/drivers/cpufreq/acpi-cpufreq.c > +++ b/drivers/cpufreq/acpi-cpufreq.c > @@ -198,8 +198,13 @@ static int override_acpi_psd(unsigned int cpu_id) > if (c->x86_vendor == X86_VENDOR_AMD) { > if (!check_amd_hwpstate_cpu(cpu_id)) > return false; > - > - return c->x86 < 0x19; > + /* > + * CPU's before Zen3 (except some Zen2) need the > + * override. > + */ > + return (c->x86 < 0x19) && > + !(c->x86 == 0x17 && c->x86_model == 0x60 && > + c->x86_stepping == 0x1); > } > > return false; >
On Mon, Dec 07, 2020 at 02:20:55PM -0600, Wei Huang wrote: > In summary, this patch is fine if Punit already verified it. My only > concern is the list can potentially increase over the time, and we will > keep coming back to fix override_acpi_psd() function. Can the detection be done by looking at those _PSD things instead of comparing f/m/s? And, alternatively, what is this fixing? So what if some zen2 boxes have correct _PSD objects? Why do they need to ignore the override? Hmmm? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
On 12/7/20 2:26 PM, Borislav Petkov wrote: > On Mon, Dec 07, 2020 at 02:20:55PM -0600, Wei Huang wrote: >> In summary, this patch is fine if Punit already verified it. My only >> concern is the list can potentially increase over the time, and we will >> keep coming back to fix override_acpi_psd() function. > > Can the detection be done by looking at those _PSD things instead of > comparing f/m/s? Not I am aware of. I don't know the correlation between _PSD configuration and CPU's f/m/s. > > And, alternatively, what is this fixing? > > So what if some zen2 boxes have correct _PSD objects? Why do they need > to ignore the override? I think we shouldn't override zen2 if _PSD is correct. In my opinion, there are two approaches: * Keep override_acpi_psd() Let us keep the original quirk and override_acpi_psd() function. Over the time, people may want to add new CPUs to override_acpi_psd(). The maintainer may declare that only CPUs >= family 17h will be fixed, to avoid exploding the check-list. * Remove the quirk completely We can completely remove commit acd316248205 ("acpi-cpufreq: Add quirk to disable _PSD usage on all AMD CPUs")? I am not sure what "AMD desktop boards" was referring to in the original commit message of acd316248205. Maybe such machines aren't in use anymore. > > Hmmm? >
On Mon, Dec 07, 2020 at 04:07:52PM -0600, Wei Huang wrote: > I think we shouldn't override zen2 if _PSD is correct. In my opinion, > there are two approaches: > > * Keep override_acpi_psd() > Let us keep the original quirk and override_acpi_psd() function. Over > the time, people may want to add new CPUs to override_acpi_psd(). The > maintainer may declare that only CPUs >= family 17h will be fixed, to > avoid exploding the check-list. > > * Remove the quirk completely > We can completely remove commit acd316248205 ("acpi-cpufreq: Add quirk > to disable _PSD usage on all AMD CPUs")? I am not sure what "AMD desktop > boards" was referring to in the original commit message of acd316248205. > Maybe such machines aren't in use anymore. * Third option: do not do anything. Why? - Let sleeping dogs lie and leave the workaround acd316248205 for old machines. - Make a clear cut that the override is not needed from Zen3 on, i.e., your patch 5368512abe08 ("acpi-cpufreq: Honor _PSD table setting on new AMD CPUs") Punit's commit message reads "...indicates that the override is not required for Zen3 onwards, it seems that domain information can be trusted even on certain earlier systems." That's not nearly a justification in my book to do this on anything < Zen3. This way you have a clear cut, you don't need to deal with adding any more models to override_acpi_psd() and all is good. Unless there's a better reason to skip the override on machines < Zen3 but I haven't heard any so far... Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
On 12/7/20 4:30 PM, Borislav Petkov wrote: > On Mon, Dec 07, 2020 at 04:07:52PM -0600, Wei Huang wrote: >> I think we shouldn't override zen2 if _PSD is correct. In my opinion, >> there are two approaches: >> >> * Keep override_acpi_psd() >> Let us keep the original quirk and override_acpi_psd() function. Over >> the time, people may want to add new CPUs to override_acpi_psd(). The >> maintainer may declare that only CPUs >= family 17h will be fixed, to >> avoid exploding the check-list. >> >> * Remove the quirk completely >> We can completely remove commit acd316248205 ("acpi-cpufreq: Add quirk >> to disable _PSD usage on all AMD CPUs")? I am not sure what "AMD desktop >> boards" was referring to in the original commit message of acd316248205. >> Maybe such machines aren't in use anymore. > > * Third option: do not do anything. Why? I am fine with this option, unless Punit can prove me that I am wrong, (i.e. some Zen2 is broken because of acd316248205). > > - Let sleeping dogs lie and leave the workaround acd316248205 for old > machines. > > - Make a clear cut that the override is not needed from Zen3 on, i.e., > your patch > > 5368512abe08 ("acpi-cpufreq: Honor _PSD table setting on new AMD CPUs") > > > Punit's commit message reads "...indicates that the override is not > required for Zen3 onwards, it seems that domain information can be > trusted even on certain earlier systems." > > That's not nearly a justification in my book to do this on anything < Zen3. > > This way you have a clear cut, you don't need to deal with adding any > more models to override_acpi_psd() and all is good. > > Unless there's a better reason to skip the override on machines < Zen3 > but I haven't heard any so far... > > Thx. >
Borislav Petkov <bp@alien8.de> writes: > On Mon, Dec 07, 2020 at 04:07:52PM -0600, Wei Huang wrote: >> I think we shouldn't override zen2 if _PSD is correct. In my opinion, >> there are two approaches: >> >> * Keep override_acpi_psd() >> Let us keep the original quirk and override_acpi_psd() function. Over >> the time, people may want to add new CPUs to override_acpi_psd(). The >> maintainer may declare that only CPUs >= family 17h will be fixed, to >> avoid exploding the check-list. >> >> * Remove the quirk completely >> We can completely remove commit acd316248205 ("acpi-cpufreq: Add quirk >> to disable _PSD usage on all AMD CPUs")? I am not sure what "AMD desktop >> boards" was referring to in the original commit message of acd316248205. >> Maybe such machines aren't in use anymore. > > * Third option: do not do anything. Why? > > - Let sleeping dogs lie and leave the workaround acd316248205 for old > machines. According to the commit log, acd316248205 seems to be only targeted at powernow-K8 - "in order to use the hardware to its full potential and keep the original powernow-k8 behavior, lets override the _PSD table setting on AMD hardware." It's unfortunate that it has continued to apply to all systems since. An alternate to this and 5368512abe08 would be to restrict the original workaround to the system mentioned in the commit log. > - Make a clear cut that the override is not needed from Zen3 on, i.e., > your patch > > 5368512abe08 ("acpi-cpufreq: Honor _PSD table setting on new AMD CPUs") > > > Punit's commit message reads "...indicates that the override is not > required for Zen3 onwards, it seems that domain information can be > trusted even on certain earlier systems." > > That's not nearly a justification in my book to do this on anything < > Zen3. > > This way you have a clear cut, you don't need to deal with adding any > more models to override_acpi_psd() and all is good. PSD (P-State Dependency) object is used to express the coupling of processors that share the same frequency domain. Ignoring the hardware topology information will cause both power management and scheduling to make sub-optimal choices. Wasn't this the motivation for taking it into account for Zen3 based systems in 5368512abe08? Or the powernow-k8 in the original workaround (though there it required ignoring firmware)? The exact same argument applies here to the Zen2 system I have running in a thermally constrained environment. In an ideal case, I was hoping AMD folks would come back with confirmation of how far back the override can be dropped and the workaround condition could be relaxed further. But if that is not available, the only way we have is to include systems that have been verified to not need the override and somebody that cares to send the patch.
On Wed, Dec 09, 2020 at 08:21:48AM +0900, Punit Agrawal wrote: > According to the commit log, acd316248205 seems to be only targeted at > powernow-K8 - No, it is not targeted at powernow-k8 - acpi-cpufreq.c is what is used on AMD hw. He means to make acpi-cpufreq's behavior consistent with powernow-k8. > But if that is not available, the only way we have is to include > systems that have been verified to not need the override You have verified exactly *one* system - yours. Or are you sure that *all* family 0x17, model 0x60, stepping 0x1 machines don't need the override? Also, you still haven't explained what you're trying to fix here. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov <bp@alien8.de> writes: > On Wed, Dec 09, 2020 at 08:21:48AM +0900, Punit Agrawal wrote: >> According to the commit log, acd316248205 seems to be only targeted at >> powernow-K8 - > > No, it is not targeted at powernow-k8 - acpi-cpufreq.c is what is used > on AMD hw. He means to make acpi-cpufreq's behavior consistent with > powernow-k8. So "powernow-k8" is not a cpu but a cpufreq driver. That doesn't change the fact that the patch causes all AMD systems using acpi-cpufreq to ignore processor frequency groupings and treat each processor to be an in independent frequency domain from cpufreq's point of view. >> But if that is not available, the only way we have is to include >> systems that have been verified to not need the override > > You have verified exactly *one* system - yours. Or are you sure that > *all* family 0x17, model 0x60, stepping 0x1 machines don't need the > override? Unfortunately, I only have access to one system with that F/M/S. Since posting the non-RFC patches, I was able to inspect the ACPI tables for more CPUs - Family: 0x17h, Model: 0x71h (Ryzen 3950X) Family: 0x17h, Model: 0x18h (Ryzen 3500u) To me it suggests, that there are likely more systems from the family that show the characteristic described below. > Also, you still haven't explained what you're trying to fix here. All the CPUs here are multi-threaded with 2 threads per core. The _PSD for the system describes the cores as having a coupling that consist of a frequency domain per core that contains both the threads. The firmware description makes sense and seems to accurately describe the hardware topology. In all these systems, the override causes this topology information to be ignored - treating each core to be a separate domain. The proposed patch removes the override so that _PSD is taken into account.
On Sat, Dec 12, 2020 at 08:36:48AM +0900, Punit Agrawal wrote: > To me it suggests, that there are likely more systems from the family > that show the characteristic described below. Until we find a *single* system with a broken BIOS which has those objects kaputt and then this heuristic would need an exception. VS the clear statement from AMD that from zen3 onwards, all BIOS will be tested. I hope they boot Linux at least before they ship. > In all these systems, the override causes this topology information to > be ignored - treating each core to be a separate domain. The proposed > patch removes the override so that _PSD is taken into account. You're still not answering my question: what does the coupling of the SMT threads bring on those systems? Power savings? Perf improvement? Anything palpable or measurable? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov <bp@alien8.de> writes: > On Sat, Dec 12, 2020 at 08:36:48AM +0900, Punit Agrawal wrote: >> To me it suggests, that there are likely more systems from the family >> that show the characteristic described below. > > Until we find a *single* system with a broken BIOS which has those > objects kaputt and then this heuristic would need an exception. > > VS the clear statement from AMD that from zen3 onwards, all BIOS will be > tested. I hope they boot Linux at least before they ship. IIUC, this suggests that Linux booting on anything prior to Zen3 is down to pure luck - I hope that wasn't the case. >> In all these systems, the override causes this topology information to >> be ignored - treating each core to be a separate domain. The proposed >> patch removes the override so that _PSD is taken into account. > > You're still not answering my question: what does the coupling of the > SMT threads bring on those systems? Power savings? Perf improvement? > Anything palpable or measurable? At the moment acpi thermals is bust on this and other affected AMD system I have access to. That'll need fixing before any sensible measurements can be run. Tbh, I didn't quite expect the patch to the PSD exclusion list to be so controversial - especially when a similar change for zen3 had recently been merged. If you're really not keen on the change, I will carry it locally for the time being and revisit once the other issues have been resolved.
On Mon, Dec 14, 2020 at 10:27:09PM +0900, Punit Agrawal wrote: > IIUC, this suggests that Linux booting on anything prior to Zen3 is down > to pure luck - I hope that wasn't the case. WTF does this have to do with linux booting?! > At the moment acpi thermals is bust on this and other affected AMD > system I have access to. That'll need fixing before any sensible > measurements can be run. Nope, still not answering my questions. > Tbh, I didn't quite expect the patch to the PSD exclusion list to be > so controversial It won't be if you explain properly what your patch is fixing. That is, if it fixes anything. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov <bp@alien8.de> writes: > On Mon, Dec 14, 2020 at 10:27:09PM +0900, Punit Agrawal wrote: >> IIUC, this suggests that Linux booting on anything prior to Zen3 is down >> to pure luck - I hope that wasn't the case. > > WTF does this have to do with linux booting?! I guess I misunderstood the comment from your previous mail. Pasting back for context (emphasis mine) - VS the clear statement from AMD that from zen3 onwards, all BIOS will be tested. *I hope they boot Linux at least before they ship.* >> At the moment acpi thermals is bust on this and other affected AMD >> system I have access to. That'll need fixing before any sensible >> measurements can be run. > > Nope, still not answering my questions. > >> Tbh, I didn't quite expect the patch to the PSD exclusion list to be >> so controversial > > It won't be if you explain properly what your patch is fixing. That is, > if it fixes anything. I stared at the driver code (and the ACPI tables for the platform) to see if I could provide a better explanation. That's when I realised that as the platform advertises hardware frequency co-ordination, even without the override it still skips setting the policy cpus - /* * Will let policy->cpus know about dependency only when software * coordination is required. */ if (policy->shared_type == CPUFREQ_SHARED_TYPE_ALL || policy->shared_type == CPUFREQ_SHARED_TYPE_ANY) { cpumask_copy(policy->cpus, perf->shared_cpu_map); } This ends up treating each CPU as an independent frequency domain anyways. So even ignoring the override for the CPU, doesn't change anything other than dropping the message from boot log - overriding BIOS provided _PSD data As such, there's no point in merging the patch as it is. Apologies for the noise. I should've checked more thoroughly before sending the patches. [ Aside: If Zen3 is using hardware co-ordination it'll probably face the issue described above as well. ]
diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c index b1e7df96d428..29f1cd93541e 100644 --- a/drivers/cpufreq/acpi-cpufreq.c +++ b/drivers/cpufreq/acpi-cpufreq.c @@ -198,8 +198,13 @@ static int override_acpi_psd(unsigned int cpu_id) if (c->x86_vendor == X86_VENDOR_AMD) { if (!check_amd_hwpstate_cpu(cpu_id)) return false; - - return c->x86 < 0x19; + /* + * CPU's before Zen3 (except some Zen2) need the + * override. + */ + return (c->x86 < 0x19) && + !(c->x86 == 0x17 && c->x86_model == 0x60 && + c->x86_stepping == 0x1); } return false;
Booting Linux on a Zen2 based processor (family: 0x17, model: 0x60, stepping: 0x1) shows the following message in the logs - acpi_cpufreq: overriding BIOS provided _PSD data Although commit 5368512abe08 ("acpi-cpufreq: Honor _PSD table setting on new AMD CPUs") indicates that the override is not required for Zen3 onwards, it seems that domain information can be trusted even on certain earlier systems. Update the check, to skip the override for Zen2 processors known to work without the override. Signed-off-by: Punit Agrawal <punitagrawal@gmail.com> Cc: Wei Huang <wei.huang2@amd.com> --- drivers/cpufreq/acpi-cpufreq.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)