Message ID | CAK8P3a1mkHEjRJgJPsRy+kuN=48=JEDJAeR2z9n+O71qbJ8hSA@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK? | expand |
On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote: > On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada > <srinivas.pandruvada@linux.intel.com> wrote: > > > > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote: > > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that > > > started > > > constantly spewing messages from the therm_throt driver after one > > > core overheated: > > > > > I think this is a Cascade Lake system. Have you tried the latest > > micro- > > code? > > Thanks for your quick reply. I have installed the latest microcode > 0x5003302 > now (manually, because the version provided by the distro was still > using > version 0x5003102). > > After that, I tried writing the value 0x2a80 from userspace, and > that did not cause a trap, so I assume that fixed it. > Thanks for reporting. I am aware of this issue and should be fixed by microcode update. Thanks, Srinivas > It's hard to be sure, as the system has only run into the broken > state twice during its life, and now it's fine. I'll reply here if it > ever comes back with the new microcode. > > Thanks a lot! > > Arnd
On Thu, Jun 2, 2022 at 6:25 PM srinivas pandruvada <srinivas.pandruvada@linux.intel.com> wrote: > On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote: > > On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada > > <srinivas.pandruvada@linux.intel.com> wrote: > > > > > > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote: > > > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that > > > > started > > > > constantly spewing messages from the therm_throt driver after one > > > > core overheated: > > > > > > > I think this is a Cascade Lake system. Have you tried the latest > > > micro- > > > code? > > > > Thanks for your quick reply. I have installed the latest microcode > > 0x5003302 > > now (manually, because the version provided by the distro was still > > using > > version 0x5003102). > > > > After that, I tried writing the value 0x2a80 from userspace, and > > that did not cause a trap, so I assume that fixed it. > > > Thanks for reporting. > I am aware of this issue and should be fixed by microcode update. I wonder how common this problem it is. Would it help to add a driver workaround like this? diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c index 8352083b87c7..acb402e56796 100644 --- a/drivers/thermal/intel/therm_throt.c +++ b/drivers/thermal/intel/therm_throt.c @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level) rdmsrl(msr, msr_val); msr_val &= mask; - wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG); + if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) { + /* work around Cascade Lake SKZ57 erratum */ + printk_once(KERN_WARNING "Failed to update IA32_THERM_STATUS, " + "please upgrade microcode\n"); + wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG & + ~BIT(13) & ~BIT(15)); + } } static void get_therm_status(int level, bool *proc_hot, u8 *temp) Arnd
On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote: > On Thu, Jun 2, 2022 at 6:25 PM srinivas pandruvada > <srinivas.pandruvada@linux.intel.com> wrote: > > On Thu, 2022-06-02 at 18:18 +0200, Arnd Bergmann wrote: > > > On Thu, Jun 2, 2022 at 5:52 PM srinivas pandruvada > > > <srinivas.pandruvada@linux.intel.com> wrote: > > > > > > > > On Thu, 2022-06-02 at 11:19 +0200, Arnd Bergmann wrote: > > > > > I have a Xeon W-2265 (family 6, model 85, stepping 7) that > > > > > started > > > > > constantly spewing messages from the therm_throt driver after > > > > > one > > > > > core overheated: > > > > > > > > > I think this is a Cascade Lake system. Have you tried the > > > > latest > > > > micro- > > > > code? > > > > > > Thanks for your quick reply. I have installed the latest > > > microcode > > > 0x5003302 > > > now (manually, because the version provided by the distro was > > > still > > > using > > > version 0x5003102). > > > > > > After that, I tried writing the value 0x2a80 from userspace, and > > > that did not cause a trap, so I assume that fixed it. > > > > > Thanks for reporting. > > I am aware of this issue and should be fixed by microcode update. > > I wonder how common this problem it is. Would it help to add a driver > workaround > like this? This issue affects only certain skews. The others already working as expected. These are important log bits for debug, we don't want to clear in this path. Printing warning for CLX stepping is fine without clearing unrelated bits 13 and 15. Read-modify-update should always work where we only update the bits of interest. Writing 1s to this register should be NOP. Thanks, Srinivas > > diff --git a/drivers/thermal/intel/therm_throt.c > b/drivers/thermal/intel/therm_throt.c > index 8352083b87c7..acb402e56796 100644 > --- a/drivers/thermal/intel/therm_throt.c > +++ b/drivers/thermal/intel/therm_throt.c > @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level) > > rdmsrl(msr, msr_val); > msr_val &= mask; > - wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG); > + if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) { > + /* work around Cascade Lake SKZ57 erratum */ > + printk_once(KERN_WARNING "Failed to update > IA32_THERM_STATUS, " > + "please upgrade > microcode\n"); > + wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG & > + ~BIT(13) & ~BIT(15)); > + } > } > > static void get_therm_status(int level, bool *proc_hot, u8 *temp) > > Arnd
On Thu, Jun 2, 2022 at 10:10 PM srinivas pandruvada <srinivas.pandruvada@linux.intel.com> wrote: > On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote: > > > > I wonder how common this problem it is. Would it help to add a driver > > workaround > > like this? > This issue affects only certain skews. The others already working as > expected. These are important log bits for debug, we don't want to > clear in this path. Printing warning for CLX stepping is fine without > clearing unrelated bits 13 and 15. > Read-modify-update should always work where we only update the bits of > interest. Writing 1s to this register should be NOP. The patch I suggested doesn't change the behavior unless the initial write causes an exception. As long as only buggy microcode rejects the write, the second write just serves to clear the state that causes the repeated stack dumps. Arnd > > @@ -214,7 +214,13 @@ static void clear_therm_status_log(int level) > > > > rdmsrl(msr, msr_val); > > msr_val &= mask; > > - wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG); > > + if (wrmsrl_safe(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG)) { > > + /* work around Cascade Lake SKZ57 erratum */ > > + printk_once(KERN_WARNING "Failed to update IA32_THERM_STATUS, " > > + "please upgrade microcode\n"); > > + wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG & > > + ~BIT(13) & ~BIT(15)); > > + } > > } > >
On Thu, 2022-06-02 at 22:42 +0200, Arnd Bergmann wrote: > On Thu, Jun 2, 2022 at 10:10 PM srinivas pandruvada > <srinivas.pandruvada@linux.intel.com> wrote: > > On Thu, 2022-06-02 at 20:53 +0200, Arnd Bergmann wrote: > > > > > > I wonder how common this problem it is. Would it help to add a > > > driver > > > workaround > > > like this? > > This issue affects only certain skews. The others already working > > as > > expected. These are important log bits for debug, we don't want to > > clear in this path. Printing warning for CLX stepping is fine > > without > > clearing unrelated bits 13 and 15. > > Read-modify-update should always work where we only update the bits > > of > > interest. Writing 1s to this register should be NOP. > > The patch I suggested doesn't change the behavior unless the initial > write causes an exception. As long as only buggy microcode rejects > the > write, the second write just serves to clear the state that causes > the > repeated stack dumps. But it will clear BIT 13 and 15 in this case. So atleast print the current msr value in the warning message so that we don't loose the BIT 13 and BIT 15 values, in case we need them for debug. Thanks, Srinivas > > Arnd > > > > @@ -214,7 +214,13 @@ static void clear_therm_status_log(int > > > level) > > > > > > rdmsrl(msr, msr_val); > > > msr_val &= mask; > > > - wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG); > > > + if (wrmsrl_safe(msr, msr_val & > > > ~THERM_STATUS_PROCHOT_LOG)) { > > > + /* work around Cascade Lake SKZ57 erratum */ > > > + printk_once(KERN_WARNING "Failed to update > > > IA32_THERM_STATUS, " > > > + "please upgrade > > > microcode\n"); > > > + wrmsrl(msr, msr_val & ~THERM_STATUS_PROCHOT_LOG & > > > + ~BIT(13) & ~BIT(15)); > > > + } > > > } > > >
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c index 8352083b87c7..620d7f4c013e 100644 --- a/drivers/thermal/intel/therm_throt.c +++ b/drivers/thermal/intel/therm_throt.c @@ -196,7 +196,7 @@ static const struct attribute_group thermal_attr_group = { #define THERM_THROT_POLL_INTERVAL HZ #define THERM_STATUS_PROCHOT_LOG BIT(1) -#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) | BIT(7) | BIT(9) | BIT(11) | BIT(13) | BIT(15)) +#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) | BIT(7) | BIT(9) | BIT(11)) #define THERM_STATUS_CLEAR_PKG_MASK (BIT(1) | BIT(3) | BIT(5) |