Message ID | 20241030130840.2890904-1-quic_sibis@quicinc.com |
---|---|
Headers | show |
Series | qcom: x1e80100: Enable CPUFreq | expand |
[ +CC: Marc, who I think I saw reporting something similar even if I can seem to find where right now ] On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote: > This series enables CPUFreq support on the X1E SoC using the SCMI perf > protocol. This was originally part of the RFC: firmware: arm_scmi: > Qualcomm Vendor Protocol [1]. I've split it up so that this part can > land earlier. Warnings Introduced by the series are fixed by [2] Sibi Sankar (2): > arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes > arm64: dts: qcom: x1e80100: Enable cpufreq I've been running with v6 of these for a while now, without noticing any issues, and just updated to v7 to be able to provide a Tested-by tag. I wanted to run a compilation and see how the frequencies varied, but before I got around to that I just grepped the cpufreq sysfs attributes for CPU0 four times. And this triggered a reset of the machine (x1e80100 CRD). The last values output were: affected_cpus:0 1 2 3 cpuinfo_cur_freq:<unknown> cpuinfo_max_freq:3417600 cpuinfo_min_freq:710400 cpuinfo_transition_latency:30000 related_cpus:0 1 2 3 scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320 scaling_available_governors:ondemand userspace performance schedutil scaling_cur_freq:806400 scaling_driver:scmi scaling_governor:schedutil scaling_max_freq:3417600 scaling_min_freq:710400 scaling_setspeed:<unsupported> Notice the <unknown> current frequency (the previous greps said 710400 and 2515200). The last thing I see on the serial console, presumably just before the reset, is: [ 196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564) I just rebooted and grepped again and it triggered on the first attempt (cur_freq also said '<unknown>'). Same error in the log, printed when grepping. Johan
On Fri, 01 Nov 2024 13:00:37 +0000, Johan Hovold <johan@kernel.org> wrote: > > [ +CC: Marc, who I think I saw reporting something similar even if I can > seem to find where right now ] It was on IRC. > > On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote: > > This series enables CPUFreq support on the X1E SoC using the SCMI perf > > protocol. This was originally part of the RFC: firmware: arm_scmi: > > Qualcomm Vendor Protocol [1]. I've split it up so that this part can > > land earlier. Warnings Introduced by the series are fixed by [2] > > Sibi Sankar (2): > > arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes > > arm64: dts: qcom: x1e80100: Enable cpufreq > > I've been running with v6 of these for a while now, without noticing any > issues, and just updated to v7 to be able to provide a Tested-by tag. > > I wanted to run a compilation and see how the frequencies varied, but > before I got around to that I just grepped the cpufreq sysfs attributes > for CPU0 four times. And this triggered a reset of the machine (x1e80100 > CRD). > > The last values output were: > > affected_cpus:0 1 2 3 > cpuinfo_cur_freq:<unknown> > cpuinfo_max_freq:3417600 > cpuinfo_min_freq:710400 > cpuinfo_transition_latency:30000 > related_cpus:0 1 2 3 > scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320 > scaling_available_governors:ondemand userspace performance schedutil > scaling_cur_freq:806400 > scaling_driver:scmi > scaling_governor:schedutil > scaling_max_freq:3417600 > scaling_min_freq:710400 > scaling_setspeed:<unsupported> > > Notice the <unknown> current frequency (the previous greps said 710400 > and 2515200). > > The last thing I see on the serial console, presumably just before > the reset, is: > > [ 196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564) > > I just rebooted and grepped again and it triggered on the first attempt > (cur_freq also said '<unknown>'). Same error in the log, printed when > grepping. I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* results in hard resets, although I don't get much on the serial console when that happens. Interestingly, I also see some errors in dmesg at boot time: maz@semi-fraudulent:~$ dmesg| grep -i scmi [ 0.966175] scmi_core: SCMI protocol bus registered [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains All these "Failed" are a bit worrying. Happy to put any theory to the test. Thanks, M.
On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote: > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* > results in hard resets, although I don't get much on the serial > console when that happens. Interestingly, I also see some errors in > dmesg at boot time: > > maz@semi-fraudulent:~$ dmesg| grep -i scmi > [ 0.966175] scmi_core: SCMI protocol bus registered > [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport > [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms > [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 > [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! > [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. > [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 > [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains > > All these "Failed" are a bit worrying. Happy to put any theory to the > test. Yes, those warnings indeed look troubling. Fortunately they appear to be mostly benign and only indicate that the firmware is reporting duplicate OPPs, which the kernel is now ignoring without any other side effects than the warnings. The side-effects and these remaining warnings are addressed by this series: https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ but I think we should try to make the warnings a bit more informative (and less scary) by printing something along the lines of: arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC instead. Johan
On Fri, 01 Nov 2024 14:19:54 +0000, Johan Hovold <johan@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote: > > > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* > > results in hard resets, although I don't get much on the serial > > console when that happens. Interestingly, I also see some errors in > > dmesg at boot time: > > > > maz@semi-fraudulent:~$ dmesg| grep -i scmi > > [ 0.966175] scmi_core: SCMI protocol bus registered > > [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport > > [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms > > [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 > > [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! > > [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. > > [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 > > [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains > > > > All these "Failed" are a bit worrying. Happy to put any theory to the > > test. > > Yes, those warnings indeed look troubling. Fortunately they appear to be > mostly benign and only indicate that the firmware is reporting duplicate > OPPs, which the kernel is now ignoring without any other side effects > than the warnings. Right. Not something that would explain the hard reset behaviour then. > > The side-effects and these remaining warnings are addressed by this > series: > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > but I think we should try to make the warnings a bit more informative > (and less scary) by printing something along the lines of: > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > instead. Indeed. Seeing [Firmware Bug] has a comforting feeling of familiarity... :) I wonder whether the same sort of reset happen on more "commercial" systems (such as some of the laptops). You expect that people look at the cpufreq stuff closely, and don't see things exploding like we are. M.
On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: > On Fri, 01 Nov 2024 14:19:54 +0000, > Johan Hovold <johan@kernel.org> wrote: > > The side-effects and these remaining warnings are addressed by this > > series: > > > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > > > but I think we should try to make the warnings a bit more informative > > (and less scary) by printing something along the lines of: > > > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > > > instead. > > Indeed. Seeing [Firmware Bug] has a comforting feeling of > familiarity... :) > > I wonder whether the same sort of reset happen on more "commercial" > systems (such as some of the laptops). You expect that people look at > the cpufreq stuff closely, and don't see things exploding like we are. I finally got around to getting my Lenovo ThinkPad T14s to boot (it refuses to start the kernel when using GRUB, and it's not due to the known 64 GB memory issue as it only has 32 GB) and can confirm that it hard resets when accessing the cpufreq sysfs attributes as well. On the bright side, at least I don't see any warnings due to duplicate OPPs on this machine (x1e78100, latest UEFI fw). Johan
On Tue, 05 Nov 2024 16:57:07 +0000, Johan Hovold <johan@kernel.org> wrote: > > On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: > > On Fri, 01 Nov 2024 14:19:54 +0000, > > Johan Hovold <johan@kernel.org> wrote: > > > > The side-effects and these remaining warnings are addressed by this > > > series: > > > > > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ > > > > > > but I think we should try to make the warnings a bit more informative > > > (and less scary) by printing something along the lines of: > > > > > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > > > > > instead. > > > > Indeed. Seeing [Firmware Bug] has a comforting feeling of > > familiarity... :) > > > > I wonder whether the same sort of reset happen on more "commercial" > > systems (such as some of the laptops). You expect that people look at > > the cpufreq stuff closely, and don't see things exploding like we are. > > I finally got around to getting my Lenovo ThinkPad T14s to boot (it > refuses to start the kernel when using GRUB, and it's not due to the > known 64 GB memory issue as it only has 32 GB) <cry> I know the feeling. My devkit can't use GRUB either, so I added a hook to the GRUB config to generate EFI scripts that directly execute the kernel with initrd, dtb, and command line. This is probably the worse firmware I've seen in a very long while. </cry> > and can confirm that it > hard resets when accessing the cpufreq sysfs attributes as well. Right. So this also happens on non-abandonware machines. > On the bright side, at least I don't see any warnings due to duplicate > OPPs on this machine (x1e78100, latest UEFI fw). One bug fixed... M.
On 11/5/24 23:42, Marc Zyngier wrote: > On Tue, 05 Nov 2024 16:57:07 +0000, > Johan Hovold <johan@kernel.org> wrote: >> >> On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: >>> On Fri, 01 Nov 2024 14:19:54 +0000, >>> Johan Hovold <johan@kernel.org> wrote: >> >>>> The side-effects and these remaining warnings are addressed by this >>>> series: >>>> >>>> https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/ >>>> >>>> but I think we should try to make the warnings a bit more informative >>>> (and less scary) by printing something along the lines of: >>>> >>>> arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC >>>> >>>> instead. >>> >>> Indeed. Seeing [Firmware Bug] has a comforting feeling of >>> familiarity... :) >>> >>> I wonder whether the same sort of reset happen on more "commercial" >>> systems (such as some of the laptops). You expect that people look at >>> the cpufreq stuff closely, and don't see things exploding like we are. >> >> I finally got around to getting my Lenovo ThinkPad T14s to boot (it >> refuses to start the kernel when using GRUB, and it's not due to the >> known 64 GB memory issue as it only has 32 GB) > > <cry> > I know the feeling. My devkit can't use GRUB either, so I added a > hook to the GRUB config to generate EFI scripts that directly execute > the kernel with initrd, dtb, and command line. > > This is probably the worse firmware I've seen in a very long while. The PERF_LEVEL_GET implementation in the SCP firmware side is the reason for the crash :|, currently there is a bug in the kernel that picks up index that we set with LEVEL_SET with fast channel and that masks the crash. I was told the crash happens when idle states are enabled and a regular LEVEL_GET message is triggered from the kernel. This was fixed a while back but it will take a while to flow back to all the devices. It should already be out CRD's. Johan, Now that you are aware of the the limitations can we make a call on how to deal with this and land cpufreq? -Sibi > </cry> > >> and can confirm that it >> hard resets when accessing the cpufreq sysfs attributes as well. > > Right. So this also happens on non-abandonware machines. > >> On the bright side, at least I don't see any warnings due to duplicate >> OPPs on this machine (x1e78100, latest UEFI fw). > > One bug fixed... > > M. >
On Thu, Dec 05, 2024 at 04:53:05PM +0530, Sibi Sankar wrote: > On 11/5/24 23:42, Marc Zyngier wrote: > > On Tue, 05 Nov 2024 16:57:07 +0000, > > Johan Hovold <johan@kernel.org> wrote: > >> On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: > >>> I wonder whether the same sort of reset happen on more "commercial" > >>> systems (such as some of the laptops). You expect that people look at > >>> the cpufreq stuff closely, and don't see things exploding like we are. > >> > >> I finally got around to getting my Lenovo ThinkPad T14s to boot (it > >> refuses to start the kernel when using GRUB, and it's not due to the > >> known 64 GB memory issue as it only has 32 GB) > > > > <cry> > > I know the feeling. My devkit can't use GRUB either, so I added a > > hook to the GRUB config to generate EFI scripts that directly execute > > the kernel with initrd, dtb, and command line. > > > > This is probably the worse firmware I've seen in a very long while. > > The PERF_LEVEL_GET implementation in the SCP firmware side > is the reason for the crash :|, currently there is a bug > in the kernel that picks up index that we set with LEVEL_SET > with fast channel and that masks the crash. I was told the > crash happens when idle states are enabled and a regular > LEVEL_GET message is triggered from the kernel. This was > fixed a while back but it will take a while to flow back > to all the devices. It should already be out CRD's. > > Johan, > Now that you are aware of the the limitations can we make > a call on how to deal with this and land cpufreq? As Marc said, it seems you need to come up with a way to detect and work around the broken firmware. We want to get the fast channel issue fixed, but when we merge that fix it will trigger these crashes if we also merge cpufreq support for x1e. Can you expand the on the PERF_LEVEL_GET issue? Is it possible to implement some workaround for the buggy firmware? Like returning a dummy value? How exactly are things working today? Can't that be used a basis for a quirk? > > </cry> > > > >> and can confirm that it > >> hard resets when accessing the cpufreq sysfs attributes as well. Johan
On 12/5/24 21:16, Johan Hovold wrote: > On Thu, Dec 05, 2024 at 04:53:05PM +0530, Sibi Sankar wrote: >> On 11/5/24 23:42, Marc Zyngier wrote: >>> On Tue, 05 Nov 2024 16:57:07 +0000, >>> Johan Hovold <johan@kernel.org> wrote: >>>> On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote: > >>>>> I wonder whether the same sort of reset happen on more "commercial" >>>>> systems (such as some of the laptops). You expect that people look at >>>>> the cpufreq stuff closely, and don't see things exploding like we are. >>>> >>>> I finally got around to getting my Lenovo ThinkPad T14s to boot (it >>>> refuses to start the kernel when using GRUB, and it's not due to the >>>> known 64 GB memory issue as it only has 32 GB) >>> >>> <cry> >>> I know the feeling. My devkit can't use GRUB either, so I added a >>> hook to the GRUB config to generate EFI scripts that directly execute >>> the kernel with initrd, dtb, and command line. >>> >>> This is probably the worse firmware I've seen in a very long while. >> >> The PERF_LEVEL_GET implementation in the SCP firmware side >> is the reason for the crash :|, currently there is a bug >> in the kernel that picks up index that we set with LEVEL_SET >> with fast channel and that masks the crash. I was told the >> crash happens when idle states are enabled and a regular >> LEVEL_GET message is triggered from the kernel. This was >> fixed a while back but it will take a while to flow back >> to all the devices. It should already be out CRD's. >> >> Johan, >> Now that you are aware of the the limitations can we make >> a call on how to deal with this and land cpufreq? > > As Marc said, it seems you need to come up with a way to detect and work > around the broken firmware. The perf protocol version won't have any changes so detecting it isn't possible :( > > We want to get the fast channel issue fixed, but when we merge that fix > it will trigger these crashes if we also merge cpufreq support for x1e. > > Can you expand the on the PERF_LEVEL_GET issue? Is it possible to > implement some workaround for the buggy firmware? Like returning a dummy > value? How exactly are things working today? Can't that be used a basis > for a quirk? The main problem is the X1E firmware supports fast channel level get but when queried it says it doesn't support it :|. The PERF_LEVEL_GET regular messaging which gets used as a fallback has a bug which causes the device to crash. So we either enable cpufreq only on platforms that has the fix in place or live with the warning that certain messages don't support fast channel which I don't think will fly. I've also been told the crash wouldn't show up if we have all sleep states disabled. > >>> </cry> >>> >>>> and can confirm that it >>>> hard resets when accessing the cpufreq sysfs attributes as well. > > Johan
On Mon, Jan 06, 2025 at 05:52:48PM +0530, Sibi Sankar wrote: > On 12/5/24 21:16, Johan Hovold wrote: > > As Marc said, it seems you need to come up with a way to detect and work > > around the broken firmware. > > The perf protocol version won't have any changes so detecting > it isn't possible :( But there could be other ways, see below. > > We want to get the fast channel issue fixed, but when we merge that fix > > it will trigger these crashes if we also merge cpufreq support for x1e. > > > > Can you expand the on the PERF_LEVEL_GET issue? Is it possible to > > implement some workaround for the buggy firmware? Like returning a dummy > > value? How exactly are things working today? Can't that be used a basis > > for a quirk? > > The main problem is the X1E firmware supports fast channel level get > but when queried it says it doesn't support it :|. The PERF_LEVEL_GET > regular messaging which gets used as a fallback has a bug which causes > the device to crash. So we either enable cpufreq only on platforms > that has the fix in place or live with the warning that certain messages > don't support fast channel which I don't think will fly. I've also been > told the crash wouldn't show up if we have all sleep states disabled. We certainly want cpufreq enabled also on the current/older firmware which have these bugs. Based on the above, it sounds like your fix: https://lore.kernel.org/lkml/20241030125512.2884761-2-quic_sibis@quicinc.com/ is correct even if it triggers the crash on machines with buggy firmware. Why can't you add a quirk for x1e platforms that makes sure that the driver always uses fastchannel level get? You know it is supported (and as has to be used) even if the buggy firmware says it's not. Just set the corresponding attribute bit unconditionally based on the DT machine compatible (or fall back to the current implementation which theoretically other broken fw implementations may also be relying on), or similar. Johan