mbox series

[V7,0/2] qcom: x1e80100: Enable CPUFreq

Message ID 20241030130840.2890904-1-quic_sibis@quicinc.com
Headers show
Series qcom: x1e80100: Enable CPUFreq | expand

Message

Sibi Sankar Oct. 30, 2024, 1:08 p.m. UTC
This series enables CPUFreq support on the X1E SoC using the SCMI perf
protocol. This was originally part of the RFC: firmware: arm_scmi:
Qualcomm Vendor Protocol [1]. I've split it up so that this part can
land earlier. Warnings Introduced by the series are fixed by [2]

V6:
* Drop patches that are already picked up.
* Rebase to the latest next-20241029.

V5:
* Fix build error reported by kernel test robot by adding 64BIT requirement
  to COMPILE_TEST
* Pick Rbs

V4:
* Move val, flag and chan to local loop variables. [Jassi]
* Add cpucp mailbox to the MAINTAINERS file. [Jassi]
* Move to core_initcall. [Konrad]
* Skip explicitly setting txdone_irq/txdone_poll to zero. [Konrad]

V3:
* Fix Maintainer info in cpucp mbox bindings. [Bjorn]
* Fix copyright info in cpucp driver. [Bjorn]
* Drop unused APSS_CPUCP_TX_MBOX_IDR, value init and drv_data. [Bjorn/Dmitry]
* Convert to lower case hex. [Bjorn]
* Convert irq and dev to local variables. [Bjorn]
* Replace for and if with for_each_set_bit. [Bjorn]
* Document the need for spinlock. [Bjorn]
* Add space after " for aesthetics. [Bjorn]
* Fix err in calc and add fixes tag. [Bjorn]
* Include io.h and re-order platform_device.h
* Use GENMASK_ULL to generate APSS_CPUCP_RX_MBOX_CMD_MASK.

V2:
* Fix series version number [Rob]
* Pickup Rbs from Dimitry and Rob.
* Use power-domain instead of clocks. [Sudeep/Ulf]
* Rename sram sub-nodes according to schema. [Dmitry]
* Use BIT() instead of manual shift. [Dmitry]
* Define RX_MBOX_CMD to account for chan calculation. [Dmitry]
* Clear the bit instead of the entire status within the spinlock. [Dmitry]
* Use dev_err_probe instead. [Dmitry]
* Drop superfluous error message while handling errors from get_irq. [Dmitry]
* Use devm_mbox_controller_register and drop remove path. [Dmitry]
* Define TX_MBOX_CMD to account for chan calculation.
* Use cpucp->dev in probe path for conformity.

RFC V1:
* Use x1e80100 as the fallback for future SoCs using the cpucp-mbox
  controller. [Krzysztoff/Konrad/Rob]
* Use chan->lock and chan->cl to detect if the channel is no longer
  Available. [Dmitry]
* Use BIT() instead of using manual shifts. [Dmitry]
* Don't use integer as a pointer value. [Dmitry]
* Allow it to default to of_mbox_index_xlate. [Dmitry]
* Use devm_of_iomap. [Dmitry]
* Use module_platform_driver instead of module init/exit. [Dmitry]
* Get channel number using mailbox core (like other drivers) and
  further simplify the driver by dropping setup_mbox func.

[1]: https://lore.kernel.org/lkml/20240117173458.2312669-1-quic_sibis@quicinc.com/#r
[2]: https://lore.kernel.org/lkml/20241030125512.2884761-1-quic_sibis@quicinc.com/

Other relevant Links:
https://lore.kernel.org/lkml/be2e475a-349f-4e98-b238-262dd7117a4e@linaro.org/

base: next-20241029

Sibi Sankar (2):
  arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes
  arm64: dts: qcom: x1e80100: Enable cpufreq

 arch/arm64/boot/dts/qcom/x1e80100.dtsi | 89 +++++++++++++++++++-------
 1 file changed, 65 insertions(+), 24 deletions(-)

Comments

Johan Hovold Nov. 1, 2024, 1 p.m. UTC | #1
[ +CC: Marc, who I think I saw reporting something similar even if I can
  seem to find where right now ]

On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote:
> This series enables CPUFreq support on the X1E SoC using the SCMI perf
> protocol. This was originally part of the RFC: firmware: arm_scmi:
> Qualcomm Vendor Protocol [1]. I've split it up so that this part can
> land earlier. Warnings Introduced by the series are fixed by [2]

 Sibi Sankar (2):
>   arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes
>   arm64: dts: qcom: x1e80100: Enable cpufreq

I've been running with v6 of these for a while now, without noticing any
issues, and just updated to v7 to be able to provide a Tested-by tag.

I wanted to run a compilation and see how the frequencies varied, but
before I got around to that I just grepped the cpufreq sysfs attributes
for CPU0 four times. And this triggered a reset of the machine (x1e80100
CRD).

The last values output were:

	affected_cpus:0 1 2 3
	cpuinfo_cur_freq:<unknown>
	cpuinfo_max_freq:3417600
	cpuinfo_min_freq:710400
	cpuinfo_transition_latency:30000
	related_cpus:0 1 2 3
	scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320
	scaling_available_governors:ondemand userspace performance schedutil
	scaling_cur_freq:806400
	scaling_driver:scmi
	scaling_governor:schedutil
	scaling_max_freq:3417600
	scaling_min_freq:710400
	scaling_setspeed:<unsupported>

Notice the <unknown> current frequency (the previous greps said 710400
and 2515200).

The last thing I see on the serial console, presumably just before
the reset, is:

	[  196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564)

I just rebooted and grepped again and it triggered on the first attempt
(cur_freq also said '<unknown>'). Same error in the log, printed when
grepping.

Johan
Marc Zyngier Nov. 1, 2024, 2:08 p.m. UTC | #2
On Fri, 01 Nov 2024 13:00:37 +0000,
Johan Hovold <johan@kernel.org> wrote:
> 
> [ +CC: Marc, who I think I saw reporting something similar even if I can
>   seem to find where right now ]

It was on IRC.

> 
> On Wed, Oct 30, 2024 at 06:38:38PM +0530, Sibi Sankar wrote:
> > This series enables CPUFreq support on the X1E SoC using the SCMI perf
> > protocol. This was originally part of the RFC: firmware: arm_scmi:
> > Qualcomm Vendor Protocol [1]. I've split it up so that this part can
> > land earlier. Warnings Introduced by the series are fixed by [2]
> 
>  Sibi Sankar (2):
> >   arm64: dts: qcom: x1e80100: Add cpucp mailbox and sram nodes
> >   arm64: dts: qcom: x1e80100: Enable cpufreq
> 
> I've been running with v6 of these for a while now, without noticing any
> issues, and just updated to v7 to be able to provide a Tested-by tag.
> 
> I wanted to run a compilation and see how the frequencies varied, but
> before I got around to that I just grepped the cpufreq sysfs attributes
> for CPU0 four times. And this triggered a reset of the machine (x1e80100
> CRD).
> 
> The last values output were:
> 
> 	affected_cpus:0 1 2 3
> 	cpuinfo_cur_freq:<unknown>
> 	cpuinfo_max_freq:3417600
> 	cpuinfo_min_freq:710400
> 	cpuinfo_transition_latency:30000
> 	related_cpus:0 1 2 3
> 	scaling_available_frequencies:710400 806400 998400 1190400 1440000 1670400 1920000 2188800 2515200 2707200 2976000 320
> 	scaling_available_governors:ondemand userspace performance schedutil
> 	scaling_cur_freq:806400
> 	scaling_driver:scmi
> 	scaling_governor:schedutil
> 	scaling_max_freq:3417600
> 	scaling_min_freq:710400
> 	scaling_setspeed:<unsupported>
> 
> Notice the <unknown> current frequency (the previous greps said 710400
> and 2515200).
> 
> The last thing I see on the serial console, presumably just before
> the reset, is:
> 
> 	[  196.268025] arm-scmi arm-scmi.0.auto: timed out in resp(caller: do_xfer+0x164/0x564)
> 
> I just rebooted and grepped again and it triggered on the first attempt
> (cur_freq also said '<unknown>'). Same error in the log, printed when
> grepping.

I'm seeing similar things indeed. Randomly grepping in cpufreq/policy*
results in hard resets, although I don't get much on the serial
console when that happens. Interestingly, I also see some errors in
dmesg at boot time:

maz@semi-fraudulent:~$ dmesg| grep -i scmi
[    0.966175] scmi_core: SCMI protocol bus registered
[    7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport
[    7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms
[    7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0
[    7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled !
[    7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled.
[    7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000
[    8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
[    8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
[    8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
[    8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
[    8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains

All these "Failed" are a bit worrying. Happy to put any theory to the
test.

Thanks,

	M.
Johan Hovold Nov. 1, 2024, 2:19 p.m. UTC | #3
On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote:

> I'm seeing similar things indeed. Randomly grepping in cpufreq/policy*
> results in hard resets, although I don't get much on the serial
> console when that happens. Interestingly, I also see some errors in
> dmesg at boot time:
> 
> maz@semi-fraudulent:~$ dmesg| grep -i scmi
> [    0.966175] scmi_core: SCMI protocol bus registered
> [    7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport
> [    7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms
> [    7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0
> [    7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled !
> [    7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled.
> [    7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000
> [    8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> [    8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> [    8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> [    8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> [    8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains
> 
> All these "Failed" are a bit worrying. Happy to put any theory to the
> test.

Yes, those warnings indeed look troubling. Fortunately they appear to be
mostly benign and only indicate that the firmware is reporting duplicate
OPPs, which the kernel is now ignoring without any other side effects
than the warnings.

The side-effects and these remaining warnings are addressed by this
series:

	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/

but I think we should try to make the warnings a bit more informative
(and less scary) by printing something along the lines of:

	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC

instead.

Johan
Marc Zyngier Nov. 1, 2024, 2:43 p.m. UTC | #4
On Fri, 01 Nov 2024 14:19:54 +0000,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote:
> 
> > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy*
> > results in hard resets, although I don't get much on the serial
> > console when that happens. Interestingly, I also see some errors in
> > dmesg at boot time:
> > 
> > maz@semi-fraudulent:~$ dmesg| grep -i scmi
> > [    0.966175] scmi_core: SCMI protocol bus registered
> > [    7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport
> > [    7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms
> > [    7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0
> > [    7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled !
> > [    7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled.
> > [    7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000
> > [    8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> > [    8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> > [    8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> > [    8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16
> > [    8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains
> > 
> > All these "Failed" are a bit worrying. Happy to put any theory to the
> > test.
> 
> Yes, those warnings indeed look troubling. Fortunately they appear to be
> mostly benign and only indicate that the firmware is reporting duplicate
> OPPs, which the kernel is now ignoring without any other side effects
> than the warnings.

Right. Not something that would explain the hard reset behaviour then.

> 
> The side-effects and these remaining warnings are addressed by this
> series:
> 
> 	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/
> 
> but I think we should try to make the warnings a bit more informative
> (and less scary) by printing something along the lines of:
> 
> 	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC
> 
> instead.

Indeed. Seeing [Firmware Bug] has a comforting feeling of
familiarity... :)

I wonder whether the same sort of reset happen on more "commercial"
systems (such as some of the laptops). You expect that people look at
the cpufreq stuff closely, and don't see things exploding like we are.

	M.
Johan Hovold Nov. 5, 2024, 4:57 p.m. UTC | #5
On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote:
> On Fri, 01 Nov 2024 14:19:54 +0000,
> Johan Hovold <johan@kernel.org> wrote:

> > The side-effects and these remaining warnings are addressed by this
> > series:
> > 
> > 	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/
> > 
> > but I think we should try to make the warnings a bit more informative
> > (and less scary) by printing something along the lines of:
> > 
> > 	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC
> > 
> > instead.
> 
> Indeed. Seeing [Firmware Bug] has a comforting feeling of
> familiarity... :)
> 
> I wonder whether the same sort of reset happen on more "commercial"
> systems (such as some of the laptops). You expect that people look at
> the cpufreq stuff closely, and don't see things exploding like we are.

I finally got around to getting my Lenovo ThinkPad T14s to boot (it
refuses to start the kernel when using GRUB, and it's not due to the
known 64 GB memory issue as it only has 32 GB) and can confirm that it
hard resets when accessing the cpufreq sysfs attributes as well.

On the bright side, at least I don't see any warnings due to duplicate
OPPs on this machine (x1e78100, latest UEFI fw).

Johan
Marc Zyngier Nov. 5, 2024, 6:12 p.m. UTC | #6
On Tue, 05 Nov 2024 16:57:07 +0000,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote:
> > On Fri, 01 Nov 2024 14:19:54 +0000,
> > Johan Hovold <johan@kernel.org> wrote:
> 
> > > The side-effects and these remaining warnings are addressed by this
> > > series:
> > > 
> > > 	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/
> > > 
> > > but I think we should try to make the warnings a bit more informative
> > > (and less scary) by printing something along the lines of:
> > > 
> > > 	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC
> > > 
> > > instead.
> > 
> > Indeed. Seeing [Firmware Bug] has a comforting feeling of
> > familiarity... :)
> > 
> > I wonder whether the same sort of reset happen on more "commercial"
> > systems (such as some of the laptops). You expect that people look at
> > the cpufreq stuff closely, and don't see things exploding like we are.
> 
> I finally got around to getting my Lenovo ThinkPad T14s to boot (it
> refuses to start the kernel when using GRUB, and it's not due to the
> known 64 GB memory issue as it only has 32 GB)

<cry>
I know the feeling. My devkit can't use GRUB either, so I added a
hook to the GRUB config to generate EFI scripts that directly execute
the kernel with initrd, dtb, and command line.

This is probably the worse firmware I've seen in a very long while.
</cry>

> and can confirm that it
> hard resets when accessing the cpufreq sysfs attributes as well.

Right. So this also happens on non-abandonware machines.

> On the bright side, at least I don't see any warnings due to duplicate
> OPPs on this machine (x1e78100, latest UEFI fw).

One bug fixed...

	M.
Sibi Sankar Dec. 5, 2024, 11:23 a.m. UTC | #7
On 11/5/24 23:42, Marc Zyngier wrote:
> On Tue, 05 Nov 2024 16:57:07 +0000,
> Johan Hovold <johan@kernel.org> wrote:
>>
>> On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote:
>>> On Fri, 01 Nov 2024 14:19:54 +0000,
>>> Johan Hovold <johan@kernel.org> wrote:
>>
>>>> The side-effects and these remaining warnings are addressed by this
>>>> series:
>>>>
>>>> 	https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@quicinc.com/
>>>>
>>>> but I think we should try to make the warnings a bit more informative
>>>> (and less scary) by printing something along the lines of:
>>>>
>>>> 	arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC
>>>>
>>>> instead.
>>>
>>> Indeed. Seeing [Firmware Bug] has a comforting feeling of
>>> familiarity... :)
>>>
>>> I wonder whether the same sort of reset happen on more "commercial"
>>> systems (such as some of the laptops). You expect that people look at
>>> the cpufreq stuff closely, and don't see things exploding like we are.
>>
>> I finally got around to getting my Lenovo ThinkPad T14s to boot (it
>> refuses to start the kernel when using GRUB, and it's not due to the
>> known 64 GB memory issue as it only has 32 GB)
> 
> <cry>
> I know the feeling. My devkit can't use GRUB either, so I added a
> hook to the GRUB config to generate EFI scripts that directly execute
> the kernel with initrd, dtb, and command line.
> 
> This is probably the worse firmware I've seen in a very long while.

The PERF_LEVEL_GET implementation in the SCP firmware side
is the reason for the crash :|, currently there is a bug
in the kernel that picks up index that we set with LEVEL_SET
with fast channel and that masks the crash. I was told the
crash happens when idle states are enabled and a regular
LEVEL_GET message is triggered from the kernel. This was
fixed a while back but it will take a while to flow back
to all the devices. It should already be out CRD's.

Johan,
Now that you are aware of the the limitations can we make
a call on how to deal with this and land cpufreq?

-Sibi

> </cry>
> 
>> and can confirm that it
>> hard resets when accessing the cpufreq sysfs attributes as well.
> 
> Right. So this also happens on non-abandonware machines.
> 
>> On the bright side, at least I don't see any warnings due to duplicate
>> OPPs on this machine (x1e78100, latest UEFI fw).
> 
> One bug fixed...
> 
> 	M.
>
Johan Hovold Dec. 5, 2024, 3:46 p.m. UTC | #8
On Thu, Dec 05, 2024 at 04:53:05PM +0530, Sibi Sankar wrote:
> On 11/5/24 23:42, Marc Zyngier wrote:
> > On Tue, 05 Nov 2024 16:57:07 +0000,
> > Johan Hovold <johan@kernel.org> wrote:
> >> On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote:

> >>> I wonder whether the same sort of reset happen on more "commercial"
> >>> systems (such as some of the laptops). You expect that people look at
> >>> the cpufreq stuff closely, and don't see things exploding like we are.
> >>
> >> I finally got around to getting my Lenovo ThinkPad T14s to boot (it
> >> refuses to start the kernel when using GRUB, and it's not due to the
> >> known 64 GB memory issue as it only has 32 GB)
> > 
> > <cry>
> > I know the feeling. My devkit can't use GRUB either, so I added a
> > hook to the GRUB config to generate EFI scripts that directly execute
> > the kernel with initrd, dtb, and command line.
> > 
> > This is probably the worse firmware I've seen in a very long while.
> 
> The PERF_LEVEL_GET implementation in the SCP firmware side
> is the reason for the crash :|, currently there is a bug
> in the kernel that picks up index that we set with LEVEL_SET
> with fast channel and that masks the crash. I was told the
> crash happens when idle states are enabled and a regular
> LEVEL_GET message is triggered from the kernel. This was
> fixed a while back but it will take a while to flow back
> to all the devices. It should already be out CRD's.
> 
> Johan,
> Now that you are aware of the the limitations can we make
> a call on how to deal with this and land cpufreq?

As Marc said, it seems you need to come up with a way to detect and work
around the broken firmware.

We want to get the fast channel issue fixed, but when we merge that fix
it will trigger these crashes if we also merge cpufreq support for x1e.

Can you expand the on the PERF_LEVEL_GET issue? Is it possible to
implement some workaround for the buggy firmware? Like returning a dummy
value? How exactly are things working today? Can't that be used a basis
for a quirk?

> > </cry>
> > 
> >> and can confirm that it
> >> hard resets when accessing the cpufreq sysfs attributes as well.

Johan
Sibi Sankar Jan. 6, 2025, 12:22 p.m. UTC | #9
On 12/5/24 21:16, Johan Hovold wrote:
> On Thu, Dec 05, 2024 at 04:53:05PM +0530, Sibi Sankar wrote:
>> On 11/5/24 23:42, Marc Zyngier wrote:
>>> On Tue, 05 Nov 2024 16:57:07 +0000,
>>> Johan Hovold <johan@kernel.org> wrote:
>>>> On Fri, Nov 01, 2024 at 02:43:57PM +0000, Marc Zyngier wrote:
> 
>>>>> I wonder whether the same sort of reset happen on more "commercial"
>>>>> systems (such as some of the laptops). You expect that people look at
>>>>> the cpufreq stuff closely, and don't see things exploding like we are.
>>>>
>>>> I finally got around to getting my Lenovo ThinkPad T14s to boot (it
>>>> refuses to start the kernel when using GRUB, and it's not due to the
>>>> known 64 GB memory issue as it only has 32 GB)
>>>
>>> <cry>
>>> I know the feeling. My devkit can't use GRUB either, so I added a
>>> hook to the GRUB config to generate EFI scripts that directly execute
>>> the kernel with initrd, dtb, and command line.
>>>
>>> This is probably the worse firmware I've seen in a very long while.
>>
>> The PERF_LEVEL_GET implementation in the SCP firmware side
>> is the reason for the crash :|, currently there is a bug
>> in the kernel that picks up index that we set with LEVEL_SET
>> with fast channel and that masks the crash. I was told the
>> crash happens when idle states are enabled and a regular
>> LEVEL_GET message is triggered from the kernel. This was
>> fixed a while back but it will take a while to flow back
>> to all the devices. It should already be out CRD's.
>>
>> Johan,
>> Now that you are aware of the the limitations can we make
>> a call on how to deal with this and land cpufreq?
> 
> As Marc said, it seems you need to come up with a way to detect and work
> around the broken firmware.

The perf protocol version won't have any changes so detecting
it isn't possible :(

> 
> We want to get the fast channel issue fixed, but when we merge that fix
> it will trigger these crashes if we also merge cpufreq support for x1e.
> 
> Can you expand the on the PERF_LEVEL_GET issue? Is it possible to
> implement some workaround for the buggy firmware? Like returning a dummy
> value? How exactly are things working today? Can't that be used a basis
> for a quirk?

The main problem is the X1E firmware supports fast channel level get
but when queried it says it doesn't support it :|. The PERF_LEVEL_GET
regular messaging which gets used as a fallback has a bug which causes
the device to crash. So we either enable cpufreq only on platforms
that has the fix in place or live with the warning that certain messages
don't support fast channel which I don't think will fly. I've also been
told the crash wouldn't show up if we have all sleep states disabled.

> 
>>> </cry>
>>>
>>>> and can confirm that it
>>>> hard resets when accessing the cpufreq sysfs attributes as well.
> 
> Johan
Johan Hovold Jan. 10, 2025, 9:52 a.m. UTC | #10
On Mon, Jan 06, 2025 at 05:52:48PM +0530, Sibi Sankar wrote:
> On 12/5/24 21:16, Johan Hovold wrote:

> > As Marc said, it seems you need to come up with a way to detect and work
> > around the broken firmware.
> 
> The perf protocol version won't have any changes so detecting
> it isn't possible :(

But there could be other ways, see below.

> > We want to get the fast channel issue fixed, but when we merge that fix
> > it will trigger these crashes if we also merge cpufreq support for x1e.
> > 
> > Can you expand the on the PERF_LEVEL_GET issue? Is it possible to
> > implement some workaround for the buggy firmware? Like returning a dummy
> > value? How exactly are things working today? Can't that be used a basis
> > for a quirk?
> 
> The main problem is the X1E firmware supports fast channel level get
> but when queried it says it doesn't support it :|. The PERF_LEVEL_GET
> regular messaging which gets used as a fallback has a bug which causes
> the device to crash. So we either enable cpufreq only on platforms
> that has the fix in place or live with the warning that certain messages
> don't support fast channel which I don't think will fly. I've also been
> told the crash wouldn't show up if we have all sleep states disabled.

We certainly want cpufreq enabled also on the current/older firmware
which have these bugs.

Based on the above, it sounds like your fix:

	https://lore.kernel.org/lkml/20241030125512.2884761-2-quic_sibis@quicinc.com/

is correct even if it triggers the crash on machines with buggy firmware.

Why can't you add a quirk for x1e platforms that makes sure that the
driver always uses fastchannel level get?

You know it is supported (and as has to be used) even if the buggy
firmware says it's not. Just set the corresponding attribute bit
unconditionally based on the DT machine compatible (or fall back to the
current implementation which theoretically other broken fw
implementations may also be relying on), or similar.

Johan