mbox series

[0/2] arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones

Message ID 20250103-topic-sm8650-thermal-cpu-idle-v1-0-faa1f011ecd9@linaro.org
Headers show
Series arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones | expand

Message

Neil Armstrong Jan. 3, 2025, 2:38 p.m. UTC
On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
the CPUs and GPU is handled by hardware & firmware using factory and
form-factor determined parameters in order to maximize frequency while
keeping the temperature way below the junction temperature where the SoC
would experience a thermal shutdown if not permanent damages.

On the other side, the High Level Ooperating System (HLOS), like Linux,
is able to adjust the CPU and GPU frequency using the internal SoC
temperature sensors (here tsens) and it's UP/LOW interrupts, but it
effectly does the same work twice in an less effective manner.

Let's take the Hardware & Firmware action in account and design the
thermal zones trip points and cooling devices mapping to use the HLOS
as a safety warant in case the platform experiences a temperature surge
to helpfully avoid a thermal shutdown and handle the scenario gracefully.

On the CPU side, the LMh hardware does the DCVS control loop, so
let's set higher trip points temperatures closer to the junction
and thermal shutdown temperatures and add some idle injection cooling
device with 100% duty cycle for each CPU that would act as emergency
action to avoid the thermal shutdown.

On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
control loop, but since we can't perform idle injection, let's
also set higher trip points temperatures closer to the junction
and thermal shutdown temperatures to reduce the GPU frequency only
as an emergency action before the thermal shutdown.

Those 2 changes optimizes the thermal management design by avoiding
concurrent thermal management, calculations & avoidable interrupts
by moving the HLOS management to a last resort emergency if the
Hardware & Firmwares fails to avoid a thermal shutdown.

Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
---
Neil Armstrong (2):
      arm64: dts: qcom: sm8650: setup cpu thermal with idle on high temperatures
      arm64: dts: qcom: sm8650: setup gpu thermal with higher temperatures

 arch/arm64/boot/dts/qcom/sm8650.dtsi | 322 ++++++++++++++++++++++++++---------
 1 file changed, 238 insertions(+), 84 deletions(-)
---
base-commit: 8155b4ef3466f0e289e8fcc9e6e62f3f4dceeac2
change-id: 20250103-topic-sm8650-thermal-cpu-idle-1e19181a94ed

Best regards,

Comments

Konrad Dybcio Jan. 3, 2025, 2:43 p.m. UTC | #1
On 3.01.2025 3:38 PM, Neil Armstrong wrote:
> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
> the CPUs and GPU is handled by hardware & firmware using factory and
> form-factor determined parameters in order to maximize frequency while
> keeping the temperature way below the junction temperature where the SoC
> would experience a thermal shutdown if not permanent damages.
> 
> On the other side, the High Level Ooperating System (HLOS), like Linux,
> is able to adjust the CPU and GPU frequency using the internal SoC
> temperature sensors (here tsens) and it's UP/LOW interrupts, but it
> effectly does the same work twice in an less effective manner.
> 
> Let's take the Hardware & Firmware action in account and design the
> thermal zones trip points and cooling devices mapping to use the HLOS
> as a safety warant in case the platform experiences a temperature surge
> to helpfully avoid a thermal shutdown and handle the scenario gracefully.
> 
> On the CPU side, the LMh hardware does the DCVS control loop, so
> let's set higher trip points temperatures closer to the junction
> and thermal shutdown temperatures and add some idle injection cooling
> device with 100% duty cycle for each CPU that would act as emergency
> action to avoid the thermal shutdown.
> 
> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
> control loop, but since we can't perform idle injection, let's
> also set higher trip points temperatures closer to the junction
> and thermal shutdown temperatures to reduce the GPU frequency only
> as an emergency action before the thermal shutdown.
> 
> Those 2 changes optimizes the thermal management design by avoiding
> concurrent thermal management, calculations & avoidable interrupts
> by moving the HLOS management to a last resort emergency if the
> Hardware & Firmwares fails to avoid a thermal shutdown.
> 
> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
> ---

Got any numbers to back this?

Konrad
Neil Armstrong Jan. 3, 2025, 2:49 p.m. UTC | #2
On 03/01/2025 15:43, Konrad Dybcio wrote:
> On 3.01.2025 3:38 PM, Neil Armstrong wrote:
>> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
>> the CPUs and GPU is handled by hardware & firmware using factory and
>> form-factor determined parameters in order to maximize frequency while
>> keeping the temperature way below the junction temperature where the SoC
>> would experience a thermal shutdown if not permanent damages.
>>
>> On the other side, the High Level Ooperating System (HLOS), like Linux,
>> is able to adjust the CPU and GPU frequency using the internal SoC
>> temperature sensors (here tsens) and it's UP/LOW interrupts, but it
>> effectly does the same work twice in an less effective manner.
>>
>> Let's take the Hardware & Firmware action in account and design the
>> thermal zones trip points and cooling devices mapping to use the HLOS
>> as a safety warant in case the platform experiences a temperature surge
>> to helpfully avoid a thermal shutdown and handle the scenario gracefully.
>>
>> On the CPU side, the LMh hardware does the DCVS control loop, so
>> let's set higher trip points temperatures closer to the junction
>> and thermal shutdown temperatures and add some idle injection cooling
>> device with 100% duty cycle for each CPU that would act as emergency
>> action to avoid the thermal shutdown.
>>
>> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
>> control loop, but since we can't perform idle injection, let's
>> also set higher trip points temperatures closer to the junction
>> and thermal shutdown temperatures to reduce the GPU frequency only
>> as an emergency action before the thermal shutdown.
>>
>> Those 2 changes optimizes the thermal management design by avoiding
>> concurrent thermal management, calculations & avoidable interrupts
>> by moving the HLOS management to a last resort emergency if the
>> Hardware & Firmwares fails to avoid a thermal shutdown.
>>
>> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
>> ---
> 
> Got any numbers to back this?

To back which part ? Yes I've been running loads with difference
scenarios and effectively the hardware work is much better with
a more linear correction and slighly better performances because
it sets slighly higger OPPs while maintaining the core closer to
the target temperature range. Which is kind of expected.

I don't have easy numbers to share, sorry...

So yes I consider avoiding the concurrent effort is better, but
since we also take the firmware design in account in the whole platform
representation in DT (DSPs, SCM, GMU, ...) we should also extend this
to thermal.

Neil

> 
> Konrad
Konrad Dybcio Jan. 9, 2025, 3:20 p.m. UTC | #3
On 3.01.2025 3:49 PM, Neil Armstrong wrote:
> On 03/01/2025 15:43, Konrad Dybcio wrote:
>> On 3.01.2025 3:38 PM, Neil Armstrong wrote:
>>> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
>>> the CPUs and GPU is handled by hardware & firmware using factory and
>>> form-factor determined parameters in order to maximize frequency while
>>> keeping the temperature way below the junction temperature where the SoC
>>> would experience a thermal shutdown if not permanent damages.
>>>
>>> On the other side, the High Level Ooperating System (HLOS), like Linux,
>>> is able to adjust the CPU and GPU frequency using the internal SoC
>>> temperature sensors (here tsens) and it's UP/LOW interrupts, but it
>>> effectly does the same work twice in an less effective manner.
>>>
>>> Let's take the Hardware & Firmware action in account and design the
>>> thermal zones trip points and cooling devices mapping to use the HLOS
>>> as a safety warant in case the platform experiences a temperature surge
>>> to helpfully avoid a thermal shutdown and handle the scenario gracefully.
>>>
>>> On the CPU side, the LMh hardware does the DCVS control loop, so
>>> let's set higher trip points temperatures closer to the junction
>>> and thermal shutdown temperatures and add some idle injection cooling
>>> device with 100% duty cycle for each CPU that would act as emergency
>>> action to avoid the thermal shutdown.
>>>
>>> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
>>> control loop, but since we can't perform idle injection, let's
>>> also set higher trip points temperatures closer to the junction
>>> and thermal shutdown temperatures to reduce the GPU frequency only
>>> as an emergency action before the thermal shutdown.

We could probably work out some mechanism for drm to say "gpu is too
hot / too busy" and stall the userspace's requests.. If that doesn't
exist already (+RobC)

>>>
>>> Those 2 changes optimizes the thermal management design by avoiding
>>> concurrent thermal management, calculations & avoidable interrupts
>>> by moving the HLOS management to a last resort emergency if the
>>> Hardware & Firmwares fails to avoid a thermal shutdown.
>>>
>>> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org>
>>> ---
>>
>> Got any numbers to back this?
> 
> To back which part ? Yes I've been running loads with difference
> scenarios and effectively the hardware work is much better with
> a more linear correction and slighly better performances because
> it sets slighly higger OPPs while maintaining the core closer to
> the target temperature range. Which is kind of expected.
> 
> I don't have easy numbers to share, sorry...

Ok, what you said above sounds good already.

Konrad