Message ID | 20250103-topic-sm8650-thermal-cpu-idle-v1-0-faa1f011ecd9@linaro.org |
---|---|
Headers | show |
Series | arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones | expand |
On 3.01.2025 3:38 PM, Neil Armstrong wrote: > On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for > the CPUs and GPU is handled by hardware & firmware using factory and > form-factor determined parameters in order to maximize frequency while > keeping the temperature way below the junction temperature where the SoC > would experience a thermal shutdown if not permanent damages. > > On the other side, the High Level Ooperating System (HLOS), like Linux, > is able to adjust the CPU and GPU frequency using the internal SoC > temperature sensors (here tsens) and it's UP/LOW interrupts, but it > effectly does the same work twice in an less effective manner. > > Let's take the Hardware & Firmware action in account and design the > thermal zones trip points and cooling devices mapping to use the HLOS > as a safety warant in case the platform experiences a temperature surge > to helpfully avoid a thermal shutdown and handle the scenario gracefully. > > On the CPU side, the LMh hardware does the DCVS control loop, so > let's set higher trip points temperatures closer to the junction > and thermal shutdown temperatures and add some idle injection cooling > device with 100% duty cycle for each CPU that would act as emergency > action to avoid the thermal shutdown. > > On the GPU side, the GPU Management Unit (GMU) acts as the DCVS > control loop, but since we can't perform idle injection, let's > also set higher trip points temperatures closer to the junction > and thermal shutdown temperatures to reduce the GPU frequency only > as an emergency action before the thermal shutdown. > > Those 2 changes optimizes the thermal management design by avoiding > concurrent thermal management, calculations & avoidable interrupts > by moving the HLOS management to a last resort emergency if the > Hardware & Firmwares fails to avoid a thermal shutdown. > > Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> > --- Got any numbers to back this? Konrad
On 03/01/2025 15:43, Konrad Dybcio wrote: > On 3.01.2025 3:38 PM, Neil Armstrong wrote: >> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for >> the CPUs and GPU is handled by hardware & firmware using factory and >> form-factor determined parameters in order to maximize frequency while >> keeping the temperature way below the junction temperature where the SoC >> would experience a thermal shutdown if not permanent damages. >> >> On the other side, the High Level Ooperating System (HLOS), like Linux, >> is able to adjust the CPU and GPU frequency using the internal SoC >> temperature sensors (here tsens) and it's UP/LOW interrupts, but it >> effectly does the same work twice in an less effective manner. >> >> Let's take the Hardware & Firmware action in account and design the >> thermal zones trip points and cooling devices mapping to use the HLOS >> as a safety warant in case the platform experiences a temperature surge >> to helpfully avoid a thermal shutdown and handle the scenario gracefully. >> >> On the CPU side, the LMh hardware does the DCVS control loop, so >> let's set higher trip points temperatures closer to the junction >> and thermal shutdown temperatures and add some idle injection cooling >> device with 100% duty cycle for each CPU that would act as emergency >> action to avoid the thermal shutdown. >> >> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS >> control loop, but since we can't perform idle injection, let's >> also set higher trip points temperatures closer to the junction >> and thermal shutdown temperatures to reduce the GPU frequency only >> as an emergency action before the thermal shutdown. >> >> Those 2 changes optimizes the thermal management design by avoiding >> concurrent thermal management, calculations & avoidable interrupts >> by moving the HLOS management to a last resort emergency if the >> Hardware & Firmwares fails to avoid a thermal shutdown. >> >> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> >> --- > > Got any numbers to back this? To back which part ? Yes I've been running loads with difference scenarios and effectively the hardware work is much better with a more linear correction and slighly better performances because it sets slighly higger OPPs while maintaining the core closer to the target temperature range. Which is kind of expected. I don't have easy numbers to share, sorry... So yes I consider avoiding the concurrent effort is better, but since we also take the firmware design in account in the whole platform representation in DT (DSPs, SCM, GMU, ...) we should also extend this to thermal. Neil > > Konrad
On 3.01.2025 3:49 PM, Neil Armstrong wrote: > On 03/01/2025 15:43, Konrad Dybcio wrote: >> On 3.01.2025 3:38 PM, Neil Armstrong wrote: >>> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for >>> the CPUs and GPU is handled by hardware & firmware using factory and >>> form-factor determined parameters in order to maximize frequency while >>> keeping the temperature way below the junction temperature where the SoC >>> would experience a thermal shutdown if not permanent damages. >>> >>> On the other side, the High Level Ooperating System (HLOS), like Linux, >>> is able to adjust the CPU and GPU frequency using the internal SoC >>> temperature sensors (here tsens) and it's UP/LOW interrupts, but it >>> effectly does the same work twice in an less effective manner. >>> >>> Let's take the Hardware & Firmware action in account and design the >>> thermal zones trip points and cooling devices mapping to use the HLOS >>> as a safety warant in case the platform experiences a temperature surge >>> to helpfully avoid a thermal shutdown and handle the scenario gracefully. >>> >>> On the CPU side, the LMh hardware does the DCVS control loop, so >>> let's set higher trip points temperatures closer to the junction >>> and thermal shutdown temperatures and add some idle injection cooling >>> device with 100% duty cycle for each CPU that would act as emergency >>> action to avoid the thermal shutdown. >>> >>> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS >>> control loop, but since we can't perform idle injection, let's >>> also set higher trip points temperatures closer to the junction >>> and thermal shutdown temperatures to reduce the GPU frequency only >>> as an emergency action before the thermal shutdown. We could probably work out some mechanism for drm to say "gpu is too hot / too busy" and stall the userspace's requests.. If that doesn't exist already (+RobC) >>> >>> Those 2 changes optimizes the thermal management design by avoiding >>> concurrent thermal management, calculations & avoidable interrupts >>> by moving the HLOS management to a last resort emergency if the >>> Hardware & Firmwares fails to avoid a thermal shutdown. >>> >>> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> >>> --- >> >> Got any numbers to back this? > > To back which part ? Yes I've been running loads with difference > scenarios and effectively the hardware work is much better with > a more linear correction and slighly better performances because > it sets slighly higger OPPs while maintaining the core closer to > the target temperature range. Which is kind of expected. > > I don't have easy numbers to share, sorry... Ok, what you said above sounds good already. Konrad
On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for the CPUs and GPU is handled by hardware & firmware using factory and form-factor determined parameters in order to maximize frequency while keeping the temperature way below the junction temperature where the SoC would experience a thermal shutdown if not permanent damages. On the other side, the High Level Ooperating System (HLOS), like Linux, is able to adjust the CPU and GPU frequency using the internal SoC temperature sensors (here tsens) and it's UP/LOW interrupts, but it effectly does the same work twice in an less effective manner. Let's take the Hardware & Firmware action in account and design the thermal zones trip points and cooling devices mapping to use the HLOS as a safety warant in case the platform experiences a temperature surge to helpfully avoid a thermal shutdown and handle the scenario gracefully. On the CPU side, the LMh hardware does the DCVS control loop, so let's set higher trip points temperatures closer to the junction and thermal shutdown temperatures and add some idle injection cooling device with 100% duty cycle for each CPU that would act as emergency action to avoid the thermal shutdown. On the GPU side, the GPU Management Unit (GMU) acts as the DCVS control loop, but since we can't perform idle injection, let's also set higher trip points temperatures closer to the junction and thermal shutdown temperatures to reduce the GPU frequency only as an emergency action before the thermal shutdown. Those 2 changes optimizes the thermal management design by avoiding concurrent thermal management, calculations & avoidable interrupts by moving the HLOS management to a last resort emergency if the Hardware & Firmwares fails to avoid a thermal shutdown. Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> --- Neil Armstrong (2): arm64: dts: qcom: sm8650: setup cpu thermal with idle on high temperatures arm64: dts: qcom: sm8650: setup gpu thermal with higher temperatures arch/arm64/boot/dts/qcom/sm8650.dtsi | 322 ++++++++++++++++++++++++++--------- 1 file changed, 238 insertions(+), 84 deletions(-) --- base-commit: 8155b4ef3466f0e289e8fcc9e6e62f3f4dceeac2 change-id: 20250103-topic-sm8650-thermal-cpu-idle-1e19181a94ed Best regards,