diff mbox series

thermal/intel: introduce tcc cooling driver

Message ID 20210115094744.21156-1-rui.zhang@intel.com
State Superseded
Headers show
Series thermal/intel: introduce tcc cooling driver | expand

Commit Message

Zhang, Rui Jan. 15, 2021, 9:47 a.m. UTC
On Intel processors, the core frequency can be reduced below OS request,
when the current temperature reaches the TCC (Thermal Control Circuit)
activation temperature.

The default TCC activation temperature is specified by
MSR_IA32_TEMPERATURE_TARGET. However, it can be adjusted by specifying an
offset in degrees C, using the TCC Offset bits in the same MSR register.

This patch introduces a cooling devices driver that utilizes the TCC
Offset feature. The bigger the current cooling state is, the lower the
effective TCC activation temperature is, so that the processors can be
throttled earlier before system critical overheats.

This patch has been tested on a KBL mobile platform.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
---
 drivers/thermal/intel/Kconfig             |   8 ++
 drivers/thermal/intel/Makefile            |   1 +
 drivers/thermal/intel/intel_tcc_cooling.c | 128 ++++++++++++++++++++++
 3 files changed, 137 insertions(+)
 create mode 100644 drivers/thermal/intel/intel_tcc_cooling.c

Comments

Zhang, Rui Jan. 18, 2021, 9:31 a.m. UTC | #1
> -----Original Message-----

> From: Doug Smythies <dsmythies@telus.net>

> Sent: Sunday, January 17, 2021 5:22 AM

> To: Zhang, Rui <rui.zhang@intel.com>; Brown, Len <len.brown@intel.com>

> Cc: daniel.lezcano@linaro.org; srinivas.pandruvada@linux.intel.com; linux-

> pm@vger.kernel.org; 'Doug Smythies' <dsmythies@telus.net>

> Subject: RE: [PATCH] thermal/intel: introduce tcc cooling driver

> Importance: High

> 

> On 2021.01.16 09:08 Doug Smythies wrote:

> > On 2021.01.15 Zhang Rui wrote:

> 

> Added Len to the "To" list:

> 

> Turostat has another issue with this stuff.

> It will be more work than I want to do to submit a fix patch, so I am not, but

> see further down for my hack fix.

> 

> ...

> 

> > Example step function overshoot, trip point set to 55 degrees C.

> >

> > doug@s18:~$ sudo ~/turbostat --Summary --quiet --show

> > Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ -- interval 1

> > Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt

> > 0.07    800     45      24      1.89    0.00

> > 0.04    800     29      23      1.89    0.00

> > 61.76   4546    4151    66      103.77  0.00 < step function load applied on 4 of 6

> cores

> > 67.76   4570    4476    66      120.42  0.00

> > 68.03   4567    4488    66      120.73  0.00

> > 67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point

> > 68.10   4489    4493    58      109.19  0.00 < this throttling is either the power

> servo or the temp

> > servo.

> > 68.08   4262    4476    51      82.82   0.00 < this throttling is the temp servo.

> > 68.13   4143    4513    48      75.16   0.00

> > 68.03   4086    4488    46      71.87   0.00 < It actually undershoots often, I don't

> know why.

> > 68.12   4000    4505    46      67.02   0.00 < often it doesn't undershoot.

> 

> It turns out that tubostat does not list the package temperature properly if it

> is started with an active TCC offset.

> It erroneously includes the offset in the temperature math.

> In the above example turbostat had also not yet been fixed for the bit mask

> issue. So the real temp above was 59 degrees C.

> 

> > 68.44   4000    4502    45      67.16   0.00

> > 68.06   4000    4483    45      66.95   0.00

> > 68.02   3973    4490    44      65.20   0.00

> > 67.94   3900    4489    43      60.51   0.00

> > 67.88   3900    4501    44      60.55   0.00

> > 67.85   3900    4472    43      60.52   0.00

> 

> And it settled at about 56 degrees, close to what was asked for.

> 

> To proceed with my work, I did a hack fix to turbostat:

> 

> doug@s18:~/temp-k-git/linux/tools/power/x86/turbostat$ git diff diff --git

> a/tools/power/x86/turbostat/turbostat.c

> b/tools/power/x86/turbostat/turbostat.c

> index d7acdd4d16c4..7f0a22ab3a0d 100644

> --- a/tools/power/x86/turbostat/turbostat.c

> +++ b/tools/power/x86/turbostat/turbostat.c

> @@ -4831,6 +4831,7 @@ int read_tcc_activation_temp()

>                 fprintf(outf, "cpu%d: MSR_IA32_TEMPERATURE_TARGET: 0x%08llx

> (%d C) (%d default - %d offset)\n",

>                         base_cpu, msr, tcc, target_c, offset_c);

> 

> +       tcc = target_c;

>         return tcc;

>  }

> 


Yes, this is a right fix.
I think Len already knows this breakage and he will propose some fix soon.

> So this:

> 

> cpu4: MSR_IA32_TEMPERATURE_TARGET: 0x2b64100d (57 C) (100 default -

> 43 offset)

> cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88420000 (-9 C)

> 

> becomes this:

> 

> cpu1: MSR_IA32_TEMPERATURE_TARGET: 0x2b64100d (57 C) (100 default -

> 43 offset)

> cpu0: MSR_IA32_PACKAGE_THERM_STATUS: 0x88400000 (36 C)

> 

> and this:

> 

> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt

> 0.08    1079    928     -11     1.91    0.00

> 

> Becomes this:

> 

> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt

> 0.05    1046    846     32      1.94    0.00

> 

> So now back to my overshoot example:

> 

> This:

> 

> > 67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point

> 

> Was actually:

> 

> > 67.98   4572    4492    80      121.00  0.00 <<< 25 degrees over trip point

> 

> But let's just do it again:

> 

> doug@s18:~$ cat /sys/devices/virtual/thermal/cooling_device11/cur_state

> 43       <<< so 100 - 43 = 57 degrees trip point.

> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show

> Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 0.25

> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt

> 0.09    800     6       36      2.01    0.00

> 0.16    800     23      36      2.00    0.00

> 0.11    800     14      36      2.15    0.00

> 66.81   4461    1160    70      101.17  0.00 <<< load applied, temp up 34 degrees in

> less than 0.25 seconds. Normal.

> 68.06   4581    1126    74      117.36  0.00

> 67.69   4589    1119    76      119.60  0.00

> 67.80   4589    1125    77      120.94  0.00

> 67.83   4587    1132    78      120.75  0.00

> 67.68   4591    1125    78      121.63  0.00

> 68.07   4585    1139    77      121.25  0.00

> 67.80   4588    1121    79      121.41  0.00 <<< now 20 degrees over trip point.

> 68.57   4579    1139    79      121.71  0.00

> ...

> 68.03   4220    1130    63      80.28   0.00 <<< it takes quite awhile (>7 seconds) to

> really throttle down.


What platform this is?
On a KBL platform I'm running right now, with performance governor, and tcc offset set to 30 (Effective TCC  is 70c), and also turbostat fixed,
I can observe that
1. all cpus running at max turbo freq (3.9G) when idle, PkgTmp around 40C
2. with load applied (I use stress tool to get 100% CPU load), the PkgTmp reports 70C and the frequency drops to  around 3G, IMMEDIATELY.
3. when I change TCC Offset to 60, cpu is throttled to around 200MHz, and the temperature is at around  50C, IMMEDIATELY.
4. when I change TCC Offset to  20, cpu freq raises to turbo range, and PkgTmp reaches 80C, IMMEDIATELY.

So in your test, there is something I don't understand. 😊
a) it take such a long time (7+ seconds) to throttle
b) it throttles to a frequency that is not low enough (in order to keep the system under effective TCC temperature, the frequency can be throttled to below turbo range, LFM, and even below LFM in my case)

Can you please try performance governor and 100% CPU load to see if the symptom is the same?

thanks,
rui
> 

> ... Doug

>
Zhang, Rui Jan. 18, 2021, 9:46 a.m. UTC | #2
Hi, Doug,

Thanks for testing this patch.

> -----Original Message-----

> From: Doug Smythies <dsmythies@telus.net>

> Sent: Sunday, January 17, 2021 1:08 AM

> To: Zhang, Rui <rui.zhang@intel.com>

> Cc: daniel.lezcano@linaro.org; srinivas.pandruvada@linux.intel.com; linux-

> pm@vger.kernel.org

> Subject: RE: [PATCH] thermal/intel: introduce tcc cooling driver

> Importance: High

> 

> On 2021.01.15 Zhang Rui wrote:

> >

> > On Intel processors, the core frequency can be reduced below OS

> > request, when the current temperature reaches the TCC (Thermal Control

> > Circuit) activation temperature.

> >

> > The default TCC activation temperature is specified by

> > MSR_IA32_TEMPERATURE_TARGET. However, it can be adjusted by

> specifying

> > an offset in degrees C, using the TCC Offset bits in the same MSR register.

> >

> > This patch introduces a cooling devices driver that utilizes the TCC

> > Offset feature. The bigger the current cooling state is, the lower the

> > effective TCC activation temperature is, so that the processors can be

> > throttled earlier before system critical overheats.

> 

> Thank you for this useful patch.

> My systems don't need thermald or any other thermal control, but it is nice

> to have this extra margin to add to the critical stuff, as a backup.

> I also like to use the offset to test stuff.

> 

> I use the internal power limit servo for power limiting, and that servo works

> very well indeed. Using this temperature offset as a way to servo the

> thermal operating limit does work, but tends to overshoot, oscillate, hold low

> excessively long (minutes). 


Do you have a script to test and show the drawbacks of this feature?
It seems that it behaves differently on different platforms.
Maybe we can evaluate this on more platforms.

> It also seems to limit CPU clock frequency

> reduction to the non-turbo limit, regardless of the desired maximum

> temperature.

> 

> I am not familiar with the thermal stuff at all, and didn't know where to find

> the trip point knob. Anyway, found "cooling_devices11".

> 

> I do not understand this:

> 

> ~$ cat /sys/devices/virtual/thermal/cooling_device11/stats/trans_table

> cat: /sys/devices/virtual/thermal/cooling_device11/stats/trans_table: File

> too large


This is a known issue that stats table can not handle devices with too many cooling states, say, 127 cooling states for TCC Offset cooling device.
We can ignore this for now.

> 

> Rather than enter the actual TCC offset, I would rather enter the desired trip

> point, and have the driver do the math to convert it to the offset.


Hmmm, a writable trip point? I need to think about this.

> 

> Example step function overshoot, trip point set to 55 degrees C.

> 

> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show

> Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 1

> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt

> 0.07    800     45      24      1.89    0.00

> 0.04    800     29      23      1.89    0.00

> 61.76   4546    4151    66      103.77  0.00 < step function load applied on 4 of 6

> cores

> 67.76   4570    4476    66      120.42  0.00

> 68.03   4567    4488    66      120.73  0.00

> 67.98   4572    4492    67      121.00  0.00 < 19 degrees over trip point

> 68.10   4489    4493    58      109.19  0.00 < this throttling is either the power

> servo or the temp servo.

> 68.08   4262    4476    51      82.82   0.00 < this throttling is the temp servo.

> 68.13   4143    4513    48      75.16   0.00

> 68.03   4086    4488    46      71.87   0.00 < It actually undershoots often, I don't

> know why.

> 68.12   4000    4505    46      67.02   0.00 < often it doesn't undershoot.

> 68.44   4000    4502    45      67.16   0.00

> 68.06   4000    4483    45      66.95   0.00

> 68.02   3973    4490    44      65.20   0.00

> 67.94   3900    4489    43      60.51   0.00

> 67.88   3900    4501    44      60.55   0.00

> 67.85   3900    4472    43      60.52   0.00

> 67.96   3900    4481    43      60.59   0.00

> 68.26   3900    4501    44      60.70   0.00

> 67.93   3900    4498    43      60.58   0.00

> 68.03   3900    4476    43      60.68   0.00

> 67.83   3900    4481    44      60.54   0.00

> 35.06   3895    2412    25      32.13   0.00 < load removed.

> 0.04    800     25      24      1.89    0.00

> 0.04    800     22      23      1.89    0.00

> 0.06    800     35      23      1.90    0.00

> 0.03    800     18      23      1.89    0.00

> 0.04    800     26      22      1.90    0.00

> 0.30    1927    44      23      1.97    0.00

> ^C0.10  800     25      23      1.91    0.00

> 

> Example long time to recover:

> (actually, this example never recovers, unusual):

> Note: 3.7 GHz is the limit.

> 

> doug@s18:~$ sudo ~/turbostat --Summary --quiet --show

> Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 30

> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt

> 67.58   3700    134812  42      52.15   0.00 <<< the trip point was changed from 37

> to 57 degrees

> 67.90   3700    134964  42      52.08   0.00

> 68.07   3700    134424  42      52.06   0.00

> 68.01   3700    134415  41      50.76   0.00

> 68.14   3700    134521  41      50.78   0.00

> 68.11   3700    134424  42      50.75   0.00

> 68.03   3700    134329  42      50.70   0.00

> 68.11   3700    134321  42      50.76   0.00

> 68.05   3700    134456  42      51.09   0.00

> 68.12   3700    134549  42      52.21   0.00

> 68.12   3700    134482  42      52.19   0.00

> 68.10   3700    134301  42      52.20   0.00

> 68.11   3700    134444  42      52.14   0.00

> 68.08   3700    134422  42      52.17   0.00

> 68.07   3700    134430  42      52.23   0.00

> 68.00   3700    134723  42      52.12   0.00

> 67.96   3711    135207  44      52.53   0.00 <<< It takes 8 minutes until the

> frequency goes above 3.7 GHz

> 68.05   3765    134519  42      54.34   0.00

> 68.11   3771    134461  43      54.60   0.00

> 67.83   3763    134867  43      54.26   0.00

> 67.93   3773    134577  43      54.78   0.00 <<< But it never recovers, Why not?

> ...

> 

> For unknown reason the processor seems to now think it is not heavily

> loaded. From my MSR decoder:

> 

> 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL

> 

> From the book:

> 

> > Autonomous Utilization-Based Frequency Control Status (R0) When set,

> > frequency is reduced below the operating system request because the

> > processor has detected that utilization is low.

> 

> Which is not true.

> 

> Anyway,

> 

> Acked-by: Doug Smythies <dsmythies@telus.net>

> 

thanks,
rui
Doug Smythies Jan. 19, 2021, 7:10 a.m. UTC | #3
On 2021.01.18 01:32 Zhang, Rui wrote:
>  On 2021.01.17 05:22 Doug Smythies wrote:

> > On 2021.01.16 09:08 Doug Smythies wrote:

> > > On 2021.01.15 Zhang Rui wrote:

...
> 

> What platform this is?


My i5-9600K test server.
Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz
6 CPUs and 6 cores.
Kernel: 5.11-rc3 + this patch.
Water cooled, with water pump always running full speed.

> On a KBL platform I'm running right now, with performance governor, and tcc offset set to 30

> (Effective TCC  is 70c), and also turbostat fixed,

> I can observe that

> 1. all cpus running at max turbo freq (3.9G) when idle, PkgTmp around 40C

> 2. with load applied (I use stress tool to get 100% CPU load), the PkgTmp reports 70C and the

> frequency drops to  around 3G, IMMEDIATELY.

> 3. when I change TCC Offset to 60, cpu is throttled to around 200MHz, and the temperature is at around

> 50C, IMMEDIATELY.

> 4. when I change TCC Offset to  20, cpu freq raises to turbo range, and PkgTmp reaches 80C,

> IMMEDIATELY.


O.K. You should be able to measure "IMMEDIATELY" and tell us what it is.

> 

> So in your test, there is something I don't understand. 😊

> a) it take such a long time (7+ seconds) to throttle


See test results below, it does seem to throttle quickly, but
then the temperature creeps up.

> b) it throttles to a frequency that is not low enough (in order to keep the system under effective TCC

> temperature, the frequency can be throttled to below turbo range, LFM, and even below LFM in my case)


c) it can take a long time to respond to an increase in allowed temperature. Likely
related to some integral term build up from condition "b" above, because yours isn't clamped
to 3.7 GHz, the response is more "immediate". I test both conditions, repeatedly below.

> 

> Can you please try performance governor and 100% CPU load to see if the symptom is the same?


I did 100% load on 4 of 6 CPUs on purpose: So as not to hit PKG Limit #2 from the outset; To
have 2 CPUs idle, as I thought it might be more challenging.

In terms of maximum heat generation, or maximum energy used, I studied every method I could
find, including several of my own methods, settling on prime95 / torture test / max heat method.

Note: all previous work was done with the intel_pstate driver, HWP enabled, powersave governor.

Test 1: intel_cpufreq,  HWP enabled, performance governor.
Test 1.1: startup delay, requires faster sampling:
MSR_IA32_TEMPERATURE_TARGET: 0x2a64100d (58 C) (100 default - 42 offset)
at 58 degrees it shouldn't clamp.

doug@s18:~$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 0.25
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
0.02    4600    6       31      1.98    0.00
0.53    4600    41      31      2.54    0.00
33.29   4360    645     52      37.34   0.00 <<< PKG Limit #2 already engaged
99.03   4271    1512    59      121.84  0.00 <<< O.K. Seems additional throttling is "IMMEDIATE"
98.85   4244    1511    60      119.81  0.00
98.80   4239    1516    61      119.71  0.00
98.82   4230    1510    63      120.02  0.00
98.84   4228    1509    63      119.32  0.00
98.81   4230    1514    63      120.16  0.00
98.78   4224    1511    63      119.00  0.00
98.82   4226    1510    63      119.18  0.00
98.81   4225    1514    64      119.77  0.00
98.84   4225    1509    63      119.23  0.00
98.82   4225    1511    65      119.56  0.00 <<< But, what? Now 7 degrees over.
   Note: increase in waste heat for otherwise unchanged operating
   conditions is normal at high limits of operation.
   Note: I do not know the level of hysteresis, if any. This might be normal.
98.80   4227    1515    63      119.93  0.00
... delete 14.5 seconds ...
100.25  4217    1514    63      111.25  0.00
100.26  4200    1514    62      109.29  0.00 <<< O.K. finally brings it down.
100.26  4200    1509    62      109.15  0.00
... delete 8.75 seconds
100.26  4100    1509    60      101.64  0.00
100.26  4100    1511    60      101.61  0.00  <<< These two are important, because they
100.25  4010    1515    58      94.65   0.00  <<< reveal that we did not hit PKG Limit #1
                                              <<< 100.0 watts
                                              <<< and we know for certain it is the temp
                                              <<< servo.

Test 1.2: clamp and recover delay, requires slower sampling:

MSR_IA32_TEMPERATURE_TARGET: 0x3f64100d (37 C) (100 default - 63 offset)

$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 30
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
100.26  3700    180608  59      72.80   0.00
100.26  3700    180407  60      72.70   0.00 <<< steady state
100.26  3700    181663  59      72.65   0.00

100.26  3700    46322   59      72.66   0.00 <<< close to time offset set to 37)
100.26  3700    180508  60      72.93   0.00
100.26  3700    180396  59      74.24   0.00
100.26  3700    180330  60      74.74   0.00
100.26  3700    180359  59      74.77   0.00
100.26  3775    180327  64      79.08   0.00 <<< ~~2 minutes 30 seconds response time
100.26  3853    180369  62      84.72   0.00
100.26  3865    180571  64      85.83   0.00
100.26  3866    180383  62      85.90   0.00

Now, change to 1 second sample time and change the offset again,
but this time it is not clamped already first.

doug@s18:~$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
100.26  3875    6093    62      87.49   0.00 
100.26  3800    6017    62      81.03   0.00 <<< by the way, notice the oscillations
100.26  3883    6023    64      87.98   0.00
100.26  3900    6020    64      89.52   0.00 <<< Processor package power oscillates quite a lot
100.26  3801    6021    62      81.09   0.00 <<< Frequency oscillates also.
100.26  3857    6021    64      85.70   0.00 <<< but in this region, 1 pstate ~= 10 watts
100.26  3900    6018    64      89.34   0.00
...
100.26  3852    6020    62      85.24   0.00
100.26  3800    6019    62      80.82   0.00
100.26  3885    6047    64      87.77   0.00 <<< trip point changed to 70
100.26  3963    6017    67      94.88   0.00 <<< yes, offset change response is fast
100.26  4000    6017    67      98.35   0.00
100.26  4079    6018    69      105.17  0.00
100.26  4100    6017    69      107.02  0.00
... delete 25 seconds ...
100.24  4042    6017    67      102.16  0.00 <<< PKG Limit #1 takes over
100.23  4016    6017    67      99.84   0.00 <<< All throttling is now PKG Limit #1
100.23  4017    6024    68      99.84   0.00
100.23  4015    6026    67      99.77   0.00

Test 2: Test 2: intel_pstate,  HWP enabled, powersave governor.
Test 2.1: startup delay, requires faster sampling:
MSR_IA32_TEMPERATURE_TARGET: 0x2a64100d (58 C) (100 default - 42 offset)
at 58 degrees it shouldn't clamp.

$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 0.1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
0.28    800     12      33      1.93
0.25    800     10      33      1.90
0.31    800     13      33      1.90
0.79    800     19      33      1.92
0.34    800     32      34      1.90
0.22    800     5       33      1.91   <<< ~ 77% of next sample is busy and 20 degrees already
61.91   4103    469     53      60.20  <<< 260 degrees per second
99.01   4264    610     56      121.94 <<< how much PKG Limit #2 and/or TCC loop, I don't know.
98.87   4251    614     61      120.74 <<< unthrottled would be 4.60 GHz
98.87   4235    609     62      119.87
98.85   4226    609     63      119.60
... delete 18.4 seconds
100.26  4100    613     62      102.41
100.26  4100    609     61      102.28
100.25  4040    609     60      97.49  <<< Don't know between PKG Limit #1 and/or TCC loop
100.26  4000    609     59      95.02  <<< definitely TCC loop
100.26  4000    615     60      94.01

Test 2.2: clamp and recover delay, requires slower sampling:

MSR_IA32_TEMPERATURE_TARGET: 0x3f64100d (37 C) (100 default - 63 offset)

sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 15
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
100.26  3700    90167   59      73.97
100.26  3700    90234   58      73.96
100.26  3700    90184   58      74.07

100.26  3700    4073    58      74.09 <<< close to time offset set to 37)
100.26  3700    90222   59      74.12
100.26  3700    90169   59      74.19
100.26  3700    90294   59      73.03
100.26  3700    90164   59      72.63
100.26  3700    90174   59      72.62
100.26  3700    90163   58      72.60
100.26  3700    90208   59      72.58
100.26  3702    90164   60      72.73 <<< 2 minutes until response.
100.26  3831    90169   63      80.67
100.26  3880    90199   63      84.56
100.26  3889    90187   63      85.34
100.26  3900    90170   63      86.24
100.26  3900    90178   62      86.26

Now, change to 0.1 second sample time and change the offset again,
but this time it is not clamped already first.

$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 0.1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
100.26  3900    609     63      89.02
100.26  3900    609     63      89.10

100.26  3900    131     63      89.47  <<< it takes a finite time between here and 
100.26  3900    615     63      89.31  <<< the actual change of offset to 30
... delete 2.7 seconds...              <<< but nowhere near this long.
100.26  3900    614     63      90.08
100.24  3915    609     64      90.42  <<< O.K. responding.
100.26  4000    611     65      98.06
... delete 1.2 seconds ...
100.26  4000    609     65      98.27
100.24  4091    616     67      106.93
100.26  4100    610     68      106.74 <<< Next step.
100.26  4100    609     68      106.90
... delete 4.4 seconds ...
100.26  4100    609     68      108.02
100.24  4107    615     69      107.42 <<< Next step.
100.26  4200    609     70      115.93
100.26  4200    609     71      115.99
100.26  4200    610     71      117.14
100.26  4200    615     70      116.00
100.26  4200    609     70      116.17
100.26  4200    609     70      116.09
100.26  4200    612     71      117.23
100.26  4200    617     70      115.96
100.26  4200    611     70      116.10
100.26  4200    609     70      116.10
100.26  4200    609     70      117.38
100.26  4200    615     70      116.09
100.26  4200    610     70      116.12
100.26  4200    609     70      116.03
100.24  4117    609     69      109.74 <<< O.K. go down again.
100.26  4100    617     69      106.86


Test 3: intel_pstate,  HWP enabled, performance governor.
Test 3.1: startup delay, requires faster sampling:
MSR_IA32_TEMPERATURE_TARGET: 0x2a64100d (58 C) (100 default - 42 offset)
at 58 degrees it shouldn't clamp.

$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 0.1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
0.06    4169    15      32      2.23
0.04    4598    6       33      2.03    <<< ~275 degrees per second
20.09   4599    268     45      15.37   <<< 12 degrees in ~43.7 mSec
99.10   4282    612     55      121.10  <<< how much PKG Limit #2 and/or TCC loop, I don't know.
98.94   4263    610     59      122.18
...delete 17.8 seconds. 63-66 degrees Example:
100.26  4300    609     66      118.93.
...
100.25  4154    610     62      106.06  <<< finally comes down again.
100.26  4100    609     62      101.90
... delete 4.5 seconds
100.26  4100    611     61      102.06
100.26  4100    609     61      102.10
100.25  4038    615     59      98.09   <<< finally gets to temp.
100.26  4000    610     59      93.84   <<< will oscillate here
100.26  4000    609     59      93.88   <<< between pstates 40 and 41
... delete 1.3 seconds ...
100.26  4000    615     58      94.81
100.26  4000    611     59      93.83
100.24  4030    609     60      96.27
100.26  4100    609     60      101.99  
100.26  4100    615     61      102.91
... delete 0.8 seconds ...
100.26  4100    614     61      103.44
100.25  4091    609     61      101.53
100.26  4000    610     59      94.07
...

Test 3.2: clamp and recover delay, requires slower sampling:

MSR_IA32_TEMPERATURE_TARGET: 0x3f64100d (37 C) (100 default - 63 offset)

sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 15
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
100.26  3700    90181   59      74.72
100.26  3700    90383   59      74.75

100.26  3700    2847    59      74.47 <<< close to time offset set to 37)
100.26  3700    90240   59      74.83
100.26  3700    90164   59      74.83
100.26  3700    90225   59      74.85
100.26  3700    90219   59      74.90
100.26  3700    90191   59      74.86
100.26  3700    90166   59      74.86
100.26  3700    90164   59      74.80
100.26  3728    90286   60      76.19 <<< 2 minutes, because it was clamped.
100.26  3832    90162   64      82.67
100.26  3870    90177   63      85.94

Now, change to 0.1 second sample time and change the offset again,
but this time it is not clamped already first.

$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 0.1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
100.26  3900    609     63      88.73
100.26  3900    615     63      88.83

100.26  3900    303     63      89.85 <<< it takes a finite time between here and
100.26  3900    615     63      88.91 <<< the actual change of offset to 30
100.26  3900    610     64      89.64
... delete 2 seconds ...
100.26  3900    611     64      88.75
100.25  3911    609     64      89.58  <<< 1st response
100.26  4000    615     65      97.82
... delete 1.2 seconds ...
100.26  4000    615     66      98.77
100.24  4086    609     68      104.94 <<< next step
100.26  4100    609     67      106.35
...

Test 4: intel_pstate,  HWP enabled, performance governor.
Method of creating 100% CPU load changed to use much less
Energy per thread.
Test 4.1: startup delay, requires faster sampling:
MSR_IA32_TEMPERATURE_TARGET: 0x2a64100d (45 C) (100 default - 55 offset)

Multiple tests were run with 2 through 6 threads.
It took between 6 and 9 seconds to begin to throttle.

Example, 3 threads:
$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
0.01    4498    25      32      1.95
0.01    4603    17      32      1.93
0.01    4153    31      31      1.96
36.34   4600    2581    51      35.31 <<< load for last ~ 0.727 seconds.
51.13   4600    3562    52      47.88 
50.77   4600    3620    52      47.98
51.03   4600    3551    52      48.11
51.13   4600    3596    53      48.14
50.87   4600    3627    52      48.20
51.30   4600    3535    52      48.30
51.17   4550    3534    50      46.26 <<< start throttling, ~ 7 seconds
51.27   4452    3567    48      42.05
50.82   4395    3585    47      39.68
51.28   4300    3529    46      36.53 <<< plus another couple to get there.
50.98   4300    3522    47      36.28
51.15   4219    3530    45      34.72
51.08   4200    3678    46      34.33
50.74   4200    3697    46      34.17
51.16   4200    3522    46      34.40
50.99   4126    3534    46      32.44
51.22   4100    3590    44      32.41

... Doug
Doug Smythies Jan. 26, 2021, 7:18 p.m. UTC | #4
Hi, Just a small follow up on this one:

On 2021.01.16 09:08 Doug Smythies wrote:
> On 2021.01.15 Zhang Rui wrote:

...
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt

> 67.93   3773    134577  43      54.78

> 

> For unknown reason the processor seems to now

> think it is not heavily loaded. From my MSR decoder:

> 

> 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL

> 

> From the book:

> 

> > Autonomous Utilization-Based Frequency Control

> > Status (R0)

> > When set, frequency is reduced below the operating

> > system request because the processor has detected

> > that utilization is low.

> 

> Which is not true.

> 

> Anyway,

> 

> Acked-by: Doug Smythies <dsmythies@telus.net>


O.K. there were 2 things wrong above:

1.) I used the wrong intel SDM table for those bit definitions.
They should have been: RATL and RATLL.

From the proper page of the book:

> Running Average Thermal Limit Status (R0)

> When set, frequency is reduced below the operating

> system request due to Running Average Thermal Limit

> (RATL).


2.) Due to the already discussed turbostat issue, that was not
the actual temperature and so the RATL bit being set was actually
valid at that time.

I have not been able to find the time window knob for this, if there
even is one, similar to the time window knobs for the package power limits.
I wanted to reduce the time constant, just as a test, in an attempt
to reduce the step function load potential temperature overshoot.

One additional informational follow up note:

There always seems to be a significant turn on transient to using the
TCC offset, appearing as temperature undershoot. I am saying that
an offset of 0 seems to also act as some sort of on/off switch to the
running average.

Example 1 - start with offset 0:

$ sudo ~/turbostat --Summary --quiet --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
51.17   4600    3531    71      93.57
51.37   4600    3543    71      93.60
51.37   4600    3590    71      93.63  <<< offset changed from 0 to 24
50.99   3737    3566    52      43.49  <<< trip point = 76 degrees
51.20   3700    3550    51      41.14  <<< TCC offset turn on transient
51.09   3700    3559    51      41.30  <<< There was no need to throttle
51.12   3779    3515    53      43.78
50.95   4064    3553    58      55.57
51.55   4271    3522    63      65.30
51.18   4424    3534    67      76.58
51.27   4500    3532    68      84.12
51.14   4500    3529    68      84.14
51.24   4599    3522    71      93.61
51.14   4600    3523    71      93.71  <<< Eventually it does return to not throttled.

Example 2 - start with offset 1:

Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
51.14   4600    3554    73      94.73
51.37   4600    3544    73      94.85
51.03   4600    3560    74      94.64 <<< offset changed from 1 to 24
51.33   4600    3508    73      94.88 <<< trip point = 76 degrees
51.14   4600    3526    73      94.69 <<< No TCC offset transient
51.22   4600    3614    73      94.85
51.24   4600    3531    73      94.84
51.50   4600    3578    73      94.92
51.15   4600    3571    73      94.77
51.20   4600    3521    73      94.91
51.19   4600    3550    73      94.76
51.27   4600    3522    74      94.81
51.27   4600    3530    74      94.98

... Doug
Zhang, Rui Jan. 28, 2021, 5:29 p.m. UTC | #5
Hi, Doug,

On Tue, 2021-01-26 at 11:18 -0800, Doug Smythies wrote:
> Hi, Just a small follow up on this one:

> 

> On 2021.01.16 09:08 Doug Smythies wrote:

> > On 2021.01.15 Zhang Rui wrote:

> 

> ...

> > Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt

> > 67.93   3773    134577  43      54.78

> > 

> > For unknown reason the processor seems to now

> > think it is not heavily loaded. From my MSR decoder:

> > 

> > 0x64F: MSR_CORE_PERF_LIMIT_REASONS: 200020 AUTO AUTOL

> > 

> > From the book:

> > 

> > > Autonomous Utilization-Based Frequency Control

> > > Status (R0)

> > > When set, frequency is reduced below the operating

> > > system request because the processor has detected

> > > that utilization is low.

> > 

> > Which is not true.

> > 

> > Anyway,

> > 

> > Acked-by: Doug Smythies <dsmythies@telus.net>

> 


> O.K. there were 2 things wrong above:

> 

> 1.) I used the wrong intel SDM table for those bit definitions.

> They should have been: RATL and RATLL.

> 

> From the proper page of the book:

> 

> > Running Average Thermal Limit Status (R0)

> > When set, frequency is reduced below the operating

> > system request due to Running Average Thermal Limit

> > (RATL).

> 


> 2.) Due to the already discussed turbostat issue, that was not

> the actual temperature and so the RATL bit being set was actually

> valid at that time.

> 

On my side, I got the "Thermal status bit" set.

> I have not been able to find the time window knob for this, if there

> even is one, similar to the time window knobs for the package power

> limits.

> I wanted to reduce the time constant, just as a test, in an attempt

> to reduce the step function load potential temperature overshoot.

> 



> One additional informational follow up note:

> 

> There always seems to be a significant turn on transient to using the

> TCC offset, appearing as temperature undershoot. I am saying that

> an offset of 0 seems to also act as some sort of on/off switch to the

> running average.

> 

> Example 1 - start with offset 0:

> 

> $ sudo ~/turbostat --Summary --quiet --show

> Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1

> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt

> 51.17   4600    3531    71      93.57

> 51.37   4600    3543    71      93.60

> 51.37   4600    3590    71      93.63  <<< offset changed from 0 to

> 24

> 50.99   3737    3566    52      43.49  <<< trip point = 76 degrees

> 51.20   3700    3550    51      41.14  <<< TCC offset turn on

> transient

> 51.09   3700    3559    51      41.30  <<< There was no need to

> throttle

> 51.12   3779    3515    53      43.78

> 50.95   4064    3553    58      55.57

> 51.55   4271    3522    63      65.30

> 51.18   4424    3534    67      76.58

> 51.27   4500    3532    68      84.12

> 51.14   4500    3529    68      84.14

> 51.24   4599    3522    71      93.61

> 51.14   4600    3523    71      93.71  <<< Eventually it does return

> to not throttled.

> 


> Example 2 - start with offset 1:

> 

> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt

> 51.14   4600    3554    73      94.73

> 51.37   4600    3544    73      94.85

> 51.03   4600    3560    74      94.64 <<< offset changed from 1 to 24

> 51.33   4600    3508    73      94.88 <<< trip point = 76 degrees

> 51.14   4600    3526    73      94.69 <<< No TCC offset transient

> 51.22   4600    3614    73      94.85

> 51.24   4600    3531    73      94.84

> 51.50   4600    3578    73      94.92

> 51.15   4600    3571    73      94.77

> 51.20   4600    3521    73      94.91

> 51.19   4600    3550    73      94.76

> 51.27   4600    3522    74      94.81

> 51.27   4600    3530    74      94.98

> 

> 

Thanks for your test.
I'd prefer this is platform specific. 
Because it behaves really differently from what I observed.

$sudo turbostat --Summary --quiet --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
99.45	2216	10656	80	14.81  <<< start with offset=0
99.48	2234	10621	79	15.02
99.47	2233	10436	80	14.96
99.45	2236	10587	79	14.94
99.49	2216	10673	79	15.04
99.46	2226	10685	79	14.87
99.43	2233	10776	79	14.89
99.73	399	9139	66	4.51   <<< offset set to 50
99.76	212	8998	65	3.31
99.77	212	8902	64	3.27
...                                    <<< throttled for 20 seconds
99.76	212	8911	55	2.97
99.77	211	8851	55	2.95
99.76	211	8916	55	2.94
99.77	211	8844	55	3.05
99.77	211	8828	54	3.21
99.77	211	8911	54	3.05
99.74	212	8998	54	3.20
99.77	212	8802	54	2.90
99.77	211	8849	54	2.90
99.76	212	8942	53	2.98
99.76	211	9039	53	3.22
99.74	212	8977	53	2.89
99.77	211	8913	53	2.89
99.76	212	8900	53	2.89
99.77	211	8817	52	2.87
99.77	212	8923	52	2.88
99.77	212	8985	52	2.88
99.73	212	8877	52	2.86
99.58	575	9308	66	5.54    <<< offset set to 32
98.92	2460	13694	66	17.32
98.98	2298	13768	66	15.24
99.03	2244	14652	66	14.48
98.97	2198	14489	66	13.95
99.03	2148	14583	66	13.43
99.02	2107	14093	66	13.45
99.06	2060	13750	66	12.61
99.06	2036	14195	66	12.27
99.07	2007	14240	66	12.07   
99.12	2888	12147	98	28.23   <<< offset cleared
99.03	3413	11503	98	37.21
98.96	3317	11698	98	34.64
99.07	3246	11410	98	32.89
98.95	3210	12107	98	32.13
98.94	3164	11790	98	31.08
99.00	3124	12106	98	30.84
99.00	3086	11876	98	29.60
98.94	3054	12482	98	29.00
98.89	3030	12629	98	28.54
99.39	2377	10764	82	17.62   <<< Didn't do anything, so it
is probably thermald or something 
99.49	2200	10679	81	14.44
99.52	2211	10267	80	14.66
99.49	2221	10318	80	14.71
99.45	2220	10289	81	14.74
99.43	2222	10326	81	14.76

I tried both tests, and the results are the same, in both cases, it
starts throttling immediately (within a second), and no over-throttling 
observed.

Do you have a script to do this? Say, run turbostat in background and
then change tcc offset at certain timestamp? Maybe we can try exactly
the same test on different machines.

thanks,
rui
Zhang, Rui Jan. 28, 2021, 5:32 p.m. UTC | #6
> > 

> > Rather than enter the actual TCC offset, I would rather enter the

> > desired trip

> > point, and have the driver do the math to convert it to the offset.

> 

> Hmmm, a writable trip point? I need to think about this.


I think this is a better idea, and I will export this as a writable
trip point of the x86_pkg_temp_thermal driver later, thanks for the
suggestion.

thanks,
rui
Doug Smythies Jan. 30, 2021, 4:58 p.m. UTC | #7
On Thu, Jan 28, 2021 at 9:30 AM Zhang Rui <rui.zhang@intel.com> wrote:
> On Tue, 2021-01-26 at 11:18 -0800, Doug Smythies wrote:

> > On 2021.01.16 09:08 Doug Smythies wrote:

> > > On 2021.01.15 Zhang Rui wrote:

...
> > They should have been: RATL and RATLL.

> >

> > From the proper page of the book:

> >

> > > Running Average Thermal Limit Status (R0)

> > > When set, frequency is reduced below the operating

> > > system request due to Running Average Thermal Limit

> > > (RATL).

> >

>

> > 2.) Due to the already discussed turbostat issue, that was not

> > the actual temperature and so the RATL bit being set was actually

> > valid at that time.

> >

> On my side, I got the "Thermal status bit" set.


Yes, and if I understand your comment correctly, you are referring
to IA32_THERM_STATUS (0X19C) and/or
IA32_PACKAGE_THERM_STATUS (0X1B1). I am referring to
MSR_CORE_PERF_LIMIT_REASONS (0X64F).

>

> > I have not been able to find the time window knob for this, if there

> > even is one, similar to the time window knobs for the package power

> > limits.


I just assume there is a time window, similar to the RAPL based
power limits. But I haven't found it.

> > I wanted to reduce the time constant, just as a test, in an attempt

> > to reduce the step function load potential temperature overshoot.

...

> >

> Thanks for your test.

> I'd prefer this is platform specific.

> Because it behaves really differently from what I observed.


O.K. These oddities aside, in the end it does do
the expected job.

> 99.06   2036    14195   66      12.27

> 99.07   2007    14240   66      12.07

> 99.12   2888    12147   98      28.23   <<< offset cleared

> 99.03   3413    11503   98      37.21

> 98.96   3317    11698   98      34.64


very close to critical temp.
I never knowingly allow my processor
to go above 80 degrees.
Although, I admit it hit 90 degrees a couple of
times during this work.

> 99.07   3246    11410   98      32.89

> 98.95   3210    12107   98      32.13

> 98.94   3164    11790   98      31.08

> 99.00   3124    12106   98      30.84

> 99.00   3086    11876   98      29.60

> 98.94   3054    12482   98      29.00

> 98.89   3030    12629   98      28.54

> 99.39   2377    10764   82      17.62   <<< Didn't do anything, so it

> is probably thermald or something


or critical temp hit.

>

> I tried both tests, and the results are the same, in both cases, it

> starts throttling immediately (within a second), and no over-throttling

> observed.

>

> Do you have a script to do this?


No, all of my tests were done manually, varing:
. placement of high loads on some cores for more heat over smaller surface area.
. balance between 100% CPU load at max heat verses 100% CPU load at less heat.
. balance between this TCC Offset throttling verses package power limits
. using ambient (coolant temperature) as a heat removal capacity knob.

In summary: I played around until I found something interesting.

> Say, run turbostat in background and

> then change tcc offset at certain timestamp? Maybe we can try exactly

> the same test on different machines.


I had an idea, and wasted way way too much time trying to make it work.
I thought to just get turbostat to also show the offset, so then we know for
certain when it changed. I tried virtually all combinations of:

turbostat --Summary --quiet --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,,,,TCC --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1
turbostat --Summary --quiet --add msr0x1a2,u32,package,raw,TCC --show
Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1

and could never get it to work in "Summary" mode. (note: about 95% of
my use of turbostat is in "Summary" mode.)

Anyway, after too long, I did get this to work:

turbostat --quiet --cpu 0 --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,u32,,raw,TCC
--show CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1 | grep "^ 0"

Example 1:

turbostat --quiet --cpu 0 --add
/sys/devices/virtual/thermal/cooling_device11/cur_state,u32,,raw,TCC
--show CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 1 | grep "^0"
CPU     Busy%   Bzy_MHz IRQ            TCC      PkgTmp  PkgWatt
0       100.26  4500    1002    0x00000001      78      99.88 <<< Offset = 1
0       100.26  4501    1002    0x00000001      77      99.90 <<<
steady state power limit throttle
0       100.26  4501    1004    0x00000001      77      99.92
0       100.26  4500    1002    0x0000001e      78      99.91   <<<
offset changed, trip int 70
0       100.25  4502    1003    0x0000001e      77      100.03
0       100.25  4503    1002    0x0000001e      77      99.85
0       100.25  4502    1002    0x0000001e      78      99.92
0       100.26  4501    1003    0x0000001e      78      99.95
0       100.25  4503    1002    0x0000001e      77      99.88
0       100.25  4502    1002    0x0000001e      78      99.86
0       100.25  4502    1004    0x0000001e      77      99.92
0       100.25  4503    1002    0x0000001e      77      99.98
0       100.25  4502    1002    0x0000001e      77      99.88
0       100.26  4498    1004    0x0000001e      77      100.06
0       100.26  4501    1002    0x0000001e      78      99.77
0       100.26  4500    1002    0x0000001e      78      99.53
0       100.26  4430    1002    0x0000001e      72      91.19  <<<
Thermal throttling. 13 Seconds
0       100.26  4400    1002    0x0000001e      72      87.55
0       100.26  4400    1002    0x0000001e      71      87.52
0       100.26  4400    1005    0x0000001e      71      87.56
0       100.26  4400    1002    0x0000001e      72      87.53

Example 2:

0       100.26  4600    1002    0x00000000      83      113.26 <<< Offset = 0
0       100.26  4600    1002    0x00000000      84      113.43
0       100.25  4599    1002    0x00000000      83      113.42 <<< No
power limit throttle yet.
0       100.26  4600    1004    0x00000000      83      113.40 <<< Not
steady state.
0       100.26  4600    1002    0x00000000      83      113.25
0       100.25  3797    1003    0x00000018      56      54.11  <<<
Overshoot is immediate.
0       100.26  3700    1002    0x00000018      56      47.09
0       100.26  3700    1002    0x00000018      55      47.08
0       100.26  3700    1002    0x00000018      54      46.98
0       100.26  3820    1002    0x00000018      58      51.62  <<<
starts to recover.
0       100.26  4016    1002    0x00000018      62      61.55
0       100.26  4177    1002    0x00000018      64      69.91
0       100.26  4275    1004    0x00000018      68      75.81
0       100.26  4300    1002    0x00000018      68      77.36
0       100.26  4371    1002    0x00000018      71      84.53
0       100.26  4400    1002    0x00000018      72      87.52
0       100.26  4400    1003    0x00000018      72      87.62

Example 3:
This test is specifically an attempt to test the TCC Offset in the exact
way I intend to use it. trip point = 75 degrees, and never changes.
Power limit 2 is 115 watts, timing window short.
Power limit 1 is 100 watts , timing window 8 seconds.
Note: all previous work was with the timing window at 28 seconds.
Note: typically temperature < 75 at 100 watts.

The load is 4 prime95 maximum heat threads, plus 0 weaker memory
hammering threads.

The collant had to be preheated for about an hour before this test
started, otherwise
the  processor would not get hot enough before package power limit 1
took over the
throttling duties.

Now, watching the TCC offset is useless for this test, so let's watch
MSR_CORE_PERF_LIMIT_REASONS instead:

turbostat --add msr0x64f,u32,,raw,TCC --show
CPU,Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ,RAMWatt --interval 1 | grep "^0"

(O.K., I should have changed the added column name. I filter it
anyhow, but manually added back, edited.)

CPU     Busy%   Bzy_MHz IRQ            TCC      PkgTmp  PkgWatt RAMWatt
0       0.07    1081    5       0x08200000      38      2.31    0.45
<<< Note high idle start temp.
0       0.16    824     11      0x08200000      38      2.12    0.45
0       1.74    3430    44      0x00000000      38      2.65    0.45
<<< clear last times log bits
0       0.16    851     6       0x00000000      37      2.27    0.45
0       4.32    3313    269     0x00000000      75      47.15   0.45
<<< load applied
0       4.24    4585    458     0x08000800      78      97.16   0.45
<<< package power limit 2
0       2.80    4588    482     0x08000000      77      97.49   0.45
<<< temperature just high
0       2.87    4593    463     0x08000000      78      97.95   0.45
0       3.39    4600    465     0x08000000      78      97.68   0.45
0       2.66    4600    462     0x08000000      78      97.55   0.45
0       2.28    4584    490     0x08000000      78      97.97   0.45
0       3.29    4583    478     0x08000000      78      97.72   0.45
0       3.24    4595    465     0x08000000      77      97.52   0.45
0       2.47    4600    465     0x08000000      78      97.50   0.45
0       4.18    4570    464     0x08000000      78      97.72   0.45
0       2.51    4600    470     0x08000000      78      97.40   0.45
0       1.77    4601    482     0x08000000      78      97.33   0.45
0       3.13    4584    462     0x08000000      78      97.57   0.45
0       3.06    4600    466     0x08000000      78      97.77   0.45
0       2.86    4592    461     0x08000000      78      97.56   0.45
0       2.85    4569    486     0x08000000      78      97.99   0.45
0       2.96    4600    465     0x08000000      78      97.91   0.45
0       3.00    4585    451     0x08000000      78      97.68   0.45
0       2.06    4600    475     0x08000000      78      97.50   0.45
0       3.05    4594    462     0x08000000      78      97.78   0.45
0       3.11    4592    461     0x08000000      78      97.68   0.45
0       2.31    4546    463     0x08200020      73      93.00   0.45  <<< RATL
0       2.80    4525    454     0x08200000      78      91.29   0.45
<<< Oscillates within
0       3.32    4538    445     0x08200020      73      91.61   0.45
<<< 1 pstate
0       3.27    4557    434     0x08200000      78      93.12   0.45
0       3.26    4523    470     0x08200020      73      89.85   0.45
<<< rough estimate is
0       2.48    4586    466     0x08200020      74      95.67   0.45
<<< oscillation costs 0.4%
0       1.95    4521    468     0x08200000      76      87.93   0.45
<<< performance loss verses
0       3.28    4569    449     0x08200020      73      94.67   0.45
<<< the power limit 2 servo.
0       0.44    4546    495     0x08200000      78      91.77   0.45
<<< (very crude, hard to defend
0       1.91    4518    487     0x08200020      73      91.24   0.45 <<< data.)
0       3.25    4539    460     0x08200000      78      91.63   0.45
0       2.51    4546    469     0x08200020      74      91.12   0.45
0       3.60    4540    453     0x08200000      77      91.43   0.45
0       3.06    4542    463     0x08200020      73      91.56   0.45

... Doug
diff mbox series

Patch

diff --git a/drivers/thermal/intel/Kconfig b/drivers/thermal/intel/Kconfig
index 8025b21f43fa..67de49cc9fb4 100644
--- a/drivers/thermal/intel/Kconfig
+++ b/drivers/thermal/intel/Kconfig
@@ -75,3 +75,11 @@  config INTEL_PCH_THERMAL
 	  Enable this to support thermal reporting on certain intel PCHs.
 	  Thermal reporting device will provide temperature reading,
 	  programmable trip points and other information.
+
+config INTEL_TCC_COOLING
+	tristate "Intel TCC offset cooling Driver"
+	depends on X86
+	help
+	  Enable this to support system cooling by adjusting the effective TCC
+          activation temperature via the TCC Offset register, which is widely
+          supported on modern Intel platforms.
diff --git a/drivers/thermal/intel/Makefile b/drivers/thermal/intel/Makefile
index 0d9736ced5d4..40e86973f88d 100644
--- a/drivers/thermal/intel/Makefile
+++ b/drivers/thermal/intel/Makefile
@@ -10,3 +10,4 @@  obj-$(CONFIG_INTEL_QUARK_DTS_THERMAL)	+= intel_quark_dts_thermal.o
 obj-$(CONFIG_INT340X_THERMAL)  += int340x_thermal/
 obj-$(CONFIG_INTEL_BXT_PMIC_THERMAL) += intel_bxt_pmic_thermal.o
 obj-$(CONFIG_INTEL_PCH_THERMAL)	+= intel_pch_thermal.o
+obj-$(CONFIG_INTEL_TCC_COOLING)	+= intel_tcc_cooling.o
diff --git a/drivers/thermal/intel/intel_tcc_cooling.c b/drivers/thermal/intel/intel_tcc_cooling.c
new file mode 100644
index 000000000000..aa6bbb9ba898
--- /dev/null
+++ b/drivers/thermal/intel/intel_tcc_cooling.c
@@ -0,0 +1,128 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * cooling device driver that activates the processor throttling by
+ * programming the TCC Offset register.
+ * Copyright (c) 2021, Intel Corporation.
+ */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/module.h>
+#include <linux/thermal.h>
+#include <asm/cpu_device_id.h>
+
+#define TCC_SHIFT 24
+#define TCC_MASK	(0x3fULL<<24)
+#define TCC_PROGRAMMABLE	BIT(30)
+
+static struct thermal_cooling_device *tcc_cdev;
+
+static int tcc_get_max_state(struct thermal_cooling_device *cdev, unsigned long
+			     *state)
+{
+	*state = TCC_MASK >> TCC_SHIFT;
+	return 0;
+}
+
+static int tcc_offset_update(int tcc)
+{
+	u64 val;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_TEMPERATURE_TARGET, &val);
+	if (err)
+		return err;
+
+	val &= ~TCC_MASK;
+	val |= tcc << TCC_SHIFT;
+
+	err = wrmsrl_safe(MSR_IA32_TEMPERATURE_TARGET, val);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+static int tcc_get_cur_state(struct thermal_cooling_device *cdev, unsigned long
+			     *state)
+{
+	u64 val;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_TEMPERATURE_TARGET, &val);
+	if (err)
+		return err;
+
+	*state = (val & TCC_MASK) >> TCC_SHIFT;
+	return 0;
+}
+
+static int tcc_set_cur_state(struct thermal_cooling_device *cdev, unsigned long
+			     state)
+{
+	return tcc_offset_update(state);
+}
+
+static const struct thermal_cooling_device_ops tcc_cooling_ops = {
+	.get_max_state = tcc_get_max_state,
+	.get_cur_state = tcc_get_cur_state,
+	.set_cur_state = tcc_set_cur_state,
+};
+
+static const struct x86_cpu_id tcc_ids[] __initconst = {
+	X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_L, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_L, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE, NULL),
+	X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, NULL),
+	{}
+};
+
+MODULE_DEVICE_TABLE(x86cpu, tcc_ids);
+
+static int __init tcc_cooling_init(void)
+{
+	int ret;
+	u64 val;
+	const struct x86_cpu_id *id;
+
+	int err;
+
+	id = x86_match_cpu(tcc_ids);
+	if (!id)
+		return -ENODEV;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, &val);
+	if (err)
+		return err;
+
+	if (!(val & TCC_PROGRAMMABLE))
+		return -ENODEV;
+
+	pr_info("Programmable TCC Offset detected\n");
+
+	tcc_cdev =
+	    thermal_cooling_device_register("TCC Offset", NULL,
+					    &tcc_cooling_ops);
+	if (IS_ERR(tcc_cdev)) {
+		ret = PTR_ERR(tcc_cdev);
+		return ret;
+	}
+	return 0;
+}
+
+module_init(tcc_cooling_init)
+
+static void __exit tcc_cooling_exit(void)
+{
+	thermal_cooling_device_unregister(tcc_cdev);
+}
+
+module_exit(tcc_cooling_exit)
+
+MODULE_DESCRIPTION("TCC offset cooling device Driver");
+MODULE_AUTHOR("Zhang Rui <rui.zhang@intel.com>");
+MODULE_LICENSE("GPL v2");