mbox series

[RFC,0/4] Support for passing runtime state idle time to TF-A

Message ID 1619123448-10138-1-git-send-email-skomatineni@nvidia.com
Headers show
Series Support for passing runtime state idle time to TF-A | expand

Message

Sowjanya Komatineni April 22, 2021, 8:30 p.m. UTC
Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is
in charge of deciding on state transition based on target state, state idle
time, and some other Tegra CPU core cluster states information.

Current PSCI specification don't have function defined for passing runtime
state idle time predicted by governor (based on next events and state target
residency) to ARM trusted firmware.

With the support of adding new PSCI function to allow passing runtime state
idle time from kernel to ARM trusted firmware, Tegra194 platforms can use
generic psci cpuidle driver rather than having Tegra specific cpuidle driver.

During Tegra specific cpuidle driver V1 review, Sudeep Holla from ARM also
suggested to use generic cpuidle driver by generalizing the need of runtime
state idle time.

So had internal discussion between ARM and NVIDIA on adding new PSCI function
to allow passing runtime state idle time from kernel to TF-A through PSCI and
once this implementation is accepted by upstream, ARM will look into further
to update PSCI specification for this new PSCI function.

So sending these patches as RFC as new PSCI function added in this series is
not part of PSCI specification and once this implementation is accepted by ARM
and upstream community, ARM can help to take this forward to add to PSCI
specification.

To keep the backward compatibility we can't update CPU_SUSPEND function to pass
state idle time argument. So added seperate function for passing state idle time
and serializing this with cpu suspend state enter.

Once this approach is agreed, we can either use this way of separate PSCI
function for passing state idle time or with PSCI specification update we can
use same CPU_SUSPEND function with extra argument for state idle time which can
be decided later for final patches based on discussion with ARM.


Sowjanya Komatineni (4):
  firmware/psci: add support for PSCI function SET_STATE_IDLE_TIME
  cpuidle: menu: add idle_time to cpuidle_state
  cpuidle: psci: pass state idle time before state enter callback
  arm64: dts: tegra194: Add CPU idle states

 arch/arm64/boot/dts/nvidia/tegra194.dtsi | 19 +++++++++++++++++++
 drivers/cpuidle/cpuidle-psci.c           | 11 +++++++++++
 drivers/cpuidle/governors/menu.c         |  7 ++++++-
 drivers/firmware/psci/psci.c             |  9 +++++++++
 include/linux/cpuidle.h                  |  1 +
 include/linux/psci.h                     |  1 +
 include/uapi/linux/psci.h                |  2 ++
 7 files changed, 49 insertions(+), 1 deletion(-)

Comments

Sowjanya Komatineni April 23, 2021, 1:03 a.m. UTC | #1
kernel test robot reported below errors for patch-2 of this series. Will 
fix in next version to use div_u64() for nsec to usec conversion along 
with the other feedback I may be receiving.

drivers/cpuidle/governors/menu.c:398: undefined reference to `__udivdi3'
ld: drivers/cpuidle/governors/menu.c:398: undefined reference to `__udivdi3'
ld: drivers/cpuidle/governors/menu.c:398: undefined reference to `__udivdi3'
ld: drivers/cpuidle/governors/menu.c:398: undefined reference to `__udivdi3'

Thanks

Sowjanya

On 4/22/21 1:30 PM, Sowjanya Komatineni wrote:
> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is

> in charge of deciding on state transition based on target state, state idle

> time, and some other Tegra CPU core cluster states information.

>

> Current PSCI specification don't have function defined for passing runtime

> state idle time predicted by governor (based on next events and state target

> residency) to ARM trusted firmware.

>

> With the support of adding new PSCI function to allow passing runtime state

> idle time from kernel to ARM trusted firmware, Tegra194 platforms can use

> generic psci cpuidle driver rather than having Tegra specific cpuidle driver.

>

> During Tegra specific cpuidle driver V1 review, Sudeep Holla from ARM also

> suggested to use generic cpuidle driver by generalizing the need of runtime

> state idle time.

>

> So had internal discussion between ARM and NVIDIA on adding new PSCI function

> to allow passing runtime state idle time from kernel to TF-A through PSCI and

> once this implementation is accepted by upstream, ARM will look into further

> to update PSCI specification for this new PSCI function.

>

> So sending these patches as RFC as new PSCI function added in this series is

> not part of PSCI specification and once this implementation is accepted by ARM

> and upstream community, ARM can help to take this forward to add to PSCI

> specification.

>

> To keep the backward compatibility we can't update CPU_SUSPEND function to pass

> state idle time argument. So added seperate function for passing state idle time

> and serializing this with cpu suspend state enter.

>

> Once this approach is agreed, we can either use this way of separate PSCI

> function for passing state idle time or with PSCI specification update we can

> use same CPU_SUSPEND function with extra argument for state idle time which can

> be decided later for final patches based on discussion with ARM.

>

>

> Sowjanya Komatineni (4):

>    firmware/psci: add support for PSCI function SET_STATE_IDLE_TIME

>    cpuidle: menu: add idle_time to cpuidle_state

>    cpuidle: psci: pass state idle time before state enter callback

>    arm64: dts: tegra194: Add CPU idle states

>

>   arch/arm64/boot/dts/nvidia/tegra194.dtsi | 19 +++++++++++++++++++

>   drivers/cpuidle/cpuidle-psci.c           | 11 +++++++++++

>   drivers/cpuidle/governors/menu.c         |  7 ++++++-

>   drivers/firmware/psci/psci.c             |  9 +++++++++

>   include/linux/cpuidle.h                  |  1 +

>   include/linux/psci.h                     |  1 +

>   include/uapi/linux/psci.h                |  2 ++

>   7 files changed, 49 insertions(+), 1 deletion(-)

>
Rafael J. Wysocki April 23, 2021, 12:22 p.m. UTC | #2
On Thu, Apr 22, 2021 at 10:31 PM Sowjanya Komatineni
<skomatineni@nvidia.com> wrote:
>

> Some platforms use separate CPU firmware running in background to

> handle state transitions which may need runtime idle time of the

> corresponding target state from the kernel.


How exactly does this work?

> This patch adds idle_time to cpuidle state to expose to cpuidle driver the

> idle time that the governor menu predicts based on next events and states

> target residency for selecting proper idle state.

>

> CPU idle driver passes this runtime state idle time to TF-A.

>

> Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>

> ---

>  drivers/cpuidle/governors/menu.c | 7 ++++++-

>  include/linux/cpuidle.h          | 1 +

>  2 files changed, 7 insertions(+), 1 deletion(-)

>

> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c

> index c3aa8d6..0da5bc5 100644

> --- a/drivers/cpuidle/governors/menu.c

> +++ b/drivers/cpuidle/governors/menu.c

> @@ -382,8 +382,10 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,

>                          * stuck in the shallow one for too long.

>                          */

>                         if (drv->states[idx].target_residency_ns < TICK_NSEC &&

> -                           s->target_residency_ns <= delta_tick)

> +                           s->target_residency_ns <= delta_tick) {

> +                               drv->states[idx].idle_time = delta_tick / NSEC_PER_USEC;

>                                 idx = i;

> +                       }

>

>                         return idx;

>                 }

> @@ -393,6 +395,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,

>                 idx = i;

>         }

>

> +       drv->states[idx].idle_time = predicted_ns / NSEC_PER_USEC;

>         if (idx == -1)

>                 idx = 0; /* No states enabled. Must use 0. */

>

> @@ -419,6 +422,8 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,

>                                 if (drv->states[i].target_residency_ns <= delta_tick)

>                                         break;

>                         }

> +

> +                       drv->states[idx].idle_time = delta_tick / NSEC_PER_USEC;

>                 }

>         }

>

> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h

> index fce4762..12db2e9 100644

> --- a/include/linux/cpuidle.h

> +++ b/include/linux/cpuidle.h

> @@ -55,6 +55,7 @@ struct cpuidle_state {

>         unsigned int    exit_latency; /* in US */

>         int             power_usage; /* in mW */

>         unsigned int    target_residency; /* in US */

> +       unsigned int    idle_time; /* in US */


No way.

This structure holds idle state properties of and not some random data
passed between cpuidle components.  And it is not per-CPU, while the
governors work on the per-CPU basis.

state_usage might be more suitable, but that only if I'm convinced
that this is really necessary.

>

>         int (*enter)    (struct cpuidle_device *dev,

>                         struct cpuidle_driver *drv,

> --
Rafael J. Wysocki April 23, 2021, 12:27 p.m. UTC | #3
On Thu, Apr 22, 2021 at 10:31 PM Sowjanya Komatineni
<skomatineni@nvidia.com> wrote:
>

> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is

> in charge of deciding on state transition based on target state, state idle

> time, and some other Tegra CPU core cluster states information.

>

> Current PSCI specification don't have function defined for passing runtime

> state idle time predicted by governor (based on next events and state target

> residency) to ARM trusted firmware.


Presumably that's because this is not a good idea.

A basic design principle of cpuidle is that it should be possible to
use every governor with every driver and the changes in this series
make the platforms in question only work with menu AFAICS.
Sowjanya Komatineni April 23, 2021, 6:32 p.m. UTC | #4
On 4/23/21 5:27 AM, Rafael J. Wysocki wrote:
> On Thu, Apr 22, 2021 at 10:31 PM Sowjanya Komatineni

> <skomatineni@nvidia.com> wrote:

>> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is

>> in charge of deciding on state transition based on target state, state idle

>> time, and some other Tegra CPU core cluster states information.

>>

>> Current PSCI specification don't have function defined for passing runtime

>> state idle time predicted by governor (based on next events and state target

>> residency) to ARM trusted firmware.

> Presumably that's because this is not a good idea.

Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs.

MCE firmware handles CPU complex power management states entry/exit 
based on its background tasks and uses expected wake time of the core to 
decide on state transition.

Expected wake time is based on next event and allowed  min residency of 
the state which governor predicts in kernel which MCE is not aware of.

So need a way to pass this info from kernel through PSCI to TF-A and 
TF-A will update this to MCE along with requested state of entry.

For example, When C6 core idle state is requested, MCE notes the time at 
which core is likely to wake up. There could be background tasks running 
on core which kernel is not aware of.

When those tasks are completed it will check the remaining time against 
states crossover thresholds (programmed by ARM trusted FW) to determine 
if its still have enough time to enter into C6 state.

While a core is power gates, it could be woken up for background tasks 
and put back to power gated state again by MCE based on expected wake 
time of the corresponding core.


So, Tegra194/Tegra186 CPU idle support, we need this runtime state 
expected wake time predicted by governor to be passed from kernel to TF-A.

Thanks

Sowjanya

>

> A basic design principle of cpuidle is that it should be possible to

> use every governor with every driver and the changes in this series

> make the platforms in question only work with menu AFAICS.
Sowjanya Komatineni April 23, 2021, 6:33 p.m. UTC | #5
On 4/23/21 5:22 AM, Rafael J. Wysocki wrote:
> On Thu, Apr 22, 2021 at 10:31 PM Sowjanya Komatineni

> <skomatineni@nvidia.com>  wrote:

>> Some platforms use separate CPU firmware running in background to

>> handle state transitions which may need runtime idle time of the

>> corresponding target state from the kernel.

> How exactly does this work?

>

Explained this as part of other feedback in Patch-0

Thanks

Sowjanya
Lukasz Luba April 23, 2021, 8:16 p.m. UTC | #6
Hi Sowjanya,

On 4/22/21 9:30 PM, Sowjanya Komatineni wrote:
> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs which is

> in charge of deciding on state transition based on target state, state idle

> time, and some other Tegra CPU core cluster states information.

> 

> Current PSCI specification don't have function defined for passing runtime

> state idle time predicted by governor (based on next events and state target

> residency) to ARM trusted firmware.


Do you have some numbers from experiments showing that these idle
governor prediction values, which are passed from kernel to MCE
firmware, are making a good 'guess'?
How much precision (1us? 1ms?) in the values do you need there?

IIRC (probably Rafael's presentations) predicting in the kernel
something like CPU idle time residency is not a trivial thing.

Another idea (depending on DT structure and PSCI bits):
Could this be solved differently, but just having a knowledge that if
the governor requested some C-state, this means governor 'predicted'
an idle residency to be greater that min_residency attached to this
C-state?
Then, when that request shows up in your FW, you know that it must be at
least min_residency because of this C-state id.
It would depend on number of available states, max_residency, scale
that you would choose while assigning values from [0, max_residency]
to each state.
IIRC there can be many state IDs for idle, so it would depend on
number of bits encoding this state, and your needs. Example of
linear scale:
4-bits encoding idle state and max predicted residency 10msec,
that means 10000us / 16 states = 625us/state.
The max_residency might be split differently, using different than
linear function, to have some rage more precised.

Open question is if these idle states must be all represented
in DT, or there is a way of describing a 'set of idle states'
automatically.

Regards,
Lukasz
Sowjanya Komatineni April 23, 2021, 10:24 p.m. UTC | #7
On 4/23/21 1:16 PM, Lukasz Luba wrote:
> Hi Sowjanya,
>
> On 4/22/21 9:30 PM, Sowjanya Komatineni wrote:
>> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs 
>> which is
>> in charge of deciding on state transition based on target state, 
>> state idle
>> time, and some other Tegra CPU core cluster states information.
>>
>> Current PSCI specification don't have function defined for passing 
>> runtime
>> state idle time predicted by governor (based on next events and state 
>> target
>> residency) to ARM trusted firmware.
>
> Do you have some numbers from experiments showing that these idle
> governor prediction values, which are passed from kernel to MCE
> firmware, are making a good 'guess'?
> How much precision (1us? 1ms?) in the values do you need there?

it could also be in few ms depending on when next cpu event/activity 
might happen which is not transparent to MCE firmware.

>
> IIRC (probably Rafael's presentations) predicting in the kernel
> something like CPU idle time residency is not a trivial thing.
>
> Another idea (depending on DT structure and PSCI bits):
> Could this be solved differently, but just having a knowledge that if
> the governor requested some C-state, this means governor 'predicted'
> an idle residency to be greater that min_residency attached to this
> C-state?
> Then, when that request shows up in your FW, you know that it must be at
> least min_residency because of this C-state id.
C6 is the only deepest state for Tegra194 Carmel CPU that we support in 
addition to C1 (WFI) idle state.

MCE firmware gets state crossover thresholds for C1 to C6 transition 
from TF-A and uses it along with state idle time to decide on C6 state 
entry based on its background work.

Assuming for now if we use min_residency as state idle time which is 
static value from DT, then it enters into deepest state C6 always as we 
use min_residency value we use is always higher than state crossover 
threshold.

But MCE firmware is not aware of when next cpu event can happen to 
predict if next event can take longer than state min_residency time.

Using min residency in such case is very conservative where MCE firmware 
exits C6 state early where we may not have better power saving.

But with MCE firmware being aware of when next event can happen it can 
use that to stay in C6 state without early exit for better power savings.

> It would depend on number of available states, max_residency, scale
> that you would choose while assigning values from [0, max_residency]
> to each state.
> IIRC there can be many state IDs for idle, so it would depend on
> number of bits encoding this state, and your needs. Example of
> linear scale:
> 4-bits encoding idle state and max predicted residency 10msec,
> that means 10000us / 16 states = 625us/state.
> The max_residency might be split differently, using different than
> linear function, to have some rage more precised.
>
> Open question is if these idle states must be all represented
> in DT, or there is a way of describing a 'set of idle states'
> automatically.
We only support C6 state through DT as C6 is the only deepest state for 
Tegra194 carmel CPU. WFI idle state is completely handled by kernel and 
does not require MCE sequences for entry/exit.
>
> Regards,
> Lukasz
Souvik Chakravarty April 26, 2021, 10:10 a.m. UTC | #8
Hi Sowjanya,

> From: Sowjanya Komatineni

> Sent: Friday, April 23, 2021 11:25 PM

> 

> On 4/23/21 1:16 PM, Lukasz Luba wrote:

> > Hi Sowjanya,

> >

> > On 4/22/21 9:30 PM, Sowjanya Komatineni wrote:

> >> Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs

> >> which is in charge of deciding on state transition based on target

> >> state, state idle time, and some other Tegra CPU core cluster states

> >> information.

> >>

> >> Current PSCI specification don't have function defined for passing

> >> runtime state idle time predicted by governor (based on next events

> >> and state target

> >> residency) to ARM trusted firmware.

> >

> > Do you have some numbers from experiments showing that these idle

> > governor prediction values, which are passed from kernel to MCE

> > firmware, are making a good 'guess'?

> > How much precision (1us? 1ms?) in the values do you need there?

> 

> it could also be in few ms depending on when next cpu event/activity might

> happen which is not transparent to MCE firmware.

> 

> >

> > IIRC (probably Rafael's presentations) predicting in the kernel

> > something like CPU idle time residency is not a trivial thing.

> >

> > Another idea (depending on DT structure and PSCI bits):

> > Could this be solved differently, but just having a knowledge that if

> > the governor requested some C-state, this means governor 'predicted'

> > an idle residency to be greater that min_residency attached to this

> > C-state?

> > Then, when that request shows up in your FW, you know that it must be

> > at least min_residency because of this C-state id.

> C6 is the only deepest state for Tegra194 Carmel CPU that we support in

> addition to C1 (WFI) idle state.

> 

> MCE firmware gets state crossover thresholds for C1 to C6 transition from TF-

> A and uses it along with state idle time to decide on C6 state entry based on

> its background work.

> 

> Assuming for now if we use min_residency as state idle time which is static

> value from DT, then it enters into deepest state C6 always as we use

> min_residency value we use is always higher than state crossover threshold.

> 

> But MCE firmware is not aware of when next cpu event can happen to

> predict if next event can take longer than state min_residency time.

> 

> Using min residency in such case is very conservative where MCE firmware

> exits C6 state early where we may not have better power saving.

> 

> But with MCE firmware being aware of when next event can happen it can

> use that to stay in C6 state without early exit for better power savings.


This part confuses me. Are you saying that the firmware will forcefully wake up
the core, even if no wakeups are pending, when min residency for C6 expires?

Regards,
Souvik

> 

> > It would depend on number of available states, max_residency, scale

> > that you would choose while assigning values from [0, max_residency]

> > to each state.

> > IIRC there can be many state IDs for idle, so it would depend on

> > number of bits encoding this state, and your needs. Example of linear

> > scale:

> > 4-bits encoding idle state and max predicted residency 10msec, that

> > means 10000us / 16 states = 625us/state.

> > The max_residency might be split differently, using different than

> > linear function, to have some rage more precised.

> >

> > Open question is if these idle states must be all represented in DT,

> > or there is a way of describing a 'set of idle states'

> > automatically.

> We only support C6 state through DT as C6 is the only deepest state for

> Tegra194 carmel CPU. WFI idle state is completely handled by kernel and

> does not require MCE sequences for entry/exit.

> >

> > Regards,

> > Lukasz
Morten Rasmussen April 26, 2021, 1:11 p.m. UTC | #9
Hi,

On Fri, Apr 23, 2021 at 03:24:51PM -0700, Sowjanya Komatineni wrote:
> On 4/23/21 1:16 PM, Lukasz Luba wrote:

> > Hi Sowjanya,

> > 

> > On 4/22/21 9:30 PM, Sowjanya Komatineni wrote:

> > > Tegra194 and Tegra186 platforms use separate MCE firmware for CPUs

> > > which is

> > > in charge of deciding on state transition based on target state,

> > > state idle

> > > time, and some other Tegra CPU core cluster states information.

> > > 

> > > Current PSCI specification don't have function defined for passing

> > > runtime

> > > state idle time predicted by governor (based on next events and

> > > state target

> > > residency) to ARM trusted firmware.

> > 

> > Do you have some numbers from experiments showing that these idle

> > governor prediction values, which are passed from kernel to MCE

> > firmware, are making a good 'guess'?

> > How much precision (1us? 1ms?) in the values do you need there?

> 

> it could also be in few ms depending on when next cpu event/activity might

> happen which is not transparent to MCE firmware.

> 

> > 

> > IIRC (probably Rafael's presentations) predicting in the kernel

> > something like CPU idle time residency is not a trivial thing.

> > 

> > Another idea (depending on DT structure and PSCI bits):

> > Could this be solved differently, but just having a knowledge that if

> > the governor requested some C-state, this means governor 'predicted'

> > an idle residency to be greater that min_residency attached to this

> > C-state?

> > Then, when that request shows up in your FW, you know that it must be at

> > least min_residency because of this C-state id.

> C6 is the only deepest state for Tegra194 Carmel CPU that we support in

> addition to C1 (WFI) idle state.

> 

> MCE firmware gets state crossover thresholds for C1 to C6 transition from

> TF-A and uses it along with state idle time to decide on C6 state entry

> based on its background work.

> 

> Assuming for now if we use min_residency as state idle time which is static

> value from DT, then it enters into deepest state C6 always as we use

> min_residency value we use is always higher than state crossover threshold.

> 

> But MCE firmware is not aware of when next cpu event can happen to predict

> if next event can take longer than state min_residency time.

> 

> Using min residency in such case is very conservative where MCE firmware

> exits C6 state early where we may not have better power saving.

> 

> But with MCE firmware being aware of when next event can happen it can use

> that to stay in C6 state without early exit for better power savings.

> 

> > It would depend on number of available states, max_residency, scale

> > that you would choose while assigning values from [0, max_residency]

> > to each state.

> > IIRC there can be many state IDs for idle, so it would depend on

> > number of bits encoding this state, and your needs. Example of

> > linear scale:

> > 4-bits encoding idle state and max predicted residency 10msec,

> > that means 10000us / 16 states = 625us/state.

> > The max_residency might be split differently, using different than

> > linear function, to have some rage more precised.

> > 

> > Open question is if these idle states must be all represented

> > in DT, or there is a way of describing a 'set of idle states'

> > automatically.

> We only support C6 state through DT as C6 is the only deepest state for

> Tegra194 carmel CPU. WFI idle state is completely handled by kernel and does

> not require MCE sequences for entry/exit.


I think Lukasz's point is that you can encode the predicted idle time by
having multiple idle_state entries with different min_residency mapping
to the same actual idle-state. So you would several variants of C6 with
different min_residencies and if the OS picks one with longer
min_residency firmware would have a better estimate of the predicted
idle residency.

I'm not convinced it is the right way to work around passing this
information on to firmware. I would rather see an example of how well
this works (best with numbers) and have a proper solution.

Morten