mbox series

[v5,0/4] vfio/pci: power management changes

Message ID 20220518111612.16985-1-abhsahu@nvidia.com
Headers show
Series vfio/pci: power management changes | expand

Message

Abhishek Sahu May 18, 2022, 11:16 a.m. UTC
Currently, there is very limited power management support available
in the upstream vfio-pci driver. If there is no user of vfio-pci device,
then it will be moved into D3Hot state. Similarly, if we enable the
runtime power management for vfio-pci device in the guest OS, then the
device is being runtime suspended (for linux guest OS) and the PCI
device will be put into D3hot state (in function
vfio_pm_config_write()). If the D3cold state can be used instead of
D3hot, then it will help in saving maximum power. The D3cold state can't
be possible with native PCI PM. It requires interaction with platform
firmware which is system-specific. To go into low power states
(including D3cold), the runtime PM framework can be used which
internally interacts with PCI and platform firmware and puts the device
into the lowest possible D-States.

This patch series registers the vfio-pci driver with runtime
PM framework and uses the same for moving the physical PCI
device to go into the low power state for unused idle devices.
There will be separate patch series that will add the support
for using runtime PM framework for used idle devices.

The current PM support was added with commit 6eb7018705de ("vfio-pci:
Move idle devices to D3hot power state") where the following point was
mentioned regarding D3cold state.

 "It's tempting to try to use D3cold, but we have no reason to inhibit
  hotplug of idle devices and we might get into a loop of having the
  device disappear before we have a chance to try to use it."

With the runtime PM, if the user want to prevent going into D3cold then
/sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
devices where the above functionality is required instead of
disallowing the D3cold state for all the cases.

The BAR access needs to be disabled if device is in D3hot state.
Also, there should not be any config access if device is in D3cold
state. For SR-IOV, the PF power state should be higher than VF's power
state.

* Changes in v5

- Rebased over https://github.com/awilliam/linux-vfio/tree/next.
- Renamed vfio_pci_lock_and_set_power_state() to
  vfio_lock_and_set_power_state() and made it static.
- Inside vfio_pci_core_sriov_configure(), protected setting of
  power state and sriov enablement with 'memory_lock'.
- Removed CONFIG_PM macro use since it is not needed with current
  code.

* Changes in v4
  (https://lore.kernel.org/lkml/20220517100219.15146-1-abhsahu@nvidia.com)

- Rebased over https://github.com/awilliam/linux-vfio/tree/next.
- Split the patch series into 2 parts. This part contains the patches
  for using runtime PM for unused idle device.
- Used the 'pdev->current_state' for checking if the device in D3 state.
- Adds the check in __vfio_pci_memory_enabled() function itself instead
  of adding power state check at each caller.
- Make vfio_pci_lock_and_set_power_state() global since it is needed
  in different files.
- Used vfio_pci_lock_and_set_power_state() instead of
  vfio_pci_set_power_state() before pci_enable_sriov().
- Inside vfio_pci_core_sriov_configure(), handled both the cases
  (the device is in low power state with and without user).
- Used list_for_each_entry_continue_reverse() in
  vfio_pci_dev_set_pm_runtime_get().

* Changes in v3
  (https://lore.kernel.org/lkml/20220425092615.10133-1-abhsahu@nvidia.com)

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
  access from host side. This is specific to NVIDIA use case and
  will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
  IOCTL for power management.
- Removed all custom code related with power management in runtime
  suspend/resume callbacks and IOCTL handling. Now, the callbacks
  contain code related with INTx handling and few other stuffs and
  all the PCI state and platform PM handling will be done by PCI core
  functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
  more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
  driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'. 
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
  of this patch series.
- Removed storing the power state locally and used one new boolean to
  track the d3 (D3cold and D3hot) power state 
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
  of touching vfio_pci_set_power_state().
- Considered the fixes made in
  https://lore.kernel.org/lkml/20220217122107.22434-1-abhsahu@nvidia.com
  and updated the patches accordingly.

* Changes in v2
  (https://lore.kernel.org/lkml/20220124181726.19174-1-abhsahu@nvidia.com)

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
  D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

* v1
  https://lore.kernel.org/lkml/20211115133640.2231-1-abhsahu@nvidia.com/

Abhishek Sahu (4):
  vfio/pci: Invalidate mmaps and block the access in D3hot power state
  vfio/pci: Change the PF power state to D0 before enabling VFs
  vfio/pci: Virtualize PME related registers bits and initialize to zero
  vfio/pci: Move the unused device into low power state with runtime PM

 drivers/vfio/pci/vfio_pci_config.c |  56 ++++++++-
 drivers/vfio/pci/vfio_pci_core.c   | 178 ++++++++++++++++++++---------
 2 files changed, 178 insertions(+), 56 deletions(-)

Comments

Abhishek Sahu May 18, 2022, 11:16 a.m. UTC | #1
According to [PCIe v5 9.6.2] for PF Device Power Management States

 "The PF's power management state (D-state) has global impact on its
  associated VFs. If a VF does not implement the Power Management
  Capability, then it behaves as if it is in an equivalent
  power state of its associated PF.

  If a VF implements the Power Management Capability, the Device behavior
  is undefined if the PF is placed in a lower power state than the VF.
  Software should avoid this situation by placing all VFs in lower power
  state before lowering their associated PF's power state."
Abhishek Sahu May 19, 2022, 4:51 a.m. UTC | #2
On 5/18/2022 11:21 PM, Alex Williamson wrote:
> On Wed, 18 May 2022 16:46:08 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> Currently, there is very limited power management support available
>> in the upstream vfio-pci driver. If there is no user of vfio-pci device,
>> then it will be moved into D3Hot state. Similarly, if we enable the
>> runtime power management for vfio-pci device in the guest OS, then the
>> device is being runtime suspended (for linux guest OS) and the PCI
>> device will be put into D3hot state (in function
>> vfio_pm_config_write()). If the D3cold state can be used instead of
>> D3hot, then it will help in saving maximum power. The D3cold state can't
>> be possible with native PCI PM. It requires interaction with platform
>> firmware which is system-specific. To go into low power states
>> (including D3cold), the runtime PM framework can be used which
>> internally interacts with PCI and platform firmware and puts the device
>> into the lowest possible D-States.
>>
>> This patch series registers the vfio-pci driver with runtime
>> PM framework and uses the same for moving the physical PCI
>> device to go into the low power state for unused idle devices.
>> There will be separate patch series that will add the support
>> for using runtime PM framework for used idle devices.
>>
>> The current PM support was added with commit 6eb7018705de ("vfio-pci:
>> Move idle devices to D3hot power state") where the following point was
>> mentioned regarding D3cold state.
>>
>>  "It's tempting to try to use D3cold, but we have no reason to inhibit
>>   hotplug of idle devices and we might get into a loop of having the
>>   device disappear before we have a chance to try to use it."
>>
>> With the runtime PM, if the user want to prevent going into D3cold then
>> /sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
>> devices where the above functionality is required instead of
>> disallowing the D3cold state for all the cases.
>>
>> The BAR access needs to be disabled if device is in D3hot state.
>> Also, there should not be any config access if device is in D3cold
>> state. For SR-IOV, the PF power state should be higher than VF's power
>> state.
>>
>> * Changes in v5
>>
>> - Rebased over https://github.com/awilliam/linux-vfio/tree/next.
>> - Renamed vfio_pci_lock_and_set_power_state() to
>>   vfio_lock_and_set_power_state() and made it static.
>> - Inside vfio_pci_core_sriov_configure(), protected setting of
>>   power state and sriov enablement with 'memory_lock'.
>> - Removed CONFIG_PM macro use since it is not needed with current
>>   code.
> 
> Applied to vfio next branch for v5.19.  Thanks!
> 
> Alex
> 

 Thanks Alex for your thorough review and support in getting
 this series merged. I will start exploring for the second part
 and will find out a generic way to support all the use cases.

 Regards,
 Abhishek
  
>> * Changes in v4
>>   (https://lore.kernel.org/lkml/20220517100219.15146-1-abhsahu@nvidia.com)
>>
>> - Rebased over https://github.com/awilliam/linux-vfio/tree/next.
>> - Split the patch series into 2 parts. This part contains the patches
>>   for using runtime PM for unused idle device.
>> - Used the 'pdev->current_state' for checking if the device in D3 state.
>> - Adds the check in __vfio_pci_memory_enabled() function itself instead
>>   of adding power state check at each caller.
>> - Make vfio_pci_lock_and_set_power_state() global since it is needed
>>   in different files.
>> - Used vfio_pci_lock_and_set_power_state() instead of
>>   vfio_pci_set_power_state() before pci_enable_sriov().
>> - Inside vfio_pci_core_sriov_configure(), handled both the cases
>>   (the device is in low power state with and without user).
>> - Used list_for_each_entry_continue_reverse() in
>>   vfio_pci_dev_set_pm_runtime_get().
>>
>> * Changes in v3
>>   (https://lore.kernel.org/lkml/20220425092615.10133-1-abhsahu@nvidia.com)
>>
>> - Rebased patches on v5.18-rc3.
>> - Marked this series as PATCH instead of RFC.
>> - Addressed the review comments given in v2.
>> - Removed the limitation to keep device in D0 state if there is any
>>   access from host side. This is specific to NVIDIA use case and
>>   will be handled separately.
>> - Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
>>   IOCTL for power management.
>> - Removed all custom code related with power management in runtime
>>   suspend/resume callbacks and IOCTL handling. Now, the callbacks
>>   contain code related with INTx handling and few other stuffs and
>>   all the PCI state and platform PM handling will be done by PCI core
>>   functions itself.
>> - Add the support of wake-up in main vfio layer itself since now we have
>>   more vfio/pci based drivers.
>> - Instead of assigning the 'struct dev_pm_ops' in individual parent
>>   driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'. 
>> - Added handling of power management around SR-IOV handling.
>> - Moved the setting of drvdata in a separate patch.
>> - Masked INTx before during runtime suspended state.
>> - Changed the order of patches so that Fix related things are at beginning
>>   of this patch series.
>> - Removed storing the power state locally and used one new boolean to
>>   track the d3 (D3cold and D3hot) power state 
>> - Removed check for IO access in D3 power state.
>> - Used another helper function vfio_lock_and_set_power_state() instead
>>   of touching vfio_pci_set_power_state().
>> - Considered the fixes made in
>>   https://lore.kernel.org/lkml/20220217122107.22434-1-abhsahu@nvidia.com
>>   and updated the patches accordingly.
>>
>> * Changes in v2
>>   (https://lore.kernel.org/lkml/20220124181726.19174-1-abhsahu@nvidia.com)
>>
>> - Rebased patches on v5.17-rc1.
>> - Included the patch to handle BAR access in D3cold.
>> - Included the patch to fix memory leak.
>> - Made a separate IOCTL that can be used to change the power state from
>>   D3hot to D3cold and D3cold to D0.
>> - Addressed the review comments given in v1.
>>
>> * v1
>>   https://lore.kernel.org/lkml/20211115133640.2231-1-abhsahu@nvidia.com/
>>
>> Abhishek Sahu (4):
>>   vfio/pci: Invalidate mmaps and block the access in D3hot power state
>>   vfio/pci: Change the PF power state to D0 before enabling VFs
>>   vfio/pci: Virtualize PME related registers bits and initialize to zero
>>   vfio/pci: Move the unused device into low power state with runtime PM
>>
>>  drivers/vfio/pci/vfio_pci_config.c |  56 ++++++++-
>>  drivers/vfio/pci/vfio_pci_core.c   | 178 ++++++++++++++++++++---------
>>  2 files changed, 178 insertions(+), 56 deletions(-)
>>
>