diff mbox series

[v4,3/6] vfio: Increment the runtime PM usage count during IOCTL call

Message ID 20220701110814.7310-4-abhsahu@nvidia.com
State New
Headers show
Series vfio/pci: power management changes | expand

Commit Message

Abhishek Sahu July 1, 2022, 11:08 a.m. UTC
The vfio-pci based driver will have runtime power management
support where the user can put the device into the low power state
and then PCI devices can go into the D3cold state. If the device is
in the low power state and the user issues any IOCTL, then the
device should be moved out of the low power state first. Once
the IOCTL is serviced, then it can go into the low power state again.
The runtime PM framework manages this with help of usage count.

One option was to add the runtime PM related API's inside vfio-pci
driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
different path and more IOCTL can be added in the future. Also, the
runtime PM will be added for vfio-pci based drivers variant currently,
but the other VFIO based drivers can use the same in the
future. So, this patch adds the runtime calls runtime-related API in
the top-level IOCTL function itself.

For the VFIO drivers which do not have runtime power management
support currently, the runtime PM API's won't be invoked. Only for
vfio-pci based drivers currently, the runtime PM API's will be invoked
to increment and decrement the usage count.

Taking this usage count incremented while servicing IOCTL will make
sure that the user won't put the device into low power state when any
other IOCTL is being serviced in parallel. Let's consider the
following scenario:

 1. Some other IOCTL is called.
 2. The user has opened another device instance and called the power
    management IOCTL for the low power entry.
 3. The power management IOCTL moves the device into the low power state.
 4. The other IOCTL finishes.

If we don't keep the usage count incremented then the device
access will happen between step 3 and 4 while the device has already
gone into the low power state.

The runtime PM API's should not be invoked for
VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
the runtime power management entry and exit for the VFIO device.

The pm_runtime_resume_and_get() will be the first call so its error
should not be propagated to user space directly. For example, if
pm_runtime_resume_and_get() can return -EINVAL for the cases where the
user has passed the correct argument. So the
pm_runtime_resume_and_get() errors have been masked behind -EIO.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/vfio.c | 82 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 74 insertions(+), 8 deletions(-)

Comments

Alex Williamson July 6, 2022, 3:40 p.m. UTC | #1
On Fri, 1 Jul 2022 16:38:11 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> The vfio-pci based driver will have runtime power management
> support where the user can put the device into the low power state
> and then PCI devices can go into the D3cold state. If the device is
> in the low power state and the user issues any IOCTL, then the
> device should be moved out of the low power state first. Once
> the IOCTL is serviced, then it can go into the low power state again.
> The runtime PM framework manages this with help of usage count.
> 
> One option was to add the runtime PM related API's inside vfio-pci
> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
> different path and more IOCTL can be added in the future. Also, the
> runtime PM will be added for vfio-pci based drivers variant currently,
> but the other VFIO based drivers can use the same in the
> future. So, this patch adds the runtime calls runtime-related API in
> the top-level IOCTL function itself.
>
> For the VFIO drivers which do not have runtime power management
> support currently, the runtime PM API's won't be invoked. Only for
> vfio-pci based drivers currently, the runtime PM API's will be invoked
> to increment and decrement the usage count.

Variant drivers can easily opt-out of runtime pm support by performing
a gratuitous pm-get in their device-open function.
 
> Taking this usage count incremented while servicing IOCTL will make
> sure that the user won't put the device into low power state when any
> other IOCTL is being serviced in parallel. Let's consider the
> following scenario:
> 
>  1. Some other IOCTL is called.
>  2. The user has opened another device instance and called the power
>     management IOCTL for the low power entry.
>  3. The power management IOCTL moves the device into the low power state.
>  4. The other IOCTL finishes.
> 
> If we don't keep the usage count incremented then the device
> access will happen between step 3 and 4 while the device has already
> gone into the low power state.
> 
> The runtime PM API's should not be invoked for
> VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
> the runtime power management entry and exit for the VFIO device.

I think the one-shot interface I proposed in the previous patch avoids
the need for special handling for these feature ioctls.  Thanks,

Alex
 
> The pm_runtime_resume_and_get() will be the first call so its error
> should not be propagated to user space directly. For example, if
> pm_runtime_resume_and_get() can return -EINVAL for the cases where the
> user has passed the correct argument. So the
> pm_runtime_resume_and_get() errors have been masked behind -EIO.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/vfio.c | 82 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 74 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 61e71c1154be..61a8d9f7629a 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
>  #include <linux/sched/signal.h>
> +#include <linux/pm_runtime.h>
>  #include "vfio.h"
>  
>  #define DRIVER_VERSION	"0.3"
> @@ -1333,6 +1334,39 @@ static const struct file_operations vfio_group_fops = {
>  	.release	= vfio_group_fops_release,
>  };
>  
> +/*
> + * Wrapper around pm_runtime_resume_and_get().
> + * Return error code on failure or 0 on success.
> + */
> +static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
> +{
> +	struct device *dev = device->dev;
> +
> +	if (dev->driver && dev->driver->pm) {
> +		int ret;
> +
> +		ret = pm_runtime_resume_and_get(dev);
> +		if (ret < 0) {
> +			dev_info_ratelimited(dev,
> +				"vfio: runtime resume failed %d\n", ret);
> +			return -EIO;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Wrapper around pm_runtime_put().
> + */
> +static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
> +{
> +	struct device *dev = device->dev;
> +
> +	if (dev->driver && dev->driver->pm)
> +		pm_runtime_put(dev);
> +}
> +
>  /*
>   * VFIO Device fd
>   */
> @@ -1607,6 +1641,8 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>  {
>  	size_t minsz = offsetofend(struct vfio_device_feature, flags);
>  	struct vfio_device_feature feature;
> +	int ret = 0;
> +	u16 feature_cmd;
>  
>  	if (copy_from_user(&feature, arg, minsz))
>  		return -EFAULT;
> @@ -1626,28 +1662,51 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>  	    (feature.flags & VFIO_DEVICE_FEATURE_GET))
>  		return -EINVAL;
>  
> -	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
> +	feature_cmd = feature.flags & VFIO_DEVICE_FEATURE_MASK;
> +
> +	/*
> +	 * The VFIO_DEVICE_FEATURE_POWER_MANAGEMENT itself performs the runtime
> +	 * power management entry and exit for the VFIO device, so the runtime
> +	 * PM API's should not be called for this feature.
> +	 */
> +	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT) {
> +		ret = vfio_device_pm_runtime_get(device);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	switch (feature_cmd) {
>  	case VFIO_DEVICE_FEATURE_MIGRATION:
> -		return vfio_ioctl_device_feature_migration(
> +		ret = vfio_ioctl_device_feature_migration(
>  			device, feature.flags, arg->data,
>  			feature.argsz - minsz);
> +		break;
>  	case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE:
> -		return vfio_ioctl_device_feature_mig_device_state(
> +		ret = vfio_ioctl_device_feature_mig_device_state(
>  			device, feature.flags, arg->data,
>  			feature.argsz - minsz);
> +		break;
>  	default:
>  		if (unlikely(!device->ops->device_feature))
> -			return -EINVAL;
> -		return device->ops->device_feature(device, feature.flags,
> -						   arg->data,
> -						   feature.argsz - minsz);
> +			ret = -EINVAL;
> +		else
> +			ret = device->ops->device_feature(
> +				device, feature.flags, arg->data,
> +				feature.argsz - minsz);
> +		break;
>  	}
> +
> +	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT)
> +		vfio_device_pm_runtime_put(device);
> +
> +	return ret;
>  }
>  
>  static long vfio_device_fops_unl_ioctl(struct file *filep,
>  				       unsigned int cmd, unsigned long arg)
>  {
>  	struct vfio_device *device = filep->private_data;
> +	int ret;
>  
>  	switch (cmd) {
>  	case VFIO_DEVICE_FEATURE:
> @@ -1655,7 +1714,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>  	default:
>  		if (unlikely(!device->ops->ioctl))
>  			return -EINVAL;
> -		return device->ops->ioctl(device, cmd, arg);
> +
> +		ret = vfio_device_pm_runtime_get(device);
> +		if (ret)
> +			return ret;
> +
> +		ret = device->ops->ioctl(device, cmd, arg);
> +		vfio_device_pm_runtime_put(device);
> +		return ret;
>  	}
>  }
>
Abhishek Sahu July 8, 2022, 9:43 a.m. UTC | #2
On 7/6/2022 9:10 PM, Alex Williamson wrote:
> On Fri, 1 Jul 2022 16:38:11 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> The vfio-pci based driver will have runtime power management
>> support where the user can put the device into the low power state
>> and then PCI devices can go into the D3cold state. If the device is
>> in the low power state and the user issues any IOCTL, then the
>> device should be moved out of the low power state first. Once
>> the IOCTL is serviced, then it can go into the low power state again.
>> The runtime PM framework manages this with help of usage count.
>>
>> One option was to add the runtime PM related API's inside vfio-pci
>> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
>> different path and more IOCTL can be added in the future. Also, the
>> runtime PM will be added for vfio-pci based drivers variant currently,
>> but the other VFIO based drivers can use the same in the
>> future. So, this patch adds the runtime calls runtime-related API in
>> the top-level IOCTL function itself.
>>
>> For the VFIO drivers which do not have runtime power management
>> support currently, the runtime PM API's won't be invoked. Only for
>> vfio-pci based drivers currently, the runtime PM API's will be invoked
>> to increment and decrement the usage count.
> 
> Variant drivers can easily opt-out of runtime pm support by performing
> a gratuitous pm-get in their device-open function.
>  

 Do I need to add this line in the commit message?
 
>> Taking this usage count incremented while servicing IOCTL will make
>> sure that the user won't put the device into low power state when any
>> other IOCTL is being serviced in parallel. Let's consider the
>> following scenario:
>>
>>  1. Some other IOCTL is called.
>>  2. The user has opened another device instance and called the power
>>     management IOCTL for the low power entry.
>>  3. The power management IOCTL moves the device into the low power state.
>>  4. The other IOCTL finishes.
>>
>> If we don't keep the usage count incremented then the device
>> access will happen between step 3 and 4 while the device has already
>> gone into the low power state.
>>
>> The runtime PM API's should not be invoked for
>> VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
>> the runtime power management entry and exit for the VFIO device.
> 
> I think the one-shot interface I proposed in the previous patch avoids
> the need for special handling for these feature ioctls.  Thanks,
> 

 Okay. So, for low power exit case (means feature GET ioctl in the
 updated case) also, we will trigger eventfd. Correct?

 Thanks,
 Abhishek
 
> Alex
>  
>> The pm_runtime_resume_and_get() will be the first call so its error
>> should not be propagated to user space directly. For example, if
>> pm_runtime_resume_and_get() can return -EINVAL for the cases where the
>> user has passed the correct argument. So the
>> pm_runtime_resume_and_get() errors have been masked behind -EIO.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/vfio.c | 82 ++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 74 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 61e71c1154be..61a8d9f7629a 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -32,6 +32,7 @@
>>  #include <linux/vfio.h>
>>  #include <linux/wait.h>
>>  #include <linux/sched/signal.h>
>> +#include <linux/pm_runtime.h>
>>  #include "vfio.h"
>>  
>>  #define DRIVER_VERSION	"0.3"
>> @@ -1333,6 +1334,39 @@ static const struct file_operations vfio_group_fops = {
>>  	.release	= vfio_group_fops_release,
>>  };
>>  
>> +/*
>> + * Wrapper around pm_runtime_resume_and_get().
>> + * Return error code on failure or 0 on success.
>> + */
>> +static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
>> +{
>> +	struct device *dev = device->dev;
>> +
>> +	if (dev->driver && dev->driver->pm) {
>> +		int ret;
>> +
>> +		ret = pm_runtime_resume_and_get(dev);
>> +		if (ret < 0) {
>> +			dev_info_ratelimited(dev,
>> +				"vfio: runtime resume failed %d\n", ret);
>> +			return -EIO;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Wrapper around pm_runtime_put().
>> + */
>> +static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
>> +{
>> +	struct device *dev = device->dev;
>> +
>> +	if (dev->driver && dev->driver->pm)
>> +		pm_runtime_put(dev);
>> +}
>> +
>>  /*
>>   * VFIO Device fd
>>   */
>> @@ -1607,6 +1641,8 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>>  {
>>  	size_t minsz = offsetofend(struct vfio_device_feature, flags);
>>  	struct vfio_device_feature feature;
>> +	int ret = 0;
>> +	u16 feature_cmd;
>>  
>>  	if (copy_from_user(&feature, arg, minsz))
>>  		return -EFAULT;
>> @@ -1626,28 +1662,51 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>>  	    (feature.flags & VFIO_DEVICE_FEATURE_GET))
>>  		return -EINVAL;
>>  
>> -	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
>> +	feature_cmd = feature.flags & VFIO_DEVICE_FEATURE_MASK;
>> +
>> +	/*
>> +	 * The VFIO_DEVICE_FEATURE_POWER_MANAGEMENT itself performs the runtime
>> +	 * power management entry and exit for the VFIO device, so the runtime
>> +	 * PM API's should not be called for this feature.
>> +	 */
>> +	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT) {
>> +		ret = vfio_device_pm_runtime_get(device);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	switch (feature_cmd) {
>>  	case VFIO_DEVICE_FEATURE_MIGRATION:
>> -		return vfio_ioctl_device_feature_migration(
>> +		ret = vfio_ioctl_device_feature_migration(
>>  			device, feature.flags, arg->data,
>>  			feature.argsz - minsz);
>> +		break;
>>  	case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE:
>> -		return vfio_ioctl_device_feature_mig_device_state(
>> +		ret = vfio_ioctl_device_feature_mig_device_state(
>>  			device, feature.flags, arg->data,
>>  			feature.argsz - minsz);
>> +		break;
>>  	default:
>>  		if (unlikely(!device->ops->device_feature))
>> -			return -EINVAL;
>> -		return device->ops->device_feature(device, feature.flags,
>> -						   arg->data,
>> -						   feature.argsz - minsz);
>> +			ret = -EINVAL;
>> +		else
>> +			ret = device->ops->device_feature(
>> +				device, feature.flags, arg->data,
>> +				feature.argsz - minsz);
>> +		break;
>>  	}
>> +
>> +	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT)
>> +		vfio_device_pm_runtime_put(device);
>> +
>> +	return ret;
>>  }
>>  
>>  static long vfio_device_fops_unl_ioctl(struct file *filep,
>>  				       unsigned int cmd, unsigned long arg)
>>  {
>>  	struct vfio_device *device = filep->private_data;
>> +	int ret;
>>  
>>  	switch (cmd) {
>>  	case VFIO_DEVICE_FEATURE:
>> @@ -1655,7 +1714,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>>  	default:
>>  		if (unlikely(!device->ops->ioctl))
>>  			return -EINVAL;
>> -		return device->ops->ioctl(device, cmd, arg);
>> +
>> +		ret = vfio_device_pm_runtime_get(device);
>> +		if (ret)
>> +			return ret;
>> +
>> +		ret = device->ops->ioctl(device, cmd, arg);
>> +		vfio_device_pm_runtime_put(device);
>> +		return ret;
>>  	}
>>  }
>>  
>
Alex Williamson July 8, 2022, 4:49 p.m. UTC | #3
On Fri, 8 Jul 2022 15:13:16 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/6/2022 9:10 PM, Alex Williamson wrote:
> > On Fri, 1 Jul 2022 16:38:11 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> The vfio-pci based driver will have runtime power management
> >> support where the user can put the device into the low power state
> >> and then PCI devices can go into the D3cold state. If the device is
> >> in the low power state and the user issues any IOCTL, then the
> >> device should be moved out of the low power state first. Once
> >> the IOCTL is serviced, then it can go into the low power state again.
> >> The runtime PM framework manages this with help of usage count.
> >>
> >> One option was to add the runtime PM related API's inside vfio-pci
> >> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
> >> different path and more IOCTL can be added in the future. Also, the
> >> runtime PM will be added for vfio-pci based drivers variant currently,
> >> but the other VFIO based drivers can use the same in the
> >> future. So, this patch adds the runtime calls runtime-related API in
> >> the top-level IOCTL function itself.
> >>
> >> For the VFIO drivers which do not have runtime power management
> >> support currently, the runtime PM API's won't be invoked. Only for
> >> vfio-pci based drivers currently, the runtime PM API's will be invoked
> >> to increment and decrement the usage count.  
> > 
> > Variant drivers can easily opt-out of runtime pm support by performing
> > a gratuitous pm-get in their device-open function.
> >    
> 
>  Do I need to add this line in the commit message?

Maybe I misinterpreted, but my initial read was that there was some
sort of opt-in, which there is by providing pm-runtime support in the
driver, which vfio-pci-core does for all vfio-pci variant drivers.  But
there's also an opt-out, where a vfio-pci variant driver might not want
to support pm-runtime support and could accomplish that by bumping the
pm reference count on device-open such that the user cannot trigger a
pm-suspend.

> >> Taking this usage count incremented while servicing IOCTL will make
> >> sure that the user won't put the device into low power state when any
> >> other IOCTL is being serviced in parallel. Let's consider the
> >> following scenario:
> >>
> >>  1. Some other IOCTL is called.
> >>  2. The user has opened another device instance and called the power
> >>     management IOCTL for the low power entry.
> >>  3. The power management IOCTL moves the device into the low power state.
> >>  4. The other IOCTL finishes.
> >>
> >> If we don't keep the usage count incremented then the device
> >> access will happen between step 3 and 4 while the device has already
> >> gone into the low power state.
> >>
> >> The runtime PM API's should not be invoked for
> >> VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
> >> the runtime power management entry and exit for the VFIO device.  
> > 
> > I think the one-shot interface I proposed in the previous patch avoids
> > the need for special handling for these feature ioctls.  Thanks,
> >   
> 
>  Okay. So, for low power exit case (means feature GET ioctl in the
>  updated case) also, we will trigger eventfd. Correct?

If all ioctls are wrapped in pm-get/put, then the pm feature exit ioctl
would wakeup and signal the eventfd via the pm-get.  I'm not sure if
it's worthwhile to try to surprise this eventfd.  Do you foresee any
issues?  Thanks,

Alex
Abhishek Sahu July 11, 2022, 9:50 a.m. UTC | #4
On 7/8/2022 10:19 PM, Alex Williamson wrote:
> On Fri, 8 Jul 2022 15:13:16 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/6/2022 9:10 PM, Alex Williamson wrote:
>>> On Fri, 1 Jul 2022 16:38:11 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> The vfio-pci based driver will have runtime power management
>>>> support where the user can put the device into the low power state
>>>> and then PCI devices can go into the D3cold state. If the device is
>>>> in the low power state and the user issues any IOCTL, then the
>>>> device should be moved out of the low power state first. Once
>>>> the IOCTL is serviced, then it can go into the low power state again.
>>>> The runtime PM framework manages this with help of usage count.
>>>>
>>>> One option was to add the runtime PM related API's inside vfio-pci
>>>> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
>>>> different path and more IOCTL can be added in the future. Also, the
>>>> runtime PM will be added for vfio-pci based drivers variant currently,
>>>> but the other VFIO based drivers can use the same in the
>>>> future. So, this patch adds the runtime calls runtime-related API in
>>>> the top-level IOCTL function itself.
>>>>
>>>> For the VFIO drivers which do not have runtime power management
>>>> support currently, the runtime PM API's won't be invoked. Only for
>>>> vfio-pci based drivers currently, the runtime PM API's will be invoked
>>>> to increment and decrement the usage count.  
>>>
>>> Variant drivers can easily opt-out of runtime pm support by performing
>>> a gratuitous pm-get in their device-open function.
>>>    
>>
>>  Do I need to add this line in the commit message?
> 
> Maybe I misinterpreted, but my initial read was that there was some
> sort of opt-in, which there is by providing pm-runtime support in the
> driver, which vfio-pci-core does for all vfio-pci variant drivers.  But
> there's also an opt-out, where a vfio-pci variant driver might not want
> to support pm-runtime support and could accomplish that by bumping the
> pm reference count on device-open such that the user cannot trigger a
> pm-suspend.
> 
>>>> Taking this usage count incremented while servicing IOCTL will make
>>>> sure that the user won't put the device into low power state when any
>>>> other IOCTL is being serviced in parallel. Let's consider the
>>>> following scenario:
>>>>
>>>>  1. Some other IOCTL is called.
>>>>  2. The user has opened another device instance and called the power
>>>>     management IOCTL for the low power entry.
>>>>  3. The power management IOCTL moves the device into the low power state.
>>>>  4. The other IOCTL finishes.
>>>>
>>>> If we don't keep the usage count incremented then the device
>>>> access will happen between step 3 and 4 while the device has already
>>>> gone into the low power state.
>>>>
>>>> The runtime PM API's should not be invoked for
>>>> VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
>>>> the runtime power management entry and exit for the VFIO device.  
>>>
>>> I think the one-shot interface I proposed in the previous patch avoids
>>> the need for special handling for these feature ioctls.  Thanks,
>>>   
>>
>>  Okay. So, for low power exit case (means feature GET ioctl in the
>>  updated case) also, we will trigger eventfd. Correct?
> 
> If all ioctls are wrapped in pm-get/put, then the pm feature exit ioctl
> would wakeup and signal the eventfd via the pm-get.  I'm not sure if
> it's worthwhile to try to surprise this eventfd.  Do you foresee any
> issues?  Thanks,
> 
> Alex
> 

 I think it should be fine. It requires some extra handling in the
 hypervisor or QEMU side where it needs to skip the virtual PME or
 similar notification handling for the cases when the guest has
 actually triggered the low power exit but that can be easily done.
 
 Regards,
 Abhishek
diff mbox series

Patch

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 61e71c1154be..61a8d9f7629a 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -32,6 +32,7 @@ 
 #include <linux/vfio.h>
 #include <linux/wait.h>
 #include <linux/sched/signal.h>
+#include <linux/pm_runtime.h>
 #include "vfio.h"
 
 #define DRIVER_VERSION	"0.3"
@@ -1333,6 +1334,39 @@  static const struct file_operations vfio_group_fops = {
 	.release	= vfio_group_fops_release,
 };
 
+/*
+ * Wrapper around pm_runtime_resume_and_get().
+ * Return error code on failure or 0 on success.
+ */
+static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
+{
+	struct device *dev = device->dev;
+
+	if (dev->driver && dev->driver->pm) {
+		int ret;
+
+		ret = pm_runtime_resume_and_get(dev);
+		if (ret < 0) {
+			dev_info_ratelimited(dev,
+				"vfio: runtime resume failed %d\n", ret);
+			return -EIO;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Wrapper around pm_runtime_put().
+ */
+static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
+{
+	struct device *dev = device->dev;
+
+	if (dev->driver && dev->driver->pm)
+		pm_runtime_put(dev);
+}
+
 /*
  * VFIO Device fd
  */
@@ -1607,6 +1641,8 @@  static int vfio_ioctl_device_feature(struct vfio_device *device,
 {
 	size_t minsz = offsetofend(struct vfio_device_feature, flags);
 	struct vfio_device_feature feature;
+	int ret = 0;
+	u16 feature_cmd;
 
 	if (copy_from_user(&feature, arg, minsz))
 		return -EFAULT;
@@ -1626,28 +1662,51 @@  static int vfio_ioctl_device_feature(struct vfio_device *device,
 	    (feature.flags & VFIO_DEVICE_FEATURE_GET))
 		return -EINVAL;
 
-	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
+	feature_cmd = feature.flags & VFIO_DEVICE_FEATURE_MASK;
+
+	/*
+	 * The VFIO_DEVICE_FEATURE_POWER_MANAGEMENT itself performs the runtime
+	 * power management entry and exit for the VFIO device, so the runtime
+	 * PM API's should not be called for this feature.
+	 */
+	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT) {
+		ret = vfio_device_pm_runtime_get(device);
+		if (ret)
+			return ret;
+	}
+
+	switch (feature_cmd) {
 	case VFIO_DEVICE_FEATURE_MIGRATION:
-		return vfio_ioctl_device_feature_migration(
+		ret = vfio_ioctl_device_feature_migration(
 			device, feature.flags, arg->data,
 			feature.argsz - minsz);
+		break;
 	case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE:
-		return vfio_ioctl_device_feature_mig_device_state(
+		ret = vfio_ioctl_device_feature_mig_device_state(
 			device, feature.flags, arg->data,
 			feature.argsz - minsz);
+		break;
 	default:
 		if (unlikely(!device->ops->device_feature))
-			return -EINVAL;
-		return device->ops->device_feature(device, feature.flags,
-						   arg->data,
-						   feature.argsz - minsz);
+			ret = -EINVAL;
+		else
+			ret = device->ops->device_feature(
+				device, feature.flags, arg->data,
+				feature.argsz - minsz);
+		break;
 	}
+
+	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT)
+		vfio_device_pm_runtime_put(device);
+
+	return ret;
 }
 
 static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
 	struct vfio_device *device = filep->private_data;
+	int ret;
 
 	switch (cmd) {
 	case VFIO_DEVICE_FEATURE:
@@ -1655,7 +1714,14 @@  static long vfio_device_fops_unl_ioctl(struct file *filep,
 	default:
 		if (unlikely(!device->ops->ioctl))
 			return -EINVAL;
-		return device->ops->ioctl(device, cmd, arg);
+
+		ret = vfio_device_pm_runtime_get(device);
+		if (ret)
+			return ret;
+
+		ret = device->ops->ioctl(device, cmd, arg);
+		vfio_device_pm_runtime_put(device);
+		return ret;
 	}
 }