Message ID | 20220701110814.7310-1-abhsahu@nvidia.com |
---|---|
Headers | show |
Series | vfio/pci: power management changes | expand |
On Fri, 1 Jul 2022 16:38:09 +0530 Abhishek Sahu <abhsahu@nvidia.com> wrote: > This patch adds INTx handling during runtime suspend/resume. > All the suspend/resume related code for the user to put the device > into the low power state will be added in subsequent patches. > > The INTx are shared among devices. Whenever any INTx interrupt comes "The INTx lines may be shared..." > for the VFIO devices, then vfio_intx_handler() will be called for each > device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx() "...device sharing the interrupt." > and checks if the interrupt has been generated for the current device. > Now, if the device is already in the D3cold state, then the config space > can not be read. Attempt to read config space in D3cold state can > cause system unresponsiveness in a few systems. To prevent this, mask > INTx in runtime suspend callback and unmask the same in runtime resume > callback. If INTx has been already masked, then no handling is needed > in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and > vfio_pci_intx_mask() has been updated to return true if INTx has been > masked inside this function. > > For the runtime suspend which is triggered for the no user of VFIO > device, the is_intx() will return false and these callbacks won't do > anything. > > The MSI/MSI-X are not shared so similar handling should not be > needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal() > without doing any device-specific config access. When the user performs > any config access or IOCTL after receiving the eventfd notification, > then the device will be moved to the D0 state first before > servicing any request. > > Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 37 +++++++++++++++++++++++++++---- > drivers/vfio/pci/vfio_pci_intrs.c | 6 ++++- > include/linux/vfio_pci_core.h | 3 ++- > 3 files changed, 40 insertions(+), 6 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index a0d69ddaf90d..5948d930449b 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat > return ret; > } > > +#ifdef CONFIG_PM > +static int vfio_pci_core_runtime_suspend(struct device *dev) > +{ > + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); > + > + /* > + * If INTx is enabled, then mask INTx before going into the runtime > + * suspended state and unmask the same in the runtime resume. > + * If INTx has already been masked by the user, then > + * vfio_pci_intx_mask() will return false and in that case, INTx > + * should not be unmasked in the runtime resume. > + */ > + vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev)); > + > + return 0; > +} > + > +static int vfio_pci_core_runtime_resume(struct device *dev) > +{ > + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); > + > + if (vdev->pm_intx_masked) > + vfio_pci_intx_unmask(vdev); > + > + return 0; > +} > +#endif /* CONFIG_PM */ > + > /* > - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, > - * so use structure without any callbacks. > - * > * The pci-driver core runtime PM routines always save the device state > * before going into suspended state. If the device is going into low power > * state with only with runtime PM ops, then no explicit handling is needed > * for the devices which have NoSoftRst-. > */ > -static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; > +static const struct dev_pm_ops vfio_pci_core_pm_ops = { > + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend, > + vfio_pci_core_runtime_resume, > + NULL) > +}; > > int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > { > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c > index 6069a11fb51a..1a37db99df48 100644 > --- a/drivers/vfio/pci/vfio_pci_intrs.c > +++ b/drivers/vfio/pci/vfio_pci_intrs.c > @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused) > eventfd_signal(vdev->ctx[0].trigger, 1); > } > > -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > +/* Returns true if INTx has been masked by this function. */ > +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > { > struct pci_dev *pdev = vdev->pdev; > unsigned long flags; > + bool intx_masked = false; > > spin_lock_irqsave(&vdev->irqlock, flags); > > @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > disable_irq_nosync(pdev->irq); > > vdev->ctx[0].masked = true; > + intx_masked = true; > } > > spin_unlock_irqrestore(&vdev->irqlock, flags); > + return intx_masked; > } There's certainly another path through this function that masks the interrupt, which makes the definition of this return value a bit confusing. Wouldn't it be simpler not to overload the masked flag on the interrupt context like this and instead set a new flag on the vdev under irqlock to indicate the device is unable to generate interrupts. The irq handler would add a test of this flag before any tests that would access the device. Thanks, Alex > /* > diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h > index 23c176d4b073..cdfd328ba6b1 100644 > --- a/include/linux/vfio_pci_core.h > +++ b/include/linux/vfio_pci_core.h > @@ -124,6 +124,7 @@ struct vfio_pci_core_device { > bool needs_reset; > bool nointx; > bool needs_pm_restore; > + bool pm_intx_masked; > struct pci_saved_state *pci_saved_state; > struct pci_saved_state *pm_save; > int ioeventfds_nr; > @@ -147,7 +148,7 @@ struct vfio_pci_core_device { > #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev))) > #define irq_is(vdev, type) (vdev->irq_type == type) > > -extern void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev); > +extern bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev); > extern void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev); > > extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev,
On Fri, 1 Jul 2022 16:38:13 +0530 Abhishek Sahu <abhsahu@nvidia.com> wrote: > Some devices (Like NVIDIA VGA or 3D controller) require driver > involvement each time before going into D3cold. In the regular flow, > the guest driver do all the required steps inside the guest OS and > then the IOCTL will be called for D3cold entry by the hypervisor. Now, > if there is any activity on the host side (For example user has run > lspci, dump the config space through sysfs, etc.), then the runtime PM > framework will resume the device first, perform the operation and then > suspend the device again. This second time suspend will be without > guest driver involvement. This patch adds the support to prevent > second-time runtime suspend if there is any wake-up. This prevention > is either based on the predefined vendor/class id list or the user can > specify the flag (VFIO_PM_LOW_POWER_REENTERY_DISABLE) during entry for > the same. > > 'pm_runtime_reentry_allowed' flag tracks if this re-entry is allowed. > It will be set during the entry time. > > 'pm_runtime_resumed' flag tracks if there is any wake-up before the > guest performs the wake-up. If re-entry is not allowed, then during > runtime resume, the runtime PM count will be incremented, and this > flag will be set. This flag will be checked during guest D3cold exit > and then skip the runtime PM-related handling if this flag is set. > > During guest low power exit time, all vdev power-related flags are > accessed under 'memory_lock' and usage count will be incremented. The > resume will be triggered after releasing the lock since the runtime > resume callback again requires the lock. pm_runtime_get_noresume()/ > pm_runtime_resume() have been used instead of > pm_runtime_resume_and_get() to handle the following scenario during > the race condition. > > a. The guest triggered the low power exit. > b. The guest thread got the lock and cleared the vdev related > flags and released the locks. > c. Before pm_runtime_resume_and_get(), the host lspci thread got > scheduled and triggered the runtime resume. > d. Now, all the vdev related flags are cleared so there won't be > any extra handling inside the runtime resume. > e. The runtime PM put the device again into the suspended state. > f. The guest triggered pm_runtime_resume_and_get() got called. > > So, at step (e), the suspend is happening without the guest driver > involvement. Now, by using pm_runtime_get_noresume() before releasing > 'memory_lock', the runtime PM framework can't suspend the device due > to incremented usage count. Nak, this policy should be implemented in userspace. Thanks, Alex > Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> > --- > drivers/vfio/pci/vfio_pci_core.c | 87 ++++++++++++++++++++++++++++++-- > include/linux/vfio_pci_core.h | 2 + > 2 files changed, 84 insertions(+), 5 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 8c17ca41d156..1ddaaa6ccef5 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -191,6 +191,20 @@ static bool vfio_pci_nointx(struct pci_dev *pdev) > return false; > } > > +static bool vfio_pci_low_power_reentry_allowed(struct pci_dev *pdev) > +{ > + /* > + * The NVIDIA display class requires driver involvement for every > + * D3cold entry. The audio and other classes can go into D3cold > + * without driver involvement. > + */ > + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA && > + ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)) > + return false; > + > + return true; > +} > + > static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev) > { > struct pci_dev *pdev = vdev->pdev; > @@ -295,6 +309,27 @@ static int vfio_pci_core_runtime_resume(struct device *dev) > if (vdev->pm_intx_masked) > vfio_pci_intx_unmask(vdev); > > + down_write(&vdev->memory_lock); > + > + /* > + * The runtime resume callback will be called for one of the following > + * two cases: > + * > + * - If the user has called IOCTL explicitly to move the device out of > + * the low power state or closed the device. > + * - If there is device access on the host side. > + * > + * For the second case, check if re-entry to the low power state is > + * allowed. If not, then increment the usage count so that runtime PM > + * framework won't suspend the device and set the 'pm_runtime_resumed' > + * flag. > + */ > + if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) { > + pm_runtime_get_noresume(dev); > + vdev->pm_runtime_resumed = true; > + } > + up_write(&vdev->memory_lock); > + > return 0; > } > #endif /* CONFIG_PM */ > @@ -415,9 +450,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > */ > down_write(&vdev->memory_lock); > if (vdev->pm_runtime_engaged) { > + if (!vdev->pm_runtime_resumed) { > + pm_runtime_get_noresume(&pdev->dev); > + do_resume = true; > + } > + vdev->pm_runtime_resumed = false; > vdev->pm_runtime_engaged = false; > - pm_runtime_get_noresume(&pdev->dev); > - do_resume = true; > } > up_write(&vdev->memory_lock); > > @@ -1227,12 +1265,17 @@ static int vfio_pci_pm_validate_flags(u32 flags) > if (!flags) > return -EINVAL; > /* Only valid flags should be set */ > - if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) > + if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT | > + VFIO_PM_LOW_POWER_REENTERY_DISABLE)) > return -EINVAL; > /* Both enter and exit should not be set */ > if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) == > (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) > return -EINVAL; > + /* re-entry disable can only be set with enter */ > + if ((flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE) && > + !(flags & VFIO_PM_LOW_POWER_ENTER)) > + return -EINVAL; > > return 0; > } > @@ -1255,10 +1298,17 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags, > > if (flags & VFIO_DEVICE_FEATURE_GET) { > down_read(&vdev->memory_lock); > - if (vdev->pm_runtime_engaged) > + if (vdev->pm_runtime_engaged) { > vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER; > - else > + if (!vdev->pm_runtime_reentry_allowed) > + vfio_pm.flags |= > + VFIO_PM_LOW_POWER_REENTERY_DISABLE; > + } else { > vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT; > + if (!vfio_pci_low_power_reentry_allowed(pdev)) > + vfio_pm.flags |= > + VFIO_PM_LOW_POWER_REENTERY_DISABLE; > + } > up_read(&vdev->memory_lock); > > if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm))) > @@ -1286,6 +1336,19 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags, > } > > vdev->pm_runtime_engaged = true; > + vdev->pm_runtime_resumed = false; > + > + /* > + * If there is any access when the device is in the runtime > + * suspended state, then the device will be resumed first > + * before access and then the device will be suspended again. > + * Check if this second time suspend is allowed and track the > + * same in 'pm_runtime_reentry_allowed' flag. > + */ > + vdev->pm_runtime_reentry_allowed = > + vfio_pci_low_power_reentry_allowed(pdev) && > + !(vfio_pm.flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE); > + > up_write(&vdev->memory_lock); > pm_runtime_put(&pdev->dev); > } else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) { > @@ -1296,6 +1359,20 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags, > } > > vdev->pm_runtime_engaged = false; > + if (vdev->pm_runtime_resumed) { > + vdev->pm_runtime_resumed = false; > + up_write(&vdev->memory_lock); > + return 0; > + } > + > + /* > + * The 'memory_lock' will be acquired again inside the runtime > + * resume callback. So, increment the usage count inside the > + * lock and call pm_runtime_resume() after releasing the lock. > + * If there is any race condition between the wake-up generated > + * at the host and the current path. Then the incremented usage > + * count will prevent the device to go into the suspended state. > + */ > pm_runtime_get_noresume(&pdev->dev); > up_write(&vdev->memory_lock); > ret = pm_runtime_resume(&pdev->dev); > diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h > index bf4823b008f9..18cc83b767b8 100644 > --- a/include/linux/vfio_pci_core.h > +++ b/include/linux/vfio_pci_core.h > @@ -126,6 +126,8 @@ struct vfio_pci_core_device { > bool needs_pm_restore; > bool pm_intx_masked; > bool pm_runtime_engaged; > + bool pm_runtime_resumed; > + bool pm_runtime_reentry_allowed; > struct pci_saved_state *pci_saved_state; > struct pci_saved_state *pm_save; > int ioeventfds_nr;
On Fri, 1 Jul 2022 16:38:12 +0530 Abhishek Sahu <abhsahu@nvidia.com> wrote: > Currently, if the runtime power management is enabled for vfio-pci > based devices in the guest OS, then the guest OS will do the register > write for PCI_PM_CTRL register. This write request will be handled in > vfio_pm_config_write() where it will do the actual register write of > PCI_PM_CTRL register. With this, the maximum D3hot state can be > achieved for low power. If we can use the runtime PM framework, then > we can achieve the D3cold state (on the supported systems) which will > help in saving maximum power. > > 1. D3cold state can't be achieved by writing PCI standard > PM config registers. This patch implements the newly added > 'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' device feature which > can be used for putting the device into the D3cold state. > > 2. The hypervisors can implement virtual ACPI methods. For example, > in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power > resources with _ON/_OFF method, then guest linux OS invokes > the _OFF method during D3cold transition and then _ON during D0 > transition. The hypervisor can tap these virtual ACPI calls and then > call the 'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' with respective flags. > > 3. The vfio-pci driver uses runtime PM framework to achieve the > D3cold state. For the D3cold transition, decrement the usage count and > for the D0 transition, increment the usage count. > > 4. If the D3cold state is not supported, then the device will > still be in the D3hot state. But with the runtime PM, the root port > can now also go into the suspended state. > > 5. The 'pm_runtime_engaged' flag tracks the entry and exit to > runtime PM. This flag is protected with 'memory_lock' semaphore. > > 6. During exit time, the flag clearing and usage count increment > are protected with 'memory_lock'. The actual wake-up is happening > outside 'memory_lock' since 'memory_lock' will be needed inside > runtime_resume callback also in subsequent patches. > > 7. In D3cold, all kinds of device-related access (BAR read/write, > config read/write, etc.) need to be disabled. For BAR-related access, > we can use existing D3hot memory disable support. During the low power > entry, invalidate the mmap mappings and add the check for > 'pm_runtime_engaged' flag. Not disabled, just wrapped in pm-get/put. If the device is indefinitely in low-power without a wake-up eventfd, mmap faults are fatal to the user. > 8. For config space, ideally, we need to return an error whenever > there is any config access from the user side once the user moved the > device into low power state. But adding a check for > 'pm_runtime_engaged' flag alone won't be sufficient due to the > following possible scenario from the user side where config space > access happens parallelly with the low power entry IOCTL. > > a. Config space access happens and vfio_pci_config_rw() will be > called. > b. The IOCTL to move into low power state is called. > c. The IOCTL will move the device into d3cold. > d. Exit from vfio_pci_config_rw() happened. > > Now, if we just check 'pm_runtime_engaged', then in the above > sequence the config space access will happen when the device already > is in the low power state. To prevent this situation, we increment the > usage count before any config space access and decrement the same > after completing the access. Also, to prevent any similar cases for > other types of access, the usage count will be incremented for all > kinds of access. Unnecessary, just wrap in pm-get/put. Thanks, Alex > Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> > --- > drivers/vfio/pci/vfio_pci_config.c | 2 +- > drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++++++++-- > include/linux/vfio_pci_core.h | 1 + > 3 files changed, 164 insertions(+), 8 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c > index 9343f597182d..21a4743d011f 100644 > --- a/drivers/vfio/pci/vfio_pci_config.c > +++ b/drivers/vfio/pci/vfio_pci_config.c > @@ -408,7 +408,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev) > * PF SR-IOV capability, there's therefore no need to trigger > * faults based on the virtual value. > */ > - return pdev->current_state < PCI_D3hot && > + return !vdev->pm_runtime_engaged && pdev->current_state < PCI_D3hot && > (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY)); > } > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > index 5948d930449b..8c17ca41d156 100644 > --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -264,6 +264,18 @@ static int vfio_pci_core_runtime_suspend(struct device *dev) > { > struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); > > + down_write(&vdev->memory_lock); > + /* > + * The user can move the device into D3hot state before invoking > + * power management IOCTL. Move the device into D0 state here and then > + * the pci-driver core runtime PM suspend function will move the device > + * into the low power state. Also, for the devices which have > + * NoSoftRst-, it will help in restoring the original state > + * (saved locally in 'vdev->pm_save'). > + */ > + vfio_pci_set_power_state(vdev, PCI_D0); > + up_write(&vdev->memory_lock); > + > /* > * If INTx is enabled, then mask INTx before going into the runtime > * suspended state and unmask the same in the runtime resume. > @@ -386,6 +398,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > struct pci_dev *pdev = vdev->pdev; > struct vfio_pci_dummy_resource *dummy_res, *tmp; > struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp; > + bool do_resume = false; > int i, bar; > > /* For needs_reset */ > @@ -393,6 +406,25 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) > > /* > * This function can be invoked while the power state is non-D0. > + * This non-D0 power state can be with or without runtime PM. > + * Increment the usage count corresponding to pm_runtime_put() > + * called during setting of 'pm_runtime_engaged'. The device will > + * wake up if it has already gone into the suspended state. > + * Otherwise, the next vfio_pci_set_power_state() will change the > + * device power state to D0. > + */ > + down_write(&vdev->memory_lock); > + if (vdev->pm_runtime_engaged) { > + vdev->pm_runtime_engaged = false; > + pm_runtime_get_noresume(&pdev->dev); > + do_resume = true; > + } > + up_write(&vdev->memory_lock); > + > + if (do_resume) > + pm_runtime_resume(&pdev->dev); > + > + /* > * This function calls __pci_reset_function_locked() which internally > * can use pci_pm_reset() for the function reset. pci_pm_reset() will > * fail if the power state is non-D0. Also, for the devices which > @@ -1190,6 +1222,99 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, > } > EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); > > +static int vfio_pci_pm_validate_flags(u32 flags) > +{ > + if (!flags) > + return -EINVAL; > + /* Only valid flags should be set */ > + if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) > + return -EINVAL; > + /* Both enter and exit should not be set */ > + if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) == > + (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) > + return -EINVAL; > + > + return 0; > +} > + > +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags, > + void __user *arg, size_t argsz) > +{ > + struct vfio_pci_core_device *vdev = > + container_of(device, struct vfio_pci_core_device, vdev); > + struct pci_dev *pdev = vdev->pdev; > + struct vfio_device_feature_power_management vfio_pm = { 0 }; > + int ret = 0; > + > + ret = vfio_check_feature(flags, argsz, > + VFIO_DEVICE_FEATURE_SET | > + VFIO_DEVICE_FEATURE_GET, > + sizeof(vfio_pm)); > + if (ret != 1) > + return ret; > + > + if (flags & VFIO_DEVICE_FEATURE_GET) { > + down_read(&vdev->memory_lock); > + if (vdev->pm_runtime_engaged) > + vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER; > + else > + vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT; > + up_read(&vdev->memory_lock); > + > + if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm))) > + return -EFAULT; > + > + return 0; > + } > + > + if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm))) > + return -EFAULT; > + > + ret = vfio_pci_pm_validate_flags(vfio_pm.flags); > + if (ret) > + return ret; > + > + /* > + * The vdev power related flags are protected with 'memory_lock' > + * semaphore. > + */ > + if (vfio_pm.flags & VFIO_PM_LOW_POWER_ENTER) { > + vfio_pci_zap_and_down_write_memory_lock(vdev); > + if (vdev->pm_runtime_engaged) { > + up_write(&vdev->memory_lock); > + return -EINVAL; > + } > + > + vdev->pm_runtime_engaged = true; > + up_write(&vdev->memory_lock); > + pm_runtime_put(&pdev->dev); > + } else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) { > + down_write(&vdev->memory_lock); > + if (!vdev->pm_runtime_engaged) { > + up_write(&vdev->memory_lock); > + return -EINVAL; > + } > + > + vdev->pm_runtime_engaged = false; > + pm_runtime_get_noresume(&pdev->dev); > + up_write(&vdev->memory_lock); > + ret = pm_runtime_resume(&pdev->dev); > + if (ret < 0) { > + down_write(&vdev->memory_lock); > + if (!vdev->pm_runtime_engaged) { > + vdev->pm_runtime_engaged = true; > + pm_runtime_put_noidle(&pdev->dev); > + } > + up_write(&vdev->memory_lock); > + return ret; > + } > + } else { > + return -EINVAL; > + } > + > + return 0; > +} > + > static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, > void __user *arg, size_t argsz) > { > @@ -1224,6 +1349,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, > switch (flags & VFIO_DEVICE_FEATURE_MASK) { > case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: > return vfio_pci_core_feature_token(device, flags, arg, argsz); > + case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT: > + return vfio_pci_core_feature_pm(device, flags, arg, argsz); > default: > return -ENOTTY; > } > @@ -1234,31 +1361,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf, > size_t count, loff_t *ppos, bool iswrite) > { > unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); > + int ret; > > if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions) > return -EINVAL; > > + ret = pm_runtime_resume_and_get(&vdev->pdev->dev); > + if (ret < 0) { > + pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n", > + ret); > + return -EIO; > + } > + > switch (index) { > case VFIO_PCI_CONFIG_REGION_INDEX: > - return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite); > + ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite); > + break; > > case VFIO_PCI_ROM_REGION_INDEX: > if (iswrite) > - return -EINVAL; > - return vfio_pci_bar_rw(vdev, buf, count, ppos, false); > + ret = -EINVAL; > + else > + ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false); > + break; > > case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX: > - return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite); > + ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite); > + break; > > case VFIO_PCI_VGA_REGION_INDEX: > - return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite); > + ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite); > + break; > + > default: > index -= VFIO_PCI_NUM_REGIONS; > - return vdev->region[index].ops->rw(vdev, buf, > + ret = vdev->region[index].ops->rw(vdev, buf, > count, ppos, iswrite); > + break; > } > > - return -EINVAL; > + pm_runtime_put(&vdev->pdev->dev); > + return ret; > } > > ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf, > @@ -2157,6 +2300,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > goto err_unlock; > } > > + /* > + * Some of the devices in the dev_set can be in the runtime suspended > + * state. Increment the usage count for all the devices in the dev_set > + * before reset and decrement the same after reset. > + */ > + ret = vfio_pci_dev_set_pm_runtime_get(dev_set); > + if (ret) > + goto err_unlock; > + > list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) { > /* > * Test whether all the affected devices are contained by the > @@ -2212,6 +2364,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, > else > mutex_unlock(&cur->vma_lock); > } > + > + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) > + pm_runtime_put(&cur->pdev->dev); > err_unlock: > mutex_unlock(&dev_set->lock); > return ret; > diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h > index cdfd328ba6b1..bf4823b008f9 100644 > --- a/include/linux/vfio_pci_core.h > +++ b/include/linux/vfio_pci_core.h > @@ -125,6 +125,7 @@ struct vfio_pci_core_device { > bool nointx; > bool needs_pm_restore; > bool pm_intx_masked; > + bool pm_runtime_engaged; > struct pci_saved_state *pci_saved_state; > struct pci_saved_state *pm_save; > int ioeventfds_nr;
On 7/6/2022 9:09 PM, Alex Williamson wrote: > On Fri, 1 Jul 2022 16:38:09 +0530 > Abhishek Sahu <abhsahu@nvidia.com> wrote: > >> This patch adds INTx handling during runtime suspend/resume. >> All the suspend/resume related code for the user to put the device >> into the low power state will be added in subsequent patches. >> >> The INTx are shared among devices. Whenever any INTx interrupt comes > > "The INTx lines may be shared..." > >> for the VFIO devices, then vfio_intx_handler() will be called for each >> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx() > > "...device sharing the interrupt." > >> and checks if the interrupt has been generated for the current device. >> Now, if the device is already in the D3cold state, then the config space >> can not be read. Attempt to read config space in D3cold state can >> cause system unresponsiveness in a few systems. To prevent this, mask >> INTx in runtime suspend callback and unmask the same in runtime resume >> callback. If INTx has been already masked, then no handling is needed >> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and >> vfio_pci_intx_mask() has been updated to return true if INTx has been >> masked inside this function. >> >> For the runtime suspend which is triggered for the no user of VFIO >> device, the is_intx() will return false and these callbacks won't do >> anything. >> >> The MSI/MSI-X are not shared so similar handling should not be >> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal() >> without doing any device-specific config access. When the user performs >> any config access or IOCTL after receiving the eventfd notification, >> then the device will be moved to the D0 state first before >> servicing any request. >> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> >> --- >> drivers/vfio/pci/vfio_pci_core.c | 37 +++++++++++++++++++++++++++---- >> drivers/vfio/pci/vfio_pci_intrs.c | 6 ++++- >> include/linux/vfio_pci_core.h | 3 ++- >> 3 files changed, 40 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >> index a0d69ddaf90d..5948d930449b 100644 >> --- a/drivers/vfio/pci/vfio_pci_core.c >> +++ b/drivers/vfio/pci/vfio_pci_core.c >> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat >> return ret; >> } >> >> +#ifdef CONFIG_PM >> +static int vfio_pci_core_runtime_suspend(struct device *dev) >> +{ >> + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); >> + >> + /* >> + * If INTx is enabled, then mask INTx before going into the runtime >> + * suspended state and unmask the same in the runtime resume. >> + * If INTx has already been masked by the user, then >> + * vfio_pci_intx_mask() will return false and in that case, INTx >> + * should not be unmasked in the runtime resume. >> + */ >> + vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev)); >> + >> + return 0; >> +} >> + >> +static int vfio_pci_core_runtime_resume(struct device *dev) >> +{ >> + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); >> + >> + if (vdev->pm_intx_masked) >> + vfio_pci_intx_unmask(vdev); >> + >> + return 0; >> +} >> +#endif /* CONFIG_PM */ >> + >> /* >> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, >> - * so use structure without any callbacks. >> - * >> * The pci-driver core runtime PM routines always save the device state >> * before going into suspended state. If the device is going into low power >> * state with only with runtime PM ops, then no explicit handling is needed >> * for the devices which have NoSoftRst-. >> */ >> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; >> +static const struct dev_pm_ops vfio_pci_core_pm_ops = { >> + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend, >> + vfio_pci_core_runtime_resume, >> + NULL) >> +}; >> >> int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) >> { >> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c >> index 6069a11fb51a..1a37db99df48 100644 >> --- a/drivers/vfio/pci/vfio_pci_intrs.c >> +++ b/drivers/vfio/pci/vfio_pci_intrs.c >> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused) >> eventfd_signal(vdev->ctx[0].trigger, 1); >> } >> >> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) >> +/* Returns true if INTx has been masked by this function. */ >> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) >> { >> struct pci_dev *pdev = vdev->pdev; >> unsigned long flags; >> + bool intx_masked = false; >> >> spin_lock_irqsave(&vdev->irqlock, flags); >> >> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) >> disable_irq_nosync(pdev->irq); >> >> vdev->ctx[0].masked = true; >> + intx_masked = true; >> } >> >> spin_unlock_irqrestore(&vdev->irqlock, flags); >> + return intx_masked; >> } > > > There's certainly another path through this function that masks the > interrupt, which makes the definition of this return value a bit > confusing. For our case we should not hit that path. But we can return the intx_masked true from that path as well to align return value. > Wouldn't it be simpler not to overload the masked flag on > the interrupt context like this and instead set a new flag on the vdev > under irqlock to indicate the device is unable to generate interrupts. > The irq handler would add a test of this flag before any tests that > would access the device. Thanks, > > Alex > We will set this flag inside runtime_suspend callback but the device can be in non-D3cold state (For example, if user has disabled d3cold explicitly by sysfs, the D3cold is not supported in the platform, etc.). Also, in D3cold supported case, the device will be in D0 till the PCI core moves the device into D3cold. In this case, there is possibility that the device can generate an interrupt. If we add check in the IRQ handler, then we won’t check and clear the IRQ status, but the interrupt line will still be asserted which can cause interrupt flooding. This was the reason for disabling interrupt itself instead of checking flag in the IRQ handler. Thanks, Abhishek >> /* >> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h >> index 23c176d4b073..cdfd328ba6b1 100644 >> --- a/include/linux/vfio_pci_core.h >> +++ b/include/linux/vfio_pci_core.h >> @@ -124,6 +124,7 @@ struct vfio_pci_core_device { >> bool needs_reset; >> bool nointx; >> bool needs_pm_restore; >> + bool pm_intx_masked; >> struct pci_saved_state *pci_saved_state; >> struct pci_saved_state *pm_save; >> int ioeventfds_nr; >> @@ -147,7 +148,7 @@ struct vfio_pci_core_device { >> #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev))) >> #define irq_is(vdev, type) (vdev->irq_type == type) >> >> -extern void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev); >> +extern bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev); >> extern void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev); >> >> extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, >
On Fri, 8 Jul 2022 14:51:30 +0530 Abhishek Sahu <abhsahu@nvidia.com> wrote: > On 7/6/2022 9:09 PM, Alex Williamson wrote: > > On Fri, 1 Jul 2022 16:38:09 +0530 > > Abhishek Sahu <abhsahu@nvidia.com> wrote: > > > >> This patch adds INTx handling during runtime suspend/resume. > >> All the suspend/resume related code for the user to put the device > >> into the low power state will be added in subsequent patches. > >> > >> The INTx are shared among devices. Whenever any INTx interrupt comes > > > > "The INTx lines may be shared..." > > > >> for the VFIO devices, then vfio_intx_handler() will be called for each > >> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx() > > > > "...device sharing the interrupt." > > > >> and checks if the interrupt has been generated for the current device. > >> Now, if the device is already in the D3cold state, then the config space > >> can not be read. Attempt to read config space in D3cold state can > >> cause system unresponsiveness in a few systems. To prevent this, mask > >> INTx in runtime suspend callback and unmask the same in runtime resume > >> callback. If INTx has been already masked, then no handling is needed > >> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and > >> vfio_pci_intx_mask() has been updated to return true if INTx has been > >> masked inside this function. > >> > >> For the runtime suspend which is triggered for the no user of VFIO > >> device, the is_intx() will return false and these callbacks won't do > >> anything. > >> > >> The MSI/MSI-X are not shared so similar handling should not be > >> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal() > >> without doing any device-specific config access. When the user performs > >> any config access or IOCTL after receiving the eventfd notification, > >> then the device will be moved to the D0 state first before > >> servicing any request. > >> > >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> > >> --- > >> drivers/vfio/pci/vfio_pci_core.c | 37 +++++++++++++++++++++++++++---- > >> drivers/vfio/pci/vfio_pci_intrs.c | 6 ++++- > >> include/linux/vfio_pci_core.h | 3 ++- > >> 3 files changed, 40 insertions(+), 6 deletions(-) > >> > >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > >> index a0d69ddaf90d..5948d930449b 100644 > >> --- a/drivers/vfio/pci/vfio_pci_core.c > >> +++ b/drivers/vfio/pci/vfio_pci_core.c > >> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat > >> return ret; > >> } > >> > >> +#ifdef CONFIG_PM > >> +static int vfio_pci_core_runtime_suspend(struct device *dev) > >> +{ > >> + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); > >> + > >> + /* > >> + * If INTx is enabled, then mask INTx before going into the runtime > >> + * suspended state and unmask the same in the runtime resume. > >> + * If INTx has already been masked by the user, then > >> + * vfio_pci_intx_mask() will return false and in that case, INTx > >> + * should not be unmasked in the runtime resume. > >> + */ > >> + vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev)); > >> + > >> + return 0; > >> +} > >> + > >> +static int vfio_pci_core_runtime_resume(struct device *dev) > >> +{ > >> + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); > >> + > >> + if (vdev->pm_intx_masked) > >> + vfio_pci_intx_unmask(vdev); > >> + > >> + return 0; > >> +} > >> +#endif /* CONFIG_PM */ > >> + > >> /* > >> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, > >> - * so use structure without any callbacks. > >> - * > >> * The pci-driver core runtime PM routines always save the device state > >> * before going into suspended state. If the device is going into low power > >> * state with only with runtime PM ops, then no explicit handling is needed > >> * for the devices which have NoSoftRst-. > >> */ > >> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; > >> +static const struct dev_pm_ops vfio_pci_core_pm_ops = { > >> + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend, > >> + vfio_pci_core_runtime_resume, > >> + NULL) > >> +}; > >> > >> int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > >> { > >> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c > >> index 6069a11fb51a..1a37db99df48 100644 > >> --- a/drivers/vfio/pci/vfio_pci_intrs.c > >> +++ b/drivers/vfio/pci/vfio_pci_intrs.c > >> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused) > >> eventfd_signal(vdev->ctx[0].trigger, 1); > >> } > >> > >> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > >> +/* Returns true if INTx has been masked by this function. */ > >> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > >> { > >> struct pci_dev *pdev = vdev->pdev; > >> unsigned long flags; > >> + bool intx_masked = false; > >> > >> spin_lock_irqsave(&vdev->irqlock, flags); > >> > >> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > >> disable_irq_nosync(pdev->irq); > >> > >> vdev->ctx[0].masked = true; > >> + intx_masked = true; > >> } > >> > >> spin_unlock_irqrestore(&vdev->irqlock, flags); > >> + return intx_masked; > >> } > > > > > > There's certainly another path through this function that masks the > > interrupt, which makes the definition of this return value a bit > > confusing. > > For our case we should not hit that path. But we can return the > intx_masked true from that path as well to align return value. > > > Wouldn't it be simpler not to overload the masked flag on > > the interrupt context like this and instead set a new flag on the vdev > > under irqlock to indicate the device is unable to generate interrupts. > > The irq handler would add a test of this flag before any tests that > > would access the device. Thanks, > > > > Alex > > > > We will set this flag inside runtime_suspend callback but the > device can be in non-D3cold state (For example, if user has > disabled d3cold explicitly by sysfs, the D3cold is not supported in > the platform, etc.). Also, in D3cold supported case, the device will > be in D0 till the PCI core moves the device into D3cold. In this case, > there is possibility that the device can generate an interrupt. > If we add check in the IRQ handler, then we won’t check and clear > the IRQ status, but the interrupt line will still be asserted > which can cause interrupt flooding. > > This was the reason for disabling interrupt itself instead of > checking flag in the IRQ handler. Ok, maybe this is largely a clarification of the return value of vfio_pci_intx_mask(). I think what you're looking for is whether the context mask was changed, rather than whether the interrupt is masked, which I think avoids the confusion regarding whether the first branch should return true or false. So the variable should be something like "masked_changed" and the comment changed to "Returns true if the INTx vfio_pci_irq_ctx.masked value is changed". Testing is_intx() outside of the irqlock is potentially racy, so do we need to add the pm-get/put wrappers on ioctls first to avoid the possibility that pm-suspend can race a SET_IRQS ioctl? Thanks, Alex
On 7/8/2022 9:15 PM, Alex Williamson wrote: > On Fri, 8 Jul 2022 14:51:30 +0530 > Abhishek Sahu <abhsahu@nvidia.com> wrote: > >> On 7/6/2022 9:09 PM, Alex Williamson wrote: >>> On Fri, 1 Jul 2022 16:38:09 +0530 >>> Abhishek Sahu <abhsahu@nvidia.com> wrote: >>> >>>> This patch adds INTx handling during runtime suspend/resume. >>>> All the suspend/resume related code for the user to put the device >>>> into the low power state will be added in subsequent patches. >>>> >>>> The INTx are shared among devices. Whenever any INTx interrupt comes >>> >>> "The INTx lines may be shared..." >>> >>>> for the VFIO devices, then vfio_intx_handler() will be called for each >>>> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx() >>> >>> "...device sharing the interrupt." >>> >>>> and checks if the interrupt has been generated for the current device. >>>> Now, if the device is already in the D3cold state, then the config space >>>> can not be read. Attempt to read config space in D3cold state can >>>> cause system unresponsiveness in a few systems. To prevent this, mask >>>> INTx in runtime suspend callback and unmask the same in runtime resume >>>> callback. If INTx has been already masked, then no handling is needed >>>> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and >>>> vfio_pci_intx_mask() has been updated to return true if INTx has been >>>> masked inside this function. >>>> >>>> For the runtime suspend which is triggered for the no user of VFIO >>>> device, the is_intx() will return false and these callbacks won't do >>>> anything. >>>> >>>> The MSI/MSI-X are not shared so similar handling should not be >>>> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal() >>>> without doing any device-specific config access. When the user performs >>>> any config access or IOCTL after receiving the eventfd notification, >>>> then the device will be moved to the D0 state first before >>>> servicing any request. >>>> >>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> >>>> --- >>>> drivers/vfio/pci/vfio_pci_core.c | 37 +++++++++++++++++++++++++++---- >>>> drivers/vfio/pci/vfio_pci_intrs.c | 6 ++++- >>>> include/linux/vfio_pci_core.h | 3 ++- >>>> 3 files changed, 40 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c >>>> index a0d69ddaf90d..5948d930449b 100644 >>>> --- a/drivers/vfio/pci/vfio_pci_core.c >>>> +++ b/drivers/vfio/pci/vfio_pci_core.c >>>> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat >>>> return ret; >>>> } >>>> >>>> +#ifdef CONFIG_PM >>>> +static int vfio_pci_core_runtime_suspend(struct device *dev) >>>> +{ >>>> + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); >>>> + >>>> + /* >>>> + * If INTx is enabled, then mask INTx before going into the runtime >>>> + * suspended state and unmask the same in the runtime resume. >>>> + * If INTx has already been masked by the user, then >>>> + * vfio_pci_intx_mask() will return false and in that case, INTx >>>> + * should not be unmasked in the runtime resume. >>>> + */ >>>> + vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev)); >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +static int vfio_pci_core_runtime_resume(struct device *dev) >>>> +{ >>>> + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); >>>> + >>>> + if (vdev->pm_intx_masked) >>>> + vfio_pci_intx_unmask(vdev); >>>> + >>>> + return 0; >>>> +} >>>> +#endif /* CONFIG_PM */ >>>> + >>>> /* >>>> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, >>>> - * so use structure without any callbacks. >>>> - * >>>> * The pci-driver core runtime PM routines always save the device state >>>> * before going into suspended state. If the device is going into low power >>>> * state with only with runtime PM ops, then no explicit handling is needed >>>> * for the devices which have NoSoftRst-. >>>> */ >>>> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; >>>> +static const struct dev_pm_ops vfio_pci_core_pm_ops = { >>>> + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend, >>>> + vfio_pci_core_runtime_resume, >>>> + NULL) >>>> +}; >>>> >>>> int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) >>>> { >>>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c >>>> index 6069a11fb51a..1a37db99df48 100644 >>>> --- a/drivers/vfio/pci/vfio_pci_intrs.c >>>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c >>>> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused) >>>> eventfd_signal(vdev->ctx[0].trigger, 1); >>>> } >>>> >>>> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) >>>> +/* Returns true if INTx has been masked by this function. */ >>>> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) >>>> { >>>> struct pci_dev *pdev = vdev->pdev; >>>> unsigned long flags; >>>> + bool intx_masked = false; >>>> >>>> spin_lock_irqsave(&vdev->irqlock, flags); >>>> >>>> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) >>>> disable_irq_nosync(pdev->irq); >>>> >>>> vdev->ctx[0].masked = true; >>>> + intx_masked = true; >>>> } >>>> >>>> spin_unlock_irqrestore(&vdev->irqlock, flags); >>>> + return intx_masked; >>>> } >>> >>> >>> There's certainly another path through this function that masks the >>> interrupt, which makes the definition of this return value a bit >>> confusing. >> >> For our case we should not hit that path. But we can return the >> intx_masked true from that path as well to align return value. >> >>> Wouldn't it be simpler not to overload the masked flag on >>> the interrupt context like this and instead set a new flag on the vdev >>> under irqlock to indicate the device is unable to generate interrupts. >>> The irq handler would add a test of this flag before any tests that >>> would access the device. Thanks, >>> >>> Alex >>> >> >> We will set this flag inside runtime_suspend callback but the >> device can be in non-D3cold state (For example, if user has >> disabled d3cold explicitly by sysfs, the D3cold is not supported in >> the platform, etc.). Also, in D3cold supported case, the device will >> be in D0 till the PCI core moves the device into D3cold. In this case, >> there is possibility that the device can generate an interrupt. >> If we add check in the IRQ handler, then we won’t check and clear >> the IRQ status, but the interrupt line will still be asserted >> which can cause interrupt flooding. >> >> This was the reason for disabling interrupt itself instead of >> checking flag in the IRQ handler. > > Ok, maybe this is largely a clarification of the return value of > vfio_pci_intx_mask(). I think what you're looking for is whether the > context mask was changed, rather than whether the interrupt is masked, > which I think avoids the confusion regarding whether the first branch > should return true or false. So the variable should be something like > "masked_changed" and the comment changed to "Returns true if the INTx > vfio_pci_irq_ctx.masked value is changed". > Thanks Alex. I will rename the variable and update the comment. > Testing is_intx() outside of the irqlock is potentially racy, so do we > need to add the pm-get/put wrappers on ioctls first to avoid the > possibility that pm-suspend can race a SET_IRQS ioctl? Thanks, > > Alex > Even after adding this patch, the runtime suspend will not be supported for the device with users. It will be supported only after patch 4 in this patch series. So with this patch, the pm-suspend can be called only for the cases where vfio device has no user and there we should not see the race condition. But I can move the patch related with pm-get/put wrappers on ioctls before this patch since both are independent. Regards, Abhishek
On Mon, 11 Jul 2022 14:48:34 +0530 Abhishek Sahu <abhsahu@nvidia.com> wrote: > On 7/8/2022 9:15 PM, Alex Williamson wrote: > > On Fri, 8 Jul 2022 14:51:30 +0530 > > Abhishek Sahu <abhsahu@nvidia.com> wrote: > > > >> On 7/6/2022 9:09 PM, Alex Williamson wrote: > >>> On Fri, 1 Jul 2022 16:38:09 +0530 > >>> Abhishek Sahu <abhsahu@nvidia.com> wrote: > >>> > >>>> This patch adds INTx handling during runtime suspend/resume. > >>>> All the suspend/resume related code for the user to put the device > >>>> into the low power state will be added in subsequent patches. > >>>> > >>>> The INTx are shared among devices. Whenever any INTx interrupt comes > >>> > >>> "The INTx lines may be shared..." > >>> > >>>> for the VFIO devices, then vfio_intx_handler() will be called for each > >>>> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx() > >>> > >>> "...device sharing the interrupt." > >>> > >>>> and checks if the interrupt has been generated for the current device. > >>>> Now, if the device is already in the D3cold state, then the config space > >>>> can not be read. Attempt to read config space in D3cold state can > >>>> cause system unresponsiveness in a few systems. To prevent this, mask > >>>> INTx in runtime suspend callback and unmask the same in runtime resume > >>>> callback. If INTx has been already masked, then no handling is needed > >>>> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and > >>>> vfio_pci_intx_mask() has been updated to return true if INTx has been > >>>> masked inside this function. > >>>> > >>>> For the runtime suspend which is triggered for the no user of VFIO > >>>> device, the is_intx() will return false and these callbacks won't do > >>>> anything. > >>>> > >>>> The MSI/MSI-X are not shared so similar handling should not be > >>>> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal() > >>>> without doing any device-specific config access. When the user performs > >>>> any config access or IOCTL after receiving the eventfd notification, > >>>> then the device will be moved to the D0 state first before > >>>> servicing any request. > >>>> > >>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com> > >>>> --- > >>>> drivers/vfio/pci/vfio_pci_core.c | 37 +++++++++++++++++++++++++++---- > >>>> drivers/vfio/pci/vfio_pci_intrs.c | 6 ++++- > >>>> include/linux/vfio_pci_core.h | 3 ++- > >>>> 3 files changed, 40 insertions(+), 6 deletions(-) > >>>> > >>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c > >>>> index a0d69ddaf90d..5948d930449b 100644 > >>>> --- a/drivers/vfio/pci/vfio_pci_core.c > >>>> +++ b/drivers/vfio/pci/vfio_pci_core.c > >>>> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat > >>>> return ret; > >>>> } > >>>> > >>>> +#ifdef CONFIG_PM > >>>> +static int vfio_pci_core_runtime_suspend(struct device *dev) > >>>> +{ > >>>> + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); > >>>> + > >>>> + /* > >>>> + * If INTx is enabled, then mask INTx before going into the runtime > >>>> + * suspended state and unmask the same in the runtime resume. > >>>> + * If INTx has already been masked by the user, then > >>>> + * vfio_pci_intx_mask() will return false and in that case, INTx > >>>> + * should not be unmasked in the runtime resume. > >>>> + */ > >>>> + vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev)); > >>>> + > >>>> + return 0; > >>>> +} > >>>> + > >>>> +static int vfio_pci_core_runtime_resume(struct device *dev) > >>>> +{ > >>>> + struct vfio_pci_core_device *vdev = dev_get_drvdata(dev); > >>>> + > >>>> + if (vdev->pm_intx_masked) > >>>> + vfio_pci_intx_unmask(vdev); > >>>> + > >>>> + return 0; > >>>> +} > >>>> +#endif /* CONFIG_PM */ > >>>> + > >>>> /* > >>>> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working, > >>>> - * so use structure without any callbacks. > >>>> - * > >>>> * The pci-driver core runtime PM routines always save the device state > >>>> * before going into suspended state. If the device is going into low power > >>>> * state with only with runtime PM ops, then no explicit handling is needed > >>>> * for the devices which have NoSoftRst-. > >>>> */ > >>>> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { }; > >>>> +static const struct dev_pm_ops vfio_pci_core_pm_ops = { > >>>> + SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend, > >>>> + vfio_pci_core_runtime_resume, > >>>> + NULL) > >>>> +}; > >>>> > >>>> int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) > >>>> { > >>>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c > >>>> index 6069a11fb51a..1a37db99df48 100644 > >>>> --- a/drivers/vfio/pci/vfio_pci_intrs.c > >>>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c > >>>> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused) > >>>> eventfd_signal(vdev->ctx[0].trigger, 1); > >>>> } > >>>> > >>>> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > >>>> +/* Returns true if INTx has been masked by this function. */ > >>>> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > >>>> { > >>>> struct pci_dev *pdev = vdev->pdev; > >>>> unsigned long flags; > >>>> + bool intx_masked = false; > >>>> > >>>> spin_lock_irqsave(&vdev->irqlock, flags); > >>>> > >>>> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev) > >>>> disable_irq_nosync(pdev->irq); > >>>> > >>>> vdev->ctx[0].masked = true; > >>>> + intx_masked = true; > >>>> } > >>>> > >>>> spin_unlock_irqrestore(&vdev->irqlock, flags); > >>>> + return intx_masked; > >>>> } > >>> > >>> > >>> There's certainly another path through this function that masks the > >>> interrupt, which makes the definition of this return value a bit > >>> confusing. > >> > >> For our case we should not hit that path. But we can return the > >> intx_masked true from that path as well to align return value. > >> > >>> Wouldn't it be simpler not to overload the masked flag on > >>> the interrupt context like this and instead set a new flag on the vdev > >>> under irqlock to indicate the device is unable to generate interrupts. > >>> The irq handler would add a test of this flag before any tests that > >>> would access the device. Thanks, > >>> > >>> Alex > >>> > >> > >> We will set this flag inside runtime_suspend callback but the > >> device can be in non-D3cold state (For example, if user has > >> disabled d3cold explicitly by sysfs, the D3cold is not supported in > >> the platform, etc.). Also, in D3cold supported case, the device will > >> be in D0 till the PCI core moves the device into D3cold. In this case, > >> there is possibility that the device can generate an interrupt. > >> If we add check in the IRQ handler, then we won’t check and clear > >> the IRQ status, but the interrupt line will still be asserted > >> which can cause interrupt flooding. > >> > >> This was the reason for disabling interrupt itself instead of > >> checking flag in the IRQ handler. > > > > Ok, maybe this is largely a clarification of the return value of > > vfio_pci_intx_mask(). I think what you're looking for is whether the > > context mask was changed, rather than whether the interrupt is masked, > > which I think avoids the confusion regarding whether the first branch > > should return true or false. So the variable should be something like > > "masked_changed" and the comment changed to "Returns true if the INTx > > vfio_pci_irq_ctx.masked value is changed". > > > > Thanks Alex. > I will rename the variable and update the comment. > > > Testing is_intx() outside of the irqlock is potentially racy, so do we > > need to add the pm-get/put wrappers on ioctls first to avoid the > > possibility that pm-suspend can race a SET_IRQS ioctl? Thanks, > > > > Alex > > > > Even after adding this patch, the runtime suspend will not be supported > for the device with users. It will be supported only after patch 4 in > this patch series. So with this patch, the pm-suspend can be called only > for the cases where vfio device has no user and there we should not see > the race condition. We should also not see is_intx() true for unused devices. Thanks, Alex