mbox series

[v2,00/10] Add IO page table replacement support

Message ID cover.1675802050.git.nicolinc@nvidia.com
Headers show
Series Add IO page table replacement support | expand

Message

Nicolin Chen Feb. 7, 2023, 9:17 p.m. UTC
Changelog
v1->v2:
 * Rebased on top of vfio_device cdev v2 series.
 * Update the kdoc and commit message of iommu_group_replace_domain().
 * Dropped revert-to-core-domain part in iommu_group_replace_domain().
 * Dropped !ops->dma_unmap check in vfio_iommufd_emulated_attach_ioas().
 * Added missing rc value in vfio_iommufd_emulated_attach_ioas() from the
   iommufd_access_set_ioas() call.
 * Added a new patch in vfio_main to deny vfio_pin/unpin_pages() calls if
   vdev->ops->dma_unmap is not implemented.
 * Added a __iommmufd_device_detach helper and let the replace routine do
   a partial detach().
 * Added restriction on auto_domains to use the replace feature.
 * Added the patch "iommufd/device: Make hwpt_list list_add/del symmetric"
   from the has_group removal series.

Hi all,

The existing IOMMU APIs provide a pair of functions: iommu_attach_group()
for callers to attach a device from the default_domain (NULL if not being
supported) to a given iommu domain, and iommu_detach_group() for callers
to detach a device from a given domain to the default_domain. Internally,
the detach_dev op is deprecated for the newer drivers with default_domain.
This means that those drivers likely can switch an attaching domain to
another one, without stagging the device at a blocking or default domain,
for use cases such as:
1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
   table with a larger table (PASID=N)
2) Nesting mode, when switching the attaching device from an S2 domain
   to an S1 domain, or when switching between relevant S1 domains.

This series introduces a new iommu_group_replace_domain() for that. And
add corresponding support throughout the uAPI. So user space can do such
a REPLACE ioctl reusing the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT. This
means that user space needs to be aware whether the device is attached or
not: an unattached device calling VFIO_DEVICE_ATTACH_IOMMUFD_PT means a
regular ATTACH; an attached device calling VFIO_DEVICE_ATTACH_IOMMUFD_PT
on the other hand means a REPLACE.

QEMU with this feature should have the vIOMMU maintain a cache of the
guest io page table addresses and assign a unique IOAS to each unique
guest page table.

As the guest writes the page table address to the HW registers qemu should
then use the 'replace domain' operation on VFIO to assign the VFIO device
to the correct de-duplicated page table.

The algorithm where QEMU uses one VFIO container per-device and removes
all the mappings to change the assignment should ideally not be used with
iommufd.

To apply this series, please rebase on top of the following patches:
1) [PATCH v2 00/14] Add vfio_device cdev for iommufd support
   https://lore.kernel.org/kvm/20230206090532.95598-1-yi.l.liu@intel.com/

Or you can also find this series on Github:
https://github.com/nicolinc/iommufd/commits/iommu_group_replace_domain-v2

Thank you
Nicolin Chen

Nicolin Chen (9):
  iommu: Introduce a new iommu_group_replace_domain() API
  iommufd: Create access in vfio_iommufd_emulated_bind()
  iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage
  iommufd: Add replace support in iommufd_access_set_ioas()
  iommufd/selftest: Add coverage for access->ioas replacement
  iommufd/device: Make hwpt_list list_add/del symmetric
  iommufd/device: Use iommu_group_replace_domain()
  vfio: Support IO page table replacement
  vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages()

Yi Liu (1):
  iommu: Move dev_iommu_ops() to private header

 drivers/iommu/iommu-priv.h                    |  22 ++
 drivers/iommu/iommu.c                         |  30 +++
 drivers/iommu/iommufd/device.c                | 221 +++++++++++++-----
 drivers/iommu/iommufd/iommufd_private.h       |   4 +
 drivers/iommu/iommufd/iommufd_test.h          |   4 +
 drivers/iommu/iommufd/selftest.c              |  25 +-
 drivers/vfio/iommufd.c                        |  30 ++-
 drivers/vfio/vfio_main.c                      |   4 +
 include/linux/iommu.h                         |  11 -
 include/linux/iommufd.h                       |   3 +-
 include/uapi/linux/vfio.h                     |   6 +
 tools/testing/selftests/iommu/iommufd.c       |  29 ++-
 tools/testing/selftests/iommu/iommufd_utils.h |  22 +-
 13 files changed, 321 insertions(+), 90 deletions(-)
 create mode 100644 drivers/iommu/iommu-priv.h

Comments

Yi Liu Feb. 8, 2023, 8:12 a.m. UTC | #1
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, February 8, 2023 5:18 AM
> 
> +/**
> + * __iommmufd_device_detach - Detach a device from idev->hwpt to

Nit: s/_iommmufd/__iommufd

Regards,
Yi Liu
Tian, Kevin Feb. 9, 2023, 2:49 a.m. UTC | #2
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, February 8, 2023 5:18 AM
> 
> From: Yi Liu <yi.l.liu@intel.com>
> 
> dev_iommu_ops() is essentially only used in iommu subsystem, so
> move to a private header to avoid being abused by other drivers.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tian, Kevin Feb. 9, 2023, 2:50 a.m. UTC | #3
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, February 8, 2023 5:18 AM
> 
> QEMU with this feature should have the vIOMMU maintain a cache of the
> guest io page table addresses and assign a unique IOAS to each unique
> guest page table.
> 

I still didn't see a clear reply why above is required. Please elaborate.
Tian, Kevin Feb. 9, 2023, 2:55 a.m. UTC | #4
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, February 8, 2023 5:18 AM
> 
> +int iommu_group_replace_domain(struct iommu_group *group,
> +			       struct iommu_domain *new_domain)
> +{
> +	int ret;
> +
> +	if (!new_domain)
> +		return -EINVAL;

Is there value of allowing NULL new domain so this plays like
iommu_detach_group() then iommufd only needs call one
function in both attach/detach path?

> +
> +	mutex_lock(&group->mutex);
> +	ret = __iommu_group_set_domain(group, new_domain);
> +	if (ret)
> +		__iommu_group_set_domain(group, group->domain);
> +	mutex_unlock(&group->mutex);
> +	return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommu_group_replace_domain,
> IOMMUFD_INTERNAL);
> +
>  static int iommu_group_do_set_platform_dma(struct device *dev, void
> *data)
>  {
>  	const struct iommu_ops *ops = dev_iommu_ops(dev);
> --
> 2.39.1
>
Tian, Kevin Feb. 9, 2023, 2:59 a.m. UTC | #5
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, February 8, 2023 5:18 AM
> 
> Add a new IOMMU_TEST_OP_ACCESS_SET_IOAS to allow setting access->ioas
> individually, corresponding to the iommufd_access_set_ioas() helper.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tian, Kevin Feb. 9, 2023, 4:06 a.m. UTC | #6
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, February 8, 2023 5:18 AM
> 
> Remove the vdev->iommufd_attached check, since the kernel can internally
> handle a replacement of the IO page table now.
> 
> Also update the VFIO_DEVICE_ATTACH_IOMMUFD_PT kdoc in the uAPI
> header.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jason Gunthorpe Feb. 9, 2023, 1:23 p.m. UTC | #7
On Thu, Feb 09, 2023 at 02:55:24AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> > 
> > +int iommu_group_replace_domain(struct iommu_group *group,
> > +			       struct iommu_domain *new_domain)
> > +{
> > +	int ret;
> > +
> > +	if (!new_domain)
> > +		return -EINVAL;
> 
> Is there value of allowing NULL new domain so this plays like
> iommu_detach_group() then iommufd only needs call one
> function in both attach/detach path?

We've used NULL to mean the 'platform domain' in the iommu core code
in a few places, I'd prefer to avoid overloading NULL.

IMHO it doesn't help iommufd to substitute detach with replace.

Jason
Nicolin Chen Feb. 9, 2023, 4:13 p.m. UTC | #8
On Thu, Feb 09, 2023 at 02:50:39AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > QEMU with this feature should have the vIOMMU maintain a cache of the
> > guest io page table addresses and assign a unique IOAS to each unique
> > guest page table.
> >
> 
> I still didn't see a clear reply why above is required. Please elaborate.

Hmm..this piece was directly copied from Jason's inputs before I
sent v1. And I thought he explained in the previous version. I
can drop it in v3 if that makes you happy.

Thanks
Nic
Nicolin Chen Feb. 9, 2023, 8:55 p.m. UTC | #9
On Wed, Feb 08, 2023 at 08:08:42AM +0000, Liu, Yi L wrote:
 
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > iommu_group_replace_domain() is introduced to support use cases where
> > an
> > iommu_group can be attached to a new domain without getting detached
> > from
> > the old one. This replacement feature will be useful, for cases such as:
> > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> >    table with a larger table (PASID=N)
> > 2) Nesting mode, when switching the attaching device from an S2 domain
> >    to an S1 domain, or when switching between relevant S1 domains.
> > as it allows these cases to switch seamlessly without a DMA disruption.
> >
> > So, call iommu_group_replace_domain() in the
> > iommufd_device_do_attach().
> > And add a __iommmufd_device_detach helper to allow the replace routine
> > to
> > do a partial detach on the current hwpt that's being replaced. Though the
> > updated locking logic is overcomplicated, it will be eased, once those
> > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> > allocation/destroy() functions in the coming nesting series, as that'll
> > depend on a new ->domain_alloc_user op in the iommu core.
> >
> > Also, block replace operations that are from/to auto_domains, i.e. only
> > user-allocated hw_pagetables can be replaced or replaced with.
> >
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > ---
> >  drivers/iommu/iommufd/device.c          | 101 +++++++++++++++++-------
> >  drivers/iommu/iommufd/iommufd_private.h |   2 +
> >  2 files changed, 76 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/iommu/iommufd/device.c
> > b/drivers/iommu/iommufd/device.c
> > index b8c3e3baccb5..8a9834fc129a 100644
> > --- a/drivers/iommu/iommufd/device.c
> > +++ b/drivers/iommu/iommufd/device.c
> > @@ -9,6 +9,8 @@
> >  #include "io_pagetable.h"
> >  #include "iommufd_private.h"
> >
> > +MODULE_IMPORT_NS(IOMMUFD_INTERNAL);
> > +
> >  static bool allow_unsafe_interrupts;
> >  module_param(allow_unsafe_interrupts, bool, S_IRUGO | S_IWUSR);
> >  MODULE_PARM_DESC(
> > @@ -194,9 +196,61 @@ static bool
> > iommufd_hw_pagetable_has_group(struct iommufd_hw_pagetable *hwpt,
> >       return false;
> >  }
> >
> > +/**
> > + * __iommmufd_device_detach - Detach a device from idev->hwpt to
> > new_hwpt
> 
> This function doesn't do anything to make this device attached to new_hwpt.
> It is done in the iommufd_device_attach_ioas(). New_hwpt here indicates if
> this detach requires to do some extra thing. E.g. remove reserved iova from
> the idev->hwpt->ioas. So may just say " Detach a device from idev->hwpt",
> and explain the usage of new_hwpt in the below.

Yea. You are right.

> > + * @idev: device to detach
> > + * @new_hwpt: new hw_pagetable to attach (pass in NULL for a simple
> > detach)
> 
> The new hw_pagetable to be attached.

OK.

> > + * @detach_group: flag to call iommu_detach_group
> > + *
> > + * This is a cleanup helper shared by the replace and detach routines.
> > Comparing
> > + * to a detach routine, a replace routine only needs a partial detach
> > procedure:
> > + * it does not need the iommu_detach_group(); it will attach the device to
> > a new
> > + * hw_pagetable after a partial detach from the currently attached
> > hw_pagetable,
> > + * so certain steps can be skipped if two hw_pagetables have the same
> > IOAS.
> > + */
> > +static void __iommmufd_device_detach(struct iommufd_device *idev,
> > +                                  struct iommufd_hw_pagetable
> > *new_hwpt,
> > +                                  bool detach_group)
> > +{
> > +     struct iommufd_hw_pagetable *hwpt = idev->hwpt;
> > +     struct iommufd_ioas *new_ioas = NULL;
> > +
> > +     if (new_hwpt)
> > +             new_ioas = new_hwpt->ioas;
> > +
> > +     mutex_lock(&hwpt->devices_lock);
> > +     list_del(&idev->devices_item);
> > +     if (hwpt->ioas != new_ioas)
> > +             mutex_lock(&hwpt->ioas->mutex);
> 
> The lock order is mostly hwpt->ioas->mutex and then hwpt->devices_lock.
> See the iommufd_device_auto_get_domain(). If possible, may switch the
> order sequence here.

Yea, I know it's a bit strange. Yet...

Our nesting series simplifies this part to:
	if (cur_ioas != new_ioas) {
		mutex_lock(&hwpt->ioas->mutex);
		iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
		mutex_unlock(&hwpt->ioas->mutex);
	}

So, here is trying to avoid something like:
	if (cur_ioas != new_ioas)
		mutex_lock(&hwpt->ioas->mutex);
	// doing something
	if (cur_ioas != new_ioas)
		iopt_remove_reserved_iova(&hwpt->ioas->iopt, idev->dev);
	// doing something
	if (cur_ioas != new_ioas)
		mutex_unlock(&hwpt->ioas->mutex);

> Also, rename hwpt to be cur_hwpt, this may help
> reviewers to distinguish it from the hwpt in the caller of this function. It
> looks to be a deadlock at first look, but not after closer reading.

Sure.

> > @@ -345,6 +406,13 @@ int iommufd_device_attach(struct iommufd_device
> > *idev, u32 *pt_id)
> >               struct iommufd_hw_pagetable *hwpt =
> >                       container_of(pt_obj, struct
> > iommufd_hw_pagetable, obj);
> >
> > +             if (idev->hwpt == hwpt)
> > +                     goto out_done;
> > +             if (idev->hwpt && idev->hwpt->auto_domain) {
> > +                     rc = -EBUSY;
> 
> This means if device was attached to an auto_created hwpt, then we
> cannot replace it with a user allocated hwpt? If yes, this means the
> replace is not available until user hwpt support, which is part of nesting.

After aligning with Jason, this limit here might be wrong, as we
should be able to support replacing an IOAS. I'd need to take a
closer look and fix it in v3.

> > +             if (idev->hwpt)
> > +                     return -EBUSY;
> 
> So we don't allow ioas replacement for physical devices. Is it?
> Looks like emulated devices allows it.

This was to avoid an replace with an auto_domain. Similarly, it's
likely wrong, as I replied above.

Thanks
Nic
Nicolin Chen Feb. 9, 2023, 8:56 p.m. UTC | #10
On Wed, Feb 08, 2023 at 08:12:55AM +0000, Liu, Yi L wrote:
> External email: Use caution opening links or attachments
> 
> 
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, February 8, 2023 5:18 AM
> >
> > +/**
> > + * __iommmufd_device_detach - Detach a device from idev->hwpt to
> 
> Nit: s/_iommmufd/__iommufd

Will fix it. Thanks!
Jason Gunthorpe Feb. 10, 2023, 12:01 a.m. UTC | #11
On Thu, Feb 09, 2023 at 01:13:07PM -0800, Nicolin Chen wrote:
> On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote:
>  
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > iommu_group_replace_domain() is introduced to support use cases where
> > > an
> > > iommu_group can be attached to a new domain without getting detached
> > > from
> > > the old one. This replacement feature will be useful, for cases such as:
> > > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> > >    table with a larger table (PASID=N)
> > > 2) Nesting mode, when switching the attaching device from an S2 domain
> > >    to an S1 domain, or when switching between relevant S1 domains.
> > > as it allows these cases to switch seamlessly without a DMA disruption.
> > >
> > > So, call iommu_group_replace_domain() in the iommufd_device_do_attach().
> > > And add a __iommmufd_device_detach helper to allow the replace routine
> > > to
> > > do a partial detach on the current hwpt that's being replaced. Though the
> > > updated locking logic is overcomplicated, it will be eased, once those
> > > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> > > allocation/destroy() functions in the coming nesting series, as that'll
> > > depend on a new ->domain_alloc_user op in the iommu core.
> > 
> > then why not moving those changes into this series to make it simple?
> 
> The simplification depends on the new ->domain_alloc_user op and
> its implementation in SMMU driver, which would be introduced by
> the nesting series of VT-d and SMMU respectively.

Since we are not fixing all the drivers at this point, this argument
doesn't hold water.

It is as I said in the other email, there should be no changes to the
normal attach/replace path when adding manual HWPT creation - once
those are removed there should be minimal connection between these two
series.

Jason
Tian, Kevin Feb. 10, 2023, 1:34 a.m. UTC | #12
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, February 10, 2023 12:13 AM
> 
> On Thu, Feb 09, 2023 at 02:50:39AM +0000, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > QEMU with this feature should have the vIOMMU maintain a cache of the
> > > guest io page table addresses and assign a unique IOAS to each unique
> > > guest page table.
> > >
> >
> > I still didn't see a clear reply why above is required. Please elaborate.
> 
> Hmm..this piece was directly copied from Jason's inputs before I
> sent v1. And I thought he explained in the previous version. I
> can drop it in v3 if that makes you happy.
> 

Jason's reply was:

> I think the guidance is about the change to VFIO uAPI where it is now
> possible to change the domain attached, previously that was not
> possible

I cannot connect it to the above statement. Anyway if you think
that guidance to userspace is necessary please connect it to the
replace semantics to explain the motivation.
Tian, Kevin Feb. 10, 2023, 1:34 a.m. UTC | #13
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, February 9, 2023 9:23 PM
> 
> On Thu, Feb 09, 2023 at 02:55:24AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, February 8, 2023 5:18 AM
> > >
> > > +int iommu_group_replace_domain(struct iommu_group *group,
> > > +			       struct iommu_domain *new_domain)
> > > +{
> > > +	int ret;
> > > +
> > > +	if (!new_domain)
> > > +		return -EINVAL;
> >
> > Is there value of allowing NULL new domain so this plays like
> > iommu_detach_group() then iommufd only needs call one
> > function in both attach/detach path?
> 
> We've used NULL to mean the 'platform domain' in the iommu core code
> in a few places, I'd prefer to avoid overloading NULL.
> 
> IMHO it doesn't help iommufd to substitute detach with replace.
> 

OK
Nicolin Chen Feb. 10, 2023, 8:50 p.m. UTC | #14
On Thu, Feb 09, 2023 at 08:01:00PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 09, 2023 at 01:13:07PM -0800, Nicolin Chen wrote:
> > On Thu, Feb 09, 2023 at 04:00:52AM +0000, Tian, Kevin wrote:
> >  
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Wednesday, February 8, 2023 5:18 AM
> > > >
> > > > iommu_group_replace_domain() is introduced to support use cases where
> > > > an
> > > > iommu_group can be attached to a new domain without getting detached
> > > > from
> > > > the old one. This replacement feature will be useful, for cases such as:
> > > > 1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
> > > >    table with a larger table (PASID=N)
> > > > 2) Nesting mode, when switching the attaching device from an S2 domain
> > > >    to an S1 domain, or when switching between relevant S1 domains.
> > > > as it allows these cases to switch seamlessly without a DMA disruption.
> > > >
> > > > So, call iommu_group_replace_domain() in the iommufd_device_do_attach().
> > > > And add a __iommmufd_device_detach helper to allow the replace routine
> > > > to
> > > > do a partial detach on the current hwpt that's being replaced. Though the
> > > > updated locking logic is overcomplicated, it will be eased, once those
> > > > iopt_table_add/remove_ioas and list_add/del calls are moved to hwpt's
> > > > allocation/destroy() functions in the coming nesting series, as that'll
> > > > depend on a new ->domain_alloc_user op in the iommu core.
> > > 
> > > then why not moving those changes into this series to make it simple?
> > 
> > The simplification depends on the new ->domain_alloc_user op and
> > its implementation in SMMU driver, which would be introduced by
> > the nesting series of VT-d and SMMU respectively.
> 
> Since we are not fixing all the drivers at this point, this argument
> doesn't hold water.
> 
> It is as I said in the other email, there should be no changes to the
> normal attach/replace path when adding manual HWPT creation - once
> those are removed there should be minimal connection between these two
> series.

Yes. I replied here earlier than the discussion with you in
that thread. I think I should just drop this line of commit
messages.

Thanks
Nic