mbox series

[v1,0/8] Add IO page table replacement support

Message ID cover.1675320212.git.nicolinc@nvidia.com
Headers show
Series Add IO page table replacement support | expand

Message

Nicolin Chen Feb. 2, 2023, 7:05 a.m. UTC
Hi all,

The existing IOMMU APIs provide a pair of functions: iommu_attach_group()
for callers to attach a device from the default_domain (NULL if not being
supported) to a given iommu domain, and iommu_detach_group() for callers
to detach a device from a given domain to the default_domain. Internally,
the detach_dev op is deprecated for the newer drivers with default_domain.
This means that those drivers likely can switch an attaching domain to
another one, without stagging the device at a blocking or default domain,
for use cases such as:
1) vPASID mode, when a guest wants to replace a single pasid (PASID=0)
   table with a larger table (PASID=N)
2) Nesting mode, when switching the attaching device from an S2 domain
   to an S1 domain, or when switching between relevant S1 domains.

This series introduces a new iommu_group_replace_domain() for that. And
add corresponding support throughout the uAPI. So user space can do such
a REPLACE ioctl reusing the existing VFIO_DEVICE_ATTACH_IOMMUFD_PT. This
means that user space needs to be aware whether the device is attached or
not: an unattached device calling VFIO_DEVICE_ATTACH_IOMMUFD_PT means a
regular ATTACH; an attached device calling VFIO_DEVICE_ATTACH_IOMMUFD_PT
on the other hand means a REPLACE.

QEMU with this feature should have the vIOMMU maintain a cache of the
guest io page table addresses and assign a unique IOAS to each unique
guest page table.

As the guest writes the page table address to the HW registers qemu should
then use the 'replace domain' operation on VFIO to assign the VFIO device
to the correct de-duplicated page table.

The algorithm where QEMU uses one VFIO container per-device and removes
all the mappings to change the assignment should ideally not be used with
iommufd.

To apply this series, please rebase on top of the following patches:
1) [PATCH 00/13] Add vfio_device cdev for iommufd support
   https://lore.kernel.org/kvm/20230117134942.101112-1-yi.l.liu@intel.com/
2) (Merged) [PATCH v5 0/5] iommu: Retire detach_dev callback
   https://lore.kernel.org/linux-iommu/20230110025408.667767-1-baolu.lu@linux.intel.com/
3) (Merged) [PATCH] selftests: iommu: Fix test_cmd_destroy_access() call in user_copy
   https://lore.kernel.org/lkml/20230120074204.1368-1-nicolinc@nvidia.com/

Or you can also find this series on Github:
https://github.com/nicolinc/iommufd/commits/iommu_group_replace_domain-v1

Thank you
Nicolin Chen

Nicolin Chen (7):
  iommu: Introduce a new iommu_group_replace_domain() API
  iommufd: Create access in vfio_iommufd_emulated_bind()
  iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage
  iommufd: Add replace support in iommufd_access_set_ioas()
  iommufd/selftest: Add coverage for access->ioas replacement
  iommufd/device: Use iommu_group_replace_domain()
  vfio-iommufd: Support IO page table replacement

Yi Liu (1):
  iommu: Move dev_iommu_ops() to private header

 drivers/iommu/iommu-priv.h                    |  22 +++
 drivers/iommu/iommu.c                         |  32 ++++
 drivers/iommu/iommufd/device.c                | 150 +++++++++++++++---
 drivers/iommu/iommufd/iommufd_private.h       |   4 +
 drivers/iommu/iommufd/iommufd_test.h          |   4 +
 drivers/iommu/iommufd/selftest.c              |  25 ++-
 drivers/vfio/iommufd.c                        |  33 ++--
 include/linux/iommu.h                         |  11 --
 include/linux/iommufd.h                       |   4 +-
 tools/testing/selftests/iommu/iommufd.c       |  29 +++-
 tools/testing/selftests/iommu/iommufd_utils.h |  22 ++-
 11 files changed, 273 insertions(+), 63 deletions(-)
 create mode 100644 drivers/iommu/iommu-priv.h

Comments

Nicolin Chen Feb. 2, 2023, 7:14 p.m. UTC | #1
On Thu, Feb 02, 2023 at 06:21:20PM +0800, Baolu Lu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2023/2/2 15:05, Nicolin Chen wrote:
> > +/**
> > + * iommu_group_replace_domain - replace the domain that a group is attached to
> > + * @new_domain: new IOMMU domain to replace with
> > + * @group: IOMMU group that will be attached to the new domain
> > + *
> > + * This API allows the group to switch domains without being forced to go to
> > + * the blocking domain in-between.
> > + *
> > + * If the attached domain is a core domain (e.g. a default_domain), it will act
> > + * just like the iommu_attach_group().
> 
> I am not following above two lines. Why and how could iommufd set a
> core domain to an iommu_group?

Perhaps this isn't the best narrative. What it's supposed to say
is that this function acts as an iommu_attach_group() call if the
device is "detached", yet we have changed the semantics about the
word "detach". So, what should the correct way to write such a
note?

Thanks
Nic
Baolu Lu Feb. 3, 2023, 1:33 a.m. UTC | #2
On 2023/2/3 3:14, Nicolin Chen wrote:
> On Thu, Feb 02, 2023 at 06:21:20PM +0800, Baolu Lu wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2023/2/2 15:05, Nicolin Chen wrote:
>>> +/**
>>> + * iommu_group_replace_domain - replace the domain that a group is attached to
>>> + * @new_domain: new IOMMU domain to replace with
>>> + * @group: IOMMU group that will be attached to the new domain
>>> + *
>>> + * This API allows the group to switch domains without being forced to go to
>>> + * the blocking domain in-between.
>>> + *
>>> + * If the attached domain is a core domain (e.g. a default_domain), it will act
>>> + * just like the iommu_attach_group().
>> I am not following above two lines. Why and how could iommufd set a
>> core domain to an iommu_group?
> Perhaps this isn't the best narrative. What it's supposed to say
> is that this function acts as an iommu_attach_group() call if the
> device is "detached", yet we have changed the semantics about the
> word "detach". So, what should the correct way to write such a
> note?

How could this interface be used as detaching a domain from a group?
Even it could be used, doesn't it act as an iommu_detach_group()?

Best regards,
baolu
Nicolin Chen Feb. 3, 2023, 1:41 a.m. UTC | #3
On Fri, Feb 03, 2023 at 09:33:44AM +0800, Baolu Lu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2023/2/3 3:14, Nicolin Chen wrote:
> > On Thu, Feb 02, 2023 at 06:21:20PM +0800, Baolu Lu wrote:
> > > External email: Use caution opening links or attachments
> > > 
> > > 
> > > On 2023/2/2 15:05, Nicolin Chen wrote:
> > > > +/**
> > > > + * iommu_group_replace_domain - replace the domain that a group is attached to
> > > > + * @new_domain: new IOMMU domain to replace with
> > > > + * @group: IOMMU group that will be attached to the new domain
> > > > + *
> > > > + * This API allows the group to switch domains without being forced to go to
> > > > + * the blocking domain in-between.
> > > > + *
> > > > + * If the attached domain is a core domain (e.g. a default_domain), it will act
> > > > + * just like the iommu_attach_group().
> > > I am not following above two lines. Why and how could iommufd set a
> > > core domain to an iommu_group?
> > Perhaps this isn't the best narrative. What it's supposed to say
> > is that this function acts as an iommu_attach_group() call if the
> > device is "detached", yet we have changed the semantics about the
> > word "detach". So, what should the correct way to write such a
> > note?
> 
> How could this interface be used as detaching a domain from a group?
> Even it could be used, doesn't it act as an iommu_detach_group()?

No. I didn't say that. It doesn't act as detach(), but attach()
when a device is already "detached".

The original statement is saying, "if the attached domain is a
core domain", i.e. the device is detach()-ed, "it will act just
like the iommu_attach_group()".

Thanks
Nic
Baolu Lu Feb. 3, 2023, 2:35 a.m. UTC | #4
On 2023/2/3 9:41, Nicolin Chen wrote:
> On Fri, Feb 03, 2023 at 09:33:44AM +0800, Baolu Lu wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 2023/2/3 3:14, Nicolin Chen wrote:
>>> On Thu, Feb 02, 2023 at 06:21:20PM +0800, Baolu Lu wrote:
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On 2023/2/2 15:05, Nicolin Chen wrote:
>>>>> +/**
>>>>> + * iommu_group_replace_domain - replace the domain that a group is attached to
>>>>> + * @new_domain: new IOMMU domain to replace with
>>>>> + * @group: IOMMU group that will be attached to the new domain
>>>>> + *
>>>>> + * This API allows the group to switch domains without being forced to go to
>>>>> + * the blocking domain in-between.
>>>>> + *
>>>>> + * If the attached domain is a core domain (e.g. a default_domain), it will act
>>>>> + * just like the iommu_attach_group().
>>>> I am not following above two lines. Why and how could iommufd set a
>>>> core domain to an iommu_group?
>>> Perhaps this isn't the best narrative. What it's supposed to say
>>> is that this function acts as an iommu_attach_group() call if the
>>> device is "detached", yet we have changed the semantics about the
>>> word "detach". So, what should the correct way to write such a
>>> note?
>> How could this interface be used as detaching a domain from a group?
>> Even it could be used, doesn't it act as an iommu_detach_group()?
> No. I didn't say that. It doesn't act as detach(), but attach()
> when a device is already "detached".
> 
> The original statement is saying, "if the attached domain is a
> core domain", i.e. the device is detach()-ed, "it will act just
> like the iommu_attach_group()".

Oh! My bad. I misunderstood it. Sorry for the noise. :-)

Best regards,
baolu
Tian, Kevin Feb. 3, 2023, 8:09 a.m. UTC | #5
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, February 2, 2023 3:05 PM
> 
> QEMU with this feature should have the vIOMMU maintain a cache of the
> guest io page table addresses and assign a unique IOAS to each unique
> guest page table.

I didn't get why we impose such requirement to userspace.

> 
> To apply this series, please rebase on top of the following patches:
> 1) [PATCH 00/13] Add vfio_device cdev for iommufd support
>    https://lore.kernel.org/kvm/20230117134942.101112-1-yi.l.liu@intel.com/
> 2) (Merged) [PATCH v5 0/5] iommu: Retire detach_dev callback
>    https://lore.kernel.org/linux-iommu/20230110025408.667767-1-
> baolu.lu@linux.intel.com/
> 3) (Merged) [PATCH] selftests: iommu: Fix test_cmd_destroy_access() call in
> user_copy
>    https://lore.kernel.org/lkml/20230120074204.1368-1-nicolinc@nvidia.com/
> 

No need to list merged work as dependency. In concept every commit
merged in the version which this series is rebased to is required. 😊
Tian, Kevin Feb. 3, 2023, 10:10 a.m. UTC | #6
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, February 2, 2023 3:05 PM
> 
> Support an access->ioas replacement in iommufd_access_set_ioas(), which
> sets the access->ioas to NULL provisionally so that any further incoming
> iommufd_access_pin_pages() callback can be blocked.
> 
> Then, call access->ops->unmap() to clean up the entire iopt. To allow an
> iommufd_access_unpin_pages() callback to happen via this unmap() call,
> add an ioas_unpin pointer so the unpin routine won't be affected by the
> "access->ioas = NULL" trick above.
> 
> Also, a vdev without an ops->dma_unmap implementation cannot replace its
> access->ioas pointer. So add an iommufd_access_ioas_is_attached() helper
> to sanity that.
> 

Presumably a driver which doesn't implement ops->dma_unmap shouldn't
be allowed to do pin/unpin. But it could use vfio_dma_rw() to access an
iova range. In the latter case I don't see why replace cannot work.

Probably what's required here is to deny !ops->dma_unmap in
vfio_pin/unpin_pages then making here replace always allowed?
Jason Gunthorpe Feb. 3, 2023, 12:31 p.m. UTC | #7
On Fri, Feb 03, 2023 at 10:10:45AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Thursday, February 2, 2023 3:05 PM
> > 
> > Support an access->ioas replacement in iommufd_access_set_ioas(), which
> > sets the access->ioas to NULL provisionally so that any further incoming
> > iommufd_access_pin_pages() callback can be blocked.
> > 
> > Then, call access->ops->unmap() to clean up the entire iopt. To allow an
> > iommufd_access_unpin_pages() callback to happen via this unmap() call,
> > add an ioas_unpin pointer so the unpin routine won't be affected by the
> > "access->ioas = NULL" trick above.
> > 
> > Also, a vdev without an ops->dma_unmap implementation cannot replace its
> > access->ioas pointer. So add an iommufd_access_ioas_is_attached() helper
> > to sanity that.
> > 
> 
> Presumably a driver which doesn't implement ops->dma_unmap shouldn't
> be allowed to do pin/unpin. But it could use vfio_dma_rw() to access an
> iova range. In the latter case I don't see why replace cannot work.
> 
> Probably what's required here is to deny !ops->dma_unmap in
> vfio_pin/unpin_pages then making here replace always allowed?

That makes sense

Jason
Jason Gunthorpe Feb. 3, 2023, 3:03 p.m. UTC | #8
On Fri, Feb 03, 2023 at 08:26:44AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Thursday, February 2, 2023 3:05 PM
> > 
> > All drivers are already required to support changing between active
> > UNMANAGED domains when using their attach_dev ops.
> 
> All drivers which don't have *broken* UNMANAGED domain?

No, all drivers.. It has always been used by VFIO.

> > + */
> > +int iommu_group_replace_domain(struct iommu_group *group,
> > +			       struct iommu_domain *new_domain)
> 
> what actual value does 'replace' give us? It's just a wrapper of
> __iommu_group_set_domain() then calling it set_domain is
> probably clearer. You can clarify the 'replace' behavior in the
> comment.

As the APIs are setup:

attach demands that no domain is currently set (eg the domain must be blocking)

replace permits the domain to be currently set

'set' vs 'attach' is really unclear what the intended difference is.

We could also address this by simply removing the protection from
attach, but it is not so clear if that is safe for the few users.

> > +{
> > +	int ret;
> > +
> > +	if (!new_domain)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&group->mutex);
> > +	ret = __iommu_group_set_domain(group, new_domain);
> > +	if (ret) {
> > +		if (__iommu_group_set_domain(group, group->domain))
> > +			__iommu_group_set_core_domain(group);
> > +	}
> 
> Can you elaborate the error handling here? Ideally if
> __iommu_group_set_domain() fails then group->domain shouldn't
> be changed. 

That isn't what it implements though. The internal helper leaves
things in a mess, it is for the caller to fix it, and it depends on
the caller what that means.

In this case the API cannot retain a hidden reference to the new
domain, so it must be purged, one way or another.

Jason
Jason Gunthorpe Feb. 3, 2023, 3:04 p.m. UTC | #9
On Fri, Feb 03, 2023 at 08:09:30AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Thursday, February 2, 2023 3:05 PM
> > 
> > QEMU with this feature should have the vIOMMU maintain a cache of the
> > guest io page table addresses and assign a unique IOAS to each unique
> > guest page table.
> 
> I didn't get why we impose such requirement to userspace.

I read this as implementation guidance for qemu. qemu can do what it
wants of course

Jason
Nicolin Chen Feb. 3, 2023, 5:53 p.m. UTC | #10
On Fri, Feb 03, 2023 at 11:03:20AM -0400, Jason Gunthorpe wrote:
> > > + */
> > > +int iommu_group_replace_domain(struct iommu_group *group,
> > > +			       struct iommu_domain *new_domain)
> > 
> > what actual value does 'replace' give us? It's just a wrapper of
> > __iommu_group_set_domain() then calling it set_domain is
> > probably clearer. You can clarify the 'replace' behavior in the
> > comment.
> 
> As the APIs are setup:
> 
> attach demands that no domain is currently set (eg the domain must be blocking)
> 
> replace permits the domain to be currently set
> 
> 'set' vs 'attach' is really unclear what the intended difference is.
> 
> We could also address this by simply removing the protection from
> attach, but it is not so clear if that is safe for the few users.

I can add a couple of lines to the commit message to make things
clear.

> > > +{
> > > +	int ret;
> > > +
> > > +	if (!new_domain)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&group->mutex);
> > > +	ret = __iommu_group_set_domain(group, new_domain);
> > > +	if (ret) {
> > > +		if (__iommu_group_set_domain(group, group->domain))
> > > +			__iommu_group_set_core_domain(group);
> > > +	}
> > 
> > Can you elaborate the error handling here? Ideally if
> > __iommu_group_set_domain() fails then group->domain shouldn't
> > be changed. 
> 
> That isn't what it implements though. The internal helper leaves
> things in a mess, it is for the caller to fix it, and it depends on
> the caller what that means.
> 
> In this case the API cannot retain a hidden reference to the new
> domain, so it must be purged, one way or another.

Considering it is a bit different than the existing APIs like
iommu_attach_group(), perhaps I should add a line of comments
in the fallback routine.

Thanks
Nic
Nicolin Chen Feb. 3, 2023, 10:25 p.m. UTC | #11
On Fri, Feb 03, 2023 at 08:31:52AM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 03, 2023 at 10:10:45AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Thursday, February 2, 2023 3:05 PM
> > > 
> > > Support an access->ioas replacement in iommufd_access_set_ioas(), which
> > > sets the access->ioas to NULL provisionally so that any further incoming
> > > iommufd_access_pin_pages() callback can be blocked.
> > > 
> > > Then, call access->ops->unmap() to clean up the entire iopt. To allow an
> > > iommufd_access_unpin_pages() callback to happen via this unmap() call,
> > > add an ioas_unpin pointer so the unpin routine won't be affected by the
> > > "access->ioas = NULL" trick above.
> > > 
> > > Also, a vdev without an ops->dma_unmap implementation cannot replace its
> > > access->ioas pointer. So add an iommufd_access_ioas_is_attached() helper
> > > to sanity that.
> > > 
> > 
> > Presumably a driver which doesn't implement ops->dma_unmap shouldn't
> > be allowed to do pin/unpin. But it could use vfio_dma_rw() to access an
> > iova range. In the latter case I don't see why replace cannot work.
> > 
> > Probably what's required here is to deny !ops->dma_unmap in
> > vfio_pin/unpin_pages then making here replace always allowed?
> 
> That makes sense

Will change that in v2.

Thanks
Nic
Tian, Kevin Feb. 6, 2023, 6:39 a.m. UTC | #12
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, February 3, 2023 11:04 PM
> 
> On Fri, Feb 03, 2023 at 08:09:30AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Thursday, February 2, 2023 3:05 PM
> > >
> > > QEMU with this feature should have the vIOMMU maintain a cache of the
> > > guest io page table addresses and assign a unique IOAS to each unique
> > > guest page table.
> >
> > I didn't get why we impose such requirement to userspace.
> 
> I read this as implementation guidance for qemu. qemu can do what it
> wants of course
> 

sure but I still didn't get why this is a guidance specific to the
new replace cmd...
Tian, Kevin Feb. 6, 2023, 6:57 a.m. UTC | #13
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, February 3, 2023 11:03 PM
> 
> On Fri, Feb 03, 2023 at 08:26:44AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Thursday, February 2, 2023 3:05 PM
> > >
> > > All drivers are already required to support changing between active
> > > UNMANAGED domains when using their attach_dev ops.
> >
> > All drivers which don't have *broken* UNMANAGED domain?
> 
> No, all drivers.. It has always been used by VFIO.

existing iommu_attach_group() doesn't support changing between
two UNMANAGED domains. only from default->unmanaged or
blocking->unmanaged.

Above statement is true only if this series is based on your unmerged
work removing DMA domain types.

> 
> > > +{
> > > +	int ret;
> > > +
> > > +	if (!new_domain)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&group->mutex);
> > > +	ret = __iommu_group_set_domain(group, new_domain);
> > > +	if (ret) {
> > > +		if (__iommu_group_set_domain(group, group->domain))
> > > +			__iommu_group_set_core_domain(group);
> > > +	}
> >
> > Can you elaborate the error handling here? Ideally if
> > __iommu_group_set_domain() fails then group->domain shouldn't
> > be changed.
> 
> That isn't what it implements though. The internal helper leaves
> things in a mess, it is for the caller to fix it, and it depends on
> the caller what that means.

I didn't see any warning of the mess and the caller's responsibility
in __iommu_group_set_domain(). Can it be documented clearly
so if someone wants to add a new caller on it he can clearly know
what to do?

and why doesn't iommu_attach_group() need to do anything
when an attach attempt fails? In the end it calls the same
iommu_group_do_attach_device() as __iommu_group_set_domain()
does.

btw looking at the code __iommu_group_set_domain():

	 * Note that this is called in error unwind paths, attaching to a
	 * domain that has already been attached cannot fail.
	 */
	ret = __iommu_group_for_each_dev(group, new_domain,
				iommu_group_do_attach_device);

with that we don't need fall back to core domain in above error
unwinding per this comment.

> 
> In this case the API cannot retain a hidden reference to the new
> domain, so it must be purged, one way or another.
> 

Could you elaborate where the hidden reference is retained?
Jason Gunthorpe Feb. 6, 2023, 1:25 p.m. UTC | #14
On Mon, Feb 06, 2023 at 06:57:35AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, February 3, 2023 11:03 PM
> > 
> > On Fri, Feb 03, 2023 at 08:26:44AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Thursday, February 2, 2023 3:05 PM
> > > >
> > > > All drivers are already required to support changing between active
> > > > UNMANAGED domains when using their attach_dev ops.
> > >
> > > All drivers which don't have *broken* UNMANAGED domain?
> > 
> > No, all drivers.. It has always been used by VFIO.
> 
> existing iommu_attach_group() doesn't support changing between
> two UNMANAGED domains. only from default->unmanaged or
> blocking->unmanaged.

Yes, but before we added the blocking domains VFIO was changing
between unmanaged domains. Blocking domains are so new that no driver
could have suddenly started to depend on this.

> > > Can you elaborate the error handling here? Ideally if
> > > __iommu_group_set_domain() fails then group->domain shouldn't
> > > be changed.
> > 
> > That isn't what it implements though. The internal helper leaves
> > things in a mess, it is for the caller to fix it, and it depends on
> > the caller what that means.
> 
> I didn't see any warning of the mess and the caller's responsibility
> in __iommu_group_set_domain(). Can it be documented clearly
> so if someone wants to add a new caller on it he can clearly know
> what to do?

That would be nice..
 
> and why doesn't iommu_attach_group() need to do anything
> when an attach attempt fails? In the end it calls the same
> iommu_group_do_attach_device() as __iommu_group_set_domain()
> does.

That's a bug for sure.

 
> btw looking at the code __iommu_group_set_domain():
> 
> 	 * Note that this is called in error unwind paths, attaching to a
> 	 * domain that has already been attached cannot fail.
> 	 */
> 	ret = __iommu_group_for_each_dev(group, new_domain,
> 				iommu_group_do_attach_device);
> 
> with that we don't need fall back to core domain in above error
> unwinding per this comment.

That does make some sense.

I tried to make a patch to consolidate all this error handling once,
that would be the better way to approach this.

> > In this case the API cannot retain a hidden reference to the new
> > domain, so it must be purged, one way or another.
> 
> Could you elaborate where the hidden reference is retained?

Inside the driver, it can keep track of the domain pointer if
attach_dev succeeds

Jason
Jason Gunthorpe Feb. 6, 2023, 1:26 p.m. UTC | #15
On Mon, Feb 06, 2023 at 06:39:29AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, February 3, 2023 11:04 PM
> > 
> > On Fri, Feb 03, 2023 at 08:09:30AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Thursday, February 2, 2023 3:05 PM
> > > >
> > > > QEMU with this feature should have the vIOMMU maintain a cache of the
> > > > guest io page table addresses and assign a unique IOAS to each unique
> > > > guest page table.
> > >
> > > I didn't get why we impose such requirement to userspace.
> > 
> > I read this as implementation guidance for qemu. qemu can do what it
> > wants of course
> > 
> 
> sure but I still didn't get why this is a guidance specific to the
> new replace cmd...

I think the guidance is about the change to VFIO uAPI where it is now
possible to change the domain attached, previously that was not
possible

Jason
Tian, Kevin Feb. 7, 2023, 12:32 a.m. UTC | #16
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, February 6, 2023 9:25 PM
> 
> On Mon, Feb 06, 2023 at 06:57:35AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, February 3, 2023 11:03 PM
> > >
> > > On Fri, Feb 03, 2023 at 08:26:44AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Thursday, February 2, 2023 3:05 PM
> > > > >
> > > > > All drivers are already required to support changing between active
> > > > > UNMANAGED domains when using their attach_dev ops.
> > > >
> > > > All drivers which don't have *broken* UNMANAGED domain?
> > >
> > > No, all drivers.. It has always been used by VFIO.
> >
> > existing iommu_attach_group() doesn't support changing between
> > two UNMANAGED domains. only from default->unmanaged or
> > blocking->unmanaged.
> 
> Yes, but before we added the blocking domains VFIO was changing
> between unmanaged domains. Blocking domains are so new that no driver
> could have suddenly started to depend on this.

In legacy VFIO unmanaged domain was 1:1 associated with vfio
container. I didn't say how a group can switch between two
containers w/o going through transition to/from the default
domain, i.e. detach from 1st container and then attach to the 2nd.

> > btw looking at the code __iommu_group_set_domain():
> >
> > 	 * Note that this is called in error unwind paths, attaching to a
> > 	 * domain that has already been attached cannot fail.
> > 	 */
> > 	ret = __iommu_group_for_each_dev(group, new_domain,
> > 				iommu_group_do_attach_device);
> >
> > with that we don't need fall back to core domain in above error
> > unwinding per this comment.
> 
> That does make some sense.
> 
> I tried to make a patch to consolidate all this error handling once,
> that would be the better way to approach this.

that would be good.

> 
> > > In this case the API cannot retain a hidden reference to the new
> > > domain, so it must be purged, one way or another.
> >
> > Could you elaborate where the hidden reference is retained?
> 
> Inside the driver, it can keep track of the domain pointer if
> attach_dev succeeds
> 

Are you referring to no error unwinding in __iommu_group_for_each_dev()
so if it is failed some devices may have attach_dev succeeds then simply
recovering group->domain in __iommu_attach_group() is insufficient?
Tian, Kevin Feb. 7, 2023, 12:34 a.m. UTC | #17
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, February 6, 2023 9:26 PM
> 
> On Mon, Feb 06, 2023 at 06:39:29AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, February 3, 2023 11:04 PM
> > >
> > > On Fri, Feb 03, 2023 at 08:09:30AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Thursday, February 2, 2023 3:05 PM
> > > > >
> > > > > QEMU with this feature should have the vIOMMU maintain a cache of
> the
> > > > > guest io page table addresses and assign a unique IOAS to each
> unique
> > > > > guest page table.
> > > >
> > > > I didn't get why we impose such requirement to userspace.
> > >
> > > I read this as implementation guidance for qemu. qemu can do what it
> > > wants of course
> > >
> >
> > sure but I still didn't get why this is a guidance specific to the
> > new replace cmd...
> 
> I think the guidance is about the change to VFIO uAPI where it is now
> possible to change the domain attached, previously that was not
> possible
> 

that is fine. I just didn't get why the original description emphasized
the cache and unique IOAS aspects in Qemu.
Jason Gunthorpe Feb. 7, 2023, 12:22 p.m. UTC | #18
On Tue, Feb 07, 2023 at 12:32:50AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, February 6, 2023 9:25 PM
> > 
> > On Mon, Feb 06, 2023 at 06:57:35AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, February 3, 2023 11:03 PM
> > > >
> > > > On Fri, Feb 03, 2023 at 08:26:44AM +0000, Tian, Kevin wrote:
> > > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > > Sent: Thursday, February 2, 2023 3:05 PM
> > > > > >
> > > > > > All drivers are already required to support changing between active
> > > > > > UNMANAGED domains when using their attach_dev ops.
> > > > >
> > > > > All drivers which don't have *broken* UNMANAGED domain?
> > > >
> > > > No, all drivers.. It has always been used by VFIO.
> > >
> > > existing iommu_attach_group() doesn't support changing between
> > > two UNMANAGED domains. only from default->unmanaged or
> > > blocking->unmanaged.
> > 
> > Yes, but before we added the blocking domains VFIO was changing
> > between unmanaged domains. Blocking domains are so new that no driver
> > could have suddenly started to depend on this.
> 
> In legacy VFIO unmanaged domain was 1:1 associated with vfio
> container. I didn't say how a group can switch between two
> containers w/o going through transition to/from the default
> domain, i.e. detach from 1st container and then attach to the 2nd.

Yes, in the past we went through the default domain which is basically
another unmanaged domain type. So unmanaged -> unmanaged is OK.

> > Inside the driver, it can keep track of the domain pointer if
> > attach_dev succeeds
> 
> Are you referring to no error unwinding in __iommu_group_for_each_dev()
> so if it is failed some devices may have attach_dev succeeds then simply
> recovering group->domain in __iommu_attach_group() is insufficient?

Yes

Jason
Nicolin Chen Feb. 7, 2023, 7:16 p.m. UTC | #19
On Mon, Feb 06, 2023 at 09:25:17AM -0400, Jason Gunthorpe wrote:

> > > > Can you elaborate the error handling here? Ideally if
> > > > __iommu_group_set_domain() fails then group->domain shouldn't
> > > > be changed.
> > > 
> > > That isn't what it implements though. The internal helper leaves
> > > things in a mess, it is for the caller to fix it, and it depends on
> > > the caller what that means.
> > 
> > I didn't see any warning of the mess and the caller's responsibility
> > in __iommu_group_set_domain(). Can it be documented clearly
> > so if someone wants to add a new caller on it he can clearly know
> > what to do?
> 
> That would be nice..

I'd expect the doc to come with some other patch/series than this
replace series, so I think we should be fine without adding a line
of comments in this patch?

> > btw looking at the code __iommu_group_set_domain():
> > 
> > 	 * Note that this is called in error unwind paths, attaching to a
> > 	 * domain that has already been attached cannot fail.
> > 	 */
> > 	ret = __iommu_group_for_each_dev(group, new_domain,
> > 				iommu_group_do_attach_device);
> > 
> > with that we don't need fall back to core domain in above error
> > unwinding per this comment.
> 
> That does make some sense.
> 
> I tried to make a patch to consolidate all this error handling once,
> that would be the better way to approach this.

Then, I'll drop the core-domain line. Combining my reply above:

+	mutex_lock(&group->mutex);
+	ret = __iommu_group_set_domain(group, new_domain);
+	if (ret)
+		__iommu_group_set_domain(group, group->domain);
+	mutex_unlock(&group->mutex);

Will wrap things up and send v2 today.

Thanks
Nic
Tian, Kevin Feb. 8, 2023, 4:25 a.m. UTC | #20
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, February 7, 2023 8:23 PM
> 
> On Tue, Feb 07, 2023 at 12:32:50AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Monday, February 6, 2023 9:25 PM
> > >
> > > On Mon, Feb 06, 2023 at 06:57:35AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Friday, February 3, 2023 11:03 PM
> > > > >
> > > > > On Fri, Feb 03, 2023 at 08:26:44AM +0000, Tian, Kevin wrote:
> > > > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > > > Sent: Thursday, February 2, 2023 3:05 PM
> > > > > > >
> > > > > > > All drivers are already required to support changing between
> active
> > > > > > > UNMANAGED domains when using their attach_dev ops.
> > > > > >
> > > > > > All drivers which don't have *broken* UNMANAGED domain?
> > > > >
> > > > > No, all drivers.. It has always been used by VFIO.
> > > >
> > > > existing iommu_attach_group() doesn't support changing between
> > > > two UNMANAGED domains. only from default->unmanaged or
> > > > blocking->unmanaged.
> > >
> > > Yes, but before we added the blocking domains VFIO was changing
> > > between unmanaged domains. Blocking domains are so new that no
> driver
> > > could have suddenly started to depend on this.
> >
> > In legacy VFIO unmanaged domain was 1:1 associated with vfio
> > container. I didn't say how a group can switch between two
> > containers w/o going through transition to/from the default
> > domain, i.e. detach from 1st container and then attach to the 2nd.
> 
> Yes, in the past we went through the default domain which is basically
> another unmanaged domain type. So unmanaged -> unmanaged is OK.
> 

it's right in concept but confusing in current context whether DMA
domain still has its own type. That's why I replied earlier the statement
makes more sense after your patch which cleans up the domain type
is merged.
Jason Gunthorpe Feb. 8, 2023, 12:42 p.m. UTC | #21
On Wed, Feb 08, 2023 at 04:25:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, February 7, 2023 8:23 PM
> > 
> > On Tue, Feb 07, 2023 at 12:32:50AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Monday, February 6, 2023 9:25 PM
> > > >
> > > > On Mon, Feb 06, 2023 at 06:57:35AM +0000, Tian, Kevin wrote:
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Friday, February 3, 2023 11:03 PM
> > > > > >
> > > > > > On Fri, Feb 03, 2023 at 08:26:44AM +0000, Tian, Kevin wrote:
> > > > > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > > > > Sent: Thursday, February 2, 2023 3:05 PM
> > > > > > > >
> > > > > > > > All drivers are already required to support changing between
> > active
> > > > > > > > UNMANAGED domains when using their attach_dev ops.
> > > > > > >
> > > > > > > All drivers which don't have *broken* UNMANAGED domain?
> > > > > >
> > > > > > No, all drivers.. It has always been used by VFIO.
> > > > >
> > > > > existing iommu_attach_group() doesn't support changing between
> > > > > two UNMANAGED domains. only from default->unmanaged or
> > > > > blocking->unmanaged.
> > > >
> > > > Yes, but before we added the blocking domains VFIO was changing
> > > > between unmanaged domains. Blocking domains are so new that no
> > driver
> > > > could have suddenly started to depend on this.
> > >
> > > In legacy VFIO unmanaged domain was 1:1 associated with vfio
> > > container. I didn't say how a group can switch between two
> > > containers w/o going through transition to/from the default
> > > domain, i.e. detach from 1st container and then attach to the 2nd.
> > 
> > Yes, in the past we went through the default domain which is basically
> > another unmanaged domain type. So unmanaged -> unmanaged is OK.
> > 
> 
> it's right in concept but confusing in current context whether DMA
> domain still has its own type. That's why I replied earlier the statement
> makes more sense after your patch which cleans up the domain type
> is merged.

We are just reasoning about why the existing drivers are safe with
this - and my draft patch to remove DMA simply demonstrates that DMA
== UNMANAGED, and the above reasoning about VFIO in the past
demonstrates that this has historically be expected of drivers.

So I'm not so worried about patch order as long as things do work

Jason