diff mbox series

[v7,1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

Message ID 20231117131816.24359-2-yi.l.liu@intel.com
State New
Headers show
Series Add Intel VT-d nested translation (part 2/2) | expand

Commit Message

Yi Liu Nov. 17, 2023, 1:18 p.m. UTC
This adds the data structure for flushing iotlb for the nested domain
allocated with IOMMU_HWPT_DATA_VTD_S1 type.

This only supports invalidating IOTLB, but no for device-TLB as device-TLB
invalidation will be covered automatically in the IOTLB invalidation if the
underlying IOMMU driver has enabled ATS for the affected device.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

Comments

Jason Gunthorpe Nov. 20, 2023, 11:04 p.m. UTC | #1
On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Friday, November 17, 2023 9:18 PM
> > 
> > This adds the data structure for flushing iotlb for the nested domain
> > allocated with IOMMU_HWPT_DATA_VTD_S1 type.
> > 
> > This only supports invalidating IOTLB, but no for device-TLB as device-TLB
> > invalidation will be covered automatically in the IOTLB invalidation if the
> > underlying IOMMU driver has enabled ATS for the affected device.
> 
> "no for device-TLB" is misleading. Here just say that cache invalidation
> request applies to both IOTLB and device TLB (if ATS is enabled ...)

I think we should forward the ATS invalidation from the guest too?
That is what ARM and AMD will have to do, can we keep them all
consistent?

I understand Intel keeps track of enough stuff to know what the RIDs
are, but is it necessary to make it different?

Jason
Tian, Kevin Nov. 21, 2023, 2:54 a.m. UTC | #2
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, November 21, 2023 7:05 AM
> 
> On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Friday, November 17, 2023 9:18 PM
> > >
> > > This adds the data structure for flushing iotlb for the nested domain
> > > allocated with IOMMU_HWPT_DATA_VTD_S1 type.
> > >
> > > This only supports invalidating IOTLB, but no for device-TLB as device-TLB
> > > invalidation will be covered automatically in the IOTLB invalidation if the
> > > underlying IOMMU driver has enabled ATS for the affected device.
> >
> > "no for device-TLB" is misleading. Here just say that cache invalidation
> > request applies to both IOTLB and device TLB (if ATS is enabled ...)
> 
> I think we should forward the ATS invalidation from the guest too?
> That is what ARM and AMD will have to do, can we keep them all
> consistent?
> 
> I understand Intel keeps track of enough stuff to know what the RIDs
> are, but is it necessary to make it different?
> 

probably ask the other way. Now intel-iommu driver always flushes
iotlb and device tlb together then is it necessary to separate them
in uAPI for no good (except doubled syscalls)? :)

anyway this is driver specific contract. I don't see a need to keep
it consistent for all.
Baolu Lu Nov. 22, 2023, 2:32 a.m. UTC | #3
On 11/21/23 8:17 PM, Jason Gunthorpe wrote:
> On Tue, Nov 21, 2023 at 02:54:15AM +0000, Tian, Kevin wrote:
>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>> Sent: Tuesday, November 21, 2023 7:05 AM
>>>
>>> On Mon, Nov 20, 2023 at 08:26:31AM +0000, Tian, Kevin wrote:
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Sent: Friday, November 17, 2023 9:18 PM
>>>>>
>>>>> This adds the data structure for flushing iotlb for the nested domain
>>>>> allocated with IOMMU_HWPT_DATA_VTD_S1 type.
>>>>>
>>>>> This only supports invalidating IOTLB, but no for device-TLB as device-TLB
>>>>> invalidation will be covered automatically in the IOTLB invalidation if the
>>>>> underlying IOMMU driver has enabled ATS for the affected device.
>>>>
>>>> "no for device-TLB" is misleading. Here just say that cache invalidation
>>>> request applies to both IOTLB and device TLB (if ATS is enabled ...)
>>>
>>> I think we should forward the ATS invalidation from the guest too?
>>> That is what ARM and AMD will have to do, can we keep them all
>>> consistent?
>>>
>>> I understand Intel keeps track of enough stuff to know what the RIDs
>>> are, but is it necessary to make it different?
>>
>> probably ask the other way. Now intel-iommu driver always flushes
>> iotlb and device tlb together then is it necessary to separate them
>> in uAPI for no good (except doubled syscalls)? :)
> 
> I wish I knew more about Intel CC design to be able to answer that :|
> 
> Doesn't the VM issue the ATC flush command regardless? How does it
> know it has a working ATC but does not need to flush it?
> 

The Intel VT-d spec doesn't require the driver to flush iotlb and device
tlb together. Therefore, the current approach of relying on caching mode
to determine whether device TLB invalidation is necessary appears to be
a performance optimization rather than an architectural requirement.

The vIOMMU driver assumes that it is running within a VM guest when
caching mode is enabled. This assumption leads to an omission of device
TLB invalidation, relying on the hypervisor to perform a combined flush
of the IOLB and device TLB.

While this optimization aims to reduce VMEXIT overhead, it introduces
potential issues:

- When a Linux guest running on a hypervisor other than KVM/QEMU, the
   assumption of combined IOLB and device TLB flushing by the hypervisor
   may be incorrect, potentially leading to missed device TLB
   invalidation.

- The caching mode doesn't apply to first-stage translation. Therefore,
   if the driver uses first-stage translation and still relies on caching
   mode to determine device TLB invalidation, the optimization fails.

A more reasonable optimization would be to allocate a bit in the iommu
capability registers. The vIOMMU driver could then leverage this bit to
determine whether it could eliminate a device invalidation request.

Best regards,
baolu
Jason Gunthorpe Nov. 24, 2023, 1:46 p.m. UTC | #4
On Fri, Nov 24, 2023 at 03:00:45AM +0000, Tian, Kevin wrote:

> > I'm fully expecting that Intel will adopt an direct-DMA flush queue
> > like SMMU and AMD have already done as a performance optimization. In
> > this world it makes no sense that the behavior of the direct DMA queue
> > and driver mediated queue would be different.
> > 
> 
> that's a orthogonal topic. I don't think the value of direct-DMA flush
> queue should prevent possible optimization in the mediation path
> (as long as guest-expected deterministic behavior is sustained).

Okay, well as long as the guest is generating the ATC invalidations we
can always make an iommufd API flag to include or exclude the ATC
invalidation when doing the ASID invalidation. So we aren't trapped

Jason
Yi Liu Dec. 14, 2023, 11:26 a.m. UTC | #5
On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for 
flushing iotlb for the nested domain
 > allocated with IOMMU_HWPT_DATA_VTD_S1 type.
 >
 > This only supports invalidating IOTLB, but no for device-TLB as device-TLB
 > invalidation will be covered automatically in the IOTLB invalidation if the
 > underlying IOMMU driver has enabled ATS for the affected device.
 >
 > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
 > ---
 >   include/uapi/linux/iommufd.h | 36 ++++++++++++++++++++++++++++++++++++
 >   1 file changed, 36 insertions(+)
 >
 > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
 > index 7f92cecc87d7..cafd98642abf 100644
 > --- a/include/uapi/linux/iommufd.h
 > +++ b/include/uapi/linux/iommufd.h
 > @@ -614,6 +614,42 @@ struct iommu_hwpt_get_dirty_bitmap {
 >   #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
 >   					IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
 >
 > +/**
 > + * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
 > + *                                           stage-1 cache invalidation
 > + * @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only the
 > + *                            leaf PTE caching needs to be invalidated
 > + *                            and other paging structure caches can be
 > + *                            preserved.
 > + */
 > +enum iommu_hwpt_vtd_s1_invalidate_flags {
 > +	IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0,
 > +};
 > +
 > +/**
 > + * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
 > + *                                       (IOMMU_HWPT_DATA_VTD_S1)
 > + * @addr: The start address of the addresses to be invalidated. It needs
 > + *        to be 4KB aligned.
 > + * @npages: Number of contiguous 4K pages to be invalidated.
 > + * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
 > + * @__reserved: Must be 0
 > + *
 > + * The Intel VT-d specific invalidation data for user-managed stage-1 cache
 > + * invalidation in nested translation. Userspace uses this structure to
 > + * tell the impacted cache scope after modifying the stage-1 page table.
 > + *
 > + * Invalidating all the caches related to the page table by setting @addr
 > + * to be 0 and @npages to be __aligned_u64(-1). This includes the
 > + * corresponding device-TLB if ATS is enabled on the attached devices.
 > + */
 > +struct iommu_hwpt_vtd_s1_invalidate {
 > +	__aligned_u64 addr;
 > +	__aligned_u64 npages;
 > +	__u32 flags;
 > +	__u32 __reserved;
 > +};
 > +
 >   /**
 >    * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
 >    * @size: sizeof(struct iommu_hwpt_invalidate)

Hi Jason, Kevin,

Per the prior discussion[1], we agreed to move the error reporting into the
driver specific part. On Intel side, we want to report two devTLB
invalidation errors: ICE (invalid completion error) and ITE (invalidation
timeout error). Such errors have an additional SID information to tell
which device failed the devTLB invalidation. I've got the below structure.

+/**
+ * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
+ *                                       (IOMMU_HWPT_DATA_VTD_S1)
+ * @addr: The start address of the addresses to be invalidated. It needs
+ *        to be 4KB aligned.
+ * @npages: Number of contiguous 4K pages to be invalidated.
+ * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
+ * @__reserved: Must be 0.
+ * @error: One of enum iommu_hwpt_vtd_s1_invalidate_error_code
+ * @dev_id: The device in the invalidation completion message, it's meaninfful
+ *          when @error_code is set to IOMMU_HWPT_VTD_S1_INVALIDATE_DEVTLB_ICE
+ *          or IOMMU_HWPT_VTD_S1_INVALIDATE_DEVTLB_ITE.
+ *
+ * The Intel VT-d specific invalidation data for user-managed stage-1 cache
+ * invalidation in nested translation. Userspace uses this structure to
+ * tell the impacted cache scope after modifying the stage-1 page table.
+ *
+ * Invalidating all the caches related to the page table by setting @addr
+ * to be 0 and @npages to be U64_MAX.
+ *
+ * @error_code is meaningful only if the request is handled by kernel. This
+ * can be known by checking struct iommu_hwpt_invalidate::req_num output.
+ * @error_code only covers the errors detected by hardware after submitting
+ * the invalidation. The software detected errors would go through the normal
+ * ioctl errno.
+ */
+struct iommu_hwpt_vtd_s1_invalidate {
+	__aligned_u64 addr;
+	__aligned_u64 npages;
+	__u32 flags;
+	__u32 __reserved;
+	__u32 error;
+	__u32 dev_id;
+};

dev_id is used to report the failed device, userspace should be able to map
it to a vRID, and inject it to VM as part of ITE/ICE error.

However, I got a problem when trying to get dev_id in cache invalidation
path, since this is filled in intel iommu driver. Seems like there is no
good way for it. I've below alternatives to move forward, wish you have
a look.

- Reporting pSID instead of dev_id. This may not work if userspace for
example Qemu cen get a vfio device cdev fd from management stack. Maybe you
have different opinion, do let me know.

- Let iommufd to convert a SID info or device pointer to a dev_id, and then
report it back to userspace. This seems easiest, but breaks layer and also
requires vt-d specific logic. :(

- Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
maintained, then intel iommu can report a vRID back to user. But intel
iommu driver does not have viommu context, no place to hold the vRID->pRID
mapping. TBH. It may require other reasons to introduce it other than the
error reporting need. Anyhow, this requires more thinking and also has
dependency even if it is doable in intel side.

- Only report error code, but no device info at first. May adding the
device info (dev_id or vRID) in future series. In reality, the existing
Qemu vIOMMU does not report ICE, ITE, neither the SID to VM. Also, VT-d
spec defined the ICE/ITE errors first in 2007 spec 1.1, and added SID info
later in 2019 spec 3.1. We may do it in stage as well.

What about your opinion?

[1] 
https://lore.kernel.org/linux-iommu/a9699f71-805a-4a5a-9282-3ec52e5bc81a@intel.com/
Tian, Kevin Dec. 15, 2023, 1:50 a.m. UTC | #6
> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, December 14, 2023 7:27 PM
> 
> On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
> flushing iotlb for the nested domain
> 
> +struct iommu_hwpt_vtd_s1_invalidate {
> +	__aligned_u64 addr;
> +	__aligned_u64 npages;
> +	__u32 flags;
> +	__u32 __reserved;
> +	__u32 error;
> +	__u32 dev_id;
> +};
> 
> dev_id is used to report the failed device, userspace should be able to map
> it to a vRID, and inject it to VM as part of ITE/ICE error.
> 
> However, I got a problem when trying to get dev_id in cache invalidation
> path, since this is filled in intel iommu driver. Seems like there is no
> good way for it. I've below alternatives to move forward, wish you have
> a look.
> 
> - Reporting pSID instead of dev_id. This may not work if userspace for
> example Qemu cen get a vfio device cdev fd from management stack. Maybe
> you
> have different opinion, do let me know.

yes, there is no guarantee that pRID is always visible to the user.

> 
> - Let iommufd to convert a SID info or device pointer to a dev_id, and then
> report it back to userspace. This seems easiest, but breaks layer and also
> requires vt-d specific logic. :(

yes, the current philosophy of iommufd is to put diver specific knowledge
out of iommufd.

> 
> - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
> maintained, then intel iommu can report a vRID back to user. But intel
> iommu driver does not have viommu context, no place to hold the vRID-
> >pRID
> mapping. TBH. It may require other reasons to introduce it other than the
> error reporting need. Anyhow, this requires more thinking and also has
> dependency even if it is doable in intel side.

this sounds like a cleaner way to inject knowledge which iommu driver
requires to find out the user tag. but yes it's a bit weird to introduce
viommu awareness in intel iommu driver when there is no such thing
in real hardware.

and for this error reporting case what we actually require is the
reverse map i.e. pRID->vRID. Not sure whether we can leverage the
same RID mapping uAPI as for ARM/AMD but ignore viommu_id
and then store vRID under device_domain_info. a bit tricky on
life cycle management and also incompatible with SIOV...

let's see whether Jason has a better idea here.

> 
> - Only report error code, but no device info at first. May adding the
> device info (dev_id or vRID) in future series. In reality, the existing
> Qemu vIOMMU does not report ICE, ITE, neither the SID to VM. Also, VT-d

and IOAS_UNMAP doesn't provide such ATS error either.

> spec defined the ICE/ITE errors first in 2007 spec 1.1, and added SID info
> later in 2019 spec 3.1. We may do it in stage as well.

and it's not tied to a specific iommu version. the spec is stated in 
a way that software treats zero value in SID as no hw support so
theoretically even a modern hw may not report SID for certain
reason.

> 
> What about your opinion?
> 
> [1]
> https://lore.kernel.org/linux-iommu/a9699f71-805a-4a5a-9282-
> 3ec52e5bc81a@intel.com/

I'm fine with this staged approach given the spec allows this
behavior and no vIOMMU properly emulates ITE/ICE today. 

let's work out a new version w/o dev info and make sure it's
in a good state first. Then let Jason decide next week whether
he wants to take it for this cycle or not.
Nicolin Chen Dec. 15, 2023, 2:28 a.m. UTC | #7
On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Thursday, December 14, 2023 7:27 PM
> >
> > On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
> > flushing iotlb for the nested domain
> >
> > +struct iommu_hwpt_vtd_s1_invalidate {
> > +     __aligned_u64 addr;
> > +     __aligned_u64 npages;
> > +     __u32 flags;
> > +     __u32 __reserved;
> > +     __u32 error;
> > +     __u32 dev_id;
> > +};
> >
> > dev_id is used to report the failed device, userspace should be able to map
> > it to a vRID, and inject it to VM as part of ITE/ICE error.
> >
> > However, I got a problem when trying to get dev_id in cache invalidation
> > path, since this is filled in intel iommu driver. Seems like there is no
> > good way for it. I've below alternatives to move forward, wish you have
> > a look.

> >
> > - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
> > maintained, then intel iommu can report a vRID back to user. But intel
> > iommu driver does not have viommu context, no place to hold the vRID-
> > >pRID
> > mapping. TBH. It may require other reasons to introduce it other than the
> > error reporting need. Anyhow, this requires more thinking and also has
> > dependency even if it is doable in intel side.
> 
> this sounds like a cleaner way to inject knowledge which iommu driver
> requires to find out the user tag. but yes it's a bit weird to introduce
> viommu awareness in intel iommu driver when there is no such thing
> in real hardware.

I think a viommu is defined more like a software object representing
the virtual IOMMU in a VM. Since VT-d has a vIOMMU in a nesting case,
there could be an object for it too?

> and for this error reporting case what we actually require is the
> reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> and then store vRID under device_domain_info. a bit tricky on
> life cycle management and also incompatible with SIOV...

One thing that I am not very clear here: since both vRID and dev_id
are given by the VMM, shouldn't it already know the mapping if the
point is to translate (pRID->)dev_id->vRID?

Thanks
Nicolin
Tian, Kevin Dec. 15, 2023, 3:04 a.m. UTC | #8
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, December 15, 2023 10:28 AM
> 
> On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Thursday, December 14, 2023 7:27 PM
> > >
> > > On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
> > > flushing iotlb for the nested domain
> > >
> > > +struct iommu_hwpt_vtd_s1_invalidate {
> > > +     __aligned_u64 addr;
> > > +     __aligned_u64 npages;
> > > +     __u32 flags;
> > > +     __u32 __reserved;
> > > +     __u32 error;
> > > +     __u32 dev_id;
> > > +};
> > >
> > > dev_id is used to report the failed device, userspace should be able to
> map
> > > it to a vRID, and inject it to VM as part of ITE/ICE error.
> > >
> > > However, I got a problem when trying to get dev_id in cache invalidation
> > > path, since this is filled in intel iommu driver. Seems like there is no
> > > good way for it. I've below alternatives to move forward, wish you have
> > > a look.
> 
> > >
> > > - Reuse Nicolin's vRID->pRID mapping. If thevRID->pRID mapping is
> > > maintained, then intel iommu can report a vRID back to user. But intel
> > > iommu driver does not have viommu context, no place to hold the vRID-
> > > >pRID
> > > mapping. TBH. It may require other reasons to introduce it other than the
> > > error reporting need. Anyhow, this requires more thinking and also has
> > > dependency even if it is doable in intel side.
> >
> > this sounds like a cleaner way to inject knowledge which iommu driver
> > requires to find out the user tag. but yes it's a bit weird to introduce
> > viommu awareness in intel iommu driver when there is no such thing
> > in real hardware.
> 
> I think a viommu is defined more like a software object representing
> the virtual IOMMU in a VM. Since VT-d has a vIOMMU in a nesting case,
> there could be an object for it too?

for VT-d it's not necessary to maintain such vIOMMU awareness in
the kernel (before this error reporting case) given its interfaces are
simply around hwpt's. there is no vIOMMU-scope operation provided
by intel-iommu driver so far.

> 
> > and for this error reporting case what we actually require is the
> > reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> > same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> > and then store vRID under device_domain_info. a bit tricky on
> > life cycle management and also incompatible with SIOV...
> 
> One thing that I am not very clear here: since both vRID and dev_id
> are given by the VMM, shouldn't it already know the mapping if the
> point is to translate (pRID->)dev_id->vRID?
> 

it's true for current Qemu.

but there is plan to support Qemu accepting a fd passed by Libvirt.
In that case Qemu even doesn't see the sysfs path hence is not
aware of pRID. otherwise yes we could leave the translation to
VMM instead.
Nicolin Chen Dec. 15, 2023, 3:32 a.m. UTC | #9
On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Friday, December 15, 2023 10:28 AM
> > On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Thursday, December 14, 2023 7:27 PM
> > > >
> > > > On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
> > > > flushing iotlb for the nested domain
> > > >
> > > > +struct iommu_hwpt_vtd_s1_invalidate {
> > > > +     __aligned_u64 addr;
> > > > +     __aligned_u64 npages;
> > > > +     __u32 flags;
> > > > +     __u32 __reserved;
> > > > +     __u32 error;
> > > > +     __u32 dev_id;
> > > > +};
> > > >
> > > > dev_id is used to report the failed device, userspace should be able to
> > map
> > > > it to a vRID, and inject it to VM as part of ITE/ICE error.

> > > and for this error reporting case what we actually require is the
> > > reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> > > same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> > > and then store vRID under device_domain_info. a bit tricky on
> > > life cycle management and also incompatible with SIOV...
> >
> > One thing that I am not very clear here: since both vRID and dev_id
> > are given by the VMM, shouldn't it already know the mapping if the
> > point is to translate (pRID->)dev_id->vRID?
> >
> 
> it's true for current Qemu.
> 
> but there is plan to support Qemu accepting a fd passed by Libvirt.
> In that case Qemu even doesn't see the sysfs path hence is not
> aware of pRID. otherwise yes we could leave the translation to
> VMM instead.

I think I misread Yi's narrative: dev_id is a working approach
for VMM to convert to a vRID, while he is asking for a better
alternative :)
Yi Liu Dec. 15, 2023, 4:01 a.m. UTC | #10
On 2023/12/15 11:32, Nicolin Chen wrote:
> On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Sent: Friday, December 15, 2023 10:28 AM
>>> On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Sent: Thursday, December 14, 2023 7:27 PM
>>>>>
>>>>> On 2023/11/17 21:18, Yi Liu wrote:> This adds the data structure for
>>>>> flushing iotlb for the nested domain
>>>>>
>>>>> +struct iommu_hwpt_vtd_s1_invalidate {
>>>>> +     __aligned_u64 addr;
>>>>> +     __aligned_u64 npages;
>>>>> +     __u32 flags;
>>>>> +     __u32 __reserved;
>>>>> +     __u32 error;
>>>>> +     __u32 dev_id;
>>>>> +};
>>>>>
>>>>> dev_id is used to report the failed device, userspace should be able to
>>> map
>>>>> it to a vRID, and inject it to VM as part of ITE/ICE error.
> 
>>>> and for this error reporting case what we actually require is the
>>>> reverse map i.e. pRID->vRID. Not sure whether we can leverage the
>>>> same RID mapping uAPI as for ARM/AMD but ignore viommu_id
>>>> and then store vRID under device_domain_info. a bit tricky on
>>>> life cycle management and also incompatible with SIOV...
>>>
>>> One thing that I am not very clear here: since both vRID and dev_id
>>> are given by the VMM, shouldn't it already know the mapping if the
>>> point is to translate (pRID->)dev_id->vRID?
>>>
>>
>> it's true for current Qemu.
>>
>> but there is plan to support Qemu accepting a fd passed by Libvirt.
>> In that case Qemu even doesn't see the sysfs path hence is not
>> aware of pRID. otherwise yes we could leave the translation to
>> VMM instead.
> 
> I think I misread Yi's narrative: dev_id is a working approach
> for VMM to convert to a vRID, while he is asking for a better
> alternative :)

In concept, dev_id works, but in reality we have problem to get a dev_id
for a given device in intel iommu driver, hence I'm asking for help here. :)
Nicolin Chen Dec. 16, 2023, 6:49 p.m. UTC | #11
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> On 2023/12/15 11:32, Nicolin Chen wrote:
> > On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Friday, December 15, 2023 10:28 AM
> > > > On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Thursday, December 14, 2023 7:27 PM
> > > > > > 
> > > > > > On 2023/11/17 21:18, Yi Liu wrote:
> > > > > and for this error reporting case what we actually require is the
> > > > > reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> > > > > same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> > > > > and then store vRID under device_domain_info. a bit tricky on
> > > > > life cycle management and also incompatible with SIOV...
> > > > 
> > > > One thing that I am not very clear here: since both vRID and dev_id
> > > > are given by the VMM, shouldn't it already know the mapping if the
> > > > point is to translate (pRID->)dev_id->vRID?
> > > > 
> > > 
> > > it's true for current Qemu.
> > > 
> > > but there is plan to support Qemu accepting a fd passed by Libvirt.
> > > In that case Qemu even doesn't see the sysfs path hence is not
> > > aware of pRID. otherwise yes we could leave the translation to
> > > VMM instead.
> > 
> > I think I misread Yi's narrative: dev_id is a working approach
> > for VMM to convert to a vRID, while he is asking for a better
> > alternative :)
> 
> In concept, dev_id works, but in reality we have problem to get a dev_id
> for a given device in intel iommu driver, hence I'm asking for help here. :)

Yea, I got that.

Would it be possible for us to postpone this error report in the
vtd driver?

Jason is taking vacation, so he'll unlikely be very active until
the new year, although he would probably spare some time taking
the cache_invalidate series once it's mature.

If the final solution is to use pRID<->vRID mappings for vtd too,
we'd likely need the viommu/dev_set_virtual_id series that I am
still working on, which certainly won't make it to this cycle.

Thanks
Nic
Tian, Kevin Dec. 17, 2023, 11:28 p.m. UTC | #12
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Sunday, December 17, 2023 2:49 AM
> 
> On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > On 2023/12/15 11:32, Nicolin Chen wrote:
> > > On Fri, Dec 15, 2023 at 03:04:44AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Friday, December 15, 2023 10:28 AM
> > > > > On Fri, Dec 15, 2023 at 01:50:07AM +0000, Tian, Kevin wrote:
> > > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Sent: Thursday, December 14, 2023 7:27 PM
> > > > > > >
> > > > > > > On 2023/11/17 21:18, Yi Liu wrote:
> > > > > > and for this error reporting case what we actually require is the
> > > > > > reverse map i.e. pRID->vRID. Not sure whether we can leverage the
> > > > > > same RID mapping uAPI as for ARM/AMD but ignore viommu_id
> > > > > > and then store vRID under device_domain_info. a bit tricky on
> > > > > > life cycle management and also incompatible with SIOV...
> > > > >
> > > > > One thing that I am not very clear here: since both vRID and dev_id
> > > > > are given by the VMM, shouldn't it already know the mapping if the
> > > > > point is to translate (pRID->)dev_id->vRID?
> > > > >
> > > >
> > > > it's true for current Qemu.
> > > >
> > > > but there is plan to support Qemu accepting a fd passed by Libvirt.
> > > > In that case Qemu even doesn't see the sysfs path hence is not
> > > > aware of pRID. otherwise yes we could leave the translation to
> > > > VMM instead.
> > >
> > > I think I misread Yi's narrative: dev_id is a working approach
> > > for VMM to convert to a vRID, while he is asking for a better
> > > alternative :)
> >
> > In concept, dev_id works, but in reality we have problem to get a dev_id
> > for a given device in intel iommu driver, hence I'm asking for help here. :)
> 
> Yea, I got that.
> 
> Would it be possible for us to postpone this error report in the
> vtd driver?
> 

that is the plan. We'll still report error code but no dev_id error
data. this is allowed by vtd spec.
Jason Gunthorpe Jan. 2, 2024, 11:38 p.m. UTC | #13
On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > I think I misread Yi's narrative: dev_id is a working approach
> > for VMM to convert to a vRID, while he is asking for a better
> > alternative :)
> 
> In concept, dev_id works, but in reality we have problem to get a dev_id
> for a given device in intel iommu driver, hence I'm asking for help here. :)

I think we just need to solve this one way or another.. Even if you
use a viommu object you still end up having difficult coupling to
iommufd

Some:
  iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)

Callable by a driver (using the driver-callable function
infrastructure we made for dirty tracking) Is really all that is
needed here.

Jason
Yi Liu Jan. 3, 2024, 2:24 a.m. UTC | #14
On 2024/1/3 07:38, Jason Gunthorpe wrote:
> On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
>>> I think I misread Yi's narrative: dev_id is a working approach
>>> for VMM to convert to a vRID, while he is asking for a better
>>> alternative :)
>>
>> In concept, dev_id works, but in reality we have problem to get a dev_id
>> for a given device in intel iommu driver, hence I'm asking for help here. :)
> 
> I think we just need to solve this one way or another.. Even if you
> use a viommu object you still end up having difficult coupling to
> iommufd
> 
> Some:
>    iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
> 
> Callable by a driver (using the driver-callable function
> infrastructure we made for dirty tracking) Is really all that is
> needed here.

yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when
IOMMUFD is configed. Maybe such an API could be compiled when
IOMMUFD_DRIVER is enabled as well. This does address my concern on making
intel iommu driver depending on iommufd. But still need a way to pass in
the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel
iommu driver could call the above iommufd_get_dev_id(). One possible way is
passing it when attaching device to domain and clear it in detach. However,
this seems not ideal as iommufd_ctx information should be static between
bind_iommufd and unbind. But we don't call into intel iommu driver in the
bind and unbind operations. May need to add new iommu op. Any suggestion
here?
Jason Gunthorpe Jan. 3, 2024, 4:01 p.m. UTC | #15
On Wed, Jan 03, 2024 at 10:24:42AM +0800, Yi Liu wrote:
> On 2024/1/3 07:38, Jason Gunthorpe wrote:
> > On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > > > I think I misread Yi's narrative: dev_id is a working approach
> > > > for VMM to convert to a vRID, while he is asking for a better
> > > > alternative :)
> > > 
> > > In concept, dev_id works, but in reality we have problem to get a dev_id
> > > for a given device in intel iommu driver, hence I'm asking for help here. :)
> > 
> > I think we just need to solve this one way or another.. Even if you
> > use a viommu object you still end up having difficult coupling to
> > iommufd
> > 
> > Some:
> >    iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
> > 
> > Callable by a driver (using the driver-callable function
> > infrastructure we made for dirty tracking) Is really all that is
> > needed here.
> 
> yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when
> IOMMUFD is configed. Maybe such an API could be compiled when
> IOMMUFD_DRIVER is enabled as well. This does address my concern on making
> intel iommu driver depending on iommufd. But still need a way to pass in
> the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel
> iommu driver could call the above iommufd_get_dev_id(). One possible way is
> passing it when attaching device to domain and clear it in detach. However,
> this seems not ideal as iommufd_ctx information should be static between
> bind_iommufd and unbind. But we don't call into intel iommu driver in the
> bind and unbind operations. May need to add new iommu op. Any suggestion
> here?

You can pass the ctx to the invalidate op, it is already implied
because the passed iommu_domain is linked to a single iommufd ctx.

Jason
Nicolin Chen Jan. 3, 2024, 4:48 p.m. UTC | #16
On Wed, Jan 03, 2024 at 12:01:08PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 10:24:42AM +0800, Yi Liu wrote:
> > On 2024/1/3 07:38, Jason Gunthorpe wrote:
> > > On Fri, Dec 15, 2023 at 12:01:19PM +0800, Yi Liu wrote:
> > > > > I think I misread Yi's narrative: dev_id is a working approach
> > > > > for VMM to convert to a vRID, while he is asking for a better
> > > > > alternative :)
> > > > 
> > > > In concept, dev_id works, but in reality we have problem to get a dev_id
> > > > for a given device in intel iommu driver, hence I'm asking for help here. :)
> > > 
> > > I think we just need to solve this one way or another.. Even if you
> > > use a viommu object you still end up having difficult coupling to
> > > iommufd
> > > 
> > > Some:
> > >    iommufd_get_dev_id(struct iommufd_ctx *ictx, struct device *dev)
> > > 
> > > Callable by a driver (using the driver-callable function
> > > infrastructure we made for dirty tracking) Is really all that is
> > > needed here.
> > 
> > yep, I noticed IOMMUFD_DRIVER was selected by intel iommu driver when
> > IOMMUFD is configed. Maybe such an API could be compiled when
> > IOMMUFD_DRIVER is enabled as well. This does address my concern on making
> > intel iommu driver depending on iommufd. But still need a way to pass in
> > the iommufd_ctx pointer to intel iommu driver, and store it. Hence intel
> > iommu driver could call the above iommufd_get_dev_id(). One possible way is
> > passing it when attaching device to domain and clear it in detach. However,
> > this seems not ideal as iommufd_ctx information should be static between
> > bind_iommufd and unbind. But we don't call into intel iommu driver in the
> > bind and unbind operations. May need to add new iommu op. Any suggestion
> > here?
> 
> You can pass the ctx to the invalidate op, it is already implied
> because the passed iommu_domain is linked to a single iommufd ctx.

The device virtual id lookup API needs something similar, yet it
likely needs a viommu pointer (or its id) instead? As the table
is attached to a viommu while an ictx can have multiple viommus,
right?

Thanks
Nic
Jason Gunthorpe Jan. 3, 2024, 4:58 p.m. UTC | #17
On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
> > You can pass the ctx to the invalidate op, it is already implied
> > because the passed iommu_domain is linked to a single iommufd ctx.
> 
> The device virtual id lookup API needs something similar, yet it
> likely needs a viommu pointer (or its id) instead? As the table
> is attached to a viommu while an ictx can have multiple viommus,
> right?

Yes, when we get to an API for that it will have to be some op
'invalidate_viommu(..)' and it can get the necessary pointers.

The viommu object will have to be some driver object like the
iommu_domain.

Jason
Nicolin Chen Jan. 3, 2024, 5:06 p.m. UTC | #18
On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
> > > You can pass the ctx to the invalidate op, it is already implied
> > > because the passed iommu_domain is linked to a single iommufd ctx.
> > 
> > The device virtual id lookup API needs something similar, yet it
> > likely needs a viommu pointer (or its id) instead? As the table
> > is attached to a viommu while an ictx can have multiple viommus,
> > right?
> 
> Yes, when we get to an API for that it will have to be some op
> 'invalidate_viommu(..)' and it can get the necessary pointers.

OK! I will try that first.

> The viommu object will have to be some driver object like the
> iommu_domain.

I drafted something like this, linking it to struct iommu_device:

+struct iommufd_viommu {
+       struct iommufd_object obj;
+       struct iommufd_ctx *ictx;
+       struct iommu_device *iommu_dev;
+       struct iommufd_hwpt_paging *hwpt;
+       /* array of struct iommufd_device, indexed by device virtual id */
+       struct xarray device_ids;
+};

Thanks
Nicolin
Jason Gunthorpe Jan. 3, 2024, 5:52 p.m. UTC | #19
On Wed, Jan 03, 2024 at 09:06:23AM -0800, Nicolin Chen wrote:
> On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
> > > > You can pass the ctx to the invalidate op, it is already implied
> > > > because the passed iommu_domain is linked to a single iommufd ctx.
> > > 
> > > The device virtual id lookup API needs something similar, yet it
> > > likely needs a viommu pointer (or its id) instead? As the table
> > > is attached to a viommu while an ictx can have multiple viommus,
> > > right?
> > 
> > Yes, when we get to an API for that it will have to be some op
> > 'invalidate_viommu(..)' and it can get the necessary pointers.
> 
> OK! I will try that first.
> 
> > The viommu object will have to be some driver object like the
> > iommu_domain.
> 
> I drafted something like this, linking it to struct iommu_device:
> 
> +struct iommufd_viommu {
> +       struct iommufd_object obj;
> +       struct iommufd_ctx *ictx;
> +       struct iommu_device *iommu_dev;
> +       struct iommufd_hwpt_paging *hwpt;
> +       /* array of struct iommufd_device, indexed by device virtual id */
> +       struct xarray device_ids;
> +};

The driver would have to create it and there would be some driver
specific enclosing struct to go with it

Perhaps device_ids goes in the driver specific struct, I don't know.

Not sure it should have hwpt at all, probably vmid should come from
the driver specific struct in some driver specific way

Jason
Nicolin Chen Jan. 3, 2024, 8:18 p.m. UTC | #20
On Wed, Jan 03, 2024 at 01:52:02PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 09:06:23AM -0800, Nicolin Chen wrote:
> > On Wed, Jan 03, 2024 at 12:58:48PM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 03, 2024 at 08:48:46AM -0800, Nicolin Chen wrote:
> > > > > You can pass the ctx to the invalidate op, it is already implied
> > > > > because the passed iommu_domain is linked to a single iommufd ctx.
> > > > 
> > > > The device virtual id lookup API needs something similar, yet it
> > > > likely needs a viommu pointer (or its id) instead? As the table
> > > > is attached to a viommu while an ictx can have multiple viommus,
> > > > right?
> > > 
> > > Yes, when we get to an API for that it will have to be some op
> > > 'invalidate_viommu(..)' and it can get the necessary pointers.
> > 
> > OK! I will try that first.
> > 
> > > The viommu object will have to be some driver object like the
> > > iommu_domain.
> > 
> > I drafted something like this, linking it to struct iommu_device:
> > 
> > +struct iommufd_viommu {
> > +       struct iommufd_object obj;
> > +       struct iommufd_ctx *ictx;
> > +       struct iommu_device *iommu_dev;
> > +       struct iommufd_hwpt_paging *hwpt;
> > +       /* array of struct iommufd_device, indexed by device virtual id */
> > +       struct xarray device_ids;
> > +};
> 
> The driver would have to create it and there would be some driver
> specific enclosing struct to go with it
> 
> Perhaps device_ids goes in the driver specific struct, I don't know.

+struct iommufd_viommu {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommu_device *iommu_dev;
+	struct iommufd_hwpt_paging *hwpt;	/* maybe unneeded */
+
+	int vmid;
+
+	union iommu_driver_user_data {
+		struct iommu_driver_user_vtd;
+		struct iommu_driver_user_arm_smmuv3;
+		struct iommu_driver_user_amd_viommu;
+	};
+};

Then iommu_ops would need something like:
	iommu_user_alloc/free(struct iommu_device *iommu_dev,
			      union *iommu_driver_user_data, int *vmid);
	iommu_user_set/unset_dev_id(union iommu_driver_user_data,
				    struct device* dev. u32/u64 id);
	iommu_user_invalidate(union iommu_driver_user_data *iommu,
			      struct iommu_user_data_array *array);

Comments and ideas on better naming convention?

> Not sure it should have hwpt at all, probably vmid should come from
> the driver specific struct in some driver specific way

The idea having a hwpt pointer is to share the paging hwpt among
the viommu objects. I don't think it "shouldn't have", yet likely
we can avoid it depending on whether it will have some use or not
in the context.

Nicolin
Jason Gunthorpe Jan. 4, 2024, 12:02 a.m. UTC | #21
On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
> > The driver would have to create it and there would be some driver
> > specific enclosing struct to go with it
> > 
> > Perhaps device_ids goes in the driver specific struct, I don't know.
> 
> +struct iommufd_viommu {
> +	struct iommufd_object obj;
> +	struct iommufd_ctx *ictx;
> +	struct iommu_device *iommu_dev;
> +	struct iommufd_hwpt_paging *hwpt;	/* maybe unneeded */
> +
> +	int vmid;
> +
> +	union iommu_driver_user_data {
> +		struct iommu_driver_user_vtd;
> +		struct iommu_driver_user_arm_smmuv3;
> +		struct iommu_driver_user_amd_viommu;
> +	};

Not like that, in the usual container_of way

Jason
Jason Gunthorpe Jan. 4, 2024, 2:36 p.m. UTC | #22
On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
> Per the prior discussion[1], we agreed to move the error reporting into the
> driver specific part. On Intel side, we want to report two devTLB
> invalidation errors: ICE (invalid completion error) and ITE (invalidation
> timeout error). Such errors have an additional SID information to tell
> which device failed the devTLB invalidation. I've got the below structure.

IMHO all of this complexity is a consequence of the decision to hide
the devtlb invalidation from the VM..

On the other hand I guess you want to do this because of the SIOV
troubles where the vPCI function in the VM is entirely virtual and
can't be trivially mapped to a real PCI function for ATC invalidation
like ARM and AMD can do (but they also can't support SIOV because of
this). :(

However it also makes it very confusing about how the VM would
perceive an error - eg if it invalidates an SIOV device single PASID
and that devtlb fails then the error should be connected back to the
vPCI function for the SIOV's specific PASID and not back to the
physical PCI function for the SIOV owner.

As the iommu driver itself has no idea about the vPCI functions this
seems like it is going to get really confusing. The API I suggested in
the other email is not entirely going to work as the vPCI function for
SIOV cases will have to be identified by the (struct device, PASID) -
while it would be easy enough for the iommu driver to provide the
PASID, I'm not sure how the iommufd core will relate the PASID back to
the iommu_device to understand SIOV without actually being aware of
SIOV to some degree :\

(Given SIOVr1 seems on track to be replaced by SIOVr2 so this is all a
one-off I was hoping to minimize such awareness)

Jason
Tian, Kevin Jan. 5, 2024, 2:16 a.m. UTC | #23
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 4, 2024 10:37 PM
> 
> On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
> > Per the prior discussion[1], we agreed to move the error reporting into the
> > driver specific part. On Intel side, we want to report two devTLB
> > invalidation errors: ICE (invalid completion error) and ITE (invalidation
> > timeout error). Such errors have an additional SID information to tell
> > which device failed the devTLB invalidation. I've got the below structure.
> 
> IMHO all of this complexity is a consequence of the decision to hide
> the devtlb invalidation from the VM..
> 
> On the other hand I guess you want to do this because of the SIOV
> troubles where the vPCI function in the VM is entirely virtual and
> can't be trivially mapped to a real PCI function for ATC invalidation
> like ARM and AMD can do (but they also can't support SIOV because of
> this). :(
> 
> However it also makes it very confusing about how the VM would
> perceive an error - eg if it invalidates an SIOV device single PASID
> and that devtlb fails then the error should be connected back to the
> vPCI function for the SIOV's specific PASID and not back to the
> physical PCI function for the SIOV owner.
> 
> As the iommu driver itself has no idea about the vPCI functions this
> seems like it is going to get really confusing. The API I suggested in
> the other email is not entirely going to work as the vPCI function for
> SIOV cases will have to be identified by the (struct device, PASID) -
> while it would be easy enough for the iommu driver to provide the
> PASID, I'm not sure how the iommufd core will relate the PASID back to
> the iommu_device to understand SIOV without actually being aware of
> SIOV to some degree :\

we plan to add such awareness with a new binding helper:

https://lore.kernel.org/lkml/20231009085123.463179-4-yi.l.liu@intel.com/

and with metadata to track association between PASID's and iommufd vdev.

but in reality the relation could be identified in an easy way due to a SIOV
restriction which we discussed before - shared PASID space of PF disallows
assigning sibling vdev's to a same VM (otherwise no way to identify which
sibling vdev triggering an iopf when a pasid is used on both vdev's). That
restriction implies that within an iommufd context every iommufd_device
object should contain a unique struct device pointer. So PASID can be
instead ignored in the lookup then just always do iommufd_get_dev_id()
using struct device.
Tian, Kevin Jan. 5, 2024, 2:52 a.m. UTC | #24
> From: Tian, Kevin
> Sent: Friday, January 5, 2024 10:16 AM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, January 4, 2024 10:37 PM
> >
> > On Thu, Dec 14, 2023 at 07:26:39PM +0800, Yi Liu wrote:
> > > Per the prior discussion[1], we agreed to move the error reporting into
> the
> > > driver specific part. On Intel side, we want to report two devTLB
> > > invalidation errors: ICE (invalid completion error) and ITE (invalidation
> > > timeout error). Such errors have an additional SID information to tell
> > > which device failed the devTLB invalidation. I've got the below structure.
> >
> > IMHO all of this complexity is a consequence of the decision to hide
> > the devtlb invalidation from the VM..
> >
> > On the other hand I guess you want to do this because of the SIOV
> > troubles where the vPCI function in the VM is entirely virtual and
> > can't be trivially mapped to a real PCI function for ATC invalidation
> > like ARM and AMD can do (but they also can't support SIOV because of
> > this). :(
> >
> > However it also makes it very confusing about how the VM would
> > perceive an error - eg if it invalidates an SIOV device single PASID
> > and that devtlb fails then the error should be connected back to the
> > vPCI function for the SIOV's specific PASID and not back to the
> > physical PCI function for the SIOV owner.
> >
> > As the iommu driver itself has no idea about the vPCI functions this
> > seems like it is going to get really confusing. The API I suggested in
> > the other email is not entirely going to work as the vPCI function for
> > SIOV cases will have to be identified by the (struct device, PASID) -
> > while it would be easy enough for the iommu driver to provide the
> > PASID, I'm not sure how the iommufd core will relate the PASID back to
> > the iommu_device to understand SIOV without actually being aware of
> > SIOV to some degree :\
> 
> we plan to add such awareness with a new binding helper:
> 
> https://lore.kernel.org/lkml/20231009085123.463179-4-yi.l.liu@intel.com/
> 
> and with metadata to track association between PASID's and iommufd vdev.
> 
> but in reality the relation could be identified in an easy way due to a SIOV
> restriction which we discussed before - shared PASID space of PF disallows
> assigning sibling vdev's to a same VM (otherwise no way to identify which
> sibling vdev triggering an iopf when a pasid is used on both vdev's). That
> restriction implies that within an iommufd context every iommufd_device
> object should contain a unique struct device pointer. So PASID can be
> instead ignored in the lookup then just always do iommufd_get_dev_id()
> using struct device.

A bit more background.

Previously we thought this restriction only applies to SIOV+vSVA, as
a guest process may bind to both sibling vdev's, leading to the same
pasid situation.

In concept w/o vSVA it's still possible to assign sibling vdev's to
a same VM as each vdev is allocated with a unique pasid to mark vRID
so can be differentiated from each other in the fault/error path.

But when looking at this err code issue with Yi closely, we found
there is another gap in the VT-d spec. Upon devtlb invalidation
timeout the hw doesn't report pasid in the error info register. this
makes it impossible to identify the source vdev if a hwpt invalidation
request involves sibling vdev's from a same PF.

with that I'm inclined to always imposing this restriction for SIOV. 
One may argue that SIOV w/o vSVA w/o devtlb is conceptually immune
but I'm with you that given SIOVr1 is one-off I prefer to limiting its
usability other than complexing the kernel.
Nicolin Chen Jan. 5, 2024, 7:38 a.m. UTC | #25
On Wed, Jan 03, 2024 at 08:02:04PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
> > > The driver would have to create it and there would be some driver
> > > specific enclosing struct to go with it
> > > 
> > > Perhaps device_ids goes in the driver specific struct, I don't know.
> > 
> > +struct iommufd_viommu {
> > +	struct iommufd_object obj;
> > +	struct iommufd_ctx *ictx;
> > +	struct iommu_device *iommu_dev;
> > +	struct iommufd_hwpt_paging *hwpt;	/* maybe unneeded */
> > +
> > +	int vmid;
> > +
> > +	union iommu_driver_user_data {
> > +		struct iommu_driver_user_vtd;
> > +		struct iommu_driver_user_arm_smmuv3;
> > +		struct iommu_driver_user_amd_viommu;
> > +	};
> 
> Not like that, in the usual container_of way

How about this:

// iommu.h
@@ -490,6 +490,16 @@ struct iommu_ops {
+       /* User space instance allocation and freeing by the iommu driver */
+       struct iommu_device_user *(*iommu_alloc_user)(struct iommu_device *iommu);
+       void (*iommu_free_user)(struct iommu_device_user *iommu_user);
+       int (*iommu_user_set_dev_id)(struct iommu_device_user *iommu_user,
+                                    struct device *dev, u64 dev_id);
+       int (*iommu_user_unset_dev_id)(struct iommu_device_user *iommu_user,
+                                      struct device *dev);
+       int (*iommu_cache_invalidate_usewr)(struct iommu_device_user *iommu_user,
+                                           struct iommu_user_data_array *array);
...
+/**
+ * struct iommu_device_user - IOMMU core representation of one IOMMU virtual
+ *                            instance
+ * @iommu_dev: Underlying IOMMU hardware instance
+ * @id: Virtual instance ID, e.g. a vmid
+ */
+struct iommu_device_user {
+       struct iommu_device *iommu_dev;
+       struct xarray virtual_ids;
+       u32 id;
+};

// iommufd_private.h
+struct iommufd_viommu {
+       struct iommufd_object obj;
+       struct iommufd_ctx *ictx;
+       struct iommu_device *iommu_dev;
+       struct iommu_device_user *iommu_user;
+       struct iommufd_hwpt_paging *hwpt;
+};
+
+int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_viommu_destroy(struct iommufd_object *obj);

So iommu_alloc_user op allocates a driver private structure that
contains an iommu_device_user and returns &priv->iommu_user.

The set/unset_dev_id ops probably need a type argument if there
can be a multi-xarray case, then the virtual_ids xarray should
be moved to the driver private structure too?

Also, a 64-bit virtual id in the uAPI was suggested previously.
But xarray is limited to a 32-bit index? Should we compromise
the uAPI to 32-bit or is there an alternative for a 64-bit id
lookup?

Thanks
Nicolin
Jason Gunthorpe Jan. 5, 2024, 2:45 p.m. UTC | #26
On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
> > but in reality the relation could be identified in an easy way due to a SIOV
> > restriction which we discussed before - shared PASID space of PF disallows
> > assigning sibling vdev's to a same VM (otherwise no way to identify which
> > sibling vdev triggering an iopf when a pasid is used on both vdev's). That
> > restriction implies that within an iommufd context every iommufd_device
> > object should contain a unique struct device pointer. So PASID can be
> > instead ignored in the lookup then just always do iommufd_get_dev_id()
> > using struct device.
> 
> A bit more background.
> 
> Previously we thought this restriction only applies to SIOV+vSVA, as
> a guest process may bind to both sibling vdev's, leading to the same
> pasid situation.
> 
> In concept w/o vSVA it's still possible to assign sibling vdev's to
> a same VM as each vdev is allocated with a unique pasid to mark vRID
> so can be differentiated from each other in the fault/error path.

I thought the SIOV plan was that each "vdev" ie vpci function would
get a slice of the pRID's PASID space statically selected at creation?

So SVA/etc doesn't matter, you reliably get a disjoint set of pRID &
pPASID into each VM.
Jason Gunthorpe Jan. 5, 2024, 3:46 p.m. UTC | #27
On Thu, Jan 04, 2024 at 11:38:40PM -0800, Nicolin Chen wrote:
> On Wed, Jan 03, 2024 at 08:02:04PM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 03, 2024 at 12:18:35PM -0800, Nicolin Chen wrote:
> > > > The driver would have to create it and there would be some driver
> > > > specific enclosing struct to go with it
> > > > 
> > > > Perhaps device_ids goes in the driver specific struct, I don't know.
> > > 
> > > +struct iommufd_viommu {
> > > +	struct iommufd_object obj;
> > > +	struct iommufd_ctx *ictx;
> > > +	struct iommu_device *iommu_dev;
> > > +	struct iommufd_hwpt_paging *hwpt;	/* maybe unneeded */
> > > +
> > > +	int vmid;
> > > +
> > > +	union iommu_driver_user_data {
> > > +		struct iommu_driver_user_vtd;
> > > +		struct iommu_driver_user_arm_smmuv3;
> > > +		struct iommu_driver_user_amd_viommu;
> > > +	};
> > 
> > Not like that, in the usual container_of way
> 
> How about this:
> 
> // iommu.h
> @@ -490,6 +490,16 @@ struct iommu_ops {
> +       /* User space instance allocation and freeing by the iommu driver */
> +       struct iommu_device_user *(*iommu_alloc_user)(struct iommu_device *iommu);

struct iommufd_viommu *iommu_alloc_viommu(struct device *dev);

> +/**
> + * struct iommu_device_user - IOMMU core representation of one IOMMU virtual
> + *                            instance
> + * @iommu_dev: Underlying IOMMU hardware instance
> + * @id: Virtual instance ID, e.g. a vmid
> + */
> +struct iommu_device_user {
> +       struct iommu_device *iommu_dev;
> +       struct xarray virtual_ids;
> +       u32 id;
> +};
> 
> // iommufd_private.h
> +struct iommufd_viommu {
> +       struct iommufd_object obj;
> +       struct iommufd_ctx *ictx;
> +       struct iommu_device *iommu_dev;
> +       struct iommu_device_user *iommu_user;
> +       struct iommufd_hwpt_paging *hwpt;
> +};

And you probably don't need two structs, just combine them together

And don't repeat the iommu_domain misdesign, there should be some
helper to init (or maybe allocate and init) the core structure along
with the driver part.

> The set/unset_dev_id ops probably need a type argument if there
> can be a multi-xarray case, then the virtual_ids xarray should
> be moved to the driver private structure too?

Yeah, probably. No need to add things that are not used right now, but
it does make some sense that the driver could control the
datastructure used for mapping. eg AMD has a HW backed datastructure
so they may not need an xarray.

> Also, a 64-bit virtual id in the uAPI was suggested previously.
> But xarray is limited to a 32-bit index? Should we compromise
> the uAPI to 32-bit or is there an alternative for a 64-bit id
> lookup?

You can use 64 bits in the uapi and forbid values that are too
large. xarry uses an unsigned long for the index so it is 64 bit on
64 bit systems.

Jason
Tian, Kevin Jan. 8, 2024, 4:07 a.m. UTC | #28
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, January 5, 2024 10:45 PM
> 
> On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
> > > but in reality the relation could be identified in an easy way due to a SIOV
> > > restriction which we discussed before - shared PASID space of PF
> disallows
> > > assigning sibling vdev's to a same VM (otherwise no way to identify which
> > > sibling vdev triggering an iopf when a pasid is used on both vdev's). That
> > > restriction implies that within an iommufd context every iommufd_device
> > > object should contain a unique struct device pointer. So PASID can be
> > > instead ignored in the lookup then just always do iommufd_get_dev_id()
> > > using struct device.
> >
> > A bit more background.
> >
> > Previously we thought this restriction only applies to SIOV+vSVA, as
> > a guest process may bind to both sibling vdev's, leading to the same
> > pasid situation.
> >
> > In concept w/o vSVA it's still possible to assign sibling vdev's to
> > a same VM as each vdev is allocated with a unique pasid to mark vRID
> > so can be differentiated from each other in the fault/error path.
> 
> I thought the SIOV plan was that each "vdev" ie vpci function would
> get a slice of the pRID's PASID space statically selected at creation?
> 
> So SVA/etc doesn't matter, you reliably get a disjoint set of pRID &
> pPASID into each VM.
> 
> From that view you can't identify the iommufd dev_id without knowing
> both the pRID and pPASID which will disambiguate the different SIOV
> iommufd dev_id instances sharing a rid.

true when assigning those instances to different VMs.

Here I was talking about assigning them to a same VM being a problem.
with rid sharing plus same ENQCMD pPASID potentially used on both
instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.

we agreed before on preventing sibling vdev's in one VM for above
reason, but only as far as vSVA is concerned.

then given the new finding in err reporting I wondered whether this
restriction should be applied to all SIOV scenarios (but irrelevant now
with below explanation after another thinking)

> 
> > But when looking at this err code issue with Yi closely, we found
> > there is another gap in the VT-d spec. Upon devtlb invalidation
> > timeout the hw doesn't report pasid in the error info register. this
> > makes it impossible to identify the source vdev if a hwpt invalidation
> > request involves sibling vdev's from a same PF.
> 
> Don't you know which command timed out?

unfortunately no.

for errors related to descriptor fetch the driver can tell the command
by looking at the head pointer of the invalidation queue.

command completion is indirectly detected by inserting a wait descriptor
as fence. completion timeout error is reported in an error register. but
this register doesn't record pasid, nor does the command location. if there
are multiple pending devtlb invalidation commands upon timeout 
error the spec suggests the driver to treat all of them timeout as the
register can only record one rid.

this is kind of moot. If the driver submits only one command (plus wait)
at a time it doesn't need hw's help to identify the timeout command. 
If the driver batches invalidation commands it must treat all timeout if
an timeout error is reported.

from this angle whether to record pasid doesn't really matter.

intel-iommu driver doesn't batch commands. so it's possible for
the driver to figure out the timeout device itself and identify rid plus
pasid to find dev_id from iommufd.

Thanks
Kevin
Jason Gunthorpe Jan. 8, 2024, 1:51 p.m. UTC | #29
On Mon, Jan 08, 2024 at 04:07:12AM +0000, Tian, Kevin wrote:
> > > In concept w/o vSVA it's still possible to assign sibling vdev's to
> > > a same VM as each vdev is allocated with a unique pasid to mark vRID
> > > so can be differentiated from each other in the fault/error path.
> > 
> > I thought the SIOV plan was that each "vdev" ie vpci function would
> > get a slice of the pRID's PASID space statically selected at creation?
> > 
> > So SVA/etc doesn't matter, you reliably get a disjoint set of pRID &
> > pPASID into each VM.
> > 
> > From that view you can't identify the iommufd dev_id without knowing
> > both the pRID and pPASID which will disambiguate the different SIOV
> > iommufd dev_id instances sharing a rid.
> 
> true when assigning those instances to different VMs.
> 
> Here I was talking about assigning them to a same VM being a problem.
> with rid sharing plus same ENQCMD pPASID potentially used on both
> instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.

Oh you imaging sharing the pPASID if things have the same translation?
I guess I can see why, but given where things are overall I'd say just
don't do that.

Indeed we can't do that because it makes the vRID unknowable.

(again I continue to think that vt-d cache design is messed up, using
the PASID for the cache tag is a *terrible* design, and causes exactly
these kinds of problems)

> for errors related to descriptor fetch the driver can tell the command
> by looking at the head pointer of the invalidation queue.
> 
> command completion is indirectly detected by inserting a wait descriptor
> as fence. completion timeout error is reported in an error register. but
> this register doesn't record pasid, nor does the command location. if there
> are multiple pending devtlb invalidation commands upon timeout 
> error the spec suggests the driver to treat all of them timeout as the
> register can only record one rid.

Makes sense, or at least you have to re-issue them one by one

> this is kind of moot. If the driver submits only one command (plus wait)
> at a time it doesn't need hw's help to identify the timeout command. 
> If the driver batches invalidation commands it must treat all timeout if
> an timeout error is reported.

Yes
 
> from this angle whether to record pasid doesn't really matter.

At least for error handling..
 
Jason
Yi Liu Jan. 9, 2024, 6 a.m. UTC | #30
On 2024/1/8 12:07, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, January 5, 2024 10:45 PM
>>
>> On Fri, Jan 05, 2024 at 02:52:50AM +0000, Tian, Kevin wrote:
>>>> but in reality the relation could be identified in an easy way due to a SIOV
>>>> restriction which we discussed before - shared PASID space of PF
>> disallows
>>>> assigning sibling vdev's to a same VM (otherwise no way to identify which
>>>> sibling vdev triggering an iopf when a pasid is used on both vdev's). That
>>>> restriction implies that within an iommufd context every iommufd_device
>>>> object should contain a unique struct device pointer. So PASID can be
>>>> instead ignored in the lookup then just always do iommufd_get_dev_id()
>>>> using struct device.
>>>
>>> A bit more background.
>>>
>>> Previously we thought this restriction only applies to SIOV+vSVA, as
>>> a guest process may bind to both sibling vdev's, leading to the same
>>> pasid situation.
>>>
>>> In concept w/o vSVA it's still possible to assign sibling vdev's to
>>> a same VM as each vdev is allocated with a unique pasid to mark vRID
>>> so can be differentiated from each other in the fault/error path.
>>
>> I thought the SIOV plan was that each "vdev" ie vpci function would
>> get a slice of the pRID's PASID space statically selected at creation?
>>
>> So SVA/etc doesn't matter, you reliably get a disjoint set of pRID &
>> pPASID into each VM.
>>
>>  From that view you can't identify the iommufd dev_id without knowing
>> both the pRID and pPASID which will disambiguate the different SIOV
>> iommufd dev_id instances sharing a rid.
> 
> true when assigning those instances to different VMs.
> 
> Here I was talking about assigning them to a same VM being a problem.
> with rid sharing plus same ENQCMD pPASID potentially used on both
> instances there'd be ambiguity in vSVA e.g. iopf to identify dev_id.
> 
> we agreed before on preventing sibling vdev's in one VM for above
> reason, but only as far as vSVA is concerned.
> 
> then given the new finding in err reporting I wondered whether this
> restriction should be applied to all SIOV scenarios (but irrelevant now
> with below explanation after another thinking)
> 
>>
>>> But when looking at this err code issue with Yi closely, we found
>>> there is another gap in the VT-d spec. Upon devtlb invalidation
>>> timeout the hw doesn't report pasid in the error info register. this
>>> makes it impossible to identify the source vdev if a hwpt invalidation
>>> request involves sibling vdev's from a same PF.
>>
>> Don't you know which command timed out?
> 
> unfortunately no.
> 
> for errors related to descriptor fetch the driver can tell the command
> by looking at the head pointer of the invalidation queue.
> 
> command completion is indirectly detected by inserting a wait descriptor
> as fence. completion timeout error is reported in an error register. but
> this register doesn't record pasid, nor does the command location. if there
> are multiple pending devtlb invalidation commands upon timeout
> error the spec suggests the driver to treat all of them timeout as the
> register can only record one rid.
> 
> this is kind of moot. If the driver submits only one command (plus wait)
> at a time it doesn't need hw's help to identify the timeout command.
> If the driver batches invalidation commands it must treat all timeout if
> an timeout error is reported.
> 
> from this angle whether to record pasid doesn't really matter.
> 
> intel-iommu driver doesn't batch commands. so it's possible for
> the driver to figure out the timeout device itself and identify rid plus
> pasid to find dev_id from iommufd.

based on this, even RID is unnecessary. Software should know which device
it has sent a devTLB invalidation.
diff mbox series

Patch

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 7f92cecc87d7..cafd98642abf 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -614,6 +614,42 @@  struct iommu_hwpt_get_dirty_bitmap {
 #define IOMMU_HWPT_GET_DIRTY_BITMAP _IO(IOMMUFD_TYPE, \
 					IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP)
 
+/**
+ * enum iommu_hwpt_vtd_s1_invalidate_flags - Flags for Intel VT-d
+ *                                           stage-1 cache invalidation
+ * @IOMMU_VTD_INV_FLAGS_LEAF: The LEAF flag indicates whether only the
+ *                            leaf PTE caching needs to be invalidated
+ *                            and other paging structure caches can be
+ *                            preserved.
+ */
+enum iommu_hwpt_vtd_s1_invalidate_flags {
+	IOMMU_VTD_INV_FLAGS_LEAF = 1 << 0,
+};
+
+/**
+ * struct iommu_hwpt_vtd_s1_invalidate - Intel VT-d cache invalidation
+ *                                       (IOMMU_HWPT_DATA_VTD_S1)
+ * @addr: The start address of the addresses to be invalidated. It needs
+ *        to be 4KB aligned.
+ * @npages: Number of contiguous 4K pages to be invalidated.
+ * @flags: Combination of enum iommu_hwpt_vtd_s1_invalidate_flags
+ * @__reserved: Must be 0
+ *
+ * The Intel VT-d specific invalidation data for user-managed stage-1 cache
+ * invalidation in nested translation. Userspace uses this structure to
+ * tell the impacted cache scope after modifying the stage-1 page table.
+ *
+ * Invalidating all the caches related to the page table by setting @addr
+ * to be 0 and @npages to be __aligned_u64(-1). This includes the
+ * corresponding device-TLB if ATS is enabled on the attached devices.
+ */
+struct iommu_hwpt_vtd_s1_invalidate {
+	__aligned_u64 addr;
+	__aligned_u64 npages;
+	__u32 flags;
+	__u32 __reserved;
+};
+
 /**
  * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
  * @size: sizeof(struct iommu_hwpt_invalidate)