mbox series

[v2,00/19] iommufd: Add VIOMMU infrastructure (Part-1)

Message ID cover.1724776335.git.nicolinc@nvidia.com
Headers show
Series iommufd: Add VIOMMU infrastructure (Part-1) | expand

Message

Nicolin Chen Aug. 27, 2024, 4:59 p.m. UTC
This series introduces a new VIOMMU infrastructure and related ioctls.

IOMMUFD has been using the HWPT infrastructure for all cases, including a
nested IO page table support. Yet, there're limitations for an HWPT-based
structure to support some advanced HW-accelerated features, such as CMDQV
on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
environment, it is not straightforward for nested HWPTs to share the same
parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.

The new VIOMMU object is an additional layer, between the nested HWPT and
its parent HWPT, to give to both the IOMMUFD core and an IOMMU driver an
additional structure to support HW-accelerated feature:
                     ----------------------------
 ----------------    |         |  paging_hwpt0  |
 | hwpt_nested0 |--->| viommu0 ------------------
 ----------------    |         | HW-accel feats |
                     ----------------------------

On a multi-IOMMU system, the VIOMMU object can be instanced to the number
of vIOMMUs in a guest VM, while holding the same parent HWPT to share the
stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own
VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
                     ----------------------------
 ----------------    |         |  paging_hwpt0  |
 | hwpt_nested0 |--->| viommu0 ------------------
 ----------------    |         |     VMID0      |
                     ----------------------------
                     ----------------------------
 ----------------    |         |  paging_hwpt0  |
 | hwpt_nested1 |--->| viommu1 ------------------
 ----------------    |         |     VMID1      |
                     ----------------------------

As an initial part-1, add ioctls to support a VIOMMU-based invalidation:
    IOMMUFD_CMD_VIOMMU_ALLOC to allocate a VIOMMU object
    IOMMUFD_CMD_VIOMMU_SET/UNSET_VDEV_ID to set/clear device's virtual ID
    (Resue IOMMUFD_CMD_HWPT_INVALIDATE for a VIOMMU object to flush cache
     by a given driver data)

Worth noting that the VDEV_ID is for a per-VIOMMU device list for drivers
to look up the device's physical instance from its virtual ID in a VM. It
is essential for a VIOMMU-based invalidation where the request contains a
device's virtual ID for its device cache flush, e.g. ATC invalidation.

As for the implementation of the series, add an IOMMU_VIOMMU_TYPE_DEFAULT
type for a core-allocated-core-managed VIOMMU object, allowing drivers to
simply hook a default viommu ops for viommu-based invalidation alone. And
provide some viommu helpers to drivers for VDEV_ID translation and parent
domain lookup. Add VIOMMU invalidation support to ARM SMMUv3 driver for a
real world use case. This adds supports of arm-smmuv-v3's CMDQ_OP_ATC_INV
and CMDQ_OP_CFGI_CD/ALL commands, supplementing HWPT-based invalidations.

In the future, drivers will also be able to choose a driver-managed type
to hold its own structure by adding a new type to enum iommu_viommu_type.
More VIOMMU-based structures and ioctls will be introduced in part-2/3 to
support a driver-managed VIOMMU, e.g. VQUEUE object for a HW accelerated
queue, VIRQ (or VEVENT) object for IRQ injections. Although we repurposed
the VIOMMU object from an earlier RFC discussion, for a referece:
https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/

This series is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v2
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p1-v2

Changelog
v2
 * Limited vdev_id to one per idev
 * Added a rw_sem to protect the vdev_id list
 * Reworked driver-level APIs with proper lockings
 * Added a new viommu_api file for IOMMUFD_DRIVER config
 * Dropped useless iommu_dev point from the viommu structure
 * Added missing index numnbers to new types in the uAPI header
 * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
 * Reworked mock_viommu_cache_invalidate() using the new iommu helper
 * Reordered details of set/unset_vdev_id handlers for proper lockings
 * Added arm_smmu_cache_invalidate_user patch from Jason's nesting series
v1
 https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Jason Gunthorpe (3):
  iommu: Add iommu_copy_struct_from_full_user_array helper
  iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
  iommu/arm-smmu-v3: Update comments about ATS and bypass

Nicolin Chen (16):
  iommufd: Reorder struct forward declarations
  iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl
  iommu: Pass in a viommu pointer to domain_alloc_user op
  iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
  iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
  iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl
  iommufd/selftest: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID test coverage
  iommufd/viommu: Add cache_invalidate for IOMMU_VIOMMU_TYPE_DEFAULT
  iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
  iommufd/viommu: Add vdev_id helpers for IOMMU drivers
  iommufd/selftest: Add mock_viommu_invalidate_user op
  iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
  iommufd/selftest: Add VIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
  iommufd/viommu: Add iommufd_viommu_to_parent_domain helper
  iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user
  iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate

 drivers/iommu/amd/iommu.c                     |   1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 218 ++++++++++++++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   3 +
 drivers/iommu/intel/iommu.c                   |   1 +
 drivers/iommu/iommufd/Makefile                |   5 +-
 drivers/iommu/iommufd/device.c                |  12 +
 drivers/iommu/iommufd/hw_pagetable.c          |  59 +++-
 drivers/iommu/iommufd/iommufd_private.h       |  37 +++
 drivers/iommu/iommufd/iommufd_test.h          |  30 ++
 drivers/iommu/iommufd/main.c                  |  12 +
 drivers/iommu/iommufd/selftest.c              | 101 ++++++-
 drivers/iommu/iommufd/viommu.c                | 196 +++++++++++++
 drivers/iommu/iommufd/viommu_api.c            |  53 ++++
 include/linux/iommu.h                         |  56 +++-
 include/linux/iommufd.h                       |  51 +++-
 include/uapi/linux/iommufd.h                  | 117 +++++++-
 tools/testing/selftests/iommu/iommufd.c       | 259 +++++++++++++++++-
 tools/testing/selftests/iommu/iommufd_utils.h | 126 +++++++++
 18 files changed, 1299 insertions(+), 38 deletions(-)
 create mode 100644 drivers/iommu/iommufd/viommu.c
 create mode 100644 drivers/iommu/iommufd/viommu_api.c

Comments

Tian, Kevin Sept. 11, 2024, 6:12 a.m. UTC | #1
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, August 28, 2024 1:00 AM
> 
[...]
> On a multi-IOMMU system, the VIOMMU object can be instanced to the
> number
> of vIOMMUs in a guest VM, while holding the same parent HWPT to share
> the

Is there restriction that multiple vIOMMU objects can be only created
on a multi-IOMMU system?

> stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own
> VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:

this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the
entire context it actually means the physical 'VMID' allocated on the
associated physical IOMMU, correct?
Nicolin Chen Sept. 11, 2024, 7:08 a.m. UTC | #2
On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, August 28, 2024 1:00 AM
> >
> [...]
> > On a multi-IOMMU system, the VIOMMU object can be instanced to the
> > number
> > of vIOMMUs in a guest VM, while holding the same parent HWPT to share
> > the
> 
> Is there restriction that multiple vIOMMU objects can be only created
> on a multi-IOMMU system?

I think it should be generally restricted to the number of pIOMMUs,
although likely (not 100% sure) we could do multiple vIOMMUs on a
single-pIOMMU system. Any reason for doing that?

> > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own
> > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
> 
> this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the
> entire context it actually means the physical 'VMID' allocated on the
> associated physical IOMMU, correct?

Quoting Jason's narratives, a VMID is a "Security namespace for
guest owned ID". The allocation, using SMMU as an example, should
be a part of vIOMMU instance allocation in the host SMMU driver.
Then, this VMID will be used to mark the cache tags. So, it is
still a software allocated ID, while HW would use it too.

Thanks
Nicolin
Tian, Kevin Sept. 11, 2024, 7:18 a.m. UTC | #3
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, September 11, 2024 3:08 PM
> 
> On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, August 28, 2024 1:00 AM
> > >
> > [...]
> > > On a multi-IOMMU system, the VIOMMU object can be instanced to the
> > > number
> > > of vIOMMUs in a guest VM, while holding the same parent HWPT to
> share
> > > the
> >
> > Is there restriction that multiple vIOMMU objects can be only created
> > on a multi-IOMMU system?
> 
> I think it should be generally restricted to the number of pIOMMUs,
> although likely (not 100% sure) we could do multiple vIOMMUs on a
> single-pIOMMU system. Any reason for doing that?

No idea. But if you stated so then there will be code to enforce it e.g.
failing the attempt to create a vIOMMU object on a pIOMMU to which
another vIOMMU object is already linked?

> 
> > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its
> own
> > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
> >
> > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the
> > entire context it actually means the physical 'VMID' allocated on the
> > associated physical IOMMU, correct?
> 
> Quoting Jason's narratives, a VMID is a "Security namespace for
> guest owned ID". The allocation, using SMMU as an example, should

the VMID alone is not a namespace. It's one ID to tag another namespace.

> be a part of vIOMMU instance allocation in the host SMMU driver.
> Then, this VMID will be used to mark the cache tags. So, it is
> still a software allocated ID, while HW would use it too.
> 

VMIDs are physical resource belonging to the host SMMU driver.

but I got your original point that it's each vIOMMU gets an unique VMID
from the host SMMU driver, not exactly that each vIOMMU maintains
its own VMID namespace. that'd be a different concept.
Nicolin Chen Sept. 11, 2024, 7:40 a.m. UTC | #4
On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, September 11, 2024 3:08 PM
> >
> > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Wednesday, August 28, 2024 1:00 AM
> > > >
> > > [...]
> > > > On a multi-IOMMU system, the VIOMMU object can be instanced to the
> > > > number
> > > > of vIOMMUs in a guest VM, while holding the same parent HWPT to
> > share
> > > > the
> > >
> > > Is there restriction that multiple vIOMMU objects can be only created
> > > on a multi-IOMMU system?
> >
> > I think it should be generally restricted to the number of pIOMMUs,
> > although likely (not 100% sure) we could do multiple vIOMMUs on a
> > single-pIOMMU system. Any reason for doing that?
> 
> No idea. But if you stated so then there will be code to enforce it e.g.
> failing the attempt to create a vIOMMU object on a pIOMMU to which
> another vIOMMU object is already linked?

Yea, I can do that.

> > > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its
> > own
> > > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
> > >
> > > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the
> > > entire context it actually means the physical 'VMID' allocated on the
> > > associated physical IOMMU, correct?
> >
> > Quoting Jason's narratives, a VMID is a "Security namespace for
> > guest owned ID". The allocation, using SMMU as an example, should
> 
> the VMID alone is not a namespace. It's one ID to tag another namespace.
> 
> > be a part of vIOMMU instance allocation in the host SMMU driver.
> > Then, this VMID will be used to mark the cache tags. So, it is
> > still a software allocated ID, while HW would use it too.
> >
> 
> VMIDs are physical resource belonging to the host SMMU driver.

Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e.
the guest.

> but I got your original point that it's each vIOMMU gets an unique VMID
> from the host SMMU driver, not exactly that each vIOMMU maintains
> its own VMID namespace. that'd be a different concept.

What's a VMID namespace actually? Please educate me :)

Thanks
Nicolin
Tian, Kevin Sept. 11, 2024, 8:08 a.m. UTC | #5
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, September 11, 2024 3:41 PM
> 
> On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, September 11, 2024 3:08 PM
> > >
> > > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Wednesday, August 28, 2024 1:00 AM
> > > > >
> > > > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its
> > > own
> > > > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
> > > >
> > > > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the
> > > > entire context it actually means the physical 'VMID' allocated on the
> > > > associated physical IOMMU, correct?
> > >
> > > Quoting Jason's narratives, a VMID is a "Security namespace for
> > > guest owned ID". The allocation, using SMMU as an example, should
> >
> > the VMID alone is not a namespace. It's one ID to tag another namespace.
> >
> > > be a part of vIOMMU instance allocation in the host SMMU driver.
> > > Then, this VMID will be used to mark the cache tags. So, it is
> > > still a software allocated ID, while HW would use it too.
> > >
> >
> > VMIDs are physical resource belonging to the host SMMU driver.
> 
> Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e.
> the guest.
> 
> > but I got your original point that it's each vIOMMU gets an unique VMID
> > from the host SMMU driver, not exactly that each vIOMMU maintains
> > its own VMID namespace. that'd be a different concept.
> 
> What's a VMID namespace actually? Please educate me :)
> 

I meant the 16bit VMID pool under each SMMU.
Nicolin Chen Sept. 11, 2024, 8:21 p.m. UTC | #6
On Wed, Sep 11, 2024 at 08:08:04AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, September 11, 2024 3:41 PM
> >
> > On Wed, Sep 11, 2024 at 07:18:10AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Wednesday, September 11, 2024 3:08 PM
> > > >
> > > > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
> > > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > > Sent: Wednesday, August 28, 2024 1:00 AM
> > > > > >
> > > > > > stage-2 IO pagetable. Each VIOMMU then just need to only allocate its
> > > > own
> > > > > > VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
> > > > >
> > > > > this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the
> > > > > entire context it actually means the physical 'VMID' allocated on the
> > > > > associated physical IOMMU, correct?
> > > >
> > > > Quoting Jason's narratives, a VMID is a "Security namespace for
> > > > guest owned ID". The allocation, using SMMU as an example, should
> > >
> > > the VMID alone is not a namespace. It's one ID to tag another namespace.
> > >
> > > > be a part of vIOMMU instance allocation in the host SMMU driver.
> > > > Then, this VMID will be used to mark the cache tags. So, it is
> > > > still a software allocated ID, while HW would use it too.
> > > >
> > >
> > > VMIDs are physical resource belonging to the host SMMU driver.
> >
> > Yes. Just the lifecycle of a VMID is controlled by a vIOMMU, i.e.
> > the guest.
> >
> > > but I got your original point that it's each vIOMMU gets an unique VMID
> > > from the host SMMU driver, not exactly that each vIOMMU maintains
> > > its own VMID namespace. that'd be a different concept.
> >
> > What's a VMID namespace actually? Please educate me :)
> >
> 
> I meant the 16bit VMID pool under each SMMU.

I see. Makes sense now.

Thanks
Nicolin
Yi Liu Sept. 25, 2024, 10:30 a.m. UTC | #7
Hi Nic,

On 2024/8/28 00:59, Nicolin Chen wrote:
> This series introduces a new VIOMMU infrastructure and related ioctls.
> 
> IOMMUFD has been using the HWPT infrastructure for all cases, including a
> nested IO page table support. Yet, there're limitations for an HWPT-based
> structure to support some advanced HW-accelerated features, such as CMDQV
> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
> environment, it is not straightforward for nested HWPTs to share the same
> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.

could you elaborate a bit for the last sentence in the above paragraph?

> 
> The new VIOMMU object is an additional layer, between the nested HWPT and
> its parent HWPT, to give to both the IOMMUFD core and an IOMMU driver an
> additional structure to support HW-accelerated feature:
>                       ----------------------------
>   ----------------    |         |  paging_hwpt0  |
>   | hwpt_nested0 |--->| viommu0 ------------------
>   ----------------    |         | HW-accel feats |
>                       ----------------------------
> 
> On a multi-IOMMU system, the VIOMMU object can be instanced to the number
> of vIOMMUs in a guest VM, while holding the same parent HWPT to share the
> stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own
> VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
>                       ----------------------------
>   ----------------    |         |  paging_hwpt0  |
>   | hwpt_nested0 |--->| viommu0 ------------------
>   ----------------    |         |     VMID0      |
>                       ----------------------------
>                       ----------------------------
>   ----------------    |         |  paging_hwpt0  |
>   | hwpt_nested1 |--->| viommu1 ------------------
>   ----------------    |         |     VMID1      |
>                       ----------------------------
> 
> As an initial part-1, add ioctls to support a VIOMMU-based invalidation:
>      IOMMUFD_CMD_VIOMMU_ALLOC to allocate a VIOMMU object
>      IOMMUFD_CMD_VIOMMU_SET/UNSET_VDEV_ID to set/clear device's virtual ID
>      (Resue IOMMUFD_CMD_HWPT_INVALIDATE for a VIOMMU object to flush cache
>       by a given driver data)
> 
> Worth noting that the VDEV_ID is for a per-VIOMMU device list for drivers
> to look up the device's physical instance from its virtual ID in a VM. It
> is essential for a VIOMMU-based invalidation where the request contains a
> device's virtual ID for its device cache flush, e.g. ATC invalidation.
> 
> As for the implementation of the series, add an IOMMU_VIOMMU_TYPE_DEFAULT
> type for a core-allocated-core-managed VIOMMU object, allowing drivers to
> simply hook a default viommu ops for viommu-based invalidation alone. And
> provide some viommu helpers to drivers for VDEV_ID translation and parent
> domain lookup. Add VIOMMU invalidation support to ARM SMMUv3 driver for a
> real world use case. This adds supports of arm-smmuv-v3's CMDQ_OP_ATC_INV
> and CMDQ_OP_CFGI_CD/ALL commands, supplementing HWPT-based invalidations.
> 
> In the future, drivers will also be able to choose a driver-managed type
> to hold its own structure by adding a new type to enum iommu_viommu_type.
> More VIOMMU-based structures and ioctls will be introduced in part-2/3 to
> support a driver-managed VIOMMU, e.g. VQUEUE object for a HW accelerated
> queue, VIRQ (or VEVENT) object for IRQ injections. Although we repurposed
> the VIOMMU object from an earlier RFC discussion, for a referece:
> https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
> 
> This series is on Github:
> https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v2
> Paring QEMU branch for testing:
> https://github.com/nicolinc/qemu/commits/wip/for_iommufd_viommu_p1-v2
> 
> Changelog
> v2
>   * Limited vdev_id to one per idev
>   * Added a rw_sem to protect the vdev_id list
>   * Reworked driver-level APIs with proper lockings
>   * Added a new viommu_api file for IOMMUFD_DRIVER config
>   * Dropped useless iommu_dev point from the viommu structure
>   * Added missing index numnbers to new types in the uAPI header
>   * Dropped IOMMU_VIOMMU_INVALIDATE uAPI; Instead, reuse the HWPT one
>   * Reworked mock_viommu_cache_invalidate() using the new iommu helper
>   * Reordered details of set/unset_vdev_id handlers for proper lockings
>   * Added arm_smmu_cache_invalidate_user patch from Jason's nesting series
> v1
>   https://lore.kernel.org/all/cover.1723061377.git.nicolinc@nvidia.com/
> 
> Thanks!
> Nicolin
> 
> Jason Gunthorpe (3):
>    iommu: Add iommu_copy_struct_from_full_user_array helper
>    iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
>    iommu/arm-smmu-v3: Update comments about ATS and bypass
> 
> Nicolin Chen (16):
>    iommufd: Reorder struct forward declarations
>    iommufd/viommu: Add IOMMUFD_OBJ_VIOMMU and IOMMU_VIOMMU_ALLOC ioctl
>    iommu: Pass in a viommu pointer to domain_alloc_user op
>    iommufd: Allow pt_id to carry viommu_id for IOMMU_HWPT_ALLOC
>    iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage
>    iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl
>    iommufd/selftest: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID test coverage
>    iommufd/viommu: Add cache_invalidate for IOMMU_VIOMMU_TYPE_DEFAULT
>    iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
>    iommufd/viommu: Add vdev_id helpers for IOMMU drivers
>    iommufd/selftest: Add mock_viommu_invalidate_user op
>    iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
>    iommufd/selftest: Add VIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
>    iommufd/viommu: Add iommufd_viommu_to_parent_domain helper
>    iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user
>    iommu/arm-smmu-v3: Add arm_smmu_viommu_cache_invalidate
> 
>   drivers/iommu/amd/iommu.c                     |   1 +
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 218 ++++++++++++++-
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   3 +
>   drivers/iommu/intel/iommu.c                   |   1 +
>   drivers/iommu/iommufd/Makefile                |   5 +-
>   drivers/iommu/iommufd/device.c                |  12 +
>   drivers/iommu/iommufd/hw_pagetable.c          |  59 +++-
>   drivers/iommu/iommufd/iommufd_private.h       |  37 +++
>   drivers/iommu/iommufd/iommufd_test.h          |  30 ++
>   drivers/iommu/iommufd/main.c                  |  12 +
>   drivers/iommu/iommufd/selftest.c              | 101 ++++++-
>   drivers/iommu/iommufd/viommu.c                | 196 +++++++++++++
>   drivers/iommu/iommufd/viommu_api.c            |  53 ++++
>   include/linux/iommu.h                         |  56 +++-
>   include/linux/iommufd.h                       |  51 +++-
>   include/uapi/linux/iommufd.h                  | 117 +++++++-
>   tools/testing/selftests/iommu/iommufd.c       | 259 +++++++++++++++++-
>   tools/testing/selftests/iommu/iommufd_utils.h | 126 +++++++++
>   18 files changed, 1299 insertions(+), 38 deletions(-)
>   create mode 100644 drivers/iommu/iommufd/viommu.c
>   create mode 100644 drivers/iommu/iommufd/viommu_api.c
>
Nicolin Chen Sept. 26, 2024, 8:03 p.m. UTC | #8
On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote:
> On 2024/9/26 02:55, Nicolin Chen wrote:
> > On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
> > > Hi Nic,
> > > 
> > > On 2024/8/28 00:59, Nicolin Chen wrote:
> > > > This series introduces a new VIOMMU infrastructure and related ioctls.
> > > > 
> > > > IOMMUFD has been using the HWPT infrastructure for all cases, including a
> > > > nested IO page table support. Yet, there're limitations for an HWPT-based
> > > > structure to support some advanced HW-accelerated features, such as CMDQV
> > > > on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
> > > > environment, it is not straightforward for nested HWPTs to share the same
> > > > parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
> > > 
> > > could you elaborate a bit for the last sentence in the above paragraph?
> > 
> > Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent
> > domain across IOMMU instances, we'd have to make sure that VMID
> > is available on all IOMMU instances. There comes the limitation
> > and potential resource starving, so not ideal.
> 
> got it.
> 
> > Baolu told me that Intel may have the same: different domain IDs
> > on different IOMMUs; multiple IOMMU instances on one chip:
> > https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/
> > So, I think we are having the same situation here.
> 
> yes, it's called iommu unit or dmar. A typical Intel server can have
> multiple iommu units. But like Baolu mentioned in that thread, the intel
> iommu driver maintains separate domain ID spaces for iommu units, which
> means a given iommu domain has different DIDs when associated with
> different iommu units. So intel side is not suffering from this so far.

An ARM SMMU has its own VMID pool as well. The suffering comes
from associating VMIDs to one shared parent S2 domain.

Does a DID per S1 nested domain or parent S2? If it is per S2,
I think the same suffering applies when we share the S2 across
IOMMU instances?

> > Adding another vIOMMU wrapper on the other hand can allow us to
> > allocate different VMIDs/DIDs for different IOMMUs.
> 
> that looks like to generalize the association of the iommu domain and the
> iommu units?

A vIOMMU is a presentation/object of a physical IOMMU instance
in a VM. This presentation gives a VMM some capability to take
advantage of some of HW resource of the physical IOMMU:
- a VMID is a small HW reousrce to tag the cache;
- a vIOMMU invalidation allows to access device cache that's
  not straightforwardly done via an S1 HWPT invalidation;
- a virtual device presentation of a physical device in a VM,
  related to the vIOMMU in the VM, which contains some VM-level
  info: virtual device ID, security level (ARM CCA), and etc;
- Non-PRI IRQ forwarding to the guest VM;
- HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU;

Thanks
Nicolin
Baolu Lu Sept. 27, 2024, 2:05 a.m. UTC | #9
On 9/27/24 4:03 AM, Nicolin Chen wrote:
> On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote:
>> On 2024/9/26 02:55, Nicolin Chen wrote:
>>> On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
>>>> Hi Nic,
>>>>
>>>> On 2024/8/28 00:59, Nicolin Chen wrote:
>>>>> This series introduces a new VIOMMU infrastructure and related ioctls.
>>>>>
>>>>> IOMMUFD has been using the HWPT infrastructure for all cases, including a
>>>>> nested IO page table support. Yet, there're limitations for an HWPT-based
>>>>> structure to support some advanced HW-accelerated features, such as CMDQV
>>>>> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
>>>>> environment, it is not straightforward for nested HWPTs to share the same
>>>>> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
>>>> could you elaborate a bit for the last sentence in the above paragraph?
>>> Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent
>>> domain across IOMMU instances, we'd have to make sure that VMID
>>> is available on all IOMMU instances. There comes the limitation
>>> and potential resource starving, so not ideal.
>> got it.
>>
>>> Baolu told me that Intel may have the same: different domain IDs
>>> on different IOMMUs; multiple IOMMU instances on one chip:
>>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/
>>> So, I think we are having the same situation here.
>> yes, it's called iommu unit or dmar. A typical Intel server can have
>> multiple iommu units. But like Baolu mentioned in that thread, the intel
>> iommu driver maintains separate domain ID spaces for iommu units, which
>> means a given iommu domain has different DIDs when associated with
>> different iommu units. So intel side is not suffering from this so far.
> An ARM SMMU has its own VMID pool as well. The suffering comes
> from associating VMIDs to one shared parent S2 domain.
> 
> Does a DID per S1 nested domain or parent S2? If it is per S2,
> I think the same suffering applies when we share the S2 across
> IOMMU instances?

It's per S1 nested domain in current VT-d design. It's simple but lacks
sharing of DID within a VM. We probably will change this later.

Thanks,
baolu
Yi Liu Sept. 27, 2024, 5:54 a.m. UTC | #10
On 2024/9/27 04:03, Nicolin Chen wrote:
> On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote:
>> On 2024/9/26 02:55, Nicolin Chen wrote:
>>> On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
>>>> Hi Nic,
>>>>
>>>> On 2024/8/28 00:59, Nicolin Chen wrote:
>>>>> This series introduces a new VIOMMU infrastructure and related ioctls.
>>>>>
>>>>> IOMMUFD has been using the HWPT infrastructure for all cases, including a
>>>>> nested IO page table support. Yet, there're limitations for an HWPT-based
>>>>> structure to support some advanced HW-accelerated features, such as CMDQV
>>>>> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a multi-IOMMU
>>>>> environment, it is not straightforward for nested HWPTs to share the same
>>>>> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
>>>>
>>>> could you elaborate a bit for the last sentence in the above paragraph?
>>>
>>> Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent
>>> domain across IOMMU instances, we'd have to make sure that VMID
>>> is available on all IOMMU instances. There comes the limitation
>>> and potential resource starving, so not ideal.
>>
>> got it.
>>
>>> Baolu told me that Intel may have the same: different domain IDs
>>> on different IOMMUs; multiple IOMMU instances on one chip:
>>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/
>>> So, I think we are having the same situation here.
>>
>> yes, it's called iommu unit or dmar. A typical Intel server can have
>> multiple iommu units. But like Baolu mentioned in that thread, the intel
>> iommu driver maintains separate domain ID spaces for iommu units, which
>> means a given iommu domain has different DIDs when associated with
>> different iommu units. So intel side is not suffering from this so far.
> 
> An ARM SMMU has its own VMID pool as well. The suffering comes
> from associating VMIDs to one shared parent S2 domain.

Is this because of the VMID is tied with a S2 domain?

> Does a DID per S1 nested domain or parent S2? If it is per S2,
> I think the same suffering applies when we share the S2 across
> IOMMU instances?

per S1 I think. The iotlb efficiency is low as S2 caches would be
tagged with different DIDs even the page table is the same. :)

>>> Adding another vIOMMU wrapper on the other hand can allow us to
>>> allocate different VMIDs/DIDs for different IOMMUs.
>>
>> that looks like to generalize the association of the iommu domain and the
>> iommu units?
> 
> A vIOMMU is a presentation/object of a physical IOMMU instance
> in a VM.

a slice of a physical IOMMU. is it? and you treat S2 hwpt as a resource
of the physical IOMMU as well.

> This presentation gives a VMM some capability to take
> advantage of some of HW resource of the physical IOMMU:
> - a VMID is a small HW reousrce to tag the cache;
> - a vIOMMU invalidation allows to access device cache that's
>    not straightforwardly done via an S1 HWPT invalidation;
> - a virtual device presentation of a physical device in a VM,
>    related to the vIOMMU in the VM, which contains some VM-level
>    info: virtual device ID, security level (ARM CCA), and etc;
> - Non-PRI IRQ forwarding to the guest VM;
> - HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU;

might be helpful to draw a diagram to show what the vIOMMU obj contains.:)
Yi Liu Sept. 27, 2024, 6:14 a.m. UTC | #11
On 2024/9/27 10:05, Baolu Lu wrote:
> On 9/27/24 4:03 AM, Nicolin Chen wrote:
>> On Thu, Sep 26, 2024 at 04:47:02PM +0800, Yi Liu wrote:
>>> On 2024/9/26 02:55, Nicolin Chen wrote:
>>>> On Wed, Sep 25, 2024 at 06:30:20PM +0800, Yi Liu wrote:
>>>>> Hi Nic,
>>>>>
>>>>> On 2024/8/28 00:59, Nicolin Chen wrote:
>>>>>> This series introduces a new VIOMMU infrastructure and related ioctls.
>>>>>>
>>>>>> IOMMUFD has been using the HWPT infrastructure for all cases, 
>>>>>> including a
>>>>>> nested IO page table support. Yet, there're limitations for an 
>>>>>> HWPT-based
>>>>>> structure to support some advanced HW-accelerated features, such as 
>>>>>> CMDQV
>>>>>> on NVIDIA Grace, and HW-accelerated vIOMMU on AMD. Even for a 
>>>>>> multi-IOMMU
>>>>>> environment, it is not straightforward for nested HWPTs to share the 
>>>>>> same
>>>>>> parent HWPT (stage-2 IO pagetable), with the HWPT infrastructure alone.
>>>>> could you elaborate a bit for the last sentence in the above paragraph?
>>>> Stage-2 HWPT/domain on ARM holds a VMID. If we share the parent
>>>> domain across IOMMU instances, we'd have to make sure that VMID
>>>> is available on all IOMMU instances. There comes the limitation
>>>> and potential resource starving, so not ideal.
>>> got it.
>>>
>>>> Baolu told me that Intel may have the same: different domain IDs
>>>> on different IOMMUs; multiple IOMMU instances on one chip:
>>>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/
>>>> So, I think we are having the same situation here.
>>> yes, it's called iommu unit or dmar. A typical Intel server can have
>>> multiple iommu units. But like Baolu mentioned in that thread, the intel
>>> iommu driver maintains separate domain ID spaces for iommu units, which
>>> means a given iommu domain has different DIDs when associated with
>>> different iommu units. So intel side is not suffering from this so far.
>> An ARM SMMU has its own VMID pool as well. The suffering comes
>> from associating VMIDs to one shared parent S2 domain.
>>
>> Does a DID per S1 nested domain or parent S2? If it is per S2,
>> I think the same suffering applies when we share the S2 across
>> IOMMU instances?
> 
> It's per S1 nested domain in current VT-d design. It's simple but lacks
> sharing of DID within a VM. We probably will change this later.

Could you share a bit more about this? I hope it is not going to share the
DID if the S1 nested domains share the same S2 hwpt. For fist-stage caches,
the tag is PASID, DID and address. If both PASID and DID are the same, then
there is cache conflict. And the typical scenarios is the gIOVA which uses
the RIDPASID. :)
Nicolin Chen Sept. 27, 2024, 6:32 a.m. UTC | #12
On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote:
> > > > Baolu told me that Intel may have the same: different domain IDs
> > > > on different IOMMUs; multiple IOMMU instances on one chip:
> > > > https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/
> > > > So, I think we are having the same situation here.
> > > 
> > > yes, it's called iommu unit or dmar. A typical Intel server can have
> > > multiple iommu units. But like Baolu mentioned in that thread, the intel
> > > iommu driver maintains separate domain ID spaces for iommu units, which
> > > means a given iommu domain has different DIDs when associated with
> > > different iommu units. So intel side is not suffering from this so far.
> > 
> > An ARM SMMU has its own VMID pool as well. The suffering comes
> > from associating VMIDs to one shared parent S2 domain.
> 
> Is this because of the VMID is tied with a S2 domain?

On ARM, yes. VMID is a part of S2 domain stuff.

> > Does a DID per S1 nested domain or parent S2? If it is per S2,
> > I think the same suffering applies when we share the S2 across
> > IOMMU instances?
> 
> per S1 I think. The iotlb efficiency is low as S2 caches would be
> tagged with different DIDs even the page table is the same. :)

On ARM, the stage-1 is tagged with an ASID (Address Space ID)
while the stage-2 is tagged with a VMID. Then an invalidation
for a nested S1 domain must require the VMID from the S2. The
ASID may be also required if the invalidation is specific to
that address space (otherwise, broadcast per VMID.)

I feel these two might act somehow similarly to the two DIDs
during nested translations?

> > > > Adding another vIOMMU wrapper on the other hand can allow us to
> > > > allocate different VMIDs/DIDs for different IOMMUs.
> > > 
> > > that looks like to generalize the association of the iommu domain and the
> > > iommu units?
> > 
> > A vIOMMU is a presentation/object of a physical IOMMU instance
> > in a VM.
> 
> a slice of a physical IOMMU. is it?

Yes. When multiple nested translations happen at the same time,
IOMMU (just like a CPU) is shared by these slices. And so is an
invalidation queue executing multiple requests.

Perhaps calling it a slice sounds more accurate, as I guess all
the confusion comes from the name "vIOMMU" that might be thought
to be a user space object/instance that likely holds all virtual
stuff like stage-1 HWPT or so?

> and you treat S2 hwpt as a resource of the physical IOMMU as well.

Yes. A parent HWPT (in the old day, we called it "kernel-manged"
HWPT) is not a user space thing. This belongs to a kernel owned
object.

> > This presentation gives a VMM some capability to take
> > advantage of some of HW resource of the physical IOMMU:
> > - a VMID is a small HW reousrce to tag the cache;
> > - a vIOMMU invalidation allows to access device cache that's
> >    not straightforwardly done via an S1 HWPT invalidation;
> > - a virtual device presentation of a physical device in a VM,
> >    related to the vIOMMU in the VM, which contains some VM-level
> >    info: virtual device ID, security level (ARM CCA), and etc;
> > - Non-PRI IRQ forwarding to the guest VM;
> > - HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU;
> 
> might be helpful to draw a diagram to show what the vIOMMU obj contains.:)

That's what I plan to. Basically looks like:
  device---->stage1--->[ viommu [s2_hwpt, vmid, virq, HW-acc, etc.] ]

Thanks
Nic
Yi Liu Sept. 27, 2024, 12:12 p.m. UTC | #13
On 2024/9/27 14:32, Nicolin Chen wrote:
> On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote:
>>>>> Baolu told me that Intel may have the same: different domain IDs
>>>>> on different IOMMUs; multiple IOMMU instances on one chip:
>>>>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/
>>>>> So, I think we are having the same situation here.
>>>>
>>>> yes, it's called iommu unit or dmar. A typical Intel server can have
>>>> multiple iommu units. But like Baolu mentioned in that thread, the intel
>>>> iommu driver maintains separate domain ID spaces for iommu units, which
>>>> means a given iommu domain has different DIDs when associated with
>>>> different iommu units. So intel side is not suffering from this so far.
>>>
>>> An ARM SMMU has its own VMID pool as well. The suffering comes
>>> from associating VMIDs to one shared parent S2 domain.
>>
>> Is this because of the VMID is tied with a S2 domain?
> 
> On ARM, yes. VMID is a part of S2 domain stuff.
> 
>>> Does a DID per S1 nested domain or parent S2? If it is per S2,
>>> I think the same suffering applies when we share the S2 across
>>> IOMMU instances?
>>
>> per S1 I think. The iotlb efficiency is low as S2 caches would be
>> tagged with different DIDs even the page table is the same. :)
> 
> On ARM, the stage-1 is tagged with an ASID (Address Space ID)
> while the stage-2 is tagged with a VMID. Then an invalidation
> for a nested S1 domain must require the VMID from the S2. The
> ASID may be also required if the invalidation is specific to
> that address space (otherwise, broadcast per VMID.)
Looks like the nested s1 caches are tagged with both ASID and VMID.

> I feel these two might act somehow similarly to the two DIDs
> during nested translations?

not quite the same. Is it possible that the ASID is the same for stage-1?
Intel VT-d side can have the pasid to be the same. Like the gIOVA, all
devices use the same ridpasid. Like the scenario I replied to Baolu[1],
do er choose to use different DIDs to differentiate the caches for the
two devices.

[1] 
https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/

>>>>> Adding another vIOMMU wrapper on the other hand can allow us to
>>>>> allocate different VMIDs/DIDs for different IOMMUs.
>>>>
>>>> that looks like to generalize the association of the iommu domain and the
>>>> iommu units?
>>>
>>> A vIOMMU is a presentation/object of a physical IOMMU instance
>>> in a VM.
>>
>> a slice of a physical IOMMU. is it?
> 
> Yes. When multiple nested translations happen at the same time,
> IOMMU (just like a CPU) is shared by these slices. And so is an
> invalidation queue executing multiple requests.
> 
> Perhaps calling it a slice sounds more accurate, as I guess all
> the confusion comes from the name "vIOMMU" that might be thought
> to be a user space object/instance that likely holds all virtual
> stuff like stage-1 HWPT or so?

yeah. Maybe this confusion partly comes when you start it with the
cache invalidation as well. I failed to get why a S2 hwpt needs to
be part of the vIOMMU obj at the first glance.

> 
>> and you treat S2 hwpt as a resource of the physical IOMMU as well.
> 
> Yes. A parent HWPT (in the old day, we called it "kernel-manged"
> HWPT) is not a user space thing. This belongs to a kernel owned
> object.
> 
>>> This presentation gives a VMM some capability to take
>>> advantage of some of HW resource of the physical IOMMU:
>>> - a VMID is a small HW reousrce to tag the cache;
>>> - a vIOMMU invalidation allows to access device cache that's
>>>     not straightforwardly done via an S1 HWPT invalidation;
>>> - a virtual device presentation of a physical device in a VM,
>>>     related to the vIOMMU in the VM, which contains some VM-level
>>>     info: virtual device ID, security level (ARM CCA), and etc;
>>> - Non-PRI IRQ forwarding to the guest VM;
>>> - HW-accelerated virtualization resource: vCMDQ, AMD VIOMMU;
>>
>> might be helpful to draw a diagram to show what the vIOMMU obj contains.:)
> 
> That's what I plan to. Basically looks like:
>    device---->stage1--->[ viommu [s2_hwpt, vmid, virq, HW-acc, etc.] ]

ok. let's see your new doc.
Jason Gunthorpe Sept. 27, 2024, 12:20 p.m. UTC | #14
On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote:
> > Perhaps calling it a slice sounds more accurate, as I guess all
> > the confusion comes from the name "vIOMMU" that might be thought
> > to be a user space object/instance that likely holds all virtual
> > stuff like stage-1 HWPT or so?
> 
> yeah. Maybe this confusion partly comes when you start it with the
> cache invalidation as well. I failed to get why a S2 hwpt needs to
> be part of the vIOMMU obj at the first glance.

Both amd and arm have direct to VM queues for the iommu and these
queues have their DMA translated by the S2.

So their viommu HW concepts come along with a requirement that there
be a fixed translation for the VM, which we model by attaching a S2
HWPT to the VIOMMU object which get's linked into the IOMMU HW as
the translation for the queue memory.

Jason
Nicolin Chen Sept. 27, 2024, 8:44 p.m. UTC | #15
On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote:
> On 2024/9/27 14:32, Nicolin Chen wrote:
> > On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote:
> > > > > > Baolu told me that Intel may have the same: different domain IDs
> > > > > > on different IOMMUs; multiple IOMMU instances on one chip:
> > > > > > https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/
> > > > > > So, I think we are having the same situation here.
> > > > > 
> > > > > yes, it's called iommu unit or dmar. A typical Intel server can have
> > > > > multiple iommu units. But like Baolu mentioned in that thread, the intel
> > > > > iommu driver maintains separate domain ID spaces for iommu units, which
> > > > > means a given iommu domain has different DIDs when associated with
> > > > > different iommu units. So intel side is not suffering from this so far.
> > > > 
> > > > An ARM SMMU has its own VMID pool as well. The suffering comes
> > > > from associating VMIDs to one shared parent S2 domain.
> > > 
> > > Is this because of the VMID is tied with a S2 domain?
> > 
> > On ARM, yes. VMID is a part of S2 domain stuff.
> > 
> > > > Does a DID per S1 nested domain or parent S2? If it is per S2,
> > > > I think the same suffering applies when we share the S2 across
> > > > IOMMU instances?
> > > 
> > > per S1 I think. The iotlb efficiency is low as S2 caches would be
> > > tagged with different DIDs even the page table is the same. :)
> > 
> > On ARM, the stage-1 is tagged with an ASID (Address Space ID)
> > while the stage-2 is tagged with a VMID. Then an invalidation
> > for a nested S1 domain must require the VMID from the S2. The
> > ASID may be also required if the invalidation is specific to
> > that address space (otherwise, broadcast per VMID.)

> Looks like the nested s1 caches are tagged with both ASID and VMID.

Yea, my understanding is similar. If both stages are enabled for
a nested translation, VMID is tagged for S1 cache too.

> > I feel these two might act somehow similarly to the two DIDs
> > during nested translations?
> 
> not quite the same. Is it possible that the ASID is the same for stage-1?
> Intel VT-d side can have the pasid to be the same. Like the gIOVA, all
> devices use the same ridpasid. Like the scenario I replied to Baolu[1],
> do er choose to use different DIDs to differentiate the caches for the
> two devices.

On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or
an SVA PASID>0 domain) has a unique ASID. So it unlikely has the
situation of two identical ASIDs if they are on the same vIOMMU,
because the ASID pool is per IOMMU instance (whether p or v).

With two vIOMMU instances, there might be the same ASIDs but they
will be tagged with different VMIDs.

> [1]
> https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/

Is "gIOVA" a type of invalidation that only uses "address" out of
"PASID, DID and address"? I.e. PASID and DID are not provided via
the invalidation request, so it's going to broadcast all viommus?

Thanks
Nicolin
Yi Liu Sept. 29, 2024, 7:16 a.m. UTC | #16
On 2024/9/28 04:44, Nicolin Chen wrote:
> On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote:
>> On 2024/9/27 14:32, Nicolin Chen wrote:
>>> On Fri, Sep 27, 2024 at 01:54:45PM +0800, Yi Liu wrote:
>>>>>>> Baolu told me that Intel may have the same: different domain IDs
>>>>>>> on different IOMMUs; multiple IOMMU instances on one chip:
>>>>>>> https://lore.kernel.org/linux-iommu/cf4fe15c-8bcb-4132-a1fd-b2c8ddf2731b@linux.intel.com/
>>>>>>> So, I think we are having the same situation here.
>>>>>>
>>>>>> yes, it's called iommu unit or dmar. A typical Intel server can have
>>>>>> multiple iommu units. But like Baolu mentioned in that thread, the intel
>>>>>> iommu driver maintains separate domain ID spaces for iommu units, which
>>>>>> means a given iommu domain has different DIDs when associated with
>>>>>> different iommu units. So intel side is not suffering from this so far.
>>>>>
>>>>> An ARM SMMU has its own VMID pool as well. The suffering comes
>>>>> from associating VMIDs to one shared parent S2 domain.
>>>>
>>>> Is this because of the VMID is tied with a S2 domain?
>>>
>>> On ARM, yes. VMID is a part of S2 domain stuff.
>>>
>>>>> Does a DID per S1 nested domain or parent S2? If it is per S2,
>>>>> I think the same suffering applies when we share the S2 across
>>>>> IOMMU instances?
>>>>
>>>> per S1 I think. The iotlb efficiency is low as S2 caches would be
>>>> tagged with different DIDs even the page table is the same. :)
>>>
>>> On ARM, the stage-1 is tagged with an ASID (Address Space ID)
>>> while the stage-2 is tagged with a VMID. Then an invalidation
>>> for a nested S1 domain must require the VMID from the S2. The
>>> ASID may be also required if the invalidation is specific to
>>> that address space (otherwise, broadcast per VMID.)
> 
>> Looks like the nested s1 caches are tagged with both ASID and VMID.
> 
> Yea, my understanding is similar. If both stages are enabled for
> a nested translation, VMID is tagged for S1 cache too.
> 
>>> I feel these two might act somehow similarly to the two DIDs
>>> during nested translations?
>>
>> not quite the same. Is it possible that the ASID is the same for stage-1?
>> Intel VT-d side can have the pasid to be the same. Like the gIOVA, all
>> devices use the same ridpasid. Like the scenario I replied to Baolu[1],
>> do er choose to use different DIDs to differentiate the caches for the
>> two devices.
> 
> On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or
> an SVA PASID>0 domain) has a unique ASID.

I see. Looks like ASID is not the PASID.

> So it unlikely has the
> situation of two identical ASIDs if they are on the same vIOMMU,
> because the ASID pool is per IOMMU instance (whether p or v).
> 
> With two vIOMMU instances, there might be the same ASIDs but they
> will be tagged with different VMIDs.
> 
>> [1]
>> https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/
> 
> Is "gIOVA" a type of invalidation that only uses "address" out of
> "PASID, DID and address"? I.e. PASID and DID are not provided via
> the invalidation request, so it's going to broadcast all viommus?

gIOVA is just a term v.s. vSVA. Just want to differentiate it from vSVA. :)
PASID and DID are still provided in the invalidation.
Yi Liu Sept. 29, 2024, 7:19 a.m. UTC | #17
On 2024/9/27 20:20, Jason Gunthorpe wrote:
> On Fri, Sep 27, 2024 at 08:12:20PM +0800, Yi Liu wrote:
>>> Perhaps calling it a slice sounds more accurate, as I guess all
>>> the confusion comes from the name "vIOMMU" that might be thought
>>> to be a user space object/instance that likely holds all virtual
>>> stuff like stage-1 HWPT or so?
>>
>> yeah. Maybe this confusion partly comes when you start it with the
>> cache invalidation as well. I failed to get why a S2 hwpt needs to
>> be part of the vIOMMU obj at the first glance.
> 
> Both amd and arm have direct to VM queues for the iommu and these
> queues have their DMA translated by the S2.

ok, this explains why the S2 should be part of the vIOMMU obj.

> 
> So their viommu HW concepts come along with a requirement that there
> be a fixed translation for the VM, which we model by attaching a S2
> HWPT to the VIOMMU object which get's linked into the IOMMU HW as
> the translation for the queue memory.

Is the mapping of the S2 be static? or it an be unmapped per userspace?
Nicolin Chen Sept. 30, 2024, 9:59 p.m. UTC | #18
On Sun, Sep 29, 2024 at 03:16:55PM +0800, Yi Liu wrote:
> > > > I feel these two might act somehow similarly to the two DIDs
> > > > during nested translations?
> > > 
> > > not quite the same. Is it possible that the ASID is the same for stage-1?
> > > Intel VT-d side can have the pasid to be the same. Like the gIOVA, all
> > > devices use the same ridpasid. Like the scenario I replied to Baolu[1],
> > > do er choose to use different DIDs to differentiate the caches for the
> > > two devices.
> > 
> > On ARM, each S1 domain (either a normal stage-1 PASID=0 domain or
> > an SVA PASID>0 domain) has a unique ASID.
> 
> I see. Looks like ASID is not the PASID.

It's not. PASID is called Substream ID in SMMU term. It's used to
index the PASID table. For cache invalidations, a PASID (ssid) is
for ATC (dev cache) or PASID table entry invalidation only.

> > So it unlikely has the
> > situation of two identical ASIDs if they are on the same vIOMMU,
> > because the ASID pool is per IOMMU instance (whether p or v).
> > 
> > With two vIOMMU instances, there might be the same ASIDs but they
> > will be tagged with different VMIDs.
> > 
> > > [1]
> > > https://lore.kernel.org/linux-iommu/4bc9bd20-5aae-440d-84fd-f530d0747c23@intel.com/
> > 
> > Is "gIOVA" a type of invalidation that only uses "address" out of
> > "PASID, DID and address"? I.e. PASID and DID are not provided via
> > the invalidation request, so it's going to broadcast all viommus?
> 
> gIOVA is just a term v.s. vSVA. Just want to differentiate it from vSVA. :)
> PASID and DID are still provided in the invalidation.

I am still not getting this gIOVA. What it does exactly v.s. vSVA?
And should RIDPASID be IOMMU_NO_PASID?

Nicolin
Alexey Kardashevskiy Oct. 1, 2024, 1:55 a.m. UTC | #19
On 11/9/24 17:08, Nicolin Chen wrote:
> On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Sent: Wednesday, August 28, 2024 1:00 AM
>>>
>> [...]
>>> On a multi-IOMMU system, the VIOMMU object can be instanced to the
>>> number
>>> of vIOMMUs in a guest VM, while holding the same parent HWPT to share
>>> the
>>
>> Is there restriction that multiple vIOMMU objects can be only created
>> on a multi-IOMMU system?
> 
> I think it should be generally restricted to the number of pIOMMUs,
> although likely (not 100% sure) we could do multiple vIOMMUs on a
> single-pIOMMU system. Any reason for doing that?


Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly?

On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT 
- device table, one entry per BDFn, the entry owns a queue. A slice of 
that can be passed to a VM (== queues mapped directly to the VM, and 
such IOMMU appears in the VM as a pretend-pcie device too). So what is 
[pv]IOMMU here? Thanks,


> 
>>> stage-2 IO pagetable. Each VIOMMU then just need to only allocate its own
>>> VMID to attach the shared stage-2 IO pagetable to the physical IOMMU:
>>
>> this reads like 'VMID' is a virtual ID allocated by vIOMMU. But from the
>> entire context it actually means the physical 'VMID' allocated on the
>> associated physical IOMMU, correct?
> 
> Quoting Jason's narratives, a VMID is a "Security namespace for
> guest owned ID". The allocation, using SMMU as an example, should
> be a part of vIOMMU instance allocation in the host SMMU driver.
> Then, this VMID will be used to mark the cache tags. So, it is
> still a software allocated ID, while HW would use it too.
> 
> Thanks
> Nicolin
Nicolin Chen Oct. 1, 2024, 3:36 a.m. UTC | #20
On Tue, Oct 01, 2024 at 11:55:59AM +1000, Alexey Kardashevskiy wrote:
> On 11/9/24 17:08, Nicolin Chen wrote:
> > On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Wednesday, August 28, 2024 1:00 AM
> > > > 
> > > [...]
> > > > On a multi-IOMMU system, the VIOMMU object can be instanced to the
> > > > number
> > > > of vIOMMUs in a guest VM, while holding the same parent HWPT to share
> > > > the
> > > 
> > > Is there restriction that multiple vIOMMU objects can be only created
> > > on a multi-IOMMU system?
> > 
> > I think it should be generally restricted to the number of pIOMMUs,
> > although likely (not 100% sure) we could do multiple vIOMMUs on a
> > single-pIOMMU system. Any reason for doing that?
> 
> 
> Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly?
> 
> On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT
> - device table, one entry per BDFn, the entry owns a queue. A slice of
> that can be passed to a VM (== queues mapped directly to the VM, and
> such IOMMU appears in the VM as a pretend-pcie device too). So what is
> [pv]IOMMU here? Thanks,
 
The "p" stands for physical: the entire IOMMU unit/instance. In
the IOMMU subsystem terminology, it's a struct iommu_device. It
sounds like AMD would register one iommu device per rootport?

The "v" stands for virtual: a slice of the pIOMMU that could be
shared or passed through to a VM:
 - Intel IOMMU doesn't have passthrough queues, so it uses a
   shared queue (for invalidation). In this case, vIOMMU will
   be a pure SW structure for HW queue sharing (with the host
   machine and other VMs). That said, I think the channel (or
   the port) that Intel VT-d uses internally for a device to
   do a two-stage translation can be seen as a "passthrough"
   feature, held by a vIOMMU.
 - AMD IOMMU can assign passthrough queues to VMs, in which
   case, vIOMMU will be a structure holding all passthrough
   resource (of the pIOMMU) assisgned to a VM. If there is a
   shared resource, it can be packed into the vIOMMU struct
   too. FYI, vQUEUE (future series) on the other hand will
   represent each passthrough queue in a vIOMMU struct. The
   VM then, per that specific pIOMMU (rootport?), will have
   one vIOMMU holding a number of vQUEUEs.
 - ARM SMMU is sort of in the middle, depending on the impls.
   vIOMMU will be a structure holding both passthrough and
   shared resource. It can define vQUEUEs, if the impl has
   passthrough queues like AMD does.

Allowing a vIOMMU to hold shared resource makes it a bit of an
upgraded model for IOMMU virtualization, from the existing HWPT
model that now looks like a subset of the vIOMMU model. 

Thanks
Nicolin
Alexey Kardashevskiy Oct. 1, 2024, 5:06 a.m. UTC | #21
On 1/10/24 13:36, Nicolin Chen wrote:
> On Tue, Oct 01, 2024 at 11:55:59AM +1000, Alexey Kardashevskiy wrote:
>> On 11/9/24 17:08, Nicolin Chen wrote:
>>> On Wed, Sep 11, 2024 at 06:12:21AM +0000, Tian, Kevin wrote:
>>>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>>> Sent: Wednesday, August 28, 2024 1:00 AM
>>>>>
>>>> [...]
>>>>> On a multi-IOMMU system, the VIOMMU object can be instanced to the
>>>>> number
>>>>> of vIOMMUs in a guest VM, while holding the same parent HWPT to share
>>>>> the
>>>>
>>>> Is there restriction that multiple vIOMMU objects can be only created
>>>> on a multi-IOMMU system?
>>>
>>> I think it should be generally restricted to the number of pIOMMUs,
>>> although likely (not 100% sure) we could do multiple vIOMMUs on a
>>> single-pIOMMU system. Any reason for doing that?
>>
>>
>> Just to clarify the terminology here - what are pIOMMU and vIOMMU exactly?
>>
>> On AMD, IOMMU is a pretend-pcie device, one per a rootport, manages a DT
>> - device table, one entry per BDFn, the entry owns a queue. A slice of
>> that can be passed to a VM (== queues mapped directly to the VM, and
>> such IOMMU appears in the VM as a pretend-pcie device too). So what is
>> [pv]IOMMU here? Thanks,
>   
> The "p" stands for physical: the entire IOMMU unit/instance. In
> the IOMMU subsystem terminology, it's a struct iommu_device. It
> sounds like AMD would register one iommu device per rootport?

Yup, my test machine has 4 of these.


> The "v" stands for virtual: a slice of the pIOMMU that could be
> shared or passed through to a VM:
>   - Intel IOMMU doesn't have passthrough queues, so it uses a
>     shared queue (for invalidation). In this case, vIOMMU will
>     be a pure SW structure for HW queue sharing (with the host
>     machine and other VMs). That said, I think the channel (or
>     the port) that Intel VT-d uses internally for a device to
>     do a two-stage translation can be seen as a "passthrough"
>     feature, held by a vIOMMU.
>   - AMD IOMMU can assign passthrough queues to VMs, in which
>     case, vIOMMU will be a structure holding all passthrough
>     resource (of the pIOMMU) assisgned to a VM. If there is a
>     shared resource, it can be packed into the vIOMMU struct
>     too. FYI, vQUEUE (future series) on the other hand will
>     represent each passthrough queue in a vIOMMU struct. The
>     VM then, per that specific pIOMMU (rootport?), will have
>     one vIOMMU holding a number of vQUEUEs.
>   - ARM SMMU is sort of in the middle, depending on the impls.
>     vIOMMU will be a structure holding both passthrough and
>     shared resource. It can define vQUEUEs, if the impl has
>     passthrough queues like AMD does.
> 
> Allowing a vIOMMU to hold shared resource makes it a bit of an
> upgraded model for IOMMU virtualization, from the existing HWPT
> model that now looks like a subset of the vIOMMU model.

Thanks for confirming.

I've just read in this thread that "it should be generally restricted to 
the number of pIOMMUs, although likely (not 100% sure) we could do 
multiple vIOMMUs on a single-pIOMMU system. Any reason for doing that?"? 
thought "we have every reason to do that, unless p means something 
different", so I decided to ask :) Thanks,


> 
> Thanks
> Nicolin
Jason Gunthorpe Oct. 1, 2024, 1:44 p.m. UTC | #22
On Tue, Oct 01, 2024 at 03:06:57PM +1000, Alexey Kardashevskiy wrote:
> I've just read in this thread that "it should be generally restricted to the
> number of pIOMMUs, although likely (not 100% sure) we could do multiple
> vIOMMUs on a single-pIOMMU system. Any reason for doing that?"? thought "we
> have every reason to do that, unless p means something different", so I
> decided to ask :) Thanks,

I think that was inteded as "multiple vIOMMUs per pIOMMU within a
single VM".

There would always be multiple vIOMMUs per pIOMMU across VMs/etc.

Jason
Jason Gunthorpe Oct. 1, 2024, 1:48 p.m. UTC | #23
On Sun, Sep 29, 2024 at 03:19:42PM +0800, Yi Liu wrote:
> > So their viommu HW concepts come along with a requirement that there
> > be a fixed translation for the VM, which we model by attaching a S2
> > HWPT to the VIOMMU object which get's linked into the IOMMU HW as
> > the translation for the queue memory.
> 
> Is the mapping of the S2 be static? or it an be unmapped per userspace?

In principle it should be dynamic, but I think the vCMDQ stuff will
struggle to do that

Jason
Nicolin Chen Oct. 1, 2024, 6:40 p.m. UTC | #24
On Tue, Oct 01, 2024 at 10:48:15AM -0300, Jason Gunthorpe wrote:
> On Sun, Sep 29, 2024 at 03:19:42PM +0800, Yi Liu wrote:
> > > So their viommu HW concepts come along with a requirement that there
> > > be a fixed translation for the VM, which we model by attaching a S2
> > > HWPT to the VIOMMU object which get's linked into the IOMMU HW as
> > > the translation for the queue memory.
> > 
> > Is the mapping of the S2 be static? or it an be unmapped per userspace?
> 
> In principle it should be dynamic, but I think the vCMDQ stuff will
> struggle to do that

Yea. vCMDQ HW requires a setting of the physical address of the
base address to a queue in the VM's ram space. If the S2 mapping
changes (resulting a different queue location in the physical
memory), VMM should notify the kernel for a HW reconfiguration.

I wonder what all the user cases are, which can cause a shifting
of S2 mappings? VM migration? Any others?

Thanks
Nicolin