diff mbox series

[v2,06/19] iommufd/viommu: Add IOMMU_VIOMMU_SET/UNSET_VDEV_ID ioctl

Message ID 6348cc7a72ce9f2ac0e9caf9737e70177a01eb74.1724776335.git.nicolinc@nvidia.com
State New
Headers show
Series iommufd: Add VIOMMU infrastructure (Part-1) | expand

Commit Message

Nicolin Chen Aug. 27, 2024, 4:59 p.m. UTC
Introduce a pair of new ioctls to set/unset a per-viommu virtual device id
that should be linked to a physical device id via an idev pointer.

Continue the support IOMMU_VIOMMU_TYPE_DEFAULT for a core-managed viommu.
Provide a lookup function for drivers to load device pointer by a virtual
device id.

Add a rw_semaphore protection around the vdev_id list. Any future ioctl
handlers that potentially access the list must grab the lock too.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/device.c          |  12 +++
 drivers/iommu/iommufd/iommufd_private.h |  21 ++++
 drivers/iommu/iommufd/main.c            |   6 ++
 drivers/iommu/iommufd/viommu.c          | 121 ++++++++++++++++++++++++
 include/uapi/linux/iommufd.h            |  40 ++++++++
 5 files changed, 200 insertions(+)

Comments

Jason Gunthorpe Sept. 5, 2024, 4:03 p.m. UTC | #1
On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
> Introduce a pair of new ioctls to set/unset a per-viommu virtual device id
> that should be linked to a physical device id via an idev pointer.

Given some of the other discussions around CC I suspect we should
rename these to 'create/destroy virtual device' with an eye that
eventually they would be extended like other ops with per-CC platform
data.

ie this would be the interface to tell the CC trusted world that a
secure device is being added to a VM with some additional flags..

Right now it only conveys the vRID parameter of the virtual device
being created.

A following question is if these objects should have their own IDs in
the iommufd space too, and then unset is not unset but just a normal
destroy object. If so then the thing you put in the ids xarray would
also just be a normal object struct.

This is probably worth doing if this is going to grow more CC stuff
later.

Jason
Tian, Kevin Sept. 11, 2024, 6:19 a.m. UTC | #2
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, September 6, 2024 4:15 AM
> 
> On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
> > > That being said, if we have a clear picture that in the long term
> > > we would extend it to hold more information, I think it could be
> > > a smart move.
> > >
> > > Perhaps virtual device can have its own "attach" to vIOMMU? Or
> > > would you still prefer attaching via proxy hwpt_nested?
> >
> > I was thinking just creating it against a vIOMMU is an effective
> > "attach" and the virtual device is permanently tied to the vIOMMU at
> > creation time.
> 
> Ah, right! The create is per-viommu, so it's being attached.
> 

presumably we also need check compatibility between the idev
which the virtual device is created against and the stage-2 pgtable,
as a normal attach required?
Nicolin Chen Sept. 11, 2024, 7:11 a.m. UTC | #3
On Wed, Sep 11, 2024 at 06:19:10AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Friday, September 6, 2024 4:15 AM
> >
> > On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
> > > > That being said, if we have a clear picture that in the long term
> > > > we would extend it to hold more information, I think it could be
> > > > a smart move.
> > > >
> > > > Perhaps virtual device can have its own "attach" to vIOMMU? Or
> > > > would you still prefer attaching via proxy hwpt_nested?
> > >
> > > I was thinking just creating it against a vIOMMU is an effective
> > > "attach" and the virtual device is permanently tied to the vIOMMU at
> > > creation time.
> >
> > Ah, right! The create is per-viommu, so it's being attached.
> >
> 
> presumably we also need check compatibility between the idev
> which the virtual device is created against and the stage-2 pgtable,
> as a normal attach required?

If that's required, it can be a part of "create virtual device",
where idev and viommu (holding s2 hwpt) would be all available?

Thanks
Nicolin
Tian, Kevin Sept. 11, 2024, 7:18 a.m. UTC | #4
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, September 11, 2024 3:12 PM
> 
> On Wed, Sep 11, 2024 at 06:19:10AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Friday, September 6, 2024 4:15 AM
> > >
> > > On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
> > > > > That being said, if we have a clear picture that in the long term
> > > > > we would extend it to hold more information, I think it could be
> > > > > a smart move.
> > > > >
> > > > > Perhaps virtual device can have its own "attach" to vIOMMU? Or
> > > > > would you still prefer attaching via proxy hwpt_nested?
> > > >
> > > > I was thinking just creating it against a vIOMMU is an effective
> > > > "attach" and the virtual device is permanently tied to the vIOMMU at
> > > > creation time.
> > >
> > > Ah, right! The create is per-viommu, so it's being attached.
> > >
> >
> > presumably we also need check compatibility between the idev
> > which the virtual device is created against and the stage-2 pgtable,
> > as a normal attach required?
> 
> If that's required, it can be a part of "create virtual device",
> where idev and viommu (holding s2 hwpt) would be all available?
> 

yes
Nicolin Chen Sept. 11, 2024, 7:47 a.m. UTC | #5
On Wed, Sep 11, 2024 at 07:18:52AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, September 11, 2024 3:12 PM
> >
> > On Wed, Sep 11, 2024 at 06:19:10AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Friday, September 6, 2024 4:15 AM
> > > >
> > > > On Thu, Sep 05, 2024 at 02:43:26PM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Sep 05, 2024 at 10:37:45AM -0700, Nicolin Chen wrote:
> > > > > > That being said, if we have a clear picture that in the long term
> > > > > > we would extend it to hold more information, I think it could be
> > > > > > a smart move.
> > > > > >
> > > > > > Perhaps virtual device can have its own "attach" to vIOMMU? Or
> > > > > > would you still prefer attaching via proxy hwpt_nested?
> > > > >
> > > > > I was thinking just creating it against a vIOMMU is an effective
> > > > > "attach" and the virtual device is permanently tied to the vIOMMU at
> > > > > creation time.
> > > >
> > > > Ah, right! The create is per-viommu, so it's being attached.
> > > >
> > >
> > > presumably we also need check compatibility between the idev
> > > which the virtual device is created against and the stage-2 pgtable,
> > > as a normal attach required?
> >
> > If that's required, it can be a part of "create virtual device",
> > where idev and viommu (holding s2 hwpt) would be all available?
> >
> 
> yes

Oh, I misread your question actually. I think it's about a matching
validation between dev->iommu->iommu_dev and vIOMMU->iommu_dev.

Thanks
Nicolin
Mostafa Saleh Sept. 27, 2024, 1:50 p.m. UTC | #6
Hi Nicolin,

On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
> Introduce a pair of new ioctls to set/unset a per-viommu virtual device id
> that should be linked to a physical device id via an idev pointer.
> 
> Continue the support IOMMU_VIOMMU_TYPE_DEFAULT for a core-managed viommu.
> Provide a lookup function for drivers to load device pointer by a virtual
> device id.
> 
> Add a rw_semaphore protection around the vdev_id list. Any future ioctl
> handlers that potentially access the list must grab the lock too.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/device.c          |  12 +++
>  drivers/iommu/iommufd/iommufd_private.h |  21 ++++
>  drivers/iommu/iommufd/main.c            |   6 ++
>  drivers/iommu/iommufd/viommu.c          | 121 ++++++++++++++++++++++++
>  include/uapi/linux/iommufd.h            |  40 ++++++++
>  5 files changed, 200 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 5fd3dd420290..3ad759971b32 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -136,6 +136,18 @@ void iommufd_device_destroy(struct iommufd_object *obj)
>  	struct iommufd_device *idev =
>  		container_of(obj, struct iommufd_device, obj);
>  
> +	/* Unlocked since there should be no race in a destroy() */
> +	if (idev->vdev_id) {
> +		struct iommufd_vdev_id *vdev_id = idev->vdev_id;
> +		struct iommufd_viommu *viommu = vdev_id->viommu;
> +		struct iommufd_vdev_id *old;
> +
> +		old = xa_cmpxchg(&viommu->vdev_ids, vdev_id->id, vdev_id, NULL,
> +				 GFP_KERNEL);
> +		WARN_ON(old != vdev_id);
> +		kfree(vdev_id);
> +		idev->vdev_id = NULL;
> +	}
>  	iommu_device_release_dma_owner(idev->dev);
>  	iommufd_put_group(idev->igroup);
>  	if (!iommufd_selftest_is_mock_dev(idev->dev))
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 1f2a1c133b9a..2c6e168c5300 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -416,6 +416,7 @@ struct iommufd_device {
>  	struct iommufd_object obj;
>  	struct iommufd_ctx *ictx;
>  	struct iommufd_group *igroup;
> +	struct iommufd_vdev_id *vdev_id;
>  	struct list_head group_item;
>  	/* always the physical device */
>  	struct device *dev;
> @@ -533,11 +534,31 @@ struct iommufd_viommu {
>  	struct iommufd_ctx *ictx;
>  	struct iommufd_hwpt_paging *hwpt;
>  
> +	/* The locking order is vdev_ids_rwsem -> igroup::lock */
> +	struct rw_semaphore vdev_ids_rwsem;
> +	struct xarray vdev_ids;
> +
>  	unsigned int type;
>  };
>  
> +struct iommufd_vdev_id {
> +	struct iommufd_viommu *viommu;
> +	struct iommufd_device *idev;
> +	u64 id;
> +};
> +
> +static inline struct iommufd_viommu *
> +iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
> +{
> +	return container_of(iommufd_get_object(ucmd->ictx, id,
> +					       IOMMUFD_OBJ_VIOMMU),
> +			    struct iommufd_viommu, obj);
> +}
> +
>  int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
>  void iommufd_viommu_destroy(struct iommufd_object *obj);
> +int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd);
> +int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd);
>  
>  #ifdef CONFIG_IOMMUFD_TEST
>  int iommufd_test(struct iommufd_ucmd *ucmd);
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 288ee51b6829..199ad90fa36b 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -334,6 +334,8 @@ union ucmd_buffer {
>  	struct iommu_option option;
>  	struct iommu_vfio_ioas vfio_ioas;
>  	struct iommu_viommu_alloc viommu;
> +	struct iommu_viommu_set_vdev_id set_vdev_id;
> +	struct iommu_viommu_unset_vdev_id unset_vdev_id;
>  #ifdef CONFIG_IOMMUFD_TEST
>  	struct iommu_test_cmd test;
>  #endif
> @@ -387,6 +389,10 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>  		 __reserved),
>  	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
>  		 struct iommu_viommu_alloc, out_viommu_id),
> +	IOCTL_OP(IOMMU_VIOMMU_SET_VDEV_ID, iommufd_viommu_set_vdev_id,
> +		 struct iommu_viommu_set_vdev_id, vdev_id),
> +	IOCTL_OP(IOMMU_VIOMMU_UNSET_VDEV_ID, iommufd_viommu_unset_vdev_id,
> +		 struct iommu_viommu_unset_vdev_id, vdev_id),
>  #ifdef CONFIG_IOMMUFD_TEST
>  	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
>  #endif
> diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
> index 200653a4bf57..8ffcd72b16b8 100644
> --- a/drivers/iommu/iommufd/viommu.c
> +++ b/drivers/iommu/iommufd/viommu.c
> @@ -8,6 +8,15 @@ void iommufd_viommu_destroy(struct iommufd_object *obj)
>  {
>  	struct iommufd_viommu *viommu =
>  		container_of(obj, struct iommufd_viommu, obj);
> +	struct iommufd_vdev_id *vdev_id;
> +	unsigned long index;
> +
> +	xa_for_each(&viommu->vdev_ids, index, vdev_id) {
> +		/* Unlocked since there should be no race in a destroy() */
> +		vdev_id->idev->vdev_id = NULL;
> +		kfree(vdev_id);
> +	}
> +	xa_destroy(&viommu->vdev_ids);
>  
>  	refcount_dec(&viommu->hwpt->common.obj.users);
>  }
> @@ -53,6 +62,9 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
>  	viommu->ictx = ucmd->ictx;
>  	viommu->hwpt = hwpt_paging;
>  
> +	xa_init(&viommu->vdev_ids);
> +	init_rwsem(&viommu->vdev_ids_rwsem);
> +
>  	refcount_inc(&viommu->hwpt->common.obj.users);
>  
>  	cmd->out_viommu_id = viommu->obj.id;
> @@ -70,3 +82,112 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
>  	iommufd_put_object(ucmd->ictx, &idev->obj);
>  	return rc;
>  }
> +
> +int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_viommu_set_vdev_id *cmd = ucmd->cmd;
> +	struct iommufd_vdev_id *vdev_id, *curr;
> +	struct iommufd_viommu *viommu;
> +	struct iommufd_device *idev;
> +	int rc = 0;
> +
> +	if (cmd->vdev_id > ULONG_MAX)
> +		return -EINVAL;
> +
> +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> +	if (IS_ERR(viommu))
> +		return PTR_ERR(viommu);
> +
> +	idev = iommufd_get_device(ucmd, cmd->dev_id);
> +	if (IS_ERR(idev)) {
> +		rc = PTR_ERR(idev);
> +		goto out_put_viommu;
> +	}
> +
> +	down_write(&viommu->vdev_ids_rwsem);
> +	mutex_lock(&idev->igroup->lock);
> +	if (idev->vdev_id) {
> +		rc = -EEXIST;
> +		goto out_unlock_igroup;
> +	}
> +
> +	vdev_id = kzalloc(sizeof(*vdev_id), GFP_KERNEL);
> +	if (!vdev_id) {
> +		rc = -ENOMEM;
> +		goto out_unlock_igroup;
> +	}
> +
> +	vdev_id->idev = idev;
> +	vdev_id->viommu = viommu;
> +	vdev_id->id = cmd->vdev_id;

My understanding of IOMMUFD is very little, but AFAICT, that means that
it’s assumed that each device can only have one stream ID(RID)?

As I can see in patch 17 in arm_smmu_convert_viommu_vdev_id(), it
converts the virtual ID to a physical one using master->streams[0].id.

Is that correct or am I missing something?

As I am looking at similar problem for paravirtual IOMMU with pKVM, where
the UAPI would be something similar to:

	GET_NUM_END_POINTS(dev) => nr_sids

	SET_END_POINT_VSID(dev, sid_index, vsid)

Similar to what VFIO does with IRQs.

As a device can have many SIDs.

Thanks,
Mostafa

> +
> +	curr = xa_cmpxchg(&viommu->vdev_ids, cmd->vdev_id, NULL, vdev_id,
> +			  GFP_KERNEL);
> +	if (curr) {
> +		rc = xa_err(curr) ? : -EBUSY;
> +		goto out_free;
> +	}
> +
> +	idev->vdev_id = vdev_id;
> +	goto out_unlock_igroup;
> +
> +out_free:
> +	kfree(vdev_id);
> +out_unlock_igroup:
> +	mutex_unlock(&idev->igroup->lock);
> +	up_write(&viommu->vdev_ids_rwsem);
> +	iommufd_put_object(ucmd->ictx, &idev->obj);
> +out_put_viommu:
> +	iommufd_put_object(ucmd->ictx, &viommu->obj);
> +	return rc;
> +}
> +
> +int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_viommu_unset_vdev_id *cmd = ucmd->cmd;
> +	struct iommufd_viommu *viommu;
> +	struct iommufd_vdev_id *old;
> +	struct iommufd_device *idev;
> +	int rc = 0;
> +
> +	if (cmd->vdev_id > ULONG_MAX)
> +		return -EINVAL;
> +
> +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> +	if (IS_ERR(viommu))
> +		return PTR_ERR(viommu);
> +
> +	idev = iommufd_get_device(ucmd, cmd->dev_id);
> +	if (IS_ERR(idev)) {
> +		rc = PTR_ERR(idev);
> +		goto out_put_viommu;
> +	}
> +
> +	down_write(&viommu->vdev_ids_rwsem);
> +	mutex_lock(&idev->igroup->lock);
> +	if (!idev->vdev_id) {
> +		rc = -ENOENT;
> +		goto out_unlock_igroup;
> +	}
> +	if (idev->vdev_id->id != cmd->vdev_id) {
> +		rc = -EINVAL;
> +		goto out_unlock_igroup;
> +	}
> +
> +	old = xa_cmpxchg(&viommu->vdev_ids, idev->vdev_id->id,
> +			 idev->vdev_id, NULL, GFP_KERNEL);
> +	if (xa_is_err(old)) {
> +		rc = xa_err(old);
> +		goto out_unlock_igroup;
> +	}
> +	kfree(old);
> +	idev->vdev_id = NULL;
> +
> +out_unlock_igroup:
> +	mutex_unlock(&idev->igroup->lock);
> +	up_write(&viommu->vdev_ids_rwsem);
> +	iommufd_put_object(ucmd->ictx, &idev->obj);
> +out_put_viommu:
> +	iommufd_put_object(ucmd->ictx, &viommu->obj);
> +	return rc;
> +}
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 51ce6a019c34..1816e89c922d 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -52,6 +52,8 @@ enum {
>  	IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d,
>  	IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e,
>  	IOMMUFD_CMD_VIOMMU_ALLOC = 0x8f,
> +	IOMMUFD_CMD_VIOMMU_SET_VDEV_ID = 0x90,
> +	IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID = 0x91,
>  };
>  
>  /**
> @@ -882,4 +884,42 @@ struct iommu_viommu_alloc {
>  	__u32 out_viommu_id;
>  };
>  #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
> +
> +/**
> + * struct iommu_viommu_set_vdev_id - ioctl(IOMMU_VIOMMU_SET_VDEV_ID)
> + * @size: sizeof(struct iommu_viommu_set_vdev_id)
> + * @viommu_id: viommu ID to associate with the device to store its virtual ID
> + * @dev_id: device ID to set its virtual ID
> + * @__reserved: Must be 0
> + * @vdev_id: Virtual device ID
> + *
> + * Set a viommu-specific virtual ID of a device
> + */
> +struct iommu_viommu_set_vdev_id {
> +	__u32 size;
> +	__u32 viommu_id;
> +	__u32 dev_id;
> +	__u32 __reserved;
> +	__aligned_u64 vdev_id;
> +};
> +#define IOMMU_VIOMMU_SET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_VDEV_ID)
> +
> +/**
> + * struct iommu_viommu_unset_vdev_id - ioctl(IOMMU_VIOMMU_UNSET_VDEV_ID)
> + * @size: sizeof(struct iommu_viommu_unset_vdev_id)
> + * @viommu_id: viommu ID associated with the device to delete its virtual ID
> + * @dev_id: device ID to unset its virtual ID
> + * @__reserved: Must be 0
> + * @vdev_id: Virtual device ID (for verification)
> + *
> + * Unset a viommu-specific virtual ID of a device
> + */
> +struct iommu_viommu_unset_vdev_id {
> +	__u32 size;
> +	__u32 viommu_id;
> +	__u32 dev_id;
> +	__u32 __reserved;
> +	__aligned_u64 vdev_id;
> +};
> +#define IOMMU_VIOMMU_UNSET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID)
>  #endif
> -- 
> 2.43.0
>
Jason Gunthorpe Sept. 27, 2024, 2:01 p.m. UTC | #7
On Fri, Sep 27, 2024 at 01:50:52PM +0000, Mostafa Saleh wrote:

> My understanding of IOMMUFD is very little, but AFAICT, that means that
> it’s assumed that each device can only have one stream ID(RID)?
> 
> As I can see in patch 17 in arm_smmu_convert_viommu_vdev_id(), it
> converts the virtual ID to a physical one using master->streams[0].id.
> 
> Is that correct or am I missing something?
> 
> As I am looking at similar problem for paravirtual IOMMU with pKVM, where
> the UAPI would be something similar to:
> 
> 	GET_NUM_END_POINTS(dev) => nr_sids
> 
> 	SET_END_POINT_VSID(dev, sid_index, vsid)
> 
> Similar to what VFIO does with IRQs.
> 
> As a device can have many SIDs.

We don't support multi SID through this interface, at least in this
version.

To do multi-sid you have to inform the VM of all the different pSIDs
the device has and then setup the vSID/pSID translation to map them
all to the HW invalidation logic.

Which is alot more steps, and we have no use case right now. Multi-sid
is also not something I expect to see in any modern PCI device, and
this is VFIO PCI...

Jason
Mostafa Saleh Sept. 27, 2024, 2:22 p.m. UTC | #8
On Fri, Sep 27, 2024 at 11:01:41AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 27, 2024 at 01:50:52PM +0000, Mostafa Saleh wrote:
> 
> > My understanding of IOMMUFD is very little, but AFAICT, that means that
> > it’s assumed that each device can only have one stream ID(RID)?
> > 
> > As I can see in patch 17 in arm_smmu_convert_viommu_vdev_id(), it
> > converts the virtual ID to a physical one using master->streams[0].id.
> > 
> > Is that correct or am I missing something?
> > 
> > As I am looking at similar problem for paravirtual IOMMU with pKVM, where
> > the UAPI would be something similar to:
> > 
> > 	GET_NUM_END_POINTS(dev) => nr_sids
> > 
> > 	SET_END_POINT_VSID(dev, sid_index, vsid)
> > 
> > Similar to what VFIO does with IRQs.
> > 
> > As a device can have many SIDs.
> 
> We don't support multi SID through this interface, at least in this
> version.
> 
> To do multi-sid you have to inform the VM of all the different pSIDs
> the device has and then setup the vSID/pSID translation to map them
> all to the HW invalidation logic.

Why would the VM need to know the pSID? The way I view this is quite close to
how irq works, the VM only views the GSI which is the virtualized number.
The VMM then would need to configure vSID->pSID translation, also without
knowing the actual pSID, just how many SIDs are there per-device; very similar to
how it configures IRQs through
VFIO_DEVICE_GET_INFO/VFIO_DEVICE_SET_IRQS.

And as long as we only allow 1:1 vSID to pSID mapping, I guess it would be
easy to implement.

> 
> Which is alot more steps, and we have no use case right now. Multi-sid
> is also not something I expect to see in any modern PCI device, and
> this is VFIO PCI...
> 

Ah, I thought IOMMUFD would be used instead of VFIO_TYPE1*, which should cover
platform devices (VFIO-platform) or am I missing something?

And multi-SIDs is common in platform devices and this would be quite
restricting, and I was hoping to support the pKVM vIOMMU through IOMMUFD
interface.

If possible, can the UAPI be designed with this in mind, even if not
implemented now?

Thanks,
Mostafa

> Jason
Jason Gunthorpe Sept. 27, 2024, 2:58 p.m. UTC | #9
On Fri, Sep 27, 2024 at 02:22:20PM +0000, Mostafa Saleh wrote:

> > We don't support multi SID through this interface, at least in this
> > version.
> > 
> > To do multi-sid you have to inform the VM of all the different pSIDs
> > the device has and then setup the vSID/pSID translation to map them
> > all to the HW invalidation logic.
> 
> Why would the VM need to know the pSID? 

It doesn't need to know the pSID exactly, but it needs to know all the
pSIDs that exist and have them be labeled with vSIDs.

With cmdq direct assignment the VM has to issue an ATS invalidation
for each and every physical device using its vSID. There is no HW path
to handle a 1:N v/p SID relationship.

> Ah, I thought IOMMUFD would be used instead of VFIO_TYPE1*, which should cover
> platform devices (VFIO-platform) or am I missing something?

It does work with platform, but AFAIK nobody is interested in that so
it hasn't been any focus. Enabling multi-sid nesting, sub stream ids,
etc is some additional work I think.

But what I mean is the really hard use case for the vSID/pSID mapping
is ATS invalidation and you won't use ATS invalidation on platform so
multi-sid likely doesn't matter.

STE/CD invalidation could possibly be pushed down through the
per-domain ioctl and replicated to all domain attachments. We don't
have code in the series to do that, but it could work from a uAPI
perspective.

> If possible, can the UAPI be designed with this in mind, even if not
> implemented now?

It is reasonable to ask. I think things are extensible enough. I'd
imagine we can add a flag 'secondary ID' and then a new field
'secondary ID index' to the vdev operations when someone wants to take
this on.

Jason
Mostafa Saleh Sept. 27, 2024, 3:59 p.m. UTC | #10
On Fri, Sep 27, 2024 at 3:58 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Sep 27, 2024 at 02:22:20PM +0000, Mostafa Saleh wrote:
>
> > > We don't support multi SID through this interface, at least in this
> > > version.
> > >
> > > To do multi-sid you have to inform the VM of all the different pSIDs
> > > the device has and then setup the vSID/pSID translation to map them
> > > all to the HW invalidation logic.
> >
> > Why would the VM need to know the pSID?
>
> It doesn't need to know the pSID exactly, but it needs to know all the
> pSIDs that exist and have them be labeled with vSIDs.
>
> With cmdq direct assignment the VM has to issue an ATS invalidation
> for each and every physical device using its vSID. There is no HW path
> to handle a 1:N v/p SID relationship.
>

I see, that's for the cmdq assignment.

> > Ah, I thought IOMMUFD would be used instead of VFIO_TYPE1*, which should cover
> > platform devices (VFIO-platform) or am I missing something?
>
> It does work with platform, but AFAIK nobody is interested in that so
> it hasn't been any focus. Enabling multi-sid nesting, sub stream ids,
> etc is some additional work I think.
>
> But what I mean is the really hard use case for the vSID/pSID mapping
> is ATS invalidation and you won't use ATS invalidation on platform so
> multi-sid likely doesn't matter.
>
> STE/CD invalidation could possibly be pushed down through the
> per-domain ioctl and replicated to all domain attachments. We don't
> have code in the series to do that, but it could work from a uAPI
> perspective.
>
> > If possible, can the UAPI be designed with this in mind, even if not
> > implemented now?
>
> It is reasonable to ask. I think things are extensible enough. I'd
> imagine we can add a flag 'secondary ID' and then a new field
> 'secondary ID index' to the vdev operations when someone wants to take
> this on.
>

Makes sense, I can take this when I start doing the pKVM work with
IOMMUFD, in case it wasn't supported by then.

Thanks,
Mostafa

> Jason
Nicolin Chen Oct. 1, 2024, 8:54 a.m. UTC | #11
On Thu, Sep 05, 2024 at 10:38:23AM -0700, Nicolin Chen wrote:
> On Thu, Sep 05, 2024 at 01:03:53PM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
> > > Introduce a pair of new ioctls to set/unset a per-viommu virtual device id
> > > that should be linked to a physical device id via an idev pointer.
> > 
> > Given some of the other discussions around CC I suspect we should
> > rename these to 'create/destroy virtual device' with an eye that
> > eventually they would be extended like other ops with per-CC platform
> > data.
> > 
> > ie this would be the interface to tell the CC trusted world that a
> > secure device is being added to a VM with some additional flags..
> > 
> > Right now it only conveys the vRID parameter of the virtual device
> > being created.
> >
> > A following question is if these objects should have their own IDs in
> > the iommufd space too, and then unset is not unset but just a normal
> > destroy object. If so then the thing you put in the ids xarray would
> > also just be a normal object struct.

I found that adding it as a new object makes things a lot of easier
since a vdevice can take refcounts of both viommu and idev. So both
destroy() callbacks wouldn't be bothered.

While confirming if I am missing something from the review comments,
I am not quite sure what is "the thing you put in the ids xarray"..
I only added a vRID xarray per viommu, yet that doesn't seem to be
able to merge into the normal object struct. Mind elaborating?

Thanks
Nicolin

> > This is probably worth doing if this is going to grow more CC stuff
> > later.
> 
> Having to admit that I have been struggling to find a better name
> than set_vdev_id, I also thought about something similar to that
> "create/destroy virtual device', yet was not that confident since
> we only have virtual device ID in its data structure. Also, the
> virtual device sounds a bit confusing, given we already have idev.
> 
> That being said, if we have a clear picture that in the long term
> we would extend it to hold more information, I think it could be
> a smart move.
> 
> Perhaps virtual device can have its own "attach" to vIOMMU? Or
> would you still prefer attaching via proxy hwpt_nested?
> 
> Thanks
> Nicolin
Jason Gunthorpe Oct. 1, 2024, 1:46 p.m. UTC | #12
On Tue, Oct 01, 2024 at 01:54:05AM -0700, Nicolin Chen wrote:
> On Thu, Sep 05, 2024 at 10:38:23AM -0700, Nicolin Chen wrote:
> > On Thu, Sep 05, 2024 at 01:03:53PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
> > > > Introduce a pair of new ioctls to set/unset a per-viommu virtual device id
> > > > that should be linked to a physical device id via an idev pointer.
> > > 
> > > Given some of the other discussions around CC I suspect we should
> > > rename these to 'create/destroy virtual device' with an eye that
> > > eventually they would be extended like other ops with per-CC platform
> > > data.
> > > 
> > > ie this would be the interface to tell the CC trusted world that a
> > > secure device is being added to a VM with some additional flags..
> > > 
> > > Right now it only conveys the vRID parameter of the virtual device
> > > being created.
> > >
> > > A following question is if these objects should have their own IDs in
> > > the iommufd space too, and then unset is not unset but just a normal
> > > destroy object. If so then the thing you put in the ids xarray would
> > > also just be a normal object struct.
> 
> I found that adding it as a new object makes things a lot of easier
> since a vdevice can take refcounts of both viommu and idev. So both
> destroy() callbacks wouldn't be bothered.
> 
> While confirming if I am missing something from the review comments,
> I am not quite sure what is "the thing you put in the ids xarray"..
> I only added a vRID xarray per viommu, yet that doesn't seem to be
> able to merge into the normal object struct. Mind elaborating?

I would think to point the vRID xarray directly to the new iommufd
vdevice object

Jason
Nicolin Chen Oct. 1, 2024, 6:45 p.m. UTC | #13
On Tue, Oct 01, 2024 at 10:46:20AM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 01, 2024 at 01:54:05AM -0700, Nicolin Chen wrote:
> > On Thu, Sep 05, 2024 at 10:38:23AM -0700, Nicolin Chen wrote:
> > > On Thu, Sep 05, 2024 at 01:03:53PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 27, 2024 at 09:59:43AM -0700, Nicolin Chen wrote:
> > > > > Introduce a pair of new ioctls to set/unset a per-viommu virtual device id
> > > > > that should be linked to a physical device id via an idev pointer.
> > > > 
> > > > Given some of the other discussions around CC I suspect we should
> > > > rename these to 'create/destroy virtual device' with an eye that
> > > > eventually they would be extended like other ops with per-CC platform
> > > > data.
> > > > 
> > > > ie this would be the interface to tell the CC trusted world that a
> > > > secure device is being added to a VM with some additional flags..
> > > > 
> > > > Right now it only conveys the vRID parameter of the virtual device
> > > > being created.
> > > >
> > > > A following question is if these objects should have their own IDs in
> > > > the iommufd space too, and then unset is not unset but just a normal
> > > > destroy object. If so then the thing you put in the ids xarray would
> > > > also just be a normal object struct.
> > 
> > I found that adding it as a new object makes things a lot of easier
> > since a vdevice can take refcounts of both viommu and idev. So both
> > destroy() callbacks wouldn't be bothered.
> > 
> > While confirming if I am missing something from the review comments,
> > I am not quite sure what is "the thing you put in the ids xarray"..
> > I only added a vRID xarray per viommu, yet that doesn't seem to be
> > able to merge into the normal object struct. Mind elaborating?
> 
> I would think to point the vRID xarray directly to the new iommufd
> vdevice object

Oh, I think I already have that then:
ictx->xarray: objIds <-> { IOAS|HWPT|VIOMMU|VDEVICE|.. }->objs
viommu->xarray: vRids <-> { VDEVICE } pointers

Thanks
Nicolin
Alexey Kardashevskiy Oct. 4, 2024, 4:32 a.m. UTC | #14
On 28/8/24 02:59, Nicolin Chen wrote:
> Introduce a pair of new ioctls to set/unset a per-viommu virtual device id
> that should be linked to a physical device id via an idev pointer.
> 
> Continue the support IOMMU_VIOMMU_TYPE_DEFAULT for a core-managed viommu.
> Provide a lookup function for drivers to load device pointer by a virtual
> device id.
> 
> Add a rw_semaphore protection around the vdev_id list. Any future ioctl
> handlers that potentially access the list must grab the lock too.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/iommufd/device.c          |  12 +++
>   drivers/iommu/iommufd/iommufd_private.h |  21 ++++
>   drivers/iommu/iommufd/main.c            |   6 ++
>   drivers/iommu/iommufd/viommu.c          | 121 ++++++++++++++++++++++++
>   include/uapi/linux/iommufd.h            |  40 ++++++++
>   5 files changed, 200 insertions(+)
> 
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 5fd3dd420290..3ad759971b32 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -136,6 +136,18 @@ void iommufd_device_destroy(struct iommufd_object *obj)
>   	struct iommufd_device *idev =
>   		container_of(obj, struct iommufd_device, obj);
>   
> +	/* Unlocked since there should be no race in a destroy() */
> +	if (idev->vdev_id) {
> +		struct iommufd_vdev_id *vdev_id = idev->vdev_id;
> +		struct iommufd_viommu *viommu = vdev_id->viommu;
> +		struct iommufd_vdev_id *old;
> +
> +		old = xa_cmpxchg(&viommu->vdev_ids, vdev_id->id, vdev_id, NULL,
> +				 GFP_KERNEL);
> +		WARN_ON(old != vdev_id);
> +		kfree(vdev_id);
> +		idev->vdev_id = NULL;
> +	}
>   	iommu_device_release_dma_owner(idev->dev);
>   	iommufd_put_group(idev->igroup);
>   	if (!iommufd_selftest_is_mock_dev(idev->dev))
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 1f2a1c133b9a..2c6e168c5300 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -416,6 +416,7 @@ struct iommufd_device {
>   	struct iommufd_object obj;
>   	struct iommufd_ctx *ictx;
>   	struct iommufd_group *igroup;
> +	struct iommufd_vdev_id *vdev_id;
>   	struct list_head group_item;
>   	/* always the physical device */
>   	struct device *dev;
> @@ -533,11 +534,31 @@ struct iommufd_viommu {
>   	struct iommufd_ctx *ictx;
>   	struct iommufd_hwpt_paging *hwpt;
>   
> +	/* The locking order is vdev_ids_rwsem -> igroup::lock */
> +	struct rw_semaphore vdev_ids_rwsem;
> +	struct xarray vdev_ids;
> +
>   	unsigned int type;
>   };
>   
> +struct iommufd_vdev_id {
> +	struct iommufd_viommu *viommu;
> +	struct iommufd_device *idev;
> +	u64 id;
> +};
> +
> +static inline struct iommufd_viommu *
> +iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
> +{
> +	return container_of(iommufd_get_object(ucmd->ictx, id,
> +					       IOMMUFD_OBJ_VIOMMU),
> +			    struct iommufd_viommu, obj);
> +}
> +
>   int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
>   void iommufd_viommu_destroy(struct iommufd_object *obj);
> +int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd);
> +int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd);
>   
>   #ifdef CONFIG_IOMMUFD_TEST
>   int iommufd_test(struct iommufd_ucmd *ucmd);
> diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
> index 288ee51b6829..199ad90fa36b 100644
> --- a/drivers/iommu/iommufd/main.c
> +++ b/drivers/iommu/iommufd/main.c
> @@ -334,6 +334,8 @@ union ucmd_buffer {
>   	struct iommu_option option;
>   	struct iommu_vfio_ioas vfio_ioas;
>   	struct iommu_viommu_alloc viommu;
> +	struct iommu_viommu_set_vdev_id set_vdev_id;
> +	struct iommu_viommu_unset_vdev_id unset_vdev_id;
>   #ifdef CONFIG_IOMMUFD_TEST
>   	struct iommu_test_cmd test;
>   #endif
> @@ -387,6 +389,10 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
>   		 __reserved),
>   	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
>   		 struct iommu_viommu_alloc, out_viommu_id),
> +	IOCTL_OP(IOMMU_VIOMMU_SET_VDEV_ID, iommufd_viommu_set_vdev_id,
> +		 struct iommu_viommu_set_vdev_id, vdev_id),
> +	IOCTL_OP(IOMMU_VIOMMU_UNSET_VDEV_ID, iommufd_viommu_unset_vdev_id,
> +		 struct iommu_viommu_unset_vdev_id, vdev_id),
>   #ifdef CONFIG_IOMMUFD_TEST
>   	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
>   #endif
> diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
> index 200653a4bf57..8ffcd72b16b8 100644
> --- a/drivers/iommu/iommufd/viommu.c
> +++ b/drivers/iommu/iommufd/viommu.c
> @@ -8,6 +8,15 @@ void iommufd_viommu_destroy(struct iommufd_object *obj)
>   {
>   	struct iommufd_viommu *viommu =
>   		container_of(obj, struct iommufd_viommu, obj);
> +	struct iommufd_vdev_id *vdev_id;
> +	unsigned long index;
> +
> +	xa_for_each(&viommu->vdev_ids, index, vdev_id) {
> +		/* Unlocked since there should be no race in a destroy() */
> +		vdev_id->idev->vdev_id = NULL;
> +		kfree(vdev_id);
> +	}
> +	xa_destroy(&viommu->vdev_ids);
>   
>   	refcount_dec(&viommu->hwpt->common.obj.users);
>   }
> @@ -53,6 +62,9 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
>   	viommu->ictx = ucmd->ictx;
>   	viommu->hwpt = hwpt_paging;
>   
> +	xa_init(&viommu->vdev_ids);
> +	init_rwsem(&viommu->vdev_ids_rwsem);
> +
>   	refcount_inc(&viommu->hwpt->common.obj.users);
>   
>   	cmd->out_viommu_id = viommu->obj.id;
> @@ -70,3 +82,112 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
>   	iommufd_put_object(ucmd->ictx, &idev->obj);
>   	return rc;
>   }
> +
> +int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_viommu_set_vdev_id *cmd = ucmd->cmd;
> +	struct iommufd_vdev_id *vdev_id, *curr;
> +	struct iommufd_viommu *viommu;
> +	struct iommufd_device *idev;
> +	int rc = 0;
> +
> +	if (cmd->vdev_id > ULONG_MAX)
> +		return -EINVAL;
> +
> +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> +	if (IS_ERR(viommu))
> +		return PTR_ERR(viommu);
> +
> +	idev = iommufd_get_device(ucmd, cmd->dev_id);
> +	if (IS_ERR(idev)) {
> +		rc = PTR_ERR(idev);
> +		goto out_put_viommu;
> +	}
> +
> +	down_write(&viommu->vdev_ids_rwsem);
> +	mutex_lock(&idev->igroup->lock);
> +	if (idev->vdev_id) {
> +		rc = -EEXIST;
> +		goto out_unlock_igroup;
> +	}
> +
> +	vdev_id = kzalloc(sizeof(*vdev_id), GFP_KERNEL);
> +	if (!vdev_id) {
> +		rc = -ENOMEM;
> +		goto out_unlock_igroup;
> +	}
> +
> +	vdev_id->idev = idev;
> +	vdev_id->viommu = viommu;
> +	vdev_id->id = cmd->vdev_id;
> +
> +	curr = xa_cmpxchg(&viommu->vdev_ids, cmd->vdev_id, NULL, vdev_id,
> +			  GFP_KERNEL);
> +	if (curr) {
> +		rc = xa_err(curr) ? : -EBUSY;
> +		goto out_free;
> +	}
> +
> +	idev->vdev_id = vdev_id;
> +	goto out_unlock_igroup;
> +
> +out_free:
> +	kfree(vdev_id);
> +out_unlock_igroup:
> +	mutex_unlock(&idev->igroup->lock);
> +	up_write(&viommu->vdev_ids_rwsem);
> +	iommufd_put_object(ucmd->ictx, &idev->obj);
> +out_put_viommu:
> +	iommufd_put_object(ucmd->ictx, &viommu->obj);
> +	return rc;
> +}
> +
> +int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_viommu_unset_vdev_id *cmd = ucmd->cmd;
> +	struct iommufd_viommu *viommu;
> +	struct iommufd_vdev_id *old;
> +	struct iommufd_device *idev;
> +	int rc = 0;
> +
> +	if (cmd->vdev_id > ULONG_MAX)
> +		return -EINVAL;
> +
> +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> +	if (IS_ERR(viommu))
> +		return PTR_ERR(viommu);
> +
> +	idev = iommufd_get_device(ucmd, cmd->dev_id);
> +	if (IS_ERR(idev)) {
> +		rc = PTR_ERR(idev);
> +		goto out_put_viommu;
> +	}
> +
> +	down_write(&viommu->vdev_ids_rwsem);
> +	mutex_lock(&idev->igroup->lock);
> +	if (!idev->vdev_id) {
> +		rc = -ENOENT;
> +		goto out_unlock_igroup;
> +	}
> +	if (idev->vdev_id->id != cmd->vdev_id) {
> +		rc = -EINVAL;
> +		goto out_unlock_igroup;
> +	}
> +
> +	old = xa_cmpxchg(&viommu->vdev_ids, idev->vdev_id->id,
> +			 idev->vdev_id, NULL, GFP_KERNEL);
> +	if (xa_is_err(old)) {
> +		rc = xa_err(old);
> +		goto out_unlock_igroup;
> +	}
> +	kfree(old);
> +	idev->vdev_id = NULL;
> +
> +out_unlock_igroup:
> +	mutex_unlock(&idev->igroup->lock);
> +	up_write(&viommu->vdev_ids_rwsem);
> +	iommufd_put_object(ucmd->ictx, &idev->obj);
> +out_put_viommu:
> +	iommufd_put_object(ucmd->ictx, &viommu->obj);
> +	return rc;
> +}
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 51ce6a019c34..1816e89c922d 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -52,6 +52,8 @@ enum {
>   	IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d,
>   	IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e,
>   	IOMMUFD_CMD_VIOMMU_ALLOC = 0x8f,
> +	IOMMUFD_CMD_VIOMMU_SET_VDEV_ID = 0x90,
> +	IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID = 0x91,
>   };
>   
>   /**
> @@ -882,4 +884,42 @@ struct iommu_viommu_alloc {
>   	__u32 out_viommu_id;
>   };
>   #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
> +
> +/**
> + * struct iommu_viommu_set_vdev_id - ioctl(IOMMU_VIOMMU_SET_VDEV_ID)
> + * @size: sizeof(struct iommu_viommu_set_vdev_id)
> + * @viommu_id: viommu ID to associate with the device to store its virtual ID
> + * @dev_id: device ID to set its virtual ID
> + * @__reserved: Must be 0
> + * @vdev_id: Virtual device ID
> + *
> + * Set a viommu-specific virtual ID of a device
> + */
> +struct iommu_viommu_set_vdev_id {
> +	__u32 size;
> +	__u32 viommu_id;
> +	__u32 dev_id;

Is this ID from vfio_device_bind_iommufd.out_devid?

> +	__u32 __reserved;
> +	__aligned_u64 vdev_id;

What is the nature of this id? It is not the guest's BDFn, is it? The 
code suggests it is ARM's "SID" == "stream ID" and "a device might be 
able to generate multiple StreamIDs" (how, why?) 🤯 And these streams 
seem to have nothing to do with PCIe IDE streams, right?

For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel 
interface to pass the guest's BDFs for a specific host device (which is 
passed through) and nothing in the kernel has any knowledge of it atm, 
is this the right place, or another ioctl() is needed here?

Sorry, I am too ignorant about ARM :)


> +};
> +#define IOMMU_VIOMMU_SET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_VDEV_ID)
> +
> +/**
> + * struct iommu_viommu_unset_vdev_id - ioctl(IOMMU_VIOMMU_UNSET_VDEV_ID)
> + * @size: sizeof(struct iommu_viommu_unset_vdev_id)
> + * @viommu_id: viommu ID associated with the device to delete its virtual ID
> + * @dev_id: device ID to unset its virtual ID
> + * @__reserved: Must be 0
> + * @vdev_id: Virtual device ID (for verification)
> + *
> + * Unset a viommu-specific virtual ID of a device
> + */
> +struct iommu_viommu_unset_vdev_id {
> +	__u32 size;
> +	__u32 viommu_id;
> +	__u32 dev_id;
> +	__u32 __reserved;
> +	__aligned_u64 vdev_id;
> +};
> +#define IOMMU_VIOMMU_UNSET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID)
>   #endif

Nit: "git format-patch -O orderfile" makes patches nicer by putting the 
documentation first (.h before .c, in this case) with the "ordefile" 
looking like this:

===
*.txt
configure
*Makefile*
*.json
*.h
*.c
===

Thanks,
Nicolin Chen Oct. 4, 2024, 5:33 a.m. UTC | #15
On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
> > +/**
> > + * struct iommu_viommu_set_vdev_id - ioctl(IOMMU_VIOMMU_SET_VDEV_ID)
> > + * @size: sizeof(struct iommu_viommu_set_vdev_id)
> > + * @viommu_id: viommu ID to associate with the device to store its virtual ID
> > + * @dev_id: device ID to set its virtual ID
> > + * @__reserved: Must be 0
> > + * @vdev_id: Virtual device ID
> > + *
> > + * Set a viommu-specific virtual ID of a device
> > + */
> > +struct iommu_viommu_set_vdev_id {
> > +     __u32 size;
> > +     __u32 viommu_id;
> > +     __u32 dev_id;
> 
> Is this ID from vfio_device_bind_iommufd.out_devid?

Yes.

> > +     __u32 __reserved;
> > +     __aligned_u64 vdev_id;
> 
> What is the nature of this id? It is not the guest's BDFn, is it? The

Not exactly but certainly can be related. Explaining below..

> code suggests it is ARM's "SID" == "stream ID" and "

Yes. That's the first use case of that.

> a device might be
> able to generate multiple StreamIDs" (how, why?) 🤯 And these streams
> seem to have nothing to do with PCIe IDE streams, right?

PCI device only has one stream ID per its SMMU.

So the Stream ID is more like a channel ID or client ID from the
SMMU (IOMMU) view. A PCI device's Stream ID can be calculated from
the BDF numbers + the Stream-ID base of that PCI bus.

That said, this is all about IOMMU. So, it is likely more natural
to forward an IOMMU-specific ID (vStream ID for a vSMMU) v.s. BDF.

> For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel
> interface to pass the guest's BDFs for a specific host device (which is
> passed through) and nothing in the kernel has any knowledge of it atm,
> is this the right place, or another ioctl() is needed here?
> 
> Sorry, I am too ignorant about ARM :)

We are reworking this ioctl to an IOMMU_VDEVICE_ALLOC cmd, meaning
a virtual device allocation. A virtual device is another bond when
an iommufd_device connects to an iommufd_viommu in the VM. The name
"vDEVICE" and "virtual device" still need to go through discussion,
so they aren't finalized. But the idea here is to have a structure
to gather all virtualization information of the intersection of the
device and the vIOMMU in the VM.

On the other hand, BDF is very PCI specific yet IOMMU independent.
E.g. it could exist for a PCI device even without a vIOMMU in the
VM, i.e. there is no vDEVICE in such case. Right?

So, if your use case relies on IOMMU and it is even a part of the
IOMMU virtualization features, I think you are looking at the right
place. And we should discuss how to incorporate that. Otherwise, I
feel the struct vfio_pci might be the one to extend?

> > +};
> > +#define IOMMU_VIOMMU_SET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_VDEV_ID)
> > +
> > +/**
> > + * struct iommu_viommu_unset_vdev_id - ioctl(IOMMU_VIOMMU_UNSET_VDEV_ID)
> > + * @size: sizeof(struct iommu_viommu_unset_vdev_id)
> > + * @viommu_id: viommu ID associated with the device to delete its virtual ID
> > + * @dev_id: device ID to unset its virtual ID
> > + * @__reserved: Must be 0
> > + * @vdev_id: Virtual device ID (for verification)
> > + *
> > + * Unset a viommu-specific virtual ID of a device
> > + */
> > +struct iommu_viommu_unset_vdev_id {
> > +     __u32 size;
> > +     __u32 viommu_id;
> > +     __u32 dev_id;
> > +     __u32 __reserved;
> > +     __aligned_u64 vdev_id;
> > +};
> > +#define IOMMU_VIOMMU_UNSET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID)
> >   #endif
> 
> Nit: "git format-patch -O orderfile" makes patches nicer by putting the
> documentation first (.h before .c, in this case) with the "ordefile"
> looking like this:
> 
> ===
> *.txt
> configure
> *Makefile*
> *.json
> *.h
> *.c
> ===

Interesting :)

Will try it!

Thanks
Nicolin
Jason Gunthorpe Oct. 4, 2024, 11:41 a.m. UTC | #16
On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:

> > +	__u32 __reserved;
> > +	__aligned_u64 vdev_id;
> 
> What is the nature of this id?

It should be the vIOMMU's HW representation for the virtual device.

On ARM it is the stream id, the index into the Stream Table

On AMD it would be the "DeviceID" the index in the Device Table

On Intel it is an index into the context table

The primary usage is to transform virtual invalidations from the guest
into physical invalidations.

> For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface
> to pass the guest's BDFs for a specific host device (which is passed
> through) and nothing in the kernel has any knowledge of it atm, is this the
> right place, or another ioctl() is needed here?

We probably need to add the vRID as well to this struct for that
reason.

The vdev_id is the iommu handle, and there is a platform specific
transformation between Bus/Device/Function and the iommu handle. In
some cases this is math, in some cases it is ACPI/DT tables or
something.

So I think the kernel should not make an assumption about the
relationship.

Jason
Nicolin Chen Oct. 4, 2024, 6:13 p.m. UTC | #17
On Fri, Oct 04, 2024 at 08:41:47AM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
> > For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface
> > to pass the guest's BDFs for a specific host device (which is passed
> > through) and nothing in the kernel has any knowledge of it atm, is this the
> > right place, or another ioctl() is needed here?
> 
> We probably need to add the vRID as well to this struct for that
> reason.

"vRID"/"vBDF" doesn't sound very generic to me to put in this
structure, though PCI devices are and very likely will be the
only users of this Virtual Device for a while. Any good idea?

Also, I am wondering if the uAPI structure of Virtual Device
should have a driver-specific data structure. And the vdev_id
should be in the driver-specific struct. So, it could stay in
corresponding naming, "Stream ID", "Device ID" or "Context ID"
v.s. a generic "Virtual ID" in the top-level structure? Then,
other info like CCA can be put in the driver-level structure
of SMMU's.

> The vdev_id is the iommu handle, and there is a platform specific
> transformation between Bus/Device/Function and the iommu handle. In
> some cases this is math, in some cases it is ACPI/DT tables or
> something.

> So I think the kernel should not make an assumption about the
> relationship.

Agreed. That also implies that a vRID is quite independent to
the IOMMU right? So, I think that the reason of adding a vRID
to the virtual deivce uAPI/structure should be IOMMU requiring
it?

Thanks
Nicolin
Jason Gunthorpe Oct. 4, 2024, 6:50 p.m. UTC | #18
On Fri, Oct 04, 2024 at 11:13:46AM -0700, Nicolin Chen wrote:
> On Fri, Oct 04, 2024 at 08:41:47AM -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
> > > For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface
> > > to pass the guest's BDFs for a specific host device (which is passed
> > > through) and nothing in the kernel has any knowledge of it atm, is this the
> > > right place, or another ioctl() is needed here?
> > 
> > We probably need to add the vRID as well to this struct for that
> > reason.
> 
> "vRID"/"vBDF" doesn't sound very generic to me to put in this
> structure, though PCI devices are and very likely will be the
> only users of this Virtual Device for a while. Any good idea?

It isn't necessarily bad to have a pci field as long as we can
somehow understand when it is used.

> Also, I am wondering if the uAPI structure of Virtual Device
> should have a driver-specific data structure. And the vdev_id
> should be in the driver-specific struct. So, it could stay in
> corresponding naming, "Stream ID", "Device ID" or "Context ID"
> v.s. a generic "Virtual ID" in the top-level structure? Then,
> other info like CCA can be put in the driver-level structure
> of SMMU's.

I'd to avoid a iommu-driver specific structure here, but I fear we
will have a "lowervisor" (sigh) specific structure for the widely
varied CC/pkvm/etc world.

> Agreed. That also implies that a vRID is quite independent to
> the IOMMU right? So, I think that the reason of adding a vRID
> to the virtual deivce uAPI/structure should be IOMMU requiring
> it?

I would like to use this API to link in the CC/pkvm/etc world, and use
it to create not just the vIOMMU components but link up to the
"lowervisor" components as well, since it is all the same stuff
basically.

Jason
Nicolin Chen Oct. 4, 2024, 7:25 p.m. UTC | #19
On Fri, Oct 04, 2024 at 03:50:19PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 04, 2024 at 11:13:46AM -0700, Nicolin Chen wrote:
> > On Fri, Oct 04, 2024 at 08:41:47AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 04, 2024 at 02:32:28PM +1000, Alexey Kardashevskiy wrote:
> > > > For my SEV-TIO exercise ("trusted IO"), I am looking for a kernel interface
> > > > to pass the guest's BDFs for a specific host device (which is passed
> > > > through) and nothing in the kernel has any knowledge of it atm, is this the
> > > > right place, or another ioctl() is needed here?
> > > 
> > > We probably need to add the vRID as well to this struct for that
> > > reason.
> > 
> > "vRID"/"vBDF" doesn't sound very generic to me to put in this
> > structure, though PCI devices are and very likely will be the
> > only users of this Virtual Device for a while. Any good idea?
> 
> It isn't necessarily bad to have a pci field as long as we can
> somehow understand when it is used.

OK.

> > Also, I am wondering if the uAPI structure of Virtual Device
> > should have a driver-specific data structure. And the vdev_id
> > should be in the driver-specific struct. So, it could stay in
> > corresponding naming, "Stream ID", "Device ID" or "Context ID"
> > v.s. a generic "Virtual ID" in the top-level structure? Then,
> > other info like CCA can be put in the driver-level structure
> > of SMMU's.
> 
> I'd to avoid a iommu-driver specific structure here, but I fear we
> will have a "lowervisor" (sigh) specific structure for the widely
> varied CC/pkvm/etc world.

The design of the structure also impacts how we implement the
API between iommufd and the drivers. Right now, forwarding the
ID via a function parameter is fine, but we would need a user
structure once we have more stuff to forward.

With that, I wonder what is better for the initial version of
this structure, a generic virtual ID or a driver-named ID like
"Stream ID"? The latter might be more understandable/flexible, 
so we won't need to justify a generic virtual ID along the way
if something changes in the nature?

> > Agreed. That also implies that a vRID is quite independent to
> > the IOMMU right? So, I think that the reason of adding a vRID
> > to the virtual deivce uAPI/structure should be IOMMU requiring
> > it?
> 
> I would like to use this API to link in the CC/pkvm/etc world, and use
> it to create not just the vIOMMU components but link up to the
> "lowervisor" components as well, since it is all the same stuff
> basically.

That sounds wider than what I defined it for in my patch:
 * struct iommu_vdevice_alloc - ioctl(IOMMU_VDEVICE_ALLOC)
 * ...
 * Allocate a virtual device instance (for a physical device) against a vIOMMU.
 * This instance holds the device's information in a VM, related to its vIOMMU.

Would you please help rephrase it? It'd be also helpful for me
to update the doc.

Though I feel slightly odd if we define it wider than "vIOMMU"
since this is an iommufd header...

Thanks
Nicolin
Jason Gunthorpe Oct. 4, 2024, 8:17 p.m. UTC | #20
On Fri, Oct 04, 2024 at 12:25:19PM -0700, Nicolin Chen wrote:

> With that, I wonder what is better for the initial version of
> this structure, a generic virtual ID or a driver-named ID like
> "Stream ID"? The latter might be more understandable/flexible, 
> so we won't need to justify a generic virtual ID along the way
> if something changes in the nature?

I think the name could be a bit more specific "viommu_device_id"
maybe? And elaborate in the kdoc that this is about the identifier
that the iommu HW itself uses.
> That sounds wider than what I defined it for in my patch:
>  * struct iommu_vdevice_alloc - ioctl(IOMMU_VDEVICE_ALLOC)
>  * ...
>  * Allocate a virtual device instance (for a physical device) against a vIOMMU.
>  * This instance holds the device's information in a VM, related to its vIOMMU.
> 
> Would you please help rephrase it? It'd be also helpful for me
> to update the doc.

I think that is still OK for the moment.

> Though I feel slightly odd if we define it wider than "vIOMMU"
> since this is an iommufd header...

The notion I have is that vIOMMU would expand to encompass not just
the physical hypervisor controled vIOMMU but also the vIOMMU
controlled by the trusted "lowervisor" in a pkvm/cc/whatever world.

Alexey is working on vIOMMU support in CC which has the trusted world
do some of the trusted vIOMMU components. I'm hoping the other people
in this area will look at his design and make it fit nicely to
everyone.

Jason
Nicolin Chen Oct. 4, 2024, 8:33 p.m. UTC | #21
On Fri, Oct 04, 2024 at 05:17:46PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 04, 2024 at 12:25:19PM -0700, Nicolin Chen wrote:
> 
> > With that, I wonder what is better for the initial version of
> > this structure, a generic virtual ID or a driver-named ID like
> > "Stream ID"? The latter might be more understandable/flexible, 
> > so we won't need to justify a generic virtual ID along the way
> > if something changes in the nature?
> 
> I think the name could be a bit more specific "viommu_device_id"
> maybe? And elaborate in the kdoc that this is about the identifier
> that the iommu HW itself uses.

A "vIOMMU Device ID" might sound a bit confusing with an ID of a
vIOMMU itself :-/

At this moment, I named it "virt_id" in uAPI with a description:
" * @virt_id: Virtual device ID per vIOMMU"

I could add to that (or just in the documentation):
"e.g. ARM's Stream ID, AMD's DeviceID, Intel's Context Table ID"
to clarify it further.

> > That sounds wider than what I defined it for in my patch:
> >  * struct iommu_vdevice_alloc - ioctl(IOMMU_VDEVICE_ALLOC)
> >  * ...
> >  * Allocate a virtual device instance (for a physical device) against a vIOMMU.
> >  * This instance holds the device's information in a VM, related to its vIOMMU.
> > 
> > Would you please help rephrase it? It'd be also helpful for me
> > to update the doc.
> 
> I think that is still OK for the moment.
> 
> > Though I feel slightly odd if we define it wider than "vIOMMU"
> > since this is an iommufd header...
> 
> The notion I have is that vIOMMU would expand to encompass not just
> the physical hypervisor controled vIOMMU but also the vIOMMU
> controlled by the trusted "lowervisor" in a pkvm/cc/whatever world.
> 
> Alexey is working on vIOMMU support in CC which has the trusted world
> do some of the trusted vIOMMU components. I'm hoping the other people
> in this area will look at his design and make it fit nicely to
> everyone.

Oh, I didn't connect the dots that lowervisor must rely on the
vIOMMU too -- I'd need to check the CC stuff in detail. In that
case, having it in vIOMMU uAPI totally makes sense.

Thanks
Nicolin
Jason Gunthorpe Oct. 7, 2024, 5:18 p.m. UTC | #22
On Fri, Oct 04, 2024 at 01:33:33PM -0700, Nicolin Chen wrote:
> On Fri, Oct 04, 2024 at 05:17:46PM -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 04, 2024 at 12:25:19PM -0700, Nicolin Chen wrote:
> > 
> > > With that, I wonder what is better for the initial version of
> > > this structure, a generic virtual ID or a driver-named ID like
> > > "Stream ID"? The latter might be more understandable/flexible, 
> > > so we won't need to justify a generic virtual ID along the way
> > > if something changes in the nature?
> > 
> > I think the name could be a bit more specific "viommu_device_id"
> > maybe? And elaborate in the kdoc that this is about the identifier
> > that the iommu HW itself uses.
> 
> A "vIOMMU Device ID" might sound a bit confusing with an ID of a
> vIOMMU itself :-/
> 
> At this moment, I named it "virt_id" in uAPI with a description:
> " * @virt_id: Virtual device ID per vIOMMU"
> 
> I could add to that (or just in the documentation):
> "e.g. ARM's Stream ID, AMD's DeviceID, Intel's Context Table ID"
> to clarify it further.

Yeah probably best

> > Alexey is working on vIOMMU support in CC which has the trusted world
> > do some of the trusted vIOMMU components. I'm hoping the other people
> > in this area will look at his design and make it fit nicely to
> > everyone.
> 
> Oh, I didn't connect the dots that lowervisor must rely on the
> vIOMMU too -- I'd need to check the CC stuff in detail. In that
> case, having it in vIOMMU uAPI totally makes sense.

I think we are still getting through this, and I do wonder how to
manage it with so much stuff still in flux and sometimes private.

At least for AMD's CC case where there is an entanglement with the
physical IOMMU it seems like it makes sense.

Even in the pKVM type case I think you end up ripping the translation
away from the "physical" iommu and there should be some coordination
to ensure this handover is clean and undone when iommufd is
closed. But I wonder how the S2 works there..

Jason
diff mbox series

Patch

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 5fd3dd420290..3ad759971b32 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -136,6 +136,18 @@  void iommufd_device_destroy(struct iommufd_object *obj)
 	struct iommufd_device *idev =
 		container_of(obj, struct iommufd_device, obj);
 
+	/* Unlocked since there should be no race in a destroy() */
+	if (idev->vdev_id) {
+		struct iommufd_vdev_id *vdev_id = idev->vdev_id;
+		struct iommufd_viommu *viommu = vdev_id->viommu;
+		struct iommufd_vdev_id *old;
+
+		old = xa_cmpxchg(&viommu->vdev_ids, vdev_id->id, vdev_id, NULL,
+				 GFP_KERNEL);
+		WARN_ON(old != vdev_id);
+		kfree(vdev_id);
+		idev->vdev_id = NULL;
+	}
 	iommu_device_release_dma_owner(idev->dev);
 	iommufd_put_group(idev->igroup);
 	if (!iommufd_selftest_is_mock_dev(idev->dev))
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 1f2a1c133b9a..2c6e168c5300 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -416,6 +416,7 @@  struct iommufd_device {
 	struct iommufd_object obj;
 	struct iommufd_ctx *ictx;
 	struct iommufd_group *igroup;
+	struct iommufd_vdev_id *vdev_id;
 	struct list_head group_item;
 	/* always the physical device */
 	struct device *dev;
@@ -533,11 +534,31 @@  struct iommufd_viommu {
 	struct iommufd_ctx *ictx;
 	struct iommufd_hwpt_paging *hwpt;
 
+	/* The locking order is vdev_ids_rwsem -> igroup::lock */
+	struct rw_semaphore vdev_ids_rwsem;
+	struct xarray vdev_ids;
+
 	unsigned int type;
 };
 
+struct iommufd_vdev_id {
+	struct iommufd_viommu *viommu;
+	struct iommufd_device *idev;
+	u64 id;
+};
+
+static inline struct iommufd_viommu *
+iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_VIOMMU),
+			    struct iommufd_viommu, obj);
+}
+
 int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
 void iommufd_viommu_destroy(struct iommufd_object *obj);
+int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd);
+int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd);
 
 #ifdef CONFIG_IOMMUFD_TEST
 int iommufd_test(struct iommufd_ucmd *ucmd);
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 288ee51b6829..199ad90fa36b 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -334,6 +334,8 @@  union ucmd_buffer {
 	struct iommu_option option;
 	struct iommu_vfio_ioas vfio_ioas;
 	struct iommu_viommu_alloc viommu;
+	struct iommu_viommu_set_vdev_id set_vdev_id;
+	struct iommu_viommu_unset_vdev_id unset_vdev_id;
 #ifdef CONFIG_IOMMUFD_TEST
 	struct iommu_test_cmd test;
 #endif
@@ -387,6 +389,10 @@  static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 __reserved),
 	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
 		 struct iommu_viommu_alloc, out_viommu_id),
+	IOCTL_OP(IOMMU_VIOMMU_SET_VDEV_ID, iommufd_viommu_set_vdev_id,
+		 struct iommu_viommu_set_vdev_id, vdev_id),
+	IOCTL_OP(IOMMU_VIOMMU_UNSET_VDEV_ID, iommufd_viommu_unset_vdev_id,
+		 struct iommu_viommu_unset_vdev_id, vdev_id),
 #ifdef CONFIG_IOMMUFD_TEST
 	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
 #endif
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 200653a4bf57..8ffcd72b16b8 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -8,6 +8,15 @@  void iommufd_viommu_destroy(struct iommufd_object *obj)
 {
 	struct iommufd_viommu *viommu =
 		container_of(obj, struct iommufd_viommu, obj);
+	struct iommufd_vdev_id *vdev_id;
+	unsigned long index;
+
+	xa_for_each(&viommu->vdev_ids, index, vdev_id) {
+		/* Unlocked since there should be no race in a destroy() */
+		vdev_id->idev->vdev_id = NULL;
+		kfree(vdev_id);
+	}
+	xa_destroy(&viommu->vdev_ids);
 
 	refcount_dec(&viommu->hwpt->common.obj.users);
 }
@@ -53,6 +62,9 @@  int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
 	viommu->ictx = ucmd->ictx;
 	viommu->hwpt = hwpt_paging;
 
+	xa_init(&viommu->vdev_ids);
+	init_rwsem(&viommu->vdev_ids_rwsem);
+
 	refcount_inc(&viommu->hwpt->common.obj.users);
 
 	cmd->out_viommu_id = viommu->obj.id;
@@ -70,3 +82,112 @@  int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(ucmd->ictx, &idev->obj);
 	return rc;
 }
+
+int iommufd_viommu_set_vdev_id(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_viommu_set_vdev_id *cmd = ucmd->cmd;
+	struct iommufd_vdev_id *vdev_id, *curr;
+	struct iommufd_viommu *viommu;
+	struct iommufd_device *idev;
+	int rc = 0;
+
+	if (cmd->vdev_id > ULONG_MAX)
+		return -EINVAL;
+
+	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
+	if (IS_ERR(viommu))
+		return PTR_ERR(viommu);
+
+	idev = iommufd_get_device(ucmd, cmd->dev_id);
+	if (IS_ERR(idev)) {
+		rc = PTR_ERR(idev);
+		goto out_put_viommu;
+	}
+
+	down_write(&viommu->vdev_ids_rwsem);
+	mutex_lock(&idev->igroup->lock);
+	if (idev->vdev_id) {
+		rc = -EEXIST;
+		goto out_unlock_igroup;
+	}
+
+	vdev_id = kzalloc(sizeof(*vdev_id), GFP_KERNEL);
+	if (!vdev_id) {
+		rc = -ENOMEM;
+		goto out_unlock_igroup;
+	}
+
+	vdev_id->idev = idev;
+	vdev_id->viommu = viommu;
+	vdev_id->id = cmd->vdev_id;
+
+	curr = xa_cmpxchg(&viommu->vdev_ids, cmd->vdev_id, NULL, vdev_id,
+			  GFP_KERNEL);
+	if (curr) {
+		rc = xa_err(curr) ? : -EBUSY;
+		goto out_free;
+	}
+
+	idev->vdev_id = vdev_id;
+	goto out_unlock_igroup;
+
+out_free:
+	kfree(vdev_id);
+out_unlock_igroup:
+	mutex_unlock(&idev->igroup->lock);
+	up_write(&viommu->vdev_ids_rwsem);
+	iommufd_put_object(ucmd->ictx, &idev->obj);
+out_put_viommu:
+	iommufd_put_object(ucmd->ictx, &viommu->obj);
+	return rc;
+}
+
+int iommufd_viommu_unset_vdev_id(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_viommu_unset_vdev_id *cmd = ucmd->cmd;
+	struct iommufd_viommu *viommu;
+	struct iommufd_vdev_id *old;
+	struct iommufd_device *idev;
+	int rc = 0;
+
+	if (cmd->vdev_id > ULONG_MAX)
+		return -EINVAL;
+
+	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
+	if (IS_ERR(viommu))
+		return PTR_ERR(viommu);
+
+	idev = iommufd_get_device(ucmd, cmd->dev_id);
+	if (IS_ERR(idev)) {
+		rc = PTR_ERR(idev);
+		goto out_put_viommu;
+	}
+
+	down_write(&viommu->vdev_ids_rwsem);
+	mutex_lock(&idev->igroup->lock);
+	if (!idev->vdev_id) {
+		rc = -ENOENT;
+		goto out_unlock_igroup;
+	}
+	if (idev->vdev_id->id != cmd->vdev_id) {
+		rc = -EINVAL;
+		goto out_unlock_igroup;
+	}
+
+	old = xa_cmpxchg(&viommu->vdev_ids, idev->vdev_id->id,
+			 idev->vdev_id, NULL, GFP_KERNEL);
+	if (xa_is_err(old)) {
+		rc = xa_err(old);
+		goto out_unlock_igroup;
+	}
+	kfree(old);
+	idev->vdev_id = NULL;
+
+out_unlock_igroup:
+	mutex_unlock(&idev->igroup->lock);
+	up_write(&viommu->vdev_ids_rwsem);
+	iommufd_put_object(ucmd->ictx, &idev->obj);
+out_put_viommu:
+	iommufd_put_object(ucmd->ictx, &viommu->obj);
+	return rc;
+}
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 51ce6a019c34..1816e89c922d 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -52,6 +52,8 @@  enum {
 	IOMMUFD_CMD_HWPT_INVALIDATE = 0x8d,
 	IOMMUFD_CMD_FAULT_QUEUE_ALLOC = 0x8e,
 	IOMMUFD_CMD_VIOMMU_ALLOC = 0x8f,
+	IOMMUFD_CMD_VIOMMU_SET_VDEV_ID = 0x90,
+	IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID = 0x91,
 };
 
 /**
@@ -882,4 +884,42 @@  struct iommu_viommu_alloc {
 	__u32 out_viommu_id;
 };
 #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
+
+/**
+ * struct iommu_viommu_set_vdev_id - ioctl(IOMMU_VIOMMU_SET_VDEV_ID)
+ * @size: sizeof(struct iommu_viommu_set_vdev_id)
+ * @viommu_id: viommu ID to associate with the device to store its virtual ID
+ * @dev_id: device ID to set its virtual ID
+ * @__reserved: Must be 0
+ * @vdev_id: Virtual device ID
+ *
+ * Set a viommu-specific virtual ID of a device
+ */
+struct iommu_viommu_set_vdev_id {
+	__u32 size;
+	__u32 viommu_id;
+	__u32 dev_id;
+	__u32 __reserved;
+	__aligned_u64 vdev_id;
+};
+#define IOMMU_VIOMMU_SET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_VDEV_ID)
+
+/**
+ * struct iommu_viommu_unset_vdev_id - ioctl(IOMMU_VIOMMU_UNSET_VDEV_ID)
+ * @size: sizeof(struct iommu_viommu_unset_vdev_id)
+ * @viommu_id: viommu ID associated with the device to delete its virtual ID
+ * @dev_id: device ID to unset its virtual ID
+ * @__reserved: Must be 0
+ * @vdev_id: Virtual device ID (for verification)
+ *
+ * Unset a viommu-specific virtual ID of a device
+ */
+struct iommu_viommu_unset_vdev_id {
+	__u32 size;
+	__u32 viommu_id;
+	__u32 dev_id;
+	__u32 __reserved;
+	__aligned_u64 vdev_id;
+};
+#define IOMMU_VIOMMU_UNSET_VDEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_UNSET_VDEV_ID)
 #endif