[v7,0/9] Add iommufd nesting (part 2/2)

Message ID	20231221153948.119007-1-yi.l.liu@intel.com
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7437852F60; Thu, 21 Dec 2023 15:39:51 +0000 (UTC) From: Yi Liu <yi.l.liu@intel.com> To: joro@8bytes.org, alex.williamson@redhat.com, jgg@nvidia.com, kevin.tian@intel.com, robin.murphy@arm.com, baolu.lu@linux.intel.com Cc: cohuck@redhat.com, eric.auger@redhat.com, nicolinc@nvidia.com, kvm@vger.kernel.org, mjrosato@linux.ibm.com, chao.p.peng@linux.intel.com, yi.l.liu@intel.com, yi.y.sun@linux.intel.com, peterx@redhat.com, jasowang@redhat.com, shameerali.kolothum.thodi@huawei.com, lulu@redhat.com, suravee.suthikulpanit@amd.com, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, zhenzhong.duan@intel.com, joao.m.martins@oracle.com, xin.zeng@intel.com, yan.y.zhao@intel.com, j.granados@samsung.com Subject: [PATCH v7 0/9] Add iommufd nesting (part 2/2) Date: Thu, 21 Dec 2023 07:39:39 -0800 Message-Id: <20231221153948.119007-1-yi.l.liu@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Add iommufd nesting (part 2/2) \| expand [v7,0/9] Add iommufd nesting (part 2/2) [v7,1/9] iommu: Add cache_invalidate_user op [v7,2/9] iommufd: Add IOMMU_HWPT_INVALIDATE [v7,3/9] iommu: Add iommu_copy_struct_from_user_array helper [v7,4/9] iommufd/selftest: Add mock_domain_cache_invalidate_user support [v7,5/9] iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op [v7,6/9] iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl [v7,7/9] iommu/vt-d: Allow qi_submit_sync() to return the QI faults [v7,8/9] iommu/vt-d: Convert pasid based cache invalidation to return QI fault [v7,9/9] iommu/vt-d: Add iotlb flush for nested domain

Yi Liu Dec. 21, 2023, 3:39 p.m. UTC

Nested translation is a hardware feature that is supported by many modern
IOMMU hardwares. It has two stages (stage-1, stage-2) address translation
to get access to the physical address. stage-1 translation table is owned
by userspace (e.g. by a guest OS), while stage-2 is owned by kernel. Changes
to stage-1 translation table should be followed by an IOTLB invalidation.

Take Intel VT-d as an example, the stage-1 translation table is I/O page
table. As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform the stage-1
address translation. Along with it, modifications to present mappings in the
guest I/O page table should be followed with an IOTLB invalidation.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest I/O page table      |
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush --+
    '-------------'                        |
    |             |                        V
    |             |           I/O page table pointer in GPA
    '-------------'
Guest
------| Shadow |---------------------------|--------
      v        v                           v
Host
    .-------------.  .------------------------.
    |   pIOMMU    |  |  FS for GIOVA->GPA     |
    |             |  '------------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.----------------------------------.
    |             |   | SS for GPA->HPA, unmanaged domain|
    |             |   '----------------------------------'
    '-------------'
Where:
 - FS = First stage page tables
 - SS = Second stage page tables
<Intel VT-d Nested translation>

This series is based on the first part which was merged [1], this series is to
add the cache invalidation interface or the userspace to invalidate cache after
modifying the stage-1 page table. This includes both the iommufd changes and the
VT-d driver changes.

Complete code can be found in [2], QEMU could can be found in [3].

At last, this is a team work together with Nicolin Chen, Lu Baolu. Thanks
them for the help. ^_^. Look forward to your feedbacks.

[1] https://lore.kernel.org/linux-iommu/20231026044216.64964-1-yi.l.liu@intel.com/ - merged
[2] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1

Change log:

v7:
 - Remove domain->ops->cache_invalidate_user check in hwpt alloc path due
   to failure in bisect (Baolu)
 - Remove out_driver_error_code from struct iommu_hwpt_invalidate after
   discussion in v6. Should expect per-entry error code.
 - Rework the selftest cache invalidation part to report a per-entry error
 - Allow user to pass in an empty array to have a try-and-fail mechanism for
   user to check if a given req_type is supported by the kernel (Jason)
 - Define a separate enum type for cache invalidation data (Jason)
 - Fix the IOMMU_HWPT_INVALIDATE to always update the req_num field before
   returning (Nicolin)
 - Merge the VT-d nesting part 2/2
   https://lore.kernel.org/linux-iommu/20231117131816.24359-1-yi.l.liu@intel.com/
   into this series to avoid defining empty enum in the middle of the series.
   The major difference is adding the VT-d related invalidation uapi structures
   together with the generic data structures in patch 02 of this series.
 - VT-d driver was refined to report ICE/ITE error from the bottom cache
   invalidation submit helpers, hence the cache_invalidate_user op could
   report such errors via the per-entry error field to user. VT-d driver
   will not stop the invalidation array walking due to the ICE/ITE errors
   as such errors are defined by VT-d spec, userspace should be able to
   handle it and let the real user (say Virtual Machine) know about it.
   But for other errors like invalid uapi data structure configuration,
   memory copy failure, such errors should stop the array walking as it
   may have more issues if go on.
 - Minor fixes per Jason and Kevin's review comments

v6: https://lore.kernel.org/linux-iommu/20231117130717.19875-1-yi.l.liu@intel.com/
 - No much change, just rebase on top of 6.7-rc1 as part 1/2 is merged

v5: https://lore.kernel.org/linux-iommu/20231020092426.13907-1-yi.l.liu@intel.com/#t
 - Split the iommufd nesting series into two parts of alloc_user and
   invalidation (Jason)
 - Split IOMMUFD_OBJ_HW_PAGETABLE to IOMMUFD_OBJ_HWPT_PAGING/_NESTED, and
   do the same with the structures/alloc()/abort()/destroy(). Reworked the
   selftest accordingly too. (Jason)
 - Move hwpt/data_type into struct iommu_user_data from standalone op
   arguments. (Jason)
 - Rename hwpt_type to be data_type, the HWPT_TYPE to be HWPT_ALLOC_DATA,
   _TYPE_DEFAULT to be _ALLOC_DATA_NONE (Jason, Kevin)
 - Rename iommu_copy_user_data() to iommu_copy_struct_from_user() (Kevin)
 - Add macro to the iommu_copy_struct_from_user() to calculate min_size
   (Jason)
 - Fix two bugs spotted by ZhaoYan

v4: https://lore.kernel.org/linux-iommu/20230921075138.124099-1-yi.l.liu@intel.com/
 - Separate HWPT alloc/destroy/abort functions between user-managed HWPTs
   and kernel-managed HWPTs
 - Rework invalidate uAPI to be a multi-request array-based design
 - Add a struct iommu_user_data_array and a helper for driver to sanitize
   and copy the entry data from user space invalidation array
 - Add a patch fixing TEST_LENGTH() in selftest program
 - Drop IOMMU_RESV_IOVA_RANGES patches
 - Update kdoc and inline comments
 - Drop the code to add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation,
   this does not change the rule that resv regions should only be added to the
   kernel-managed HWPT. The IOMMU_RESV_SW_MSI stuff will be added in later series
   as it is needed only by SMMU so far.

v3: https://lore.kernel.org/linux-iommu/20230724110406.107212-1-yi.l.liu@intel.com/
 - Add new uAPI things in alphabetical order
 - Pass in "enum iommu_hwpt_type hwpt_type" to op->domain_alloc_user for
   sanity, replacing the previous op->domain_alloc_user_data_len solution
 - Return ERR_PTR from domain_alloc_user instead of NULL
 - Only add IOMMU_RESV_SW_MSI to kernel-managed HWPT in nested translation (Kevin)
 - Add IOMMU_RESV_IOVA_RANGES to report resv iova ranges to userspace hence
   userspace is able to exclude the ranges in the stage-1 HWPT (e.g. guest I/O
   page table). (Kevin)
 - Add selftest coverage for the new IOMMU_RESV_IOVA_RANGES ioctl
 - Minor changes per Kevin's inputs

v2: https://lore.kernel.org/linux-iommu/20230511143844.22693-1-yi.l.liu@intel.com/
 - Add union iommu_domain_user_data to include all user data structures to avoid
   passing void * in kernel APIs.
 - Add iommu op to return user data length for user domain allocation
 - Rename struct iommu_hwpt_alloc::data_type to be hwpt_type
 - Store the invalidation data length in iommu_domain_ops::cache_invalidate_user_data_len
 - Convert cache_invalidate_user op to be int instead of void
 - Remove @data_type in struct iommu_hwpt_invalidate
 - Remove out_hwpt_type_bitmap in struct iommu_hw_info hence drop patch 08 of v1

v1: https://lore.kernel.org/linux-iommu/20230309080910.607396-1-yi.l.liu@intel.com/

Thanks,
	Yi Liu

Lu Baolu (4):
  iommu: Add cache_invalidate_user op
  iommu/vt-d: Allow qi_submit_sync() to return the QI faults
  iommu/vt-d: Convert pasid based cache invalidation to return QI fault
  iommu/vt-d: Add iotlb flush for nested domain

Nicolin Chen (4):
  iommu: Add iommu_copy_struct_from_user_array helper
  iommufd/selftest: Add mock_domain_cache_invalidate_user support
  iommufd/selftest: Add IOMMU_TEST_OP_MD_CHECK_IOTLB test op
  iommufd/selftest: Add coverage for IOMMU_HWPT_INVALIDATE ioctl

Yi Liu (1):
  iommufd: Add IOMMU_HWPT_INVALIDATE

 drivers/iommu/intel/dmar.c                    |  36 ++--
 drivers/iommu/intel/iommu.c                   |  12 +-
 drivers/iommu/intel/iommu.h                   |   8 +-
 drivers/iommu/intel/irq_remapping.c           |   2 +-
 drivers/iommu/intel/nested.c                  | 116 ++++++++++++
 drivers/iommu/intel/pasid.c                   |  14 +-
 drivers/iommu/intel/svm.c                     |  14 +-
 drivers/iommu/iommufd/hw_pagetable.c          |  36 ++++
 drivers/iommu/iommufd/iommufd_private.h       |  10 ++
 drivers/iommu/iommufd/iommufd_test.h          |  39 ++++
 drivers/iommu/iommufd/main.c                  |   3 +
 drivers/iommu/iommufd/selftest.c              |  93 ++++++++++
 include/linux/iommu.h                         | 101 +++++++++++
 include/uapi/linux/iommufd.h                  | 100 +++++++++++
 tools/testing/selftests/iommu/iommufd.c       | 170 ++++++++++++++++++
 tools/testing/selftests/iommu/iommufd_utils.h |  57 ++++++
 16 files changed, 773 insertions(+), 38 deletions(-)

Tian, Kevin Dec. 22, 2023, 2:30 a.m. UTC | #1

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, December 21, 2023 11:40 PM
> 
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> The updates of the PTEs in the nested page table will be propagated to the
> hardware caches on both IOMMU (IOTLB) and devices (DevTLB/ATC).

this is incorrect. the scope of this cmd is driver specific.

> 
> Add a new domain op cache_invalidate_user for the userspace to flush the
> hardware caches for a nested domain through iommufd. No wrapper for it,
> as it's only supposed to be used by iommufd. Then, pass in invalidation
> requests in form of a user data array conatining a number of invalidation
> data entries.
> 
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  include/linux/iommu.h | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 6291aa7b079b..5c4a17f13761 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -284,6 +284,24 @@ struct iommu_user_data {
>  	size_t len;
>  };
> 
> +/**
> + * struct iommu_user_data_array - iommu driver specific user space data
> array
> + * @type: The data type of all the entries in the user buffer array
> + * @uptr: Pointer to the user buffer array for copy_from_user()

remove 'for copy_from_user();

> + * @entry_len: The fixed-width length of a entry in the array, in bytes

s/a/an/

> + * @entry_num: The number of total entries in the array
> + *
> + * A array having a @entry_num number of @entry_len sized entries, each

the first sentence is redundant.

> entry is
> + * user space data, an uAPI defined in include/uapi/linux/iommufd.h where
> @type
> + * is also defined as enum iommu_xyz_data_type.

I'd just say:

"The user buffer includes an array of requests with format defined 
in include/uapi/linux/iommufd.h"

Tian, Kevin Dec. 22, 2023, 3:39 a.m. UTC | #2

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, December 21, 2023 11:40 PM
> +
> +
> +		if ((inv.flags & IOMMU_TEST_INVALIDATE_FLAG_ALL) &&
> +		    (inv.flags &
> IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR)) {
> +			rc = -EINVAL;
> +			break;
> +		}
> +

a  nit. is there a reason why the two flags can not be set together?

in concept a mock iommu error could occur in either invalidate-one
or invalidate-all.

otherwise,

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Yang, Weijiang Dec. 22, 2023, 3:56 a.m. UTC | #3

On 12/21/2023 11:39 PM, Yi Liu wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
>
> This implements the .cache_invalidate_user() callback to support iotlb
> flush for nested domain.
>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Co-developed-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>   drivers/iommu/intel/nested.c | 116 +++++++++++++++++++++++++++++++++++
>   1 file changed, 116 insertions(+)
>
> diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
> index b5a5563ab32c..c665e2647045 100644
> --- a/drivers/iommu/intel/nested.c
> +++ b/drivers/iommu/intel/nested.c
> @@ -73,9 +73,125 @@ static void intel_nested_domain_free(struct iommu_domain *domain)
>   	kfree(to_dmar_domain(domain));
>   }
>   
> +static void nested_flush_pasid_iotlb(struct intel_iommu *iommu,
> +				     struct dmar_domain *domain, u64 addr,
> +				     unsigned long npages, bool ih)
> +{
> +	u16 did = domain_id_iommu(domain, iommu);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&domain->lock, flags);
> +	if (!list_empty(&domain->devices))
> +		qi_flush_piotlb(iommu, did, IOMMU_NO_PASID, addr,
> +				npages, ih, NULL);
> +	spin_unlock_irqrestore(&domain->lock, flags);
> +}
> +
> +static void nested_flush_dev_iotlb(struct dmar_domain *domain, u64 addr,
> +				   unsigned mask, u32 *fault)
> +{
> +	struct device_domain_info *info;
> +	unsigned long flags;
> +	u16 sid, qdep;
> +
> +	spin_lock_irqsave(&domain->lock, flags);
> +	list_for_each_entry(info, &domain->devices, link) {
> +		if (!info->ats_enabled)
> +			continue;
> +		sid = info->bus << 8 | info->devfn;
> +		qdep = info->ats_qdep;
> +		qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
> +				   qdep, addr, mask, fault);
> +		quirk_extra_dev_tlb_flush(info, addr, mask,
> +					  IOMMU_NO_PASID, qdep);
> +	}
> +	spin_unlock_irqrestore(&domain->lock, flags);
> +}
> +
> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr,
> +				     unsigned long npages, u32 *error)
> +{
> +	struct iommu_domain_info *info;
> +	unsigned long i;
> +	unsigned mask;
> +	u32 fault = 0;
> +
> +	if (npages == U64_MAX)
> +		mask = 64 - VTD_PAGE_SHIFT;
> +	else
> +		mask = ilog2(__roundup_pow_of_two(npages));
> +
> +	xa_for_each(&domain->iommu_array, i, info) {
> +		nested_flush_pasid_iotlb(info->iommu, domain, addr, npages, 0);
> +
> +		if (domain->has_iotlb_device)
> +			continue;

Shouldn't this be if (!domain->has_iotlb_device)?
> +
> +		nested_flush_dev_iotlb(domain, addr, mask, &fault);
> +		if (fault & (DMA_FSTS_ITE | DMA_FSTS_ICE))
> +			break;
> +	}
> +
> +	if (fault & DMA_FSTS_ICE)
> +		*error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ICE;
> +	if (fault & DMA_FSTS_ITE)
> +		*error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ITE;
> +}
> +
> +static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
> +					      struct iommu_user_data_array *array)
> +{
> +	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
> +	struct iommu_hwpt_vtd_s1_invalidate inv_entry;
> +	u32 processed = 0;
> +	int ret = 0;
> +	u32 index;
> +
> +	if (array->type != IOMMU_HWPT_INVALIDATE_DATA_VTD_S1) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	for (index = 0; index < array->entry_num; index++) {
> +		ret = iommu_copy_struct_from_user_array(&inv_entry, array,
> +							IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
> +							index, inv_error);
> +		if (ret)
> +			break;
> +
> +		if (inv_entry.flags & ~IOMMU_VTD_INV_FLAGS_LEAF) {
> +			ret = -EOPNOTSUPP;
> +			break;
> +		}
> +
> +		if (!IS_ALIGNED(inv_entry.addr, VTD_PAGE_SIZE) ||
> +		    ((inv_entry.npages == U64_MAX) && inv_entry.addr)) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +
> +		inv_entry.inv_error = 0;
> +		intel_nested_flush_cache(dmar_domain, inv_entry.addr,
> +					 inv_entry.npages, &inv_entry.inv_error);
> +
> +		ret = iommu_respond_struct_to_user_array(array, index,
> +							 (void *)&inv_entry,
> +							 sizeof(inv_entry));
> +		if (ret)
> +			break;
> +
> +		processed++;
> +	}
> +
> +out:
> +	array->entry_num = processed;
> +	return ret;
> +}
> +
>   static const struct iommu_domain_ops intel_nested_domain_ops = {
>   	.attach_dev		= intel_nested_attach_dev,
>   	.free			= intel_nested_domain_free,
> +	.cache_invalidate_user	= intel_nested_cache_invalidate_user,
>   };
>   
>   struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *parent,

Tian, Kevin Dec. 22, 2023, 4:01 a.m. UTC | #4

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, December 21, 2023 11:40 PM
> 
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> Allow to test whether IOTLB has been invalidated or not.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Tian, Kevin Dec. 22, 2023, 4:23 a.m. UTC | #5

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, December 21, 2023 11:40 PM
>
> +	fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
> +	if (fault) {
> +		if (fsts)
> +			*fsts |= fault;

do we expect the fault to be accumulated? otherwise it's clearer to
just do direct assignment instead of asking for the caller to clear
the variable before invocation.

the rest looks good:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Tian, Kevin Dec. 22, 2023, 6:47 a.m. UTC | #6

> From: Yang, Weijiang <weijiang.yang@intel.com>
> Sent: Friday, December 22, 2023 11:56 AM
> > +
> > +	xa_for_each(&domain->iommu_array, i, info) {
> > +		nested_flush_pasid_iotlb(info->iommu, domain, addr,
> npages, 0);
> > +
> > +		if (domain->has_iotlb_device)
> > +			continue;
> 
> Shouldn't this be if (!domain->has_iotlb_device)?

yes that is wrong.

actually it's weird to put domain check in a loop of domain->iommu_array.

that check along with devtlb flush should be done out of that loop.

Tian, Kevin Dec. 22, 2023, 6:57 a.m. UTC | #7

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, December 21, 2023 11:40 PM
> 
> +
> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64
> addr,
> +				     unsigned long npages, u32 *error)
> +{
> +	struct iommu_domain_info *info;
> +	unsigned long i;
> +	unsigned mask;
> +	u32 fault = 0;
> +
> +	if (npages == U64_MAX)
> +		mask = 64 - VTD_PAGE_SHIFT;
> +	else
> +		mask = ilog2(__roundup_pow_of_two(npages));
> +
> +	xa_for_each(&domain->iommu_array, i, info) {
> +		nested_flush_pasid_iotlb(info->iommu, domain, addr,
> npages, 0);

so IOMMU_VTD_INV_FLAGS_LEAF is defined but ignored?

> +
> +		if (domain->has_iotlb_device)
> +			continue;
> +
> +		nested_flush_dev_iotlb(domain, addr, mask, &fault);
> +		if (fault & (DMA_FSTS_ITE | DMA_FSTS_ICE))
> +			break;

here you may add a note that we don't plan to forward invalidation 
queue error (i.e. IQE) to the caller as it's caused only by driver
internal bug.

> +
> +		if (!IS_ALIGNED(inv_entry.addr, VTD_PAGE_SIZE) ||
> +		    ((inv_entry.npages == U64_MAX) && inv_entry.addr)) {
> +			ret = -EINVAL;
> +			break;
> +		}
> +

why is [non-zero-addr, U64_MAX] an error? Is it explicitly stated to
be not supported by underlying helpers?

Yi Liu Dec. 22, 2023, 7 a.m. UTC | #8

> On Dec 22, 2023, at 11:56, Yang, Weijiang <weijiang.yang@intel.com> wrote:
> 
> On 12/21/2023 11:39 PM, Yi Liu wrote:
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>> 
>> This implements the .cache_invalidate_user() callback to support iotlb
>> flush for nested domain.
>> 
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Co-developed-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>  drivers/iommu/intel/nested.c | 116 +++++++++++++++++++++++++++++++++++
>>  1 file changed, 116 insertions(+)
>> 
>> diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
>> index b5a5563ab32c..c665e2647045 100644
>> --- a/drivers/iommu/intel/nested.c
>> +++ b/drivers/iommu/intel/nested.c
>> @@ -73,9 +73,125 @@ static void intel_nested_domain_free(struct iommu_domain *domain)
>>      kfree(to_dmar_domain(domain));
>>  }
>>  +static void nested_flush_pasid_iotlb(struct intel_iommu *iommu,
>> +                     struct dmar_domain *domain, u64 addr,
>> +                     unsigned long npages, bool ih)
>> +{
>> +    u16 did = domain_id_iommu(domain, iommu);
>> +    unsigned long flags;
>> +
>> +    spin_lock_irqsave(&domain->lock, flags);
>> +    if (!list_empty(&domain->devices))
>> +        qi_flush_piotlb(iommu, did, IOMMU_NO_PASID, addr,
>> +                npages, ih, NULL);
>> +    spin_unlock_irqrestore(&domain->lock, flags);
>> +}
>> +
>> +static void nested_flush_dev_iotlb(struct dmar_domain *domain, u64 addr,
>> +                   unsigned mask, u32 *fault)
>> +{
>> +    struct device_domain_info *info;
>> +    unsigned long flags;
>> +    u16 sid, qdep;
>> +
>> +    spin_lock_irqsave(&domain->lock, flags);
>> +    list_for_each_entry(info, &domain->devices, link) {
>> +        if (!info->ats_enabled)
>> +            continue;
>> +        sid = info->bus << 8 | info->devfn;
>> +        qdep = info->ats_qdep;
>> +        qi_flush_dev_iotlb(info->iommu, sid, info->pfsid,
>> +                   qdep, addr, mask, fault);
>> +        quirk_extra_dev_tlb_flush(info, addr, mask,
>> +                      IOMMU_NO_PASID, qdep);
>> +    }
>> +    spin_unlock_irqrestore(&domain->lock, flags);
>> +}
>> +
>> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64 addr,
>> +                     unsigned long npages, u32 *error)
>> +{
>> +    struct iommu_domain_info *info;
>> +    unsigned long i;
>> +    unsigned mask;
>> +    u32 fault = 0;
>> +
>> +    if (npages == U64_MAX)
>> +        mask = 64 - VTD_PAGE_SHIFT;
>> +    else
>> +        mask = ilog2(__roundup_pow_of_two(npages));
>> +
>> +    xa_for_each(&domain->iommu_array, i, info) {
>> +        nested_flush_pasid_iotlb(info->iommu, domain, addr, npages, 0);
>> +
>> +        if (domain->has_iotlb_device)
>> +            continue;
> 
> Shouldn't this be if (!domain->has_iotlb_device)?

oops, yes it is.

>> +
>> +        nested_flush_dev_iotlb(domain, addr, mask, &fault);
>> +        if (fault & (DMA_FSTS_ITE | DMA_FSTS_ICE))
>> +            break;
>> +    }
>> +
>> +    if (fault & DMA_FSTS_ICE)
>> +        *error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ICE;
>> +    if (fault & DMA_FSTS_ITE)
>> +        *error |= IOMMU_HWPT_INVALIDATE_VTD_S1_ITE;
>> +}
>> +
>> +static int intel_nested_cache_invalidate_user(struct iommu_domain *domain,
>> +                          struct iommu_user_data_array *array)
>> +{
>> +    struct dmar_domain *dmar_domain = to_dmar_domain(domain);
>> +    struct iommu_hwpt_vtd_s1_invalidate inv_entry;
>> +    u32 processed = 0;
>> +    int ret = 0;
>> +    u32 index;
>> +
>> +    if (array->type != IOMMU_HWPT_INVALIDATE_DATA_VTD_S1) {
>> +        ret = -EINVAL;
>> +        goto out;
>> +    }
>> +
>> +    for (index = 0; index < array->entry_num; index++) {
>> +        ret = iommu_copy_struct_from_user_array(&inv_entry, array,
>> +                            IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
>> +                            index, inv_error);
>> +        if (ret)
>> +            break;
>> +
>> +        if (inv_entry.flags & ~IOMMU_VTD_INV_FLAGS_LEAF) {
>> +            ret = -EOPNOTSUPP;
>> +            break;
>> +        }
>> +
>> +        if (!IS_ALIGNED(inv_entry.addr, VTD_PAGE_SIZE) ||
>> +            ((inv_entry.npages == U64_MAX) && inv_entry.addr)) {
>> +            ret = -EINVAL;
>> +            break;
>> +        }
>> +
>> +        inv_entry.inv_error = 0;
>> +        intel_nested_flush_cache(dmar_domain, inv_entry.addr,
>> +                     inv_entry.npages, &inv_entry.inv_error);
>> +
>> +        ret = iommu_respond_struct_to_user_array(array, index,
>> +                             (void *)&inv_entry,
>> +                             sizeof(inv_entry));
>> +        if (ret)
>> +            break;
>> +
>> +        processed++;
>> +    }
>> +
>> +out:
>> +    array->entry_num = processed;
>> +    return ret;
>> +}
>> +
>>  static const struct iommu_domain_ops intel_nested_domain_ops = {
>>      .attach_dev        = intel_nested_attach_dev,
>>      .free            = intel_nested_domain_free,
>> +    .cache_invalidate_user    = intel_nested_cache_invalidate_user,
>>  };
>>    struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain *parent,
>

Yi Liu Dec. 22, 2023, 7:01 a.m. UTC | #9

> On Dec 22, 2023, at 14:47, Tian, Kevin <kevin.tian@intel.com> wrote:
> 
> 
>> 
>> From: Yang, Weijiang <weijiang.yang@intel.com>
>> Sent: Friday, December 22, 2023 11:56 AM
>>> +
>>> +    xa_for_each(&domain->iommu_array, i, info) {
>>> +        nested_flush_pasid_iotlb(info->iommu, domain, addr,
>> npages, 0);
>>> +
>>> +        if (domain->has_iotlb_device)
>>> +            continue;
>> 
>> Shouldn't this be if (!domain->has_iotlb_device)?
> 
> yes that is wrong.
> 
> actually it's weird to put domain check in a loop of domain->iommu_array.
> 
> that check along with devtlb flush should be done out of that loop.

Maybe adding a bool, set it out of the loop, check the bool in the loop.

Tian, Kevin Dec. 22, 2023, 7:12 a.m. UTC | #10

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, December 22, 2023 3:02 PM
> 
> 
> > On Dec 22, 2023, at 14:47, Tian, Kevin <kevin.tian@intel.com> wrote:
> >
> > 
> >>
> >> From: Yang, Weijiang <weijiang.yang@intel.com>
> >> Sent: Friday, December 22, 2023 11:56 AM
> >>> +
> >>> +    xa_for_each(&domain->iommu_array, i, info) {
> >>> +        nested_flush_pasid_iotlb(info->iommu, domain, addr,
> >> npages, 0);
> >>> +
> >>> +        if (domain->has_iotlb_device)
> >>> +            continue;
> >>
> >> Shouldn't this be if (!domain->has_iotlb_device)?
> >
> > yes that is wrong.
> >
> > actually it's weird to put domain check in a loop of domain->iommu_array.
> >
> > that check along with devtlb flush should be done out of that loop.
> 
> Maybe adding a bool, set it out of the loop, check the bool in the loop.

the point is that dev iotlb doesn't rely on info->iommu:

	nested_flush_dev_iotlb(domain, addr, mask, &fault);

then why do it in the loop of info->iommu?

Yi Liu Dec. 22, 2023, 11:59 a.m. UTC | #11

> On Dec 22, 2023, at 15:12, Tian, Kevin <kevin.tian@intel.com> wrote:
> 
> 
>> 
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Friday, December 22, 2023 3:02 PM
>> 
>> 
>>>> On Dec 22, 2023, at 14:47, Tian, Kevin <kevin.tian@intel.com> wrote:
>>> 
>>> 
>>>> 
>>>> From: Yang, Weijiang <weijiang.yang@intel.com>
>>>> Sent: Friday, December 22, 2023 11:56 AM
>>>>> +
>>>>> +    xa_for_each(&domain->iommu_array, i, info) {
>>>>> +        nested_flush_pasid_iotlb(info->iommu, domain, addr,
>>>> npages, 0);
>>>>> +
>>>>> +        if (domain->has_iotlb_device)
>>>>> +            continue;
>>>> 
>>>> Shouldn't this be if (!domain->has_iotlb_device)?
>>> 
>>> yes that is wrong.
>>> 
>>> actually it's weird to put domain check in a loop of domain->iommu_array.
>>> 
>>> that check along with devtlb flush should be done out of that loop.
>> 
>> Maybe adding a bool, set it out of the loop, check the bool in the loop.
> 
> the point is that dev iotlb doesn't rely on info->iommu:
> 
>    nested_flush_dev_iotlb(domain, addr, mask, &fault);
> 
> then why do it in the loop of info->iommu?

yes. It should have another device loop instead.

Yi Liu Dec. 26, 2023, 2:24 a.m. UTC | #12

On 2023/12/22 10:30, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, December 21, 2023 11:40 PM
>>
>> From: Lu Baolu <baolu.lu@linux.intel.com>
>>
>> The updates of the PTEs in the nested page table will be propagated to the
>> hardware caches on both IOMMU (IOTLB) and devices (DevTLB/ATC).
> 
> this is incorrect. the scope of this cmd is driver specific.

yes. May just say the hardware caches.

> 
>>
>> Add a new domain op cache_invalidate_user for the userspace to flush the
>> hardware caches for a nested domain through iommufd. No wrapper for it,
>> as it's only supposed to be used by iommufd. Then, pass in invalidation
>> requests in form of a user data array conatining a number of invalidation
>> data entries.
>>
>> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   include/linux/iommu.h | 27 +++++++++++++++++++++++++++
>>   1 file changed, 27 insertions(+)
>>
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index 6291aa7b079b..5c4a17f13761 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -284,6 +284,24 @@ struct iommu_user_data {
>>   	size_t len;
>>   };
>>
>> +/**
>> + * struct iommu_user_data_array - iommu driver specific user space data
>> array
>> + * @type: The data type of all the entries in the user buffer array
>> + * @uptr: Pointer to the user buffer array for copy_from_user()
> 
> remove 'for copy_from_user();
> 
>> + * @entry_len: The fixed-width length of a entry in the array, in bytes
> 
> s/a/an/
> 
>> + * @entry_num: The number of total entries in the array
>> + *
>> + * A array having a @entry_num number of @entry_len sized entries, each
> 
> the first sentence is redundant.
> 
>> entry is
>> + * user space data, an uAPI defined in include/uapi/linux/iommufd.h where
>> @type
>> + * is also defined as enum iommu_xyz_data_type.
> 
> I'd just say:
> 
> "The user buffer includes an array of requests with format defined
> in include/uapi/linux/iommufd.h"

sure.

Yi Liu Dec. 26, 2023, 3:35 a.m. UTC | #13

On 2023/12/22 11:39, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, December 21, 2023 11:40 PM
>> +
>> +
>> +		if ((inv.flags & IOMMU_TEST_INVALIDATE_FLAG_ALL) &&
>> +		    (inv.flags &
>> IOMMU_TEST_INVALIDATE_FLAG_TRIGGER_ERROR)) {
>> +			rc = -EINVAL;
>> +			break;
>> +		}
>> +
> 
> a  nit. is there a reason why the two flags can not be set together?
> 
> in concept a mock iommu error could occur in either invalidate-one
> or invalidate-all.

I see. I'm ok to relax this check and remove the selftest case as well.

> otherwise,
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Yi Liu Dec. 26, 2023, 4:03 a.m. UTC | #14

On 2023/12/22 12:23, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, December 21, 2023 11:40 PM
>>
>> +	fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
>> +	if (fault) {
>> +		if (fsts)
>> +			*fsts |= fault;
> 
> do we expect the fault to be accumulated? otherwise it's clearer to
> just do direct assignment instead of asking for the caller to clear
> the variable before invocation.

not quite get. do you mean the fault should not be cleared in the caller
side?

> the rest looks good:
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Tian, Kevin Dec. 26, 2023, 4:13 a.m. UTC | #15

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, December 26, 2023 12:03 PM
> 
> On 2023/12/22 12:23, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Thursday, December 21, 2023 11:40 PM
> >>
> >> +	fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
> >> +	if (fault) {
> >> +		if (fsts)
> >> +			*fsts |= fault;
> >
> > do we expect the fault to be accumulated? otherwise it's clearer to
> > just do direct assignment instead of asking for the caller to clear
> > the variable before invocation.
> 
> not quite get. do you mean the fault should not be cleared in the caller
> side?
> 

I meant:

	if (fsts)
		*fsts = fault;

unless there is a reason to *OR* the original value.

Yi Liu Dec. 26, 2023, 4:51 a.m. UTC | #16

On 2023/12/22 14:57, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, December 21, 2023 11:40 PM
>>
>> +
>> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64
>> addr,
>> +				     unsigned long npages, u32 *error)
>> +{
>> +	struct iommu_domain_info *info;
>> +	unsigned long i;
>> +	unsigned mask;
>> +	u32 fault = 0;
>> +
>> +	if (npages == U64_MAX)
>> +		mask = 64 - VTD_PAGE_SHIFT;
>> +	else
>> +		mask = ilog2(__roundup_pow_of_two(npages));
>> +
>> +	xa_for_each(&domain->iommu_array, i, info) {
>> +		nested_flush_pasid_iotlb(info->iommu, domain, addr,
>> npages, 0);
> 
> so IOMMU_VTD_INV_FLAGS_LEAF is defined but ignored?

yeah... it is. It is named as ih in the driver code. But it appears only
the below code is set ih. When calling iommu_flush_iotlb_psi(), the 5th
parameter (ih) may be true.

static int intel_iommu_memory_notifier(struct notifier_block *nb,
				       unsigned long val, void *v)
{
	struct memory_notify *mhp = v;
	unsigned long start_vpfn = mm_to_dma_pfn(mhp->start_pfn);
	unsigned long last_vpfn = mm_to_dma_pfn(mhp->start_pfn +
			mhp->nr_pages - 1);

	switch (val) {
	case MEM_GOING_ONLINE:
		if (iommu_domain_identity_map(si_domain,
					      start_vpfn, last_vpfn)) {
			pr_warn("Failed to build identity map for [%lx-%lx]\n",
				start_vpfn, last_vpfn);
			return NOTIFY_BAD;
		}
		break;

	case MEM_OFFLINE:
	case MEM_CANCEL_ONLINE:
		{
			struct dmar_drhd_unit *drhd;
			struct intel_iommu *iommu;
			LIST_HEAD(freelist);

			domain_unmap(si_domain, start_vpfn, last_vpfn, &freelist);

			rcu_read_lock();
			for_each_active_iommu(iommu, drhd)
				iommu_flush_iotlb_psi(iommu, si_domain,
					start_vpfn, mhp->nr_pages,
					list_empty(&freelist), 0);
			rcu_read_unlock();
			put_pages_list(&freelist);
		}
		break;
	}

	return NOTIFY_OK;
}

> 
>> +
>> +		if (domain->has_iotlb_device)
>> +			continue;
>> +
>> +		nested_flush_dev_iotlb(domain, addr, mask, &fault);
>> +		if (fault & (DMA_FSTS_ITE | DMA_FSTS_ICE))
>> +			break;
> 
> here you may add a note that we don't plan to forward invalidation
> queue error (i.e. IQE) to the caller as it's caused only by driver
> internal bug.

yes.

> 
>> +
>> +		if (!IS_ALIGNED(inv_entry.addr, VTD_PAGE_SIZE) ||
>> +		    ((inv_entry.npages == U64_MAX) && inv_entry.addr)) {
>> +			ret = -EINVAL;
>> +			break;
>> +		}
>> +
> 
> why is [non-zero-addr, U64_MAX] an error? Is it explicitly stated to
> be not supported by underlying helpers?

no such limitation by underlying helpers. But in such case, the 
addr+npages*PAGE_SIZE would exceed U64_MAX, this seems a bit
strange. But I'm fine to relax the check since the underlying helper
only checks npages when determining paid-selective or not.

Tian, Kevin Dec. 26, 2023, 6:11 a.m. UTC | #17

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, December 26, 2023 12:52 PM
> >> +
> >> +		if (!IS_ALIGNED(inv_entry.addr, VTD_PAGE_SIZE) ||
> >> +		    ((inv_entry.npages == U64_MAX) && inv_entry.addr)) {
> >> +			ret = -EINVAL;
> >> +			break;
> >> +		}
> >> +
> >
> > why is [non-zero-addr, U64_MAX] an error? Is it explicitly stated to
> > be not supported by underlying helpers?
> 
> no such limitation by underlying helpers. But in such case, the
> addr+npages*PAGE_SIZE would exceed U64_MAX, this seems a bit
> strange. But I'm fine to relax the check since the underlying helper
> only checks npages when determining paid-selective or not.
> 

I overlooked npages as end. let's keep the check.

Yi Liu Dec. 26, 2023, 6:15 a.m. UTC | #18

On 2023/12/26 12:13, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Tuesday, December 26, 2023 12:03 PM
>>
>> On 2023/12/22 12:23, Tian, Kevin wrote:
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Sent: Thursday, December 21, 2023 11:40 PM
>>>>
>>>> +	fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
>>>> +	if (fault) {
>>>> +		if (fsts)
>>>> +			*fsts |= fault;
>>>
>>> do we expect the fault to be accumulated? otherwise it's clearer to
>>> just do direct assignment instead of asking for the caller to clear
>>> the variable before invocation.
>>
>> not quite get. do you mean the fault should not be cleared in the caller
>> side?
>>
> 
> I meant:
> 
> 	if (fsts)
> 		*fsts = fault;
> 
> unless there is a reason to *OR* the original value.

I guess no such a reason. :) let me modify it.

Yi Liu Dec. 26, 2023, 8:44 a.m. UTC | #19

On 2023/12/26 14:15, Yi Liu wrote:
> 
> 
> On 2023/12/26 12:13, Tian, Kevin wrote:
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Tuesday, December 26, 2023 12:03 PM
>>>
>>> On 2023/12/22 12:23, Tian, Kevin wrote:
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Sent: Thursday, December 21, 2023 11:40 PM
>>>>>
>>>>> +    fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
>>>>> +    if (fault) {
>>>>> +        if (fsts)
>>>>> +            *fsts |= fault;
>>>>
>>>> do we expect the fault to be accumulated? otherwise it's clearer to
>>>> just do direct assignment instead of asking for the caller to clear
>>>> the variable before invocation.
>>>
>>> not quite get. do you mean the fault should not be cleared in the caller
>>> side?
>>>
>>
>> I meant:
>>
>>     if (fsts)
>>         *fsts = fault;
>>
>> unless there is a reason to *OR* the original value.
> 
> I guess no such a reason. :) let me modify it.

hmmm, replied too soon. The qi_check_fault() would be called multiple
times in one invalidation circle as qi_submit_sync() needs to see if any
fault happened before the hw writes back QI_DONE to the wait descriptor.
There can be ICE which may eventually result in ITE. So caller of 
qi_check_fault()
would continue to wait for QI_DONE. So qi_check_fault() returns 0 to let
qi_submit_sync() go on though ICE detected. If we use '*fsts = fault;',
then ICE would be missed since the input fsts pointer is the same in
one qi_submit_sync() call.

Yi Liu Dec. 26, 2023, 8:46 a.m. UTC | #20

On 2023/12/22 19:59, Liu, Yi L wrote:
> 
>> On Dec 22, 2023, at 15:12, Tian, Kevin <kevin.tian@intel.com> wrote:
>>
>> 
>>>
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Friday, December 22, 2023 3:02 PM
>>>
>>>
>>>>> On Dec 22, 2023, at 14:47, Tian, Kevin <kevin.tian@intel.com> wrote:
>>>>
>>>>
>>>>>
>>>>> From: Yang, Weijiang <weijiang.yang@intel.com>
>>>>> Sent: Friday, December 22, 2023 11:56 AM
>>>>>> +
>>>>>> +    xa_for_each(&domain->iommu_array, i, info) {
>>>>>> +        nested_flush_pasid_iotlb(info->iommu, domain, addr,
>>>>> npages, 0);
>>>>>> +
>>>>>> +        if (domain->has_iotlb_device)
>>>>>> +            continue;
>>>>>
>>>>> Shouldn't this be if (!domain->has_iotlb_device)?
>>>>
>>>> yes that is wrong.
>>>>
>>>> actually it's weird to put domain check in a loop of domain->iommu_array.
>>>>
>>>> that check along with devtlb flush should be done out of that loop.
>>>
>>> Maybe adding a bool, set it out of the loop, check the bool in the loop.
>>
>> the point is that dev iotlb doesn't rely on info->iommu:
>>
>>     nested_flush_dev_iotlb(domain, addr, mask, &fault);
>>
>> then why do it in the loop of info->iommu?
> 
> yes. It should have another device loop instead.

let me move the device tlb related code out of the info->iommu loop.

Tian, Kevin Dec. 26, 2023, 9:21 a.m. UTC | #21

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, December 26, 2023 4:44 PM
> 
> On 2023/12/26 14:15, Yi Liu wrote:
> >
> >
> > On 2023/12/26 12:13, Tian, Kevin wrote:
> >>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>> Sent: Tuesday, December 26, 2023 12:03 PM
> >>>
> >>> On 2023/12/22 12:23, Tian, Kevin wrote:
> >>>>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>>>> Sent: Thursday, December 21, 2023 11:40 PM
> >>>>>
> >>>>> +    fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
> >>>>> +    if (fault) {
> >>>>> +        if (fsts)
> >>>>> +            *fsts |= fault;
> >>>>
> >>>> do we expect the fault to be accumulated? otherwise it's clearer to
> >>>> just do direct assignment instead of asking for the caller to clear
> >>>> the variable before invocation.
> >>>
> >>> not quite get. do you mean the fault should not be cleared in the caller
> >>> side?
> >>>
> >>
> >> I meant:
> >>
> >>     if (fsts)
> >>         *fsts = fault;
> >>
> >> unless there is a reason to *OR* the original value.
> >
> > I guess no such a reason. :) let me modify it.
> 
> hmmm, replied too soon. The qi_check_fault() would be called multiple
> times in one invalidation circle as qi_submit_sync() needs to see if any
> fault happened before the hw writes back QI_DONE to the wait descriptor.
> There can be ICE which may eventually result in ITE. So caller of
> qi_check_fault()
> would continue to wait for QI_DONE. So qi_check_fault() returns 0 to let
> qi_submit_sync() go on though ICE detected. If we use '*fsts = fault;',
> then ICE would be missed since the input fsts pointer is the same in
> one qi_submit_sync() call.
> 

ok, that makes sense then.

Yi Liu Dec. 26, 2023, 12:35 p.m. UTC | #22

On 2023/12/26 12:51, Yi Liu wrote:
> On 2023/12/22 14:57, Tian, Kevin wrote:
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Thursday, December 21, 2023 11:40 PM
>>>
>>> +
>>> +static void intel_nested_flush_cache(struct dmar_domain *domain, u64
>>> addr,
>>> +                     unsigned long npages, u32 *error)
>>> +{
>>> +    struct iommu_domain_info *info;
>>> +    unsigned long i;
>>> +    unsigned mask;
>>> +    u32 fault = 0;
>>> +
>>> +    if (npages == U64_MAX)
>>> +        mask = 64 - VTD_PAGE_SHIFT;
>>> +    else
>>> +        mask = ilog2(__roundup_pow_of_two(npages));
>>> +
>>> +    xa_for_each(&domain->iommu_array, i, info) {
>>> +        nested_flush_pasid_iotlb(info->iommu, domain, addr,
>>> npages, 0);
>>
>> so IOMMU_VTD_INV_FLAGS_LEAF is defined but ignored?
> 
> yeah... it is. It is named as ih in the driver code. But it appears only
> the below code is set ih. When calling iommu_flush_iotlb_psi(), the 5th
> parameter (ih) may be true.
> 
> static int intel_iommu_memory_notifier(struct notifier_block *nb,
>                         unsigned long val, void *v)
> {
>      struct memory_notify *mhp = v;
>      unsigned long start_vpfn = mm_to_dma_pfn(mhp->start_pfn);
>      unsigned long last_vpfn = mm_to_dma_pfn(mhp->start_pfn +
>              mhp->nr_pages - 1);
> 
>      switch (val) {
>      case MEM_GOING_ONLINE:
>          if (iommu_domain_identity_map(si_domain,
>                            start_vpfn, last_vpfn)) {
>              pr_warn("Failed to build identity map for [%lx-%lx]\n",
>                  start_vpfn, last_vpfn);
>              return NOTIFY_BAD;
>          }
>          break;
> 
>      case MEM_OFFLINE:
>      case MEM_CANCEL_ONLINE:
>          {
>              struct dmar_drhd_unit *drhd;
>              struct intel_iommu *iommu;
>              LIST_HEAD(freelist);
> 
>              domain_unmap(si_domain, start_vpfn, last_vpfn, &freelist);
> 
>              rcu_read_lock();
>              for_each_active_iommu(iommu, drhd)
>                  iommu_flush_iotlb_psi(iommu, si_domain,
>                      start_vpfn, mhp->nr_pages,
>                      list_empty(&freelist), 0);
>              rcu_read_unlock();
>              put_pages_list(&freelist);
>          }
>          break;
>      }
> 
>      return NOTIFY_OK;
> }

I passed this flag to the intel_nested_flush_cache() now as the
helper accepts an ih parameter.

Zhenzhong Duan Dec. 27, 2023, 9:06 a.m. UTC | #23

>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Sent: Tuesday, December 26, 2023 4:44 PM
>Subject: Re: [PATCH v7 7/9] iommu/vt-d: Allow qi_submit_sync() to return
>the QI faults
>
>On 2023/12/26 14:15, Yi Liu wrote:
>>
>>
>> On 2023/12/26 12:13, Tian, Kevin wrote:
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Sent: Tuesday, December 26, 2023 12:03 PM
>>>>
>>>> On 2023/12/22 12:23, Tian, Kevin wrote:
>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>> Sent: Thursday, December 21, 2023 11:40 PM
>>>>>>
>>>>>> +    fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
>>>>>> +    if (fault) {
>>>>>> +        if (fsts)
>>>>>> +            *fsts |= fault;
>>>>>
>>>>> do we expect the fault to be accumulated? otherwise it's clearer to
>>>>> just do direct assignment instead of asking for the caller to clear
>>>>> the variable before invocation.
>>>>
>>>> not quite get. do you mean the fault should not be cleared in the caller
>>>> side?
>>>>
>>>
>>> I meant:
>>>
>>>     if (fsts)
>>>         *fsts = fault;
>>>
>>> unless there is a reason to *OR* the original value.
>>
>> I guess no such a reason. :) let me modify it.
>
>hmmm, replied too soon. The qi_check_fault() would be called multiple
>times in one invalidation circle as qi_submit_sync() needs to see if any
>fault happened before the hw writes back QI_DONE to the wait descriptor.
>There can be ICE which may eventually result in ITE. So caller of
>qi_check_fault()
>would continue to wait for QI_DONE. So qi_check_fault() returns 0 to let
>qi_submit_sync() go on though ICE detected. If we use '*fsts = fault;',
>then ICE would be missed since the input fsts pointer is the same in
>one qi_submit_sync() call.

Is it necessary to return fault to user if qi_check_fault() return -EAGAIN and
a restart run succeeds?

Thanks
Zhenzhong

Ethan Zhao Dec. 27, 2023, 9:33 a.m. UTC | #24

On 12/27/2023 5:06 PM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Tuesday, December 26, 2023 4:44 PM
>> Subject: Re: [PATCH v7 7/9] iommu/vt-d: Allow qi_submit_sync() to return
>> the QI faults
>>
>> On 2023/12/26 14:15, Yi Liu wrote:
>>>
>>> On 2023/12/26 12:13, Tian, Kevin wrote:
>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Sent: Tuesday, December 26, 2023 12:03 PM
>>>>>
>>>>> On 2023/12/22 12:23, Tian, Kevin wrote:
>>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>> Sent: Thursday, December 21, 2023 11:40 PM
>>>>>>>
>>>>>>> +    fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
>>>>>>> +    if (fault) {
>>>>>>> +        if (fsts)
>>>>>>> +            *fsts |= fault;
>>>>>> do we expect the fault to be accumulated? otherwise it's clearer to
>>>>>> just do direct assignment instead of asking for the caller to clear
>>>>>> the variable before invocation.
>>>>> not quite get. do you mean the fault should not be cleared in the caller
>>>>> side?
>>>>>
>>>> I meant:
>>>>
>>>>      if (fsts)
>>>>          *fsts = fault;
>>>>
>>>> unless there is a reason to *OR* the original value.
>>> I guess no such a reason. :) let me modify it.
>> hmmm, replied too soon. The qi_check_fault() would be called multiple
>> times in one invalidation circle as qi_submit_sync() needs to see if any
>> fault happened before the hw writes back QI_DONE to the wait descriptor.
>> There can be ICE which may eventually result in ITE. So caller of
>> qi_check_fault()
>> would continue to wait for QI_DONE. So qi_check_fault() returns 0 to let
>> qi_submit_sync() go on though ICE detected. If we use '*fsts = fault;',
>> then ICE would be missed since the input fsts pointer is the same in
>> one qi_submit_sync() call.
> Is it necessary to return fault to user if qi_check_fault() return -EAGAIN and
> a restart run succeeds?

Issue a device-TLB invalidation to no response device there is possibility

will be trapped there loop for ITE , never get return.

Thanks,

Ethan

> Thanks
> Zhenzhong

Yi Liu Dec. 27, 2023, 2:12 p.m. UTC | #25

On 2023/12/27 17:33, Ethan Zhao wrote:
> 
> On 12/27/2023 5:06 PM, Duan, Zhenzhong wrote:
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Sent: Tuesday, December 26, 2023 4:44 PM
>>> Subject: Re: [PATCH v7 7/9] iommu/vt-d: Allow qi_submit_sync() to return
>>> the QI faults
>>>
>>> On 2023/12/26 14:15, Yi Liu wrote:
>>>>
>>>> On 2023/12/26 12:13, Tian, Kevin wrote:
>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>> Sent: Tuesday, December 26, 2023 12:03 PM
>>>>>>
>>>>>> On 2023/12/22 12:23, Tian, Kevin wrote:
>>>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>>> Sent: Thursday, December 21, 2023 11:40 PM
>>>>>>>>
>>>>>>>> +    fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
>>>>>>>> +    if (fault) {
>>>>>>>> +        if (fsts)
>>>>>>>> +            *fsts |= fault;
>>>>>>> do we expect the fault to be accumulated? otherwise it's clearer to
>>>>>>> just do direct assignment instead of asking for the caller to clear
>>>>>>> the variable before invocation.
>>>>>> not quite get. do you mean the fault should not be cleared in the caller
>>>>>> side?
>>>>>>
>>>>> I meant:
>>>>>
>>>>>      if (fsts)
>>>>>          *fsts = fault;
>>>>>
>>>>> unless there is a reason to *OR* the original value.
>>>> I guess no such a reason. :) let me modify it.
>>> hmmm, replied too soon. The qi_check_fault() would be called multiple
>>> times in one invalidation circle as qi_submit_sync() needs to see if any
>>> fault happened before the hw writes back QI_DONE to the wait descriptor.
>>> There can be ICE which may eventually result in ITE. So caller of
>>> qi_check_fault()
>>> would continue to wait for QI_DONE. So qi_check_fault() returns 0 to let
>>> qi_submit_sync() go on though ICE detected. If we use '*fsts = fault;',
>>> then ICE would be missed since the input fsts pointer is the same in
>>> one qi_submit_sync() call.
>> Is it necessary to return fault to user if qi_check_fault() return 
>> -EAGAIN and
>> a restart run succeeds?

no need if a restart succeeds. I would add a *fault = 0 per the restart.

> 
> Issue a device-TLB invalidation to no response device there is possibility
> 
> will be trapped there loop for ITE , never get return.

yes. This the implementation today, in future I think we may need a kind
of timeout mechanism, so that it can return and report the error to user.
In concept, in nested translation, the page table is owned by userspace, so
it makes more sense to let userspace know it and take proper action.

Tian, Kevin Dec. 28, 2023, 5:39 a.m. UTC | #26

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, December 27, 2023 10:13 PM
> 
> On 2023/12/27 17:33, Ethan Zhao wrote:
> >
> > On 12/27/2023 5:06 PM, Duan, Zhenzhong wrote:
> >>
> >>> -----Original Message-----
> >>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>> Sent: Tuesday, December 26, 2023 4:44 PM
> >>> Subject: Re: [PATCH v7 7/9] iommu/vt-d: Allow qi_submit_sync() to
> return
> >>> the QI faults
> >>>
> >>> On 2023/12/26 14:15, Yi Liu wrote:
> >>>>
> >>>> On 2023/12/26 12:13, Tian, Kevin wrote:
> >>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>>>>> Sent: Tuesday, December 26, 2023 12:03 PM
> >>>>>>
> >>>>>> On 2023/12/22 12:23, Tian, Kevin wrote:
> >>>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>>>>>>> Sent: Thursday, December 21, 2023 11:40 PM
> >>>>>>>>
> >>>>>>>> +    fault &= DMA_FSTS_IQE | DMA_FSTS_ITE | DMA_FSTS_ICE;
> >>>>>>>> +    if (fault) {
> >>>>>>>> +        if (fsts)
> >>>>>>>> +            *fsts |= fault;
> >>>>>>> do we expect the fault to be accumulated? otherwise it's clearer to
> >>>>>>> just do direct assignment instead of asking for the caller to clear
> >>>>>>> the variable before invocation.
> >>>>>> not quite get. do you mean the fault should not be cleared in the
> caller
> >>>>>> side?
> >>>>>>
> >>>>> I meant:
> >>>>>
> >>>>>      if (fsts)
> >>>>>          *fsts = fault;
> >>>>>
> >>>>> unless there is a reason to *OR* the original value.
> >>>> I guess no such a reason. :) let me modify it.
> >>> hmmm, replied too soon. The qi_check_fault() would be called multiple
> >>> times in one invalidation circle as qi_submit_sync() needs to see if any
> >>> fault happened before the hw writes back QI_DONE to the wait
> descriptor.
> >>> There can be ICE which may eventually result in ITE. So caller of
> >>> qi_check_fault()
> >>> would continue to wait for QI_DONE. So qi_check_fault() returns 0 to let
> >>> qi_submit_sync() go on though ICE detected. If we use '*fsts = fault;',
> >>> then ICE would be missed since the input fsts pointer is the same in
> >>> one qi_submit_sync() call.
> >> Is it necessary to return fault to user if qi_check_fault() return
> >> -EAGAIN and
> >> a restart run succeeds?
> 
> no need if a restart succeeds. I would add a *fault = 0 per the restart.
> 
> >
> > Issue a device-TLB invalidation to no response device there is possibility
> >
> > will be trapped there loop for ITE , never get return.
> 
> yes. This the implementation today, in future I think we may need a kind
> of timeout mechanism, so that it can return and report the error to user.
> In concept, in nested translation, the page table is owned by userspace, so
> it makes more sense to let userspace know it and take proper action.
> 

it doesn't make sense to retry upon an invalidation request from userspace.
if retry is required that is the policy of guest iommu driver. Also it's not
 good to introduce a uapi flag which won't be set by current driver.

this can be solved by a simple change in qi_check_fault():

	if (qi->desc_status[wait_index] == QI_ABORT)
- 		return -EAGAIN;
+		return fsts ? -ETIMEDOUT : -EAGAIN;

because if the caller wants to know the fault reason the implication
is that the caller will decide how to cope with the fault. It is incorrect
for qi_check_fault() to decide.

[v7,0/9] Add iommufd nesting (part 2/2)

Message

Comments