mbox series

[v6,0/8] Add Intel VT-d nested translation (part 1/2)

Message ID 20231020093246.17015-1-yi.l.liu@intel.com
Headers show
Series Add Intel VT-d nested translation (part 1/2) | expand

Message

Yi Liu Oct. 20, 2023, 9:32 a.m. UTC
This is the first part to add Intel VT-d nested translation based on IOMMUFD
nesting infrastructure. As the iommufd nesting infrastructure series[1],
iommu core supports new ops to allocate domains with user data. For nesting,
the user data is vendor-specific, IOMMU_HWPT_DATA_VTD_S1 is defined for
the Intel VT-d stage-1 page table, it will be used in the stage-1 domain
allocation path. struct iommu_hwpt_vtd_s1 is defined to pass user_data
for the Intel VT-d stage-1 domain allocation. This series does not have
the cache invalidation path, it would be added in part 2/2.

The first Intel platform supporting nested translation is Sapphire
Rapids which, unfortunately, has a hardware errata [2] requiring special
treatment. This errata happens when a stage-1 page table page (either
level) is located in a stage-2 read-only region. In that case the IOMMU
hardware may ignore the stage-2 RO permission and still set the A/D bit
in stage-1 page table entries during page table walking.

A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to report
this errata to userspace. With that restriction the user should either
disable nested translation to favor RO stage-2 mappings or ensure no
RO stage-2 mapping to enable nested translation.

Intel-iommu driver is armed with necessary checks to prevent such mix
in patch8 of this series.

Qemu currently does add RO mappings though. The vfio agent in Qemu
simply maps all valid regions in the GPA address space which certainly
includes RO regions e.g. vbios.

In reality we don't know a usage relying on DMA reads from the BIOS
region. Hence finding a way to skip RO regions (e.g. via a discard manager)
in Qemu might be an acceptable tradeoff. The actual change needs more
discussion in Qemu community. For now we just hacked Qemu to test.

Complete code can be found in [3], corresponding QEMU could can be found
in [4].

[1] https://lore.kernel.org/linux-iommu/20231020091946.12173-1-yi.l.liu@intel.com/
[2] https://www.intel.com/content/www/us/en/content-details/772415/content-details.html
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting
[4] https://github.com/yiliu1765/qemu/tree/zhenzhong/wip/iommufd_nesting_rfcv1

Change log:

v6:
 - Add Kevin's r-b for patch 1 and 8
 - Drop Kevin's r-b for patch 7
 - Address comments from Kevin
 - Split the VT-d nesting series into two parts 1/2 and 2/2

v5: https://lore.kernel.org/linux-iommu/20230921075431.125239-1-yi.l.liu@intel.com/
 - Add Kevin's r-b for patch 2, 3 ,5 8, 10
 - Drop enforce_cache_coherency callback from the nested type domain ops (Kevin)
 - Remove duplicate agaw check in patch 04 (Kevin)
 - Remove duplicate domain_update_iommu_cap() in patch 06 (Kevin)
 - Check parent's force_snooping to set pgsnp in the pasid entry (Kevin)
 - uapi data structure check (Kevin)
 - Simplify the errata handling as user can allocate nested parent domain

v4: https://lore.kernel.org/linux-iommu/20230724111335.107427-1-yi.l.liu@intel.com/
 - Remove ascii art tables (Jason)
 - Drop EMT (Tina, Jason)
 - Drop MTS and related definitions (Kevin)
 - Rename macro IOMMU_VTD_PGTBL_ to IOMMU_VTD_S1_ (Kevin)
 - Rename struct iommu_hwpt_intel_vtd_ to iommu_hwpt_vtd_ (Kevin)
 - Rename struct iommu_hwpt_intel_vtd to iommu_hwpt_vtd_s1 (Kevin)
 - Put the vendor specific hwpt alloc data structure before enuma iommu_hwpt_type (Kevin)
 - Do not trim the higher page levels of S2 domain in nested domain attachment as the
   S2 domain may have been used independently. (Kevin)
 - Remove the first-stage pgd check against the maximum address of s2_domain as hw
   can check it anyhow. It makes sense to check every pfns used in the stage-1 page
   table. But it cannot make it. So just leave it to hw. (Kevin)
 - Split the iotlb flush part into an order of uapi, helper and callback implementation (Kevin)
 - Change the policy of VT-d nesting errata, disallow RO mapping once a domain is used
   as parent domain of a nested domain. This removes the nested_users counting. (Kevin)
 - Minor fix for "make htmldocs"

v3: https://lore.kernel.org/linux-iommu/20230511145110.27707-1-yi.l.liu@intel.com/
 - Further split the patches into an order of adding helpers for nested
   domain, iotlb flush, nested domain attachment and nested domain allocation
   callback, then report the hw_info to userspace.
 - Add batch support in cache invalidation from userspace
 - Disallow nested translation usage if RO mappings exists in stage-2 domain
   due to errata on readonly mappings on Sapphire Rapids platform.

v2: https://lore.kernel.org/linux-iommu/20230309082207.612346-1-yi.l.liu@intel.com/
 - The iommufd infrastructure is split to be separate series.

v1: https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.com/

Regards,
	Yi Liu

Lu Baolu (5):
  iommu/vt-d: Extend dmar_domain to support nested domain
  iommu/vt-d: Add helper for nested domain allocation
  iommu/vt-d: Add helper to setup pasid nested translation
  iommu/vt-d: Add nested domain allocation
  iommu/vt-d: Disallow read-only mappings to nest parent domain

Yi Liu (3):
  iommufd: Add data structure for Intel VT-d stage-1 domain allocation
  iommu/vt-d: Make domain attach helpers to be extern
  iommu/vt-d: Set the nested domain to a device

 drivers/iommu/intel/Makefile |   2 +-
 drivers/iommu/intel/iommu.c  |  63 +++++++++++++-------
 drivers/iommu/intel/iommu.h  |  46 ++++++++++++--
 drivers/iommu/intel/nested.c | 109 ++++++++++++++++++++++++++++++++++
 drivers/iommu/intel/pasid.c  | 112 +++++++++++++++++++++++++++++++++++
 drivers/iommu/intel/pasid.h  |   2 +
 include/uapi/linux/iommufd.h |  42 ++++++++++++-
 7 files changed, 348 insertions(+), 28 deletions(-)
 create mode 100644 drivers/iommu/intel/nested.c

Comments

Baolu Lu Oct. 21, 2023, 3:24 a.m. UTC | #1
On 10/20/23 5:32 PM, Yi Liu wrote:
> From: Lu Baolu <baolu.lu@linux.intel.com>
> 
> When remapping hardware is configured by system software in scalable mode
> as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry,
> it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled)
> in first-stage page-table entries even when second-stage mappings indicate
> that corresponding first-stage page-table is Read-Only.
> 
> As the result, contents of pages designated by VMM as Read-Only can be
> modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of
> address translation process due to DMAs issued by Guest.
> 
> This disallows read-only mappings in the domain that is supposed to be used
> as nested parent. Reference from Sapphire Rapids Specification Update [1],
> errata details, SPR17. Userspace should know this limitation by checking
> the IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 flag reported in the IOMMU_GET_HW_INFO
> ioctl.
> 
> [1] https://www.intel.com/content/www/us/en/content-details/772415/content-details.html
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>   drivers/iommu/intel/iommu.c  |  9 +++++++++
>   drivers/iommu/intel/iommu.h  |  1 +
>   include/uapi/linux/iommufd.h | 12 +++++++++++-
>   3 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index c7704e7efd4a..a0341a069fbf 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -2193,6 +2193,11 @@ __domain_mapping(struct dmar_domain *domain, unsigned long iov_pfn,
>   	if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)
>   		return -EINVAL;
>   
> +	if (!(prot & DMA_PTE_WRITE) && domain->is_nested_parent) {
> +		pr_err_ratelimited("Read-only mapping is disallowed on the domain which serves as the parent in a nested configuration, due to HW errata (ERRATA_772415_SPR17)\n");
> +		return -EINVAL;
> +	}
> +
>   	attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP);
>   	attr |= DMA_FL_PTE_PRESENT;
>   	if (domain->use_first_level) {
> @@ -4101,6 +4106,9 @@ intel_iommu_domain_alloc_user(struct device *dev, u32 flags,
>   		domain = iommu_domain_alloc(dev->bus);
>   		if (!domain)
>   			return ERR_PTR(-ENOMEM);
> +		container_of(domain,
> +			     struct dmar_domain,
> +			     domain)->is_nested_parent = request_nest_parent;

How about
		to_dmar_domain(domain)->is_nested_parent = ...;
?

I would also prefer to introduce is_nested_parent_domain to the user
domain allocation patch (patch 7/8). This field should be checked when
allocating a nested user domain.

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 8f81a5c9fcc0..d3f6bc1f6590 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4121,6 +4121,8 @@ intel_iommu_domain_alloc_user(struct device *dev, 
u32 flags,
                 return ERR_PTR(-EINVAL);
         if (request_nest_parent)
                 return ERR_PTR(-EINVAL);
+       if (!to_dmar_domain(parent)->is_nested_parent)
+               return ERR_PTR(-EINVAL);

         return intel_nested_domain_alloc(parent, user_data);
  }

Best regards,
baolu
Yi Liu Oct. 23, 2023, 11 a.m. UTC | #2
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Friday, October 20, 2023 7:49 PM
> 
> On 2023/10/20 17:32, Yi Liu wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > This adds helper for accepting user parameters and allocate a nested
> > domain.
> >
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >   drivers/iommu/intel/Makefile |  2 +-
> >   drivers/iommu/intel/iommu.h  |  2 ++
> >   drivers/iommu/intel/nested.c | 55 ++++++++++++++++++++++++++++++++++++
> >   3 files changed, 58 insertions(+), 1 deletion(-)
> >   create mode 100644 drivers/iommu/intel/nested.c
> >
> > diff --git a/drivers/iommu/intel/Makefile b/drivers/iommu/intel/Makefile
> > index 7af3b8a4f2a0..5dabf081a779 100644
> > --- a/drivers/iommu/intel/Makefile
> > +++ b/drivers/iommu/intel/Makefile
> > @@ -1,6 +1,6 @@
> >   # SPDX-License-Identifier: GPL-2.0
> >   obj-$(CONFIG_DMAR_TABLE) += dmar.o
> > -obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o
> > +obj-$(CONFIG_INTEL_IOMMU) += iommu.o pasid.o nested.o
> >   obj-$(CONFIG_DMAR_TABLE) += trace.o cap_audit.o
> >   obj-$(CONFIG_DMAR_PERF) += perf.o
> >   obj-$(CONFIG_INTEL_IOMMU_DEBUGFS) += debugfs.o
> > diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
> > index a91077a063ee..ff55184456dd 100644
> > --- a/drivers/iommu/intel/iommu.h
> > +++ b/drivers/iommu/intel/iommu.h
> > @@ -866,6 +866,8 @@ void *alloc_pgtable_page(int node, gfp_t gfp);
> >   void free_pgtable_page(void *vaddr);
> >   void iommu_flush_write_buffer(struct intel_iommu *iommu);
> >   struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn);
> > +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain
> *s2_domain,
> > +					       const struct iommu_user_data *user_data);
> >
> >   #ifdef CONFIG_INTEL_IOMMU_SVM
> >   void intel_svm_check(struct intel_iommu *iommu);
> > diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
> > new file mode 100644
> > index 000000000000..5a2920a98e47
> > --- /dev/null
> > +++ b/drivers/iommu/intel/nested.c
> > @@ -0,0 +1,55 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * nested.c - nested mode translation support
> > + *
> > + * Copyright (C) 2023 Intel Corporation
> > + *
> > + * Author: Lu Baolu <baolu.lu@linux.intel.com>
> > + *         Jacob Pan <jacob.jun.pan@linux.intel.com>
> > + *         Yi Liu <yi.l.liu@intel.com>
> > + */
> > +
> > +#define pr_fmt(fmt)	"DMAR: " fmt
> > +
> > +#include <linux/iommu.h>
> > +
> > +#include "iommu.h"
> > +
> > +static void intel_nested_domain_free(struct iommu_domain *domain)
> > +{
> > +	kfree(to_dmar_domain(domain));
> > +}
> > +
> > +static const struct iommu_domain_ops intel_nested_domain_ops = {
> > +	.free			= intel_nested_domain_free,
> > +};
> > +
> > +struct iommu_domain *intel_nested_domain_alloc(struct iommu_domain
> *s2_domain,
> > +					       const struct iommu_user_data *user_data)
> > +{
> > +	struct iommu_hwpt_vtd_s1 vtd;
> > +	struct dmar_domain *domain;
> > +	int ret;
> > +
> > +	ret = iommu_copy_struct_from_user(&vtd, user_data,
> > +					  IOMMU_HWPT_DATA_VTD_S1, __reserved);
> > +	if (ret)
> > +		return ERR_PTR(ret);
> > +
> > +	domain = kzalloc(sizeof(*domain), GFP_KERNEL_ACCOUNT);
> > +	if (!domain)
> > +		return NULL;
> 
> 	return ERR_PTR(-ENOMEM);
> ?

Yes, good catch!

> > +
> > +	domain->use_first_level = true;
> > +	domain->s2_domain = to_dmar_domain(s2_domain);
> > +	domain->s1_pgtbl = vtd.pgtbl_addr;
> > +	domain->s1_cfg = vtd;
> > +	domain->domain.ops = &intel_nested_domain_ops;
> > +	domain->domain.type = IOMMU_DOMAIN_NESTED;
> > +	INIT_LIST_HEAD(&domain->devices);
> > +	INIT_LIST_HEAD(&domain->dev_pasids);
> > +	spin_lock_init(&domain->lock);
> > +	xa_init(&domain->iommu_array);
> > +
> > +	return &domain->domain;
> > +}
> 
> Best regards,
> baolu
Yi Liu Oct. 23, 2023, 11:15 a.m. UTC | #3
> From: Baolu Lu <baolu.lu@linux.intel.com>
> Sent: Saturday, October 21, 2023 11:24 AM
> 
> On 10/20/23 5:32 PM, Yi Liu wrote:
> > From: Lu Baolu <baolu.lu@linux.intel.com>
> >
> > When remapping hardware is configured by system software in scalable mode
> > as Nested (PGTT=011b) and with PWSNP field Set in the PASID-table-entry,
> > it may Set Accessed bit and Dirty bit (and Extended Access bit if enabled)
> > in first-stage page-table entries even when second-stage mappings indicate
> > that corresponding first-stage page-table is Read-Only.
> >
> > As the result, contents of pages designated by VMM as Read-Only can be
> > modified by IOMMU via PML5E (PML4E for 4-level tables) access as part of
> > address translation process due to DMAs issued by Guest.
> >
> > This disallows read-only mappings in the domain that is supposed to be used
> > as nested parent. Reference from Sapphire Rapids Specification Update [1],
> > errata details, SPR17. Userspace should know this limitation by checking
> > the IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 flag reported in the
> IOMMU_GET_HW_INFO
> > ioctl.
> >
> > [1] https://www.intel.com/content/www/us/en/content-details/772415/content-
> details.html
> >
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >   drivers/iommu/intel/iommu.c  |  9 +++++++++
> >   drivers/iommu/intel/iommu.h  |  1 +
> >   include/uapi/linux/iommufd.h | 12 +++++++++++-
> >   3 files changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index c7704e7efd4a..a0341a069fbf 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -2193,6 +2193,11 @@ __domain_mapping(struct dmar_domain *domain,
> unsigned long iov_pfn,
> >   	if ((prot & (DMA_PTE_READ|DMA_PTE_WRITE)) == 0)
> >   		return -EINVAL;
> >
> > +	if (!(prot & DMA_PTE_WRITE) && domain->is_nested_parent) {
> > +		pr_err_ratelimited("Read-only mapping is disallowed on the domain
> which serves as the parent in a nested configuration, due to HW errata
> (ERRATA_772415_SPR17)\n");
> > +		return -EINVAL;
> > +	}
> > +
> >   	attr = prot & (DMA_PTE_READ | DMA_PTE_WRITE | DMA_PTE_SNP);
> >   	attr |= DMA_FL_PTE_PRESENT;
> >   	if (domain->use_first_level) {
> > @@ -4101,6 +4106,9 @@ intel_iommu_domain_alloc_user(struct device *dev, u32
> flags,
> >   		domain = iommu_domain_alloc(dev->bus);
> >   		if (!domain)
> >   			return ERR_PTR(-ENOMEM);
> > +		container_of(domain,
> > +			     struct dmar_domain,
> > +			     domain)->is_nested_parent = request_nest_parent;
> 
> How about
> 		to_dmar_domain(domain)->is_nested_parent = ...;
> ?

Yes.

> 
> I would also prefer to introduce is_nested_parent_domain to the user
> domain allocation patch (patch 7/8). This field should be checked when
> allocating a nested user domain.

A ctually, no need. This should be a common check, so iommufd core already
has the check. So the parent should be a nest parent domain, otherwise already
returned in iommufd.

+	if (!parent->nest_parent)
+		return ERR_PTR(-EINVAL);

https://lore.kernel.org/linux-iommu/20231020091946.12173-8-yi.l.liu@intel.com/

> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 8f81a5c9fcc0..d3f6bc1f6590 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4121,6 +4121,8 @@ intel_iommu_domain_alloc_user(struct device *dev,
> u32 flags,
>                  return ERR_PTR(-EINVAL);
>          if (request_nest_parent)
>                  return ERR_PTR(-EINVAL);
> +       if (!to_dmar_domain(parent)->is_nested_parent)
> +               return ERR_PTR(-EINVAL);
> 
>          return intel_nested_domain_alloc(parent, user_data);
>   }
> 
> Best regards,
> baolu
Baolu Lu Oct. 23, 2023, 12:18 p.m. UTC | #4
On 2023/10/23 19:15, Liu, Yi L wrote:
>> I would also prefer to introduce is_nested_parent_domain to the user
>> domain allocation patch (patch 7/8). This field should be checked when
>> allocating a nested user domain.
> A ctually, no need. This should be a common check, so iommufd core already
> has the check. So the parent should be a nest parent domain, otherwise already
> returned in iommufd.
> 
> +	if (!parent->nest_parent)
> +		return ERR_PTR(-EINVAL);

I know this will not cause errors in the code. But since you are
introducing is_parent property in the vt-d driver. The integrity of the
property should be ensured. In this way, it will make the code more
readable and maintainable.

Best regards,
baolu