mbox series

[0/8] Initial support for SMMUv3 nested translation

Message ID 0-v1-54e734311a7f+14f72-smmuv3_nesting_jgg@nvidia.com
Headers show
Series Initial support for SMMUv3 nested translation | expand

Message

Jason Gunthorpe Aug. 6, 2024, 11:41 p.m. UTC
This brings support for the IOMMFD ioctls:

 - IOMMU_GET_HW_INFO
 - IOMMU_HWPT_ALLOC_NEST_PARENT
 - IOMMU_DOMAIN_NESTED
 - ops->enforce_cache_coherency()

This is quite straightforward as the nested STE can just be built in the
special NESTED domain op and fed through the generic update machinery.

The design allows the user provided STE fragment to control several
aspects of the translation, including putting the STE into a "virtual
bypass" or a aborting state. This duplicates functionality available by
other means, but it allows trivially preserving the VMID in the STE as we
eventually move towards the VIOMMU owning the VMID.

Nesting support requires the system to either support S2FWB or the
stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
cache and view incoherent data, currently VFIO lacks any cache flushing
that would make this safe.

Yan has a series to add some of the needed infrastructure for VFIO cache
flushing here:

 https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/

Which may someday allow relaxing this further.

Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
this.

This is the first series in what will be several to complete nesting
support. At least:
 - IOMMU_RESV_SW_MSI related fixups
 - VIOMMU object support to allow ATS invalidations
 - vCMDQ hypervisor support for direct invalidation queue assignment
 - KVM pinned VMID using VIOMMU for vBTM
 - Cross instance S2 sharing
 - Virtual Machine Structure using VIOMMU (for vMPAM?)
 - Fault forwarding support through IOMMUFD's fault fd for vSVA

It is enough to allow significant amounts of qemu work to progress.

This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting

Jason Gunthorpe (5):
  vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  iommu/arm-smmu-v3: Use S2FWB when available
  iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED

Nicolin Chen (3):
  ACPI/IORT: Support CANWBS memory access flag
  iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct
    arm_smmu_hw_info
  iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

 drivers/acpi/arm64/iort.c                   |  13 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 398 ++++++++++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  27 ++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |  16 -
 drivers/iommu/io-pgtable-arm.c              |  24 +-
 drivers/iommu/iommu.c                       |  10 -
 drivers/iommu/iommufd/vfio_compat.c         |   7 +-
 drivers/vfio/vfio_iommu_type1.c             |  12 +-
 include/acpi/actbl2.h                       |   1 +
 include/linux/io-pgtable.h                  |   2 +
 include/linux/iommu.h                       |  54 ++-
 include/uapi/linux/iommufd.h                |  79 ++++
 include/uapi/linux/vfio.h                   |   2 +-
 13 files changed, 572 insertions(+), 73 deletions(-)


base-commit: e5e288d94186b266b062b3e44c82c285dfe68712

Comments

Shameerali Kolothum Thodi Aug. 9, 2024, 2:26 p.m. UTC | #1
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, August 7, 2024 12:41 AM
> To: acpica-devel@lists.linux.dev; Alex Williamson
> <alex.williamson@redhat.com>; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org;
> Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>;
> Will Deacon <will@kernel.org>
> Cc: Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Subject: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
> 
> Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> works. When S2FWB is supported and enabled the IOPTE will force cachable
> access to IOMMU_CACHE memory and deny cachable access otherwise.
> 
> This is not especially meaningful for simple S2 domains, it apparently
> doesn't even force PCI no-snoop access to be coherent.
> 
> However, when used with a nested S1, FWB has the effect of preventing the
> guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> cache. Consistent with KVM we wish to deny the guest the ability to become
> incoherent with cached memory the hypervisor believes is cachable so we
> don't have to flush it.
> 
> Turn on S2FWB whenever the SMMU supports it and use it for all S2
> mappings.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  6 ++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>  drivers/iommu/io-pgtable-arm.c              | 24 +++++++++++++++++----
>  include/linux/io-pgtable.h                  |  2 ++
>  4 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 531125f231b662..7fe1e27d11586c 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct
> arm_smmu_ste *target,
>  		FIELD_PREP(STRTAB_STE_1_EATS,
>  			   ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
> 
> +	if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +		target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB);
>  	if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR)
>  		target->data[1] |=
> cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> 
> STRTAB_STE_1_SHCFG_INCOMING));
> @@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct
> arm_smmu_domain *smmu_domain,
>  		pgtbl_cfg.oas = smmu->oas;
>  		fmt = ARM_64_LPAE_S2;
>  		finalise_stage_fn = arm_smmu_domain_finalise_s2;
> +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;

This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks check.

Thanks,
Shameer
Jason Gunthorpe Aug. 9, 2024, 3:12 p.m. UTC | #2
On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote:
> > +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> > +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
> 
> This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks check.

Yep, fixed I was hoping you had HW to test this..

Thanks,
Jason
Shameerali Kolothum Thodi Aug. 15, 2024, 4:14 p.m. UTC | #3
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, August 9, 2024 4:12 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: acpica-devel@lists.linux.dev; Alex Williamson
> <alex.williamson@redhat.com>; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org;
> Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>;
> Will Deacon <will@kernel.org>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>;
> Michael Shavit <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev
> Subject: Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
> 
> On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote:
> > > +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> > > +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
> >
> > This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks
> check.
> 
> Yep, fixed I was hoping you had HW to test this..

Let me see if I can get hold of a test setup that supports S2FWB.

I do have another concern with respect to the hardware we have which doesn't
support S2FWB, but those can claim CANWBS. The problem is, BIOS update is not
a very liked/feasible solution to already deployed ones. But we can probably add 
an option/quirk in SMMUv3 driver for those platforms(based on 
ACPI_IORT_SMMU_V3_HISILICON_HI161X).  I hope this is fine.

Thanks,
Shameer
Jason Gunthorpe Aug. 15, 2024, 4:18 p.m. UTC | #4
On Thu, Aug 15, 2024 at 04:14:22PM +0000, Shameerali Kolothum Thodi wrote:
> > On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote:
> > > > +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> > > > +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
> > >
> > > This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks
> > check.
> > 
> > Yep, fixed I was hoping you had HW to test this..
> 
> Let me see if I can get hold of a test setup that supports S2FWB.

Thanks!
 
> I do have another concern with respect to the hardware we have which doesn't
> support S2FWB, but those can claim CANWBS. The problem is, BIOS update is not
> a very liked/feasible solution to already deployed ones. But we can probably add 
> an option/quirk in SMMUv3 driver for those platforms(based on 
> ACPI_IORT_SMMU_V3_HISILICON_HI161X).  I hope this is fine.

I don't have an issue with doing that, if you can reliably identify
the platform in some way a kernel quirk seems reasonable.

Thanks,
Jason
Mostafa Saleh Aug. 20, 2024, 8:20 a.m. UTC | #5
Hi Jason,

On Tue, Aug 06, 2024 at 08:41:13PM -0300, Jason Gunthorpe wrote:
> This brings support for the IOMMFD ioctls:
> 
>  - IOMMU_GET_HW_INFO
>  - IOMMU_HWPT_ALLOC_NEST_PARENT
>  - IOMMU_DOMAIN_NESTED
>  - ops->enforce_cache_coherency()
> 
> This is quite straightforward as the nested STE can just be built in the
> special NESTED domain op and fed through the generic update machinery.
> 
> The design allows the user provided STE fragment to control several
> aspects of the translation, including putting the STE into a "virtual
> bypass" or a aborting state. This duplicates functionality available by
> other means, but it allows trivially preserving the VMID in the STE as we
> eventually move towards the VIOMMU owning the VMID.
> 
> Nesting support requires the system to either support S2FWB or the
> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> cache and view incoherent data, currently VFIO lacks any cache flushing
> that would make this safe.
> 
> Yan has a series to add some of the needed infrastructure for VFIO cache
> flushing here:
> 
>  https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
> 
> Which may someday allow relaxing this further.
> 
> Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
> this.
> 
> This is the first series in what will be several to complete nesting
> support. At least:
>  - IOMMU_RESV_SW_MSI related fixups
>  - VIOMMU object support to allow ATS invalidations
>  - vCMDQ hypervisor support for direct invalidation queue assignment
>  - KVM pinned VMID using VIOMMU for vBTM
>  - Cross instance S2 sharing
>  - Virtual Machine Structure using VIOMMU (for vMPAM?)
>  - Fault forwarding support through IOMMUFD's fault fd for vSVA
> 
> It is enough to allow significant amounts of qemu work to progress.
> 
Are there any qemu patches to tests this?
As I am confused with some of the user space bits and that would help.

Thanks,
Mostafa

> This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> 
> Jason Gunthorpe (5):
>   vfio: Remove VFIO_TYPE1_NESTING_IOMMU
>   iommu/arm-smmu-v3: Use S2FWB when available
>   iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
>   iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
>   iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
> 
> Nicolin Chen (3):
>   ACPI/IORT: Support CANWBS memory access flag
>   iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct
>     arm_smmu_hw_info
>   iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user
> 
>  drivers/acpi/arm64/iort.c                   |  13 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 398 ++++++++++++++++++--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  27 ++
>  drivers/iommu/arm/arm-smmu/arm-smmu.c       |  16 -
>  drivers/iommu/io-pgtable-arm.c              |  24 +-
>  drivers/iommu/iommu.c                       |  10 -
>  drivers/iommu/iommufd/vfio_compat.c         |   7 +-
>  drivers/vfio/vfio_iommu_type1.c             |  12 +-
>  include/acpi/actbl2.h                       |   1 +
>  include/linux/io-pgtable.h                  |   2 +
>  include/linux/iommu.h                       |  54 ++-
>  include/uapi/linux/iommufd.h                |  79 ++++
>  include/uapi/linux/vfio.h                   |   2 +-
>  13 files changed, 572 insertions(+), 73 deletions(-)
> 
> 
> base-commit: e5e288d94186b266b062b3e44c82c285dfe68712
> -- 
> 2.46.0
>
Mostafa Saleh Aug. 20, 2024, 8:30 a.m. UTC | #6
Hi Jason,

On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> works. When S2FWB is supported and enabled the IOPTE will force cachable
> access to IOMMU_CACHE memory and deny cachable access otherwise.
> 
> This is not especially meaningful for simple S2 domains, it apparently
> doesn't even force PCI no-snoop access to be coherent.
> 
> However, when used with a nested S1, FWB has the effect of preventing the
> guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> cache. Consistent with KVM we wish to deny the guest the ability to become
> incoherent with cached memory the hypervisor believes is cachable so we
> don't have to flush it.
> 
> Turn on S2FWB whenever the SMMU supports it and use it for all S2
> mappings.

I have been looking into this recently from the KVM side as it will
use FWB for the CPU stage-2 unconditionally for guests(if supported),
however that breaks for non-coherent devices when assigned, and
limiting assigned devices to be coherent seems too restrictive.
I have been looking into ways to notify KVM from VFIO as early as
possible so it can configure the page table properly.

But for SMMUv3, S2FWB is per stream, can’t we just use it if the master
is DMA coherent?

Thanks,
Mostafa

> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  6 ++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>  drivers/iommu/io-pgtable-arm.c              | 24 +++++++++++++++++----
>  include/linux/io-pgtable.h                  |  2 ++
>  4 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 531125f231b662..7fe1e27d11586c 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>  		FIELD_PREP(STRTAB_STE_1_EATS,
>  			   ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
>  
> +	if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +		target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB);
>  	if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR)
>  		target->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>  							  STRTAB_STE_1_SHCFG_INCOMING));
> @@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
>  		pgtbl_cfg.oas = smmu->oas;
>  		fmt = ARM_64_LPAE_S2;
>  		finalise_stage_fn = arm_smmu_domain_finalise_s2;
> +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
>  		break;
>  	default:
>  		return -EINVAL;
> @@ -4189,6 +4193,8 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>  
>  	/* IDR3 */
>  	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> +	if (FIELD_GET(IDR3_FWB, reg))
> +		smmu->features |= ARM_SMMU_FEAT_S2FWB;
>  	if (FIELD_GET(IDR3_RIL, reg))
>  		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
>  
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 8851a7abb5f0f3..7e8d2f36faebf3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -55,6 +55,7 @@
>  #define IDR1_SIDSIZE			GENMASK(5, 0)
>  
>  #define ARM_SMMU_IDR3			0xc
> +#define IDR3_FWB			(1 << 8)
>  #define IDR3_RIL			(1 << 10)
>  
>  #define ARM_SMMU_IDR5			0x14
> @@ -258,6 +259,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
>  #define STRTAB_STE_1_S1CSH		GENMASK_ULL(7, 6)
>  
>  #define STRTAB_STE_1_S1STALLD		(1UL << 27)
> +#define STRTAB_STE_1_S2FWB		(1UL << 25)
>  
>  #define STRTAB_STE_1_EATS		GENMASK_ULL(29, 28)
>  #define STRTAB_STE_1_EATS_ABT		0UL
> @@ -700,6 +702,7 @@ struct arm_smmu_device {
>  #define ARM_SMMU_FEAT_ATTR_TYPES_OVR	(1 << 20)
>  #define ARM_SMMU_FEAT_HA		(1 << 21)
>  #define ARM_SMMU_FEAT_HD		(1 << 22)
> +#define ARM_SMMU_FEAT_S2FWB		(1 << 23)
>  	u32				features;
>  
>  #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index f5d9fd1f45bf49..62bbb6037e1686 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -106,6 +106,18 @@
>  #define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
>  #define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
>  #define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
> +/*
> + * For !FWB these code to:
> + *  1111 = Normal outer write back cachable / Inner Write Back Cachable
> + *         Permit S1 to override
> + *  0101 = Normal Non-cachable / Inner Non-cachable
> + *  0001 = Device / Device-nGnRE
> + * For S2FWB these code:
> + *  0110 Force Normal Write Back
> + *  0101 Normal* is forced Normal-NC, Device unchanged
> + *  0001 Force Device-nGnRE
> + */
> +#define ARM_LPAE_PTE_MEMATTR_FWB_WB	(((arm_lpae_iopte)0x6) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
> @@ -458,12 +470,16 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
>  	 */
>  	if (data->iop.fmt == ARM_64_LPAE_S2 ||
>  	    data->iop.fmt == ARM_32_LPAE_S2) {
> -		if (prot & IOMMU_MMIO)
> +		if (prot & IOMMU_MMIO) {
>  			pte |= ARM_LPAE_PTE_MEMATTR_DEV;
> -		else if (prot & IOMMU_CACHE)
> -			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> -		else
> +		} else if (prot & IOMMU_CACHE) {
> +			if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_S2FWB)
> +				pte |= ARM_LPAE_PTE_MEMATTR_FWB_WB;
> +			else
> +				pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +		} else {
>  			pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +		}
>  	} else {
>  		if (prot & IOMMU_MMIO)
>  			pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index f9a81761bfceda..aff9b020b6dcc7 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>  	 *	attributes set in the TCR for a non-coherent page-table walker.
>  	 *
>  	 * IO_PGTABLE_QUIRK_ARM_HD: Enables dirty tracking in stage 1 pagetable.
> +	 * IO_PGTABLE_QUIRK_ARM_S2FWB: Use the FWB format for the MemAttrs bits
>  	 */
>  	#define IO_PGTABLE_QUIRK_ARM_NS			BIT(0)
>  	#define IO_PGTABLE_QUIRK_NO_PERMS		BIT(1)
> @@ -95,6 +96,7 @@ struct io_pgtable_cfg {
>  	#define IO_PGTABLE_QUIRK_ARM_TTBR1		BIT(5)
>  	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA		BIT(6)
>  	#define IO_PGTABLE_QUIRK_ARM_HD			BIT(7)
> +	#define IO_PGTABLE_QUIRK_ARM_S2FWB		BIT(8)
>  	unsigned long			quirks;
>  	unsigned long			pgsize_bitmap;
>  	unsigned int			ias;
> -- 
> 2.46.0
>
Jason Gunthorpe Aug. 20, 2024, 12:01 p.m. UTC | #7
On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote:
> Hi Jason,
> 
> On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> > works. When S2FWB is supported and enabled the IOPTE will force cachable
> > access to IOMMU_CACHE memory and deny cachable access otherwise.
> > 
> > This is not especially meaningful for simple S2 domains, it apparently
> > doesn't even force PCI no-snoop access to be coherent.
> > 
> > However, when used with a nested S1, FWB has the effect of preventing the
> > guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> > cache. Consistent with KVM we wish to deny the guest the ability to become
> > incoherent with cached memory the hypervisor believes is cachable so we
> > don't have to flush it.
> > 
> > Turn on S2FWB whenever the SMMU supports it and use it for all S2
> > mappings.
> 
> I have been looking into this recently from the KVM side as it will
> use FWB for the CPU stage-2 unconditionally for guests(if supported),
> however that breaks for non-coherent devices when assigned, and
> limiting assigned devices to be coherent seems too restrictive.

kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That
concept is only relevant to the SMMU.

The issue on the KVM side is you can't put device MMIO into the CPU S2
using S2FWB and Normal Cachable, it will break the MMIO programming
model. That isn't "coherency" though.

It has to be Normal-NC, which this patch does:

https://lore.kernel.org/r/20240224150546.368-4-ankita@nvidia.com

> But for SMMUv3, S2FWB is per stream, can’t we just use it if the master
> is DMA coherent?

Sure, that seems to be a weird corner. Lets add this:

@@ -4575,7 +4575,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
        /* IDR3 */
        reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
-       if (FIELD_GET(IDR3_FWB, reg))
+       /*
+        * If for some reason the HW does not support DMA coherency then using
+        * S2FWB won't work. This will also disable nesting support.
+        */
+       if (FIELD_GET(IDR3_FWB, reg) &&
+           (smmu->features & ARM_SMMU_FEAT_COHERENCY))
                smmu->features |= ARM_SMMU_FEAT_S2FWB;
        if (FIELD_GET(IDR3_RIL, reg))
                smmu->features |= ARM_SMMU_FEAT_RANGE_INV;

IMHO it would be weird to make HW that has S2FWB but not coherency,
but sure let's check it.

Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
so we won't even get a chance to ask for a S2 domain.

Jason
Jason Gunthorpe Aug. 20, 2024, 1:01 p.m. UTC | #8
On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote:

> Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
> so we won't even get a chance to ask for a S2 domain.

And I should also say that without iommufd something like the DMA API
could select a S2 with S2FWB enabled, but all that does is change the
encoding of the memattr bits. Requests for !IOMMU_CACHE will still map
to non-cacheble IO PTEs like before - just with a different encoding.

The only thing at issue is nesting which will end up in the guest as
forced cachable - however since VFIO doesn't support non-DMA-coherent
devices at all this is not a problem.

Jason
Nicolin Chen Aug. 20, 2024, 3:24 p.m. UTC | #9
On Tue, Aug 20, 2024 at 08:20:32AM +0000, Mostafa Saleh wrote:
> > This is the first series in what will be several to complete nesting
> > support. At least:
> >  - IOMMU_RESV_SW_MSI related fixups
> >  - VIOMMU object support to allow ATS invalidations
> >  - vCMDQ hypervisor support for direct invalidation queue assignment
> >  - KVM pinned VMID using VIOMMU for vBTM
> >  - Cross instance S2 sharing
> >  - Virtual Machine Structure using VIOMMU (for vMPAM?)
> >  - Fault forwarding support through IOMMUFD's fault fd for vSVA
> >
> > It is enough to allow significant amounts of qemu work to progress.
> >

> Are there any qemu patches to tests this?
> As I am confused with some of the user space bits and that would help.

I have the qemu patches, but am running some backlogs to keep it
updated, and don't have one exactly fitting to test this series.

I collected a few remarks from Jason regarding the VIOMMU series.
And I am reworking on it. I plan to post a testable QEMU branch
with the next VIOMMU version. Will CC you and more folks.

Thanks
Nicolin
Mostafa Saleh Aug. 20, 2024, 7:52 p.m. UTC | #10
On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote:
> > Hi Jason,
> > 
> > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> > > works. When S2FWB is supported and enabled the IOPTE will force cachable
> > > access to IOMMU_CACHE memory and deny cachable access otherwise.
> > > 
> > > This is not especially meaningful for simple S2 domains, it apparently
> > > doesn't even force PCI no-snoop access to be coherent.
> > > 
> > > However, when used with a nested S1, FWB has the effect of preventing the
> > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> > > cache. Consistent with KVM we wish to deny the guest the ability to become
> > > incoherent with cached memory the hypervisor believes is cachable so we
> > > don't have to flush it.
> > > 
> > > Turn on S2FWB whenever the SMMU supports it and use it for all S2
> > > mappings.
> > 
> > I have been looking into this recently from the KVM side as it will
> > use FWB for the CPU stage-2 unconditionally for guests(if supported),
> > however that breaks for non-coherent devices when assigned, and
> > limiting assigned devices to be coherent seems too restrictive.
> 
> kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That
> concept is only relevant to the SMMU.
> 

Why not? That would be a problem if a device is not dma coherent,
and the VM knows that and maps it’s DMA memory as non cacheable.
But it would be overridden by FWB in stage-2 to be cacheable,
it would lead to coherency issues.

> The issue on the KVM side is you can't put device MMIO into the CPU S2
> using S2FWB and Normal Cachable, it will break the MMIO programming
> model. That isn't "coherency" though.> 
> It has to be Normal-NC, which this patch does:
> 
> https://lore.kernel.org/r/20240224150546.368-4-ankita@nvidia.com

Yes, that also breaks (although I think this is an easier problem to
solve)

> 
> > But for SMMUv3, S2FWB is per stream, can’t we just use it if the master
> > is DMA coherent?
> 
> Sure, that seems to be a weird corner. Lets add this:
> 
> @@ -4575,7 +4575,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>  
>         /* IDR3 */
>         reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> -       if (FIELD_GET(IDR3_FWB, reg))
> +       /*
> +        * If for some reason the HW does not support DMA coherency then using
> +        * S2FWB won't work. This will also disable nesting support.
> +        */
> +       if (FIELD_GET(IDR3_FWB, reg) &&
> +           (smmu->features & ARM_SMMU_FEAT_COHERENCY))
>                 smmu->features |= ARM_SMMU_FEAT_S2FWB;
>         if (FIELD_GET(IDR3_RIL, reg))
>                 smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
> 
> IMHO it would be weird to make HW that has S2FWB but not coherency,
> but sure let's check it.
> 
What I mean is the master itself not the SMMU (the SID basically),
so in that case the STE shouldn’t have FWB enabled.

> Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
> so we won't even get a chance to ask for a S2 domain.

Oh, I think that is only for the SMMU, not for the master, the
SMMU can be coherent (for pte, ste …) but the master can still be
non coherent. Looking at how VFIO uses it, that seems to be a bug?

Thanks,
Mostafa

> 
> Jason
Jason Gunthorpe Aug. 20, 2024, 8:21 p.m. UTC | #11
On Tue, Aug 20, 2024 at 07:52:53PM +0000, Mostafa Saleh wrote:
> On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote:
> > > Hi Jason,
> > > 
> > > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> > > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> > > > works. When S2FWB is supported and enabled the IOPTE will force cachable
> > > > access to IOMMU_CACHE memory and deny cachable access otherwise.
> > > > 
> > > > This is not especially meaningful for simple S2 domains, it apparently
> > > > doesn't even force PCI no-snoop access to be coherent.
> > > > 
> > > > However, when used with a nested S1, FWB has the effect of preventing the
> > > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> > > > cache. Consistent with KVM we wish to deny the guest the ability to become
> > > > incoherent with cached memory the hypervisor believes is cachable so we
> > > > don't have to flush it.
> > > > 
> > > > Turn on S2FWB whenever the SMMU supports it and use it for all S2
> > > > mappings.
> > > 
> > > I have been looking into this recently from the KVM side as it will
> > > use FWB for the CPU stage-2 unconditionally for guests(if supported),
> > > however that breaks for non-coherent devices when assigned, and
> > > limiting assigned devices to be coherent seems too restrictive.
> > 
> > kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That
> > concept is only relevant to the SMMU.
>
> Why not? That would be a problem if a device is not dma coherent,
> and the VM knows that and maps it’s DMA memory as non cacheable.
> But it would be overridden by FWB in stage-2 to be cacheable,
> it would lead to coherency issues.

Oh, from that perspective yes, but the entire point of S2FWB is that
VM's can not create non-coherent access so it is a bit nonsense to ask
for both S2FWB and try to assign a non-DMA coherent device.

> Yes, that also breaks (although I think this is an easier problem to
> solve)

Well, it is easy to solve, just don't use S2FWB and manually flush the
caches before the hypervisor touches any memory. :)

> What I mean is the master itself not the SMMU (the SID basically),
> so in that case the STE shouldn’t have FWB enabled.

That doesn't matter, those cases will not pass in IOMMU_CACHE and they
will work fine with S2FWB turned on.

> > Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
> > so we won't even get a chance to ask for a S2 domain.
> 
> Oh, I think that is only for the SMMU, not for the master, the
> SMMU can be coherent (for pte, ste …) but the master can still be
> non coherent. Looking at how VFIO uses it, that seems to be a bug?

If there are mixes of SMMU feature and dev_is_dma_coherent() then it
would be a bug yes..

I recall we started out trying to use dev_is_dma_coherent() but
Christoph explained it doesn't work that generally:

https://lore.kernel.org/kvm/20220406135150.GA21532@lst.de/

Seems we sort of gave up on it, too complicated. Robin had a nice
observation of the complexity:

    Disregarding the complete disaster of PCIe No Snoop on Arm-Based 
    systems, there's the more interesting effectively-opposite scenario 
    where an SMMU bridges non-coherent devices to a coherent interconnect. 
    It's not something we take advantage of yet in Linux, and it can only be 
    properly described in ACPI, but there do exist situations where 
    IOMMU_CACHE is capable of making the device's traffic snoop, but 
    dev_is_dma_coherent() - and device_get_dma_attr() for external users - 
    would still say non-coherent because they can't assume that the SMMU is 
    enabled and programmed in just the right way.

Anyhow, for the purposes of KVM and VFIO, devices that don't work with
IOMMU_CACHE are not allowed. From an API perspective
IOMMU_CAP_CACHE_COHERENCY is supposed to return if the struct device
can use IOMMU_CACHE.

The corner case where we have a ARM_SMMU_FEAT_COHERENCY SMMU but
somehow specific devices don't support IOMMU_CACHE is not properly
reflected in IOMMU_CAP_CACHE_COHERENCY. I don't know how to fix that,
and we've been ignoring it for a long time now :)

Jason
Mostafa Saleh Aug. 21, 2024, 9:53 a.m. UTC | #12
On Tue, Aug 20, 2024 at 05:21:38PM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2024 at 07:52:53PM +0000, Mostafa Saleh wrote:
> > On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote:
> > > > Hi Jason,
> > > > 
> > > > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> > > > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> > > > > works. When S2FWB is supported and enabled the IOPTE will force cachable
> > > > > access to IOMMU_CACHE memory and deny cachable access otherwise.
> > > > > 
> > > > > This is not especially meaningful for simple S2 domains, it apparently
> > > > > doesn't even force PCI no-snoop access to be coherent.
> > > > > 
> > > > > However, when used with a nested S1, FWB has the effect of preventing the
> > > > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> > > > > cache. Consistent with KVM we wish to deny the guest the ability to become
> > > > > incoherent with cached memory the hypervisor believes is cachable so we
> > > > > don't have to flush it.
> > > > > 
> > > > > Turn on S2FWB whenever the SMMU supports it and use it for all S2
> > > > > mappings.
> > > > 
> > > > I have been looking into this recently from the KVM side as it will
> > > > use FWB for the CPU stage-2 unconditionally for guests(if supported),
> > > > however that breaks for non-coherent devices when assigned, and
> > > > limiting assigned devices to be coherent seems too restrictive.
> > > 
> > > kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That
> > > concept is only relevant to the SMMU.
> >
> > Why not? That would be a problem if a device is not dma coherent,
> > and the VM knows that and maps it’s DMA memory as non cacheable.
> > But it would be overridden by FWB in stage-2 to be cacheable,
> > it would lead to coherency issues.
> 
> Oh, from that perspective yes, but the entire point of S2FWB is that
> VM's can not create non-coherent access so it is a bit nonsense to ask
> for both S2FWB and try to assign a non-DMA coherent device.

Yes, but KVM sets FWB unconditionally and would use cacheable mapping
for stage-2, and I expect the same for the nested SMMU.

> 
> > Yes, that also breaks (although I think this is an easier problem to
> > solve)
> 
> Well, it is easy to solve, just don't use S2FWB and manually flush the
> caches before the hypervisor touches any memory. :)

Yes, although that means virtualized devices would have worse
performance :/ but I guess there is nothing more to do here.

I have some ideas about that, I can send patches to the kvm list
as an RFC.

> 
> > What I mean is the master itself not the SMMU (the SID basically),
> > so in that case the STE shouldn’t have FWB enabled.
> 
> That doesn't matter, those cases will not pass in IOMMU_CACHE and they
> will work fine with S2FWB turned on.
> 

But that won’t be the case in nested? Otherwise why we use FWB in the
first place.

> > > Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
> > > so we won't even get a chance to ask for a S2 domain.
> > 
> > Oh, I think that is only for the SMMU, not for the master, the
> > SMMU can be coherent (for pte, ste …) but the master can still be
> > non coherent. Looking at how VFIO uses it, that seems to be a bug?
> 
> If there are mixes of SMMU feature and dev_is_dma_coherent() then it
> would be a bug yes..
> 

I think there is a bug, I was able to assign a “non-coherent” device with
VFIO with no issues, and it allows it as long as the SMMU is coherent.

> I recall we started out trying to use dev_is_dma_coherent() but
> Christoph explained it doesn't work that generally:
> 
> https://lore.kernel.org/kvm/20220406135150.GA21532@lst.de/
> 
> Seems we sort of gave up on it, too complicated. Robin had a nice
> observation of the complexity:
> 
>     Disregarding the complete disaster of PCIe No Snoop on Arm-Based 
>     systems, there's the more interesting effectively-opposite scenario 
>     where an SMMU bridges non-coherent devices to a coherent interconnect. 
>     It's not something we take advantage of yet in Linux, and it can only be 
>     properly described in ACPI, but there do exist situations where 
>     IOMMU_CACHE is capable of making the device's traffic snoop, but 
>     dev_is_dma_coherent() - and device_get_dma_attr() for external users - 
>     would still say non-coherent because they can't assume that the SMMU is 
>     enabled and programmed in just the right way.
> 
> Anyhow, for the purposes of KVM and VFIO, devices that don't work with
> IOMMU_CACHE are not allowed. From an API perspective
> IOMMU_CAP_CACHE_COHERENCY is supposed to return if the struct device
> can use IOMMU_CACHE.
> 
> The corner case where we have a ARM_SMMU_FEAT_COHERENCY SMMU but
> somehow specific devices don't support IOMMU_CACHE is not properly
> reflected in IOMMU_CAP_CACHE_COHERENCY. I don't know how to fix that,
> and we've been ignoring it for a long time now :)

Thanks a lot for the extra context!

Maybe the SMMUv3 .capable, should be changed to check if the device is
coherent (instead of using dev_is_dma_coherent, it can use lower level
functions from the supported buses)

Also, I think supporting IOMMU_CACHE is not enough, as the SMMU can
support it but the device is still not coherent.

Thanks,
Mostafa

> 
> Jason
Jason Gunthorpe Aug. 21, 2024, 12:06 p.m. UTC | #13
On Wed, Aug 21, 2024 at 09:53:33AM +0000, Mostafa Saleh wrote:
> > Oh, from that perspective yes, but the entire point of S2FWB is that
> > VM's can not create non-coherent access so it is a bit nonsense to ask
> > for both S2FWB and try to assign a non-DMA coherent device.
> 
> Yes, but KVM sets FWB unconditionally and would use cacheable mapping
> for stage-2, and I expect the same for the nested SMMU.

Yes, you'd need some kind of handshake like Intel built for their GPU
cache incoherence to turn that off.

> > > What I mean is the master itself not the SMMU (the SID basically),
> > > so in that case the STE shouldn’t have FWB enabled.
> > 
> > That doesn't matter, those cases will not pass in IOMMU_CACHE and they
> > will work fine with S2FWB turned on.
> 
> But that won’t be the case in nested? Otherwise why we use FWB in the
> first place.

Right, without KVM support for guest cachability selection and cache
flushing in VFIO, is infeasible to allow non-coherent devices. It is a
medium sized problem if someone wants to tackle it.

> Maybe the SMMUv3 .capable, should be changed to check if the device is
> coherent (instead of using dev_is_dma_coherent, it can use lower level
> functions from the supported buses)

That would be the fix I expect. Either SMMUv3 does it, or the core
code adds it on top in the .capable wrapper. It makes sense to me that
the iommu driver should be aware of per-master coherence capability.

> Also, I think supporting IOMMU_CACHE is not enough, as the SMMU can
> support it but the device is still not coherent.

IOMMU_CACHE is defined as requiring no cache maintenance on that
memory.

If specific devices can't guarentee that then IOMMU_CACHE should not
be used on those devices and IOMMU_CAP_CACHE_COHERENCY for that device
should be false.

That is what I mean by support.

Anyhow, I'm going to continue to leave this problem alone for
nesting. Nothing gets worse by adding nesting on top of this. Even if
we wrongly permit VFIO to open non-coherent devices they won't
actually work correctly (VFIO forces IOMMU_CACHE and S2FWB). Most
likely anything trying to use them will just crash/malfunction due to
missing cache flushing.

Jason