Message ID | 20210627143405.77298-1-sven@svenpeter.dev |
---|---|
Headers | show |
Series | Apple M1 DART IOMMU driver | expand |
On 27.06.21 16:34, Sven Peter wrote: > > Apple's DART iommu uses a pagetable format that shares some > similarities with the ones already implemented by io-pgtable.c. > Add a new format variant to support the required differences > so that we don't have to duplicate the pagetable handling code. > > Signed-off-by: Sven Peter <sven@svenpeter.dev> > --- > drivers/iommu/io-pgtable-arm.c | 62 ++++++++++++++++++++++++++++++++++ > drivers/iommu/io-pgtable.c | 1 + > include/linux/io-pgtable.h | 7 ++++ > 3 files changed, 70 insertions(+) > > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c > index 87def58e79b5..1dd5c45b4b5b 100644 > --- a/drivers/iommu/io-pgtable-arm.c > +++ b/drivers/iommu/io-pgtable-arm.c > @@ -127,6 +127,9 @@ > #define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x88ULL > #define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL > > +#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7) > +#define APPLE_DART_PTE_PROT_NO_READ (1<<8) > + > /* IOPTE accessors */ > #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d)) > > @@ -381,6 +384,15 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, > { > arm_lpae_iopte pte; > > + if (data->iop.fmt == ARM_APPLE_DART) { > + pte = 0; > + if (!(prot & IOMMU_WRITE)) > + pte |= APPLE_DART_PTE_PROT_NO_WRITE; > + if (!(prot & IOMMU_READ)) > + pte |= APPLE_DART_PTE_PROT_NO_READ; > + return pte; What about the other bits, such as sharability, XN, etc? Do they not exist on DART? Or have they not been reverse engineered and 0s happen to "just work"? > + } > + > if (data->iop.fmt == ARM_64_LPAE_S1 || > data->iop.fmt == ARM_32_LPAE_S1) { > pte = ARM_LPAE_PTE_nG; > @@ -1043,6 +1055,51 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) > return NULL; > } > > +static struct io_pgtable * > +apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) > +{ > + struct arm_lpae_io_pgtable *data; > + int i; > + > + if (cfg->oas > 36) > + return NULL; > + > + data = arm_lpae_alloc_pgtable(cfg); > + if (!data) > + return NULL; > + > + /* > + * Apple's DART always requires three levels with the first level being > + * stored in four MMIO registers. We always concatenate the first and > + * second level so that we only have to setup the MMIO registers once. > + * This results in an effective two level pagetable. > + */ > + if (data->start_level < 1) > + return NULL; > + if (data->start_level == 1 && data->pgd_bits > 2) > + return NULL; > + if (data->start_level > 1) > + data->pgd_bits = 0; > + data->start_level = 2; > + cfg->apple_dart_cfg.n_ttbrs = 1 << data->pgd_bits; Maybe add a BUG_ON if n_ttbrs > ARRAY_SIZE(ttbr)? Or alternatively, do a normal runtime check and bail out then. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Mon, Jun 28, 2021, at 12:54, Alexander Graf wrote: > > > On 27.06.21 16:34, Sven Peter wrote: > > > > Apple's DART iommu uses a pagetable format that shares some > > similarities with the ones already implemented by io-pgtable.c. > > Add a new format variant to support the required differences > > so that we don't have to duplicate the pagetable handling code. > > > > Signed-off-by: Sven Peter <sven@svenpeter.dev> > > --- > > drivers/iommu/io-pgtable-arm.c | 62 ++++++++++++++++++++++++++++++++++ > > drivers/iommu/io-pgtable.c | 1 + > > include/linux/io-pgtable.h | 7 ++++ > > 3 files changed, 70 insertions(+) > > > > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c > > index 87def58e79b5..1dd5c45b4b5b 100644 > > --- a/drivers/iommu/io-pgtable-arm.c > > +++ b/drivers/iommu/io-pgtable-arm.c > > @@ -127,6 +127,9 @@ > > #define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x88ULL > > #define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL > > > > +#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7) > > +#define APPLE_DART_PTE_PROT_NO_READ (1<<8) > > + > > /* IOPTE accessors */ > > #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d)) > > > > @@ -381,6 +384,15 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, > > { > > arm_lpae_iopte pte; > > > > + if (data->iop.fmt == ARM_APPLE_DART) { > > + pte = 0; > > + if (!(prot & IOMMU_WRITE)) > > + pte |= APPLE_DART_PTE_PROT_NO_WRITE; > > + if (!(prot & IOMMU_READ)) > > + pte |= APPLE_DART_PTE_PROT_NO_READ; > > + return pte; > > What about the other bits, such as sharability, XN, etc? Do they not > exist on DART? Or have they not been reverse engineered and 0s happen to > "just work"? I'm fairly certain they don't exist (or are at least not used by XNU). The co-processors that can run code also either use an entire separate iommu (e.g. the GPU) or only use DART as a "second stage" and have their own MMU which e.g. handles XN (e.g. the SEP or AOP). > > > + } > > + > > if (data->iop.fmt == ARM_64_LPAE_S1 || > > data->iop.fmt == ARM_32_LPAE_S1) { > > pte = ARM_LPAE_PTE_nG; > > @@ -1043,6 +1055,51 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) > > return NULL; > > } > > > > +static struct io_pgtable * > > +apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) > > +{ > > + struct arm_lpae_io_pgtable *data; > > + int i; > > + > > + if (cfg->oas > 36) > > + return NULL; > > + > > + data = arm_lpae_alloc_pgtable(cfg); > > + if (!data) > > + return NULL; > > + > > + /* > > + * Apple's DART always requires three levels with the first level being > > + * stored in four MMIO registers. We always concatenate the first and > > + * second level so that we only have to setup the MMIO registers once. > > + * This results in an effective two level pagetable. > > + */ > > + if (data->start_level < 1) > > + return NULL; > > + if (data->start_level == 1 && data->pgd_bits > 2) > > + return NULL; > > + if (data->start_level > 1) > > + data->pgd_bits = 0; > > + data->start_level = 2; > > + cfg->apple_dart_cfg.n_ttbrs = 1 << data->pgd_bits; > > Maybe add a BUG_ON if n_ttbrs > ARRAY_SIZE(ttbr)? Or alternatively, do a > normal runtime check and bail out then. n_ttbrs can't actually be larger than 4 at this point already due to the previous checks. I can add a BUG_ON though just to make it explicit and be safe in case those checks or the array size ever change. Sven
On 29.06.21 09:37, Sven Peter wrote: > > On Mon, Jun 28, 2021, at 12:54, Alexander Graf wrote: >> >> >> On 27.06.21 16:34, Sven Peter wrote: >>> >>> Apple's DART iommu uses a pagetable format that shares some >>> similarities with the ones already implemented by io-pgtable.c. >>> Add a new format variant to support the required differences >>> so that we don't have to duplicate the pagetable handling code. >>> >>> Signed-off-by: Sven Peter <sven@svenpeter.dev> >>> --- >>> drivers/iommu/io-pgtable-arm.c | 62 ++++++++++++++++++++++++++++++++++ >>> drivers/iommu/io-pgtable.c | 1 + >>> include/linux/io-pgtable.h | 7 ++++ >>> 3 files changed, 70 insertions(+) >>> >>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c >>> index 87def58e79b5..1dd5c45b4b5b 100644 >>> --- a/drivers/iommu/io-pgtable-arm.c >>> +++ b/drivers/iommu/io-pgtable-arm.c >>> @@ -127,6 +127,9 @@ >>> #define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x88ULL >>> #define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL >>> >>> +#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7) >>> +#define APPLE_DART_PTE_PROT_NO_READ (1<<8) >>> + >>> /* IOPTE accessors */ >>> #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d)) >>> >>> @@ -381,6 +384,15 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data, >>> { >>> arm_lpae_iopte pte; >>> >>> + if (data->iop.fmt == ARM_APPLE_DART) { >>> + pte = 0; >>> + if (!(prot & IOMMU_WRITE)) >>> + pte |= APPLE_DART_PTE_PROT_NO_WRITE; >>> + if (!(prot & IOMMU_READ)) >>> + pte |= APPLE_DART_PTE_PROT_NO_READ; >>> + return pte; >> >> What about the other bits, such as sharability, XN, etc? Do they not >> exist on DART? Or have they not been reverse engineered and 0s happen to >> "just work"? > > I'm fairly certain they don't exist (or are at least not used by XNU). > > The co-processors that can run code also either use an entire separate iommu > (e.g. the GPU) or only use DART as a "second stage" and have their own > MMU which e.g. handles XN (e.g. the SEP or AOP). Ok :). > >> >>> + } >>> + >>> if (data->iop.fmt == ARM_64_LPAE_S1 || >>> data->iop.fmt == ARM_32_LPAE_S1) { >>> pte = ARM_LPAE_PTE_nG; >>> @@ -1043,6 +1055,51 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) >>> return NULL; >>> } >>> >>> +static struct io_pgtable * >>> +apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie) >>> +{ >>> + struct arm_lpae_io_pgtable *data; >>> + int i; >>> + >>> + if (cfg->oas > 36) >>> + return NULL; >>> + >>> + data = arm_lpae_alloc_pgtable(cfg); >>> + if (!data) >>> + return NULL; >>> + >>> + /* >>> + * Apple's DART always requires three levels with the first level being >>> + * stored in four MMIO registers. We always concatenate the first and >>> + * second level so that we only have to setup the MMIO registers once. >>> + * This results in an effective two level pagetable. >>> + */ >>> + if (data->start_level < 1) >>> + return NULL; >>> + if (data->start_level == 1 && data->pgd_bits > 2) >>> + return NULL; >>> + if (data->start_level > 1) >>> + data->pgd_bits = 0; >>> + data->start_level = 2; >>> + cfg->apple_dart_cfg.n_ttbrs = 1 << data->pgd_bits; >> >> Maybe add a BUG_ON if n_ttbrs > ARRAY_SIZE(ttbr)? Or alternatively, do a >> normal runtime check and bail out then. > > n_ttbrs can't actually be larger than 4 at this point already due to the > previous checks. > I can add a BUG_ON though just to make it explicit and be safe in case those > checks or the array size ever change. Ah, now I see it too. No worries then - I agree that you have all cases covered. Reviewed-by: Alexander Graf <graf@amazon.com> Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On 2021-06-27 15:34, Sven Peter wrote: [...] > In the long term, I'd like to extend the dma-iommu framework itself to > support iommu pagesizes with a larger granule than the CPU pagesize if that is > something you agree with. BTW this isn't something we can fully support in general. IOMMU API users may expect this to work: iommu_map(domain, iova, page_to_phys(p1), PAGE_SIZE, prot); iommu_map(domain, iova + PAGE_SIZE, page_to_phys(p2), PAGE_SIZE, prot); Although they do in principle have visibility of pgsize_bitmap, I still doubt anyone is really prepared for CPU-page-aligned mappings to fail. Even at the DMA API level you could hide *some* of it (at the cost of effectively only having 1/4 of the usable address space), but there are still cases like where v4l2 has a hard requirement that a page-aligned scatterlist can be mapped into a contiguous region of DMA addresses. > This would be important to later support the thunderbolt DARTs since I would be > very uncomfortable to have these running in (software or hardware) bypass mode. Funnily enough that's the one case that would be relatively workable, since untrusted devices are currently subject to bounce-buffering of the entire DMA request, so it doesn't matter so much how the bounce buffer itself is mapped. Even with the possible future optimisation of only bouncing the non-page-aligned start and end parts of a buffer I think it still works (the physical alignment just has to be considered in terms of the IOMMU granule). Robin.
On Wed, Jul 14, 2021 at 8:21 PM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2021-06-27 15:34, Sven Peter wrote: > [...] > > In the long term, I'd like to extend the dma-iommu framework itself to > > support iommu pagesizes with a larger granule than the CPU pagesize if that is > > something you agree with. > > BTW this isn't something we can fully support in general. IOMMU API > users may expect this to work: > > iommu_map(domain, iova, page_to_phys(p1), PAGE_SIZE, prot); > iommu_map(domain, iova + PAGE_SIZE, page_to_phys(p2), PAGE_SIZE, prot); > > Although they do in principle have visibility of pgsize_bitmap, I still > doubt anyone is really prepared for CPU-page-aligned mappings to fail. > Even at the DMA API level you could hide *some* of it (at the cost of > effectively only having 1/4 of the usable address space), but there are > still cases like where v4l2 has a hard requirement that a page-aligned > scatterlist can be mapped into a contiguous region of DMA addresses. I think that was the same conclusion we had earlier: the dma-mapping interfaces should be possible for large iotlb pages, but any driver directly using the IOMMU API, such as VFIO, would not work. The question is how we can best allow one but not the other. Arnd
On Wed, Jul 14, 2021 at 10:51:34PM +0200, Arnd Bergmann wrote:
> The question is how we can best allow one but not the other.
By only allowing to allocate domains of type IDENTITY and DMA, but fail
to allocate UNMANAGED domains.
Regards,
Joerg
On Wed, Jul 14, 2021 at 07:19:50PM +0100, Robin Murphy wrote: > Even at the DMA API level you could hide *some* of it (at the cost of > effectively only having 1/4 of the usable address space), but there are > still cases like where v4l2 has a hard requirement that a page-aligned > scatterlist can be mapped into a contiguous region of DMA addresses. Where does v4l2 make that broken assumption? Plenty of dma mapping implementations including dma-direct do not support that. Drivers need to call dma_get_merge_boundary() to check for that kind of behavior.
On 2021-07-16 07:24, Christoph Hellwig wrote: > On Wed, Jul 14, 2021 at 07:19:50PM +0100, Robin Murphy wrote: >> Even at the DMA API level you could hide *some* of it (at the cost of >> effectively only having 1/4 of the usable address space), but there are >> still cases like where v4l2 has a hard requirement that a page-aligned >> scatterlist can be mapped into a contiguous region of DMA addresses. > > Where does v4l2 make that broken assumption? Plenty of dma mapping > implementations including dma-direct do not support that. See vb2_dc_get_contiguous_size() and its callers. I still remember spending an entire work day on writing one email at the culmination of this discussion: https://lore.kernel.org/linux-iommu/56409B6D.5090903@arm.com/ 809eac54cdd6 was framed as an efficiency improvement because it technically was one (and something I had wanted to implement anyway), but it was also very much to save myself from any further email debates or customer calls about "regressing" code ported from 32-bit platforms... Robin.