mbox series

[0/7] add non-strict mode support for arm-smmu-v3

Message ID 1527752569-18020-1-git-send-email-thunder.leizhen@huawei.com
Headers show
Series add non-strict mode support for arm-smmu-v3 | expand

Message

Zhen Lei May 31, 2018, 7:42 a.m. UTC
In common, a IOMMU unmap operation follow the below steps:
1. remove the mapping in page table of the specified iova range
2. execute tlbi command to invalid the mapping which is cached in TLB
3. wait for the above tlbi operation to be finished
4. free the IOVA resource
5. free the physical memory resource

This maybe a problem when unmap is very frequently, the combination of tlbi
and wait operation will consume a lot of time. A feasible method is put off
tlbi and iova-free operation, when accumulating to a certain number or
reaching a specified time, execute only one tlbi_all command to clean up
TLB, then free the backup IOVAs. Mark as non-strict mode.

But it must be noted that, although the mapping has already been removed in
the page table, it maybe still exist in TLB. And the freed physical memory
may also be reused for others. So a attacker can persistent access to memory
based on the just freed IOVA, to obtain sensible data or corrupt memory. So
the VFIO should always choose the strict mode.

Some may consider put off physical memory free also, that will still follow
strict mode. But for the map_sg cases, the memory allocation is not controlled
by IOMMU APIs, so it is not enforceable.

Fortunately, Intel and AMD have already applied the non-strict mode, and put
queue_iova() operation into the common file dma-iommu.c., and my work is based
on it. The difference is that arm-smmu-v3 driver will call IOMMU common APIs to
unmap, but Intel and AMD IOMMU drivers are not.

Below is the performance data of strict vs non-strict for NVMe device:
Randomly Read  IOPS: 146K(strict) vs 573K(non-strict)
Randomly Write IOPS: 143K(strict) vs 513K(non-strict)


Zhen Lei (7):
  iommu/dma: fix trival coding style mistake
  iommu/arm-smmu-v3: fix the implementation of flush_iotlb_all hook
  iommu: prepare for the non-strict mode support
  iommu/amd: make sure TLB to be flushed before IOVA freed
  iommu/dma: add support for non-strict mode
  iommu/io-pgtable-arm: add support for non-strict mode
  iommu/arm-smmu-v3: add support for non-strict mode

 drivers/iommu/amd_iommu.c          |  2 +-
 drivers/iommu/arm-smmu-v3.c        | 16 ++++++++++++---
 drivers/iommu/arm-smmu.c           |  2 +-
 drivers/iommu/dma-iommu.c          | 41 ++++++++++++++++++++++++++++++--------
 drivers/iommu/io-pgtable-arm-v7s.c |  6 +++---
 drivers/iommu/io-pgtable-arm.c     | 28 ++++++++++++++------------
 drivers/iommu/io-pgtable.h         |  2 +-
 drivers/iommu/ipmmu-vmsa.c         |  2 +-
 drivers/iommu/msm_iommu.c          |  2 +-
 drivers/iommu/mtk_iommu.c          |  2 +-
 drivers/iommu/qcom_iommu.c         |  2 +-
 include/linux/iommu.h              |  5 +++++
 12 files changed, 76 insertions(+), 34 deletions(-)

-- 
1.8.3

Comments

Robin Murphy May 31, 2018, 11:24 a.m. UTC | #1
On 31/05/18 08:42, Zhen Lei wrote:
> In common, a IOMMU unmap operation follow the below steps:

> 1. remove the mapping in page table of the specified iova range

> 2. execute tlbi command to invalid the mapping which is cached in TLB

> 3. wait for the above tlbi operation to be finished

> 4. free the IOVA resource

> 5. free the physical memory resource

> 

> This maybe a problem when unmap is very frequently, the combination of tlbi

> and wait operation will consume a lot of time. A feasible method is put off

> tlbi and iova-free operation, when accumulating to a certain number or

> reaching a specified time, execute only one tlbi_all command to clean up

> TLB, then free the backup IOVAs. Mark as non-strict mode.

> 

> But it must be noted that, although the mapping has already been removed in

> the page table, it maybe still exist in TLB. And the freed physical memory

> may also be reused for others. So a attacker can persistent access to memory

> based on the just freed IOVA, to obtain sensible data or corrupt memory. So

> the VFIO should always choose the strict mode.

> 

> Some may consider put off physical memory free also, that will still follow

> strict mode. But for the map_sg cases, the memory allocation is not controlled

> by IOMMU APIs, so it is not enforceable.

> 

> Fortunately, Intel and AMD have already applied the non-strict mode, and put

> queue_iova() operation into the common file dma-iommu.c., and my work is based

> on it. The difference is that arm-smmu-v3 driver will call IOMMU common APIs to

> unmap, but Intel and AMD IOMMU drivers are not.

> 

> Below is the performance data of strict vs non-strict for NVMe device:

> Randomly Read  IOPS: 146K(strict) vs 573K(non-strict)

> Randomly Write IOPS: 143K(strict) vs 513K(non-strict)


What hardware is this on? If it's SMMUv3 without MSIs (e.g. D05), then 
you'll still be using the rubbish globally-blocking sync implementation. 
If that is the case, I'd be very interested to see how much there is to 
gain from just improving that - I've had a patch kicking around for a 
while[1] (also on a rebased branch at [2]), but don't have the means for 
serious performance testing.

Robin.

[1] 
https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg20576.html
[2] git://linux-arm.org/linux-rm iommu/smmu

> 

> 

> Zhen Lei (7):

>    iommu/dma: fix trival coding style mistake

>    iommu/arm-smmu-v3: fix the implementation of flush_iotlb_all hook

>    iommu: prepare for the non-strict mode support

>    iommu/amd: make sure TLB to be flushed before IOVA freed

>    iommu/dma: add support for non-strict mode

>    iommu/io-pgtable-arm: add support for non-strict mode

>    iommu/arm-smmu-v3: add support for non-strict mode

> 

>   drivers/iommu/amd_iommu.c          |  2 +-

>   drivers/iommu/arm-smmu-v3.c        | 16 ++++++++++++---

>   drivers/iommu/arm-smmu.c           |  2 +-

>   drivers/iommu/dma-iommu.c          | 41 ++++++++++++++++++++++++++++++--------

>   drivers/iommu/io-pgtable-arm-v7s.c |  6 +++---

>   drivers/iommu/io-pgtable-arm.c     | 28 ++++++++++++++------------

>   drivers/iommu/io-pgtable.h         |  2 +-

>   drivers/iommu/ipmmu-vmsa.c         |  2 +-

>   drivers/iommu/msm_iommu.c          |  2 +-

>   drivers/iommu/mtk_iommu.c          |  2 +-

>   drivers/iommu/qcom_iommu.c         |  2 +-

>   include/linux/iommu.h              |  5 +++++

>   12 files changed, 76 insertions(+), 34 deletions(-)

>
Robin Murphy May 31, 2018, 1:04 p.m. UTC | #2
On 31/05/18 08:42, Zhen Lei wrote:
> 1. Save the related domain pointer in struct iommu_dma_cookie, make iovad

>     capable call domain->ops->flush_iotlb_all to flush TLB.

> 2. Define a new iommu capable: IOMMU_CAP_NON_STRICT, which used to indicate

>     that the iommu domain support non-strict mode.

> 3. During the iommu domain initialization phase, call capable() to check

>     whether it support non-strcit mode. If so, call init_iova_flush_queue

>     to register iovad->flush_cb callback.

> 4. All unmap(contains iova-free) APIs will finally invoke __iommu_dma_unmap

>     -->iommu_dma_free_iova. Use iovad->flush_cb to check whether its related

>     iommu support non-strict mode or not, and call IOMMU_DOMAIN_IS_STRICT to

>     make sure the IOMMU_DOMAIN_UNMANAGED domain always follow strict mode.


Once again, this is a whole load of complexity for a property which 
could just be statically encoded at allocation, e.g. in the cookie type.

> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>

> ---

>   drivers/iommu/dma-iommu.c | 29 ++++++++++++++++++++++++++---

>   include/linux/iommu.h     |  3 +++

>   2 files changed, 29 insertions(+), 3 deletions(-)

> 

> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c

> index 4e885f7..2e116d9 100644

> --- a/drivers/iommu/dma-iommu.c

> +++ b/drivers/iommu/dma-iommu.c

> @@ -55,6 +55,7 @@ struct iommu_dma_cookie {

>   	};

>   	struct list_head		msi_page_list;

>   	spinlock_t			msi_lock;

> +	struct iommu_domain		*domain;

>   };

>   

>   static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)

> @@ -64,7 +65,8 @@ static inline size_t cookie_msi_granule(struct iommu_dma_cookie *cookie)

>   	return PAGE_SIZE;

>   }

>   

> -static struct iommu_dma_cookie *cookie_alloc(enum iommu_dma_cookie_type type)

> +static struct iommu_dma_cookie *cookie_alloc(struct iommu_domain *domain,

> +					     enum iommu_dma_cookie_type type)

>   {

>   	struct iommu_dma_cookie *cookie;

>   

> @@ -73,6 +75,7 @@ static struct iommu_dma_cookie *cookie_alloc(enum iommu_dma_cookie_type type)

>   		spin_lock_init(&cookie->msi_lock);

>   		INIT_LIST_HEAD(&cookie->msi_page_list);

>   		cookie->type = type;

> +		cookie->domain = domain;

>   	}

>   	return cookie;

>   }

> @@ -94,7 +97,7 @@ int iommu_get_dma_cookie(struct iommu_domain *domain)

>   	if (domain->iova_cookie)

>   		return -EEXIST;

>   

> -	domain->iova_cookie = cookie_alloc(IOMMU_DMA_IOVA_COOKIE);

> +	domain->iova_cookie = cookie_alloc(domain, IOMMU_DMA_IOVA_COOKIE);

>   	if (!domain->iova_cookie)

>   		return -ENOMEM;

>   

> @@ -124,7 +127,7 @@ int iommu_get_msi_cookie(struct iommu_domain *domain, dma_addr_t base)

>   	if (domain->iova_cookie)

>   		return -EEXIST;

>   

> -	cookie = cookie_alloc(IOMMU_DMA_MSI_COOKIE);

> +	cookie = cookie_alloc(domain, IOMMU_DMA_MSI_COOKIE);

>   	if (!cookie)

>   		return -ENOMEM;

>   

> @@ -261,6 +264,17 @@ static int iova_reserve_iommu_regions(struct device *dev,

>   	return ret;

>   }

>   

> +static void iova_flush_iotlb_all(struct iova_domain *iovad)


iommu_dma_flush...

> +{

> +	struct iommu_dma_cookie *cookie;

> +	struct iommu_domain *domain;

> +

> +	cookie = container_of(iovad, struct iommu_dma_cookie, iovad);

> +	domain = cookie->domain;

> +

> +	domain->ops->flush_iotlb_all(domain);

> +}

> +

>   /**

>    * iommu_dma_init_domain - Initialise a DMA mapping domain

>    * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()

> @@ -276,6 +290,7 @@ static int iova_reserve_iommu_regions(struct device *dev,

>   int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,

>   		u64 size, struct device *dev)

>   {

> +	const struct iommu_ops *ops = domain->ops;

>   	struct iommu_dma_cookie *cookie = domain->iova_cookie;

>   	struct iova_domain *iovad = &cookie->iovad;

>   	unsigned long order, base_pfn, end_pfn;

> @@ -313,6 +328,11 @@ int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base,

>   

>   	init_iova_domain(iovad, 1UL << order, base_pfn);

>   

> +	if (ops->capable && ops->capable(IOMMU_CAP_NON_STRICT)) {

> +		BUG_ON(!ops->flush_iotlb_all);

> +		init_iova_flush_queue(iovad, iova_flush_iotlb_all, NULL);

> +	}

> +

>   	return iova_reserve_iommu_regions(dev, domain);

>   }

>   EXPORT_SYMBOL(iommu_dma_init_domain);

> @@ -392,6 +412,9 @@ static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,

>   	/* The MSI case is only ever cleaning up its most recent allocation */

>   	if (cookie->type == IOMMU_DMA_MSI_COOKIE)

>   		cookie->msi_iova -= size;

> +	else if (!IOMMU_DOMAIN_IS_STRICT(cookie->domain) && iovad->flush_cb)

> +		queue_iova(iovad, iova_pfn(iovad, iova),

> +				size >> iova_shift(iovad), 0);

>   	else

>   		free_iova_fast(iovad, iova_pfn(iovad, iova),

>   				size >> iova_shift(iovad));

> diff --git a/include/linux/iommu.h b/include/linux/iommu.h

> index 39b3150..01ff569 100644

> --- a/include/linux/iommu.h

> +++ b/include/linux/iommu.h

> @@ -87,6 +87,8 @@ struct iommu_domain_geometry {

>   				 __IOMMU_DOMAIN_DMA_API)

>   

>   #define IOMMU_STRICT		1

> +#define IOMMU_DOMAIN_IS_STRICT(domain)	\

> +		(domain->type == IOMMU_DOMAIN_UNMANAGED)

>   

>   struct iommu_domain {

>   	unsigned type;

> @@ -103,6 +105,7 @@ enum iommu_cap {

>   					   transactions */

>   	IOMMU_CAP_INTR_REMAP,		/* IOMMU supports interrupt isolation */

>   	IOMMU_CAP_NOEXEC,		/* IOMMU_NOEXEC flag */

> +	IOMMU_CAP_NON_STRICT,		/* IOMMU supports non-strict mode */


This isn't a property of the IOMMU, it depends purely on the driver 
implementation. I think it also doesn't matter anyway - if a caller asks 
for lazy unmapping on their domain but the IOMMU driver just does strict 
unmaps anyway because that's all it supports, there's no actual harm done.

Robin.

>   };

>   

>   /*

>
Robin Murphy May 31, 2018, 2:25 p.m. UTC | #3
On 31/05/18 14:49, Hanjun Guo wrote:
> Hi Robin,

> 

> On 2018/5/31 19:24, Robin Murphy wrote:

>> On 31/05/18 08:42, Zhen Lei wrote:

>>> In common, a IOMMU unmap operation follow the below steps:

>>> 1. remove the mapping in page table of the specified iova range

>>> 2. execute tlbi command to invalid the mapping which is cached in TLB

>>> 3. wait for the above tlbi operation to be finished

>>> 4. free the IOVA resource

>>> 5. free the physical memory resource

>>>

>>> This maybe a problem when unmap is very frequently, the combination of tlbi

>>> and wait operation will consume a lot of time. A feasible method is put off

>>> tlbi and iova-free operation, when accumulating to a certain number or

>>> reaching a specified time, execute only one tlbi_all command to clean up

>>> TLB, then free the backup IOVAs. Mark as non-strict mode.

>>>

>>> But it must be noted that, although the mapping has already been removed in

>>> the page table, it maybe still exist in TLB. And the freed physical memory

>>> may also be reused for others. So a attacker can persistent access to memory

>>> based on the just freed IOVA, to obtain sensible data or corrupt memory. So

>>> the VFIO should always choose the strict mode.

>>>

>>> Some may consider put off physical memory free also, that will still follow

>>> strict mode. But for the map_sg cases, the memory allocation is not controlled

>>> by IOMMU APIs, so it is not enforceable.

>>>

>>> Fortunately, Intel and AMD have already applied the non-strict mode, and put

>>> queue_iova() operation into the common file dma-iommu.c., and my work is based

>>> on it. The difference is that arm-smmu-v3 driver will call IOMMU common APIs to

>>> unmap, but Intel and AMD IOMMU drivers are not.

>>>

>>> Below is the performance data of strict vs non-strict for NVMe device:

>>> Randomly Read  IOPS: 146K(strict) vs 573K(non-strict)

>>> Randomly Write IOPS: 143K(strict) vs 513K(non-strict)

>>

>> What hardware is this on? If it's SMMUv3 without MSIs (e.g. D05), then you'll still be using the rubbish globally-blocking sync implementation. If that is the case, I'd be very interested to see how much there is to gain from just improving that - I've had a patch kicking around for a while[1] (also on a rebased branch at [2]), but don't have the means for serious performance testing.

> 

> The hardware is the new D06 which the SMMU with MSIs,


Cool! Now that profiling is fairly useful since we got rid of most of 
the locks, are you able to get an idea of how the overhead in the normal 
case is distributed between arm_smmu_cmdq_insert_cmd() and 
__arm_smmu_sync_poll_msi()? We're always trying to improve our 
understanding of where command-queue-related overheads turn out to be in 
practice, and there's still potentially room to do nicer things than 
TLBI_NH_ALL ;)

Robin.

> it's not D05 :)

> 

> Thanks

> Hanjun

>