diff mbox series

[v3,3/4] scsi: core: Cap shost max_sectors according to DMA optimum mapping limits

Message ID 1654507822-168026-4-git-send-email-john.garry@huawei.com
State Superseded
Headers show
Series DMA mapping changes for SCSI core | expand

Commit Message

John Garry June 6, 2022, 9:30 a.m. UTC
Streaming DMA mappings may be considerably slower when mappings go through
an IOMMU and the total mapping length is somewhat long. This is because the
IOMMU IOVA code allocates and free an IOVA for each mapping, which may
affect performance.

For performance reasons set the request_queue max_sectors from
dma_opt_mapping_size(), which knows this mapping limit.

In addition, the shost->max_sectors is repeatedly set for each sdev in
__scsi_init_queue(). This is unnecessary, so set once when adding the
host.

Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
---
 drivers/scsi/hosts.c    | 5 +++++
 drivers/scsi/scsi_lib.c | 4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

Comments

Bart Van Assche June 8, 2022, 5:33 p.m. UTC | #1
On 6/6/22 02:30, John Garry wrote:
> +	if (dma_dev->dma_mask) {
> +		shost->max_sectors = min_t(unsigned int, shost->max_sectors,
> +				dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
> +	}

Since IOVA_RANGE_CACHE_MAX_SIZE = 6 this limits max_sectors to 2**6 * 
PAGE_SIZE or 256 KiB if the page size is 4 KiB. I think that's too 
small. Some (SRP) storage arrays require much larger transfers to 
achieve optimal performance.

Thanks,

Bart.
John Garry June 8, 2022, 5:50 p.m. UTC | #2
On 08/06/2022 18:33, Bart Van Assche wrote:
> On 6/6/22 02:30, John Garry wrote:
>> +    if (dma_dev->dma_mask) {
>> +        shost->max_sectors = min_t(unsigned int, shost->max_sectors,
>> +                dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
>> +    }
> 
> Since IOVA_RANGE_CACHE_MAX_SIZE = 6 this limits max_sectors to 2**6 * 
> PAGE_SIZE or 256 KiB if the page size is 4 KiB.

It's actually 128K for 4K page size, as any IOVA size is roundup to 
power-of-2 when testing if we may cache it, which means anything >128K 
would roundup to 256K and cannot be cached.

> I think that's too 
> small. Some (SRP) storage arrays require much larger transfers to 
> achieve optimal performance.

Have you tried this achieve this optimal performance with an IOMMU enabled?

Please note that this limit only applies if we have an IOMMU enabled for 
the scsi host dma device. Otherwise we are limited by dma direct or 
swiotlb max mapping size, as before.

Thanks,
John
Bart Van Assche June 8, 2022, 9:07 p.m. UTC | #3
On 6/8/22 10:50, John Garry wrote:
> Please note that this limit only applies if we have an IOMMU enabled for 
> the scsi host dma device. Otherwise we are limited by dma direct or 
> swiotlb max mapping size, as before.

SCSI host bus adapters that support 64-bit DMA may support much larger 
transfer sizes than 128 KiB.

Thanks,

Bart.
John Garry June 9, 2022, 8 a.m. UTC | #4
On 08/06/2022 22:07, Bart Van Assche wrote:
> On 6/8/22 10:50, John Garry wrote:
>> Please note that this limit only applies if we have an IOMMU enabled 
>> for the scsi host dma device. Otherwise we are limited by dma direct 
>> or swiotlb max mapping size, as before.
> 
> SCSI host bus adapters that support 64-bit DMA may support much larger 
> transfer sizes than 128 KiB.

Indeed, and that is my problem today, as my storage controller is 
generating DMA mapping lengths which exceeds 128K and they slow 
everything down.

If you say that SRP enjoys best peformance with larger transfers then 
can you please test this with an IOMMU enabled (iommu group type DMA or 
DMA-FQ)?

Thanks,
John
Bart Van Assche June 9, 2022, 5:18 p.m. UTC | #5
On 6/9/22 01:00, John Garry wrote:
> On 08/06/2022 22:07, Bart Van Assche wrote:
>> On 6/8/22 10:50, John Garry wrote:
>>> Please note that this limit only applies if we have an IOMMU enabled 
>>> for the scsi host dma device. Otherwise we are limited by dma direct 
>>> or swiotlb max mapping size, as before.
>>
>> SCSI host bus adapters that support 64-bit DMA may support much larger 
>> transfer sizes than 128 KiB.
> 
> Indeed, and that is my problem today, as my storage controller is 
> generating DMA mapping lengths which exceeds 128K and they slow 
> everything down.
> 
> If you say that SRP enjoys best peformance with larger transfers then 
> can you please test this with an IOMMU enabled (iommu group type DMA or 
> DMA-FQ)?

Hmm ... what exactly do you want me to test? Do you perhaps want me to 
measure how much performance drops with an IOMMU enabled? I don't have 
access anymore to the SRP setup I referred to in my previous email. But 
I do have access to devices that boot from UFS storage. For these 
devices we need to transfer 2 MiB per request to achieve full bandwidth.

Thanks,

Bart.
John Garry June 9, 2022, 5:54 p.m. UTC | #6
On 09/06/2022 18:18, Bart Van Assche wrote:
>>>
>>> SCSI host bus adapters that support 64-bit DMA may support much 
>>> larger transfer sizes than 128 KiB.
>>
>> Indeed, and that is my problem today, as my storage controller is 
>> generating DMA mapping lengths which exceeds 128K and they slow 
>> everything down.
>>
>> If you say that SRP enjoys best peformance with larger transfers then 
>> can you please test this with an IOMMU enabled (iommu group type DMA 
>> or DMA-FQ)?
> 
> Hmm ... what exactly do you want me to test? Do you perhaps want me to 
> measure how much performance drops with an IOMMU enabled? 

Yes, I would like to know of any performance change with an IOMMU 
enabled and then with an IOMMU enabled and including my series.

> I don't have 
> access anymore to the SRP setup I referred to in my previous email. But 
> I do have access to devices that boot from UFS storage. For these 
> devices we need to transfer 2 MiB per request to achieve full bandwidth.

ok, but do you have a system where the UFS host controller is behind an 
IOMMU? I had the impression that UFS controllers would be mostly found 
in embedded systems and IOMMUs are not as common on there.

Thanks,
John
Bart Van Assche June 9, 2022, 8:34 p.m. UTC | #7
On 6/9/22 10:54, John Garry wrote:
> ok, but do you have a system where the UFS host controller is behind an 
> IOMMU? I had the impression that UFS controllers would be mostly found 
> in embedded systems and IOMMUs are not as common on there.

Modern phones have an IOMMU. Below one can find an example from a Pixel 
6 phone. The UFS storage controller is not controller by the IOMMU as 
far as I can see but I wouldn't be surprised if the security team would 
ask us one day to enable the IOMMU for the UFS controller.

# (cd /sys/class/iommu && ls */devices)
1a090000.sysmmu/devices:
19000000.aoc

1a510000.sysmmu/devices:
1a440000.lwis_csi

1a540000.sysmmu/devices:
1aa40000.lwis_pdp

1a880000.sysmmu/devices:
1a840000.lwis_g3aa

1ad00000.sysmmu/devices:
1ac40000.lwis_ipp  1ac80000.lwis_gtnr_align

1b080000.sysmmu/devices:
1b450000.lwis_itp

1b780000.sysmmu/devices:

1b7b0000.sysmmu/devices:
1b760000.lwis_mcsc

1b7e0000.sysmmu/devices:

1baa0000.sysmmu/devices:
1a4e0000.lwis_votf  1ba40000.lwis_gdc

1bad0000.sysmmu/devices:
1ba60000.lwis_gdc

1bb00000.sysmmu/devices:
1ba80000.lwis_scsc

1bc70000.sysmmu/devices:
1bc40000.lwis_gtnr_merge

1bca0000.sysmmu/devices:

1bcd0000.sysmmu/devices:

1bd00000.sysmmu/devices:

1bd30000.sysmmu/devices:

1c100000.sysmmu/devices:
1c300000.drmdecon  1c302000.drmdecon

1c110000.sysmmu/devices:

1c120000.sysmmu/devices:

1c660000.sysmmu/devices:
1c640000.g2d

1c690000.sysmmu/devices:

1c710000.sysmmu/devices:
1c700000.smfc

1c870000.sysmmu/devices:
1c8d0000.MFC-0  mfc

1c8a0000.sysmmu/devices:

1ca40000.sysmmu/devices:
1cb00000.bigocean

1cc40000.sysmmu/devices:
1ce00000.abrolhos

Bart.
John Garry June 10, 2022, 3:37 p.m. UTC | #8
On 09/06/2022 21:34, Bart Van Assche wrote:
> On 6/9/22 10:54, John Garry wrote:
>> ok, but do you have a system where the UFS host controller is behind 
>> an IOMMU? I had the impression that UFS controllers would be mostly 
>> found in embedded systems and IOMMUs are not as common on there.
> 
> Modern phones have an IOMMU. Below one can find an example from a Pixel 
> 6 phone. The UFS storage controller is not controller by the IOMMU as 
> far as I can see but I wouldn't be surprised if the security team would 
> ask us one day to enable the IOMMU for the UFS controller.

OK, then unfortunately it seems that you have no method to test. I might 
be able to test USB MSC but I am not even sure if I can even get DMA 
mappings who length exceeds the IOVA rcache limit there.

Thanks,
John
John Garry June 23, 2022, 8:36 a.m. UTC | #9
On 10/06/2022 16:37, John Garry via iommu wrote:
> 
>> On 6/9/22 10:54, John Garry wrote:
>>> ok, but do you have a system where the UFS host controller is behind 
>>> an IOMMU? I had the impression that UFS controllers would be mostly 
>>> found in embedded systems and IOMMUs are not as common on there.
>>
>> Modern phones have an IOMMU. Below one can find an example from a 
>> Pixel 6 phone. The UFS storage controller is not controller by the 
>> IOMMU as far as I can see but I wouldn't be surprised if the security 
>> team would ask us one day to enable the IOMMU for the UFS controller.
> 
> OK, then unfortunately it seems that you have no method to test. I might 
> be able to test USB MSC but I am not even sure if I can even get DMA 
> mappings who length exceeds the IOVA rcache limit there.

I was able to do some testing on USB MSC for an XHCI controller. The 
result is that limiting the max HW sectors there does not affect 
performance in normal conditions.

However if I hack the USB driver and fiddle with request queue settings 
then it can:
- lift max_sectors limit in usb_stor_host_template 120KB -> 256KB
- lift request queue read_ahead_kb 128KB -> 256KB

In this scenario I can get 42.5MB/s read throughput, as opposed to 
39.5MB/s in normal conditions. Since .can_queue=1 for that host it would 
not fall foul of some issues I experience in IOVA allocator performance, 
so limiting max_sectors would not be required for that reason.

So this is an artificial test, but it may be worth considering only 
applying this DMA mapping optimal max_sectors limit to SAS controllers 
which I know can benefit.

Christoph, any opinion?

thanks,
John
diff mbox series

Patch

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 8352f90d997d..ea1a207634d1 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -236,6 +236,11 @@  int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 
 	shost->dma_dev = dma_dev;
 
+	if (dma_dev->dma_mask) {
+		shost->max_sectors = min_t(unsigned int, shost->max_sectors,
+				dma_opt_mapping_size(dma_dev) >> SECTOR_SHIFT);
+	}
+
 	error = scsi_mq_setup_tags(shost);
 	if (error)
 		goto fail;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6ffc9e4258a8..6ce8acea322a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1884,10 +1884,6 @@  void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 		blk_queue_max_integrity_segments(q, shost->sg_prot_tablesize);
 	}
 
-	if (dev->dma_mask) {
-		shost->max_sectors = min_t(unsigned int, shost->max_sectors,
-				dma_max_mapping_size(dev) >> SECTOR_SHIFT);
-	}
 	blk_queue_max_hw_sectors(q, shost->max_sectors);
 	blk_queue_segment_boundary(q, shost->dma_boundary);
 	dma_set_seg_boundary(dev, shost->dma_boundary);