diff mbox series

[v1] mmc: sdhci: fix max req size based on spec

Message ID TY3P286MB2611D07641D842BD373FD3A8987F2@TY3P286MB2611.JPNP286.PROD.OUTLOOK.COM
State New
Headers show
Series [v1] mmc: sdhci: fix max req size based on spec | expand

Commit Message

Qu Shengyu Jan. 28, 2024, 10:01 a.m. UTC
For almost 2 decades, the max allowed requests were limited to 512KB
because of SDMA's max 512KiB boundary limit.

ADMA2 and ADMA3 do not have such limit and were effectively made so any
kind of block count would not impose interrupt and managing stress to the
host.

By limiting that to 512KiB, it effectively downgrades these DMA modes to
SDMA.

Fix that by actually following the spec:
When ADMA is selected tuning mode is advised. On lesser modes, 4MiB
transfer is selected as max, so re-tuning if timer trigger or if requested
by host interrupt, can be done in time. Otherwise, the only limit is the
variable size of types used. In this implementation, 16MiB is used as
maximum since tests showed that after that point, there are diminishing
returns.

Also 16MiB in worst case scenarios, when card is eMMC and its max speed is
a generous 350MiB/s, will generate interrupts every 45ms on huge data
transfers.

A new `adma_get_req_limit` sdhci host function was also introduced, to let
vendors override imposed limits by the generic implementation if needed.

For example, on local tests with rigorous CPU/GPU burn-in tests and abrupt
cut-offs to generate huge temperature changes (upwards/downwards) to the
card, tested host was fine up to 128MB/s transfers on slow cards that used
SDR104 bus timing without re-tuning.
In that case the 4MiB limit was overridden with a more than safe 8MiB
value.

In all testing cases and boards, that change brought the following:

Depending on bus timing and eMMC/SD specs:
* Max Read throughput increased by 2-20%
* Max Write throughput increased by 50-200%
Depending on CPU frequency and transfer sizes:
* Reduced mmcqd cpu core usage by 4-50%

Above commit message comes from original author whose id is CTCaer, with
SoB email address ctcaer@gmail.com. I tried to contact with the author 1
month ago to ask for sending it to mainline or get the authority to submit
by myself, but I didn't get any reply, so I decided to send this patch by
myself. Original commit is here[1].

The author also has a patch[2] applied after this patch, which overrides
adma size on tegra device from device tree property. But I don't have a
tegra device to actually test that, so it is not sent, and
adma_get_req_limit part is not included in this version of patch.

[1]: https://github.com/CTCaer/switch-l4t-kernel-4.9/commit/fa86ebbd56d30b3b6af26e1d1c3f9c411a47e98e
[2]: https://github.com/CTCaer/switch-l4t-kernel-4.9/commit/385f9335b9a60ce471ac3291f202b1326212be3e

Signed-off-by: Shengyu Qu <wiagn233@outlook.com>
---
 drivers/mmc/host/sdhci.c | 17 ++++++++++++-----
 drivers/mmc/host/sdhci.h |  4 ++--
 2 files changed, 14 insertions(+), 7 deletions(-)

Comments

Adrian Hunter Jan. 30, 2024, 6:54 a.m. UTC | #1
On 28/01/24 12:01, Shengyu Qu wrote:
> For almost 2 decades, the max allowed requests were limited to 512KB
> because of SDMA's max 512KiB boundary limit.

It is not limited by SDMA.  It is limited by choice.

> 
> ADMA2 and ADMA3 do not have such limit and were effectively made so any
> kind of block count would not impose interrupt and managing stress to the
> host.

The main benefit of ADMA is that it provides scatter/gather and so does
not need a bounce buffer.

> 
> By limiting that to 512KiB, it effectively downgrades these DMA modes to
> SDMA.

Not really.

> 
> Fix that by actually following the spec:
> When ADMA is selected tuning mode is advised. On lesser modes, 4MiB
> transfer is selected as max, so re-tuning if timer trigger or if requested
> by host interrupt, can be done in time. Otherwise, the only limit is the
> variable size of types used. In this implementation, 16MiB is used as
> maximum since tests showed that after that point, there are diminishing
> returns.
> 
> Also 16MiB in worst case scenarios, when card is eMMC and its max speed is
> a generous 350MiB/s, will generate interrupts every 45ms on huge data
> transfers.
> 
> A new `adma_get_req_limit` sdhci host function was also introduced, to let
> vendors override imposed limits by the generic implementation if needed.

Not in this patch?

> 
> For example, on local tests with rigorous CPU/GPU burn-in tests and abrupt
> cut-offs to generate huge temperature changes (upwards/downwards) to the
> card, tested host was fine up to 128MB/s transfers on slow cards that used
> SDR104 bus timing without re-tuning.
> In that case the 4MiB limit was overridden with a more than safe 8MiB
> value.
> 
> In all testing cases and boards, that change brought the following:

"all testing cases and boards" doesn't mean much to anyone else. You
need to be more explicit.

> 
> Depending on bus timing and eMMC/SD specs:
> * Max Read throughput increased by 2-20%
> * Max Write throughput increased by 50-200%
> Depending on CPU frequency and transfer sizes:
> * Reduced mmcqd cpu core usage by 4-50%

The main issue with increasing the request size is that it introduces much
more latency for synchronous reads e.g. a synchronous read may have to wait
for a large write operation.  Generally, that is probably a show-stopper
for unconditionally increasing the maximum request size.

> 
> Above commit message comes from original author whose id is CTCaer, with
> SoB email address ctcaer@gmail.com. I tried to contact with the author 1
> month ago to ask for sending it to mainline or get the authority to submit
> by myself, but I didn't get any reply, so I decided to send this patch by
> myself. Original commit is here[1].

Ok, so it is not your patch and the original author is out of touch.

Is there a particular reason you wanted this patch?

> 
> The author also has a patch[2] applied after this patch, which overrides
> adma size on tegra device from device tree property. But I don't have a
> tegra device to actually test that, so it is not sent, and
> adma_get_req_limit part is not included in this version of patch.

Does that mean you haven't tested this patch yourself at all?

> 
> [1]: https://github.com/CTCaer/switch-l4t-kernel-4.9/commit/fa86ebbd56d30b3b6af26e1d1c3f9c411a47e98e
> [2]: https://github.com/CTCaer/switch-l4t-kernel-4.9/commit/385f9335b9a60ce471ac3291f202b1326212be3e
> 
> Signed-off-by: Shengyu Qu <wiagn233@outlook.com>
> ---
>  drivers/mmc/host/sdhci.c | 17 ++++++++++++-----
>  drivers/mmc/host/sdhci.h |  4 ++--
>  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> index c79f73459915..f546b675c7b9 100644
> --- a/drivers/mmc/host/sdhci.c
> +++ b/drivers/mmc/host/sdhci.c
> @@ -1081,7 +1081,7 @@ static void sdhci_initialize_data(struct sdhci_host *host,
>  	WARN_ON(host->data);
>  
>  	/* Sanity checks */
> -	BUG_ON(data->blksz * data->blocks > 524288);
> +	BUG_ON(data->blksz * data->blocks > host->mmc->max_req_size);
>  	BUG_ON(data->blksz > host->mmc->max_blk_size);
>  	BUG_ON(data->blocks > 65535);
>  
> @@ -4690,11 +4690,18 @@ int sdhci_setup_host(struct sdhci_host *host)
>  
>  	/*
>  	 * Maximum number of sectors in one transfer. Limited by SDMA boundary
> -	 * size (512KiB). Note some tuning modes impose a 4MiB limit, but this
> -	 * is less anyway.
> +	 * size and by tuning modes on ADMA. On tuning mode 3 16MiB is more than
> +	 * enough to cover big data transfers.
>  	 */
> -	mmc->max_req_size = 524288;
> -
> +	if (host->flags & SDHCI_USE_ADMA) {
> +		if (host->tuning_mode != SDHCI_TUNING_MODE_3)
> +			mmc->max_req_size = SZ_4M;
> +		else
> +			mmc->max_req_size = SZ_16M;
> +	} else {
> +		/* On PIO/SDMA use SDMA boundary size (512KiB). */
> +		mmc->max_req_size = SZ_512K;
> +	}
>  	/*
>  	 * Maximum number of segments. Depends on if the hardware
>  	 * can do scatter/gather or not.
> diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
> index a20864fc0641..98252c427feb 100644
> --- a/drivers/mmc/host/sdhci.h
> +++ b/drivers/mmc/host/sdhci.h
> @@ -346,11 +346,11 @@ struct sdhci_adma2_64_desc {
>  #define ADMA2_END		0x2
>  
>  /*
> - * Maximum segments assuming a 512KiB maximum requisition size and a minimum
> + * Maximum segments assuming a 16MiB maximum requisition size and a minimum
>   * 4KiB page size. Note this also allows enough for multiple descriptors in
>   * case of PAGE_SIZE >= 64KiB.
>   */
> -#define SDHCI_MAX_SEGS		128
> +#define SDHCI_MAX_SEGS		4096
>  
>  /* Allow for a command request and a data request at the same time */
>  #define SDHCI_MAX_MRQS		2
Qu Shengyu Feb. 2, 2024, 6:36 p.m. UTC | #2
Hello Adrian,

Sorry for my late reply. I was quite busy this week.


在 2024/1/30 14:54, Adrian Hunter 写道:
> On 28/01/24 12:01, Shengyu Qu wrote:
>> For almost 2 decades, the max allowed requests were limited to 512KB
>> because of SDMA's max 512KiB boundary limit.
> It is not limited by SDMA.  It is limited by choice.
>
>> ADMA2 and ADMA3 do not have such limit and were effectively made so any
>> kind of block count would not impose interrupt and managing stress to the
>> host.
> The main benefit of ADMA is that it provides scatter/gather and so does
> not need a bounce buffer.
>
>> By limiting that to 512KiB, it effectively downgrades these DMA modes to
>> SDMA.
> Not really.
So maybe I could just delete this part of the commit message?
>> Fix that by actually following the spec:
>> When ADMA is selected tuning mode is advised. On lesser modes, 4MiB
>> transfer is selected as max, so re-tuning if timer trigger or if requested
>> by host interrupt, can be done in time. Otherwise, the only limit is the
>> variable size of types used. In this implementation, 16MiB is used as
>> maximum since tests showed that after that point, there are diminishing
>> returns.
>>
>> Also 16MiB in worst case scenarios, when card is eMMC and its max speed is
>> a generous 350MiB/s, will generate interrupts every 45ms on huge data
>> transfers.
>>
>> A new `adma_get_req_limit` sdhci host function was also introduced, to let
>> vendors override imposed limits by the generic implementation if needed.
> Not in this patch?
Yes, this part is removed from my version of patch, original version has a
"adma_get_req_limit" host function and implements it to tegra. And there's
another patch adding a new device tree item to tegra chips to set the limit
from device tree config.
>> For example, on local tests with rigorous CPU/GPU burn-in tests and abrupt
>> cut-offs to generate huge temperature changes (upwards/downwards) to the
>> card, tested host was fine up to 128MB/s transfers on slow cards that used
>> SDR104 bus timing without re-tuning.
>> In that case the 4MiB limit was overridden with a more than safe 8MiB
>> value.
>>
>> In all testing cases and boards, that change brought the following:
> "all testing cases and boards" doesn't mean much to anyone else. You
> need to be more explicit.
>
>> Depending on bus timing and eMMC/SD specs:
>> * Max Read throughput increased by 2-20%
>> * Max Write throughput increased by 50-200%
>> Depending on CPU frequency and transfer sizes:
>> * Reduced mmcqd cpu core usage by 4-50%
> The main issue with increasing the request size is that it introduces much
> more latency for synchronous reads e.g. a synchronous read may have to wait
> for a large write operation.  Generally, that is probably a show-stopper
> for unconditionally increasing the maximum request size.
>> Above commit message comes from original author whose id is CTCaer, with
>> SoB email addressctcaer@gmail.com. I tried to contact with the author 1
>> month ago to ask for sending it to mainline or get the authority to submit
>> by myself, but I didn't get any reply, so I decided to send this patch by
>> myself. Original commit is here[1].
> Ok, so it is not your patch and the original author is out of touch.
>
> Is there a particular reason you wanted this patch?
>
>> The author also has a patch[2] applied after this patch, which overrides
>> adma size on tegra device from device tree property. But I don't have a
>> tegra device to actually test that, so it is not sent, and
>> adma_get_req_limit part is not included in this version of patch.
> Does that mean you haven't tested this patch yourself at all?
Yes, I have tested that on my Raspberry Pi 5b and Starfive Visionfive 2. On
pi5B, It brings about 23% speed increase in sequential write. So there's a
huge improvement in improving write performance that I think that's enough
to send this patch to upstream. On Visionfive 2, performance is mainly
limited by low mmc clock and cannot be increased, so I can't see obvious
improvement on that.

Best regards,
Shengyu

>> [1]:https://github.com/CTCaer/switch-l4t-kernel-4.9/commit/fa86ebbd56d30b3b6af26e1d1c3f9c411a47e98e
>> [2]:https://github.com/CTCaer/switch-l4t-kernel-4.9/commit/385f9335b9a60ce471ac3291f202b1326212be3e
>>
>> Signed-off-by: Shengyu Qu<wiagn233@outlook.com>
>> ---
>>   drivers/mmc/host/sdhci.c | 17 ++++++++++++-----
>>   drivers/mmc/host/sdhci.h |  4 ++--
>>   2 files changed, 14 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>> index c79f73459915..f546b675c7b9 100644
>> --- a/drivers/mmc/host/sdhci.c
>> +++ b/drivers/mmc/host/sdhci.c
>> @@ -1081,7 +1081,7 @@ static void sdhci_initialize_data(struct sdhci_host *host,
>>   	WARN_ON(host->data);
>>   
>>   	/* Sanity checks */
>> -	BUG_ON(data->blksz * data->blocks > 524288);
>> +	BUG_ON(data->blksz * data->blocks > host->mmc->max_req_size);
>>   	BUG_ON(data->blksz > host->mmc->max_blk_size);
>>   	BUG_ON(data->blocks > 65535);
>>   
>> @@ -4690,11 +4690,18 @@ int sdhci_setup_host(struct sdhci_host *host)
>>   
>>   	/*
>>   	 * Maximum number of sectors in one transfer. Limited by SDMA boundary
>> -	 * size (512KiB). Note some tuning modes impose a 4MiB limit, but this
>> -	 * is less anyway.
>> +	 * size and by tuning modes on ADMA. On tuning mode 3 16MiB is more than
>> +	 * enough to cover big data transfers.
>>   	 */
>> -	mmc->max_req_size = 524288;
>> -
>> +	if (host->flags & SDHCI_USE_ADMA) {
>> +		if (host->tuning_mode != SDHCI_TUNING_MODE_3)
>> +			mmc->max_req_size = SZ_4M;
>> +		else
>> +			mmc->max_req_size = SZ_16M;
>> +	} else {
>> +		/* On PIO/SDMA use SDMA boundary size (512KiB). */
>> +		mmc->max_req_size = SZ_512K;
>> +	}
>>   	/*
>>   	 * Maximum number of segments. Depends on if the hardware
>>   	 * can do scatter/gather or not.
>> diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
>> index a20864fc0641..98252c427feb 100644
>> --- a/drivers/mmc/host/sdhci.h
>> +++ b/drivers/mmc/host/sdhci.h
>> @@ -346,11 +346,11 @@ struct sdhci_adma2_64_desc {
>>   #define ADMA2_END		0x2
>>   
>>   /*
>> - * Maximum segments assuming a 512KiB maximum requisition size and a minimum
>> + * Maximum segments assuming a 16MiB maximum requisition size and a minimum
>>    * 4KiB page size. Note this also allows enough for multiple descriptors in
>>    * case of PAGE_SIZE >= 64KiB.
>>    */
>> -#define SDHCI_MAX_SEGS		128
>> +#define SDHCI_MAX_SEGS		4096
>>   
>>   /* Allow for a command request and a data request at the same time */
>>   #define SDHCI_MAX_MRQS		2
diff mbox series

Patch

diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index c79f73459915..f546b675c7b9 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -1081,7 +1081,7 @@  static void sdhci_initialize_data(struct sdhci_host *host,
 	WARN_ON(host->data);
 
 	/* Sanity checks */
-	BUG_ON(data->blksz * data->blocks > 524288);
+	BUG_ON(data->blksz * data->blocks > host->mmc->max_req_size);
 	BUG_ON(data->blksz > host->mmc->max_blk_size);
 	BUG_ON(data->blocks > 65535);
 
@@ -4690,11 +4690,18 @@  int sdhci_setup_host(struct sdhci_host *host)
 
 	/*
 	 * Maximum number of sectors in one transfer. Limited by SDMA boundary
-	 * size (512KiB). Note some tuning modes impose a 4MiB limit, but this
-	 * is less anyway.
+	 * size and by tuning modes on ADMA. On tuning mode 3 16MiB is more than
+	 * enough to cover big data transfers.
 	 */
-	mmc->max_req_size = 524288;
-
+	if (host->flags & SDHCI_USE_ADMA) {
+		if (host->tuning_mode != SDHCI_TUNING_MODE_3)
+			mmc->max_req_size = SZ_4M;
+		else
+			mmc->max_req_size = SZ_16M;
+	} else {
+		/* On PIO/SDMA use SDMA boundary size (512KiB). */
+		mmc->max_req_size = SZ_512K;
+	}
 	/*
 	 * Maximum number of segments. Depends on if the hardware
 	 * can do scatter/gather or not.
diff --git a/drivers/mmc/host/sdhci.h b/drivers/mmc/host/sdhci.h
index a20864fc0641..98252c427feb 100644
--- a/drivers/mmc/host/sdhci.h
+++ b/drivers/mmc/host/sdhci.h
@@ -346,11 +346,11 @@  struct sdhci_adma2_64_desc {
 #define ADMA2_END		0x2
 
 /*
- * Maximum segments assuming a 512KiB maximum requisition size and a minimum
+ * Maximum segments assuming a 16MiB maximum requisition size and a minimum
  * 4KiB page size. Note this also allows enough for multiple descriptors in
  * case of PAGE_SIZE >= 64KiB.
  */
-#define SDHCI_MAX_SEGS		128
+#define SDHCI_MAX_SEGS		4096
 
 /* Allow for a command request and a data request at the same time */
 #define SDHCI_MAX_MRQS		2