mbox series

[v3,0/3] spi: Add DMA mode support to spi-qcom-qspi

Message ID 1681481153-24036-1-git-send-email-quic_vnivarth@quicinc.com
Headers show
Series spi: Add DMA mode support to spi-qcom-qspi | expand

Message

Vijaya Krishna Nivarthi April 14, 2023, 2:05 p.m. UTC
There are large number of QSPI irqs that fire during boot/init and later
on every suspend/resume.
This could be made faster by doing DMA instead of PIO.
Below is comparison for number of interrupts raised in 2 acenarios...
Boot up and stabilise
Suspend/Resume

Sequence   PIO    DMA
=======================
Boot-up    69088  19284
S/R        5066   3430

Though we have not made measurements for speed, power we expect
the performance to be better with DMA mode and no regressions were
encountered in testing.

Vijaya Krishna Nivarthi (3):
  spi: dt-bindings: qcom,spi-qcom-qspi: Add iommus
  arm64: dts: qcom: sc7280: Add stream-id of qspi to iommus
  spi: spi-qcom-qspi: Add DMA mode support
---
v2 -> v3:
- Modified commit messages
- Made a change to driver based on re-review

v1 -> v2:
- Added documentation file to the series
- Made changes to driver based on HPG re-review
---
 .../bindings/spi/qcom,spi-qcom-qspi.yaml           |   3 +
 arch/arm64/boot/dts/qcom/sc7280.dtsi               |   1 +
 drivers/spi/spi-qcom-qspi.c                        | 434 +++++++++++++++++++--
 3 files changed, 407 insertions(+), 31 deletions(-)

Comments

Doug Anderson April 14, 2023, 3:48 p.m. UTC | #1
Hi,

On Fri, Apr 14, 2023 at 7:06 AM Vijaya Krishna Nivarthi
<quic_vnivarth@quicinc.com> wrote:
>
> There are large number of QSPI irqs that fire during boot/init and later
> on every suspend/resume.
> This could be made faster by doing DMA instead of PIO.
> Below is comparison for number of interrupts raised in 2 acenarios...

s/acenarios/scenarios

> Boot up and stabilise
> Suspend/Resume
>
> Sequence   PIO    DMA
> =======================
> Boot-up    69088  19284
> S/R        5066   3430
>
> Though we have not made measurements for speed, power we expect
> the performance to be better with DMA mode and no regressions were
> encountered in testing.

Measuring the speed isn't really very hard, so I gave it a shot.

I used a truly terrible python script to do this on a Chromebook:

--

import os
import time

os.system("""
stop ui
stop powerd

cd /sys/devices/system/cpu/cpufreq
for policy in policy*; do
  cat ${policy}/cpuinfo_max_freq > ${policy}/scaling_min_freq
done
""")

all_times = []
for i in range(1000):
  start = time.time()
  os.system("flashrom -p host -r /tmp/foo.bin")
  end = time.time()

  all_times.append(end - start)
  print("Iteration %d, min=%.2f, max=%.2f, avg=%.2f" % (
      i, min(all_times), max(all_times), sum(all_times) / len(all_times)))

--

The good news is that after applying your patches the loop runs _much_ faster.

The bad news is that it runs much faster because it very quickly fails
and errors out. flashrom just keeps reporting:

Opened /dev/mtd0 successfully
Found Programmer flash chip "Opaque flash chip" (8192 kB,
Programmer-specific) on host.
Reading flash... Cannot read 0x001000 bytes at 0x000000: Connection timed out
read_flash: failed to read (00000000..0x7fffff).
Read operation failed!
FAILED.
FAILED

I went back and tried v1, v2, and v3 and all three versions fail.
Vijaya Krishna Nivarthi April 14, 2023, 5:01 p.m. UTC | #2
On 4/14/2023 10:12 PM, Doug Anderson wrote:
> Hi,
>
> On Fri, Apr 14, 2023 at 8:48 AM Doug Anderson <dianders@chromium.org> wrote:
>> Hi,
>>
>> On Fri, Apr 14, 2023 at 7:06 AM Vijaya Krishna Nivarthi
>> <quic_vnivarth@quicinc.com> wrote:
>>> There are large number of QSPI irqs that fire during boot/init and later
>>> on every suspend/resume.
>>> This could be made faster by doing DMA instead of PIO.
>>> Below is comparison for number of interrupts raised in 2 acenarios...
>> s/acenarios/scenarios
>>
>>> Boot up and stabilise
>>> Suspend/Resume
>>>
>>> Sequence   PIO    DMA
>>> =======================
>>> Boot-up    69088  19284
>>> S/R        5066   3430
>>>
>>> Though we have not made measurements for speed, power we expect
>>> the performance to be better with DMA mode and no regressions were
>>> encountered in testing.
>> Measuring the speed isn't really very hard, so I gave it a shot.
>>
>> I used a truly terrible python script to do this on a Chromebook:
>>
>> --
>>
>> import os
>> import time
>>
>> os.system("""
>> stop ui
>> stop powerd
>>
>> cd /sys/devices/system/cpu/cpufreq
>> for policy in policy*; do
>>    cat ${policy}/cpuinfo_max_freq > ${policy}/scaling_min_freq
>> done
>> """)
>>
>> all_times = []
>> for i in range(1000):
>>    start = time.time()
>>    os.system("flashrom -p host -r /tmp/foo.bin")
>>    end = time.time()
>>
>>    all_times.append(end - start)
>>    print("Iteration %d, min=%.2f, max=%.2f, avg=%.2f" % (
>>        i, min(all_times), max(all_times), sum(all_times) / len(all_times)))
>>
>> --
>>
>> The good news is that after applying your patches the loop runs _much_ faster.
>>
>> The bad news is that it runs much faster because it very quickly fails
>> and errors out. flashrom just keeps reporting:
>>
>> Opened /dev/mtd0 successfully
>> Found Programmer flash chip "Opaque flash chip" (8192 kB,
>> Programmer-specific) on host.
>> Reading flash... Cannot read 0x001000 bytes at 0x000000: Connection timed out
>> read_flash: failed to read (00000000..0x7fffff).
>> Read operation failed!
>> FAILED.
>> FAILED
>>
>> I went back and tried v1, v2, and v3 and all three versions fail.
> Ah, I see what's likely the problem. Your patch series only adds the
> "iommus" for sc7280 but I'm testing on sc7180. That means:
>
> 1. You need to add the iommus to _all_ the boards that have qspi. That
> means sc7280, sc7180, and sdm845.
>
> 2. Ideally the code should still be made to work (it should fall back
> to PIO mode) if DMA isn't properly enabled. That would keep old device
> trees working, which we're supposed to do.


Thank you very much for the review, script, test and quick debug.
Will check same and update a v4.

-Vijay/


> -Doug
Krzysztof Kozlowski April 15, 2023, 8:53 a.m. UTC | #3
On 14/04/2023 16:05, Vijaya Krishna Nivarthi wrote:
> Add iommus binding for DMA mode support
> 
> Signed-off-by: Vijaya Krishna Nivarthi <quic_vnivarth@quicinc.com>
> ---

Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>

Best regards,
Krzysztof
Mark Brown April 17, 2023, 12:12 p.m. UTC | #4
On Fri, Apr 14, 2023 at 03:05:58PM -0700, Doug Anderson wrote:

> Having alignment requirements like this doesn't seem like it should be
> that unusual, though, and that's why it feels like the logic belongs
> in the SPI core. In fact, it seems like this is _supposed_ to be
> handled in the SPI core, but it isn't? In "spi.h" I see
> "dma_alignment" that claims to be exactly what you need. As far as I
> can tell, though, the core doesn't use this? ...so I'm kinda confused.
> As far as I can tell this doesn't do anything and thus anyone setting
> it today is broken?

SPI consumers should only be providing dmaable buffers.
Doug Anderson April 17, 2023, 2:07 p.m. UTC | #5
Hi,

On Mon, Apr 17, 2023 at 5:12 AM Mark Brown <broonie@kernel.org> wrote:
>
> On Fri, Apr 14, 2023 at 03:05:58PM -0700, Doug Anderson wrote:
>
> > Having alignment requirements like this doesn't seem like it should be
> > that unusual, though, and that's why it feels like the logic belongs
> > in the SPI core. In fact, it seems like this is _supposed_ to be
> > handled in the SPI core, but it isn't? In "spi.h" I see
> > "dma_alignment" that claims to be exactly what you need. As far as I
> > can tell, though, the core doesn't use this? ...so I'm kinda confused.
> > As far as I can tell this doesn't do anything and thus anyone setting
> > it today is broken?
>
> SPI consumers should only be providing dmaable buffers.

Ah, I think I see.

1. In "struct spi_transfer" the @tx_buf and @rx_buf are documented to
have "dma-safe memory".

2. On ARM64 anyway, I see "ARCH_DMA_MINALIGN" is 128.

So there is no reason to do any special rules to force alignment to
32-bytes because that's already guaranteed. Presumably that means you
can drop a whole pile of code and things will still work fine.

-Doug
Vijaya Krishna Nivarthi April 17, 2023, 3:57 p.m. UTC | #6
Hi,


On 4/17/2023 7:37 PM, Doug Anderson wrote:
> Hi,
>
> On Mon, Apr 17, 2023 at 5:12 AM Mark Brown <broonie@kernel.org> wrote:
>> On Fri, Apr 14, 2023 at 03:05:58PM -0700, Doug Anderson wrote:
>>
>>> Having alignment requirements like this doesn't seem like it should be
>>> that unusual, though, and that's why it feels like the logic belongs
>>> in the SPI core. In fact, it seems like this is _supposed_ to be
>>> handled in the SPI core, but it isn't? In "spi.h" I see
>>> "dma_alignment" that claims to be exactly what you need. As far as I
>>> can tell, though, the core doesn't use this? ...so I'm kinda confused.
>>> As far as I can tell this doesn't do anything and thus anyone setting
>>> it today is broken?
>> SPI consumers should only be providing dmaable buffers.
> Ah, I think I see.
>
> 1. In "struct spi_transfer" the @tx_buf and @rx_buf are documented to
> have "dma-safe memory".
>
> 2. On ARM64 anyway, I see "ARCH_DMA_MINALIGN" is 128.
>
> So there is no reason to do any special rules to force alignment to
> 32-bytes because that's already guaranteed. Presumably that means you
> can drop a whole pile of code and things will still work fine.
>
> -Doug


Thank you very much Mark and Doug for review and inputs.


spi_map_buf is taking into consideration max_dma_len (in spi.h) when it 
is set.

For example if set to 1024 instead of 65536(the actual max_dma_len of 
HW), 4 entries are created in the sg_table for a buffer of size 4096.

However, Like Doug pointed, dma_alignment is not being used by core.

Some drivers seem to be setting both of these in probe() but 
dma_alignment is probably not taking effect.

Is it up to the SPI consumers to read this and ensure they are providing 
dmaable buffers of required alignment?


The dma_addresses coming from core are aligned for larger sized buffers 
but for small ones like 1 and 3 bytes they are not aligned.

Hence if the code handing the alignment part is not present, smaller 
transfers fail.

For example, Below debug patch does

a) all transfers in DMA

b) prints DMA addresses and lengths

====== patch start ======

diff --git a/drivers/spi/spi-qcom-qspi.c b/drivers/spi/spi-qcom-qspi.c
index 60c4f554..8d24022 100644
--- a/drivers/spi/spi-qcom-qspi.c
+++ b/drivers/spi/spi-qcom-qspi.c
@@ -438,8 +438,14 @@ static int qcom_qspi_setup_dma_desc(struct 
qcom_qspi *ctrl,
                 return -EINVAL;
         }

-       for (i = 0; i < sgt->nents; i++)
+       for (i = 0; i < sgt->nents; i++) {
+               dma_addr_t temp_addr;
                 sg_total_len += sg_dma_len(sgt->sgl + i);
+
+               temp_addr = sg_dma_address(sgt->sgl + i);
+               dev_err_ratelimited(ctrl->dev, "%s pilli-20230417 i-%d, 
nents-%d, len-%d, dma_address-%pad\n",
+                               __func__, i, sgt->nents, 
sg_dma_len(sgt->sgl + i), &temp_addr);
+       }
         if (sg_total_len != xfer->len) {
                 dev_err(ctrl->dev, "Data lengths mismatch\n");
                 return -EINVAL;
@@ -517,6 +523,7 @@ static void qcom_qspi_dma_xfer(struct qcom_qspi *ctrl)
  static bool qcom_qspi_can_dma(struct spi_controller *ctlr,
                          struct spi_device *slv, struct spi_transfer *xfer)
  {
+       return true;
         return xfer->len > QSPI_MAX_BYTES_FIFO ? true : false;
  }

====== patch end =======

and below is sample outcome...

[   23.620397] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-1, dma_address-0x00000000fff3e000
[   23.638392] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-3, dma_address-0x00000000fff3f001
[   23.650238] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-1, dma_address-0x00000000fff40004
[   23.662039] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-8, dma_address-0x00000000fff44080
[   23.673965] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-1, dma_address-0x00000000fff44000
[   23.685749] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-3, dma_address-0x00000000fff40001
[   23.697630] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-1, dma_address-0x00000000fff3f004
[   23.709552] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-8, dma_address-0x00000000fff3e600
[   23.721460] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-1, dma_address-0x00000000fff3e000
[   23.733257] qcom_qspi 88dc000.spi: qcom_qspi_setup_dma_desc 
pilli-20230417 i-0, nents-1, len-3, dma_address-0x00000000fff3f001

If we use the dma_address from sg table as is, transfers fail.

I have not checked the spi-nor driver, but is it the consumer driver's 
job to ensure required alignment in all cases?

Can you Please comment?

Thank you,

-Vijay/
Mark Brown April 17, 2023, 4:39 p.m. UTC | #7
On Mon, Apr 17, 2023 at 09:27:16PM +0530, Vijaya Krishna Nivarthi wrote:

> However, Like Doug pointed, dma_alignment is not being used by core.

The core will use kmalloc() for any new buffers it allocates, this is
guaranteed to satisfy DMA constraints.

> Is it up to the SPI consumers to read this and ensure they are providing
> dmaable buffers of required alignment?

If they're doing anything fun for allocation.  Or they can just use
kmalloc() themselves.

> The dma_addresses coming from core are aligned for larger sized buffers but
> for small ones like 1 and 3 bytes they are not aligned.

In theory even buffers that small should be DMA safe, in practice that
rarely matters since it's vanishingly rare that it's sensible to DMA
such tiny buffers rather than PIOing them so drivers will tend to never
actually try to do so and I'd expect bugs.  It is likely worth checking
that DMA makes sense for this hardware.

> I have not checked the spi-nor driver, but is it the consumer driver's job
> to ensure required alignment in all cases?

Yes.