diff mbox series

[v4,5/9] spi: spi-mem: Mark dummy transfers by setting dummy_data bit

Message ID 1608236927-28701-6-git-send-email-skomatineni@nvidia.com
State Superseded
Headers show
Series Add Tegra Quad SPI driver | expand

Commit Message

Sowjanya Komatineni Dec. 17, 2020, 8:28 p.m. UTC
This patch marks dummy transfer by setting dummy_data bit to 1.

Controllers supporting dummy transfer by hardware use this bit field
to skip software transfer of dummy bytes and use hardware dummy bytes
transfer.

Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>
---
 drivers/spi/spi-mem.c   | 1 +
 include/linux/spi/spi.h | 2 ++
 2 files changed, 3 insertions(+)

Comments

Pratyush Yadav Dec. 18, 2020, 9:21 a.m. UTC | #1
Hi Sowjanya,

On 17/12/20 12:28PM, Sowjanya Komatineni wrote:
> This patch marks dummy transfer by setting dummy_data bit to 1.
> 
> Controllers supporting dummy transfer by hardware use this bit field
> to skip software transfer of dummy bytes and use hardware dummy bytes
> transfer.

What is the benefit you get from this change? You add complexity in 
spi-mem and the controller driver, so that must come with some benefits. 
Here I don't see any. The transfer will certainly take the same amount 
of time because the number or period of the dummy cycles has not 
changed. So why is this needed?
 
> Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>
> ---
>  drivers/spi/spi-mem.c   | 1 +
>  include/linux/spi/spi.h | 2 ++
>  2 files changed, 3 insertions(+)
> 
> diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c
> index f3a3f19..c64371c 100644
> --- a/drivers/spi/spi-mem.c
> +++ b/drivers/spi/spi-mem.c
> @@ -354,6 +354,7 @@ int spi_mem_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)
>  		xfers[xferpos].tx_buf = tmpbuf + op->addr.nbytes + 1;
>  		xfers[xferpos].len = op->dummy.nbytes;
>  		xfers[xferpos].tx_nbits = op->dummy.buswidth;
> +		xfers[xferpos].dummy_data = 1;
>  		spi_message_add_tail(&xfers[xferpos], &msg);
>  		xferpos++;
>  		totalxferlen += op->dummy.nbytes;
> diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
> index aa09fdc..708f2f5 100644
> --- a/include/linux/spi/spi.h
> +++ b/include/linux/spi/spi.h
> @@ -827,6 +827,7 @@ extern void spi_res_release(struct spi_controller *ctlr,
>   *      transfer. If 0 the default (from @spi_device) is used.
>   * @bits_per_word: select a bits_per_word other than the device default
>   *      for this transfer. If 0 the default (from @spi_device) is used.
> + * @dummy_data: indicates transfer is dummy bytes transfer.
>   * @cs_change: affects chipselect after this transfer completes
>   * @cs_change_delay: delay between cs deassert and assert when
>   *      @cs_change is set and @spi_transfer is not the last in @spi_message
> @@ -939,6 +940,7 @@ struct spi_transfer {
>  	struct sg_table tx_sg;
>  	struct sg_table rx_sg;
>  
> +	unsigned	dummy_data:1;
>  	unsigned	cs_change:1;
>  	unsigned	tx_nbits:3;
>  	unsigned	rx_nbits:3;
> -- 
> 2.7.4
>
Boris Brezillon Dec. 18, 2020, 9:57 a.m. UTC | #2
On Fri, 18 Dec 2020 14:51:08 +0530
Pratyush Yadav <p.yadav@ti.com> wrote:

> Hi Sowjanya,
> 
> On 17/12/20 12:28PM, Sowjanya Komatineni wrote:
> > This patch marks dummy transfer by setting dummy_data bit to 1.
> > 
> > Controllers supporting dummy transfer by hardware use this bit field
> > to skip software transfer of dummy bytes and use hardware dummy bytes
> > transfer.  
> 
> What is the benefit you get from this change? You add complexity in 
> spi-mem and the controller driver, so that must come with some benefits. 
> Here I don't see any. The transfer will certainly take the same amount 
> of time because the number or period of the dummy cycles has not 
> changed. So why is this needed?

Well, you don't have to queue TX bytes if you use HW-based dummy
cycles, but I agree, I'd expect the overhead to be negligible,
especially since we're talking about emitting a few bytes, not hundreds.
This being said, the complexity added to the core is reasonable IMHO,
so if it really helps reducing the CPU overhead (we might need some
numbers to prove that), I guess it's okay.

>  
> > Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>
> > ---
> >  drivers/spi/spi-mem.c   | 1 +
> >  include/linux/spi/spi.h | 2 ++
> >  2 files changed, 3 insertions(+)
> > 
> > diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c
> > index f3a3f19..c64371c 100644
> > --- a/drivers/spi/spi-mem.c
> > +++ b/drivers/spi/spi-mem.c
> > @@ -354,6 +354,7 @@ int spi_mem_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)
> >  		xfers[xferpos].tx_buf = tmpbuf + op->addr.nbytes + 1;
> >  		xfers[xferpos].len = op->dummy.nbytes;
> >  		xfers[xferpos].tx_nbits = op->dummy.buswidth;
> > +		xfers[xferpos].dummy_data = 1;
> >  		spi_message_add_tail(&xfers[xferpos], &msg);
> >  		xferpos++;
> >  		totalxferlen += op->dummy.nbytes;
> > diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
> > index aa09fdc..708f2f5 100644
> > --- a/include/linux/spi/spi.h
> > +++ b/include/linux/spi/spi.h
> > @@ -827,6 +827,7 @@ extern void spi_res_release(struct spi_controller *ctlr,
> >   *      transfer. If 0 the default (from @spi_device) is used.
> >   * @bits_per_word: select a bits_per_word other than the device default
> >   *      for this transfer. If 0 the default (from @spi_device) is used.
> > + * @dummy_data: indicates transfer is dummy bytes transfer.
> >   * @cs_change: affects chipselect after this transfer completes
> >   * @cs_change_delay: delay between cs deassert and assert when
> >   *      @cs_change is set and @spi_transfer is not the last in @spi_message
> > @@ -939,6 +940,7 @@ struct spi_transfer {
> >  	struct sg_table tx_sg;
> >  	struct sg_table rx_sg;
> >  
> > +	unsigned	dummy_data:1;
> >  	unsigned	cs_change:1;
> >  	unsigned	tx_nbits:3;
> >  	unsigned	rx_nbits:3;
> > -- 
> > 2.7.4
> >   
>
Sowjanya Komatineni Dec. 18, 2020, 6:09 p.m. UTC | #3
On 12/18/20 1:57 AM, Boris Brezillon wrote:
> On Fri, 18 Dec 2020 14:51:08 +0530
> Pratyush Yadav <p.yadav@ti.com> wrote:
>
>> Hi Sowjanya,
>>
>> On 17/12/20 12:28PM, Sowjanya Komatineni wrote:
>>> This patch marks dummy transfer by setting dummy_data bit to 1.
>>>
>>> Controllers supporting dummy transfer by hardware use this bit field
>>> to skip software transfer of dummy bytes and use hardware dummy bytes
>>> transfer.
>> What is the benefit you get from this change? You add complexity in
>> spi-mem and the controller driver, so that must come with some benefits.
>> Here I don't see any. The transfer will certainly take the same amount
>> of time because the number or period of the dummy cycles has not
>> changed. So why is this needed?
> Well, you don't have to queue TX bytes if you use HW-based dummy
> cycles, but I agree, I'd expect the overhead to be negligible,
> especially since we're talking about emitting a few bytes, not hundreds.
> This being said, the complexity added to the core is reasonable IMHO,
> so if it really helps reducing the CPU overhead (we might need some
> numbers to prove that), I guess it's okay.

Hardware dummy cycles feature of Tegra QSPI is to save SW transfer cycle 
of dummy bytes by filling FIFO.

I don't have numbers as we always use hardware dummy cycles with Tegra QSPI.

>>   
>>> Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>
>>> ---
>>>   drivers/spi/spi-mem.c   | 1 +
>>>   include/linux/spi/spi.h | 2 ++
>>>   2 files changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c
>>> index f3a3f19..c64371c 100644
>>> --- a/drivers/spi/spi-mem.c
>>> +++ b/drivers/spi/spi-mem.c
>>> @@ -354,6 +354,7 @@ int spi_mem_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)
>>>   		xfers[xferpos].tx_buf = tmpbuf + op->addr.nbytes + 1;
>>>   		xfers[xferpos].len = op->dummy.nbytes;
>>>   		xfers[xferpos].tx_nbits = op->dummy.buswidth;
>>> +		xfers[xferpos].dummy_data = 1;
>>>   		spi_message_add_tail(&xfers[xferpos], &msg);
>>>   		xferpos++;
>>>   		totalxferlen += op->dummy.nbytes;
>>> diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
>>> index aa09fdc..708f2f5 100644
>>> --- a/include/linux/spi/spi.h
>>> +++ b/include/linux/spi/spi.h
>>> @@ -827,6 +827,7 @@ extern void spi_res_release(struct spi_controller *ctlr,
>>>    *      transfer. If 0 the default (from @spi_device) is used.
>>>    * @bits_per_word: select a bits_per_word other than the device default
>>>    *      for this transfer. If 0 the default (from @spi_device) is used.
>>> + * @dummy_data: indicates transfer is dummy bytes transfer.
>>>    * @cs_change: affects chipselect after this transfer completes
>>>    * @cs_change_delay: delay between cs deassert and assert when
>>>    *      @cs_change is set and @spi_transfer is not the last in @spi_message
>>> @@ -939,6 +940,7 @@ struct spi_transfer {
>>>   	struct sg_table tx_sg;
>>>   	struct sg_table rx_sg;
>>>   
>>> +	unsigned	dummy_data:1;
>>>   	unsigned	cs_change:1;
>>>   	unsigned	tx_nbits:3;
>>>   	unsigned	rx_nbits:3;
>>> -- 
>>> 2.7.4
>>>
Pratyush Yadav Dec. 18, 2020, 7:19 p.m. UTC | #4
On 18/12/20 10:09AM, Sowjanya Komatineni wrote:
> 

> On 12/18/20 1:57 AM, Boris Brezillon wrote:

> > On Fri, 18 Dec 2020 14:51:08 +0530

> > Pratyush Yadav <p.yadav@ti.com> wrote:

> > 

> > > Hi Sowjanya,

> > > 

> > > On 17/12/20 12:28PM, Sowjanya Komatineni wrote:

> > > > This patch marks dummy transfer by setting dummy_data bit to 1.

> > > > 

> > > > Controllers supporting dummy transfer by hardware use this bit field

> > > > to skip software transfer of dummy bytes and use hardware dummy bytes

> > > > transfer.

> > > What is the benefit you get from this change? You add complexity in

> > > spi-mem and the controller driver, so that must come with some benefits.

> > > Here I don't see any. The transfer will certainly take the same amount

> > > of time because the number or period of the dummy cycles has not

> > > changed. So why is this needed?

> > Well, you don't have to queue TX bytes if you use HW-based dummy

> > cycles, but I agree, I'd expect the overhead to be negligible,

> > especially since we're talking about emitting a few bytes, not hundreds.

> > This being said, the complexity added to the core is reasonable IMHO,

> > so if it really helps reducing the CPU overhead (we might need some

> > numbers to prove that), I guess it's okay.

> 

> Hardware dummy cycles feature of Tegra QSPI is to save SW transfer cycle of

> dummy bytes by filling FIFO.

> 

> I don't have numbers as we always use hardware dummy cycles with Tegra QSPI.


Most flashes use somewhere around 8-10 dummy cycles. Some of the faster 
ones working at 166 MHz or above use 20-25. From what I understand, the 
only time hardware transfer of dummy cycles will reduce CPU load 
significantly will be when multiple small transfers are made in quick 
succession. Unless you have such a usecase I don't know how much benefit 
you will get by it.

Anyway, if the SPI maintainers think this is worth it, I won't object.
 
> > > > Signed-off-by: Sowjanya Komatineni <skomatineni@nvidia.com>

> > > > ---

> > > >   drivers/spi/spi-mem.c   | 1 +

> > > >   include/linux/spi/spi.h | 2 ++

> > > >   2 files changed, 3 insertions(+)

> > > > 

> > > > diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c

> > > > index f3a3f19..c64371c 100644

> > > > --- a/drivers/spi/spi-mem.c

> > > > +++ b/drivers/spi/spi-mem.c

> > > > @@ -354,6 +354,7 @@ int spi_mem_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)

> > > >   		xfers[xferpos].tx_buf = tmpbuf + op->addr.nbytes + 1;

> > > >   		xfers[xferpos].len = op->dummy.nbytes;

> > > >   		xfers[xferpos].tx_nbits = op->dummy.buswidth;

> > > > +		xfers[xferpos].dummy_data = 1;

> > > >   		spi_message_add_tail(&xfers[xferpos], &msg);

> > > >   		xferpos++;

> > > >   		totalxferlen += op->dummy.nbytes;

> > > > diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h

> > > > index aa09fdc..708f2f5 100644

> > > > --- a/include/linux/spi/spi.h

> > > > +++ b/include/linux/spi/spi.h

> > > > @@ -827,6 +827,7 @@ extern void spi_res_release(struct spi_controller *ctlr,

> > > >    *      transfer. If 0 the default (from @spi_device) is used.

> > > >    * @bits_per_word: select a bits_per_word other than the device default

> > > >    *      for this transfer. If 0 the default (from @spi_device) is used.

> > > > + * @dummy_data: indicates transfer is dummy bytes transfer.

> > > >    * @cs_change: affects chipselect after this transfer completes

> > > >    * @cs_change_delay: delay between cs deassert and assert when

> > > >    *      @cs_change is set and @spi_transfer is not the last in @spi_message

> > > > @@ -939,6 +940,7 @@ struct spi_transfer {

> > > >   	struct sg_table tx_sg;

> > > >   	struct sg_table rx_sg;

> > > > +	unsigned	dummy_data:1;

> > > >   	unsigned	cs_change:1;

> > > >   	unsigned	tx_nbits:3;

> > > >   	unsigned	rx_nbits:3;

> > > > -- 

> > > > 2.7.4


-- 
Regards,
Pratyush Yadav
Texas Instruments India
Mark Brown Dec. 18, 2020, 8:41 p.m. UTC | #5
On Sat, Dec 19, 2020 at 12:49:38AM +0530, Pratyush Yadav wrote:

> Anyway, if the SPI maintainers think this is worth it, I won't object.

This gets kind of circular, for me it's a question of if there's some
meaningful benefit from using the feature vs the cost to support it and
from the sounds of it we don't have numbers on the benefits from using
it at present.
Mark Brown Dec. 18, 2020, 8:44 p.m. UTC | #6
On Fri, Dec 18, 2020 at 08:41:02PM +0000, Mark Brown wrote:
> On Sat, Dec 19, 2020 at 12:49:38AM +0530, Pratyush Yadav wrote:


> > Anyway, if the SPI maintainers think this is worth it, I won't object.


> This gets kind of circular, for me it's a question of if there's some

> meaningful benefit from using the feature vs the cost to support it and

> from the sounds of it we don't have numbers on the benefits from using

> it at present.


...although I do have to say looking at the implementation that the cost
seems low, it's just a flag set on an existing transfer.  The only issue
is if we'd get more win from coalesing the entire transaction (or entire
transmit) into a single transfer that could be DMAed and/or requires
fewer trips through the stack which does make it seem like an unclear
tradeoff from the point of view of client drivers
Sowjanya Komatineni Dec. 18, 2020, 10:01 p.m. UTC | #7
On 12/18/20 12:44 PM, Mark Brown wrote:
> On Fri, Dec 18, 2020 at 08:41:02PM +0000, Mark Brown wrote:

>> On Sat, Dec 19, 2020 at 12:49:38AM +0530, Pratyush Yadav wrote:

>>> Anyway, if the SPI maintainers think this is worth it, I won't object.

>> This gets kind of circular, for me it's a question of if there's some

>> meaningful benefit from using the feature vs the cost to support it and

>> from the sounds of it we don't have numbers on the benefits from using

>> it at present.

> ...although I do have to say looking at the implementation that the cost

> seems low, it's just a flag set on an existing transfer.  The only issue

> is if we'd get more win from coalesing the entire transaction (or entire

> transmit) into a single transfer that could be DMAed and/or requires

> fewer trips through the stack which does make it seem like an unclear

> tradeoff from the point of view of client drivers


Using HW dummy cycles save extra software cycle of transfer which 
involves transfer setup register writes, writing dummy bytes to TX FIFO, 
interrupt processing.

Implementation wise it just a single bit field added to spi_transfer and 
on Tegra controller driver programming dummy cycles with prior transfer 
and skipping sw dummy transfer which is actually not complex.

 From quick check, I see HW dummy cycles transfer of 128KB shows 18 Mb/s 
while SW transfer of bytes shows 17.3 MB/s on average.

When back-to-back read commands are executed using HW dummy cycles will 
definitely save cycles.
Mark Brown Dec. 21, 2020, 4:50 p.m. UTC | #8
On Fri, Dec 18, 2020 at 02:01:56PM -0800, Sowjanya Komatineni wrote:

> From quick check, I see HW dummy cycles transfer of 128KB shows 18 Mb/s
> while SW transfer of bytes shows 17.3 MB/s on average.

OK, it's not going to revolutionize the world or anything but that's
definitely a speedup.
diff mbox series

Patch

diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c
index f3a3f19..c64371c 100644
--- a/drivers/spi/spi-mem.c
+++ b/drivers/spi/spi-mem.c
@@ -354,6 +354,7 @@  int spi_mem_exec_op(struct spi_mem *mem, const struct spi_mem_op *op)
 		xfers[xferpos].tx_buf = tmpbuf + op->addr.nbytes + 1;
 		xfers[xferpos].len = op->dummy.nbytes;
 		xfers[xferpos].tx_nbits = op->dummy.buswidth;
+		xfers[xferpos].dummy_data = 1;
 		spi_message_add_tail(&xfers[xferpos], &msg);
 		xferpos++;
 		totalxferlen += op->dummy.nbytes;
diff --git a/include/linux/spi/spi.h b/include/linux/spi/spi.h
index aa09fdc..708f2f5 100644
--- a/include/linux/spi/spi.h
+++ b/include/linux/spi/spi.h
@@ -827,6 +827,7 @@  extern void spi_res_release(struct spi_controller *ctlr,
  *      transfer. If 0 the default (from @spi_device) is used.
  * @bits_per_word: select a bits_per_word other than the device default
  *      for this transfer. If 0 the default (from @spi_device) is used.
+ * @dummy_data: indicates transfer is dummy bytes transfer.
  * @cs_change: affects chipselect after this transfer completes
  * @cs_change_delay: delay between cs deassert and assert when
  *      @cs_change is set and @spi_transfer is not the last in @spi_message
@@ -939,6 +940,7 @@  struct spi_transfer {
 	struct sg_table tx_sg;
 	struct sg_table rx_sg;
 
+	unsigned	dummy_data:1;
 	unsigned	cs_change:1;
 	unsigned	tx_nbits:3;
 	unsigned	rx_nbits:3;