mbox series

[v1,0/7] Add Tegra QSPI driver

Message ID 1606857168-5839-1-git-send-email-skomatineni@nvidia.com
Headers show
Series Add Tegra QSPI driver | expand

Message

Sowjanya Komatineni Dec. 1, 2020, 9:12 p.m. UTC
This series adds Tegra210, Tegra186, and Tehra194 QSPI driver and
enables QSPI on Jetson Nano and Jetson Xavier NX.

QSPI controller is available on Tegra210, Tegra186 and Tegra194.

Tegra186 and Tegra194 has additional feature of combined sequence mode
where command, address and data can all be transferred in a single transfer.

Combined sequence mode is useful with DMA mode transfer.

This series does not have combined sequence mode feature as Tegra186/Tegra194
GPCDMA driver is not upstreamed yet.

This series includes
- dt-binding document
- QSPI driver for Tegra210/Tegra186/Tegra194
- Enables QSPI on Jetson Nano and Jetson Xavier NX.

Sowjanya Komatineni (7):
  MAINTAINERS: Add Tegra QSPI driver section
  dt-bindings: spi: Add Tegra QSPI device tree binding
  spi: qspi-tegra: Add support for Tegra210 QSPI controller
  spi: qspi-tegra: Add Tegra186 and Tegra194 QSPI compatibles
  arm64: tegra: Enable QSPI on Jetson Nano
  arm64: tegra: Add QSPI nodes on Tegra194
  arm64: tegra: Enable QSPI on Jetson Xavier NX

 .../devicetree/bindings/spi/nvidia,tegra-qspi.yaml |   77 ++
 MAINTAINERS                                        |    8 +
 .../dts/nvidia/tegra194-p3509-0000+p3668-0000.dts  |   12 +
 arch/arm64/boot/dts/nvidia/tegra194.dtsi           |   26 +
 arch/arm64/boot/dts/nvidia/tegra210-p3450-0000.dts |   12 +
 drivers/spi/Kconfig                                |    9 +
 drivers/spi/Makefile                               |    1 +
 drivers/spi/qspi-tegra.c                           | 1420 ++++++++++++++++++++
 8 files changed, 1565 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/spi/nvidia,tegra-qspi.yaml
 create mode 100644 drivers/spi/qspi-tegra.c

Comments

Mark Brown Dec. 2, 2020, 5:27 p.m. UTC | #1
On Tue, Dec 01, 2020 at 01:12:44PM -0800, Sowjanya Komatineni wrote:
> Tegra SoC has a Quad SPI controller starting from Tegra210.

> 

> This patch adds support for Tegra210 QSPI controller.


This looks pretty clean but I've got a few questions below about how
this integrates with the frameworks as well as some more minor issues.

> +config QSPI_TEGRA

> +	tristate "Nvidia Tegra QSPI Controller"


Everything else in this file is SPI_, even the qspi controllers.

> +++ b/drivers/spi/qspi-tegra.c

> @@ -0,0 +1,1418 @@

> +// SPDX-License-Identifier: GPL-2.0-only

> +/*

> + * Copyright (C) 2020 NVIDIA CORPORATION.  All rights reserved.

> + */


Please make the entire comment a C++ one.  It also appears that the "All
rights reserved" here conflicts with the GPL-2.0-only SPDX statement...

> +static void

> +tegra_qspi_copy_client_txbuf_to_qspi_txbuf(struct tegra_qspi_data *tqspi,

> +					   struct spi_transfer *t)

> +{

> +	/* Make the dma buffer to read by cpu */

> +	dma_sync_single_for_cpu(tqspi->dev, tqspi->tx_dma_phys,

> +				tqspi->dma_buf_size, DMA_TO_DEVICE);

> +

> +	if (tqspi->is_packed) {

> +		unsigned int len = tqspi->curr_dma_words *

> +				   tqspi->bytes_per_word;

> +

> +		memcpy(tqspi->tx_dma_buf, t->tx_buf + tqspi->cur_pos, len);

> +		tqspi->cur_tx_pos += tqspi->curr_dma_words *

> +				     tqspi->bytes_per_word;


It seems weird that this device needs us to do a memcpy() to do DMA,
most devices are able to DMA directly from the buffers provided by the
SPI API (and let the SPI core sync things).  What is going on here?

> +	tegra_qspi_writel(tqspi, status, QSPI_FIFO_STATUS);

> +	while ((status & QSPI_FIFO_EMPTY) != QSPI_FIFO_EMPTY) {

> +		status = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);

> +		if (time_after(jiffies, timeout)) {

> +			dev_err(tqspi->dev,

> +				"timeout waiting for fifo flush\n");

> +			return -EIO;

> +		}

> +

> +		udelay(1);

> +	}


It'd be good to put a cpu_relax() in the busy loop.

> +static u32 tegra_qspi_setup_transfer_one(struct spi_device *spi,

> +					 struct spi_transfer *t,

> +					 bool is_first_of_msg)

> +{


> +		/* toggle cs to active state */

> +		if (spi->mode & SPI_CS_HIGH)

> +			command1 |= QSPI_CS_SW_VAL;

> +		else

> +			command1 &= ~QSPI_CS_SW_VAL;

> +		tegra_qspi_writel(tqspi, command1, QSPI_COMMAND1);


This is worrying, the client device might be confused if /CS is doing
things outside of the standard handling.

> +	of_property_read_u32(slave_np, "nvidia,tx-clk-tap-delay",

> +			     &cdata->tx_clk_tap_delay);

> +	of_property_read_u32(slave_np, "nvidia,rx-clk-tap-delay",

> +			     &cdata->rx_clk_tap_delay);


These properties are not mentioned in the binding document.

> +static int tegra_qspi_setup(struct spi_device *spi)

> +{


> +	if (cdata && cdata->tx_clk_tap_delay)

> +		tx_tap = cdata->tx_clk_tap_delay;

> +	if (cdata && cdata->rx_clk_tap_delay)

> +		rx_tap = cdata->rx_clk_tap_delay;

> +	tqspi->def_command2_reg = QSPI_TX_TAP_DELAY(tx_tap) |

> +				  QSPI_RX_TAP_DELAY(rx_tap);

> +	tegra_qspi_writel(tqspi, tqspi->def_command2_reg, QSPI_COMMAND2);


The setup for one device shouldn't be able to affect the operation of
another, already running, device so either these need to be configured
as part of the controller probe or these configurations need to be
deferred until we're actually doing a transfer.

> +	/*

> +	 * Tegra QSPI hardware support dummy bytes transfer based on the

> +	 * programmed dummy clock cyles in QSPI register.

> +	 * So, get the total dummy bytes from the dummy bytes transfer in

> +	 * spi_messages and convert to dummy clock cyles.

> +	 */

> +	list_for_each_entry(xfer, &msg->transfers, transfer_list) {

> +		if (ntransfers == DUMMY_BYTES_XFER &&

> +		    !(list_is_last(&xfer->transfer_list, &msg->transfers)))

> +			dummy_cycles = xfer->len * 8 / xfer->tx_nbits;

> +		ntransfers++;

> +	}


This seems weird, there's some hard coded assumption about particular
patterns that the client device is going to send.  What's going on here?
I don't really understand what this is trying to do.

> +static irqreturn_t tegra_qspi_isr(int irq, void *context_data)

> +{

> +	struct tegra_qspi_data *tqspi = context_data;

> +

> +	tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);

> +	if (tqspi->cur_direction & DATA_DIR_TX)

> +		tqspi->tx_status = tqspi->status_reg &

> +				   (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF);

> +

> +	if (tqspi->cur_direction & DATA_DIR_RX)

> +		tqspi->rx_status = tqspi->status_reg &

> +				   (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);

> +	tegra_qspi_mask_clear_irq(tqspi);

> +

> +	return IRQ_WAKE_THREAD;

> +}


It's a bit unclear to me the value we gain from having this handler - if
we don't specify a handler genirq will already mask the interrupt until
we get to the thread anyway and we could just read the status in the
threaded handler.  OTOH it doesn't do any harm, just struck me as a bit
odd.

> +	master = spi_alloc_master(&pdev->dev, sizeof(*tqspi));

> +	if (!master) {

> +		dev_err(&pdev->dev, "master allocation failed\n");

> +		return -ENOMEM;

> +	}


Please switch to using the devm_ version of the API to allocate
controller, it makes things much more robust.

> +	if (of_property_read_u32(pdev->dev.of_node, "spi-max-frequency",

> +				 &master->max_speed_hz))

> +		master->max_speed_hz = QSPI_MAX_SPEED;


The core will do this for you.
Sowjanya Komatineni Dec. 2, 2020, 7:17 p.m. UTC | #2
On 12/2/20 9:27 AM, Mark Brown wrote:
> On Tue, Dec 01, 2020 at 01:12:44PM -0800, Sowjanya Komatineni wrote:

>> Tegra SoC has a Quad SPI controller starting from Tegra210.

>>

>> This patch adds support for Tegra210 QSPI controller.

> This looks pretty clean but I've got a few questions below about how

> this integrates with the frameworks as well as some more minor issues.

>

>> +config QSPI_TEGRA

>> +	tristate "Nvidia Tegra QSPI Controller"

> Everything else in this file is SPI_, even the qspi controllers.

Will rename in v2
>> +++ b/drivers/spi/qspi-tegra.c

>> @@ -0,0 +1,1418 @@

>> +// SPDX-License-Identifier: GPL-2.0-only

>> +/*

>> + * Copyright (C) 2020 NVIDIA CORPORATION.  All rights reserved.

>> + */

> Please make the entire comment a C++ one.  It also appears that the "All

> rights reserved" here conflicts with the GPL-2.0-only SPDX statement...

Will fix in v2
>

>> +static void

>> +tegra_qspi_copy_client_txbuf_to_qspi_txbuf(struct tegra_qspi_data *tqspi,

>> +					   struct spi_transfer *t)

>> +{

>> +	/* Make the dma buffer to read by cpu */

>> +	dma_sync_single_for_cpu(tqspi->dev, tqspi->tx_dma_phys,

>> +				tqspi->dma_buf_size, DMA_TO_DEVICE);

>> +

>> +	if (tqspi->is_packed) {

>> +		unsigned int len = tqspi->curr_dma_words *

>> +				   tqspi->bytes_per_word;

>> +

>> +		memcpy(tqspi->tx_dma_buf, t->tx_buf + tqspi->cur_pos, len);

>> +		tqspi->cur_tx_pos += tqspi->curr_dma_words *

>> +				     tqspi->bytes_per_word;

> It seems weird that this device needs us to do a memcpy() to do DMA,

> most devices are able to DMA directly from the buffers provided by the

> SPI API (and let the SPI core sync things).  What is going on here?


For transfers of size more than max DMA transfer limit, data transfer 
happens in multiple iterations with each iteration transferring up to 
max DMA transfer limit.

So using separate dma buffers and on every iteration copying them to SPI 
core provided tx/rx buffers.

Transferring data logic in this driver is similar as Tegra SPI driver 
except register changes and some QSPI specific register programming.

>

>> +	tegra_qspi_writel(tqspi, status, QSPI_FIFO_STATUS);

>> +	while ((status & QSPI_FIFO_EMPTY) != QSPI_FIFO_EMPTY) {

>> +		status = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);

>> +		if (time_after(jiffies, timeout)) {

>> +			dev_err(tqspi->dev,

>> +				"timeout waiting for fifo flush\n");

>> +			return -EIO;

>> +		}

>> +

>> +		udelay(1);

>> +	}

> It'd be good to put a cpu_relax() in the busy loop.

Will update in v2.
>

>> +static u32 tegra_qspi_setup_transfer_one(struct spi_device *spi,

>> +					 struct spi_transfer *t,

>> +					 bool is_first_of_msg)

>> +{

>> +		/* toggle cs to active state */

>> +		if (spi->mode & SPI_CS_HIGH)

>> +			command1 |= QSPI_CS_SW_VAL;

>> +		else

>> +			command1 &= ~QSPI_CS_SW_VAL;

>> +		tegra_qspi_writel(tqspi, command1, QSPI_COMMAND1);

> This is worrying, the client device might be confused if /CS is doing

> things outside of the standard handling.


Do you mean to honor spi_transfer cs_change flag?

Tegra QSPI is master and is used only with QSPI flash devices. Looking 
at SPI NOR driver, I see QSPI Flash commands are executed with one flash 
command per spi_message and I dont see cs_change flag usage w.r.t QSPI 
flash. So, using SW based CS control for QSPI.

Please correct me if I miss something to understand here.

Also Tegra186 and later QSPI controller supports combined sequence mode 
where command, address, data phases can be combined in a single GO.

This saves some cycles in transfer and for this we need to use SW based 
CS control only.


>> +	of_property_read_u32(slave_np, "nvidia,tx-clk-tap-delay",

>> +			     &cdata->tx_clk_tap_delay);

>> +	of_property_read_u32(slave_np, "nvidia,rx-clk-tap-delay",

>> +			     &cdata->rx_clk_tap_delay);

> These properties are not mentioned in the binding document.

Thanks Mark. Missed them. Will add in v2.
>

>> +static int tegra_qspi_setup(struct spi_device *spi)

>> +{

>> +	if (cdata && cdata->tx_clk_tap_delay)

>> +		tx_tap = cdata->tx_clk_tap_delay;

>> +	if (cdata && cdata->rx_clk_tap_delay)

>> +		rx_tap = cdata->rx_clk_tap_delay;

>> +	tqspi->def_command2_reg = QSPI_TX_TAP_DELAY(tx_tap) |

>> +				  QSPI_RX_TAP_DELAY(rx_tap);

>> +	tegra_qspi_writel(tqspi, tqspi->def_command2_reg, QSPI_COMMAND2);

> The setup for one device shouldn't be able to affect the operation of

> another, already running, device so either these need to be configured

> as part of the controller probe or these configurations need to be

> deferred until we're actually doing a transfer.

We will only have 1 device on QSPI as we only support single chip select.
>

>> +	/*

>> +	 * Tegra QSPI hardware support dummy bytes transfer based on the

>> +	 * programmed dummy clock cyles in QSPI register.

>> +	 * So, get the total dummy bytes from the dummy bytes transfer in

>> +	 * spi_messages and convert to dummy clock cyles.

>> +	 */

>> +	list_for_each_entry(xfer, &msg->transfers, transfer_list) {

>> +		if (ntransfers == DUMMY_BYTES_XFER &&

>> +		    !(list_is_last(&xfer->transfer_list, &msg->transfers)))

>> +			dummy_cycles = xfer->len * 8 / xfer->tx_nbits;

>> +		ntransfers++;

>> +	}

> This seems weird, there's some hard coded assumption about particular

> patterns that the client device is going to send.  What's going on here?

> I don't really understand what this is trying to do.


QSPI flash needs dummy cycles for data read operation which is actually 
the initial read latency and no. of dummy cycles required are vendor 
specific.

SPI NOR driver gets required dummy cycles based on mode clock cycles and 
wait state clock cycles.

During read operations, spi_nor_spimem_read_data() converts dummy cycles 
to number of dummy bytes.

Tegra QSPI controller supports dummy clock cycles register and when 
programmed QSPI controller sends dummy bytes rather than SW handling 
extra cycles for transferring dummy bytes.

Above equation converts this dummy bytes back to dummy clock cycles to 
program into QSPI register and avoid manual SW transfer of dummy bytes.

>

>> +static irqreturn_t tegra_qspi_isr(int irq, void *context_data)

>> +{

>> +	struct tegra_qspi_data *tqspi = context_data;

>> +

>> +	tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);

>> +	if (tqspi->cur_direction & DATA_DIR_TX)

>> +		tqspi->tx_status = tqspi->status_reg &

>> +				   (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF);

>> +

>> +	if (tqspi->cur_direction & DATA_DIR_RX)

>> +		tqspi->rx_status = tqspi->status_reg &

>> +				   (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);

>> +	tegra_qspi_mask_clear_irq(tqspi);

>> +

>> +	return IRQ_WAKE_THREAD;

>> +}

> It's a bit unclear to me the value we gain from having this handler - if

> we don't specify a handler genirq will already mask the interrupt until

> we get to the thread anyway and we could just read the status in the

> threaded handler.  OTOH it doesn't do any harm, just struck me as a bit

> odd.


I started QSPI driver by taking SPI driver as data transfer and 
interrupt handling are similar.

So kept this handler for clearing status registers and masking 
interrupts as I did not see anything wrong with this.

>

>> +	master = spi_alloc_master(&pdev->dev, sizeof(*tqspi));

>> +	if (!master) {

>> +		dev_err(&pdev->dev, "master allocation failed\n");

>> +		return -ENOMEM;

>> +	}

> Please switch to using the devm_ version of the API to allocate

> controller, it makes things much more robust.

Will update in v2
>

>> +	if (of_property_read_u32(pdev->dev.of_node, "spi-max-frequency",

>> +				 &master->max_speed_hz))

>> +		master->max_speed_hz = QSPI_MAX_SPEED;

> The core will do this for you.


Will remove this in v2.

Thanks

Sowjanya
Sowjanya Komatineni Dec. 4, 2020, 12:22 a.m. UTC | #3
On 12/2/20 11:17 AM, Sowjanya Komatineni wrote:
>
> On 12/2/20 9:27 AM, Mark Brown wrote:
>> On Tue, Dec 01, 2020 at 01:12:44PM -0800, Sowjanya Komatineni wrote:
>>> Tegra SoC has a Quad SPI controller starting from Tegra210.
>>>
>>> This patch adds support for Tegra210 QSPI controller.
>> This looks pretty clean but I've got a few questions below about how
>> this integrates with the frameworks as well as some more minor issues.
>>
>>> +config QSPI_TEGRA
>>> +    tristate "Nvidia Tegra QSPI Controller"
>> Everything else in this file is SPI_, even the qspi controllers.
> Will rename in v2
>>> +++ b/drivers/spi/qspi-tegra.c
>>> @@ -0,0 +1,1418 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Copyright (C) 2020 NVIDIA CORPORATION.  All rights reserved.
>>> + */
>> Please make the entire comment a C++ one.  It also appears that the "All
>> rights reserved" here conflicts with the GPL-2.0-only SPDX statement...
> Will fix in v2
>>
>>> +static void
>>> +tegra_qspi_copy_client_txbuf_to_qspi_txbuf(struct tegra_qspi_data 
>>> *tqspi,
>>> +                       struct spi_transfer *t)
>>> +{
>>> +    /* Make the dma buffer to read by cpu */
>>> +    dma_sync_single_for_cpu(tqspi->dev, tqspi->tx_dma_phys,
>>> +                tqspi->dma_buf_size, DMA_TO_DEVICE);
>>> +
>>> +    if (tqspi->is_packed) {
>>> +        unsigned int len = tqspi->curr_dma_words *
>>> +                   tqspi->bytes_per_word;
>>> +
>>> +        memcpy(tqspi->tx_dma_buf, t->tx_buf + tqspi->cur_pos, len);
>>> +        tqspi->cur_tx_pos += tqspi->curr_dma_words *
>>> +                     tqspi->bytes_per_word;
>> It seems weird that this device needs us to do a memcpy() to do DMA,
>> most devices are able to DMA directly from the buffers provided by the
>> SPI API (and let the SPI core sync things).  What is going on here?
>
> For transfers of size more than max DMA transfer limit, data transfer 
> happens in multiple iterations with each iteration transferring up to 
> max DMA transfer limit.
>
> So using separate dma buffers and on every iteration copying them to 
> SPI core provided tx/rx buffers.
Also unpack mode needs to manually put the bytes together from read data 
to SPI core rx buffer.
>
> Transferring data logic in this driver is similar as Tegra SPI driver 
> except register changes and some QSPI specific register programming.
>
>>
>>> +    tegra_qspi_writel(tqspi, status, QSPI_FIFO_STATUS);
>>> +    while ((status & QSPI_FIFO_EMPTY) != QSPI_FIFO_EMPTY) {
>>> +        status = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
>>> +        if (time_after(jiffies, timeout)) {
>>> +            dev_err(tqspi->dev,
>>> +                "timeout waiting for fifo flush\n");
>>> +            return -EIO;
>>> +        }
>>> +
>>> +        udelay(1);
>>> +    }
>> It'd be good to put a cpu_relax() in the busy loop.
> Will update in v2.
>>
>>> +static u32 tegra_qspi_setup_transfer_one(struct spi_device *spi,
>>> +                     struct spi_transfer *t,
>>> +                     bool is_first_of_msg)
>>> +{
>>> +        /* toggle cs to active state */
>>> +        if (spi->mode & SPI_CS_HIGH)
>>> +            command1 |= QSPI_CS_SW_VAL;
>>> +        else
>>> +            command1 &= ~QSPI_CS_SW_VAL;
>>> +        tegra_qspi_writel(tqspi, command1, QSPI_COMMAND1);
>> This is worrying, the client device might be confused if /CS is doing
>> things outside of the standard handling.
>
> Do you mean to honor spi_transfer cs_change flag?
>
> Tegra QSPI is master and is used only with QSPI flash devices. Looking 
> at SPI NOR driver, I see QSPI Flash commands are executed with one 
> flash command per spi_message and I dont see cs_change flag usage 
> w.r.t QSPI flash. So, using SW based CS control for QSPI.
>
> Please correct me if I miss something to understand here.
>
> Also Tegra186 and later QSPI controller supports combined sequence 
> mode where command, address, data phases can be combined in a single GO.
>
> This saves some cycles in transfer and for this we need to use SW 
> based CS control only.
>
>
>>> +    of_property_read_u32(slave_np, "nvidia,tx-clk-tap-delay",
>>> +                 &cdata->tx_clk_tap_delay);
>>> +    of_property_read_u32(slave_np, "nvidia,rx-clk-tap-delay",
>>> +                 &cdata->rx_clk_tap_delay);
>> These properties are not mentioned in the binding document.
> Thanks Mark. Missed them. Will add in v2.
>>
>>> +static int tegra_qspi_setup(struct spi_device *spi)
>>> +{
>>> +    if (cdata && cdata->tx_clk_tap_delay)
>>> +        tx_tap = cdata->tx_clk_tap_delay;
>>> +    if (cdata && cdata->rx_clk_tap_delay)
>>> +        rx_tap = cdata->rx_clk_tap_delay;
>>> +    tqspi->def_command2_reg = QSPI_TX_TAP_DELAY(tx_tap) |
>>> +                  QSPI_RX_TAP_DELAY(rx_tap);
>>> +    tegra_qspi_writel(tqspi, tqspi->def_command2_reg, QSPI_COMMAND2);
>> The setup for one device shouldn't be able to affect the operation of
>> another, already running, device so either these need to be configured
>> as part of the controller probe or these configurations need to be
>> deferred until we're actually doing a transfer.
> We will only have 1 device on QSPI as we only support single chip select.
>>
>>> +    /*
>>> +     * Tegra QSPI hardware support dummy bytes transfer based on the
>>> +     * programmed dummy clock cyles in QSPI register.
>>> +     * So, get the total dummy bytes from the dummy bytes transfer in
>>> +     * spi_messages and convert to dummy clock cyles.
>>> +     */
>>> +    list_for_each_entry(xfer, &msg->transfers, transfer_list) {
>>> +        if (ntransfers == DUMMY_BYTES_XFER &&
>>> +            !(list_is_last(&xfer->transfer_list, &msg->transfers)))
>>> +            dummy_cycles = xfer->len * 8 / xfer->tx_nbits;
>>> +        ntransfers++;
>>> +    }
>> This seems weird, there's some hard coded assumption about particular
>> patterns that the client device is going to send.  What's going on here?
>> I don't really understand what this is trying to do.
>
> QSPI flash needs dummy cycles for data read operation which is 
> actually the initial read latency and no. of dummy cycles required are 
> vendor specific.
>
> SPI NOR driver gets required dummy cycles based on mode clock cycles 
> and wait state clock cycles.
>
> During read operations, spi_nor_spimem_read_data() converts dummy 
> cycles to number of dummy bytes.
>
> Tegra QSPI controller supports dummy clock cycles register and when 
> programmed QSPI controller sends dummy bytes rather than SW handling 
> extra cycles for transferring dummy bytes.
>
> Above equation converts this dummy bytes back to dummy clock cycles to 
> program into QSPI register and avoid manual SW transfer of dummy bytes.
>
>>
>>> +static irqreturn_t tegra_qspi_isr(int irq, void *context_data)
>>> +{
>>> +    struct tegra_qspi_data *tqspi = context_data;
>>> +
>>> +    tqspi->status_reg = tegra_qspi_readl(tqspi, QSPI_FIFO_STATUS);
>>> +    if (tqspi->cur_direction & DATA_DIR_TX)
>>> +        tqspi->tx_status = tqspi->status_reg &
>>> +                   (QSPI_TX_FIFO_UNF | QSPI_TX_FIFO_OVF);
>>> +
>>> +    if (tqspi->cur_direction & DATA_DIR_RX)
>>> +        tqspi->rx_status = tqspi->status_reg &
>>> +                   (QSPI_RX_FIFO_OVF | QSPI_RX_FIFO_UNF);
>>> +    tegra_qspi_mask_clear_irq(tqspi);
>>> +
>>> +    return IRQ_WAKE_THREAD;
>>> +}
>> It's a bit unclear to me the value we gain from having this handler - if
>> we don't specify a handler genirq will already mask the interrupt until
>> we get to the thread anyway and we could just read the status in the
>> threaded handler.  OTOH it doesn't do any harm, just struck me as a bit
>> odd.
>
> I started QSPI driver by taking SPI driver as data transfer and 
> interrupt handling are similar.
>
> So kept this handler for clearing status registers and masking 
> interrupts as I did not see anything wrong with this.
>
>>
>>> +    master = spi_alloc_master(&pdev->dev, sizeof(*tqspi));
>>> +    if (!master) {
>>> +        dev_err(&pdev->dev, "master allocation failed\n");
>>> +        return -ENOMEM;
>>> +    }
>> Please switch to using the devm_ version of the API to allocate
>> controller, it makes things much more robust.
> Will update in v2
>>
>>> +    if (of_property_read_u32(pdev->dev.of_node, "spi-max-frequency",
>>> +                 &master->max_speed_hz))
>>> +        master->max_speed_hz = QSPI_MAX_SPEED;
>> The core will do this for you.
>
> Will remove this in v2.
>
> Thanks
>
> Sowjanya
>
Thierry Reding Dec. 4, 2020, 12:17 p.m. UTC | #4
On Wed, Dec 02, 2020 at 05:27:21PM +0000, Mark Brown wrote:
> On Tue, Dec 01, 2020 at 01:12:44PM -0800, Sowjanya Komatineni wrote:

> > Tegra SoC has a Quad SPI controller starting from Tegra210.

> > 

> > This patch adds support for Tegra210 QSPI controller.

> 

> This looks pretty clean but I've got a few questions below about how

> this integrates with the frameworks as well as some more minor issues.

> 

> > +config QSPI_TEGRA

> > +	tristate "Nvidia Tegra QSPI Controller"

> 

> Everything else in this file is SPI_, even the qspi controllers.

> 

> > +++ b/drivers/spi/qspi-tegra.c

> > @@ -0,0 +1,1418 @@

> > +// SPDX-License-Identifier: GPL-2.0-only

> > +/*

> > + * Copyright (C) 2020 NVIDIA CORPORATION.  All rights reserved.

> > + */

> 

> Please make the entire comment a C++ one.  It also appears that the "All

> rights reserved" here conflicts with the GPL-2.0-only SPDX statement...


My understanding is that this phrase is pretty much irrelevant these
days. Furthermore, I don't think this is to be interpreted as a claim to
any rights other than the copyright, so it's mostly equivalent to the
"Copyright (C)" on that same line. Any license terms associated with the
file do still apply regardless.

That said, I'm not a lawyer, so don't take this as legal advice. FWIW,
there's something on the order of 8000 occurrences of that phrase in the
Linux kernel sources, so I think with or without the phrase we should be
okay.

Thierry
Thierry Reding Dec. 4, 2020, 12:26 p.m. UTC | #5
On Wed, Dec 02, 2020 at 11:17:18AM -0800, Sowjanya Komatineni wrote:
> On 12/2/20 9:27 AM, Mark Brown wrote:
> > On Tue, Dec 01, 2020 at 01:12:44PM -0800, Sowjanya Komatineni wrote:
[...]
> > > +static int tegra_qspi_setup(struct spi_device *spi)
> > > +{
> > > +	if (cdata && cdata->tx_clk_tap_delay)
> > > +		tx_tap = cdata->tx_clk_tap_delay;
> > > +	if (cdata && cdata->rx_clk_tap_delay)
> > > +		rx_tap = cdata->rx_clk_tap_delay;
> > > +	tqspi->def_command2_reg = QSPI_TX_TAP_DELAY(tx_tap) |
> > > +				  QSPI_RX_TAP_DELAY(rx_tap);
> > > +	tegra_qspi_writel(tqspi, tqspi->def_command2_reg, QSPI_COMMAND2);
> > The setup for one device shouldn't be able to affect the operation of
> > another, already running, device so either these need to be configured
> > as part of the controller probe or these configurations need to be
> > deferred until we're actually doing a transfer.
> We will only have 1 device on QSPI as we only support single chip select.

Even so we could make the driver operate as if there were multiple
devices. This has the advantage of setting a better example for someone
who might be reading this code as reference, and it might come in handy
if for whatever reason we ever end up with a QSPI controller that does
support multiple chip selects.

If that's overly complicated, maybe a compromise would be to document
very explicitly that this only works because Tegra QSPI supports a
single chip select?

Thierry
Mark Brown Dec. 4, 2020, 6:52 p.m. UTC | #6
On Thu, Dec 03, 2020 at 04:22:54PM -0800, Sowjanya Komatineni wrote:
> On 12/2/20 11:17 AM, Sowjanya Komatineni wrote:


> > > It seems weird that this device needs us to do a memcpy() to do DMA,

> > > most devices are able to DMA directly from the buffers provided by the

> > > SPI API (and let the SPI core sync things).  What is going on here?


> > For transfers of size more than max DMA transfer limit, data transfer

> > happens in multiple iterations with each iteration transferring up to

> > max DMA transfer limit.


> > So using separate dma buffers and on every iteration copying them to SPI

> > core provided tx/rx buffers.


I don't understand this - there's no restriction on where DMA transfers
can be done from within a DMA mapped region, the driver can do multiple
transfers from different chunks of the source buffer without having to
copy anything.  That's a very common scenario.

> Also unpack mode needs to manually put the bytes together from read data to

> SPI core rx buffer.


Could you be more explicit here, I don't know what "unpack mode" is?

> > > This is worrying, the client device might be confused if /CS is doing

> > > things outside of the standard handling.


> > Do you mean to honor spi_transfer cs_change flag?


At least, yes - more generally just if there's any feature to with chip
select then the driver will need to open code it.  The driver should at
least be double checking that what it's doing matches what it was told
to do, though just letting this be handled by the generic code if
there's no limitation on the hardware tends to be easier all round.

> > Tegra QSPI is master and is used only with QSPI flash devices. Looking

> > at SPI NOR driver, I see QSPI Flash commands are executed with one flash

> > command per spi_message and I dont see cs_change flag usage w.r.t QSPI

> > flash. So, using SW based CS control for QSPI.


> > Please correct me if I miss something to understand here.


Someone might build a system that does something different, they may see
a spare SPI controller and decide they can use it for something else or
there may be some future change with the flash code which does something
different.

> > > > +    tegra_qspi_writel(tqspi, tqspi->def_command2_reg, QSPI_COMMAND2);


> > > The setup for one device shouldn't be able to affect the operation of

> > > another, already running, device so either these need to be configured

> > > as part of the controller probe or these configurations need to be

> > > deferred until we're actually doing a transfer.


> > We will only have 1 device on QSPI as we only support single chip select.


It's quite common for people to do things like add additional devices
with GPIO chip selects.

> > > > +    /*

> > > > +     * Tegra QSPI hardware support dummy bytes transfer based on the

> > > > +     * programmed dummy clock cyles in QSPI register.

> > > > +     * So, get the total dummy bytes from the dummy bytes transfer in

> > > > +     * spi_messages and convert to dummy clock cyles.

> > > > +     */

> > > > +    list_for_each_entry(xfer, &msg->transfers, transfer_list) {

> > > > +        if (ntransfers == DUMMY_BYTES_XFER &&

> > > > +            !(list_is_last(&xfer->transfer_list, &msg->transfers)))

> > > > +            dummy_cycles = xfer->len * 8 / xfer->tx_nbits;

> > > > +        ntransfers++;

> > > > +    }


> > > This seems weird, there's some hard coded assumption about particular

> > > patterns that the client device is going to send.  What's going on here?

> > > I don't really understand what this is trying to do.


> > QSPI flash needs dummy cycles for data read operation which is actually

> > the initial read latency and no. of dummy cycles required are vendor

> > specific.


> > SPI NOR driver gets required dummy cycles based on mode clock cycles and

> > wait state clock cycles.


> > During read operations, spi_nor_spimem_read_data() converts dummy cycles

> > to number of dummy bytes.


> > Tegra QSPI controller supports dummy clock cycles register and when

> > programmed QSPI controller sends dummy bytes rather than SW handling

> > extra cycles for transferring dummy bytes.


> > Above equation converts this dummy bytes back to dummy clock cycles to

> > program into QSPI register and avoid manual SW transfer of dummy bytes.


This is not a good idea, attempting to reverse engineer the message and
guess at the contents isn't going to be robust and if it's useful it
will if nothing else lead to a bunch of duplicated code in drivers as
every device that has this feature will need to reimplment it.  Instead
we should extend the framework so there's explicit support for
specifying transfers that are padding bytes, then there's no guesswork
that can go wrong and no duplicated code between drivers.  A flag in the
transfer struct might work?
Sowjanya Komatineni Dec. 4, 2020, 9:04 p.m. UTC | #7
On 12/4/20 10:52 AM, Mark Brown wrote:
> On Thu, Dec 03, 2020 at 04:22:54PM -0800, Sowjanya Komatineni wrote:
>> On 12/2/20 11:17 AM, Sowjanya Komatineni wrote:
>>>> It seems weird that this device needs us to do a memcpy() to do DMA,
>>>> most devices are able to DMA directly from the buffers provided by the
>>>> SPI API (and let the SPI core sync things).  What is going on here?
>>> For transfers of size more than max DMA transfer limit, data transfer
>>> happens in multiple iterations with each iteration transferring up to
>>> max DMA transfer limit.
>>> So using separate dma buffers and on every iteration copying them to SPI
>>> core provided tx/rx buffers.
> I don't understand this - there's no restriction on where DMA transfers
> can be done from within a DMA mapped region, the driver can do multiple
> transfers from different chunks of the source buffer without having to
> copy anything.  That's a very common scenario.
>
>> Also unpack mode needs to manually put the bytes together from read data to
>> SPI core rx buffer.
> Could you be more explicit here, I don't know what "unpack mode" is?

Tegra SPI/QSPI controller support packed mode and unpacked mode based on 
bits per word in a transfer.

Packed Mode: When enabled, all 32-bits of data in FIFO contains valid 
data packets of 8-bit/16-bit/32-bit length.

Non packed mode: For transfers like 24-bit data for example we disable 
packed mode and only 24-bits of FIFO data are valid and other bits are 0's.
So during TX for FIFO filling and during receive when FIFO data is read, 
SW need to skip invalid bits and should align order from/to SPI core 
tx/rx buffers.

>
>>>> This is worrying, the client device might be confused if /CS is doing
>>>> things outside of the standard handling.
>>> Do you mean to honor spi_transfer cs_change flag?
> At least, yes - more generally just if there's any feature to with chip
> select then the driver will need to open code it.  The driver should at
> least be double checking that what it's doing matches what it was told
> to do, though just letting this be handled by the generic code if
> there's no limitation on the hardware tends to be easier all round.
OK Will honor cs_change flag at end of transfer and will program CS 
state based on that.
>
>>> Tegra QSPI is master and is used only with QSPI flash devices. Looking
>>> at SPI NOR driver, I see QSPI Flash commands are executed with one flash
>>> command per spi_message and I dont see cs_change flag usage w.r.t QSPI
>>> flash. So, using SW based CS control for QSPI.
>>> Please correct me if I miss something to understand here.
> Someone might build a system that does something different, they may see
> a spare SPI controller and decide they can use it for something else or
> there may be some future change with the flash code which does something
> different.
>
>>>>> +    tegra_qspi_writel(tqspi, tqspi->def_command2_reg, QSPI_COMMAND2);
>>>> The setup for one device shouldn't be able to affect the operation of
>>>> another, already running, device so either these need to be configured
>>>> as part of the controller probe or these configurations need to be
>>>> deferred until we're actually doing a transfer.
>>> We will only have 1 device on QSPI as we only support single chip select.
> It's quite common for people to do things like add additional devices
> with GPIO chip selects.
Will move tap delay programming to happen during spi transfer..
>
>>>>> +    /*
>>>>> +     * Tegra QSPI hardware support dummy bytes transfer based on the
>>>>> +     * programmed dummy clock cyles in QSPI register.
>>>>> +     * So, get the total dummy bytes from the dummy bytes transfer in
>>>>> +     * spi_messages and convert to dummy clock cyles.
>>>>> +     */
>>>>> +    list_for_each_entry(xfer, &msg->transfers, transfer_list) {
>>>>> +        if (ntransfers == DUMMY_BYTES_XFER &&
>>>>> +            !(list_is_last(&xfer->transfer_list, &msg->transfers)))
>>>>> +            dummy_cycles = xfer->len * 8 / xfer->tx_nbits;
>>>>> +        ntransfers++;
>>>>> +    }
>>>> This seems weird, there's some hard coded assumption about particular
>>>> patterns that the client device is going to send.  What's going on here?
>>>> I don't really understand what this is trying to do.
>>> QSPI flash needs dummy cycles for data read operation which is actually
>>> the initial read latency and no. of dummy cycles required are vendor
>>> specific.
>>> SPI NOR driver gets required dummy cycles based on mode clock cycles and
>>> wait state clock cycles.
>>> During read operations, spi_nor_spimem_read_data() converts dummy cycles
>>> to number of dummy bytes.
>>> Tegra QSPI controller supports dummy clock cycles register and when
>>> programmed QSPI controller sends dummy bytes rather than SW handling
>>> extra cycles for transferring dummy bytes.
>>> Above equation converts this dummy bytes back to dummy clock cycles to
>>> program into QSPI register and avoid manual SW transfer of dummy bytes.
> This is not a good idea, attempting to reverse engineer the message and
> guess at the contents isn't going to be robust and if it's useful it
> will if nothing else lead to a bunch of duplicated code in drivers as
> every device that has this feature will need to reimplment it.  Instead
> we should extend the framework so there's explicit support for
> specifying transfers that are padding bytes, then there's no guesswork
> that can go wrong and no duplicated code between drivers.  A flag in the
> transfer struct might work?

As per QSPI spec, Dummy bytes for initial read latency are always FF's. 
So its not like guessing the contents.

Tegra QSPI controller HW supports transferring dummy bytes (sending FF's 
after address) based on dummy clock cycles programmed.

To allow Tegra QSPI HW transfer dummy bytes directly, controller driver 
need number of dummy bytes / actual dummy clock cycles which core driver 
gets from flash sfdp read data.

So, we can add flag to transfer and based on this flag if controller HW 
supports then we can ignore filling dummy bytes in spi_mem_exec_op but 
controller driver still needs dummy_cycles value. So probably along with 
flag, do you agree to have dummy_cycle as part of transfer struct which 
can be set to nor->read_dummy value?


Thanks

Sowjanya
Mark Brown Dec. 4, 2020, 10:46 p.m. UTC | #8
On Fri, Dec 04, 2020 at 01:04:46PM -0800, Sowjanya Komatineni wrote:
> On 12/4/20 10:52 AM, Mark Brown wrote:
> > On Thu, Dec 03, 2020 at 04:22:54PM -0800, Sowjanya Komatineni wrote:

> > > Also unpack mode needs to manually put the bytes together from read data to
> > > SPI core rx buffer.
> > Could you be more explicit here, I don't know what "unpack mode" is?

> Tegra SPI/QSPI controller support packed mode and unpacked mode based on
> bits per word in a transfer.

> Packed Mode: When enabled, all 32-bits of data in FIFO contains valid data
> packets of 8-bit/16-bit/32-bit length.

> Non packed mode: For transfers like 24-bit data for example we disable
> packed mode and only 24-bits of FIFO data are valid and other bits are 0's.
> So during TX for FIFO filling and during receive when FIFO data is read, SW
> need to skip invalid bits and should align order from/to SPI core tx/rx
> buffers.

That's pretty surprising - is it really worth the overhead of using
non-packed mode compared to just doing the transfer in 8 bit mode?  In
any case it seems better to only do the memcpy() stuff in the cases
where it's actually required since it looks like fairly obvious overhead
otherwise, and the code could use some comments explaining why we're
doing this.  It may actually be that the implementation is already doing
the most sensible thing and it just needs more comments explaining why
that's the case.

> > This is not a good idea, attempting to reverse engineer the message and
> > guess at the contents isn't going to be robust and if it's useful it
> > will if nothing else lead to a bunch of duplicated code in drivers as
> > every device that has this feature will need to reimplment it.  Instead
> > we should extend the framework so there's explicit support for
> > specifying transfers that are padding bytes, then there's no guesswork
> > that can go wrong and no duplicated code between drivers.  A flag in the
> > transfer struct might work?

> As per QSPI spec, Dummy bytes for initial read latency are always FF's. So
> its not like guessing the contents.

The guesswork I was thinking of was deciding to do this rather than the
pattern being output - the bit where the driver figures out that the
intent of that transfer is to provide dummy bytes.

> Tegra QSPI controller HW supports transferring dummy bytes (sending FF's
> after address) based on dummy clock cycles programmed.

> To allow Tegra QSPI HW transfer dummy bytes directly, controller driver need
> number of dummy bytes / actual dummy clock cycles which core driver gets
> from flash sfdp read data.

Sure, the use case makes sense.

> So, we can add flag to transfer and based on this flag if controller HW
> supports then we can ignore filling dummy bytes in spi_mem_exec_op but
> controller driver still needs dummy_cycles value. So probably along with
> flag, do you agree to have dummy_cycle as part of transfer struct which can
> be set to nor->read_dummy value?

Yeah, or given that perhaps just skip the flag and do this by specifying
dummy_cycles.  Or if this is always a multiple of 8 (which I guess it
must be to do it using normal byte transfers) perhaps just have the flag
and use the existing length field to infer the number of cycles?  I've
not actually looked at the details at all so one or both of those
suggestions may not actually make sense, please take them with a grain
of salt.

I'd recommend doing this as a followup to introducing the base driver,
start off with the less efficient explicit writes and then add the API
and add the support in the driver - that way the new API can be
reviewed without it holding up the rest of the driver.
Sowjanya Komatineni Dec. 5, 2020, 4:10 a.m. UTC | #9
On 12/4/20 2:46 PM, Mark Brown wrote:
> On Fri, Dec 04, 2020 at 01:04:46PM -0800, Sowjanya Komatineni wrote:

>> On 12/4/20 10:52 AM, Mark Brown wrote:

>>> On Thu, Dec 03, 2020 at 04:22:54PM -0800, Sowjanya Komatineni wrote:

>>>> Also unpack mode needs to manually put the bytes together from read data to

>>>> SPI core rx buffer.

>>> Could you be more explicit here, I don't know what "unpack mode" is?

>> Tegra SPI/QSPI controller support packed mode and unpacked mode based on

>> bits per word in a transfer.

>> Packed Mode: When enabled, all 32-bits of data in FIFO contains valid data

>> packets of 8-bit/16-bit/32-bit length.

>> Non packed mode: For transfers like 24-bit data for example we disable

>> packed mode and only 24-bits of FIFO data are valid and other bits are 0's.

>> So during TX for FIFO filling and during receive when FIFO data is read, SW

>> need to skip invalid bits and should align order from/to SPI core tx/rx

>> buffers.

> That's pretty surprising - is it really worth the overhead of using

> non-packed mode compared to just doing the transfer in 8 bit mode?  In

> any case it seems better to only do the memcpy() stuff in the cases

> where it's actually required since it looks like fairly obvious overhead

> otherwise, and the code could use some comments explaining why we're

> doing this.  It may actually be that the implementation is already doing

> the most sensible thing and it just needs more comments explaining why

> that's the case.


Understand the overhead but If any device specific transfers use/need 24 
bits per word, without non-packed mode we should fail the transfer.

Tegra HW has non-packed mode for such cases.

OK. Will use dma_map/unmap for packed mode transfer and for non-packed 
mode will use dma buf for fifo data and then can fill SPI core rx_buf 
with valid bytes from dma buf contents.

Sure will add comments for non-packed mode logic.

>>> This is not a good idea, attempting to reverse engineer the message and

>>> guess at the contents isn't going to be robust and if it's useful it

>>> will if nothing else lead to a bunch of duplicated code in drivers as

>>> every device that has this feature will need to reimplment it.  Instead

>>> we should extend the framework so there's explicit support for

>>> specifying transfers that are padding bytes, then there's no guesswork

>>> that can go wrong and no duplicated code between drivers.  A flag in the

>>> transfer struct might work?

>> As per QSPI spec, Dummy bytes for initial read latency are always FF's. So

>> its not like guessing the contents.

> The guesswork I was thinking of was deciding to do this rather than the

> pattern being output - the bit where the driver figures out that the

> intent of that transfer is to provide dummy bytes.

>

>> Tegra QSPI controller HW supports transferring dummy bytes (sending FF's

>> after address) based on dummy clock cycles programmed.

>> To allow Tegra QSPI HW transfer dummy bytes directly, controller driver need

>> number of dummy bytes / actual dummy clock cycles which core driver gets

>> from flash sfdp read data.

> Sure, the use case makes sense.

>

>> So, we can add flag to transfer and based on this flag if controller HW

>> supports then we can ignore filling dummy bytes in spi_mem_exec_op but

>> controller driver still needs dummy_cycles value. So probably along with

>> flag, do you agree to have dummy_cycle as part of transfer struct which can

>> be set to nor->read_dummy value?

> Yeah, or given that perhaps just skip the flag and do this by specifying

> dummy_cycles.  Or if this is always a multiple of 8 (which I guess it

> must be to do it using normal byte transfers) perhaps just have the flag

> and use the existing length field to infer the number of cycles?  I've

> not actually looked at the details at all so one or both of those

> suggestions may not actually make sense, please take them with a grain

> of salt.

>

> I'd recommend doing this as a followup to introducing the base driver,

> start off with the less efficient explicit writes and then add the API

> and add the support in the driver - that way the new API can be

> reviewed without it holding up the rest of the driver.


ok I think adding dummy_cycles to transfer is enough without flag.

If dummy cycles is 0, definitely no dummy bytes transfer.

So will get rid of code that detects dummy bytes xfer phase from list of 
transfers.


Thanks Mark.


Regards,

Sowjanya