diff mbox series

[v3,1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case

Message ID 20201217142842.v3.1.I99ee04f0cb823415df59bd4f550d6ff5756e43d6@changeid
State Superseded
Headers show
Series [v3,1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case | expand

Commit Message

Doug Anderson Dec. 17, 2020, 10:29 p.m. UTC
In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state
variable") we changed handle_fifo_timeout() so that we set
"mas->cur_xfer" to NULL to make absolutely sure that we don't mess
with the buffers from the previous transfer in the timeout case.

Unfortunately, this caused the IRQ handler to dereference NULL in some
cases.  One case:

  CPU0                           CPU1
  ----                           ----
                                 setup_fifo_xfer()
                                  geni_se_setup_m_cmd()
                                 <hardware starts transfer>
                                 <transfer completes in hardware>
                                 <hardware sets M_RX_FIFO_WATERMARK_EN in m_irq>
                                 ...
                                 handle_fifo_timeout()
                                  spin_lock_irq(mas->lock)
                                  mas->cur_xfer = NULL
                                  geni_se_cancel_m_cmd()
                                  spin_unlock_irq(mas->lock)

  geni_spi_isr()
   spin_lock(mas->lock)
   if (m_irq & M_RX_FIFO_WATERMARK_EN)
    geni_spi_handle_rx()
     mas->cur_xfer NULL dereference!

tl;dr: Seriously delayed interrupts for RX/TX can lead to timeout
handling setting mas->cur_xfer to NULL.

Let's check for the NULL transfer in the TX and RX cases and reset the
watermark or clear out the fifo respectively to put the hardware back
into a sane state.

NOTE: things still could get confused if we get timeouts all the way
through handle_fifo_timeout() and then start a new transfer because
interrupts from the old transfer / cancel / abort could still be
pending.  A future patch will help this corner case.

Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Stephen Boyd <swboyd@chromium.org>
---

Changes in v3:
- (ptr == NULL) => (!ptr), take 2.
- while loop => for loop

Changes in v2:
- (ptr == NULL) => (!ptr).
- Addressed loop nits in geni_spi_handle_rx().
- Commit message rewording from Stephen.

 drivers/spi/spi-geni-qcom.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Stephen Boyd Dec. 18, 2020, 2:54 a.m. UTC | #1
Quoting Douglas Anderson (2020-12-17 14:29:12)
> If we got a timeout when trying to send an abort command then it means
> that we just got 3 timeouts in a row:
> 
> 1. The original timeout that caused handle_fifo_timeout() to be
>    called.
> 2. A one second timeout waiting for the cancel command to finish.
> 3. A one second timeout waiting for the abort command to finish.
> 
> SPI is clocked by the controller, so nothing (aside from a hardware
> fault or a totally broken sequencer) should be causing the actual
> commands to fail in hardware.  However, even though the hardware
> itself is not expected to fail (and it'd be hard to predict how we
> should handle things if it did), it's easy to hit the timeout case by
> simply blocking our interrupt handler from running for a long period
> of time.  Obviously the system is in pretty bad shape if a interrupt
> handler is blocked for > 2 seconds, but there are certainly bugs (even
> bugs in other unrelated drivers) that can make this happen.
> 
> Let's make things a bit more robust against this case.  If we fail to
> abort we'll set a flag and then we'll block all future transfers until
> we have no more interrupts pending.
> 
> Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Stephen Boyd Dec. 18, 2020, 2:54 a.m. UTC | #2
Quoting Douglas Anderson (2020-12-17 14:29:13)
> If we get a timeout sending then this happens:
> 
> spi_transfer_one_message()
>  ->transfer_one() AKA spi_geni_transfer_one()
>   setup_fifo_xfer()
>    mas->cur_xfer = non-NULL
>  spi_transfer_wait() => TIMES OUT
>  if (msg->status != -EINPROGRESS)
>   goto out
>  if (ret != 0 ...)
>   spi_set_cs()
>    ->set_cs AKA spi_geni_set_cs()
>     # mas->cur_xfer is non-NULL
> 
> The above happens _before_ the SPI core calls ->handle_err() AKA
> handle_fifo_timeout().
> 
> Unfortunately that won't work so well on geni.  If we got a timeout
> transferring then it's likely that our interrupt handler is blocked,
> but we need that same interrupt handler to run and the command channel
> to be unblocked in order to adjust the chip select.  Trying to set the
> chip select doesn't crash us but ends up confusing our state machine
> and leads to messages like: Premature done. rx_rem = 32 bpw8
> 
> Let's just drop the chip select request in this case.  We can detect
> the case because cur_xfer is non-NULL--it would have been set to NULL
> in the interrupt handler if the previous transfer had finished.  Sure,
> we might leave the chip select in the wrong state but it's likely it
> was going to fail anyway and this avoids getting the driver even more
> confused about what it's doing.
> 
> The SPI core in general assumes that setting chip select is a simple
> operation that doesn't fail.  Yet another reason to just reconfigure
> the chip select line as GPIOs.
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Mark Brown Dec. 18, 2020, 6:29 p.m. UTC | #3
On Thu, 17 Dec 2020 14:29:11 -0800, Douglas Anderson wrote:
> In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state
> variable") we changed handle_fifo_timeout() so that we set
> "mas->cur_xfer" to NULL to make absolutely sure that we don't mess
> with the buffers from the previous transfer in the timeout case.
> 
> Unfortunately, this caused the IRQ handler to dereference NULL in some
> cases.  One case:
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
      commit: 4aa1464acbe3697710279a4bd65cb4801ed30425
[2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
      commit: 690d8b917bbe64772cb0b652311bcd50908aea6b
[3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
      commit: 3d7d916f9bc98ce88272b3e4405c7c685afbfcd6
[4/4] spi: spi-geni-qcom: Print an error when we timeout setting the CS
      commit: 17fa81aa702ec118f2b835715897041675b06336

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark
patchwork-bot+linux-arm-msm@kernel.org March 1, 2021, 7:59 p.m. UTC | #4
Hello:

This series was applied to qcom/linux.git (refs/heads/for-next):

On Thu, 17 Dec 2020 14:29:11 -0800 you wrote:
> In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state

> variable") we changed handle_fifo_timeout() so that we set

> "mas->cur_xfer" to NULL to make absolutely sure that we don't mess

> with the buffers from the previous transfer in the timeout case.

> 

> Unfortunately, this caused the IRQ handler to dereference NULL in some

> cases.  One case:

> 

> [...]


Here is the summary with links:
  - [v3,1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
    https://git.kernel.org/qcom/c/4aa1464acbe3
  - [v3,2/4] spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
    https://git.kernel.org/qcom/c/690d8b917bbe
  - [v3,3/4] spi: spi-geni-qcom: Don't try to set CS if an xfer is pending
    https://git.kernel.org/qcom/c/3d7d916f9bc9
  - [v3,4/4] spi: spi-geni-qcom: Print an error when we timeout setting the CS
    https://git.kernel.org/qcom/c/17fa81aa702e

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
diff mbox series

Patch

diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c
index 25810a7eef10..6939c6cabe91 100644
--- a/drivers/spi/spi-geni-qcom.c
+++ b/drivers/spi/spi-geni-qcom.c
@@ -354,6 +354,12 @@  static bool geni_spi_handle_tx(struct spi_geni_master *mas)
 	unsigned int bytes_per_fifo_word = geni_byte_per_fifo_word(mas);
 	unsigned int i = 0;
 
+	/* Stop the watermark IRQ if nothing to send */
+	if (!mas->cur_xfer) {
+		writel(0, se->base + SE_GENI_TX_WATERMARK_REG);
+		return false;
+	}
+
 	max_bytes = (mas->tx_fifo_depth - mas->tx_wm) * bytes_per_fifo_word;
 	if (mas->tx_rem_bytes < max_bytes)
 		max_bytes = mas->tx_rem_bytes;
@@ -396,6 +402,14 @@  static void geni_spi_handle_rx(struct spi_geni_master *mas)
 		if (rx_last_byte_valid && rx_last_byte_valid < 4)
 			rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid;
 	}
+
+	/* Clear out the FIFO and bail if nowhere to put it */
+	if (!mas->cur_xfer) {
+		for (i = 0; i < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word); i++)
+			readl(se->base + SE_GENI_RX_FIFOn);
+		return;
+	}
+
 	if (mas->rx_rem_bytes < rx_bytes)
 		rx_bytes = mas->rx_rem_bytes;