Message ID | 20201216144114.v2.1.I99ee04f0cb823415df59bd4f550d6ff5756e43d6@changeid |
---|---|
State | Superseded |
Headers | show |
Series | [v2,1/4] spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case | expand |
Quoting Douglas Anderson (2020-12-16 14:41:49) > In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state > variable") we changed handle_fifo_timeout() so that we set > "mas->cur_xfer" to NULL to make absolutely sure that we don't mess > with the buffers from the previous transfer in the timeout case. > > Unfortunately, this caused the IRQ handler to dereference NULL in some > cases. One case: > > CPU0 CPU1 > ---- ---- > setup_fifo_xfer() > geni_se_setup_m_cmd() > <hardware starts transfer> > <transfer completes in hardware> > <hardware sets M_RX_FIFO_WATERMARK_EN in m_irq> > ... > handle_fifo_timeout() > spin_lock_irq(mas->lock) > mas->cur_xfer = NULL > geni_se_cancel_m_cmd() > spin_unlock_irq(mas->lock) > > geni_spi_isr() > spin_lock(mas->lock) > if (m_irq & M_RX_FIFO_WATERMARK_EN) > geni_spi_handle_rx() > mas->cur_xfer NULL dereference! > > tl;dr: Seriously delayed interrupts for RX/TX can lead to timeout > handling setting mas->cur_xfer to NULL. > > Let's check for the NULL transfer in the TX and RX cases and reset the > watermark or clear out the fifo respectively to put the hardware back > into a sane state. > > NOTE: things still could get confused if we get timeouts all the way > through handle_fifo_timeout() and then start a new transfer because > interrupts from the old transfer / cancel / abort could still be > pending. A future patch will help this corner case. > > Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- Minor nits below. Reviewed-by: Stephen Boyd <swboyd@chromium.org> > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > index 25810a7eef10..bf55abbd39f1 100644 > --- a/drivers/spi/spi-geni-qcom.c > +++ b/drivers/spi/spi-geni-qcom.c > @@ -354,6 +354,12 @@ static bool geni_spi_handle_tx(struct spi_geni_master *mas) > unsigned int bytes_per_fifo_word = geni_byte_per_fifo_word(mas); > unsigned int i = 0; > > + /* Stop the watermark IRQ if nothing to send */ > + if (!mas->cur_xfer) { > + writel(0, se->base + SE_GENI_TX_WATERMARK_REG); > + return false; > + } > + > max_bytes = (mas->tx_fifo_depth - mas->tx_wm) * bytes_per_fifo_word; > if (mas->tx_rem_bytes < max_bytes) > max_bytes = mas->tx_rem_bytes; > @@ -396,6 +402,14 @@ static void geni_spi_handle_rx(struct spi_geni_master *mas) > if (rx_last_byte_valid && rx_last_byte_valid < 4) > rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid; > } > + > + /* Clear out the FIFO and bail if nowhere to put it */ > + if (mas->cur_xfer == NULL) { This is still == NULL though. :( > + while (i++ < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word)) A for loop is also fine, but I think it would push the 80 character limit which is probably OK. I'm happy either way, but a for loop is usually easier to reason about. I suggested while to fit within 80 characters :/ for (i = 0; i < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word); i++) > + readl(se->base + SE_GENI_RX_FIFOn); > + return; > + }
Quoting Douglas Anderson (2020-12-16 14:41:51) > If we get a timeout sending then this happens: > * spi_transfer_wait() will get a timeout. > * We'll set the chip select > * We'll call handle_err() => handle_fifo_timeout(). > > Unfortunately that won't work so well on geni. If we got a timeout > transferring then it's likely that our interrupt handler is blocked, > but we need that same interrupt handler to adjust the chip select. > Trying to set the chip select doesn't crash us but ends up confusing > our state machine and leads to messages like: > Premature done. rx_rem = 32 bpw8 > > Let's just drop the chip select request in this case. Sure, we might > leave the chip select in the wrong state but it's likely it was going > to fail anyway and this avoids getting the driver even more confused > about what it's doing. > > The SPI core in general assumes that setting chip select is a simple > operation that doesn't fail. Yet another reason to just reconfigure > the chip select line as GPIOs. Indeed. > > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > Changes in v2: > - ("spi: spi-geni-qcom: Don't try to set CS if an xfer is pending") new for v2. > > drivers/spi/spi-geni-qcom.c | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c > index d988463e606f..0e4fa52ac017 100644 > --- a/drivers/spi/spi-geni-qcom.c > +++ b/drivers/spi/spi-geni-qcom.c > @@ -204,9 +204,14 @@ static void spi_geni_set_cs(struct spi_device *slv, bool set_flag) > goto exit; > } > > - mas->cs_flag = set_flag; > - > spin_lock_irq(&mas->lock); > + if (mas->cur_xfer) { How is it possible that cs change happens when cur_xfer is non-NULL? > + dev_err(mas->dev, "Can't set CS when prev xfter running\n"); xfer? or xfter? > + spin_unlock_irq(&mas->lock); > + goto exit; > + } > + > + mas->cs_flag = set_flag; > reinit_completion(&mas->cs_done);
Quoting Douglas Anderson (2020-12-16 14:41:51) > If we get a timeout sending then this happens: > * spi_transfer_wait() will get a timeout. > * We'll set the chip select > * We'll call handle_err() => handle_fifo_timeout(). > > Unfortunately that won't work so well on geni. If we got a timeout > transferring then it's likely that our interrupt handler is blocked, > but we need that same interrupt handler to adjust the chip select. > Trying to set the chip select doesn't crash us but ends up confusing > our state machine and leads to messages like: > Premature done. rx_rem = 32 bpw8 > > Let's just drop the chip select request in this case. Sure, we might > leave the chip select in the wrong state but it's likely it was going > to fail anyway and this avoids getting the driver even more confused > about what it's doing. > > The SPI core in general assumes that setting chip select is a simple > operation that doesn't fail. Yet another reason to just reconfigure > the chip select line as GPIOs. BTW, we could peek at the irq bit for the CS change and ignore the irq handler entirely. That would be one way to make sure the cs change went through, and would avoid an irq delay/scheduling problem for this simple operation. Maybe using the irq path is worse in general here?
On Wed, Dec 16, 2020 at 08:25:18PM -0800, Stephen Boyd wrote: > > spin_lock_irq(&mas->lock); > > + if (mas->cur_xfer) { > > How is it possible that cs change happens when cur_xfer is non-NULL? We will set the initial chip select state during controller setup.
Hi, On Wed, Dec 16, 2020 at 8:29 PM Stephen Boyd <swboyd@chromium.org> wrote: > > > The SPI core in general assumes that setting chip select is a simple > > operation that doesn't fail. Yet another reason to just reconfigure > > the chip select line as GPIOs. > > BTW, we could peek at the irq bit for the CS change and ignore the irq > handler entirely. That would be one way to make sure the cs change went > through, and would avoid an irq delay/scheduling problem for this simple > operation. Maybe using the irq path is worse in general here? Yes, when I was in my SPI optimization phase I actually coded this up and thought about it. It can be made to work and is probably marginally faster at the expense of more cpu cycles to poll the interrupt line. I also don't think it fixes this issue nor does it simplify things... :( 1. If there are already commands pending we still have to do something about them. Said another way: there's not a separate channel just for setting the chip select, so if the single command channel is gummed up then using polling mode to handle the chip select won't really un-gum it up. 2. In order to use polling mode to set the chip select we have to do _something_ to temporarily disable our interrupt handler. If we don't do that then the interrupt handler will fire for the "DONE" when we send the chip select command. If we wanted to truly make this driver super robust against ridiculous interrupt latencies then, presumably, we could handle the SPI timeout ourselves but before timing out we could check to see if the interrupts were pending. Then we could disable our interrupts, synchronize our interrupt handler, handle the interrupt directly, and then re-enable interrupts. If we did this then transfers could continue to eek their way through even if interrupts were completely blocked. IMO, it's not worth it. I'm satisfied with not crashing and not getting the state machine too out-of-whack. -Doug
Quoting Doug Anderson (2020-12-17 13:35:08) > > If we wanted to truly make this driver super robust against ridiculous > interrupt latencies then, presumably, we could handle the SPI timeout > ourselves but before timing out we could check to see if the > interrupts were pending. Then we could disable our interrupts, > synchronize our interrupt handler, handle the interrupt directly, and > then re-enable interrupts. If we did this then transfers could > continue to eek their way through even if interrupts were completely > blocked. IMO, it's not worth it. I'm satisfied with not crashing and > not getting the state machine too out-of-whack. > Ok that's fair. If it's not worth the effort then let's drop this idea.
diff --git a/drivers/spi/spi-geni-qcom.c b/drivers/spi/spi-geni-qcom.c index 25810a7eef10..bf55abbd39f1 100644 --- a/drivers/spi/spi-geni-qcom.c +++ b/drivers/spi/spi-geni-qcom.c @@ -354,6 +354,12 @@ static bool geni_spi_handle_tx(struct spi_geni_master *mas) unsigned int bytes_per_fifo_word = geni_byte_per_fifo_word(mas); unsigned int i = 0; + /* Stop the watermark IRQ if nothing to send */ + if (!mas->cur_xfer) { + writel(0, se->base + SE_GENI_TX_WATERMARK_REG); + return false; + } + max_bytes = (mas->tx_fifo_depth - mas->tx_wm) * bytes_per_fifo_word; if (mas->tx_rem_bytes < max_bytes) max_bytes = mas->tx_rem_bytes; @@ -396,6 +402,14 @@ static void geni_spi_handle_rx(struct spi_geni_master *mas) if (rx_last_byte_valid && rx_last_byte_valid < 4) rx_bytes -= bytes_per_fifo_word - rx_last_byte_valid; } + + /* Clear out the FIFO and bail if nowhere to put it */ + if (mas->cur_xfer == NULL) { + while (i++ < DIV_ROUND_UP(rx_bytes, bytes_per_fifo_word)) + readl(se->base + SE_GENI_RX_FIFOn); + return; + } + if (mas->rx_rem_bytes < rx_bytes) rx_bytes = mas->rx_rem_bytes;
In commit 7ba9bdcb91f6 ("spi: spi-geni-qcom: Don't keep a local state variable") we changed handle_fifo_timeout() so that we set "mas->cur_xfer" to NULL to make absolutely sure that we don't mess with the buffers from the previous transfer in the timeout case. Unfortunately, this caused the IRQ handler to dereference NULL in some cases. One case: CPU0 CPU1 ---- ---- setup_fifo_xfer() geni_se_setup_m_cmd() <hardware starts transfer> <transfer completes in hardware> <hardware sets M_RX_FIFO_WATERMARK_EN in m_irq> ... handle_fifo_timeout() spin_lock_irq(mas->lock) mas->cur_xfer = NULL geni_se_cancel_m_cmd() spin_unlock_irq(mas->lock) geni_spi_isr() spin_lock(mas->lock) if (m_irq & M_RX_FIFO_WATERMARK_EN) geni_spi_handle_rx() mas->cur_xfer NULL dereference! tl;dr: Seriously delayed interrupts for RX/TX can lead to timeout handling setting mas->cur_xfer to NULL. Let's check for the NULL transfer in the TX and RX cases and reset the watermark or clear out the fifo respectively to put the hardware back into a sane state. NOTE: things still could get confused if we get timeouts all the way through handle_fifo_timeout() and then start a new transfer because interrupts from the old transfer / cancel / abort could still be pending. A future patch will help this corner case. Fixes: 561de45f72bd ("spi: spi-geni-qcom: Add SPI driver support for GENI based QUP") Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Changes in v2: - (ptr == NULL) => (!ptr). - Addressed loop nits in geni_spi_handle_rx(). - Commit message rewording from Stephen. drivers/spi/spi-geni-qcom.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)