[0/5] spi: zynqmp-gqspi: Improve error recovery by resetting

Message ID	20250116225521.2688224-1-sean.anderson@linux.dev
Headers	show Received: from out-172.mta0.migadu.com (out-172.mta0.migadu.com [91.218.175.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6DE8D18A6A8 for <linux-spi@vger.kernel.org>; Thu, 16 Jan 2025 22:55:34 +0000 (UTC) From: Sean Anderson <sean.anderson@linux.dev> To: Mark Brown <broonie@kernel.org>, Michal Simek <michal.simek@amd.com>, linux-spi@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Jinjie Ruan <ruanjinjie@huawei.com>, linux-arm-kernel@lists.infradead.org, Amit Kumar Mahapatra <amit.kumar-mahapatra@amd.com>, Miquel Raynal <miquel.raynal@bootlin.com>, Sean Anderson <sean.anderson@linux.dev>, Conor Dooley <conor+dt@kernel.org>, Krzysztof Kozlowski <krzk+dt@kernel.org>, Rob Herring <robh@kernel.org>, devicetree@vger.kernel.org Subject: [PATCH 0/5] spi: zynqmp-gqspi: Improve error recovery by resetting Date: Thu, 16 Jan 2025 17:55:16 -0500 Message-Id: <20250116225521.2688224-1-sean.anderson@linux.dev> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	spi: zynqmp-gqspi: Improve error recovery by resetting \| expand [0/5] spi: zynqmp-gqspi: Improve error recovery by resetting [1/5] dt-bindings: spi: zynqmp-qspi: Add reset [2/5] spi: zynqmp-gqspi: Reset device in probe [3/5] spi: zynqmp-gqspi: Abort operations on timeout [4/5] spi: zynqmp-gqspi: Allow interrupting operations [5/5] ARM64: xilinx: zynqmp: Add QSPI reset

Message ID

20250116225521.2688224-1-sean.anderson@linux.dev

Headers

From: Sean Anderson <sean.anderson@linux.dev>
To: Mark Brown <broonie@kernel.org>,
	Michal Simek <michal.simek@amd.com>,
	linux-spi@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	Jinjie Ruan <ruanjinjie@huawei.com>,
	linux-arm-kernel@lists.infradead.org,
	Amit Kumar Mahapatra <amit.kumar-mahapatra@amd.com>,
	Miquel Raynal <miquel.raynal@bootlin.com>,
	Sean Anderson <sean.anderson@linux.dev>,
	Conor Dooley <conor+dt@kernel.org>,
	Krzysztof Kozlowski <krzk+dt@kernel.org>,
	Rob Herring <robh@kernel.org>,
	devicetree@vger.kernel.org
Subject: [PATCH 0/5] spi: zynqmp-gqspi: Improve error recovery by resetting
Date: Thu, 16 Jan 2025 17:55:16 -0500
Message-Id: <20250116225521.2688224-1-sean.anderson@linux.dev>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

spi: zynqmp-gqspi: Improve error recovery by resetting | expand

Message

Sean Anderson Jan. 16, 2025, 10:55 p.m. UTC

This series adds support for resetting the QSPI controller if we have a
timeout. I find this greatly improves the stability of the device, which
would tend to break after any timeout.


Sean Anderson (5):
  dt-bindings: spi: zynqmp-qspi: Add reset
  spi: zynqmp-gqspi: Reset device in probe
  spi: zynqmp-gqspi: Abort operations on timeout
  spi: zynqmp-gqspi: Allow interrupting operations
  ARM64: xilinx: zynqmp: Add QSPI reset

 .../bindings/spi/spi-zynqmp-qspi.yaml         |  6 ++
 arch/arm64/boot/dts/xilinx/zynqmp.dtsi        |  1 +
 drivers/spi/spi-zynqmp-gqspi.c                | 64 +++++++++++++++----
 3 files changed, 59 insertions(+), 12 deletions(-)

Comments

Miquel Raynal Jan. 17, 2025, 6:31 p.m. UTC | #1

On 17/01/2025 at 13:21:58 GMT, Mark Brown <broonie@kernel.org> wrote:

> On Thu, Jan 16, 2025 at 05:55:16PM -0500, Sean Anderson wrote:
>> This series adds support for resetting the QSPI controller if we have a
>> timeout. I find this greatly improves the stability of the device, which
>> would tend to break after any timeout.
>
> If you're hitting a timeout that tends to indicate there's already a
> serious stability problem...

Yes, unless the timeout is reached for "good reasons", ie. you request
substantial amounts of data (typically from a memory device) and the
timeout is too short compared to the theoretical time spent in the
transfer. A loaded machine can also increase the number of false
positives I guess.

Cheers,
Miquèl

Sean Anderson Jan. 17, 2025, 9:46 p.m. UTC | #2

On 1/17/25 13:41, Mark Brown wrote:
> On Fri, Jan 17, 2025 at 07:31:08PM +0100, Miquel Raynal wrote:
>> On 17/01/2025 at 13:21:58 GMT, Mark Brown <broonie@kernel.org> wrote:
> 
>> > If you're hitting a timeout that tends to indicate there's already a
>> > serious stability problem...
> 
>> Yes, unless the timeout is reached for "good reasons", ie. you request
>> substantial amounts of data (typically from a memory device) and the
>> timeout is too short compared to the theoretical time spent in the
>> transfer. A loaded machine can also increase the number of false
>> positives I guess.
> 
> I'd argue that all of those are bad reasons, I'd only expect us to time
> out when there's a bug - choosing too low a timeout or doing things in a
> way that generates timeouts under load is a problem.

There's no transmit DMA for this device. So if you are under high load
and make a long transfer, it's possible to time out. I don't know if
it's possible to fix that very easily. The timeout calculation assumes
that data is being transferred at the SPI bus rate.

That said, in the common case (NOR flash) writes don't work like that.
To write a flash, we make a short transfer (such as an eraseblock) and
then poll the status register before making another transfer. With short
transfers there is less probability of timing out because the extra time
makes up more of the duration.

--Sean

Mark Brown Jan. 20, 2025, 1:49 p.m. UTC | #3

On Fri, Jan 17, 2025 at 04:46:23PM -0500, Sean Anderson wrote:
> On 1/17/25 13:41, Mark Brown wrote:
> > On Fri, Jan 17, 2025 at 07:31:08PM +0100, Miquel Raynal wrote:

> >> Yes, unless the timeout is reached for "good reasons", ie. you request
> >> substantial amounts of data (typically from a memory device) and the
> >> timeout is too short compared to the theoretical time spent in the
> >> transfer. A loaded machine can also increase the number of false
> >> positives I guess.

> > I'd argue that all of those are bad reasons, I'd only expect us to time
> > out when there's a bug - choosing too low a timeout or doing things in a
> > way that generates timeouts under load is a problem.

> There's no transmit DMA for this device. So if you are under high load
> and make a long transfer, it's possible to time out. I don't know if
> it's possible to fix that very easily. The timeout calculation assumes
> that data is being transferred at the SPI bus rate.

In that case I wouldn't expect the timeout to apply to the whole
operation, or I'd expect a timeout applied waiting for something
interrupt driven to not to be fired unless we stop making forward
progress.

Sean Anderson Jan. 21, 2025, 4:51 p.m. UTC | #4

On 1/20/25 08:49, Mark Brown wrote:
> On Fri, Jan 17, 2025 at 04:46:23PM -0500, Sean Anderson wrote:
>> On 1/17/25 13:41, Mark Brown wrote:
>> > On Fri, Jan 17, 2025 at 07:31:08PM +0100, Miquel Raynal wrote:
> 
>> >> Yes, unless the timeout is reached for "good reasons", ie. you request
>> >> substantial amounts of data (typically from a memory device) and the
>> >> timeout is too short compared to the theoretical time spent in the
>> >> transfer. A loaded machine can also increase the number of false
>> >> positives I guess.
> 
>> > I'd argue that all of those are bad reasons, I'd only expect us to time
>> > out when there's a bug - choosing too low a timeout or doing things in a
>> > way that generates timeouts under load is a problem.
> 
>> There's no transmit DMA for this device. So if you are under high load
>> and make a long transfer, it's possible to time out. I don't know if
>> it's possible to fix that very easily. The timeout calculation assumes
>> that data is being transferred at the SPI bus rate.
> 
> In that case I wouldn't expect the timeout to apply to the whole
> operation, or I'd expect a timeout applied waiting for something
> interrupt driven to not to be fired unless we stop making forward
> progress.  

I don't know if there are any helpers we can use for this. To implement
this we'd need something like schedule_timeout() but where the interrupt
handler calls mod_timer() whenever it does work.

--Sean