diff mbox series

ath9k: Add RX inactivity detection and reset chip when it occurs

Message ID 20241106-ath9k-deaf-detection-v1-1-736a150d2425@redhat.com
State New
Headers show
Series ath9k: Add RX inactivity detection and reset chip when it occurs | expand

Commit Message

Toke Høiland-Jørgensen Nov. 6, 2024, 12:41 p.m. UTC
Some ath9k chips can, seemingly at random, end up in a state which can
be described as "deaf". No or nearly no interrupts are generated anymore
for incoming packets. Existing links either break down after a while and
new links will not be established.

The circumstances leading to this "deafness" is still unclear, but some
particular chips (especially 2-stream 11n SoCs, but also others) can go
'deaf' when running AP or mesh (or both) after some time. It's probably
a hardware issue, and doing a channel scan to trigger a chip
reset (which one normally can't do on an AP interface) recovers the
hardware.

The only way the driver can detect this state, is by detecting if there
has been no RX activity for a while. In this case we can proactively
reset the chip (which only takes a small number of milliseconds, so
shouldn't interrupt things too much if it has been idle for several
seconds), which functions as a workaround.

OpenWrt, and various derivatives, have been carrying versions of this
workaround for years, that were never upstreamed. One version[0],
written by Felix Fietkau, used a simple counter and only reset if there
was precisely zero RX activity for a long period of time. This had the
problem that in some cases a small number of interrupts would appear
even if the device was otherwise not responsive. For this reason,
another version[1], written by Simon Wunderlich and Sven Eckelmann, used
a time-based approach to calculate the average number of RX interrupts
over a longer (four-second) interval, and reset the chip when seeing
less than one interrupt per second over this period. However, that
version relied on debugfs counters to keep track of the number of
interrupts, which means it didn't work at all if debugfs was not
enabled.

This patch unifies the two versions: it uses the same approach as Felix'
patch to count the number of RX handler invocations, but uses the same
time-based windowing approach as Simon and Sven's patch to still handle
the case where occasional interrupts appear but the device is otherwise
deaf.

Since this is based on ideas by all three people, but not actually
directly derived from any of the patches, I'm including Suggested-by
tags from Simon, Sven and Felix below, which should hopefully serve as
proper credit.

[0] https://patchwork.kernel.org/project/linux-wireless/patch/20170125163654.66431-3-nbd@nbd.name/
[1] https://patchwork.kernel.org/project/linux-wireless/patch/20161117083614.19188-2-sven.eckelmann@open-mesh.com/

Suggested-by: Simon Wunderlich <sw@simonwunderlich.de>
Suggested-by: Sven Eckelmann <se@simonwunderlich.de>
Suggested-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 drivers/net/wireless/ath/ath9k/ath9k.h |  2 ++
 drivers/net/wireless/ath/ath9k/debug.c |  1 +
 drivers/net/wireless/ath/ath9k/debug.h |  1 +
 drivers/net/wireless/ath/ath9k/link.c  | 33 +++++++++++++++++++++++++++++++--
 drivers/net/wireless/ath/ath9k/main.c  |  1 +
 5 files changed, 36 insertions(+), 2 deletions(-)


---
base-commit: c33f9c2728d0ccc7472e6239346c0fb3de556e0f
change-id: 20241105-ath9k-deaf-detection-0e26bb3243a6

Comments

Sven Eckelmann Nov. 6, 2024, 1:39 p.m. UTC | #1
Hi,

Thank you for submitting the patch.

On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote:
> Since this is based on ideas by all three people, but not actually
> directly derived from any of the patches, I'm including Suggested-by
> tags from Simon, Sven and Felix below, which should hopefully serve as
> proper credit.

At least for me, this is more than enough. Thanks.

I don't have the setup at the moment to test it again - maybe Issam can do 
this. One concern I would have (because I don't find the notes regarding this 
problem), is whether this check is now breaking because we count more things. 
In the past, rxlp/rxok was used for the check. And now I don't know whether 
the count for the other ones were still increasing.

* RXHP (rather sure that "high priority frame" wasn't increasing)
* RXEOL ("no RX descriptors available" - I would guess no, but I can't say for
  sure)
* RXORN ("FIFO overrun" I would guess no, but I can't say for sure)

Reviewed-by: Sven Eckelmann <se@simonwunderlich.de>

Kind regards,
	Sven
Toke Høiland-Jørgensen Nov. 6, 2024, 2:12 p.m. UTC | #2
Sven Eckelmann <se@simonwunderlich.de> writes:

> Hi,
>
> Thank you for submitting the patch.
>
> On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote:
>> Since this is based on ideas by all three people, but not actually
>> directly derived from any of the patches, I'm including Suggested-by
>> tags from Simon, Sven and Felix below, which should hopefully serve as
>> proper credit.
>
> At least for me, this is more than enough. Thanks.
>
> I don't have the setup at the moment to test it again - maybe Issam can do 
> this. One concern I would have (because I don't find the notes regarding this 
> problem), is whether this check is now breaking because we count more things. 
> In the past, rxlp/rxok was used for the check. And now I don't know whether 
> the count for the other ones were still increasing.
>
> * RXHP (rather sure that "high priority frame" wasn't increasing)
> * RXEOL ("no RX descriptors available" - I would guess no, but I can't say for
>   sure)
> * RXORN ("FIFO overrun" I would guess no, but I can't say for sure)
>
> Reviewed-by: Sven Eckelmann <se@simonwunderlich.de>

Great, thanks for the review! I'll let it sit in patchwork for a little
while to give people a chance to test it out before sending it over to
Kalle to be applied :)

-Toke
Simon Wunderlich Nov. 6, 2024, 4:03 p.m. UTC | #3
On Wednesday, November 6, 2024 3:12:59 PM CET Toke Høiland-Jørgensen wrote:
> Sven Eckelmann <se@simonwunderlich.de> writes:
> > Hi,
> > 
> > Thank you for submitting the patch.
> > 
> > On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote:
> >> Since this is based on ideas by all three people, but not actually
> >> directly derived from any of the patches, I'm including Suggested-by
> >> tags from Simon, Sven and Felix below, which should hopefully serve as
> >> proper credit.
> > 
> > At least for me, this is more than enough. Thanks.
> > 
> > I don't have the setup at the moment to test it again - maybe Issam can do
> > this. One concern I would have (because I don't find the notes regarding
> > this problem), is whether this check is now breaking because we count
> > more things. In the past, rxlp/rxok was used for the check. And now I
> > don't know whether the count for the other ones were still increasing.
> > 
> > * RXHP (rather sure that "high priority frame" wasn't increasing)
> > * RXEOL ("no RX descriptors available" - I would guess no, but I can't say
> > for> 
> >   sure)
> > 
> > * RXORN ("FIFO overrun" I would guess no, but I can't say for sure)
> > 
> > Reviewed-by: Sven Eckelmann <se@simonwunderlich.de>
> 
> Great, thanks for the review! I'll let it sit in patchwork for a little
> while to give people a chance to test it out before sending it over to
> Kalle to be applied :)
> 
> -Toke

Hi Toke,

this looks good to me in general. I'm not sure either about the particular RX 
interrupts. We can test this by putting the AP in a shield box and verify that 
the counters are actually increasing, and that should be good enough.

Acked-by: Simon Wunderlich <sw@simonwunderlich.de>

Thank you!
      Simon
Toke Høiland-Jørgensen Nov. 7, 2024, 9:36 a.m. UTC | #4
Simon Wunderlich <sw@simonwunderlich.de> writes:

> On Wednesday, November 6, 2024 3:12:59 PM CET Toke Høiland-Jørgensen wrote:
>> Sven Eckelmann <se@simonwunderlich.de> writes:
>> > Hi,
>> > 
>> > Thank you for submitting the patch.
>> > 
>> > On Wednesday, 6 November 2024 13:41:44 CET Toke Høiland-Jørgensen wrote:
>> >> Since this is based on ideas by all three people, but not actually
>> >> directly derived from any of the patches, I'm including Suggested-by
>> >> tags from Simon, Sven and Felix below, which should hopefully serve as
>> >> proper credit.
>> > 
>> > At least for me, this is more than enough. Thanks.
>> > 
>> > I don't have the setup at the moment to test it again - maybe Issam can do
>> > this. One concern I would have (because I don't find the notes regarding
>> > this problem), is whether this check is now breaking because we count
>> > more things. In the past, rxlp/rxok was used for the check. And now I
>> > don't know whether the count for the other ones were still increasing.
>> > 
>> > * RXHP (rather sure that "high priority frame" wasn't increasing)
>> > * RXEOL ("no RX descriptors available" - I would guess no, but I can't say
>> > for> 
>> >   sure)
>> > 
>> > * RXORN ("FIFO overrun" I would guess no, but I can't say for sure)
>> > 
>> > Reviewed-by: Sven Eckelmann <se@simonwunderlich.de>
>> 
>> Great, thanks for the review! I'll let it sit in patchwork for a little
>> while to give people a chance to test it out before sending it over to
>> Kalle to be applied :)
>> 
>> -Toke
>
> Hi Toke,
>
> this looks good to me in general. I'm not sure either about the particular RX 
> interrupts. We can test this by putting the AP in a shield box and verify that 
> the counters are actually increasing, and that should be good enough.
>
> Acked-by: Simon Wunderlich <sw@simonwunderlich.de>

Great, thanks! Would be awesome if you could test it out an report back! :)

-Toke
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath9k/ath9k.h b/drivers/net/wireless/ath/ath9k/ath9k.h
index 29ca65a732a66c9452b19ad184f513f92e4c8c2c..bcfc8df0efe5b2c5510230e7f59592d51df8365e 100644
--- a/drivers/net/wireless/ath/ath9k/ath9k.h
+++ b/drivers/net/wireless/ath/ath9k/ath9k.h
@@ -1018,6 +1018,8 @@  struct ath_softc {
 
 	u8 gtt_cnt;
 	u32 intrstatus;
+	u32 rx_active_check_time;
+	u32 rx_active_count;
 	u16 ps_flags; /* PS_* */
 	bool ps_enabled;
 	bool ps_idle;
diff --git a/drivers/net/wireless/ath/ath9k/debug.c b/drivers/net/wireless/ath/ath9k/debug.c
index 51abc470125b3cc4e773f5272e2394c8f790a32e..a1e3bcf796f27fd58f5ab307a198927274f8f321 100644
--- a/drivers/net/wireless/ath/ath9k/debug.c
+++ b/drivers/net/wireless/ath/ath9k/debug.c
@@ -750,6 +750,7 @@  static int read_file_reset(struct seq_file *file, void *data)
 		[RESET_TYPE_CALIBRATION] = "Calibration error",
 		[RESET_TX_DMA_ERROR] = "Tx DMA stop error",
 		[RESET_RX_DMA_ERROR] = "Rx DMA stop error",
+		[RESET_TYPE_RX_INACTIVE] = "Rx path inactive",
 	};
 	int i;
 
diff --git a/drivers/net/wireless/ath/ath9k/debug.h b/drivers/net/wireless/ath/ath9k/debug.h
index 389459c04d14a5215783b916519d05419b970c20..cb3e75969875afed282b09ff2458dd8ca286d5b1 100644
--- a/drivers/net/wireless/ath/ath9k/debug.h
+++ b/drivers/net/wireless/ath/ath9k/debug.h
@@ -53,6 +53,7 @@  enum ath_reset_type {
 	RESET_TYPE_CALIBRATION,
 	RESET_TX_DMA_ERROR,
 	RESET_RX_DMA_ERROR,
+	RESET_TYPE_RX_INACTIVE,
 	__RESET_TYPE_MAX
 };
 
diff --git a/drivers/net/wireless/ath/ath9k/link.c b/drivers/net/wireless/ath/ath9k/link.c
index d1e5767aab3cbc2ee91018edadf89a223cf1d759..d078a59d7d3cdbe623b507d29cc974b801f2deaf 100644
--- a/drivers/net/wireless/ath/ath9k/link.c
+++ b/drivers/net/wireless/ath/ath9k/link.c
@@ -50,7 +50,36 @@  static bool ath_tx_complete_check(struct ath_softc *sc)
 		"tx hung, resetting the chip\n");
 	ath9k_queue_reset(sc, RESET_TYPE_TX_HANG);
 	return false;
+}
+
+#define RX_INACTIVE_CHECK_INTERVAL (4 * MSEC_PER_SEC)
+
+static bool ath_hw_rx_inactive_check(struct ath_softc *sc)
+{
+	struct ath_common *common = ath9k_hw_common(sc->sc_ah);
+	u32 interval, count;
+
+	interval = jiffies_to_msecs(jiffies - sc->rx_active_check_time);
+	count = sc->rx_active_count;
+
+	if (interval < RX_INACTIVE_CHECK_INTERVAL)
+		return true; /* too soon to check */
 
+	sc->rx_active_count = 0;
+	sc->rx_active_check_time = jiffies;
+
+	/* Need at least one interrupt per second, and we should only react if
+	 * we are within a factor two of the expected interval
+	 */
+	if (interval > RX_INACTIVE_CHECK_INTERVAL * 2 ||
+	    count >= interval / MSEC_PER_SEC)
+		return true;
+
+	ath_dbg(common, RESET,
+		"RX inactivity detected. Schedule chip reset\n");
+	ath9k_queue_reset(sc, RESET_TYPE_RX_INACTIVE);
+
+	return false;
 }
 
 void ath_hw_check_work(struct work_struct *work)
@@ -58,8 +87,8 @@  void ath_hw_check_work(struct work_struct *work)
 	struct ath_softc *sc = container_of(work, struct ath_softc,
 					    hw_check_work.work);
 
-	if (!ath_hw_check(sc) ||
-	    !ath_tx_complete_check(sc))
+	if (!ath_hw_check(sc) || !ath_tx_complete_check(sc) ||
+	    !ath_hw_rx_inactive_check(sc))
 		return;
 
 	ieee80211_queue_delayed_work(sc->hw, &sc->hw_check_work,
diff --git a/drivers/net/wireless/ath/ath9k/main.c b/drivers/net/wireless/ath/ath9k/main.c
index b92c89dad8deac7281e0165c9e1a566ba99ece65..998f717a1a86eee539058f6f4d9318dd459fc8dd 100644
--- a/drivers/net/wireless/ath/ath9k/main.c
+++ b/drivers/net/wireless/ath/ath9k/main.c
@@ -453,6 +453,7 @@  void ath9k_tasklet(struct tasklet_struct *t)
 			ath_rx_tasklet(sc, 0, true);
 
 		ath_rx_tasklet(sc, 0, false);
+		sc->rx_active_count++;
 	}
 
 	if (status & ATH9K_INT_TX) {