mbox series

[RESEND,v3,0/2] Improve ath10k flush queue mechanism

Message ID cover.1732293922.git.repk@triplefau.lt
Headers show
Series Improve ath10k flush queue mechanism | expand

Message

Remi Pommarel Nov. 22, 2024, 4:48 p.m. UTC
It has been reported [0] that a 3-4 seconds (actually up to 5 sec) of
radio silence could be observed followed by the error below on ath10k
devices:

 ath10k_pci 0000:04:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0

This is due to how the TX queues are flushed in ath10k. When a STA is
removed, mac80211 need to flush queues [1], but because ath10k does not
have a lightweight .flush_sta operation, ieee80211_flush_queues() is
called instead effectively blocking the whole queue during the drain
causing this radio silence. Also because ath10k_flush() waits for all
queued to be emptied, not only the flushed ones it could more easily
take up to 5 seconds to finish making the whole situation worst.

The first patch of this series adds a .flush_sta operation to flush only
specific STA traffic avoiding the need to stop whole queues and should
be enough in itself to fix the reported issue.

The second patch of this series is a proposal to improve ath10k_flush so
that it will be less likely to timeout waiting for non related queues to
drain.

The abose kernel warning could still be observed (e.g. flushing a dead
STA) but should be now harmless.

[0]: https://lore.kernel.org/all/CA+Xfe4FjUmzM5mvPxGbpJsF3SvSdE5_wgxvgFJ0bsdrKODVXCQ@mail.gmail.com/
[1]: commit 0b75a1b1e42e ("wifi: mac80211: flush queues on STA removal")

V3:
  - Initialize empty to true to fix smatch error

V2:
  - Add Closes tag
  - Use atomic instead of spinlock for per sta pending frame counter
  - Call ath10k_htt_tx_sta_dec_pending within rcu
  - Rename pending_per_queue[] to num_pending_per_queue[]

Remi Pommarel (2):
  wifi: ath10k: Implement ieee80211 flush_sta callback
  wifi: ath10k: Flush only requested txq in ath10k_flush()

 drivers/net/wireless/ath/ath10k/core.h   |  2 +
 drivers/net/wireless/ath/ath10k/htt.h    | 11 +++-
 drivers/net/wireless/ath/ath10k/htt_tx.c | 49 +++++++++++++++-
 drivers/net/wireless/ath/ath10k/mac.c    | 75 ++++++++++++++++++++----
 drivers/net/wireless/ath/ath10k/txrx.c   | 11 ++--
 5 files changed, 127 insertions(+), 21 deletions(-)

Comments

James Prestwood Nov. 26, 2024, 12:57 p.m. UTC | #1
Hi Remi,

On 11/22/24 8:48 AM, Remi Pommarel wrote:
> It has been reported [0] that a 3-4 seconds (actually up to 5 sec) of
> radio silence could be observed followed by the error below on ath10k
> devices:
>
>   ath10k_pci 0000:04:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0
>
> This is due to how the TX queues are flushed in ath10k. When a STA is
> removed, mac80211 need to flush queues [1], but because ath10k does not
> have a lightweight .flush_sta operation, ieee80211_flush_queues() is
> called instead effectively blocking the whole queue during the drain
> causing this radio silence. Also because ath10k_flush() waits for all
> queued to be emptied, not only the flushed ones it could more easily
> take up to 5 seconds to finish making the whole situation worst.
>
> The first patch of this series adds a .flush_sta operation to flush only
> specific STA traffic avoiding the need to stop whole queues and should
> be enough in itself to fix the reported issue.
>
> The second patch of this series is a proposal to improve ath10k_flush so
> that it will be less likely to timeout waiting for non related queues to
> drain.
>
> The abose kernel warning could still be observed (e.g. flushing a dead
> STA) but should be now harmless.
>
> [0]: https://lore.kernel.org/all/CA+Xfe4FjUmzM5mvPxGbpJsF3SvSdE5_wgxvgFJ0bsdrKODVXCQ@mail.gmail.com/
> [1]: commit 0b75a1b1e42e ("wifi: mac80211: flush queues on STA removal")

I saw in the original report that it indicated it was only for AP mode 
but after seeing this and checking some of our clients I saw that this 
is also happening in station mode too. I only have clients on 6.2 and 
6.8. I can confirm its not occurring on 6.2, but is on 6.8. I also tried 
your set of patches but did not notice any behavior difference with or 
without them. When it happens, its always just after a roam scan, ~4 
seconds go by and we get the failure followed by a "Connection to AP 
<mac> lost". Oddly the MAC address is all zeros.

Nov 25 09:09:50 iwd[16256]: src/station.c:station_start_roam() Using 
cached neighbor report for roam
Nov 25 09:09:54 kernel: ath10k_pci 0000:02:00.0: failed to flush 
transmit queue (skip 0 ar-state 1): 0
Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME 
notification Del Station(20)
Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_link_notify() event 16 
on ifindex 7
Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME 
notification Deauthenticate(39)
Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_deauthenticate_event()
Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME 
notification Disconnect(48)
Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_disconnect_event()
Nov 25 09:09:54 iwd[16256]: Received Deauthentication event, reason: 4, 
from_ap: false
Nov 25 09:09:54 kernel: wlan0: Connection to AP 00:00:00:00:00:00 lost

Other times, the above logs are preceded by this:

Nov 26 00:25:25 kernel: ath10k_pci 0000:02:00.0: failed to flush sta txq 
(sta ca:55:b8:7a:91:4b skip 0 ar-state 1): 0

Note, the above logs are with your patches applied. Maybe this is a 
separate issue? Or do you think its related?

Thanks,

James

>
> V3:
>    - Initialize empty to true to fix smatch error
>
> V2:
>    - Add Closes tag
>    - Use atomic instead of spinlock for per sta pending frame counter
>    - Call ath10k_htt_tx_sta_dec_pending within rcu
>    - Rename pending_per_queue[] to num_pending_per_queue[]
>
> Remi Pommarel (2):
>    wifi: ath10k: Implement ieee80211 flush_sta callback
>    wifi: ath10k: Flush only requested txq in ath10k_flush()
>
>   drivers/net/wireless/ath/ath10k/core.h   |  2 +
>   drivers/net/wireless/ath/ath10k/htt.h    | 11 +++-
>   drivers/net/wireless/ath/ath10k/htt_tx.c | 49 +++++++++++++++-
>   drivers/net/wireless/ath/ath10k/mac.c    | 75 ++++++++++++++++++++----
>   drivers/net/wireless/ath/ath10k/txrx.c   | 11 ++--
>   5 files changed, 127 insertions(+), 21 deletions(-)
>
James Prestwood Nov. 26, 2024, 12:59 p.m. UTC | #2
On 11/26/24 4:57 AM, James Prestwood wrote:
> Hi Remi,
>
> On 11/22/24 8:48 AM, Remi Pommarel wrote:
>> It has been reported [0] that a 3-4 seconds (actually up to 5 sec) of
>> radio silence could be observed followed by the error below on ath10k
>> devices:
>>
>>   ath10k_pci 0000:04:00.0: failed to flush transmit queue (skip 0 
>> ar-state 1): 0
>>
>> This is due to how the TX queues are flushed in ath10k. When a STA is
>> removed, mac80211 need to flush queues [1], but because ath10k does not
>> have a lightweight .flush_sta operation, ieee80211_flush_queues() is
>> called instead effectively blocking the whole queue during the drain
>> causing this radio silence. Also because ath10k_flush() waits for all
>> queued to be emptied, not only the flushed ones it could more easily
>> take up to 5 seconds to finish making the whole situation worst.
>>
>> The first patch of this series adds a .flush_sta operation to flush only
>> specific STA traffic avoiding the need to stop whole queues and should
>> be enough in itself to fix the reported issue.
>>
>> The second patch of this series is a proposal to improve ath10k_flush so
>> that it will be less likely to timeout waiting for non related queues to
>> drain.
>>
>> The abose kernel warning could still be observed (e.g. flushing a dead
>> STA) but should be now harmless.
>>
>> [0]: 
>> https://lore.kernel.org/all/CA+Xfe4FjUmzM5mvPxGbpJsF3SvSdE5_wgxvgFJ0bsdrKODVXCQ@mail.gmail.com/
>> [1]: commit 0b75a1b1e42e ("wifi: mac80211: flush queues on STA removal")
>
> I saw in the original report that it indicated it was only for AP mode 
> but after seeing this and checking some of our clients I saw that this 
> is also happening in station mode too. I only have clients on 6.2 and 
> 6.8. I can confirm its not occurring on 6.2, but is on 6.8. I also 
> tried your set of patches but did not notice any behavior difference 
> with or without them. When it happens, its always just after a roam 
> scan, ~4 seconds go by and we get the failure followed by a 
> "Connection to AP <mac> lost". Oddly the MAC address is all zeros.
>
> Nov 25 09:09:50 iwd[16256]: src/station.c:station_start_roam() Using 
> cached neighbor report for roam
> Nov 25 09:09:54 kernel: ath10k_pci 0000:02:00.0: failed to flush 
> transmit queue (skip 0 ar-state 1): 0
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME 
> notification Del Station(20)
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_link_notify() event 16 
> on ifindex 7
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME 
> notification Deauthenticate(39)
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_deauthenticate_event()
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME 
> notification Disconnect(48)
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_disconnect_event()
> Nov 25 09:09:54 iwd[16256]: Received Deauthentication event, reason: 
> 4, from_ap: false
> Nov 25 09:09:54 kernel: wlan0: Connection to AP 00:00:00:00:00:00 lost
>
> Other times, the above logs are preceded by this:
>
> Nov 26 00:25:25 kernel: ath10k_pci 0000:02:00.0: failed to flush sta 
> txq (sta ca:55:b8:7a:91:4b skip 0 ar-state 1): 0
>
> Note, the above logs are with your patches applied. Maybe this is a 
> separate issue? Or do you think its related?

Forgot to mention, this is on the QCA6174 hw 3.2

firmware ver WLAN.RM.4.4.1-00288- api 6 features wowlan,ignore-otp,mfp 
crc32 bf907c7c

>
> Thanks,
>
> James
>
>>
>> V3:
>>    - Initialize empty to true to fix smatch error
>>
>> V2:
>>    - Add Closes tag
>>    - Use atomic instead of spinlock for per sta pending frame counter
>>    - Call ath10k_htt_tx_sta_dec_pending within rcu
>>    - Rename pending_per_queue[] to num_pending_per_queue[]
>>
>> Remi Pommarel (2):
>>    wifi: ath10k: Implement ieee80211 flush_sta callback
>>    wifi: ath10k: Flush only requested txq in ath10k_flush()
>>
>>   drivers/net/wireless/ath/ath10k/core.h   |  2 +
>>   drivers/net/wireless/ath/ath10k/htt.h    | 11 +++-
>>   drivers/net/wireless/ath/ath10k/htt_tx.c | 49 +++++++++++++++-
>>   drivers/net/wireless/ath/ath10k/mac.c    | 75 ++++++++++++++++++++----
>>   drivers/net/wireless/ath/ath10k/txrx.c   | 11 ++--
>>   5 files changed, 127 insertions(+), 21 deletions(-)
>>
Remi Pommarel Nov. 29, 2024, 4:31 p.m. UTC | #3
Hi James,

On Tue, Nov 26, 2024 at 04:57:36AM -0800, James Prestwood wrote:
> Hi Remi,
> 
> On 11/22/24 8:48 AM, Remi Pommarel wrote:
> > It has been reported [0] that a 3-4 seconds (actually up to 5 sec) of
> > radio silence could be observed followed by the error below on ath10k
> > devices:
> > 
> >   ath10k_pci 0000:04:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0
> > 
> > This is due to how the TX queues are flushed in ath10k. When a STA is
> > removed, mac80211 need to flush queues [1], but because ath10k does not
> > have a lightweight .flush_sta operation, ieee80211_flush_queues() is
> > called instead effectively blocking the whole queue during the drain
> > causing this radio silence. Also because ath10k_flush() waits for all
> > queued to be emptied, not only the flushed ones it could more easily
> > take up to 5 seconds to finish making the whole situation worst.
> > 
> > The first patch of this series adds a .flush_sta operation to flush only
> > specific STA traffic avoiding the need to stop whole queues and should
> > be enough in itself to fix the reported issue.
> > 
> > The second patch of this series is a proposal to improve ath10k_flush so
> > that it will be less likely to timeout waiting for non related queues to
> > drain.
> > 
> > The abose kernel warning could still be observed (e.g. flushing a dead
> > STA) but should be now harmless.
> > 
> > [0]: https://lore.kernel.org/all/CA+Xfe4FjUmzM5mvPxGbpJsF3SvSdE5_wgxvgFJ0bsdrKODVXCQ@mail.gmail.com/
> > [1]: commit 0b75a1b1e42e ("wifi: mac80211: flush queues on STA removal")
> 
> I saw in the original report that it indicated it was only for AP mode but
> after seeing this and checking some of our clients I saw that this is also
> happening in station mode too. I only have clients on 6.2 and 6.8. I can
> confirm its not occurring on 6.2, but is on 6.8. I also tried your set of
> patches but did not notice any behavior difference with or without them.
> When it happens, its always just after a roam scan, ~4 seconds go by and we
> get the failure followed by a "Connection to AP <mac> lost". Oddly the MAC
> address is all zeros.
> 
> Nov 25 09:09:50 iwd[16256]: src/station.c:station_start_roam() Using cached
> neighbor report for roam
> Nov 25 09:09:54 kernel: ath10k_pci 0000:02:00.0: failed to flush transmit
> queue (skip 0 ar-state 1): 0
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME
> notification Del Station(20)
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_link_notify() event 16 on
> ifindex 7
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME
> notification Deauthenticate(39)
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_deauthenticate_event()
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME
> notification Disconnect(48)
> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_disconnect_event()
> Nov 25 09:09:54 iwd[16256]: Received Deauthentication event, reason: 4,
> from_ap: false
> Nov 25 09:09:54 kernel: wlan0: Connection to AP 00:00:00:00:00:00 lost
> 
> Other times, the above logs are preceded by this:
> 
> Nov 26 00:25:25 kernel: ath10k_pci 0000:02:00.0: failed to flush sta txq
> (sta ca:55:b8:7a:91:4b skip 0 ar-state 1): 0
> 
> Note, the above logs are with your patches applied. Maybe this is a separate
> issue? Or do you think its related?

Thanks fot the test. Yes this patchset is here only to fix the issue for
AP (this caused AP to stall all traffic for every STA connected to it).
So while this issue is interesting it is not addressed by this patchset.

Out of curiosity I tried to reproduce it currently trying to roam an
ath10k sta back and forth two APs (same SSID/psk, different channels)
and wasn't able to reproduce with wpa_supplicant, didn't try with iwd
though. Or maybe the AP the sta is roaming away from has stopped
responding, in that case I don't know what can be done here as it does
not seem we want to drop pending frames (as we would prefer to deauth
cleanly from AP in main case).

In any case still I think this is a separate issue and it is also way
less critical than the AP one (one STA can create ~4sec DOS to the
entire BSS vs a STA took more time to roam away if AP crashed).

Thanks,
James Prestwood Dec. 2, 2024, 3:25 p.m. UTC | #4
Hi Remi,

On 11/29/24 8:31 AM, Remi Pommarel wrote:
> Hi James,
>
> On Tue, Nov 26, 2024 at 04:57:36AM -0800, James Prestwood wrote:
>> Hi Remi,
>>
>> On 11/22/24 8:48 AM, Remi Pommarel wrote:
>>> It has been reported [0] that a 3-4 seconds (actually up to 5 sec) of
>>> radio silence could be observed followed by the error below on ath10k
>>> devices:
>>>
>>>    ath10k_pci 0000:04:00.0: failed to flush transmit queue (skip 0 ar-state 1): 0
>>>
>>> This is due to how the TX queues are flushed in ath10k. When a STA is
>>> removed, mac80211 need to flush queues [1], but because ath10k does not
>>> have a lightweight .flush_sta operation, ieee80211_flush_queues() is
>>> called instead effectively blocking the whole queue during the drain
>>> causing this radio silence. Also because ath10k_flush() waits for all
>>> queued to be emptied, not only the flushed ones it could more easily
>>> take up to 5 seconds to finish making the whole situation worst.
>>>
>>> The first patch of this series adds a .flush_sta operation to flush only
>>> specific STA traffic avoiding the need to stop whole queues and should
>>> be enough in itself to fix the reported issue.
>>>
>>> The second patch of this series is a proposal to improve ath10k_flush so
>>> that it will be less likely to timeout waiting for non related queues to
>>> drain.
>>>
>>> The abose kernel warning could still be observed (e.g. flushing a dead
>>> STA) but should be now harmless.
>>>
>>> [0]: https://lore.kernel.org/all/CA+Xfe4FjUmzM5mvPxGbpJsF3SvSdE5_wgxvgFJ0bsdrKODVXCQ@mail.gmail.com/
>>> [1]: commit 0b75a1b1e42e ("wifi: mac80211: flush queues on STA removal")
>> I saw in the original report that it indicated it was only for AP mode but
>> after seeing this and checking some of our clients I saw that this is also
>> happening in station mode too. I only have clients on 6.2 and 6.8. I can
>> confirm its not occurring on 6.2, but is on 6.8. I also tried your set of
>> patches but did not notice any behavior difference with or without them.
>> When it happens, its always just after a roam scan, ~4 seconds go by and we
>> get the failure followed by a "Connection to AP <mac> lost". Oddly the MAC
>> address is all zeros.
>>
>> Nov 25 09:09:50 iwd[16256]: src/station.c:station_start_roam() Using cached
>> neighbor report for roam
>> Nov 25 09:09:54 kernel: ath10k_pci 0000:02:00.0: failed to flush transmit
>> queue (skip 0 ar-state 1): 0
>> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME
>> notification Del Station(20)
>> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_link_notify() event 16 on
>> ifindex 7
>> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME
>> notification Deauthenticate(39)
>> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_deauthenticate_event()
>> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_mlme_notify() MLME
>> notification Disconnect(48)
>> Nov 25 09:09:54 iwd[16256]: src/netdev.c:netdev_disconnect_event()
>> Nov 25 09:09:54 iwd[16256]: Received Deauthentication event, reason: 4,
>> from_ap: false
>> Nov 25 09:09:54 kernel: wlan0: Connection to AP 00:00:00:00:00:00 lost
>>
>> Other times, the above logs are preceded by this:
>>
>> Nov 26 00:25:25 kernel: ath10k_pci 0000:02:00.0: failed to flush sta txq
>> (sta ca:55:b8:7a:91:4b skip 0 ar-state 1): 0
>>
>> Note, the above logs are with your patches applied. Maybe this is a separate
>> issue? Or do you think its related?
> Thanks fot the test. Yes this patchset is here only to fix the issue for
> AP (this caused AP to stall all traffic for every STA connected to it).
> So while this issue is interesting it is not addressed by this patchset.
Thanks for the clarification.
>
> Out of curiosity I tried to reproduce it currently trying to roam an
> ath10k sta back and forth two APs (same SSID/psk, different channels)
> and wasn't able to reproduce with wpa_supplicant, didn't try with iwd
> though. Or maybe the AP the sta is roaming away from has stopped
> responding, in that case I don't know what can be done here as it does
> not seem we want to drop pending frames (as we would prefer to deauth
> cleanly from AP in main case).
We have quite a lot of clients on ath10k and the issue is rare(ish). But 
you may be right and its spurred from the AP not responding. I need to 
dig in more to see if there is anything to be done on the client side, I 
just figured implementing the flush queue op would apply to both station 
and AP mode.
>
> In any case still I think this is a separate issue and it is also way
> less critical than the AP one (one STA can create ~4sec DOS to the
> entire BSS vs a STA took more time to roam away if AP crashed).

So for my companies use case a 4 second DOS to an individual BSS can be 
potentially bad. This doesn't really differ from an outright disconnect 
but I'm still trying to limit any lapse in connectivity if at all 
possible. If I can gather more info I'll report back.

Thanks,

James

>
> Thanks,
>