Message ID | 20220915130946.302803-1-alexander@wetzel-home.de |
---|---|
State | New |
Headers | show |
Series | mac80211: Ensure vif queues are operational after start | expand |
On 15.09.22 15:09, Alexander Wetzel wrote: > Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync. > > When a new vif is created the queues may end up in an inconsistent state > and be inoperable: > Communication not using iTXQ will work, allowing to e.g. complete the > association. But the 4-way handshake will time out. The sta will not > send out any skbs queued in iTXQs. > > All normal attempts to start the queues will fail when reaching this > state. > local->queue_stop_reasons will have marked all queues as operational but > vif.txqs_stopped will still be set, creating an inconsistent internal > state. > > In reality this seems to be race between the mac80211 function > ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet: > Depending on the driver and the timing the queues may end up to be > operational or not. > > Cc: stable@vger.kernel.org > Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped") > Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de> Acked-by: Felix Fietkau <nbd@nbd.name> Thanks for figuring this one out, - Felix
On 15.09.22 18:18, Felix Fietkau wrote: > > On 15.09.22 15:09, Alexander Wetzel wrote: >> Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync. >> >> When a new vif is created the queues may end up in an inconsistent state >> and be inoperable: >> Communication not using iTXQ will work, allowing to e.g. complete the >> association. But the 4-way handshake will time out. The sta will not >> send out any skbs queued in iTXQs. >> >> All normal attempts to start the queues will fail when reaching this >> state. >> local->queue_stop_reasons will have marked all queues as operational but >> vif.txqs_stopped will still be set, creating an inconsistent internal >> state. >> >> In reality this seems to be race between the mac80211 function >> ieee80211_do_open() setting SDATA_STATE_RUNNING and the >> wake_txqs_tasklet: >> Depending on the driver and the timing the queues may end up to be >> operational or not. >> >> Cc: stable@vger.kernel.org >> Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that >> is being stopped") >> Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de> > > Acked-by: Felix Fietkau <nbd@nbd.name> > I've got some doubts that my fix is correct... While it fixes the problem in my tests it looks like we'll need another queue restart to get the queues working again. After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has been set by __ieee80211_stop_queue(). I'll update the patch and skip setting vif.txqs_stopped when SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX set looks harmless, having it set when it should less problematic... Alexander
> > I've got some doubts that my fix is correct... > While it fixes the problem in my tests it looks like we'll need another > queue restart to get the queues working again. > > After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has > been set by __ieee80211_stop_queue(). > > I'll update the patch and skip setting vif.txqs_stopped when > SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX > set looks harmless, having it set when it should less problematic... Scratch that: The patch should be ok as it is: IEEE80211_TXQ_STOP_NETIF_TX is not set on stop, the patch should be ok as it is. Sorry for the noise. Alexander
On 9/15/22 1:06 PM, Alexander Wetzel wrote: >> >> I've got some doubts that my fix is correct... >> While it fixes the problem in my tests it looks like we'll need another queue restart to get the queues working again. >> >> After all IEEE80211_TXQ_STOP_NETIF_TX will not be cleared when it has been set by __ieee80211_stop_queue(). >> >> I'll update the patch and skip setting vif.txqs_stopped when SDATA_STATE_RUNNING is not set. Not having IEEE80211_TXQ_STOP_NETIF_TX set looks harmless, having >> it set when it should less problematic... > > Scratch that: The patch should be ok as it is: IEEE80211_TXQ_STOP_NETIF_TX is not set on stop, the patch should be ok as it is. > > Sorry for the noise. > > Alexander > To add to the noise... From reading the original patch description, it was to stop an NPE when AP was stopped. I have been testing this patch below and it fixes the problems I saw with multiple vdevs. I was worried that the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {' might still need to run to keep everything in sync (and my patch allows that to happen), but I do not know if that is true or not. diff --git a/net/mac80211/util.c b/net/mac80211/util.c index c768e583aad4..2b5dafe9f4cc 100644 --- a/net/mac80211/util.c +++ b/net/mac80211/util.c @@ -370,12 +370,6 @@ static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac) local_bh_disable(); spin_lock(&fq->lock); - if (!test_bit(SDATA_STATE_RUNNING, &sdata->state)) - goto out; - - if (sdata->vif.type == NL80211_IFTYPE_AP) - ps = &sdata->bss->ps; - sdata->vif.txqs_stopped[ac] = false; list_for_each_entry_rcu(sta, &local->sta_list, list) { @@ -408,6 +402,10 @@ static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac) txqi = to_txq_info(vif->txq); + if (test_bit(SDATA_STATE_RUNNING, &sdata->state)) + if (sdata->vif.type == NL80211_IFTYPE_AP) + ps = &sdata->bss->ps; + if (!test_and_clear_bit(IEEE80211_TXQ_STOP_NETIF_TX, &txqi->flags) || (ps && atomic_read(&ps->num_sta_ps)) || ac != vif->txq->ac) goto out; Thanks, Ben
Alexander Wetzel <alexander@wetzel-home.de> writes: > Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync. > > When a new vif is created the queues may end up in an inconsistent state > and be inoperable: > Communication not using iTXQ will work, allowing to e.g. complete the > association. But the 4-way handshake will time out. The sta will not > send out any skbs queued in iTXQs. > > All normal attempts to start the queues will fail when reaching this > state. > local->queue_stop_reasons will have marked all queues as operational but > vif.txqs_stopped will still be set, creating an inconsistent internal > state. > > In reality this seems to be race between the mac80211 function > ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet: > Depending on the driver and the timing the queues may end up to be > operational or not. > > Cc: stable@vger.kernel.org > Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped") > Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de> > --- > net/mac80211/util.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) The title is missing "wifi:", but no need to resend because of this. I assume Johannes will fix it during commit.
On 15.09.22 22:39, Ben Greear wrote: > From reading the original patch description, it was to stop an NPE when > AP was stopped. I have been testing > this patch below and it fixes the problems I saw with multiple vdevs. I > was worried that > the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {' > might still need to run > to keep everything in sync (and my patch allows that to happen), but I > do not know if that is true or not. > I'm not sure if it's still save to wake the queues for the vif we are tearing down and assumed the intend was to skip those, too. But it looks like all stations for the vif are deleted prior to setting sdata->bss = NULL, so the outcome should be the same. Your solution removes potentially set IEEE80211_TXQ_STOP_NETIF_TX flags, reducing the risk that a queue restart during vif setup ends up with inoperable queues. But the only way to set IEEE80211_TXQ_STOP_NETIF_TX seems to be during ieee80211_tx_dequeue(). Which should not be possible to be called as long as SDATA_STATE_RUNNING was never set for the vif. But I'm on thin ice here:-) Alexander
On 9/16/22 9:38 AM, Alexander Wetzel wrote: > On 15.09.22 22:39, Ben Greear wrote: > >> From reading the original patch description, it was to stop an NPE when AP was stopped. I have been testing >> this patch below and it fixes the problems I saw with multiple vdevs. I was worried that >> the code in the 'list_for_each_entry_rcu(sta, &local->sta_list, list) {' might still need to run >> to keep everything in sync (and my patch allows that to happen), but I do not know if that is true or not. >> > > I'm not sure if it's still save to wake the queues for the vif we are tearing down and assumed the intend was to skip those, too. > > But it looks like all stations for the vif are deleted prior to setting sdata->bss = NULL, so the outcome should be the same. > > Your solution removes potentially set IEEE80211_TXQ_STOP_NETIF_TX flags, reducing the risk that a queue restart during vif setup ends up with inoperable queues. > But the only way to set IEEE80211_TXQ_STOP_NETIF_TX seems to be during ieee80211_tx_dequeue(). Which should not be possible to be called as long as > SDATA_STATE_RUNNING was never set for the vif. > > But I'm on thin ice here:-) I spent two days debugging this and have only barest understanding of how all of the atf/fq/txq logic is supposed to work. But, I think my test case was a bit different from yours, and in my test case, before my patch, it failed 100% of the time: create two station vdevs on same (mtk7916) radio admin one up (this works) admin second one up (this fails, tx path is hung, because sdata->vif.txqs_stopped[ac] was true). In general, I'd like to keep start/stop state as in-sync everywhere as possible, and I think my patch might be better at that than yours since it goes through the sta list (and, maybe too, I'm completely wrong about that). Potentially, static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac) should just be called each time a vdev goes admin up. But since my original test case failed so reliably, the __ieee80211_wake_txqs method must not actually be called when secondary vdevs go admin up. I am not sure if that is per design or some other bug. Hoping the 2-3 people who understand this logic well will chime in :) Thanks, Ben
diff --git a/net/mac80211/util.c b/net/mac80211/util.c index 0ea5d50091dc..bf7461c41bef 100644 --- a/net/mac80211/util.c +++ b/net/mac80211/util.c @@ -301,14 +301,14 @@ static void __ieee80211_wake_txqs(struct ieee80211_sub_if_data *sdata, int ac) local_bh_disable(); spin_lock(&fq->lock); + sdata->vif.txqs_stopped[ac] = false; + if (!test_bit(SDATA_STATE_RUNNING, &sdata->state)) goto out; if (sdata->vif.type == NL80211_IFTYPE_AP) ps = &sdata->bss->ps; - sdata->vif.txqs_stopped[ac] = false; - list_for_each_entry_rcu(sta, &local->sta_list, list) { if (sdata != sta->sdata) continue;
Make sure local->queue_stop_reasons and vif.txqs_stopped stay in sync. When a new vif is created the queues may end up in an inconsistent state and be inoperable: Communication not using iTXQ will work, allowing to e.g. complete the association. But the 4-way handshake will time out. The sta will not send out any skbs queued in iTXQs. All normal attempts to start the queues will fail when reaching this state. local->queue_stop_reasons will have marked all queues as operational but vif.txqs_stopped will still be set, creating an inconsistent internal state. In reality this seems to be race between the mac80211 function ieee80211_do_open() setting SDATA_STATE_RUNNING and the wake_txqs_tasklet: Depending on the driver and the timing the queues may end up to be operational or not. Cc: stable@vger.kernel.org Fixes: f856373e2f31 ("wifi: mac80211: do not wake queues on a vif that is being stopped") Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de> --- net/mac80211/util.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)