diff mbox series

[v2] wifi: mac80211: Fix performance issue with mutex_lock

Message ID 20220915043527.737133-1-venkatch@gmail.com
State Superseded
Headers show
Series [v2] wifi: mac80211: Fix performance issue with mutex_lock | expand

Commit Message

Venkat Ch Sept. 15, 2022, 4:35 a.m. UTC
From: Venkat Chimata <venkata@shasta.cloud>

OpenWiFi's ucentral agent periodically (typically 120 seconds)
issues nl80211 call to get associated client list from the
WLAN driver. Somehow this operation was causing tx/rx delays
sometimes and the video calls on connected clients are experiencing
jitter. The associated client list was protected by
a mutex lock. I saw that ieee80211_check_fast_xmit_all uses
rcu_read_lock and rcu_read_unlock to iterate through sta_list.
I took it as a refernce and changed the lock to rcu_read lock
from mutex.
Also saw this this comment just above sta_mutex declaration.
        /* Station data */
        /*
         * The mutex only protects the list, hash table and
         * counter, reads are done with RCU.
         */
        struct mutex sta_mtx;

Hence tried changing mutex_lock(/unlock) in ieee80211_dump_station
to rcu_read_lock(/unlock) and it resolved the jitter issue in the
video calls.

Tests:
We had this issue show up consistently and the patch fixed the issue.
We spent a good part of 2 weeks following up on this and with this
fix, the video calls are smooth.

Also tested if this could cause any crashes with below mentioned
process.

1. Connect 3 clients
2. A script running dumping clients in an infinite loop
3. Continuously disconnect / connect one client to see if
   there is any crash. No crash was observed.

---
v2:
* Resolve warning
 net/mac80211/cfg.c:858:33: warning:
 unused variable 'local' [-Wunused-variable]
 ---

Signed-off-by: Venkat Chimata <venkata@shasta.cloud>
---
 net/mac80211/cfg.c      | 7 ++-----
 net/mac80211/sta_info.c | 3 +--
 2 files changed, 3 insertions(+), 7 deletions(-)

Comments

Felix Fietkau Sept. 20, 2022, 8:33 a.m. UTC | #1
On 15.09.22 06:35, venkatch@gmail.com wrote:
> From: Venkat Chimata <venkata@shasta.cloud>
> 
> OpenWiFi's ucentral agent periodically (typically 120 seconds)
> issues nl80211 call to get associated client list from the
> WLAN driver. Somehow this operation was causing tx/rx delays
> sometimes and the video calls on connected clients are experiencing
> jitter. The associated client list was protected by
> a mutex lock. I saw that ieee80211_check_fast_xmit_all uses
> rcu_read_lock and rcu_read_unlock to iterate through sta_list.
> I took it as a refernce and changed the lock to rcu_read lock
> from mutex.
> Also saw this this comment just above sta_mutex declaration.
>          /* Station data */
>          /*
>           * The mutex only protects the list, hash table and
>           * counter, reads are done with RCU.
>           */
>          struct mutex sta_mtx;
> 
> Hence tried changing mutex_lock(/unlock) in ieee80211_dump_station
> to rcu_read_lock(/unlock) and it resolved the jitter issue in the
> video calls.
> 
> Tests:
> We had this issue show up consistently and the patch fixed the issue.
> We spent a good part of 2 weeks following up on this and with this
> fix, the video calls are smooth.
> 
> Also tested if this could cause any crashes with below mentioned
> process.
> 
> 1. Connect 3 clients
> 2. A script running dumping clients in an infinite loop
> 3. Continuously disconnect / connect one client to see if
>     there is any crash. No crash was observed.
The drv_sta_statistics call (called from sta_set_sinfo) can sleep, so 
you're not allowed to use RCU for iterating over the station list.
Also, please analyze and explain the actual root cause before proposing 
another fix.

- Felix
Felix Fietkau Sept. 20, 2022, 9:10 a.m. UTC | #2
On 20.09.22 11:05, Venkat Ch wrote:
> Hi Felix,
> Thanks for the comments. What part of  drv_sta_statistics could make it 
> sleep? I might be missing something. Please explain.
It's just a wrapper around the driver .sta_statistics op.
The documentation in include/net/mac80211.h states that this callback is 
allowed to sleep, and it does so in a few drivers.

- Felix
Felix Fietkau Sept. 20, 2022, 6:51 p.m. UTC | #3
On 20.09.22 20:27, Venkat Ch wrote:
> Hi Felix,
> 
>   Thank you. I browsed through the code. There seems to be sleep
> operations wcn36xx platform.ath11k  does n't seem to have the sleep
> operations in sta_statistics. We are using ath11k based chipset. Will
> it impact things if we apply this patch for ath11k boards only as a
> platform specific patch?
I think it's a bad idea to keep this hack without understanding why it 
helps at all.

- Felix
Venkat Ch Sept. 21, 2022, 8:31 a.m. UTC | #4
Hi Felix,

 Thanks again for patiently discussing things. I clearly understand
what you said about locks. Now that we know the background of the
problem, do you suggest any potential solution or any potential
direction that I should start investigating? Please let me know.

Thanks & Regards
Venkat

On Wed, 21 Sept 2022 at 13:35, Felix Fietkau <nbd@nbd.name> wrote:
>
>
> On 20.09.22 21:23, Venkat Ch wrote:
> > Hi Felix,
> >
> >   Following is the background of the problem, how I traced to
> > mutex_lock and why I propose rcu locks.
> >
> > Issue:
> >   On a 10Mbps upload / 50 Mbps download connection, the following issue reported.
> >
> > Video periodically freezes and/or appears delayed when on Zoom or Teams.
> > 1. Video will freeze for 10 or 15 seconds periodically when on a call,
> > but audio continues and the session doesn't drop.
> > 2. The video  still works but it appears delay (I move, but the video
> > of my movement is noticeably delay by a second or so)
> >
> > Tracing to mutex_lock(sta_mutex):
> >
> >   When I investigated, I found that the ucentral agent in openwifi
> > fetches the station list periodically.  Without the station list
> > fetch, the video quality is just fine. I investigated the station list
> > path and found this mutex_lock. I also see that earlier it was
> > rcu_lock which protected the station list in this path. In this
> > commit, https://github.com/torvalds/linux/commit/66572cfc30a4b764150c83ee5d842a3ce17991c9,
> > rcu lock was changed to mutex lock without providing any reason.
> The reason seems clear to me, even though it was not explicitly stated
> in the commit message: in sta_set_sinfo it introduces a call to a driver
> op that is allowed to sleep.
>
> > I also saw this comment just above the sta_mutex declaration.
> >          /* Station data */
> >          /*
> >           * The mutex only protects the list, hash table and
> >           * counter, reads are done with RCU.
> >           */
> >          struct mutex sta_mtx;
> >
> > So I reverted back the mutex_lock with rcu_lock and it just worked
> > fine. We tested for more than 2 weeks before concluding this analysis.
> >
> > I think the usage of mutex_lock is impacting the tx / rx path
> > somewhere and hence the issue. It is a challenge to trace the exact
> > function though.
>
> I don't see any critical part in the tx/rx path which depends on the
> sta_mtx lock. My guess is that for some reason your change is simply
> accidentally papering over the real bug, which may be somewhere else
> entirely, maybe even in the driver. A freeze for 10-15 seconds
> definitely does not sound like simple lock contention on the mutex,
> since a single station dump will be much faster than that.
>
> - Felix
diff mbox series

Patch

diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index 6a8350d..a005364 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -790,20 +790,17 @@  static int ieee80211_dump_station(struct wiphy *wiphy, struct net_device *dev,
 				  int idx, u8 *mac, struct station_info *sinfo)
 {
 	struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev);
-	struct ieee80211_local *local = sdata->local;
 	struct sta_info *sta;
 	int ret = -ENOENT;
 
-	mutex_lock(&local->sta_mtx);
-
+	rcu_read_lock();
 	sta = sta_info_get_by_idx(sdata, idx);
 	if (sta) {
 		ret = 0;
 		memcpy(mac, sta->sta.addr, ETH_ALEN);
 		sta_set_sinfo(sta, sinfo, true);
 	}
-
-	mutex_unlock(&local->sta_mtx);
+	rcu_read_unlock();
 
 	return ret;
 }
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 550a610..af6fa75 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -231,8 +231,7 @@  struct sta_info *sta_info_get_by_idx(struct ieee80211_sub_if_data *sdata,
 	struct sta_info *sta;
 	int i = 0;
 
-	list_for_each_entry_rcu(sta, &local->sta_list, list,
-				lockdep_is_held(&local->sta_mtx)) {
+	list_for_each_entry_rcu(sta, &local->sta_list, list) {
 		if (sdata != sta->sdata)
 			continue;
 		if (i < idx) {