Message ID | 20220915043527.737133-1-venkatch@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | [v2] wifi: mac80211: Fix performance issue with mutex_lock | expand |
On 15.09.22 06:35, venkatch@gmail.com wrote: > From: Venkat Chimata <venkata@shasta.cloud> > > OpenWiFi's ucentral agent periodically (typically 120 seconds) > issues nl80211 call to get associated client list from the > WLAN driver. Somehow this operation was causing tx/rx delays > sometimes and the video calls on connected clients are experiencing > jitter. The associated client list was protected by > a mutex lock. I saw that ieee80211_check_fast_xmit_all uses > rcu_read_lock and rcu_read_unlock to iterate through sta_list. > I took it as a refernce and changed the lock to rcu_read lock > from mutex. > Also saw this this comment just above sta_mutex declaration. > /* Station data */ > /* > * The mutex only protects the list, hash table and > * counter, reads are done with RCU. > */ > struct mutex sta_mtx; > > Hence tried changing mutex_lock(/unlock) in ieee80211_dump_station > to rcu_read_lock(/unlock) and it resolved the jitter issue in the > video calls. > > Tests: > We had this issue show up consistently and the patch fixed the issue. > We spent a good part of 2 weeks following up on this and with this > fix, the video calls are smooth. > > Also tested if this could cause any crashes with below mentioned > process. > > 1. Connect 3 clients > 2. A script running dumping clients in an infinite loop > 3. Continuously disconnect / connect one client to see if > there is any crash. No crash was observed. The drv_sta_statistics call (called from sta_set_sinfo) can sleep, so you're not allowed to use RCU for iterating over the station list. Also, please analyze and explain the actual root cause before proposing another fix. - Felix
On 20.09.22 11:05, Venkat Ch wrote: > Hi Felix, > Thanks for the comments. What part of drv_sta_statistics could make it > sleep? I might be missing something. Please explain. It's just a wrapper around the driver .sta_statistics op. The documentation in include/net/mac80211.h states that this callback is allowed to sleep, and it does so in a few drivers. - Felix
On 20.09.22 20:27, Venkat Ch wrote: > Hi Felix, > > Thank you. I browsed through the code. There seems to be sleep > operations wcn36xx platform.ath11k does n't seem to have the sleep > operations in sta_statistics. We are using ath11k based chipset. Will > it impact things if we apply this patch for ath11k boards only as a > platform specific patch? I think it's a bad idea to keep this hack without understanding why it helps at all. - Felix
Hi Felix, Thanks again for patiently discussing things. I clearly understand what you said about locks. Now that we know the background of the problem, do you suggest any potential solution or any potential direction that I should start investigating? Please let me know. Thanks & Regards Venkat On Wed, 21 Sept 2022 at 13:35, Felix Fietkau <nbd@nbd.name> wrote: > > > On 20.09.22 21:23, Venkat Ch wrote: > > Hi Felix, > > > > Following is the background of the problem, how I traced to > > mutex_lock and why I propose rcu locks. > > > > Issue: > > On a 10Mbps upload / 50 Mbps download connection, the following issue reported. > > > > Video periodically freezes and/or appears delayed when on Zoom or Teams. > > 1. Video will freeze for 10 or 15 seconds periodically when on a call, > > but audio continues and the session doesn't drop. > > 2. The video still works but it appears delay (I move, but the video > > of my movement is noticeably delay by a second or so) > > > > Tracing to mutex_lock(sta_mutex): > > > > When I investigated, I found that the ucentral agent in openwifi > > fetches the station list periodically. Without the station list > > fetch, the video quality is just fine. I investigated the station list > > path and found this mutex_lock. I also see that earlier it was > > rcu_lock which protected the station list in this path. In this > > commit, https://github.com/torvalds/linux/commit/66572cfc30a4b764150c83ee5d842a3ce17991c9, > > rcu lock was changed to mutex lock without providing any reason. > The reason seems clear to me, even though it was not explicitly stated > in the commit message: in sta_set_sinfo it introduces a call to a driver > op that is allowed to sleep. > > > I also saw this comment just above the sta_mutex declaration. > > /* Station data */ > > /* > > * The mutex only protects the list, hash table and > > * counter, reads are done with RCU. > > */ > > struct mutex sta_mtx; > > > > So I reverted back the mutex_lock with rcu_lock and it just worked > > fine. We tested for more than 2 weeks before concluding this analysis. > > > > I think the usage of mutex_lock is impacting the tx / rx path > > somewhere and hence the issue. It is a challenge to trace the exact > > function though. > > I don't see any critical part in the tx/rx path which depends on the > sta_mtx lock. My guess is that for some reason your change is simply > accidentally papering over the real bug, which may be somewhere else > entirely, maybe even in the driver. A freeze for 10-15 seconds > definitely does not sound like simple lock contention on the mutex, > since a single station dump will be much faster than that. > > - Felix
diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c index 6a8350d..a005364 100644 --- a/net/mac80211/cfg.c +++ b/net/mac80211/cfg.c @@ -790,20 +790,17 @@ static int ieee80211_dump_station(struct wiphy *wiphy, struct net_device *dev, int idx, u8 *mac, struct station_info *sinfo) { struct ieee80211_sub_if_data *sdata = IEEE80211_DEV_TO_SUB_IF(dev); - struct ieee80211_local *local = sdata->local; struct sta_info *sta; int ret = -ENOENT; - mutex_lock(&local->sta_mtx); - + rcu_read_lock(); sta = sta_info_get_by_idx(sdata, idx); if (sta) { ret = 0; memcpy(mac, sta->sta.addr, ETH_ALEN); sta_set_sinfo(sta, sinfo, true); } - - mutex_unlock(&local->sta_mtx); + rcu_read_unlock(); return ret; } diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c index 550a610..af6fa75 100644 --- a/net/mac80211/sta_info.c +++ b/net/mac80211/sta_info.c @@ -231,8 +231,7 @@ struct sta_info *sta_info_get_by_idx(struct ieee80211_sub_if_data *sdata, struct sta_info *sta; int i = 0; - list_for_each_entry_rcu(sta, &local->sta_list, list, - lockdep_is_held(&local->sta_mtx)) { + list_for_each_entry_rcu(sta, &local->sta_list, list) { if (sdata != sta->sdata) continue; if (i < idx) {