Message ID | 20210515024227.2159311-1-briannorris@chromium.org |
---|---|
State | New |
Headers | show |
Series | [5.13] mwifiex: bring down link before deleting interface | expand |
On 5/15/21 4:42 AM, Brian Norris wrote: > We can deadlock when rmmod'ing the driver or going through firmware > reset, because the cfg80211_unregister_wdev() has to bring down the link > for us, ... which then grab the same wiphy lock. > > nl80211_del_interface() already handles a very similar case, with a nice > description: > > /* > * We hold RTNL, so this is safe, without RTNL opencount cannot > * reach 0, and thus the rdev cannot be deleted. > * > * We need to do it for the dev_close(), since that will call > * the netdev notifiers, and we need to acquire the mutex there > * but don't know if we get there from here or from some other > * place (e.g. "ip link set ... down"). > */ > mutex_unlock(&rdev->wiphy.mtx); > ... > > Do similarly for mwifiex teardown, by ensuring we bring the link down > first. > > Sample deadlock trace: > > [ 247.103516] INFO: task rmmod:2119 blocked for more than 123 seconds. > [ 247.110630] Not tainted 5.12.4 #5 > [ 247.115796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 247.124557] task:rmmod state:D stack: 0 pid: 2119 ppid: 2114 flags:0x00400208 > [ 247.133905] Call trace: > [ 247.136644] __switch_to+0x130/0x170 > [ 247.140643] __schedule+0x714/0xa0c > [ 247.144548] schedule_preempt_disabled+0x88/0xf4 > [ 247.149714] __mutex_lock_common+0x43c/0x750 > [ 247.154496] mutex_lock_nested+0x5c/0x68 > [ 247.158884] cfg80211_netdev_notifier_call+0x280/0x4e0 [cfg80211] > [ 247.165769] raw_notifier_call_chain+0x4c/0x78 > [ 247.170742] call_netdevice_notifiers_info+0x68/0xa4 > [ 247.176305] __dev_close_many+0x7c/0x138 > [ 247.180693] dev_close_many+0x7c/0x10c > [ 247.184893] unregister_netdevice_many+0xfc/0x654 > [ 247.190158] unregister_netdevice_queue+0xb4/0xe0 > [ 247.195424] _cfg80211_unregister_wdev+0xa4/0x204 [cfg80211] > [ 247.201816] cfg80211_unregister_wdev+0x20/0x2c [cfg80211] > [ 247.208016] mwifiex_del_virtual_intf+0xc8/0x188 [mwifiex] > [ 247.214174] mwifiex_uninit_sw+0x158/0x1b0 [mwifiex] > [ 247.219747] mwifiex_remove_card+0x38/0xa0 [mwifiex] > [ 247.225316] mwifiex_pcie_remove+0xd0/0xe0 [mwifiex_pcie] > [ 247.231451] pci_device_remove+0x50/0xe0 > [ 247.235849] device_release_driver_internal+0x110/0x1b0 > [ 247.241701] driver_detach+0x5c/0x9c > [ 247.245704] bus_remove_driver+0x84/0xb8 > [ 247.250095] driver_unregister+0x3c/0x60 > [ 247.254486] pci_unregister_driver+0x2c/0x90 > [ 247.259267] cleanup_module+0x18/0xcdc [mwifiex_pcie] > > Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver") > Cc: stable@vger.kernel.org > Link: https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/ > Link: https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/ > Reported-by: Maximilian Luz <luzmaximilian@gmail.com> > Reported-by: dave@bewaar.me > Cc: Johannes Berg <johannes@sipsolutions.net> > Signed-off-by: Brian Norris <briannorris@chromium.org> Thanks! Tested-by: Maximilian Luz <luzmaximilian@gmail.com> > --- > drivers/net/wireless/marvell/mwifiex/main.c | 13 ++++++++++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/wireless/marvell/mwifiex/main.c b/drivers/net/wireless/marvell/mwifiex/main.c > index 529dfd8b7ae8..17399d4aa129 100644 > --- a/drivers/net/wireless/marvell/mwifiex/main.c > +++ b/drivers/net/wireless/marvell/mwifiex/main.c > @@ -1445,11 +1445,18 @@ static void mwifiex_uninit_sw(struct mwifiex_adapter *adapter) > if (!priv) > continue; > rtnl_lock(); > - wiphy_lock(adapter->wiphy); > if (priv->netdev && > - priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) > + priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) { > + /* > + * Close the netdev now, because if we do it later, the > + * netdev notifiers will need to acquire the wiphy lock > + * again --> deadlock. > + */ > + dev_close(priv->wdev.netdev); > + wiphy_lock(adapter->wiphy); > mwifiex_del_virtual_intf(adapter->wiphy, &priv->wdev); > - wiphy_unlock(adapter->wiphy); > + wiphy_unlock(adapter->wiphy); > + } > rtnl_unlock(); > } > >
On 2021-05-15 04:42, Brian Norris wrote: > We can deadlock when rmmod'ing the driver or going through firmware > reset, because the cfg80211_unregister_wdev() has to bring down the > link > for us, ... which then grab the same wiphy lock. > > nl80211_del_interface() already handles a very similar case, with a > nice > description: > > /* > * We hold RTNL, so this is safe, without RTNL opencount cannot > * reach 0, and thus the rdev cannot be deleted. > * > * We need to do it for the dev_close(), since that will call > * the netdev notifiers, and we need to acquire the mutex there > * but don't know if we get there from here or from some other > * place (e.g. "ip link set ... down"). > */ > mutex_unlock(&rdev->wiphy.mtx); > ... > > Do similarly for mwifiex teardown, by ensuring we bring the link down > first. > > Sample deadlock trace: > > [ 247.103516] INFO: task rmmod:2119 blocked for more than 123 seconds. > [ 247.110630] Not tainted 5.12.4 #5 > [ 247.115796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 247.124557] task:rmmod state:D stack: 0 pid: 2119 > ppid: 2114 flags:0x00400208 > [ 247.133905] Call trace: > [ 247.136644] __switch_to+0x130/0x170 > [ 247.140643] __schedule+0x714/0xa0c > [ 247.144548] schedule_preempt_disabled+0x88/0xf4 > [ 247.149714] __mutex_lock_common+0x43c/0x750 > [ 247.154496] mutex_lock_nested+0x5c/0x68 > [ 247.158884] cfg80211_netdev_notifier_call+0x280/0x4e0 [cfg80211] > [ 247.165769] raw_notifier_call_chain+0x4c/0x78 > [ 247.170742] call_netdevice_notifiers_info+0x68/0xa4 > [ 247.176305] __dev_close_many+0x7c/0x138 > [ 247.180693] dev_close_many+0x7c/0x10c > [ 247.184893] unregister_netdevice_many+0xfc/0x654 > [ 247.190158] unregister_netdevice_queue+0xb4/0xe0 > [ 247.195424] _cfg80211_unregister_wdev+0xa4/0x204 [cfg80211] > [ 247.201816] cfg80211_unregister_wdev+0x20/0x2c [cfg80211] > [ 247.208016] mwifiex_del_virtual_intf+0xc8/0x188 [mwifiex] > [ 247.214174] mwifiex_uninit_sw+0x158/0x1b0 [mwifiex] > [ 247.219747] mwifiex_remove_card+0x38/0xa0 [mwifiex] > [ 247.225316] mwifiex_pcie_remove+0xd0/0xe0 [mwifiex_pcie] > [ 247.231451] pci_device_remove+0x50/0xe0 > [ 247.235849] device_release_driver_internal+0x110/0x1b0 > [ 247.241701] driver_detach+0x5c/0x9c > [ 247.245704] bus_remove_driver+0x84/0xb8 > [ 247.250095] driver_unregister+0x3c/0x60 > [ 247.254486] pci_unregister_driver+0x2c/0x90 > [ 247.259267] cleanup_module+0x18/0xcdc [mwifiex_pcie] > > Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the > driver") > Cc: stable@vger.kernel.org > Link: > https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/ > Link: > https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/ > Reported-by: Maximilian Luz <luzmaximilian@gmail.com> > Reported-by: Dave Olsthoorn <dave@bewaar.me> Thanks! The firmware still seems to crash quicker than previously, but that's a unrelated problem. Tested-by: Dave Olsthoorn <dave@bewaar.me> > Cc: Johannes Berg <johannes@sipsolutions.net> > Signed-off-by: Brian Norris <briannorris@chromium.org> > --- > drivers/net/wireless/marvell/mwifiex/main.c | 13 ++++++++++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/wireless/marvell/mwifiex/main.c > b/drivers/net/wireless/marvell/mwifiex/main.c > index 529dfd8b7ae8..17399d4aa129 100644 > --- a/drivers/net/wireless/marvell/mwifiex/main.c > +++ b/drivers/net/wireless/marvell/mwifiex/main.c > @@ -1445,11 +1445,18 @@ static void mwifiex_uninit_sw(struct > mwifiex_adapter *adapter) > if (!priv) > continue; > rtnl_lock(); > - wiphy_lock(adapter->wiphy); > if (priv->netdev && > - priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) > + priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) { > + /* > + * Close the netdev now, because if we do it later, the > + * netdev notifiers will need to acquire the wiphy lock > + * again --> deadlock. > + */ > + dev_close(priv->wdev.netdev); > + wiphy_lock(adapter->wiphy); > mwifiex_del_virtual_intf(adapter->wiphy, &priv->wdev); > - wiphy_unlock(adapter->wiphy); > + wiphy_unlock(adapter->wiphy); > + } > rtnl_unlock(); > }
Hello! On Saturday 15 May 2021 18:32:30 Dave Olsthoorn wrote: > Hi, > > On 2021-05-15 17:40, Pali Rohár wrote: > > On Saturday 15 May 2021 17:10:31 Dave Olsthoorn wrote: > > > The firmware still seems to crash quicker than previously, but > > > that's a > > > unrelated problem. > > > > Hello! Do you have some more details (or links) about mentioned firmware > > crash? > > Sure, firmware crashes have always been a problem on the Surface devices. What wifi chip you have on these devices? Because very similar firmware crashes I see on 88W8997 chip (also with mwifiex) when wifi card is configured in SDIO mode (not PCIe). I know that there are new version of firmwares for these 88W8xxx chips, but they are available only under NXP NDA and only for NXP customers. So it looks like that end users with NXP wifi chips are out of luck. > They seem to be related, at least for some of the crashes, to power > management. For this reason I disabled powersaving in NetworkManager which > used to make it at least stable enough for me, in 5.13 this trick does not > seem to work. > > The dmesg log attached shows a firmware crash happening, the card does not > work even after a reset or remove & rescan on the pci(e) bus. Similar issue, card start working again only after whole system restart. So this is something which can be resolved only in NXP. Sharvari, could you look at this issue? It looks like there are many mwifiex driver or 88W8xxx firmware issues which are common both PCIe and SDIO modes. Also could you please look at way how NXP 88W8xxx firmware files could be updated in linux-firmware repository? > There are patches [1] which have not been submitted yet and where developed > as part of the linux-surface effort [2]. From my experience these patches > resolve most if not all of the firmware crashes. Is somebody going to cleanup these patches and send them for inclusion into mainline kernel? I see that most of them are PCIe related, but due to seeing same issues also on SDIO bus, I guess adding similar hooks also for SDIO could make also SDIO more stable... > Regards, > Dave Olsthoorn > > [1]: https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch > [2]: https://github.com/linux-surface > > dmesg.log: > [ 127.367628] mwifiex_pcie 0000:01:00.0: info: successfully disconnected > from xx:xx:xx:xx:xx:xx: reason code 3 > [ 127.371059] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.371083] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 127.371086] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.371088] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 127.371091] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.371093] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 127.371096] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.371099] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 127.371101] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.371103] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 127.374001] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.374028] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.374103] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.374113] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.375041] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.375311] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.375445] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.375903] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW in reset state > [ 127.379980] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex... > [ 127.448638] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 127.457009] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 127.494749] mwifiex_pcie 0000:01:00.0: performing cancel_work_sync()... > [ 127.494756] mwifiex_pcie 0000:01:00.0: cancel_work_sync() done > [ 155.308155] mwifiex_pcie: PCI memory map Virt0: 00000000ff7d3684 PCI > memory map Virt2: 000000004aa1ebbb > [ 155.357139] mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW > dnld > [ 155.357146] mwifiex_pcie 0000:01:00.0: WLAN FW is active > [ 155.510296] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex > 1.0 (15.68.19.p21) > [ 155.510319] mwifiex_pcie 0000:01:00.0: driver_version = mwifiex 1.0 > (15.68.19.p21) > [ 155.560724] mwifiex_pcie 0000:01:00.0 wlp1s0: renamed from mlan0 > [ 161.848939] mwifiex_pcie 0000:01:00.0: info: trying to associate to > bssid xx:xx:xx:xx:xx:xx > [ 162.027982] mwifiex_pcie 0000:01:00.0: info: associated to bssid > xx:xx:xx:xx:xx:xx successfully > [ 162.063062] IPv6: ADDRCONF(NETDEV_CHANGE): wlp1s0: link becomes ready > [ 173.452786] mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout > cmd id = 0x10, act = 0x1 > [ 173.452867] mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0 > [ 173.452873] mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0 > [ 173.452877] mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1 > [ 173.452881] mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0 > [ 173.452885] mwifiex_pcie 0000:01:00.0: last_cmd_index = 1 > [ 173.452889] mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 10 00 28 00 a4 > 00 7f 00 > [ 173.452894] mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 01 00 13 00 > 00 00 00 00 > [ 173.452898] mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0 > [ 173.452905] mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 10 80 28 > 80 a4 80 7f 80 > [ 173.452911] mwifiex_pcie 0000:01:00.0: last_event_index = 3 > [ 173.452918] mwifiex_pcie 0000:01:00.0: last_event: 6a 00 17 00 2b 00 33 > 00 58 00 > [ 173.452925] mwifiex_pcie 0000:01:00.0: data_sent=1 cmd_sent=1 > [ 173.452930] mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0 > [ 173.455878] mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump > start=== > [ 173.455914] mwifiex_pcie 0000:01:00.0: info: MWIFIEX VERSION: mwifiex > 1.0 (15.68.19.p21) > [ 173.455925] mwifiex_pcie 0000:01:00.0: PCIE register dump start > [ 173.456014] mwifiex_pcie 0000:01:00.0: pcie scratch register: > [ 173.456026] mwifiex_pcie 0000:01:00.0: reg:0xcf0, value=0xfedcba00 > reg:0xcf8, value=0x9141cc > reg:0xcfc, value=0x28c8c00 > [ 173.456030] mwifiex_pcie 0000:01:00.0: PCIE register dump end > [ 173.456173] mwifiex_pcie 0000:01:00.0: ===mwifiex driverinfo dump end=== > [ 173.456177] mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump start == > [ 177.548244] mwifiex_pcie 0000:01:00.0: cmd_wait_q terminated: -110 > [ 177.548341] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 177.548708] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 177.548712] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 182.816381] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 182.816389] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 183.100502] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 183.100524] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 183.463055] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 183.463059] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 189.465526] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 189.465538] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 195.464760] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 195.464770] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 201.467820] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 201.467826] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 207.463143] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 207.463154] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 213.462237] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 213.462244] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 219.464173] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 219.464180] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 223.116948] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 225.468213] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 225.468219] mwifiex_pcie 0000:01:00.0: failed to get signal information > [ 229.349824] mwifiex_pcie 0000:01:00.0: Firmware dump Finished! > [ 229.349835] mwifiex_pcie 0000:01:00.0: == mwifiex firmware dump end == > [ 229.350968] mwifiex_pcie 0000:01:00.0: == mwifiex dump information to > /sys/class/devcoredump start > [ 229.354949] mwifiex_pcie 0000:01:00.0: == mwifiex dump information to > /sys/class/devcoredump end > [ 229.354981] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 229.356825] mwifiex_pcie 0000:01:00.0: PREP_CMD: FW is in bad state > [ 229.357175] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 229.357184] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 229.357190] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 229.357195] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 229.357200] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 229.357205] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 229.357210] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 229.357215] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 229.357219] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 229.357224] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 229.357228] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 229.357233] mwifiex_pcie 0000:01:00.0: deleting the crypto keys > [ 229.359375] mwifiex_pcie 0000:01:00.0: info: shutdown mwifiex... > [ 229.362422] mwifiex_pcie 0000:01:00.0: rx_pending=0, tx_pending=1, > cmd_pending=0 > [ 229.493680] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 229.493700] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 229.509094] mwifiex_pcie 0000:01:00.0: PREP_CMD: card is removed > [ 230.653787] mwifiex_pcie 0000:01:00.0: WLAN FW already running! Skip FW > dnld > [ 230.653794] mwifiex_pcie 0000:01:00.0: WLAN FW is active > [ 241.037922] mwifiex_pcie 0000:01:00.0: mwifiex_cmd_timeout_func: Timeout > cmd id = 0xfa, act = 0x4000 > [ 241.038005] mwifiex_pcie 0000:01:00.0: num_data_h2c_failure = 0 > [ 241.038031] mwifiex_pcie 0000:01:00.0: num_cmd_h2c_failure = 0 > [ 241.038048] mwifiex_pcie 0000:01:00.0: is_cmd_timedout = 1 > [ 241.038067] mwifiex_pcie 0000:01:00.0: num_tx_timeout = 0 > [ 241.038086] mwifiex_pcie 0000:01:00.0: last_cmd_index = 2 > [ 241.038104] mwifiex_pcie 0000:01:00.0: last_cmd_id: 16 00 10 00 fa 00 a4 > 00 7f 00 > [ 241.038126] mwifiex_pcie 0000:01:00.0: last_cmd_act: 00 00 01 00 00 40 > 00 00 00 00 > [ 241.038147] mwifiex_pcie 0000:01:00.0: last_cmd_resp_index = 0 > [ 241.038168] mwifiex_pcie 0000:01:00.0: last_cmd_resp_id: 16 80 10 80 28 > 80 a4 80 7f 80 > [ 241.038187] mwifiex_pcie 0000:01:00.0: last_event_index = 3 > [ 241.038207] mwifiex_pcie 0000:01:00.0: last_event: 6a 00 17 00 2b 00 33 > 00 58 00 > [ 241.038226] mwifiex_pcie 0000:01:00.0: data_sent=0 cmd_sent=1 > [ 241.038246] mwifiex_pcie 0000:01:00.0: ps_mode=0 ps_state=0 > [ 241.040764] mwifiex_pcie 0000:01:00.0: info: _mwifiex_fw_dpc: unregister > device > [ 241.057936] mwifiex_pcie 0000:01:00.0: skipped cancel_work_sync() > because we're in card reset failure path > [ 241.060488] mwifiex: Failed to bring up adapter: -5 > [ 241.060504] mwifiex_pcie 0000:01:00.0: reinit failed: -5
Hi, I'll drop some of the people since this is a sub-thread of the original, I'll keep the lists for access to this using lore.kernel.org. On 2021-05-15 18:53, Pali Rohár wrote: > Hello! > > On Saturday 15 May 2021 18:32:30 Dave Olsthoorn wrote: >> Hi, >> >> On 2021-05-15 17:40, Pali Rohár wrote: >> > On Saturday 15 May 2021 17:10:31 Dave Olsthoorn wrote: >> > > The firmware still seems to crash quicker than previously, but >> > > that's a >> > > unrelated problem. >> > >> > Hello! Do you have some more details (or links) about mentioned firmware >> > crash? >> >> Sure, firmware crashes have always been a problem on the Surface >> devices. > > What wifi chip you have on these devices? Because very similar firmware > crashes I see on 88W8997 chip (also with mwifiex) when wifi card is > configured in SDIO mode (not PCIe). > The Surface Pro 2017 has an 88W8897. > I know that there are new version of firmwares for these 88W8xxx chips, > but they are available only under NXP NDA and only for NXP customers. > So it looks like that end users with NXP wifi chips are out of luck. > >> They seem to be related, at least for some of the crashes, to power >> management. For this reason I disabled powersaving in NetworkManager >> which >> used to make it at least stable enough for me, in 5.13 this trick does >> not >> seem to work. >> >> The dmesg log attached shows a firmware crash happening, the card does >> not >> work even after a reset or remove & rescan on the pci(e) bus. > > Similar issue, card start working again only after whole system > restart. > > So this is something which can be resolved only in NXP. After a conversation with the author of the patches, the problem is not the power management itself (for most hardware revisions [1]) but a race where pci commands are being written while the device is being put to sleep. A fix for this problem is included in the patches which make all pci commands synchronous instead of asynchronous [2]. After that a the wakeup patch seems relevant [3]. <snip> >> There are patches [1] which have not been submitted yet and where >> developed >> as part of the linux-surface effort [2]. From my experience these >> patches >> resolve most if not all of the firmware crashes. > > Is somebody going to cleanup these patches and send them for inclusion > into mainline kernel? I see that most of them are PCIe related, but due > to seeing same issues also on SDIO bus, I guess adding similar hooks > also for SDIO could make also SDIO more stable... The author plans to upstream them, he just hasn't gotten around to it. Regards, Dave [1]: https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch#L2237-L2338 [2]: https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch#L1152-L1207 [3]: https://github.com/linux-surface/linux-surface/blob/master/patches/5.12/0002-mwifiex.patch#L1992-L2079
On Fri, May 14, 2021 at 7:45 PM Brian Norris <briannorris@chromium.org> wrote: > > We can deadlock when rmmod'ing the driver or going through firmware > reset, because the cfg80211_unregister_wdev() has to bring down the link > for us, ... which then grab the same wiphy lock. ... > Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver") > Cc: stable@vger.kernel.org > Link: https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/ > Link: https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/ > Reported-by: Maximilian Luz <luzmaximilian@gmail.com> > Reported-by: dave@bewaar.me > Cc: Johannes Berg <johannes@sipsolutions.net> > Signed-off-by: Brian Norris <briannorris@chromium.org> Ping - is this going to get merged? It's a 5.12 regression, and we have multiple people complaining about it (and they tested the fix too!). Thanks, Brian
Brian Norris <briannorris@chromium.org> writes: > On Fri, May 14, 2021 at 7:45 PM Brian Norris <briannorris@chromium.org> wrote: >> >> We can deadlock when rmmod'ing the driver or going through firmware >> reset, because the cfg80211_unregister_wdev() has to bring down the link >> for us, ... which then grab the same wiphy lock. > ... >> Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver") >> Cc: stable@vger.kernel.org >> Link: >> https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/ >> Link: >> https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/ >> Reported-by: Maximilian Luz <luzmaximilian@gmail.com> >> Reported-by: dave@bewaar.me >> Cc: Johannes Berg <johannes@sipsolutions.net> >> Signed-off-by: Brian Norris <briannorris@chromium.org> > > Ping - is this going to get merged? It's a 5.12 regression, and we > have multiple people complaining about it (and they tested the fix > too!). Thanks for the ping, this got piled up under all the -next patches and I missed it. I'll look at it now.
Brian Norris <briannorris@chromium.org> wrote: > We can deadlock when rmmod'ing the driver or going through firmware > reset, because the cfg80211_unregister_wdev() has to bring down the link > for us, ... which then grab the same wiphy lock. > > nl80211_del_interface() already handles a very similar case, with a nice > description: > > /* > * We hold RTNL, so this is safe, without RTNL opencount cannot > * reach 0, and thus the rdev cannot be deleted. > * > * We need to do it for the dev_close(), since that will call > * the netdev notifiers, and we need to acquire the mutex there > * but don't know if we get there from here or from some other > * place (e.g. "ip link set ... down"). > */ > mutex_unlock(&rdev->wiphy.mtx); > ... > > Do similarly for mwifiex teardown, by ensuring we bring the link down > first. > > Sample deadlock trace: > > [ 247.103516] INFO: task rmmod:2119 blocked for more than 123 seconds. > [ 247.110630] Not tainted 5.12.4 #5 > [ 247.115796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 247.124557] task:rmmod state:D stack: 0 pid: 2119 ppid: 2114 flags:0x00400208 > [ 247.133905] Call trace: > [ 247.136644] __switch_to+0x130/0x170 > [ 247.140643] __schedule+0x714/0xa0c > [ 247.144548] schedule_preempt_disabled+0x88/0xf4 > [ 247.149714] __mutex_lock_common+0x43c/0x750 > [ 247.154496] mutex_lock_nested+0x5c/0x68 > [ 247.158884] cfg80211_netdev_notifier_call+0x280/0x4e0 [cfg80211] > [ 247.165769] raw_notifier_call_chain+0x4c/0x78 > [ 247.170742] call_netdevice_notifiers_info+0x68/0xa4 > [ 247.176305] __dev_close_many+0x7c/0x138 > [ 247.180693] dev_close_many+0x7c/0x10c > [ 247.184893] unregister_netdevice_many+0xfc/0x654 > [ 247.190158] unregister_netdevice_queue+0xb4/0xe0 > [ 247.195424] _cfg80211_unregister_wdev+0xa4/0x204 [cfg80211] > [ 247.201816] cfg80211_unregister_wdev+0x20/0x2c [cfg80211] > [ 247.208016] mwifiex_del_virtual_intf+0xc8/0x188 [mwifiex] > [ 247.214174] mwifiex_uninit_sw+0x158/0x1b0 [mwifiex] > [ 247.219747] mwifiex_remove_card+0x38/0xa0 [mwifiex] > [ 247.225316] mwifiex_pcie_remove+0xd0/0xe0 [mwifiex_pcie] > [ 247.231451] pci_device_remove+0x50/0xe0 > [ 247.235849] device_release_driver_internal+0x110/0x1b0 > [ 247.241701] driver_detach+0x5c/0x9c > [ 247.245704] bus_remove_driver+0x84/0xb8 > [ 247.250095] driver_unregister+0x3c/0x60 > [ 247.254486] pci_unregister_driver+0x2c/0x90 > [ 247.259267] cleanup_module+0x18/0xcdc [mwifiex_pcie] > > Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver") > Cc: stable@vger.kernel.org > Link: https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/ > Link: https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/ > Reported-by: Maximilian Luz <luzmaximilian@gmail.com> > Reported-by: dave@bewaar.me > Cc: Johannes Berg <johannes@sipsolutions.net> > Signed-off-by: Brian Norris <briannorris@chromium.org> > Tested-by: Maximilian Luz <luzmaximilian@gmail.com> > Tested-by: Dave Olsthoorn <dave@bewaar.me> Patch applied to wireless-drivers.git, thanks. 1f9482aa8d41 mwifiex: bring down link before deleting interface
diff --git a/drivers/net/wireless/marvell/mwifiex/main.c b/drivers/net/wireless/marvell/mwifiex/main.c index 529dfd8b7ae8..17399d4aa129 100644 --- a/drivers/net/wireless/marvell/mwifiex/main.c +++ b/drivers/net/wireless/marvell/mwifiex/main.c @@ -1445,11 +1445,18 @@ static void mwifiex_uninit_sw(struct mwifiex_adapter *adapter) if (!priv) continue; rtnl_lock(); - wiphy_lock(adapter->wiphy); if (priv->netdev && - priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) + priv->wdev.iftype != NL80211_IFTYPE_UNSPECIFIED) { + /* + * Close the netdev now, because if we do it later, the + * netdev notifiers will need to acquire the wiphy lock + * again --> deadlock. + */ + dev_close(priv->wdev.netdev); + wiphy_lock(adapter->wiphy); mwifiex_del_virtual_intf(adapter->wiphy, &priv->wdev); - wiphy_unlock(adapter->wiphy); + wiphy_unlock(adapter->wiphy); + } rtnl_unlock(); }
We can deadlock when rmmod'ing the driver or going through firmware reset, because the cfg80211_unregister_wdev() has to bring down the link for us, ... which then grab the same wiphy lock. nl80211_del_interface() already handles a very similar case, with a nice description: /* * We hold RTNL, so this is safe, without RTNL opencount cannot * reach 0, and thus the rdev cannot be deleted. * * We need to do it for the dev_close(), since that will call * the netdev notifiers, and we need to acquire the mutex there * but don't know if we get there from here or from some other * place (e.g. "ip link set ... down"). */ mutex_unlock(&rdev->wiphy.mtx); ... Do similarly for mwifiex teardown, by ensuring we bring the link down first. Sample deadlock trace: [ 247.103516] INFO: task rmmod:2119 blocked for more than 123 seconds. [ 247.110630] Not tainted 5.12.4 #5 [ 247.115796] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.124557] task:rmmod state:D stack: 0 pid: 2119 ppid: 2114 flags:0x00400208 [ 247.133905] Call trace: [ 247.136644] __switch_to+0x130/0x170 [ 247.140643] __schedule+0x714/0xa0c [ 247.144548] schedule_preempt_disabled+0x88/0xf4 [ 247.149714] __mutex_lock_common+0x43c/0x750 [ 247.154496] mutex_lock_nested+0x5c/0x68 [ 247.158884] cfg80211_netdev_notifier_call+0x280/0x4e0 [cfg80211] [ 247.165769] raw_notifier_call_chain+0x4c/0x78 [ 247.170742] call_netdevice_notifiers_info+0x68/0xa4 [ 247.176305] __dev_close_many+0x7c/0x138 [ 247.180693] dev_close_many+0x7c/0x10c [ 247.184893] unregister_netdevice_many+0xfc/0x654 [ 247.190158] unregister_netdevice_queue+0xb4/0xe0 [ 247.195424] _cfg80211_unregister_wdev+0xa4/0x204 [cfg80211] [ 247.201816] cfg80211_unregister_wdev+0x20/0x2c [cfg80211] [ 247.208016] mwifiex_del_virtual_intf+0xc8/0x188 [mwifiex] [ 247.214174] mwifiex_uninit_sw+0x158/0x1b0 [mwifiex] [ 247.219747] mwifiex_remove_card+0x38/0xa0 [mwifiex] [ 247.225316] mwifiex_pcie_remove+0xd0/0xe0 [mwifiex_pcie] [ 247.231451] pci_device_remove+0x50/0xe0 [ 247.235849] device_release_driver_internal+0x110/0x1b0 [ 247.241701] driver_detach+0x5c/0x9c [ 247.245704] bus_remove_driver+0x84/0xb8 [ 247.250095] driver_unregister+0x3c/0x60 [ 247.254486] pci_unregister_driver+0x2c/0x90 [ 247.259267] cleanup_module+0x18/0xcdc [mwifiex_pcie] Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-wireless/98392296-40ee-6300-369c-32e16cff3725@gmail.com/ Link: https://lore.kernel.org/linux-wireless/ab4d00ce52f32bd8e45ad0448a44737e@bewaar.me/ Reported-by: Maximilian Luz <luzmaximilian@gmail.com> Reported-by: dave@bewaar.me Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Brian Norris <briannorris@chromium.org> --- drivers/net/wireless/marvell/mwifiex/main.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) --