Message ID | 20200108031957.22308-1-wgong@codeaurora.org |
---|---|
Headers | show |
Series | start recovery process when payload length overflow for sdio | expand |
Wen Gong <wgong@codeaurora.org> writes: > When it has more than one restart_work queued meanwhile, the 2nd > restart_work is very esay to break the 1st restart work and lead > recovery fail. > > Add a ref count to allow only one restart work running untill > device successfully recovered. > > Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00029. > > Signed-off-by: Wen Gong <wgong@codeaurora.org> > --- > drivers/net/wireless/ath/ath10k/core.c | 13 +++++++++++++ > drivers/net/wireless/ath/ath10k/core.h | 2 ++ > drivers/net/wireless/ath/ath10k/mac.c | 1 + > 3 files changed, 16 insertions(+) > > diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c > index 91f131b87efc..0e31846e6c89 100644 > --- a/drivers/net/wireless/ath/ath10k/core.c > +++ b/drivers/net/wireless/ath/ath10k/core.c > @@ -2199,6 +2199,14 @@ static void ath10k_core_restart(struct work_struct *work) > { > struct ath10k *ar = container_of(work, struct ath10k, restart_work); > int ret; > + int restart_count; > + > + restart_count = atomic_add_return(1, &ar->restart_count); > + if (restart_count > 1) { > + ath10k_warn(ar, "can not restart, count: %d\n", restart_count); > + atomic_dec(&ar->restart_count); > + return; > + } I have been thinking a different approach for this. I think another option is to have a function like this: ath10k_core_firmware_crashed() { queue_work(ar->workqueue, &ar->restart_work); } In patch 1 we would convert all existing callers to call that function instead of queue_work() directly. In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet which one is better. Now the function would do: ath10k_core_firmware_crashed() { if (test_bit(flag)) return set_bit(flag) queue_work(ar->workqueue, &ar->restart_work); } That way restart_work queue would be called only one time. Though I'm not sure how ATH10K_STATE_WEDGED would behave after this change, it might get broken. Ah, actually I think even this patch breaks the WEDGED state. This firmware restart is tricky, difficult to say what is the best approach. Michal, are you reading? :) Any ideas? And after looking more about this patch I don't see the need for the new ar->restart_count atomic variable. Checking for ATH10K_FLAG_CRASH_FLUSH would do the same thing AFAICS. And related to this, (in a separate patch) I think we should utilise ATH10K_FLAG_CRASH_FLUSH more. For example in ath10k_wmi_cmd_send() to not even try to send a WMI command if the flag is set. Basically all hardware access should be disabled except what is needed to restart the firmware. -- https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
On 2020-08-15 01:19, Kalle Valo wrote: > Wen Gong <wgong@codeaurora.org> writes: > ... >> diff --git a/drivers/net/wireless/ath/ath10k/core.c >> b/drivers/net/wireless/ath/ath10k/core.c >> index 91f131b87efc..0e31846e6c89 100644 >> --- a/drivers/net/wireless/ath/ath10k/core.c >> +++ b/drivers/net/wireless/ath/ath10k/core.c >> @@ -2199,6 +2199,14 @@ static void ath10k_core_restart(struct >> work_struct *work) >> { >> struct ath10k *ar = container_of(work, struct ath10k, restart_work); >> int ret; >> + int restart_count; >> + >> + restart_count = atomic_add_return(1, &ar->restart_count); >> + if (restart_count > 1) { >> + ath10k_warn(ar, "can not restart, count: %d\n", restart_count); >> + atomic_dec(&ar->restart_count); >> + return; >> + } > > I have been thinking a different approach for this. I think another > option is to have a function like this: > > ath10k_core_firmware_crashed() > { > queue_work(ar->workqueue, &ar->restart_work); > } > > In patch 1 we would convert all existing callers to call that > function instead of queue_work() directly. > > In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe > should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet > which one is better. Now the function would do: > > ath10k_core_firmware_crashed() > { > if (test_bit(flag)) > return > > set_bit(flag) > queue_work(ar->workqueue, &ar->restart_work); > } > > That way restart_work queue would be called only one time. > This is not muti-thread-safe, for example, if 2 thread entered to the test_bit(flag) meanwhile and both check pass, then it will have 2 restart. atomic_add_return is muti-thread-safe, if 2 thread entered it, only 1 thread can pass the check, another will fail and return. The "payload length exceeds max htc length for sdio" happened many times in a very short time, so I add this check for it. > Though I'm not sure how ATH10K_STATE_WEDGED would behave after this > change, it might get broken. Ah, actually I think even this patch > breaks > the WEDGED state. This firmware restart is tricky, difficult to say > what > is the best approach. Michal, are you reading? :) Any ideas? > > And after looking more about this patch I don't see the need for the > new > ar->restart_count atomic variable. Checking for ATH10K_FLAG_CRASH_FLUSH > would do the same thing AFAICS. > > And related to this, (in a separate patch) I think we should utilise > ATH10K_FLAG_CRASH_FLUSH more. For example in ath10k_wmi_cmd_send() to > not even try to send a WMI command if the flag is set. Basically all > hardware access should be disabled except what is needed to restart the > firmware.
On 2020-08-15 01:19, Kalle Valo wrote: ... > > I have been thinking a different approach for this. I think another > option is to have a function like this: > > ath10k_core_firmware_crashed() > { > queue_work(ar->workqueue, &ar->restart_work); > } > > In patch 1 we would convert all existing callers to call that > function instead of queue_work() directly. > > In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe > should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet > which one is better. Now the function would do: > > ath10k_core_firmware_crashed() > { > if (test_bit(flag)) > return > > set_bit(flag) > queue_work(ar->workqueue, &ar->restart_work); > } > It is better to clear_bit ATH10K_FLAG_CRASH_FLUSH/new flag in ath10k_reconfig_complete instead of ath10k_core_start because ieee80211_reconfig(called by ieee80211_restart_work) of mac80211 do many things and drv_start is 1st thing and drv_reconfig_complete is last thing. > That way restart_work queue would be called only one time. > > Though I'm not sure how ATH10K_STATE_WEDGED would behave after this > change, it might get broken. Ah, actually I think even this patch > breaks > the WEDGED state. This firmware restart is tricky, difficult to say > what > is the best approach. Michal, are you reading? :) Any ideas? > > And after looking more about this patch I don't see the need for the > new > ar->restart_count atomic variable. Checking for ATH10K_FLAG_CRASH_FLUSH > would do the same thing AFAICS. > > And related to this, (in a separate patch) I think we should utilise > ATH10K_FLAG_CRASH_FLUSH more. For example in ath10k_wmi_cmd_send() to > not even try to send a WMI command if the flag is set. Basically all > hardware access should be disabled except what is needed to restart the > firmware.
On 2020-08-15 01:19, Kalle Valo wrote: ... > > I have been thinking a different approach for this. I think another > option is to have a function like this: > > ath10k_core_firmware_crashed() > { > queue_work(ar->workqueue, &ar->restart_work); > } > > In patch 1 we would convert all existing callers to call that > function instead of queue_work() directly. > > In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe > should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet > which one is better. Now the function would do: I thinks we can use test_and_set_bit for atomic operation athough it is same with restart_count. and add a new flag, ATH10K_FLAG_CRASH_FLUSH is used for flush, if still use ATH10K_FLAG_CRASH_FLUSH, it should change clear_bit of it from ath10k_core_start to ath10k_reconfig_complete,because ieee80211_reconfig(called by ieee80211_restart_work) of mac80211 do many things and drv_start is 1st thing and drv_reconfig_complete is last thing, drv_reconfig_complete done means the restart finished. I will send patch v5 with above changes if not other advise. > > ath10k_core_firmware_crashed() > { > if (test_bit(flag)) > return > > set_bit(flag) > queue_work(ar->workqueue, &ar->restart_work); > } > > That way restart_work queue would be called only one time. > > Though I'm not sure how ATH10K_STATE_WEDGED would behave after this > change, it might get broken. Ah, actually I think even this patch > breaks > the WEDGED state. This firmware restart is tricky, difficult to say > what > is the best approach. Michal, are you reading? :) Any ideas? > > And after looking more about this patch I don't see the need for the > new > ar->restart_count atomic variable. Checking for ATH10K_FLAG_CRASH_FLUSH > would do the same thing AFAICS. > > And related to this, (in a separate patch) I think we should utilise > ATH10K_FLAG_CRASH_FLUSH more. For example in ath10k_wmi_cmd_send() to > not even try to send a WMI command if the flag is set. Basically all > hardware access should be disabled except what is needed to restart the > firmware.
On 2020-08-20 17:18, Wen Gong wrote: ... > I thinks we can use test_and_set_bit for atomic operation athough it > is same with restart_count. > and add a new flag, ATH10K_FLAG_CRASH_FLUSH is used for flush, > if still use ATH10K_FLAG_CRASH_FLUSH, it should change clear_bit of it > from > ath10k_core_start to ath10k_reconfig_complete,because > ieee80211_reconfig(called by > ieee80211_restart_work) > of mac80211 do many things and drv_start is 1st thing and > drv_reconfig_complete is last thing, drv_reconfig_complete done means > the restart > finished. > > I will send patch v5 with above changes if not other advise. v5 sent: https://patchwork.kernel.org/patch/11728101/ ...
Wen Gong <wgong@codeaurora.org> writes: > On 2020-08-15 01:19, Kalle Valo wrote: >> Wen Gong <wgong@codeaurora.org> writes: >> > ... >>> diff --git a/drivers/net/wireless/ath/ath10k/core.c >>> b/drivers/net/wireless/ath/ath10k/core.c >>> index 91f131b87efc..0e31846e6c89 100644 >>> --- a/drivers/net/wireless/ath/ath10k/core.c >>> +++ b/drivers/net/wireless/ath/ath10k/core.c >>> @@ -2199,6 +2199,14 @@ static void ath10k_core_restart(struct >>> work_struct *work) >>> { >>> struct ath10k *ar = container_of(work, struct ath10k, restart_work); >>> int ret; >>> + int restart_count; >>> + >>> + restart_count = atomic_add_return(1, &ar->restart_count); >>> + if (restart_count > 1) { >>> + ath10k_warn(ar, "can not restart, count: %d\n", restart_count); >>> + atomic_dec(&ar->restart_count); >>> + return; >>> + } >> >> I have been thinking a different approach for this. I think another >> option is to have a function like this: >> >> ath10k_core_firmware_crashed() >> { >> queue_work(ar->workqueue, &ar->restart_work); >> } >> >> In patch 1 we would convert all existing callers to call that >> function instead of queue_work() directly. >> >> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe >> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet >> which one is better. Now the function would do: >> >> ath10k_core_firmware_crashed() >> { >> if (test_bit(flag)) >> return >> >> set_bit(flag) >> queue_work(ar->workqueue, &ar->restart_work); >> } >> >> That way restart_work queue would be called only one time. > > This is not muti-thread-safe, for example, if 2 thread entered to the > test_bit(flag) meanwhile and both check pass, then it will have 2 > restart. Good point, this was racy. And I see that you found test_and_set_bit() already to fix the race. > atomic_add_return is muti-thread-safe, if 2 thread entered it, only 1 > thread can pass the check, another will fail and return. I'm not going to add new state variables unless the justification is REALLY strong and sound. This firmware restart is complicated as is already, there's no reason to complicate it even more. -- https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Wen Gong <wgong@codeaurora.org> writes: > On 2020-08-15 01:19, Kalle Valo wrote: > ... >> >> I have been thinking a different approach for this. I think another >> option is to have a function like this: >> >> ath10k_core_firmware_crashed() >> { >> queue_work(ar->workqueue, &ar->restart_work); >> } >> >> In patch 1 we would convert all existing callers to call that >> function instead of queue_work() directly. >> >> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe >> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet >> which one is better. Now the function would do: > > I thinks we can use test_and_set_bit for atomic operation athough it > is same with restart_count. I didn't quite understand what you mean here, but using test_and_set_bit() is the correct methdo. > and add a new flag, ATH10K_FLAG_CRASH_FLUSH is used for flush, if > still use ATH10K_FLAG_CRASH_FLUSH, it should change clear_bit of it > from ath10k_core_start to ath10k_reconfig_complete,because > ieee80211_reconfig(called by ieee80211_restart_work) of mac80211 do > many things and drv_start is 1st thing and drv_reconfig_complete is > last thing, drv_reconfig_complete done means the restart finished. Ok, let's discuss about that in v5. I hope you added the analysis to the commit log. -- https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
On 2020-09-07 23:55, Kalle Valo wrote: > Wen Gong <wgong@codeaurora.org> writes: > >> On 2020-08-15 01:19, Kalle Valo wrote: >> ... >>> >>> I have been thinking a different approach for this. I think another >>> option is to have a function like this: >>> >>> ath10k_core_firmware_crashed() >>> { >>> queue_work(ar->workqueue, &ar->restart_work); >>> } >>> >>> In patch 1 we would convert all existing callers to call that >>> function instead of queue_work() directly. >>> >>> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe >>> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet >>> which one is better. Now the function would do: >> >> I thinks we can use test_and_set_bit for atomic operation athough it >> is same with restart_count. > > I didn't quite understand what you mean here, but using > test_and_set_bit() is the correct methdo. > >> and add a new flag, ATH10K_FLAG_CRASH_FLUSH is used for flush, if >> still use ATH10K_FLAG_CRASH_FLUSH, it should change clear_bit of it >> from ath10k_core_start to ath10k_reconfig_complete,because >> ieee80211_reconfig(called by ieee80211_restart_work) of mac80211 do >> many things and drv_start is 1st thing and drv_reconfig_complete is >> last thing, drv_reconfig_complete done means the restart finished. > > Ok, let's discuss about that in v5. I hope you added the analysis to > the > commit log. yes, v5 have alread sent in https://patchwork.kernel.org/patch/11728101/ [v5] ath10k: add atomic protection for device recovery