[v4,0/2] start recovery process when payload length overflow for sdio

Message ID	20200108031957.22308-1-wgong@codeaurora.org
Headers	show Return-Path: <SRS0=curT=25=vger.kernel.org=linux-wireless-owner@kernel.org> MIME-Version: Message-Id: Date: Subject: Cc: To: From: Sender; bh=F7JyZ/oMEmFgVBHR7Mu5PKIJmLDZQX7TveUow6wRTL4=; b=VCRJ7Agy7+SpJnDoizu4KAyx/gvjvAr9kZxZkBqJU8EFWNfZfpwOpeQVBMk/IPyHo0Do8g24 Jo+7aTpRgpeJ6up6YCqK/jkIbknDJTZeJihWA2tsX3ie+6lz4ifbBP6rsGl5IJnRQ7ufSPyX jM5i2Na+DeS9qEfgcUhZLG2whNU= sender: wgong) by smtp.codeaurora.org (Postfix) with ESMTPSA id CA437C433CB; Wed, 8 Jan 2020 03:20:05 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org CA437C433CB From: Wen Gong <wgong@codeaurora.org> To: ath10k@lists.infradead.org Cc: linux-wireless@vger.kernel.org Subject: [PATCH v4 0/2] start recovery process when payload length overflow for sdio Date: Wed, 8 Jan 2020 11:19:55 +0800 Message-Id: <20200108031957.22308-1-wgong@codeaurora.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-wireless-owner@vger.kernel.org Precedence: bulk
Series	start recovery process when payload length overflow for sdio \| expand [v4,0/2] start recovery process when payload length overflow for sdio [v4,2/2] ath10k: start recovery process when payload length exceeds max htc length for sdio

Wen Gong Jan. 8, 2020, 3:19 a.m. UTC

when it happened payload length exceeds max htc length, start recovery process

v4: add atomic_dec(&ar->restart_count) if state is not on

v3: change atomic_inc_and_test to atomic_add_return, remove check of ATH10K_STATE_ON

v2: add "add refcount for ath10k_core_restart" and remove ar_sdio->can_recovery

Wen Gong (2):
  ath10k: add refcount for ath10k_core_restart
  ath10k: start recovery process when payload length exceeds max htc
    length for sdio

 drivers/net/wireless/ath/ath10k/core.c | 13 +++++++++++++
 drivers/net/wireless/ath/ath10k/core.h |  2 ++
 drivers/net/wireless/ath/ath10k/mac.c  |  1 +
 drivers/net/wireless/ath/ath10k/sdio.c |  4 ++++
 4 files changed, 20 insertions(+)

Kalle Valo Aug. 14, 2020, 5:19 p.m. UTC | #1

Wen Gong <wgong@codeaurora.org> writes:

> When it has more than one restart_work queued meanwhile, the 2nd

> restart_work is very esay to break the 1st restart work and lead

> recovery fail.

>

> Add a ref count to allow only one restart work running untill

> device successfully recovered.

>

> Tested with QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00029.

>

> Signed-off-by: Wen Gong <wgong@codeaurora.org>

> ---

>  drivers/net/wireless/ath/ath10k/core.c | 13 +++++++++++++

>  drivers/net/wireless/ath/ath10k/core.h |  2 ++

>  drivers/net/wireless/ath/ath10k/mac.c  |  1 +

>  3 files changed, 16 insertions(+)

>

> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c

> index 91f131b87efc..0e31846e6c89 100644

> --- a/drivers/net/wireless/ath/ath10k/core.c

> +++ b/drivers/net/wireless/ath/ath10k/core.c

> @@ -2199,6 +2199,14 @@ static void ath10k_core_restart(struct work_struct *work)

>  {

>  	struct ath10k *ar = container_of(work, struct ath10k, restart_work);

>  	int ret;

> +	int restart_count;

> +

> +	restart_count = atomic_add_return(1, &ar->restart_count);

> +	if (restart_count > 1) {

> +		ath10k_warn(ar, "can not restart, count: %d\n", restart_count);

> +		atomic_dec(&ar->restart_count);

> +		return;

> +	}

I have been thinking a different approach for this. I think another
option is to have a function like this:

ath10k_core_firmware_crashed()
{
        queue_work(ar->workqueue, &ar->restart_work);
}

In patch 1 we would convert all existing callers to call that
function instead of queue_work() directly.

In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe
should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet
which one is better. Now the function would do:

ath10k_core_firmware_crashed()
{
        if (test_bit(flag))
                return

        set_bit(flag)                                
	queue_work(ar->workqueue, &ar->restart_work);
}

That way restart_work queue would be called only one time.

Though I'm not sure how ATH10K_STATE_WEDGED would behave after this
change, it might get broken. Ah, actually I think even this patch breaks
the WEDGED state. This firmware restart is tricky, difficult to say what
is the best approach. Michal, are you reading? :) Any ideas?

And after looking more about this patch I don't see the need for the new
ar->restart_count atomic variable. Checking for ATH10K_FLAG_CRASH_FLUSH
would do the same thing AFAICS.

And related to this, (in a separate patch) I think we should utilise
ATH10K_FLAG_CRASH_FLUSH more. For example in ath10k_wmi_cmd_send() to
not even try to send a WMI command if the flag is set. Basically all
hardware access should be disabled except what is needed to restart the
firmware.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Wen Gong Aug. 18, 2020, 8:39 a.m. UTC | #2

On 2020-08-15 01:19, Kalle Valo wrote:
> Wen Gong <wgong@codeaurora.org> writes:

> 

...
>> diff --git a/drivers/net/wireless/ath/ath10k/core.c 

>> b/drivers/net/wireless/ath/ath10k/core.c

>> index 91f131b87efc..0e31846e6c89 100644

>> --- a/drivers/net/wireless/ath/ath10k/core.c

>> +++ b/drivers/net/wireless/ath/ath10k/core.c

>> @@ -2199,6 +2199,14 @@ static void ath10k_core_restart(struct 

>> work_struct *work)

>>  {

>>  	struct ath10k *ar = container_of(work, struct ath10k, restart_work);

>>  	int ret;

>> +	int restart_count;

>> +

>> +	restart_count = atomic_add_return(1, &ar->restart_count);

>> +	if (restart_count > 1) {

>> +		ath10k_warn(ar, "can not restart, count: %d\n", restart_count);

>> +		atomic_dec(&ar->restart_count);

>> +		return;

>> +	}

> 

> I have been thinking a different approach for this. I think another

> option is to have a function like this:

> 

> ath10k_core_firmware_crashed()

> {

>         queue_work(ar->workqueue, &ar->restart_work);

> }

> 

> In patch 1 we would convert all existing callers to call that

> function instead of queue_work() directly.

> 

> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe

> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet

> which one is better. Now the function would do:

> 

> ath10k_core_firmware_crashed()

> {

>         if (test_bit(flag))

>                 return

> 

>         set_bit(flag)

> 	queue_work(ar->workqueue, &ar->restart_work);

> }

> 

> That way restart_work queue would be called only one time.

> 

This is not muti-thread-safe, for example, if 2 thread entered to the 
test_bit(flag) meanwhile
and both check pass, then it will have 2 restart.

atomic_add_return is muti-thread-safe, if 2 thread entered it, only 1 
thread can pass
the check, another will fail and return.

The "payload length exceeds max htc length for sdio" happened many times 
in a very short time,
so I add this check for it.

> Though I'm not sure how ATH10K_STATE_WEDGED would behave after this

> change, it might get broken. Ah, actually I think even this patch 

> breaks

> the WEDGED state. This firmware restart is tricky, difficult to say 

> what

> is the best approach. Michal, are you reading? :) Any ideas?

> 

> And after looking more about this patch I don't see the need for the 

> new

> ar->restart_count atomic variable. Checking for ATH10K_FLAG_CRASH_FLUSH

> would do the same thing AFAICS.

> 

> And related to this, (in a separate patch) I think we should utilise

> ATH10K_FLAG_CRASH_FLUSH more. For example in ath10k_wmi_cmd_send() to

> not even try to send a WMI command if the flag is set. Basically all

> hardware access should be disabled except what is needed to restart the

> firmware.

Wen Gong Aug. 19, 2020, 12:01 p.m. UTC | #3

On 2020-08-15 01:19, Kalle Valo wrote:
...
> 

> I have been thinking a different approach for this. I think another

> option is to have a function like this:

> 

> ath10k_core_firmware_crashed()

> {

>         queue_work(ar->workqueue, &ar->restart_work);

> }

> 

> In patch 1 we would convert all existing callers to call that

> function instead of queue_work() directly.

> 

> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe

> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet

> which one is better. Now the function would do:

> 

> ath10k_core_firmware_crashed()

> {

>         if (test_bit(flag))

>                 return

> 

>         set_bit(flag)

> 	queue_work(ar->workqueue, &ar->restart_work);

> }

> 

It is better to clear_bit ATH10K_FLAG_CRASH_FLUSH/new flag in 
ath10k_reconfig_complete
instead of ath10k_core_start because ieee80211_reconfig(called by 
ieee80211_restart_work)
of mac80211 do many things and drv_start is 1st thing and 
drv_reconfig_complete is last thing.

> That way restart_work queue would be called only one time.

> 

> Though I'm not sure how ATH10K_STATE_WEDGED would behave after this

> change, it might get broken. Ah, actually I think even this patch 

> breaks

> the WEDGED state. This firmware restart is tricky, difficult to say 

> what

> is the best approach. Michal, are you reading? :) Any ideas?

> 

> And after looking more about this patch I don't see the need for the 

> new

> ar->restart_count atomic variable. Checking for ATH10K_FLAG_CRASH_FLUSH

> would do the same thing AFAICS.

> 

> And related to this, (in a separate patch) I think we should utilise

> ATH10K_FLAG_CRASH_FLUSH more. For example in ath10k_wmi_cmd_send() to

> not even try to send a WMI command if the flag is set. Basically all

> hardware access should be disabled except what is needed to restart the

> firmware.

Wen Gong Aug. 20, 2020, 9:18 a.m. UTC | #4

On 2020-08-15 01:19, Kalle Valo wrote:
...
> 

> I have been thinking a different approach for this. I think another

> option is to have a function like this:

> 

> ath10k_core_firmware_crashed()

> {

>         queue_work(ar->workqueue, &ar->restart_work);

> }

> 

> In patch 1 we would convert all existing callers to call that

> function instead of queue_work() directly.

> 

> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe

> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet

> which one is better. Now the function would do:

I thinks we can use test_and_set_bit for atomic operation athough it is 
same with restart_count.
and add a new flag, ATH10K_FLAG_CRASH_FLUSH is used for flush,
if still use ATH10K_FLAG_CRASH_FLUSH, it should change clear_bit of it 
from
ath10k_core_start to ath10k_reconfig_complete,because 
ieee80211_reconfig(called by
ieee80211_restart_work)
of mac80211 do many things and drv_start is 1st thing and
drv_reconfig_complete is last thing, drv_reconfig_complete done means 
the restart
finished.

I will send patch v5 with above changes if not other advise.
> 

> ath10k_core_firmware_crashed()

> {

>         if (test_bit(flag))

>                 return

> 

>         set_bit(flag)

> 	queue_work(ar->workqueue, &ar->restart_work);

> }

> 

> That way restart_work queue would be called only one time.

> 

> Though I'm not sure how ATH10K_STATE_WEDGED would behave after this

> change, it might get broken. Ah, actually I think even this patch 

> breaks

> the WEDGED state. This firmware restart is tricky, difficult to say 

> what

> is the best approach. Michal, are you reading? :) Any ideas?

> 

> And after looking more about this patch I don't see the need for the 

> new

> ar->restart_count atomic variable. Checking for ATH10K_FLAG_CRASH_FLUSH

> would do the same thing AFAICS.

> 

> And related to this, (in a separate patch) I think we should utilise

> ATH10K_FLAG_CRASH_FLUSH more. For example in ath10k_wmi_cmd_send() to

> not even try to send a WMI command if the flag is set. Basically all

> hardware access should be disabled except what is needed to restart the

> firmware.

Wen Gong Aug. 24, 2020, 4:36 a.m. UTC | #5

On 2020-08-20 17:18, Wen Gong wrote:
...
> I thinks we can use test_and_set_bit for atomic operation athough it

> is same with restart_count.

> and add a new flag, ATH10K_FLAG_CRASH_FLUSH is used for flush,

> if still use ATH10K_FLAG_CRASH_FLUSH, it should change clear_bit of it 

> from

> ath10k_core_start to ath10k_reconfig_complete,because

> ieee80211_reconfig(called by

> ieee80211_restart_work)

> of mac80211 do many things and drv_start is 1st thing and

> drv_reconfig_complete is last thing, drv_reconfig_complete done means

> the restart

> finished.

> 

> I will send patch v5 with above changes if not other advise.

v5 sent: https://patchwork.kernel.org/patch/11728101/
...

Kalle Valo Sept. 7, 2020, 3:52 p.m. UTC | #6

Wen Gong <wgong@codeaurora.org> writes:

> On 2020-08-15 01:19, Kalle Valo wrote:

>> Wen Gong <wgong@codeaurora.org> writes:

>>

> ...

>>> diff --git a/drivers/net/wireless/ath/ath10k/core.c

>>> b/drivers/net/wireless/ath/ath10k/core.c

>>> index 91f131b87efc..0e31846e6c89 100644

>>> --- a/drivers/net/wireless/ath/ath10k/core.c

>>> +++ b/drivers/net/wireless/ath/ath10k/core.c

>>> @@ -2199,6 +2199,14 @@ static void ath10k_core_restart(struct

>>> work_struct *work)

>>>  {

>>>  	struct ath10k *ar = container_of(work, struct ath10k, restart_work);

>>>  	int ret;

>>> +	int restart_count;

>>> +

>>> +	restart_count = atomic_add_return(1, &ar->restart_count);

>>> +	if (restart_count > 1) {

>>> +		ath10k_warn(ar, "can not restart, count: %d\n", restart_count);

>>> +		atomic_dec(&ar->restart_count);

>>> +		return;

>>> +	}

>>

>> I have been thinking a different approach for this. I think another

>> option is to have a function like this:

>>

>> ath10k_core_firmware_crashed()

>> {

>>         queue_work(ar->workqueue, &ar->restart_work);

>> }

>>

>> In patch 1 we would convert all existing callers to call that

>> function instead of queue_work() directly.

>>

>> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe

>> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet

>> which one is better. Now the function would do:

>>

>> ath10k_core_firmware_crashed()

>> {

>>         if (test_bit(flag))

>>                 return

>>

>>         set_bit(flag)

>> 	queue_work(ar->workqueue, &ar->restart_work);

>> }

>>

>> That way restart_work queue would be called only one time.

>

> This is not muti-thread-safe, for example, if 2 thread entered to the

> test_bit(flag) meanwhile and both check pass, then it will have 2

> restart.


Good point, this was racy. And I see that you found test_and_set_bit()
already to fix the race.

> atomic_add_return is muti-thread-safe, if 2 thread entered it, only 1

> thread can pass the check, another will fail and return.


I'm not going to add new state variables unless the justification is
REALLY strong and sound. This firmware restart is complicated as is
already, there's no reason to complicate it even more.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Kalle Valo Sept. 7, 2020, 3:55 p.m. UTC | #7

Wen Gong <wgong@codeaurora.org> writes:

> On 2020-08-15 01:19, Kalle Valo wrote:

> ...

>>

>> I have been thinking a different approach for this. I think another

>> option is to have a function like this:

>>

>> ath10k_core_firmware_crashed()

>> {

>>         queue_work(ar->workqueue, &ar->restart_work);

>> }

>>

>> In patch 1 we would convert all existing callers to call that

>> function instead of queue_work() directly.

>>

>> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe

>> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet

>> which one is better. Now the function would do:

>

> I thinks we can use test_and_set_bit for atomic operation athough it

> is same with restart_count.


I didn't quite understand what you mean here, but using
test_and_set_bit() is the correct methdo.

> and add a new flag, ATH10K_FLAG_CRASH_FLUSH is used for flush, if

> still use ATH10K_FLAG_CRASH_FLUSH, it should change clear_bit of it

> from ath10k_core_start to ath10k_reconfig_complete,because

> ieee80211_reconfig(called by ieee80211_restart_work) of mac80211 do

> many things and drv_start is 1st thing and drv_reconfig_complete is

> last thing, drv_reconfig_complete done means the restart finished.


Ok, let's discuss about that in v5. I hope you added the analysis to the
commit log.

-- 
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Wen Gong Sept. 8, 2020, 3:47 a.m. UTC | #8

On 2020-09-07 23:55, Kalle Valo wrote:
> Wen Gong <wgong@codeaurora.org> writes:

> 

>> On 2020-08-15 01:19, Kalle Valo wrote:

>> ...

>>> 

>>> I have been thinking a different approach for this. I think another

>>> option is to have a function like this:

>>> 

>>> ath10k_core_firmware_crashed()

>>> {

>>>         queue_work(ar->workqueue, &ar->restart_work);

>>> }

>>> 

>>> In patch 1 we would convert all existing callers to call that

>>> function instead of queue_work() directly.

>>> 

>>> In patch 2 we would add a new flag to enum ath10k_dev_flags, or maybe

>>> should actually use existing ATH10K_FLAG_CRASH_FLUSH? Don't know yet

>>> which one is better. Now the function would do:

>> 

>> I thinks we can use test_and_set_bit for atomic operation athough it

>> is same with restart_count.

> 

> I didn't quite understand what you mean here, but using

> test_and_set_bit() is the correct methdo.

> 

>> and add a new flag, ATH10K_FLAG_CRASH_FLUSH is used for flush, if

>> still use ATH10K_FLAG_CRASH_FLUSH, it should change clear_bit of it

>> from ath10k_core_start to ath10k_reconfig_complete,because

>> ieee80211_reconfig(called by ieee80211_restart_work) of mac80211 do

>> many things and drv_start is 1st thing and drv_reconfig_complete is

>> last thing, drv_reconfig_complete done means the restart finished.

> 

> Ok, let's discuss about that in v5. I hope you added the analysis to 

> the

> commit log.

yes, v5 have alread sent in https://patchwork.kernel.org/patch/11728101/
[v5] ath10k: add atomic protection for device recovery

[v4,0/2] start recovery process when payload length overflow for sdio

Message

Comments