mbox series

[0/2] mwifiex: Work around firmware bugs on 88W8897 chip

Message ID 20210830123704.221494-1-verdre@v0yd.nl
Headers show
Series mwifiex: Work around firmware bugs on 88W8897 chip | expand

Message

Jonas Dreßler Aug. 30, 2021, 12:37 p.m. UTC
Two patches which should fix all the random resets of the Marvell 88W8897 PCIe chip.
The first one works around a bug in the firmware that causes it to crash, the second
one makes us try harder to wake up the firmware before we consider it non-responsive.
More explanations in the commit messages.

Jonas Dreßler (2):
  mwifiex: Use non-posted PCI register writes
  mwifiex: Try waking the firmware until we get an interrupt

 drivers/net/wireless/marvell/mwifiex/pcie.c | 35 +++++++++++++++++----
 1 file changed, 29 insertions(+), 6 deletions(-)

Comments

Jonas Dreßler Sept. 1, 2021, 2:01 p.m. UTC | #1
On 8/30/21 2:49 PM, Andy Shevchenko wrote:
 > On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:

 >>

 >> On the 88W8897 card it's very important the TX ring write pointer is

 >> updated correctly to its new value before setting the TX ready

 >> interrupt, otherwise the firmware appears to crash (probably because

 >> it's trying to DMA-read from the wrong place).

 >>

 >> Since PCI uses "posted writes" when writing to a register, it's not

 >> guaranteed that a write will happen immediately. That means the pointer

 >> might be outdated when setting the TX ready interrupt, leading to

 >> firmware crashes especially when ASPM L1 and L1 substates are enabled

 >> (because of the higher link latency, the write will probably take

 >> longer).

 >>

 >> So fix those firmware crashes by always forcing non-posted writes. We do

 >> that by simply reading back the register after writing it, just as a lot

 >> of other drivers do.

 >>

 >> There are two reproducers that are fixed with this patch:

 >>

 >> 1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled

 >> substates are platform dependent), the firmware crashes and eventually a

 >> command timeout appears in the logs. That crash is fixed by using a

 >> non-posted write in mwifiex_pcie_send_data().

 >>

 >> 2) When sending lots of commands to the card, waking it up from sleep in

 >> very quick intervals, the firmware eventually crashes. That crash

 >> appears to be fixed by some other non-posted write included here.

 >

 > Thanks for all this work!

 >

 > Nevertheless, do we have any commits that may be a good candidate to

 > be in the Fixes tag here?

 >


I don't think there's any commit we could point to, given that the bug 
is probably somewhere in the firmware code.

 >> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>

 >> ---

 >>   drivers/net/wireless/marvell/mwifiex/pcie.c | 6 ++++++

 >>   1 file changed, 6 insertions(+)

 >>

 >> diff --git a/drivers/net/wireless/marvell/mwifiex/pcie.c 

b/drivers/net/wireless/marvell/mwifiex/pcie.c
 >> index c6ccce426b49..bfd6e135ed99 100644

 >> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c

 >> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c

 >> @@ -237,6 +237,12 @@ static int mwifiex_write_reg(struct 

mwifiex_adapter *adapter, int reg, u32 data)
 >>

 >>          iowrite32(data, card->pci_mmap1 + reg);

 >>

 >> +       /* Do a read-back, which makes the write non-posted, 

ensuring the
 >> +        * completion before returning.

 >

 >> +        * The firmware of the 88W8897 card is buggy and this avoids 

crashes.
 >

 > Any firmware version reference? Would be nice to have just for the

 > sake of record.

 >


Pretty sure the crash is present in every firmware that has been 
released, I've tried most of them. FTR, the current firmware version is 
15.68.19.p21.

 >> +        */

 >> +       ioread32(card->pci_mmap1 + reg);

 >> +

 >>          return 0;

 >>   }

 >

 >
Heiner Kallweit Sept. 1, 2021, 7:07 p.m. UTC | #2
On 01.09.2021 19:07, Johannes Berg wrote:
> On Wed, 2021-09-01 at 18:51 +0200, Heiner Kallweit wrote:

>> On 01.09.2021 17:51, Pali Rohár wrote:

>>> On Wednesday 01 September 2021 16:01:54 Jonas Dreßler wrote:

>>>> On 8/30/21 2:49 PM, Andy Shevchenko wrote:

>>>>> On Mon, Aug 30, 2021 at 3:38 PM Jonas Dreßler <verdre@v0yd.nl> wrote:

>>>>>>

>>>>>> On the 88W8897 card it's very important the TX ring write pointer is

>>>>>> updated correctly to its new value before setting the TX ready

>>>>>> interrupt, otherwise the firmware appears to crash (probably because

>>>>>> it's trying to DMA-read from the wrong place).

>>>>>>

>>

>> This sounds somehow like the typical case where you write DMA descriptors

>> and then ring the doorbell. This normally requires a dma_wmb().

>> Maybe something like that is missing here?

> 

> But it looks like this "TX ring write pointer" is actually the register?

> 

> However, I would agree that doing it in mwifiex_write_reg() is possibly

> too big a hammer - could be done only for reg->tx_wrptr, not all the

> registers?

> 

> Actually, can two writes actually cross on PCI?

> 

> johannes

> 


In case we're talking about the following piece of code both register
writes are IOMEM writes that are ordered. Maybe the writes arrive properly
ordered but some chip-internal delays cause the issue? Then the read-back
would be something like an ordinary udelay()?
Instead of always reading back register writes, is it sufficient to read
an arbitrary register after mwifiex_write_reg(adapter, reg->tx_wrptr ?


		/* Write the TX ring write pointer in to reg->tx_wrptr */
		if (mwifiex_write_reg(adapter, reg->tx_wrptr,
				      card->txbd_wrptr | rx_val)) {
			mwifiex_dbg(adapter, ERROR,
				    "SEND DATA: failed to write reg->tx_wrptr\n");
			ret = -1;
			goto done_unmap;
		}
		if ((mwifiex_pcie_txbd_not_full(card)) &&
		    tx_param->next_pkt_len) {
			/* have more packets and TxBD still can hold more */
			mwifiex_dbg(adapter, DATA,
				    "SEND DATA: delay dnld-rdy interrupt.\n");
			adapter->data_sent = false;
		} else {
			/* Send the TX ready interrupt */
			if (mwifiex_write_reg(adapter, PCIE_CPU_INT_EVENT,
					      CPU_INTR_DNLD_RDY)) {
Brian Norris Sept. 1, 2021, 7:40 p.m. UTC | #3
On Mon, Aug 30, 2021 at 5:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:
>
> On the 88W8897 card it's very important the TX ring write pointer is
> updated correctly to its new value before setting the TX ready
> interrupt, otherwise the firmware appears to crash (probably because
> it's trying to DMA-read from the wrong place).
>
> Since PCI uses "posted writes" when writing to a register, it's not
> guaranteed that a write will happen immediately. That means the pointer
> might be outdated when setting the TX ready interrupt, leading to
> firmware crashes especially when ASPM L1 and L1 substates are enabled
> (because of the higher link latency, the write will probably take
> longer).
>
> So fix those firmware crashes by always forcing non-posted writes. We do
> that by simply reading back the register after writing it, just as a lot
> of other drivers do.
>
> There are two reproducers that are fixed with this patch:
>
> 1) During rx/tx traffic and with ASPM L1 substates enabled (the enabled
> substates are platform dependent), the firmware crashes and eventually a
> command timeout appears in the logs. That crash is fixed by using a
> non-posted write in mwifiex_pcie_send_data().
>
> 2) When sending lots of commands to the card, waking it up from sleep in
> very quick intervals, the firmware eventually crashes. That crash
> appears to be fixed by some other non-posted write included here.
>
> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>

This might be good for many cases, but please read this commit:

https://git.kernel.org/linus/062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3
mwifiex: pcie: use posted write to wake up firmware

It's very much intentional that this is a posted write in some cases.

Without ensuring this doesn't regress, NAK from me.

Brian
Andy Shevchenko Sept. 1, 2021, 8:40 p.m. UTC | #4
On Wed, Sep 1, 2021 at 11:25 PM Brian Norris <briannorris@chromium.org> wrote:
> On Mon, Aug 30, 2021 at 5:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:


...

> This might be good for many cases, but please read this commit:

>

> https://git.kernel.org/linus/062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3

> mwifiex: pcie: use posted write to wake up firmware

>

> It's very much intentional that this is a posted write in some cases.

>

> Without ensuring this doesn't regress, NAK from me.


Can you ensure that from Chrome / Google perspective, please?

-- 
With Best Regards,
Andy Shevchenko
Brian Norris Sept. 1, 2021, 9:04 p.m. UTC | #5
On Wed, Sep 1, 2021 at 1:41 PM Andy Shevchenko
<andy.shevchenko@gmail.com> wrote:
>

> On Wed, Sep 1, 2021 at 11:25 PM Brian Norris <briannorris@chromium.org> wrote:

> > On Mon, Aug 30, 2021 at 5:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:

>

> ...

>

> > This might be good for many cases, but please read this commit:

> >

> > https://git.kernel.org/linus/062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3

> > mwifiex: pcie: use posted write to wake up firmware

> >

> > It's very much intentional that this is a posted write in some cases.

> >

> > Without ensuring this doesn't regress, NAK from me.

>

> Can you ensure that from Chrome / Google perspective, please?


I mean, the context link should be pretty obvious. But sure: with this patch:

(a) all my least-favorite audio issues to debug return (audio is
especially latency-sensitive on the Rockchip RK3399 systems I have)
(b) latency spikes to ~6ms:
# trace-cmd record -p function_graph -O funcgraph-abstime -l
mwifiex_pm_wakeup_card
# trace-cmd report
   kworker/u13:0-199   [003]   348.987306: funcgraph_entry:      #
6219.500 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [003]   349.316312: funcgraph_entry:      #
6267.625 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [001]   352.238530: funcgraph_entry:      #
6184.250 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [002]   356.626366: funcgraph_entry:      #
6553.166 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [002]   356.709389: funcgraph_entry:      #
6212.500 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [002]   356.847215: funcgraph_entry:      #
6230.292 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [000]   356.897576: funcgraph_entry:      #
6451.667 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [004]   357.175025: funcgraph_entry:      #
6204.042 us |  mwifiex_pm_wakeup_card();

whereas it used to look more like:

   kworker/u13:1-173   [005]   212.230542: funcgraph_entry:
7.000 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   213.886063: funcgraph_entry:
9.334 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [002]   214.473273: funcgraph_entry:      +
11.375 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   214.530705: funcgraph_entry:
5.542 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [002]   215.050168: funcgraph_entry:      +
13.125 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [002]   215.106492: funcgraph_entry:      +
11.959 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   215.484807: funcgraph_entry:
8.459 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [003]   215.515238: funcgraph_entry:      +
15.166 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [001]   217.175691: funcgraph_entry:      +
11.083 us  |  mwifiex_pm_wakeup_card();

I think it would be enough to only modify mwifiex_pm_wakeup_card() to
retain the posted version, but I'm not sure how that fits in with the
rest of the author's work. I suppose I can give a later version a run
if needed.

Brian
Brian Norris Sept. 1, 2021, 9:07 p.m. UTC | #6
Apologies for the brain-dead mailer. I forget that I should only reply
via web when I _want_ text wrapping:

On Wed, Sep 01, 2021 at 02:04:04PM -0700, Brian Norris wrote:
> (b) latency spikes to ~6ms:
> # trace-cmd record -p function_graph -O funcgraph-abstime -l
> mwifiex_pm_wakeup_card
> # trace-cmd report
>    kworker/u13:0-199   [003]   348.987306: funcgraph_entry:      #
> 6219.500 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:0-199   [003]   349.316312: funcgraph_entry:      #
> 6267.625 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-4057  [001]   352.238530: funcgraph_entry:      #
> 6184.250 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:0-199   [002]   356.626366: funcgraph_entry:      #
> 6553.166 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-4057  [002]   356.709389: funcgraph_entry:      #
> 6212.500 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-4057  [002]   356.847215: funcgraph_entry:      #
> 6230.292 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-4057  [000]   356.897576: funcgraph_entry:      #
> 6451.667 us |  mwifiex_pm_wakeup_card();
>    kworker/u13:0-199   [004]   357.175025: funcgraph_entry:      #
> 6204.042 us |  mwifiex_pm_wakeup_card();
> 
> whereas it used to look more like:
> 
>    kworker/u13:1-173   [005]   212.230542: funcgraph_entry:
> 7.000 us   |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [005]   213.886063: funcgraph_entry:
> 9.334 us   |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [002]   214.473273: funcgraph_entry:      +
> 11.375 us  |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [005]   214.530705: funcgraph_entry:
> 5.542 us   |  mwifiex_pm_wakeup_card();
>    kworker/u13:1-173   [002]   215.050168: funcgraph_entry:      +
> 13.125 us  |  mwifiex_pm_wakeup_card();
>    kworker/u13:1-173   [002]   215.106492: funcgraph_entry:      +
> 11.959 us  |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [005]   215.484807: funcgraph_entry:
> 8.459 us   |  mwifiex_pm_wakeup_card();
>    kworker/u13:1-173   [003]   215.515238: funcgraph_entry:      +
> 15.166 us  |  mwifiex_pm_wakeup_card();
>    kworker/u13:3-1768  [001]   217.175691: funcgraph_entry:      +
> 11.083 us  |  mwifiex_pm_wakeup_card();

That should read:

# trace-cmd record -p function_graph -O funcgraph-abstime -l mwifiex_pm_wakeup_card
# trace-cmd report
   kworker/u13:0-199   [003]   348.987306: funcgraph_entry:      # 6219.500 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [003]   349.316312: funcgraph_entry:      # 6267.625 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [001]   352.238530: funcgraph_entry:      # 6184.250 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [002]   356.626366: funcgraph_entry:      # 6553.166 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [002]   356.709389: funcgraph_entry:      # 6212.500 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [002]   356.847215: funcgraph_entry:      # 6230.292 us |  mwifiex_pm_wakeup_card();
   kworker/u13:3-4057  [000]   356.897576: funcgraph_entry:      # 6451.667 us |  mwifiex_pm_wakeup_card();
   kworker/u13:0-199   [004]   357.175025: funcgraph_entry:      # 6204.042 us |  mwifiex_pm_wakeup_card();

vs.

   kworker/u13:1-173   [005]   212.230542: funcgraph_entry:        7.000 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   213.886063: funcgraph_entry:        9.334 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [002]   214.473273: funcgraph_entry:      + 11.375 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   214.530705: funcgraph_entry:        5.542 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [002]   215.050168: funcgraph_entry:      + 13.125 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [002]   215.106492: funcgraph_entry:      + 11.959 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [005]   215.484807: funcgraph_entry:        8.459 us   |  mwifiex_pm_wakeup_card();
   kworker/u13:1-173   [003]   215.515238: funcgraph_entry:      + 15.166 us  |  mwifiex_pm_wakeup_card();
   kworker/u13:3-1768  [001]   217.175691: funcgraph_entry:      + 11.083 us  |  mwifiex_pm_wakeup_card();

Brian
Jonas Dreßler Sept. 22, 2021, 12:50 p.m. UTC | #7
On 9/20/21 7:48 PM, Brian Norris wrote:
> On Sat, Sep 18, 2021 at 12:37 AM Jonas Dreßler <verdre@v0yd.nl> wrote:

>> Thanks for the pointer to that commit Brian, it turns out this is

>> actually the change that causes the "Firmware wakeup failed" issues that

>> I'm trying to fix with the second patch here.

> 

> Huh. That's interesting, although I guess it makes some sense given

> your theory of "dropped writes". FWIW, this strategy (post a single

> write, then wait for wakeup) is the same used by some other

> chips/drivers too (e.g., ath10k/pci), although in those cases card

> wakeup is much much faster. But if the bus was dropping writes

> somehow, those strategies would fail too.

> 

>> Also my approach is a lot messier than just reverting

>> 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 and also appears to be blocking

>> even longer...

> 

> For the record, in case you're talking about my data ("blocking even

> longer"): I was only testing patch 1. Patch 2 isn't really relevant to

> my particular systems (Rockchip RK3399 + Marvell 8997/PCIe), because

> (a) I'm pretty sure my system isn't "dropping" any reads or writes

> (b) all my delay is in the read-back; the Rockchip PCIe bus is waiting

> indefinitely for the card to wake up, instead of timing out and

> reporting all-1's like many x86 systems appear to do (I've tested

> this).

> 

> So, the 6ms delay is entirely sitting in the ioread32(), not a delay loop.

> 

> I haven't yet tried your version 2 (which avoids the blocking read to

> wake up; good!), but it sounds like in theory it could solve your

> problem while avoiding 6ms delays for me. I intend to test your v2

> this week.

> 


With "blocking even longer" I meant that (on my system) the delay-loop 
blocks even longer than waking up the card via mwifiex_read_reg() (both 
are in the orders of milliseconds). And given that in certain cases the 
card wakeup (or a write getting through to the card, I have no idea) can 
take extremely long, I'd feel more confident going with the 
mwifiex_read_reg() method to wake up the card.

Anyway, you know what's even weirder with all this: I've been testing 
the first commit of patch v2 (so just the single read-back instead of 
the big hammer) together with 062e008a6e83e7c4da7df0a9c6aefdbc849e2bb3 
reverted for a good week now and haven't seen any wakeup failure yet. 
Otoh I'm fairly sure the big hammer with reading back every write wasn't 
enough to fix the wakeup failures, otherwise I wouldn't even have 
started working on the second commit.

So that would mean there's a difference between writing and then reading 
back vs only reading to wake up the card: Only the latter fixes the 
wakeup failures.

>> Does anyone have an idea what could be the reason for the posted write

>> not going through, or could that also be a potential firmware bug in the

>> chip?

> 

> I have no clue about that. That does sound downright horrible, but so

> are many things when dealing with this family of hardware/firmware.

> I'm not sure how to prove out whether this is a host bus problem, or

> an endpoint/firmware problem, other than perhaps trying the same

> module/firmware on another system, if that's possible.

> 

> Anyway, to reiterate: I'm not fundamentally opposed to v2 (pending a

> test run here), even if it is a bit ugly and perhaps not 100%

> understood.

> 


I'm not 100% sure about all this yet, I think I'm gonna try to confirm 
my older findings once again now and then we'll see. FTR, would you be 
fine with using the mwifiex_read_reg() method to wake up the card and 
somehow quirking your system to use write_reg()?

> Brian

>
Pali Rohár Sept. 25, 2021, 5:32 p.m. UTC | #8
FYI: There is a new NXP driver for these ex-Marvell wifi chips:
https://source.codeaurora.org/external/nxpwifi/mwifiex/tree/mxm_wifiex/wlan_src/mlinux?h=caf/master

And NXP has there also some modified mwifiex driver:
https://source.codeaurora.org/external/nxpwifi/mwifiex/tree/mwifiex_8997/mlinux?h=caf/master&id=070397228c46e0a6afce4f627b032d3818cc1356

Maybe somebody could be interested in looking into those NXP driver
modifications... Maybe NXP fixed some issues here.

Also look at another new bug report related to mwifiex firmware:
https://lore.kernel.org/linux-wireless/edeb34bc-7c85-7f1d-79e4-e3e21df86334@gk2.net/
Jonas Dreßler Sept. 30, 2021, 3:38 p.m. UTC | #9
On 9/23/21 10:22 PM, Pali Rohár wrote:
> On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:

>> On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:

>>> On 9/22/21 2:50 PM, Jonas Dreßler wrote:

>>

>> ...

>>

>>> - Just calling mwifiex_write_reg() once and then blocking until the card

>>> wakes up using my delay-loop doesn't fix the issue, it's actually

>>> writing multiple times that fixes the issue

>>>

>>> These observations sound a lot like writes (and even reads) are actually

>>> being dropped, don't they?

>>

>> It sounds like you're writing into a not ready (fully powered on) device.

> 

> This reminds me a discussion with Bjorn about CRS response returned

> after firmware crash / reset when device is not ready yet:

> https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/

> 

> Could not be this similar issue? You could check it via reading

> PCI_VENDOR_ID register from config space. And if it is not valid value

> then card is not really ready yet.

> 

>> To check this, try to put a busy loop for reading and check the value

>> till it gets 0.

>>

>> Something like

>>

>>    unsigned int count = 1000;

>>

>>    do {

>>      if (mwifiex_read_reg(...) == 0)

>>        break;

>>    } while (--count);

>>

>>

>> -- 

>> With Best Regards,

>> Andy Shevchenko


I've tried both reading PCI_VENDOR_ID and the firmware status using a 
busy loop now, but sadly none of them worked. It looks like the card 
always replies with the correct values even though it sometimes won't 
wake up after that.

I do have one new observation though, although I've no clue what could 
be happening here: When reading PCI_VENDOR_ID 1000 times to wakeup we 
can "predict" the wakeup failure because exactly one (usually around the 
20th) of those 1000 reads will fail. Maybe the firmware actually tries 
to wake up, encounters an error somewhere in its wakeup routines and 
then goes down a special failure code path. That code path keeps the 
cards CPU so busy that at some point a PCI_VENDOR_ID request times out?

Or well, maybe the card actually wakes up fine, but we don't receive the 
interrupt on our end, so many possibilities...
Pali Rohár Sept. 30, 2021, 4:19 p.m. UTC | #10
On Thursday 30 September 2021 18:14:04 Jonas Dreßler wrote:
> On 9/30/21 5:42 PM, Pali Rohár wrote:

> > On Thursday 30 September 2021 17:38:43 Jonas Dreßler wrote:

> > > On 9/23/21 10:22 PM, Pali Rohár wrote:

> > > > On Thursday 23 September 2021 22:41:30 Andy Shevchenko wrote:

> > > > > On Thu, Sep 23, 2021 at 6:28 PM Jonas Dreßler <verdre@v0yd.nl> wrote:

> > > > > > On 9/22/21 2:50 PM, Jonas Dreßler wrote:

> > > > > 

> > > > > ...

> > > > > 

> > > > > > - Just calling mwifiex_write_reg() once and then blocking until the card

> > > > > > wakes up using my delay-loop doesn't fix the issue, it's actually

> > > > > > writing multiple times that fixes the issue

> > > > > > 

> > > > > > These observations sound a lot like writes (and even reads) are actually

> > > > > > being dropped, don't they?

> > > > > 

> > > > > It sounds like you're writing into a not ready (fully powered on) device.

> > > > 

> > > > This reminds me a discussion with Bjorn about CRS response returned

> > > > after firmware crash / reset when device is not ready yet:

> > > > https://lore.kernel.org/linux-pci/20210922164803.GA203171@bhelgaas/

> > > > 

> > > > Could not be this similar issue? You could check it via reading

> > > > PCI_VENDOR_ID register from config space. And if it is not valid value

> > > > then card is not really ready yet.

> > > > 

> > > > > To check this, try to put a busy loop for reading and check the value

> > > > > till it gets 0.

> > > > > 

> > > > > Something like

> > > > > 

> > > > >     unsigned int count = 1000;

> > > > > 

> > > > >     do {

> > > > >       if (mwifiex_read_reg(...) == 0)

> > > > >         break;

> > > > >     } while (--count);

> > > > > 

> > > > > 

> > > > > -- 

> > > > > With Best Regards,

> > > > > Andy Shevchenko

> > > 

> > > I've tried both reading PCI_VENDOR_ID and the firmware status using a busy

> > > loop now, but sadly none of them worked. It looks like the card always

> > > replies with the correct values even though it sometimes won't wake up after

> > > that.

> > > 

> > > I do have one new observation though, although I've no clue what could be

> > > happening here: When reading PCI_VENDOR_ID 1000 times to wakeup we can

> > > "predict" the wakeup failure because exactly one (usually around the 20th)

> > > of those 1000 reads will fail.

> > 

> > What does "fail" means here?

> 

> ioread32() returns all ones, that's interpreted as failure by

> mwifiex_read_reg().


Ok. And can you check if PCI Bridge above this card has enabled CRSSVE
bit (CRSVisible in RootCtl+RootCap in lspci output)? To determinate if
Bridge could convert CRS response to all-ones as failed transaction.

> > 

> > > Maybe the firmware actually tries to wake up,

> > > encounters an error somewhere in its wakeup routines and then goes down a

> > > special failure code path. That code path keeps the cards CPU so busy that

> > > at some point a PCI_VENDOR_ID request times out?

> > > 

> > > Or well, maybe the card actually wakes up fine, but we don't receive the

> > > interrupt on our end, so many possibilities...