mbox series

[v2,0/2] mwifiex: Work around firmware bugs on 88W8897 chip

Message ID 20210914114813.15404-1-verdre@v0yd.nl
Headers show
Series mwifiex: Work around firmware bugs on 88W8897 chip | expand

Message

Jonas Dreßler Sept. 14, 2021, 11:48 a.m. UTC
This is the second revision of the patch, the first one is here:
https://lore.kernel.org/linux-wireless/20210830123704.221494-1-verdre@v0yd.nl/

Changes between v1 and v2:
 - Only read-back the register write to the TX ring write pointer, not all writes
 - Mention firmware version in commit message+code comment for future reference
 - Use -EIO return value in second patch
 - Use definitions for waiting intervals in second patch

Jonas Dreßler (2):
  mwifiex: Use non-posted PCI write when setting TX ring write pointer
  mwifiex: Try waking the firmware until we get an interrupt

 drivers/net/wireless/marvell/mwifiex/pcie.c | 59 +++++++++++++++++----
 1 file changed, 50 insertions(+), 9 deletions(-)

Comments

Jonas Dreßler Sept. 22, 2021, 12:08 p.m. UTC | #1
On 9/22/21 1:17 PM, Andy Shevchenko wrote:
> On Tue, Sep 14, 2021 at 01:48:12PM +0200, Jonas Dreßler wrote:

>> On the 88W8897 card it's very important the TX ring write pointer is

>> updated correctly to its new value before setting the TX ready

>> interrupt, otherwise the firmware appears to crash (probably because

>> it's trying to DMA-read from the wrong place). The issue is present in

>> the latest firmware version 15.68.19.p21 of the pcie+usb card.

> 

> Please, be consistent in the commit message(s) and the code (esp. if the term

> comes from a specification).

> 

> Here, PCIe (same in the code, at least that I have noticed, but should be done

> everywhere).

> 

>> Since PCI uses "posted writes" when writing to a register, it's not

>> guaranteed that a write will happen immediately. That means the pointer

>> might be outdated when setting the TX ready interrupt, leading to

>> firmware crashes especially when ASPM L1 and L1 substates are enabled

>> (because of the higher link latency, the write will probably take

>> longer).

>>

>> So fix those firmware crashes by always using a non-posted write for

>> this specific register write. We do that by simply reading back the

>> register after writing it, just as a few other PCI drivers do.

>>

>> This fixes a bug where during rx/tx traffic and with ASPM L1 substates

> 

> Ditto. TX/RX.

> 

>> enabled (the enabled substates are platform dependent), the firmware

>> crashes and eventually a command timeout appears in the logs.

> 

> Should it have a Fixes tag?

> 


Don't think so, there's the infamous 
(https://bugzilla.kernel.org/show_bug.cgi?id=109681) Bugzilla bug it 
fixes though, I'll mention that in v3.

>> Cc: stable@vger.kernel.org

>> Signed-off-by: Jonas Dreßler <verdre@v0yd.nl>

> 

> ...

> 

>> -		/* Write the TX ring write pointer in to reg->tx_wrptr */

>> -		if (mwifiex_write_reg(adapter, reg->tx_wrptr,

>> -				      card->txbd_wrptr | rx_val)) {

>> +		/* Write the TX ring write pointer in to reg->tx_wrptr.

>> +		 * The firmware (latest version 15.68.19.p21) of the 88W8897

>> +		 * pcie+usb card seems to crash when getting the TX ready

>> +		 * interrupt but the TX ring write pointer points to an outdated

>> +		 * address, so it's important we do a non-posted write here to

>> +		 * force the completion of the write.

>> +		 */

>> +		if (mwifiex_write_reg_np(adapter, reg->tx_wrptr,

>> +				        card->txbd_wrptr | rx_val)) {

> 

>>   			mwifiex_dbg(adapter, ERROR,

>>   				    "SEND DATA: failed to write reg->tx_wrptr\n");

>>   			ret = -1;

> 

> I'm not sure how this is not a dead code.

> 

> On top of that, I would rather to call old function and explicitly put the

> dummy read after it

> 

> 		/* Write the TX ring write pointer in to reg->tx_wrptr */

> 		if (mwifiex_write_reg(adapter, reg->tx_wrptr,

> 				      card->txbd_wrptr | rx_val)) {

> 			...eliminate dead code in the following patch(es)...

> 		}

> 

> +		/* The firmware (latest version 15.68.19.p21) of the 88W8897

> +		 * pcie+usb card seems to crash when getting the TX ready

> +		 * interrupt but the TX ring write pointer points to an outdated

> +		 * address, so it's important we do a non-posted write here to

> +		 * force the completion of the write.

> +		 */

> 		mwifiex_read_reg(...);

> 

> Now, since I found the dummy read function to be present, perhaps you need to

> dive more into the code and understand why it exists.

> 


Interesting, I haven't noticed that mwifiex_write_reg() always returns 
0. So are you suggesting to remove that return value and get rid of all 
the "if (mwifiex_write_reg()) {}" checks in a separate commit?

As for why the dummy read/write functions exist, I have no idea. Looking 
at git history it seems they were always there (only change is that 
mwifiex_read_reg() started to handle read errors with commit 
af05148392f50490c662dccee6c502d9fcba33e2). My bet would be that they 
were created to be consistent with sdio.c which is the oldest supported 
bus type in mwifiex.
David Laight Sept. 22, 2021, 3:54 p.m. UTC | #2
From: Pali Rohár

> Sent: 22 September 2021 15:27

> 

> On Wednesday 22 September 2021 14:03:25 David Laight wrote:

> > From: Jonas Dreßler

> > > Sent: 14 September 2021 12:48

> > >

> > > On the 88W8897 card it's very important the TX ring write pointer is

> > > updated correctly to its new value before setting the TX ready

> > > interrupt, otherwise the firmware appears to crash (probably because

> > > it's trying to DMA-read from the wrong place). The issue is present in

> > > the latest firmware version 15.68.19.p21 of the pcie+usb card.

> > >

> > > Since PCI uses "posted writes" when writing to a register, it's not

> > > guaranteed that a write will happen immediately. That means the pointer

> > > might be outdated when setting the TX ready interrupt, leading to

> > > firmware crashes especially when ASPM L1 and L1 substates are enabled

> > > (because of the higher link latency, the write will probably take

> > > longer).

> > >

> > > So fix those firmware crashes by always using a non-posted write for

> > > this specific register write. We do that by simply reading back the

> > > register after writing it, just as a few other PCI drivers do.

> > >

> > > This fixes a bug where during rx/tx traffic and with ASPM L1 substates

> > > enabled (the enabled substates are platform dependent), the firmware

> > > crashes and eventually a command timeout appears in the logs.

> >

> > I think you need to change your terminology.

> > PCIe does have some non-posted write transactions - but I can't

> > remember when they are used.

> 

> In PCIe are all memory write requests as posted.

> 

> Non-posted writes in PCIe are used only for IO and config requests. But

> this is not case for proposed patch change as it access only card's

> memory space.

> 

> Technically this patch does not use non-posted memory write (as PCIe

> does not support / provide it), just adds something like a barrier and

> I'm not sure if it is really correct (you already wrote more details

> about it, so I will let it be).

> 

> I'm not sure what is the correct terminology, I do not know how this

> kind of write-followed-by-read "trick" is correctly called.


I think it is probably best to say:
   "flush the posted write when setting the TX ring write pointer".

The write can get posted in any/all of the following places:
1) The cpu store buffer.
2) The PCIe host bridge.
3) Any other PCIe bridges.
4) The PCIe slave logic in the target.
   There could be separate buffers for each BAR,
5) The actual target logic for that address block.
   The target (probably) will look a bit like an old fashioned cpu
   motherboard with the PCIe slave logic as the main bus master.

The readback forces all the posted write buffers be flushed.

In this case I suspect it is either flushing (5) or the extra
delay of the read TLP processing that 'fixes' the problem.

Note that depending on the exact code and host cpu the second
write may not need to wait for the response to the read TLP.
So the write, readback, write TLP may be back to back on the
actual PCIe link.

Although I don't have access to an actual PCIe monitor we
do have the ability to trace 'data' TLP into fpga memory
on one of our systems.
This is near real-time but they are slightly munged.
Watching the TLP can be illuminating!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Brian Norris Sept. 27, 2021, 8:30 p.m. UTC | #3
On Tue, Sep 14, 2021 at 4:48 AM Jonas Dreßler <verdre@v0yd.nl> wrote:
>

> This is the second revision of the patch, the first one is here:

> https://lore.kernel.org/linux-wireless/20210830123704.221494-1-verdre@v0yd.nl/

>

> Changes between v1 and v2:

>  - Only read-back the register write to the TX ring write pointer, not all writes

>  - Mention firmware version in commit message+code comment for future reference

>  - Use -EIO return value in second patch

>  - Use definitions for waiting intervals in second patch


I tested this version, and it doesn't have the same issues v1 had
(regarding long-blocking reads, causing audio dropouts, etc.), so:

Tested-by: Brian Norris <briannorris@chromium.org>


As suggested elsewhere, this polling loop approach is a little slower
than just waiting for an interrupt instead (and that proves out; the
wakeup latency seems to increase by ~1 "short" polling interval; so
about half a millisecond). It seems like that could be optimized if
needed, because you *are* still waiting for an interrupt anyway. But I
haven't tried benchmarking anything that would really matter for this;
when we're already waiting 6+ ms, another 0.5ms isn't the end of the
world.

This doesn't really count as Reviewed-by. There are probably better
improvements to the poling loop (e.g., Andy's existing suggestions);
and frankly, I'd rather see if the dropped writes can be fixed
somehow. But I'm not holding my breath for the latter, and don't have
any good suggestions. So if this is the best we can do, so be it.

Regards,
Brian