diff mbox series

net: ftgmac100: Fix missing TX-poll issue

Message ID 20201019073908.32262-1-dylan_hung@aspeedtech.com
State New
Headers show
Series net: ftgmac100: Fix missing TX-poll issue | expand

Commit Message

Dylan Hung Oct. 19, 2020, 7:39 a.m. UTC
The cpu accesses the register and the memory via different bus/path on
aspeed soc.  So we can not guarantee that the tx-poll command
(register access) is always behind the tx descriptor (memory).  In other
words, the HW may start working even the data is not yet ready.  By
adding a dummy read after the last data write, we can ensure the data
are pushed to the memory, then guarantee the processing sequence

Signed-off-by: Dylan Hung <dylan_hung@aspeedtech.com>
---
 drivers/net/ethernet/faraday/ftgmac100.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Joel Stanley Oct. 19, 2020, 8:57 a.m. UTC | #1
On Mon, 19 Oct 2020 at 07:39, Dylan Hung <dylan_hung@aspeedtech.com> wrote:
>

> The cpu accesses the register and the memory via different bus/path on

> aspeed soc.  So we can not guarantee that the tx-poll command


Just the 2600, or other versions too?

> (register access) is always behind the tx descriptor (memory).  In other

> words, the HW may start working even the data is not yet ready.  By


even if the

> adding a dummy read after the last data write, we can ensure the data

> are pushed to the memory, then guarantee the processing sequence

>

> Signed-off-by: Dylan Hung <dylan_hung@aspeedtech.com>

> ---

>  drivers/net/ethernet/faraday/ftgmac100.c | 3 ++-

>  1 file changed, 2 insertions(+), 1 deletion(-)

>

> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c

> index 00024dd41147..9a99a87f29f3 100644

> --- a/drivers/net/ethernet/faraday/ftgmac100.c

> +++ b/drivers/net/ethernet/faraday/ftgmac100.c

> @@ -804,7 +804,8 @@ static netdev_tx_t ftgmac100_hard_start_xmit(struct sk_buff *skb,

>          * before setting the OWN bit on the first descriptor.

>          */

>         dma_wmb();

> -       first->txdes0 = cpu_to_le32(f_ctl_stat);

> +       WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));

> +       READ_ONCE(first->txdes0);


I understand what you're trying to do here, but I'm not sure that this
is the correct way to go about it.

It does cause the compiler to produce a store and then a load.

>

>         /* Update next TX pointer */

>         priv->tx_pointer = pointer;

> --

> 2.17.1

>
Dylan Hung Oct. 19, 2020, 9:19 a.m. UTC | #2
Hi Joel,

> -----Original Message-----

> From: Joel Stanley [mailto:joel@jms.id.au]

> Sent: Monday, October 19, 2020 4:57 PM

> To: Dylan Hung <dylan_hung@aspeedtech.com>; Benjamin Herrenschmidt

> <benh@kernel.crashing.org>

> Cc: David S . Miller <davem@davemloft.net>; Jakub Kicinski

> <kuba@kernel.org>; netdev@vger.kernel.org; Linux Kernel Mailing List

> <linux-kernel@vger.kernel.org>; Po-Yu Chuang <ratbert@faraday-tech.com>;

> linux-aspeed <linux-aspeed@lists.ozlabs.org>; OpenBMC Maillist

> <openbmc@lists.ozlabs.org>; BMC-SW <BMC-SW@aspeedtech.com>

> Subject: Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

> 

> On Mon, 19 Oct 2020 at 07:39, Dylan Hung <dylan_hung@aspeedtech.com>

> wrote:

> >

> > The cpu accesses the register and the memory via different bus/path on

> > aspeed soc.  So we can not guarantee that the tx-poll command

> 

> Just the 2600, or other versions too?


Just the 2600.  And this issue only occurred on Ethernet mac.

> 

> > (register access) is always behind the tx descriptor (memory).  In

> > other words, the HW may start working even the data is not yet ready.

> > By

> 

> even if the

> 

> > adding a dummy read after the last data write, we can ensure the data

> > are pushed to the memory, then guarantee the processing sequence

> >

> > Signed-off-by: Dylan Hung <dylan_hung@aspeedtech.com>

> > ---

> >  drivers/net/ethernet/faraday/ftgmac100.c | 3 ++-

> >  1 file changed, 2 insertions(+), 1 deletion(-)

> >

> > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c

> > b/drivers/net/ethernet/faraday/ftgmac100.c

> > index 00024dd41147..9a99a87f29f3 100644

> > --- a/drivers/net/ethernet/faraday/ftgmac100.c

> > +++ b/drivers/net/ethernet/faraday/ftgmac100.c

> > @@ -804,7 +804,8 @@ static netdev_tx_t

> ftgmac100_hard_start_xmit(struct sk_buff *skb,

> >          * before setting the OWN bit on the first descriptor.

> >          */

> >         dma_wmb();

> > -       first->txdes0 = cpu_to_le32(f_ctl_stat);

> > +       WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));

> > +       READ_ONCE(first->txdes0);

> 

> I understand what you're trying to do here, but I'm not sure that this is the

> correct way to go about it.

> 

> It does cause the compiler to produce a store and then a load.

> 

> >

> >         /* Update next TX pointer */

> >         priv->tx_pointer = pointer;

> > --

> > 2.17.1

> >
Jakub Kicinski Oct. 19, 2020, 7 p.m. UTC | #3
On Mon, 19 Oct 2020 08:57:03 +0000 Joel Stanley wrote:
> > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c

> > index 00024dd41147..9a99a87f29f3 100644

> > --- a/drivers/net/ethernet/faraday/ftgmac100.c

> > +++ b/drivers/net/ethernet/faraday/ftgmac100.c

> > @@ -804,7 +804,8 @@ static netdev_tx_t ftgmac100_hard_start_xmit(struct sk_buff *skb,

> >          * before setting the OWN bit on the first descriptor.

> >          */

> >         dma_wmb();

> > -       first->txdes0 = cpu_to_le32(f_ctl_stat);

> > +       WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));

> > +       READ_ONCE(first->txdes0);  

> 

> I understand what you're trying to do here, but I'm not sure that this

> is the correct way to go about it.

> 

> It does cause the compiler to produce a store and then a load.


+1 @first is system memory from dma_alloc_coherent(), right?

You shouldn't have to do this. Is coherent DMA memory broken 
on your platform?
Benjamin Herrenschmidt Oct. 19, 2020, 11:23 p.m. UTC | #4
On Mon, 2020-10-19 at 12:00 -0700, Jakub Kicinski wrote:
> On Mon, 19 Oct 2020 08:57:03 +0000 Joel Stanley wrote:

> > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c

> > > b/drivers/net/ethernet/faraday/ftgmac100.c

> > > index 00024dd41147..9a99a87f29f3 100644

> > > --- a/drivers/net/ethernet/faraday/ftgmac100.c

> > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c

> > > @@ -804,7 +804,8 @@ static netdev_tx_t

> > > ftgmac100_hard_start_xmit(struct sk_buff *skb,

> > >          * before setting the OWN bit on the first descriptor.

> > >          */

> > >         dma_wmb();

> > > -       first->txdes0 = cpu_to_le32(f_ctl_stat);

> > > +       WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));

> > > +       READ_ONCE(first->txdes0);  

> > 

> > I understand what you're trying to do here, but I'm not sure that

> > this

> > is the correct way to go about it.

> > 

> > It does cause the compiler to produce a store and then a load.

> 

> +1 @first is system memory from dma_alloc_coherent(), right?

> 

> You shouldn't have to do this. Is coherent DMA memory broken 

> on your platform?


I suspect the problem is that the HW (and yes this would be a HW bug)
doesn't order the CPU -> memory and the CPU -> MMIO path.

What I think happens is that the store to txde0 is potentially still in
a buffer somewhere on its way to memory, gets bypassed by the store to
MMIO, causing the MAC to try to read the descriptor, and getting the
"old" data from memory.

It's ... fishy, but that isn't the first time an Aspeed chip has that
type of bug (there's a similar one in the USB device controler iirc).

Cheers,
Ben.
Jakub Kicinski Oct. 20, 2020, 2:57 a.m. UTC | #5
On Tue, 20 Oct 2020 10:23:41 +1100 Benjamin Herrenschmidt wrote:
> On Mon, 2020-10-19 at 12:00 -0700, Jakub Kicinski wrote:

> > On Mon, 19 Oct 2020 08:57:03 +0000 Joel Stanley wrote:  

> > > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c

> > > > b/drivers/net/ethernet/faraday/ftgmac100.c

> > > > index 00024dd41147..9a99a87f29f3 100644

> > > > --- a/drivers/net/ethernet/faraday/ftgmac100.c

> > > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c

> > > > @@ -804,7 +804,8 @@ static netdev_tx_t

> > > > ftgmac100_hard_start_xmit(struct sk_buff *skb,

> > > >          * before setting the OWN bit on the first descriptor.

> > > >          */

> > > >         dma_wmb();

> > > > -       first->txdes0 = cpu_to_le32(f_ctl_stat);

> > > > +       WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));

> > > > +       READ_ONCE(first->txdes0);    

> > > 

> > > I understand what you're trying to do here, but I'm not sure that

> > > this

> > > is the correct way to go about it.

> > > 

> > > It does cause the compiler to produce a store and then a load.  

> > 

> > +1 @first is system memory from dma_alloc_coherent(), right?

> > 

> > You shouldn't have to do this. Is coherent DMA memory broken 

> > on your platform?  

> 

> I suspect the problem is that the HW (and yes this would be a HW bug)

> doesn't order the CPU -> memory and the CPU -> MMIO path.

> 

> What I think happens is that the store to txde0 is potentially still in

> a buffer somewhere on its way to memory, gets bypassed by the store to

> MMIO, causing the MAC to try to read the descriptor, and getting the

> "old" data from memory.


I see, but in general this sort of a problem should be resolved by
adding an appropriate memory barrier. And in fact such barrier should
(these days) be implied by a writel (I'm not 100% clear on why this
driver uses iowrite, and if it matters).

> It's ... fishy, but that isn't the first time an Aspeed chip has that

> type of bug (there's a similar one in the USB device controler iirc).


Argh.
Dylan Hung Oct. 20, 2020, 6:14 a.m. UTC | #6
> -----Original Message-----

> From: Jakub Kicinski [mailto:kuba@kernel.org]

> Sent: Tuesday, October 20, 2020 3:01 AM

> To: Joel Stanley <joel@jms.id.au>

> Cc: Dylan Hung <dylan_hung@aspeedtech.com>; Benjamin Herrenschmidt

> <benh@kernel.crashing.org>; David S . Miller <davem@davemloft.net>;

> netdev@vger.kernel.org; Linux Kernel Mailing List

> <linux-kernel@vger.kernel.org>; Po-Yu Chuang <ratbert@faraday-tech.com>;

> linux-aspeed <linux-aspeed@lists.ozlabs.org>; OpenBMC Maillist

> <openbmc@lists.ozlabs.org>; BMC-SW <BMC-SW@aspeedtech.com>

> Subject: Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

> 

> On Mon, 19 Oct 2020 08:57:03 +0000 Joel Stanley wrote:

> > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c

> > > b/drivers/net/ethernet/faraday/ftgmac100.c

> > > index 00024dd41147..9a99a87f29f3 100644

> > > --- a/drivers/net/ethernet/faraday/ftgmac100.c

> > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c

> > > @@ -804,7 +804,8 @@ static netdev_tx_t

> ftgmac100_hard_start_xmit(struct sk_buff *skb,

> > >          * before setting the OWN bit on the first descriptor.

> > >          */

> > >         dma_wmb();

> > > -       first->txdes0 = cpu_to_le32(f_ctl_stat);

> > > +       WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));

> > > +       READ_ONCE(first->txdes0);

> >

> > I understand what you're trying to do here, but I'm not sure that this

> > is the correct way to go about it.

> >

> > It does cause the compiler to produce a store and then a load.


Yes, the load instruction here is to guarantee the previous store is indeed pushed onto the physical memory.

> 

> +1 @first is system memory from dma_alloc_coherent(), right?

> 

> You shouldn't have to do this. Is coherent DMA memory broken on your

> platform?


It is about the arbitration on the DRAM controller.  There are two queues in the dram controller, one is for the CPU access and the other is for the HW engines.
When CPU issues a store command, the dram controller just acknowledges cpu's request and pushes the request into the queue.  Then CPU triggers the HW MAC engine, the HW engine starts to fetch the DMA memory.
But since the cpu's request may still stay in the queue, the HW engine may fetch the wrong data.
Benjamin Herrenschmidt Oct. 20, 2020, 6:15 a.m. UTC | #7
On Mon, 2020-10-19 at 19:57 -0700, Jakub Kicinski wrote:
> > I suspect the problem is that the HW (and yes this would be a HW bug)

> > doesn't order the CPU -> memory and the CPU -> MMIO path.

> > 

> > What I think happens is that the store to txde0 is potentially still in

> > a buffer somewhere on its way to memory, gets bypassed by the store to

> > MMIO, causing the MAC to try to read the descriptor, and getting the

> > "old" data from memory.

> 

> I see, but in general this sort of a problem should be resolved by

> adding an appropriate memory barrier. And in fact such barrier should

> (these days) be implied by a writel (I'm not 100% clear on why this

> driver uses iowrite, and if it matters).


No, a barrier won't solve this I think.

This is a coherency problem at the fabric/interconnect level. I has to
do with the way they implemented the DMA path from memory to the
ethernet controller using a different "port" of the memory controller
than the one used by the CPU, separately from the MMIO path, with no
proper ordering between those busses. Old school design .... and
broken.

By doing a read back, they probably force the previous write to memory
to get past the point where it will be visible to a subsequent DMA read
by the ethernet controller.

> > It's ... fishy, but that isn't the first time an Aspeed chip has that

> > type of bug (there's a similar one in the USB device controler iirc).


Cheers,
Ben.
David Laight Oct. 20, 2020, 1:15 p.m. UTC | #8
From: Dylan Hung

> Sent: 20 October 2020 07:15

> 

> > -----Original Message-----

> > From: Jakub Kicinski [mailto:kuba@kernel.org]

> >

> > On Mon, 19 Oct 2020 08:57:03 +0000 Joel Stanley wrote:

> > > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c

> > > > b/drivers/net/ethernet/faraday/ftgmac100.c

> > > > index 00024dd41147..9a99a87f29f3 100644

> > > > --- a/drivers/net/ethernet/faraday/ftgmac100.c

> > > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c

> > > > @@ -804,7 +804,8 @@ static netdev_tx_t

> > ftgmac100_hard_start_xmit(struct sk_buff *skb,

> > > >          * before setting the OWN bit on the first descriptor.

> > > >          */

> > > >         dma_wmb();

> > > > -       first->txdes0 = cpu_to_le32(f_ctl_stat);

> > > > +       WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));

> > > > +       READ_ONCE(first->txdes0);

> > >

> > > I understand what you're trying to do here, but I'm not sure that this

> > > is the correct way to go about it.

> > >

> > > It does cause the compiler to produce a store and then a load.

> 

> Yes, the load instruction here is to guarantee the previous store is indeed

> pushed onto the physical memory.


That rather depends where the data is 'stuck'.

An old sparc cpu would flush the cpu store buffer before the read.
But a modern x86 cpu will satisfy the read from the store buffer
for cached data.

If the write is 'posted' on a PCI(e) bus then the read can't overtake it.
But that is a memory access so shouldn't be to a PCI(e) address.

Shouldn't dma_wb() actually force your 'cpu to dram' queue be flushed?
In which case you need one after writing the ring descriptor and
before the poke of the mac engine.

The barrier before the descriptor write only needs to guarantee
ordering of the writes - it can probably be a lighter barrier?

It might be that your dma_wmb() needs to do a write+read of
an uncached DRAM location in order to empty the cpu to dram queue.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Jakub Kicinski Oct. 20, 2020, 5:24 p.m. UTC | #9
On Tue, 20 Oct 2020 17:15:42 +1100 Benjamin Herrenschmidt wrote:
> On Mon, 2020-10-19 at 19:57 -0700, Jakub Kicinski wrote:

> > > I suspect the problem is that the HW (and yes this would be a HW bug)

> > > doesn't order the CPU -> memory and the CPU -> MMIO path.

> > > 

> > > What I think happens is that the store to txde0 is potentially still in

> > > a buffer somewhere on its way to memory, gets bypassed by the store to

> > > MMIO, causing the MAC to try to read the descriptor, and getting the

> > > "old" data from memory.  

> > 

> > I see, but in general this sort of a problem should be resolved by

> > adding an appropriate memory barrier. And in fact such barrier should

> > (these days) be implied by a writel (I'm not 100% clear on why this

> > driver uses iowrite, and if it matters).  

> 

> No, a barrier won't solve this I think.

> 

> This is a coherency problem at the fabric/interconnect level. I has to

> do with the way they implemented the DMA path from memory to the

> ethernet controller using a different "port" of the memory controller

> than the one used by the CPU, separately from the MMIO path, with no

> proper ordering between those busses. Old school design .... and

> broken.

> 

> By doing a read back, they probably force the previous write to memory

> to get past the point where it will be visible to a subsequent DMA read

> by the ethernet controller.


Thanks for the explanation. How wonderful :/

It'd still be highly, highly preferable if the platform was conforming
to the Linux memory model. IO successors (iowrite32 / writel) must
ensure previous DRAM writes had completed. For performance sensitive
ops, which don't require ordering we have writel_relaxed etc.

I assume the DRAM controller queue is a straight FIFO and we don't have
to worry about hitting the same address, so how about we add a read
of some known uncached address in iowrite32 / writel?
Benjamin Herrenschmidt Oct. 20, 2020, 10:05 p.m. UTC | #10
On Tue, 2020-10-20 at 13:15 +0000, David Laight wrote:
> That rather depends where the data is 'stuck'.

> 

> An old sparc cpu would flush the cpu store buffer before the read.

> But a modern x86 cpu will satisfy the read from the store buffer

> for cached data.

> 

> If the write is 'posted' on a PCI(e) bus then the read can't overtake it.

> But that is a memory access so shouldn't be to a PCI(e) address.

> 

> Shouldn't dma_wb() actually force your 'cpu to dram' queue be flushed?

> In which case you need one after writing the ring descriptor and

> before the poke of the mac engine.

> 

> The barrier before the descriptor write only needs to guarantee

> ordering of the writes - it can probably be a lighter barrier?

> 

> It might be that your dma_wmb() needs to do a write+read of

> an uncached DRAM location in order to empty the cpu to dram queue.


This is a specific bug with how a specific IP block is hooked up in
those SOCs, I wouldn't bloat the global dma_wmb for that. The read back
in the driver with appropriate comment should be enough.

Cheers,
Ben.
Benjamin Herrenschmidt Oct. 20, 2020, 10:10 p.m. UTC | #11
On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:
> On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung <dylan_hung@aspeedtech.com> wrote:

> > > +1 @first is system memory from dma_alloc_coherent(), right?

> > > 

> > > You shouldn't have to do this. Is coherent DMA memory broken on your

> > > platform?

> > 

> > It is about the arbitration on the DRAM controller.  There are two queues in the dram controller, one is for the CPU access and the other is for the HW engines.

> > When CPU issues a store command, the dram controller just acknowledges cpu's request and pushes the request into the queue.  Then CPU triggers the HW MAC engine, the HW engine starts to fetch the DMA memory.

> > But since the cpu's request may still stay in the queue, the HW engine may fetch the wrong data.


Actually, I take back what I said earlier, the above seems to imply
this is more generic.

Dylan, please confirm, does this affect *all* DMA capable devices ? If
yes, then it's a really really bad design bug in your chips
unfortunately and the proper fix is indeed to make dma_wmb() do a dummy
read of some sort (what address though ? would any dummy non-cachable
page do ?) to force the data out as *all* drivers will potentially be
affected.

I was under the impression that it was a specific timing issue in the
vhub and ethernet parts, but if it's more generic then it needs to be
fixed globally.

> There is still something missing in the explanation: The iowrite32()

> only tells the

> device that it should check the queue, but not where the data is. I would expect

> the device to either see the correct data that was marked valid by the

> 'dma_wmb();first->txdes0 = cpu_to_le32(f_ctl_stat);' operation, or it would see

> the old f_ctl_stat value telling it that the data is not yet valid and

> not look at

> the rest of the descriptor. In the second case you would see the data

> not getting sent out until the next start_xmit(), but the device should not

> fetch wrong data.

> 

> There are two possible scenarios in which your patch would still help:

> 

> a) the dma_wmb() does not serialize the stores as seen by DMA the

>     way it is supposed to, so the device can observe the new value of txdec0

>     before it observes the correct data.

> 

> b) The txdes0 field sometimes contains stale data that marks the

>     descriptor as valid before the correct data is written. This field

>     should have been set in ftgmac100_tx_complete_packet() earlier

> 

> If either of the two is the case, then the READ_ONCE() would just

> introduce a long delay before the iowrite32() that makes it more likely

> that the data is there, but the inconsistent state would still be observable

> by the device if it is still working on previous frames.


I think it just get stuck until we try another packet, ie, it doesn't
see the new descriptor valid bit. But Dylan can elaborate.

Cheers,
Ben.
Andrew Jeffery Oct. 20, 2020, 10:25 p.m. UTC | #12
On Wed, 21 Oct 2020, at 08:40, Benjamin Herrenschmidt wrote:
> On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:

> > On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung <dylan_hung@aspeedtech.com> wrote:

> > > > +1 @first is system memory from dma_alloc_coherent(), right?

> > > > 

> > > > You shouldn't have to do this. Is coherent DMA memory broken on your

> > > > platform?

> > > 

> > > It is about the arbitration on the DRAM controller.  There are two queues in the dram controller, one is for the CPU access and the other is for the HW engines.

> > > When CPU issues a store command, the dram controller just acknowledges cpu's request and pushes the request into the queue.  Then CPU triggers the HW MAC engine, the HW engine starts to fetch the DMA memory.

> > > But since the cpu's request may still stay in the queue, the HW engine may fetch the wrong data.

> 

> Actually, I take back what I said earlier, the above seems to imply

> this is more generic.

> 

> Dylan, please confirm, does this affect *all* DMA capable devices ? If

> yes, then it's a really really bad design bug in your chips

> unfortunately and the proper fix is indeed to make dma_wmb() do a dummy

> read of some sort (what address though ? would any dummy non-cachable

> page do ?) to force the data out as *all* drivers will potentially be

> affected.

> 

> I was under the impression that it was a specific timing issue in the

> vhub and ethernet parts, but if it's more generic then it needs to be

> fixed globally.

> 


We see a similar issue in the XDMA engine where it can transfer stale data to 
the host. I think the driver ended up using memcpy_toio() to work around that 
despite using a DMA reserved memory region.
Arnd Bergmann Oct. 21, 2020, 12:11 p.m. UTC | #13
(replying to my own mail from a different address to deal with the
regular one being blacklisted somewhere, sorry for any duplicates)

On Wed, Oct 21, 2020 at 9:16 AM Arnd Bergmann <arnd@arndb.de> wrote:
>

> On Wed, Oct 21, 2020 at 12:10 AM Benjamin Herrenschmidt

> <benh@kernel.crashing.org> wrote:

> > On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:

> > > On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung <dylan_hung@aspeedtech.com> wrote:

> > > > > +1 @first is system memory from dma_alloc_coherent(), right?

> > > > >

> > > > > You shouldn't have to do this. Is coherent DMA memory broken on your

> > > > > platform?

> > > >

> > > > It is about the arbitration on the DRAM controller.  There are two queues in the dram controller, one is for the CPU access and the other is for the HW engines.

> > > > When CPU issues a store command, the dram controller just acknowledges cpu's request and pushes the request into the queue.  Then CPU triggers the HW MAC engine, the HW engine starts to fetch the DMA memory.

> > > > But since the cpu's request may still stay in the queue, the HW engine may fetch the wrong data.

> >

> > Actually, I take back what I said earlier, the above seems to imply

> > this is more generic.

> >

> > Dylan, please confirm, does this affect *all* DMA capable devices ? If

> > yes, then it's a really really bad design bug in your chips

> > unfortunately and the proper fix is indeed to make dma_wmb() do a dummy

> > read of some sort (what address though ? would any dummy non-cachable

> > page do ?) to force the data out as *all* drivers will potentially be

> > affected.

> >

> > I was under the impression that it was a specific timing issue in the

> > vhub and ethernet parts, but if it's more generic then it needs to be

> > fixed globally.

>

> We have CONFIG_ARM_HEAVY_MB for SoCs with similar problems,

> it turns mb() and wmb() into a platform specific function call, though it

> doesn't do that for dma_wmb() and smp_wmb(), which should not

> be affected if the problem is only missing serialization between DMA

> and CPU writes.

>

> > > If either of the two is the case, then the READ_ONCE() would just

> > > introduce a long delay before the iowrite32() that makes it more likely

> > > that the data is there, but the inconsistent state would still be observable

> > > by the device if it is still working on previous frames.

> >

> > I think it just get stuck until we try another packet, ie, it doesn't

> > see the new descriptor valid bit. But Dylan can elaborate.

>

> Ok, that would point to an insufficient barrier in iowrite32() as well,

> not in dma_wmb().

>

> At the moment, the only chips that need the heavy barrier are

> omap4 and mstar_v7, and early l2 cache controllers (not the one

> on Cortex-A7) have another synchronization callback that IIRC

> is used for streaming mappings.

>

> These are the two implementations of soc_mb() we have:

>

> /*

>  * This may need locking to deal with situations where an interrupt

>  * happens while we are in here and mb() gets called by the interrupt handler.

>  *

>  * The vendor code did have a spin lock but it doesn't seem to be needed and

>  * removing it hasn't caused any side effects so far.

> *

>  * [writel|readl]_relaxed have to be used here because otherwise

>  * we'd end up right back in here.

>  */

> static void mstarv7_mb(void)

> {

>        /* toggle the flush miu pipe fire bit */

>        writel_relaxed(0, l3bridge + MSTARV7_L3BRIDGE_FLUSH);

>        writel_relaxed(MSTARV7_L3BRIDGE_FLUSH_TRIGGER, l3bridge

>                        + MSTARV7_L3BRIDGE_FLUSH);

>        while (!(readl_relaxed(l3bridge + MSTARV7_L3BRIDGE_STATUS)

>                        & MSTARV7_L3BRIDGE_STATUS_DONE)) {

>                /* wait for flush to complete */

>        }

> }

> /*

>  * OMAP4 interconnect barrier which is called for each mb() and wmb().

>  * This is to ensure that normal paths to DRAM (normal memory, cacheable

>  * accesses) are properly synchronised with writes to DMA coherent memory

>  * (normal memory, uncacheable) and device writes.

>  *

>  * The mb() and wmb() barriers only operate only on the MPU->MA->EMIF

>  * path, as we need to ensure that data is visible to other system

>  * masters prior to writes to those system masters being seen.

>  *

>  * Note: the SRAM path is not synchronised via mb() and wmb().

>  */

> static void omap4_mb(void)

> {

>        if (dram_sync)

>                writel_relaxed(0, dram_sync);

> }

>

> Obviously, adding one of these for ast2600 would slow down every

> mb() and writel() a lot, but if it is a chip-wide problem rather than

> one isolated to the network device, it would be the correct solution,

> provided that a correct code sequence can be found.

>

>       Arnd
Benjamin Herrenschmidt Oct. 22, 2020, 7:40 a.m. UTC | #14
On Wed, 2020-10-21 at 14:11 +0200, Arnd Bergmann wrote:
> (replying to my own mail from a different address to deal with the

> regular one being blacklisted somewhere, sorry for any duplicates)

> 

> On Wed, Oct 21, 2020 at 9:16 AM Arnd Bergmann <arnd@arndb.de> wrote:

> > 

> > On Wed, Oct 21, 2020 at 12:10 AM Benjamin Herrenschmidt

> > <benh@kernel.crashing.org> wrote:

> > > On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:

> > > > On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung <dylan_hung@aspeedtech.com> wrote:

> > > > > > +1 @first is system memory from dma_alloc_coherent(), right?

> > > > > > 

> > > > > > You shouldn't have to do this. Is coherent DMA memory broken on your

> > > > > > platform?

> > > > > 

> > > > > It is about the arbitration on the DRAM controller.  There are two queues in the dram controller, one is for the CPU access and the other is for the HW engines.

> > > > > When CPU issues a store command, the dram controller just acknowledges cpu's request and pushes the request into the queue.  Then CPU triggers the HW MAC engine, the HW engine starts to fetch the DMA memory.

> > > > > But since the cpu's request may still stay in the queue, the HW engine may fetch the wrong data.

> > > 

> > > Actually, I take back what I said earlier, the above seems to imply

> > > this is more generic.

> > > 

> > > Dylan, please confirm, does this affect *all* DMA capable devices ? If

> > > yes, then it's a really really bad design bug in your chips

> > > unfortunately and the proper fix is indeed to make dma_wmb() do a dummy

> > > read of some sort (what address though ? would any dummy non-cachable

> > > page do ?) to force the data out as *all* drivers will potentially be

> > > affected.

> > > 

> > > I was under the impression that it was a specific timing issue in the

> > > vhub and ethernet parts, but if it's more generic then it needs to be

> > > fixed globally.

> > 

> > We have CONFIG_ARM_HEAVY_MB for SoCs with similar problems,

> > it turns mb() and wmb() into a platform specific function call, though it

> > doesn't do that for dma_wmb() and smp_wmb(), which should not

> > be affected if the problem is only missing serialization between DMA

> > and CPU writes.


Right. I got mixed up by David mention of dma_wmb, it's not the issue
here, it's indeed the ordering of writes to "coherent" memory vs
iowrite.

> > > > If either of the two is the case, then the READ_ONCE() would just

> > > > introduce a long delay before the iowrite32() that makes it more likely

> > > > that the data is there, but the inconsistent state would still be observable

> > > > by the device if it is still working on previous frames.

> > > 

> > > I think it just get stuck until we try another packet, ie, it doesn't

> > > see the new descriptor valid bit. But Dylan can elaborate.

> > 

> > Ok, that would point to an insufficient barrier in iowrite32() as well,

> > not in dma_wmb().


Correct.

> > At the moment, the only chips that need the heavy barrier are

> > omap4 and mstar_v7, and early l2 cache controllers (not the one

> > on Cortex-A7) have another synchronization callback that IIRC

> > is used for streaming mappings.


 .../...

> > Obviously, adding one of these for ast2600 would slow down every

> > mb() and writel() a lot, but if it is a chip-wide problem rather than

> > one isolated to the network device, it would be the correct solution,

> > provided that a correct code sequence can be found.


I'm surprised that problem doesn't already exist on the ast2400 and
2500 and I thus worry about the performance impact of such a workaround
applied generally to every MMIO writes....

But we did kill mmiowb so ... ;-)

Cheers,
Ben.
Arnd Bergmann Oct. 23, 2020, 8:39 a.m. UTC | #15
On Thu, Oct 22, 2020 at 9:41 AM Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2020-10-21 at 14:11 +0200, Arnd Bergmann wrote:

>

> > > At the moment, the only chips that need the heavy barrier are

> > > omap4 and mstar_v7, and early l2 cache controllers (not the one

> > > on Cortex-A7) have another synchronization callback that IIRC

> > > is used for streaming mappings.

>

>  .../...

>

> > > Obviously, adding one of these for ast2600 would slow down every

> > > mb() and writel() a lot, but if it is a chip-wide problem rather than

> > > one isolated to the network device, it would be the correct solution,

> > > provided that a correct code sequence can be found.

>

> I'm surprised that problem doesn't already exist on the ast2400 and

> 2500 and I thus worry about the performance impact of such a workaround

> applied generally to every MMIO writes....

>

> But we did kill mmiowb so ... ;-)


The real cost would have to be measured of course, and it depends a
lot on how it's done. The read-from-uncached-memory as in the 1/4
patch here seems fairly expensive, the mstarv7_mb() method (spinning
on an mmio read) seems worse, but the omap4 method (a posted write
to a mmio address in the memory controller to enforce a barrier between
the two ports) doesn't seem that bad and would correspond to what
the chip should be doing in the first place.

       Arnd
Dylan Hung Oct. 23, 2020, 1:08 p.m. UTC | #16
> -----Original Message-----

> From: Andrew Jeffery [mailto:andrew@aj.id.au]

> Sent: Wednesday, October 21, 2020 6:26 AM

> To: Benjamin Herrenschmidt <benh@kernel.crashing.org>; Arnd Bergmann

> <arnd@arndb.de>; Dylan Hung <dylan_hung@aspeedtech.com>

> Cc: BMC-SW <BMC-SW@aspeedtech.com>; linux-aspeed

> <linux-aspeed@lists.ozlabs.org>; Po-Yu Chuang <ratbert@faraday-tech.com>;

> netdev <netdev@vger.kernel.org>; OpenBMC Maillist

> <openbmc@lists.ozlabs.org>; Linux Kernel Mailing List

> <linux-kernel@vger.kernel.org>; Jakub Kicinski <kuba@kernel.org>; David

> Miller <davem@davemloft.net>

> Subject: Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

> 

> 

> 

> On Wed, 21 Oct 2020, at 08:40, Benjamin Herrenschmidt wrote:

> > On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:

> > > On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung

> <dylan_hung@aspeedtech.com> wrote:

> > > > > +1 @first is system memory from dma_alloc_coherent(), right?

> > > > >

> > > > > You shouldn't have to do this. Is coherent DMA memory broken on

> > > > > your platform?

> > > >

> > > > It is about the arbitration on the DRAM controller.  There are two

> queues in the dram controller, one is for the CPU access and the other is for

> the HW engines.

> > > > When CPU issues a store command, the dram controller just

> acknowledges cpu's request and pushes the request into the queue.  Then

> CPU triggers the HW MAC engine, the HW engine starts to fetch the DMA

> memory.

> > > > But since the cpu's request may still stay in the queue, the HW engine

> may fetch the wrong data.

> >

> > Actually, I take back what I said earlier, the above seems to imply

> > this is more generic.

> >

> > Dylan, please confirm, does this affect *all* DMA capable devices ? If

> > yes, then it's a really really bad design bug in your chips

> > unfortunately and the proper fix is indeed to make dma_wmb() do a

> > dummy read of some sort (what address though ? would any dummy

> > non-cachable page do ?) to force the data out as *all* drivers will

> > potentially be affected.

> >


The issue was found on our test chip (ast2600 version A0) which is just for testing and won't be mass-produced.  This HW bug has been fixed on ast2600 A1 and later versions.

To verify the HW fix, I run overnight iperf and kvm tests on ast2600A1 without this patch, and get stable result without hanging.
So I think we can discard this patch.

> > I was under the impression that it was a specific timing issue in the

> > vhub and ethernet parts, but if it's more generic then it needs to be

> > fixed globally.

> >

> 

> We see a similar issue in the XDMA engine where it can transfer stale data to

> the host. I think the driver ended up using memcpy_toio() to work around that

> despite using a DMA reserved memory region.
Benjamin Herrenschmidt Oct. 26, 2020, 10:21 p.m. UTC | #17
On Fri, 2020-10-23 at 13:08 +0000, Dylan Hung wrote:
> The issue was found on our test chip (ast2600 version A0) which is

> just for testing and won't be mass-produced.  This HW bug has been

> fixed on ast2600 A1 and later versions.

> 

> To verify the HW fix, I run overnight iperf and kvm tests on

> ast2600A1 without this patch, and get stable result without hanging.

> So I think we can discard this patch.


This is great news. Thanks !

Cheers,
Ben.
Joel Stanley Oct. 27, 2020, 2:18 a.m. UTC | #18
On Mon, 26 Oct 2020 at 22:22, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> On Fri, 2020-10-23 at 13:08 +0000, Dylan Hung wrote:
> > The issue was found on our test chip (ast2600 version A0) which is
> > just for testing and won't be mass-produced.  This HW bug has been
> > fixed on ast2600 A1 and later versions.
> >
> > To verify the HW fix, I run overnight iperf and kvm tests on
> > ast2600A1 without this patch, and get stable result without hanging.
> > So I think we can discard this patch.
>
> This is great news. Thanks !

That is excellent news. I agree; we do not need fixes for A0 issues to
be kept in the mainline kernel. Thanks for updating us Dylan.

Cheers,

Joel
diff mbox series

Patch

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 00024dd41147..9a99a87f29f3 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -804,7 +804,8 @@  static netdev_tx_t ftgmac100_hard_start_xmit(struct sk_buff *skb,
 	 * before setting the OWN bit on the first descriptor.
 	 */
 	dma_wmb();
-	first->txdes0 = cpu_to_le32(f_ctl_stat);
+	WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));
+	READ_ONCE(first->txdes0);
 
 	/* Update next TX pointer */
 	priv->tx_pointer = pointer;