diff mbox series

[v2] i2c: designware: Fix corrupted memory seen in the ISR

Message ID 20230913232938.420423-1-janb@os.amperecomputing.com
State Superseded
Headers show
Series [v2] i2c: designware: Fix corrupted memory seen in the ISR | expand

Commit Message

Jan Bottorff Sept. 13, 2023, 11:29 p.m. UTC
Errors were happening in the ISR that looked like corrupted
memory. This was because memory writes from the core enabling
interrupts were not yet visible to the core running the ISR. The
kernel log would get the message "i2c_designware APMC0D0F:00:
controller timed out" during in-band IPMI SSIF stress tests.

Add a write barrier before enabling interrupts to assure data written
by the current core is visible to all cores before the interrupt fires.

The ARM Barrier Litmus Tests and Cookbook has an example under
Sending Interrupts and Barriers that matches the usage in this
driver. That document says a DSB barrier is required.

Signed-off-by: Jan Bottorff <janb@os.amperecomputing.com>
Reviewed-by: Yann Sionneau <ysionneau@kalrayinc.com>
Tested-by: Yann Sionneau <ysionneau@kalrayinc.com>
---
 drivers/i2c/busses/i2c-designware-master.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Andy Shevchenko Sept. 14, 2023, 6:46 p.m. UTC | #1
On Wed, Sep 13, 2023 at 04:29:38PM -0700, Jan Bottorff wrote:
> Errors were happening in the ISR that looked like corrupted
> memory. This was because memory writes from the core enabling
> interrupts were not yet visible to the core running the ISR. The
> kernel log would get the message "i2c_designware APMC0D0F:00:
> controller timed out" during in-band IPMI SSIF stress tests.
> 
> Add a write barrier before enabling interrupts to assure data written
> by the current core is visible to all cores before the interrupt fires.
> 
> The ARM Barrier Litmus Tests and Cookbook has an example under
> Sending Interrupts and Barriers that matches the usage in this
> driver. That document says a DSB barrier is required.
> 
> Signed-off-by: Jan Bottorff <janb@os.amperecomputing.com>
> Reviewed-by: Yann Sionneau <ysionneau@kalrayinc.com>
> Tested-by: Yann Sionneau <ysionneau@kalrayinc.com>
> ---

Changelog?
Andy Shevchenko Sept. 14, 2023, 6:47 p.m. UTC | #2
On Thu, Sep 14, 2023 at 09:46:34PM +0300, Andy Shevchenko wrote:
> On Wed, Sep 13, 2023 at 04:29:38PM -0700, Jan Bottorff wrote:

> > ---
> 
> Changelog?

No need to resend, just reply with a bullet list of what has been done
in v1 --> v2.
Jan Bottorff Sept. 14, 2023, 8:52 p.m. UTC | #3
On 9/14/2023 11:47 AM, Andy Shevchenko wrote:
> On Thu, Sep 14, 2023 at 09:46:34PM +0300, Andy Shevchenko wrote:
>> On Wed, Sep 13, 2023 at 04:29:38PM -0700, Jan Bottorff wrote:
> 
>>> ---
>>
>> Changelog?
> 
> No need to resend, just reply with a bullet list of what has been done
> in v1 --> v2.
> 

Hi Andy,

No code changes between v1 and v2, just improvements to the commit 
message based on feedback.

Thanks,
Jan
Jarkko Nikula Sept. 15, 2023, 12:44 p.m. UTC | #4
On 9/14/23 02:29, Jan Bottorff wrote:
> Errors were happening in the ISR that looked like corrupted
> memory. This was because memory writes from the core enabling
> interrupts were not yet visible to the core running the ISR. The
> kernel log would get the message "i2c_designware APMC0D0F:00:
> controller timed out" during in-band IPMI SSIF stress tests.
> 
> Add a write barrier before enabling interrupts to assure data written
> by the current core is visible to all cores before the interrupt fires.
> 
> The ARM Barrier Litmus Tests and Cookbook has an example under
> Sending Interrupts and Barriers that matches the usage in this
> driver. That document says a DSB barrier is required.
> 
> Signed-off-by: Jan Bottorff <janb@os.amperecomputing.com>
> Reviewed-by: Yann Sionneau <ysionneau@kalrayinc.com>
> Tested-by: Yann Sionneau <ysionneau@kalrayinc.com>
> ---
>   drivers/i2c/busses/i2c-designware-master.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
Acked-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Serge Semin Sept. 15, 2023, 3:21 p.m. UTC | #5
Hi Jan

On Wed, Sep 13, 2023 at 04:29:38PM -0700, Jan Bottorff wrote:
> Errors were happening in the ISR that looked like corrupted
> memory. This was because memory writes from the core enabling
> interrupts were not yet visible to the core running the ISR. The
> kernel log would get the message "i2c_designware APMC0D0F:00:
> controller timed out" during in-band IPMI SSIF stress tests.
> 
> Add a write barrier before enabling interrupts to assure data written
> by the current core is visible to all cores before the interrupt fires.
> 
> The ARM Barrier Litmus Tests and Cookbook has an example under
> Sending Interrupts and Barriers that matches the usage in this
> driver. That document says a DSB barrier is required.
> 
> Signed-off-by: Jan Bottorff <janb@os.amperecomputing.com>
> Reviewed-by: Yann Sionneau <ysionneau@kalrayinc.com>
> Tested-by: Yann Sionneau <ysionneau@kalrayinc.com>
> ---
>  drivers/i2c/busses/i2c-designware-master.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c
> index ca1035e010c7..1694ac6bb592 100644
> --- a/drivers/i2c/busses/i2c-designware-master.c
> +++ b/drivers/i2c/busses/i2c-designware-master.c
> @@ -248,6 +248,14 @@ static void i2c_dw_xfer_init(struct dw_i2c_dev *dev)
>  	/* Dummy read to avoid the register getting stuck on Bay Trail */
>  	regmap_read(dev->map, DW_IC_ENABLE_STATUS, &dummy);
>  
> +	/*
> +	 * To guarantee data written by the current core is visible to
> +	 * all cores, a write barrier is required. This needs to be
> +	 * before an interrupt causes execution on another core.
> +	 * For ARM processors, this needs to be a DSB barrier.
> +	 */

> +	wmb();

Based on the patch log and the comment, smp_wmb() seems to be more
suitable here since the problem looks like SMP-specific. Most
importantly the smp_wmb() will get to be just the compiler barrier on
the UP system, so no cache and pipeline flushes in that case.
Meanwhile 

I am not ARM expert, but based on the problem and the DMB/DSB barriers
descriptions using DMB should be enough in your case since you only
need memory syncs.

-Serge(y)

> +
>  	/* Clear and enable interrupts */
>  	regmap_read(dev->map, DW_IC_CLR_INTR, &dummy);
>  	regmap_write(dev->map, DW_IC_INTR_MASK, DW_IC_INTR_MASTER_MASK);
> -- 
> 2.41.0
>
Jan Bottorff Sept. 16, 2023, 1:47 a.m. UTC | #6
On 9/15/2023 8:21 AM, Serge Semin wrote:
...
> 
> Based on the patch log and the comment, smp_wmb() seems to be more
> suitable here since the problem looks like SMP-specific. Most
> importantly the smp_wmb() will get to be just the compiler barrier on
> the UP system, so no cache and pipeline flushes in that case.
> Meanwhile
> 
> I am not ARM expert, but based on the problem and the DMB/DSB barriers
> descriptions using DMB should be enough in your case since you only
> need memory syncs.
> 
Hi Serge,

I looked at the definition of smp_wmb, and it looks like on arm64 it 
uses a DMB barrier not a DSB barrier.

In /arch/arm64/include/asm/barrier.h:
...
#define __arm_heavy_mb(x...) dsb(x)
...
#if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
...
#define wmb()		__arm_heavy_mb(st)
...
#define __smp_wmb()	dmb(ishst)

And then in /include/asm-generic/barrier.h it says:
#ifdef CONFIG_SMP
...
#ifndef smp_wmb
#define smp_wmb()	do { kcsan_wmb(); __smp_wmb(); } while (0)
#endif

This looks like wmb() is a DSB and smp_wmb() is a DMB on SMP systems, so 
the two functions are not equivalent on SMP systems.

So lets explore if we think DMB or DSB is the correct barrier.

The ARM barrier docs I referred to has a specific example that says this:

"In some message passing systems, it is common for one observer to 
update memory and then send an interrupt using a mailbox of some sort to 
a second observer to indicate that memory has been updated and the new
contents have been read. Even though the sending of the interrupt using 
a mailbox might be initiated using a memory access, a DSB barrier
must be used to ensure the completion of previous memory accesses.

Therefore the following sequence is needed to ensure that P2 sees the 
updated value.

P1:
  STR R5, [R1] ; message stored to shared memory location
  DSB [ST]
  STR R1, [R4] ; R4 contains the address of a mailbox

P2:
  ; interrupt service routine
  LDR R5, [R1]

Even if R4 is a pointer to Strongly-Ordered memory, the update to R1 
might not be visible without the DSB executed by P1.
It should be appreciated that these rules are required in connection to 
the ARM Generic Interrupt Controller (GIC).
"

I don't positivly understand why it needs to be a DSB and not just a 
DMB, but this example matches what happens in the driver. The ARM docs 
do some hand waving that DSB is required because of the GIC.

Unless we can come up with a reason why this example in the ARM Barrier 
docs is not a match for what happens in the i2c driver, then ARM is 
saying it has to be a DSB not a DMB. If it needs to be a DSB then 
smb_wmb is insufficient.

Does anybody else have a different interpretation of this section in the 
ARM barrier docs? They use the word mailbox, and show a shared memory 
write, an interrupt triggering write, and a read of shared memory on a 
different core. Some would describe that as a software mailbox.

I did read someplace (although don't have a specific reference I can 
give) that ordering applied to normal memory writes are in a different 
group than ordering applied between strongly ordered accesses. The 
excerpt from the ARM barrier document above does say "Even if R4 is a 
pointer to Strongly-Ordered memory, the update to R1 might not be 
visible without the DSB executed by P1", which implies a DMB is 
insufficient to cause ordering between normal memory writes and 
strongly-ordered device memory writes.

I know currently on ARM64 Windows, the low-level kernel device MMIO 
access functions (like WRITE_REGISTER_ULONG) all have a DSB before the 
MMIO memory access. That seems a little heavy handed to me, but it also 
may be that was required to get all the current driver code written for 
AMD/Intel processors to work correctly on ARM64 without adding barriers 
in the drivers. There are also non-barrier variants that can be used if 
a driver wants to optimize performance. Defaulting to correct operation 
with minimal code changes would reduce the risk to delivery schedules.

Linux doesn't seem to make any attempt to have barriers in the low level 
MMIO access functions. If Linux had chosen to do that on ARM64, this 
patch would not have been required. For a low speed device like an i2c 
controller, optimizing barriers likely make little difference in 
performance.

Let's look at it from a risk analysis viewpoint. Say a DMB is sufficient 
and we use the stronger DSB variant, the downside is a few cpu cycles 
will be wasted in i2c transfers. Say we use a DMB when a DSB is required 
for correct operation, the downside is i2c operations may malfunction. 
In this case, using a few extra cpu cycles for an operation that does 
not happen at high frequency is lower risk than failures in i2c 
transfers. If there is any uncertainty in what barrier type to use, 
picking DSB over DMB would be better. We determined from the include 
fragments above that wmb() give the DSB and smp_wmb() does not.

Based on the above info, I think wmb() is still the correct function, 
and a change to smp_wmb() would not be correct.

Sorry for the long message, I know some of you will be inspired to think 
deeply about barriers, and some will be annoyed that I spent this much 
space to explain how I came to the choice of wmb().

Thanks,
Jan
Serge Semin Sept. 17, 2023, 12:01 a.m. UTC | #7
To += Catalin, Will

Could you please join the discussion and clarify some ARM64 barriers
aspects?

On Fri, Sep 15, 2023 at 06:47:55PM -0700, Jan Bottorff wrote:
> On 9/15/2023 8:21 AM, Serge Semin wrote:
> ...
> > > diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c
> > > index ca1035e010c7..1694ac6bb592 100644
> > > --- a/drivers/i2c/busses/i2c-designware-master.c
> > > +++ b/drivers/i2c/busses/i2c-designware-master.c
> > > @@ -248,6 +248,14 @@ static void i2c_dw_xfer_init(struct dw_i2c_dev *dev)
> > >       /* Dummy read to avoid the register getting stuck on Bay Trail */
> > >       regmap_read(dev->map, DW_IC_ENABLE_STATUS, &dummy);
> > > 
> > > +     /*
> > > +      * To guarantee data written by the current core is visible to
> > > +      * all cores, a write barrier is required. This needs to be
> > > +      * before an interrupt causes execution on another core.
> > > +      * For ARM processors, this needs to be a DSB barrier.
> > > +      */
> >
> > > +     wmb();
> > 
> > Based on the patch log and the comment, smp_wmb() seems to be more
> > suitable here since the problem looks like SMP-specific. Most
> > importantly the smp_wmb() will get to be just the compiler barrier on
> > the UP system, so no cache and pipeline flushes in that case.
> > Meanwhile
> > 
> > I am not ARM expert, but based on the problem and the DMB/DSB barriers
> > descriptions using DMB should be enough in your case since you only
> > need memory syncs.
> > 
> Hi Serge,
> 
> I looked at the definition of smp_wmb, and it looks like on arm64 it uses a
> DMB barrier not a DSB barrier.
> 
> In /arch/arm64/include/asm/barrier.h:
> ...
> #define __arm_heavy_mb(x...) dsb(x)
> ...
> #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
> ...
> #define wmb()		__arm_heavy_mb(st)
> ...
> #define __smp_wmb()	dmb(ishst)
> 
> And then in /include/asm-generic/barrier.h it says:
> #ifdef CONFIG_SMP
> ...
> #ifndef smp_wmb
> #define smp_wmb()	do { kcsan_wmb(); __smp_wmb(); } while (0)
> #endif
> 

> This looks like wmb() is a DSB and smp_wmb() is a DMB on SMP systems, so the
> two functions are not equivalent on SMP systems.

Right. They aren't. That's why I added a note regarding the DMB
instruction. Anyway see further for detailed explantation of my point.

> 
> So lets explore if we think DMB or DSB is the correct barrier.
> 
> The ARM barrier docs I referred to has a specific example that says this:
> 
> "In some message passing systems, it is common for one observer to update
> memory and then send an interrupt using a mailbox of some sort to a second
> observer to indicate that memory has been updated and the new
> contents have been read. Even though the sending of the interrupt using a
> mailbox might be initiated using a memory access, a DSB barrier
> must be used to ensure the completion of previous memory accesses.
> 
> Therefore the following sequence is needed to ensure that P2 sees the
> updated value.
> 
> P1:
>  STR R5, [R1] ; message stored to shared memory location
>  DSB [ST]
>  STR R1, [R4] ; R4 contains the address of a mailbox
> 
> P2:
>  ; interrupt service routine
>  LDR R5, [R1]
> 
> Even if R4 is a pointer to Strongly-Ordered memory, the update to R1 might
> not be visible without the DSB executed by P1.
> It should be appreciated that these rules are required in connection to the
> ARM Generic Interrupt Controller (GIC).
> "
> 
> I don't positivly understand why it needs to be a DSB and not just a DMB,
> but this example matches what happens in the driver. The ARM docs do some
> hand waving that DSB is required because of the GIC.
> 
> Unless we can come up with a reason why this example in the ARM Barrier docs
> is not a match for what happens in the i2c driver, then ARM is saying it has
> to be a DSB not a DMB. If it needs to be a DSB then smb_wmb is insufficient.
> 
> Does anybody else have a different interpretation of this section in the ARM
> barrier docs? They use the word mailbox, and show a shared memory write, an
> interrupt triggering write, and a read of shared memory on a different core.
> Some would describe that as a software mailbox.
> 
> I did read someplace (although don't have a specific reference I can give)
> that ordering applied to normal memory writes are in a different group than
> ordering applied between strongly ordered accesses. The excerpt from the ARM
> barrier document above does say "Even if R4 is a pointer to Strongly-Ordered
> memory, the update to R1 might not be visible without the DSB executed by
> P1", which implies a DMB is insufficient to cause ordering between normal
> memory writes and strongly-ordered device memory writes.
> 
> I know currently on ARM64 Windows, the low-level kernel device MMIO access
> functions (like WRITE_REGISTER_ULONG) all have a DSB before the MMIO memory
> access. That seems a little heavy handed to me, but it also may be that was
> required to get all the current driver code written for AMD/Intel processors
> to work correctly on ARM64 without adding barriers in the drivers. There are
> also non-barrier variants that can be used if a driver wants to optimize
> performance. Defaulting to correct operation with minimal code changes would
> reduce the risk to delivery schedules.
> 

> Linux doesn't seem to make any attempt to have barriers in the low level
> MMIO access functions. If Linux had chosen to do that on ARM64, this patch
> would not have been required. For a low speed device like an i2c controller,
> optimizing barriers likely make little difference in performance.

* AFAICS it does for the write(b|w|l|q)() accessors. See __dma_wb(),
* __io_bw() and __raw_write*() macros. The former one is converted
* to DMB.

> 
> Let's look at it from a risk analysis viewpoint. Say a DMB is sufficient and
> we use the stronger DSB variant, the downside is a few cpu cycles will be
> wasted in i2c transfers. Say we use a DMB when a DSB is required for correct
> operation, the downside is i2c operations may malfunction. In this case,
> using a few extra cpu cycles for an operation that does not happen at high
> frequency is lower risk than failures in i2c transfers. If there is any
> uncertainty in what barrier type to use, picking DSB over DMB would be
> better. We determined from the include fragments above that wmb() give the
> DSB and smp_wmb() does not.
> 
> Based on the above info, I think wmb() is still the correct function, and a
> change to smp_wmb() would not be correct.
> 
> Sorry for the long message, I know some of you will be inspired to think
> deeply about barriers, and some will be annoyed that I spent this much space
> to explain how I came to the choice of wmb().

Thank you very much for the very-very-very detailed justification of
your point. I well understand why you insist on using the mandatory
barrier on your platform. The thing is that your patch concerns the
generic driver which is also executed on another archs. Thus we need
to be very careful with the barrier selection since it may cause
unpleasant side effects there. For instance some time ago I met a
problem with using memory barriers on the MMIO accesses on the MIPS
arch. Full mem access barrier caused the program execution stalling
for too long so it failed to fetch data from a device Rx FIFO on time.
FIFO got overrun, data got lost and communications were aborted with an
error returned. I am not saying that the same problem may happen here,
but just pointing out that selecting a correct barrier is important.

Since you are fixing a generic driver code we should make the
decisions based on the problem description and the barriers semantic
defined by the kernel. If for some reason the solution turns to be not
working, then it might as well indicate that the barrier isn't working
as expected by the kernel. Thorough studying the platform-specific
barrier implementation will be necessary then (what you've already
done).

Here is what you say regarding the found problem:

"Errors were happening in the ISR that looked like corrupted memory.
This was because memory writes from the core enabling interrupts were
not yet visible to the core running the ISR...  Add a write barrier
before enabling interrupts to assure data written by the current core
is visible to all cores before the interrupt fires."

Based on that, I can infer that the problem is relevant for the
SMP-systems only and the root of it is in one CPU/core not seeing data
written by another CPU/core. Indeed adding a barrier shall fix it.
Seeing neither uni-processor systems nor any peripheral devices are
affected, SMP-conditional barrier shall be enough. Here is what [1]
says regarding the mandatory (mb/rmb/wmb) and SMP-conditional barriers
(smp_mb, smp_rmb, smp_wmb):

"Note that SMP memory barriers _must_ be used to control the ordering
of references to shared memory on SMP systems, though the use of
locking instead is sufficient. ... Mandatory barriers should not be
used to control SMP effects, since mandatory barriers impose
unnecessary overhead on both SMP and UP systems. They may, however, be
used to control MMIO effects on accesses through relaxed memory I/O
windows. ... SMP memory barriers are reduced to compiler barriers on
uniprocessor compiled systems because it is assumed that a CPU will
appear to be self-consistent, and will order overlapping accesses
correctly with respect to itself."

[1] "CPU MEMORY BARRIERS", Documentation/memory-barriers.txt

(note [1] also contains an example of using the smp_rmb()/smp_wmb()
barriers in a case similar to yours but involving two tasks instead of
a task and ISR)

Based on that description, the mandatory and SMP-conditional barriers
are supposed to similarly function when it comes to ordering the
shared memory accesses in the SMP systems. Meanwhile the former ones
cause additional overhead on UPs and MMIO which is out of the defined
problem scope.

Thus this also indicate that smp_wmb() is your choice here. But adding
it didn't solve the problem meanwhile using wmb() did. And here we are
getting to these barriers implementation on ARM64:
wmb() -> DSB
smp_wmb() -> DMB
Again I am not the ARM expert, but based on the text cited in your
message and what can be found in the Internet I can guess that DMB
doesn't guarantee the memory write _completion_, but instead make sure
that the accesses are just orderly executed on the core pipeline (for
instance just by fetching and dispatching these instructions within
different core cycles). The writes _completion_ is guaranteed by the
DSB barrier. Seeing in order to solve the problem you described all
the writes before the IRQ is raised _must_ be finished for sure to be
visible on another core executing an ISR, the barrier you need is DSB.

Unless I am mistaken in some aspects all of the considerations above
make me thinking that perhaps the smp_mb/smp_rmb/smp_wmb barriers
implementations on ARM64 are incorrect in using DMB and instead should
be converted to using DSB. Then you'll be able to freely utilize the
smp_wmb() barrier in the i2c-driver.

Catalin, Will could you please clarify whether what is stated above is
wrong or correct? Could you give your opinion regarding the issue
here?

-Serge(y)

> 
> Thanks,
> Jan
> 
>
Yann Sionneau Sept. 17, 2023, 8:08 p.m. UTC | #8
Hi all,

Le 17/09/2023 à 02:01, Serge Semin a écrit :
> To += Catalin, Will
>
> Could you please join the discussion and clarify some ARM64 barriers
> aspects?
>
> On Fri, Sep 15, 2023 at 06:47:55PM -0700, Jan Bottorff wrote:
>> On 9/15/2023 8:21 AM, Serge Semin wrote:
>> ...
>>>> diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c
>>>> index ca1035e010c7..1694ac6bb592 100644
>>>> --- a/drivers/i2c/busses/i2c-designware-master.c
>>>> +++ b/drivers/i2c/busses/i2c-designware-master.c
>>>> @@ -248,6 +248,14 @@ static void i2c_dw_xfer_init(struct dw_i2c_dev *dev)
>>>>        /* Dummy read to avoid the register getting stuck on Bay Trail */
>>>>        regmap_read(dev->map, DW_IC_ENABLE_STATUS, &dummy);
>>>>
>>>> +     /*
>>>> +      * To guarantee data written by the current core is visible to
>>>> +      * all cores, a write barrier is required. This needs to be
>>>> +      * before an interrupt causes execution on another core.
>>>> +      * For ARM processors, this needs to be a DSB barrier.
>>>> +      */
>>>> +     wmb();
>>> Based on the patch log and the comment, smp_wmb() seems to be more
>>> suitable here since the problem looks like SMP-specific. Most
>>> importantly the smp_wmb() will get to be just the compiler barrier on
>>> the UP system, so no cache and pipeline flushes in that case.
>>> Meanwhile
>>>
>>> I am not ARM expert, but based on the problem and the DMB/DSB barriers
>>> descriptions using DMB should be enough in your case since you only
>>> need memory syncs.
>>>
>> Hi Serge,
>>
>> I looked at the definition of smp_wmb, and it looks like on arm64 it uses a
>> DMB barrier not a DSB barrier.
>>
>> In /arch/arm64/include/asm/barrier.h:
>> ...
>> #define __arm_heavy_mb(x...) dsb(x)
>> ...
>> #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
>> ...
>> #define wmb()		__arm_heavy_mb(st)
>> ...
>> #define __smp_wmb()	dmb(ishst)
>>
>> And then in /include/asm-generic/barrier.h it says:
>> #ifdef CONFIG_SMP
>> ...
>> #ifndef smp_wmb
>> #define smp_wmb()	do { kcsan_wmb(); __smp_wmb(); } while (0)
>> #endif
>>
>> This looks like wmb() is a DSB and smp_wmb() is a DMB on SMP systems, so the
>> two functions are not equivalent on SMP systems.
> Right. They aren't. That's why I added a note regarding the DMB
> instruction. Anyway see further for detailed explantation of my point.
>
>> So lets explore if we think DMB or DSB is the correct barrier.
>>
>> The ARM barrier docs I referred to has a specific example that says this:
>>
>> "In some message passing systems, it is common for one observer to update
>> memory and then send an interrupt using a mailbox of some sort to a second
>> observer to indicate that memory has been updated and the new
>> contents have been read. Even though the sending of the interrupt using a
>> mailbox might be initiated using a memory access, a DSB barrier
>> must be used to ensure the completion of previous memory accesses.
>>
>> Therefore the following sequence is needed to ensure that P2 sees the
>> updated value.
>>
>> P1:
>>   STR R5, [R1] ; message stored to shared memory location
>>   DSB [ST]
>>   STR R1, [R4] ; R4 contains the address of a mailbox
>>
>> P2:
>>   ; interrupt service routine
>>   LDR R5, [R1]
>>
>> Even if R4 is a pointer to Strongly-Ordered memory, the update to R1 might
>> not be visible without the DSB executed by P1.
>> It should be appreciated that these rules are required in connection to the
>> ARM Generic Interrupt Controller (GIC).
>> "
>>
>> I don't positivly understand why it needs to be a DSB and not just a DMB,
>> but this example matches what happens in the driver. The ARM docs do some
>> hand waving that DSB is required because of the GIC.
>>
>> Unless we can come up with a reason why this example in the ARM Barrier docs
>> is not a match for what happens in the i2c driver, then ARM is saying it has
>> to be a DSB not a DMB. If it needs to be a DSB then smb_wmb is insufficient.
>>
>> Does anybody else have a different interpretation of this section in the ARM
>> barrier docs? They use the word mailbox, and show a shared memory write, an
>> interrupt triggering write, and a read of shared memory on a different core.
>> Some would describe that as a software mailbox.
>>
>> I did read someplace (although don't have a specific reference I can give)
>> that ordering applied to normal memory writes are in a different group than
>> ordering applied between strongly ordered accesses. The excerpt from the ARM
>> barrier document above does say "Even if R4 is a pointer to Strongly-Ordered
>> memory, the update to R1 might not be visible without the DSB executed by
>> P1", which implies a DMB is insufficient to cause ordering between normal
>> memory writes and strongly-ordered device memory writes.
>>
>> I know currently on ARM64 Windows, the low-level kernel device MMIO access
>> functions (like WRITE_REGISTER_ULONG) all have a DSB before the MMIO memory
>> access. That seems a little heavy handed to me, but it also may be that was
>> required to get all the current driver code written for AMD/Intel processors
>> to work correctly on ARM64 without adding barriers in the drivers. There are
>> also non-barrier variants that can be used if a driver wants to optimize
>> performance. Defaulting to correct operation with minimal code changes would
>> reduce the risk to delivery schedules.
>>
>> Linux doesn't seem to make any attempt to have barriers in the low level
>> MMIO access functions. If Linux had chosen to do that on ARM64, this patch
>> would not have been required. For a low speed device like an i2c controller,
>> optimizing barriers likely make little difference in performance.
> * AFAICS it does for the write(b|w|l|q)() accessors. See __dma_wb(),
> * __io_bw() and __raw_write*() macros. The former one is converted
> * to DMB.
>
>> Let's look at it from a risk analysis viewpoint. Say a DMB is sufficient and
>> we use the stronger DSB variant, the downside is a few cpu cycles will be
>> wasted in i2c transfers. Say we use a DMB when a DSB is required for correct
>> operation, the downside is i2c operations may malfunction. In this case,
>> using a few extra cpu cycles for an operation that does not happen at high
>> frequency is lower risk than failures in i2c transfers. If there is any
>> uncertainty in what barrier type to use, picking DSB over DMB would be
>> better. We determined from the include fragments above that wmb() give the
>> DSB and smp_wmb() does not.
>>
>> Based on the above info, I think wmb() is still the correct function, and a
>> change to smp_wmb() would not be correct.
>>
>> Sorry for the long message, I know some of you will be inspired to think
>> deeply about barriers, and some will be annoyed that I spent this much space
>> to explain how I came to the choice of wmb().
> Thank you very much for the very-very-very detailed justification of
> your point. I well understand why you insist on using the mandatory
> barrier on your platform. The thing is that your patch concerns the
> generic driver which is also executed on another archs. Thus we need
> to be very careful with the barrier selection since it may cause
> unpleasant side effects there. For instance some time ago I met a
> problem with using memory barriers on the MMIO accesses on the MIPS
> arch. Full mem access barrier caused the program execution stalling
> for too long so it failed to fetch data from a device Rx FIFO on time.
> FIFO got overrun, data got lost and communications were aborted with an
> error returned. I am not saying that the same problem may happen here,
> but just pointing out that selecting a correct barrier is important.
>
> Since you are fixing a generic driver code we should make the
> decisions based on the problem description and the barriers semantic
> defined by the kernel. If for some reason the solution turns to be not
> working, then it might as well indicate that the barrier isn't working
> as expected by the kernel. Thorough studying the platform-specific
> barrier implementation will be necessary then (what you've already
> done).
>
> Here is what you say regarding the found problem:
>
> "Errors were happening in the ISR that looked like corrupted memory.
> This was because memory writes from the core enabling interrupts were
> not yet visible to the core running the ISR...  Add a write barrier
> before enabling interrupts to assure data written by the current core
> is visible to all cores before the interrupt fires."
>
> Based on that, I can infer that the problem is relevant for the
> SMP-systems only and the root of it is in one CPU/core not seeing data
> written by another CPU/core. Indeed adding a barrier shall fix it.
> Seeing neither uni-processor systems nor any peripheral devices are
> affected, SMP-conditional barrier shall be enough. Here is what [1]
> says regarding the mandatory (mb/rmb/wmb) and SMP-conditional barriers
> (smp_mb, smp_rmb, smp_wmb):
>
> "Note that SMP memory barriers _must_ be used to control the ordering
> of references to shared memory on SMP systems, though the use of
> locking instead is sufficient. ... Mandatory barriers should not be
> used to control SMP effects, since mandatory barriers impose
> unnecessary overhead on both SMP and UP systems. They may, however, be
> used to control MMIO effects on accesses through relaxed memory I/O
> windows. ... SMP memory barriers are reduced to compiler barriers on
> uniprocessor compiled systems because it is assumed that a CPU will
> appear to be self-consistent, and will order overlapping accesses
> correctly with respect to itself."
>
> [1] "CPU MEMORY BARRIERS", Documentation/memory-barriers.txt
>
> (note [1] also contains an example of using the smp_rmb()/smp_wmb()
> barriers in a case similar to yours but involving two tasks instead of
> a task and ISR)
>
> Based on that description, the mandatory and SMP-conditional barriers
> are supposed to similarly function when it comes to ordering the
> shared memory accesses in the SMP systems. Meanwhile the former ones
> cause additional overhead on UPs and MMIO which is out of the defined
> problem scope.
>
> Thus this also indicate that smp_wmb() is your choice here. But adding
> it didn't solve the problem meanwhile using wmb() did. And here we are
> getting to these barriers implementation on ARM64:
> wmb() -> DSB
> smp_wmb() -> DMB
> Again I am not the ARM expert, but based on the text cited in your
> message and what can be found in the Internet I can guess that DMB
> doesn't guarantee the memory write _completion_, but instead make sure
> that the accesses are just orderly executed on the core pipeline (for
> instance just by fetching and dispatching these instructions within
> different core cycles). The writes _completion_ is guaranteed by the
> DSB barrier. Seeing in order to solve the problem you described all
> the writes before the IRQ is raised _must_ be finished for sure to be
> visible on another core executing an ISR, the barrier you need is DSB.
>
> Unless I am mistaken in some aspects all of the considerations above
> make me thinking that perhaps the smp_mb/smp_rmb/smp_wmb barriers
> implementations on ARM64 are incorrect in using DMB and instead should
> be converted to using DSB. Then you'll be able to freely utilize the
> smp_wmb() barrier in the i2c-driver.
>
> Catalin, Will could you please clarify whether what is stated above is
> wrong or correct? Could you give your opinion regarding the issue
> here?

Indeed I agree with all that's been said by Serge here.

I'm just adding some piece of information here to help understand the 
issue and then decide what's to be done.

* some ARM blog tips about when to use wmb() and smp_wmb() : 
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/memory-access-ordering-part-2---barriers-and-the-linux-kernel

* some details about ARMv7 barrier instructions: isb, dmb, dsb: 
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/memory-access-ordering-part-3---memory-access-ordering-in-the-arm-architecture

* The kernel doc about memory barriers, even if I guess everybody here 
knows this resource already: 
https://www.kernel.org/doc/Documentation/memory-barriers.txt

If I understand correctly what I've read in the 2nd link, indeed the DMB 
does not guarantee that the write *completes*. So if the write was for 
instance targeting DDR: it does not guarantee that the write reached the 
DDR before another core could be running the ISR and checks for the data.

*But*, it guarantees that "All data accesses by this processor/core 
before the DMB will be visible to all other masters within the specified 
shareability domain before any of the data accesses after it."

In other word (If I understand correctly): maybe it won't have reached 
the DDR yet *but* the cache coherency mechanism is done and all other 
cores would read the new data if they load at this address.

That seems to me to do the job for our use case.

The difference between DMB and DSB is that the DSB, on top of doing what 
the DMB does, stalls *all* instructions (and not just stores) until the 
synchronization mechanism is done.

That's my understanding but let's wait for the experts to enlighten us :)

Regards,
Jan Bottorff Sept. 19, 2023, 3:45 a.m. UTC | #9
On 9/18/2023 4:14 PM, Serge Semin wrote:
> On Sun, Sep 17, 2023 at 10:08:47PM +0200, Yann Sionneau wrote:
>> Hi all,
>>
>> Le 17/09/2023 à 02:01, Serge Semin a écrit :
>>> To += Catalin, Will
>>>
>>> Could you please join the discussion and clarify some ARM64 barriers
>>> aspects?
>>>
>>> On Fri, Sep 15, 2023 at 06:47:55PM -0700, Jan Bottorff wrote:
>>>> On 9/15/2023 8:21 AM, Serge Semin wrote:
>>>> ...
>>>>>> diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c
>>>>>> index ca1035e010c7..1694ac6bb592 100644
>>>>>> --- a/drivers/i2c/busses/i2c-designware-master.c
>>>>>> +++ b/drivers/i2c/busses/i2c-designware-master.c
>>>>>> @@ -248,6 +248,14 @@ static void i2c_dw_xfer_init(struct dw_i2c_dev *dev)
>>>>>>         /* Dummy read to avoid the register getting stuck on Bay Trail */
>>>>>>         regmap_read(dev->map, DW_IC_ENABLE_STATUS, &dummy);
>>>>>>
>>>>>> +     /*
>>>>>> +      * To guarantee data written by the current core is visible to
>>>>>> +      * all cores, a write barrier is required. This needs to be
>>>>>> +      * before an interrupt causes execution on another core.
>>>>>> +      * For ARM processors, this needs to be a DSB barrier.
>>>>>> +      */
>>>>>> +     wmb();
>>>>> Based on the patch log and the comment, smp_wmb() seems to be more
>>>>> suitable here since the problem looks like SMP-specific. Most
>>>>> importantly the smp_wmb() will get to be just the compiler barrier on
>>>>> the UP system, so no cache and pipeline flushes in that case.
>>>>> Meanwhile
>>>>>
>>>>> I am not ARM expert, but based on the problem and the DMB/DSB barriers
>>>>> descriptions using DMB should be enough in your case since you only
>>>>> need memory syncs.
>>>>>
>>>> Hi Serge,
>>>>
>>>> I looked at the definition of smp_wmb, and it looks like on arm64 it uses a
>>>> DMB barrier not a DSB barrier.
>>>>
>>>> In /arch/arm64/include/asm/barrier.h:
>>>> ...
>>>> #define __arm_heavy_mb(x...) dsb(x)
>>>> ...
>>>> #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
>>>> ...
>>>> #define wmb()		__arm_heavy_mb(st)
>>>> ...
>>>> #define __smp_wmb()	dmb(ishst)
>>>>
>>>> And then in /include/asm-generic/barrier.h it says:
>>>> #ifdef CONFIG_SMP
>>>> ...
>>>> #ifndef smp_wmb
>>>> #define smp_wmb()	do { kcsan_wmb(); __smp_wmb(); } while (0)
>>>> #endif
>>>>
>>>> This looks like wmb() is a DSB and smp_wmb() is a DMB on SMP systems, so the
>>>> two functions are not equivalent on SMP systems.
>>>
>>> Right. They aren't. That's why I added a note regarding the DMB
>>> instruction. Anyway see further for detailed explantation of my point.
>>>
>>>> So lets explore if we think DMB or DSB is the correct barrier.
>>>>
>>>> The ARM barrier docs I referred to has a specific example that says this:
>>>>
>>>> "In some message passing systems, it is common for one observer to update
>>>> memory and then send an interrupt using a mailbox of some sort to a second
>>>> observer to indicate that memory has been updated and the new
>>>> contents have been read. Even though the sending of the interrupt using a
>>>> mailbox might be initiated using a memory access, a DSB barrier
>>>> must be used to ensure the completion of previous memory accesses.
>>>>
>>>> Therefore the following sequence is needed to ensure that P2 sees the
>>>> updated value.
>>>>
>>>> P1:
>>>>    STR R5, [R1] ; message stored to shared memory location
>>>>    DSB [ST]
>>>>    STR R1, [R4] ; R4 contains the address of a mailbox
>>>>
>>>> P2:
>>>>    ; interrupt service routine
>>>>    LDR R5, [R1]
>>>>
>>>> Even if R4 is a pointer to Strongly-Ordered memory, the update to R1 might
>>>> not be visible without the DSB executed by P1.
>>>> It should be appreciated that these rules are required in connection to the
>>>> ARM Generic Interrupt Controller (GIC).
>>>> "
>>>>
>>>> I don't positivly understand why it needs to be a DSB and not just a DMB,
>>>> but this example matches what happens in the driver. The ARM docs do some
>>>> hand waving that DSB is required because of the GIC.
>>>>
>>>> Unless we can come up with a reason why this example in the ARM Barrier docs
>>>> is not a match for what happens in the i2c driver, then ARM is saying it has
>>>> to be a DSB not a DMB. If it needs to be a DSB then smb_wmb is insufficient.
>>>>
>>>> Does anybody else have a different interpretation of this section in the ARM
>>>> barrier docs? They use the word mailbox, and show a shared memory write, an
>>>> interrupt triggering write, and a read of shared memory on a different core.
>>>> Some would describe that as a software mailbox.
>>>>
>>>> I did read someplace (although don't have a specific reference I can give)
>>>> that ordering applied to normal memory writes are in a different group than
>>>> ordering applied between strongly ordered accesses. The excerpt from the ARM
>>>> barrier document above does say "Even if R4 is a pointer to Strongly-Ordered
>>>> memory, the update to R1 might not be visible without the DSB executed by
>>>> P1", which implies a DMB is insufficient to cause ordering between normal
>>>> memory writes and strongly-ordered device memory writes.
>>>>
>>>> I know currently on ARM64 Windows, the low-level kernel device MMIO access
>>>> functions (like WRITE_REGISTER_ULONG) all have a DSB before the MMIO memory
>>>> access. That seems a little heavy handed to me, but it also may be that was
>>>> required to get all the current driver code written for AMD/Intel processors
>>>> to work correctly on ARM64 without adding barriers in the drivers. There are
>>>> also non-barrier variants that can be used if a driver wants to optimize
>>>> performance. Defaulting to correct operation with minimal code changes would
>>>> reduce the risk to delivery schedules.
>>>>
>>>> Linux doesn't seem to make any attempt to have barriers in the low level
>>>> MMIO access functions. If Linux had chosen to do that on ARM64, this patch
>>>> would not have been required. For a low speed device like an i2c controller,
>>>> optimizing barriers likely make little difference in performance.
>>>
>>> * AFAICS it does for the write(b|w|l|q)() accessors. See __dma_wb(),
>>> * __io_bw() and __raw_write*() macros. The former one is converted
>>> * to DMB.
>>>
>>>> Let's look at it from a risk analysis viewpoint. Say a DMB is sufficient and
>>>> we use the stronger DSB variant, the downside is a few cpu cycles will be
>>>> wasted in i2c transfers. Say we use a DMB when a DSB is required for correct
>>>> operation, the downside is i2c operations may malfunction. In this case,
>>>> using a few extra cpu cycles for an operation that does not happen at high
>>>> frequency is lower risk than failures in i2c transfers. If there is any
>>>> uncertainty in what barrier type to use, picking DSB over DMB would be
>>>> better. We determined from the include fragments above that wmb() give the
>>>> DSB and smp_wmb() does not.
>>>>
>>>> Based on the above info, I think wmb() is still the correct function, and a
>>>> change to smp_wmb() would not be correct.
>>>>
>>>> Sorry for the long message, I know some of you will be inspired to think
>>>> deeply about barriers, and some will be annoyed that I spent this much space
>>>> to explain how I came to the choice of wmb().
>>>
>>> Thank you very much for the very-very-very detailed justification of
>>> your point. I well understand why you insist on using the mandatory
>>> barrier on your platform. The thing is that your patch concerns the
>>> generic driver which is also executed on another archs. Thus we need
>>> to be very careful with the barrier selection since it may cause
>>> unpleasant side effects there. For instance some time ago I met a
>>> problem with using memory barriers on the MMIO accesses on the MIPS
>>> arch. Full mem access barrier caused the program execution stalling
>>> for too long so it failed to fetch data from a device Rx FIFO on time.
>>> FIFO got overrun, data got lost and communications were aborted with an
>>> error returned. I am not saying that the same problem may happen here,
>>> but just pointing out that selecting a correct barrier is important.
>>>
>>> Since you are fixing a generic driver code we should make the
>>> decisions based on the problem description and the barriers semantic
>>> defined by the kernel. If for some reason the solution turns to be not
>>> working, then it might as well indicate that the barrier isn't working
>>> as expected by the kernel. Thorough studying the platform-specific
>>> barrier implementation will be necessary then (what you've already
>>> done).
>>>
>>> Here is what you say regarding the found problem:
>>>
>>> "Errors were happening in the ISR that looked like corrupted memory.
>>> This was because memory writes from the core enabling interrupts were
>>> not yet visible to the core running the ISR...  Add a write barrier
>>> before enabling interrupts to assure data written by the current core
>>> is visible to all cores before the interrupt fires."
>>>
>>> Based on that, I can infer that the problem is relevant for the
>>> SMP-systems only and the root of it is in one CPU/core not seeing data
>>> written by another CPU/core. Indeed adding a barrier shall fix it.
>>> Seeing neither uni-processor systems nor any peripheral devices are
>>> affected, SMP-conditional barrier shall be enough. Here is what [1]
>>> says regarding the mandatory (mb/rmb/wmb) and SMP-conditional barriers
>>> (smp_mb, smp_rmb, smp_wmb):
>>>
>>> "Note that SMP memory barriers _must_ be used to control the ordering
>>> of references to shared memory on SMP systems, though the use of
>>> locking instead is sufficient. ... Mandatory barriers should not be
>>> used to control SMP effects, since mandatory barriers impose
>>> unnecessary overhead on both SMP and UP systems. They may, however, be
>>> used to control MMIO effects on accesses through relaxed memory I/O
>>> windows. ... SMP memory barriers are reduced to compiler barriers on
>>> uniprocessor compiled systems because it is assumed that a CPU will
>>> appear to be self-consistent, and will order overlapping accesses
>>> correctly with respect to itself."
>>>
>>> [1] "CPU MEMORY BARRIERS", Documentation/memory-barriers.txt
>>>
>>> (note [1] also contains an example of using the smp_rmb()/smp_wmb()
>>> barriers in a case similar to yours but involving two tasks instead of
>>> a task and ISR)
>>>
>>> Based on that description, the mandatory and SMP-conditional barriers
>>> are supposed to similarly function when it comes to ordering the
>>> shared memory accesses in the SMP systems. Meanwhile the former ones
>>> cause additional overhead on UPs and MMIO which is out of the defined
>>> problem scope.
>>>
>>> Thus this also indicate that smp_wmb() is your choice here. But adding
>>> it didn't solve the problem meanwhile using wmb() did. And here we are
>>> getting to these barriers implementation on ARM64:
>>> wmb() -> DSB
>>> smp_wmb() -> DMB
>>> Again I am not the ARM expert, but based on the text cited in your
>>> message and what can be found in the Internet I can guess that DMB
>>> doesn't guarantee the memory write _completion_, but instead make sure
>>> that the accesses are just orderly executed on the core pipeline (for
>>> instance just by fetching and dispatching these instructions within
>>> different core cycles). The writes _completion_ is guaranteed by the
>>> DSB barrier. Seeing in order to solve the problem you described all
>>> the writes before the IRQ is raised _must_ be finished for sure to be
>>> visible on another core executing an ISR, the barrier you need is DSB.
>>>
>>> Unless I am mistaken in some aspects all of the considerations above
>>> make me thinking that perhaps the smp_mb/smp_rmb/smp_wmb barriers
>>> implementations on ARM64 are incorrect in using DMB and instead should
>>> be converted to using DSB. Then you'll be able to freely utilize the
>>> smp_wmb() barrier in the i2c-driver.
>>>
>>> Catalin, Will could you please clarify whether what is stated above is
>>> wrong or correct? Could you give your opinion regarding the issue
>>> here?
>>
>> Indeed I agree with all that's been said by Serge here.
>>
>> I'm just adding some piece of information here to help understand the issue
>> and then decide what's to be done.
>>
>> * some ARM blog tips about when to use wmb() and smp_wmb() : https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/memory-access-ordering-part-2---barriers-and-the-linux-kernel
>>
>> * some details about ARMv7 barrier instructions: isb, dmb, dsb: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/memory-access-ordering-part-3---memory-access-ordering-in-the-arm-architecture
> 
> Thanks for the links. Especially for the last one. Reading it gave the
> idea to me of another possible reason of the problem, which looks as
> more probable in this case.
> 
>>
>> * The kernel doc about memory barriers, even if I guess everybody here knows
>> this resource already:
>> https://www.kernel.org/doc/Documentation/memory-barriers.txt
>>
>> If I understand correctly what I've read in the 2nd link, indeed the DMB
>> does not guarantee that the write *completes*. So if the write was for
>> instance targeting DDR: it does not guarantee that the write reached the DDR
>> before another core could be running the ISR and checks for the data.
>>
>> *But*, it guarantees that "All data accesses by this processor/core before
>> the DMB will be visible to all other masters within the specified
>> shareability domain before any of the data accesses after it."
>>
>> In other word (If I understand correctly): maybe it won't have reached the
>> DDR yet *but* the cache coherency mechanism is done and all other cores
>> would read the new data if they load at this address.
>>
>> That seems to me to do the job for our use case.
> 
> Actually there is no need in the load and store to reach RAM as long as
> the CPU is cache-coherent. In general it means what happens on the
> per-core or per-cluster caches is visible to all CPU cores. I doubt
> any of the modern systems lack of such ability. ARM64 certainly
> doesn't lack it. But based on what is said in the second link
> the dmb/dsb barriers semantic is configurable in that regard. The
> barriers work within the specified Shareability domain. I don't
> know at what stage the domains are supposed to be configured
> (at the SoC design or at runtime), but if a barrier doesn't involve
> all the domains which need to have the preceding operations visible
> the problem described by Jan may happen.
> 
>>
>> The difference between DMB and DSB is that the DSB, on top of doing what the
>> DMB does, stalls *all* instructions (and not just stores) until the
>> synchronization mechanism is done.
> 
> Yes, that's the main difference. But (r|w)?mb() and smp_(r|w)?mb()
> barriers are converted to the DSB and DMB instructions executed for
> different domains:
> 
> #define __mb()          dsb(sy)
> #define __rmb()         dsb(ld)
> #define __wmb()         dsb(st)
> 
> #define __smp_mb()      dmb(ish)
> #define __smp_rmb()     dmb(ishld)
> #define __smp_wmb()     dmb(ishst)
> 
> The Mandatory barriers affect the System Shareability domain, The
> SMP-conditional barriers - the Inner Shareability domain. So if for
> some reason the CPU cores on the Jan's system are split up into
> several Inner domains, AFAIU the SMP barrier will involve only the one
> on which the barrier is executed. So as said on the ARM64 doc:
> "Outside of this domain, observers might not see the same order of
> memory accesses as inside it."
> 
> So Jan could you please try the next out:
> 1. Check whether all the CPU cores in your system are on the same
> Inner Shareability domain.
> 2. If not (or in anyway) please redefine SMP-barriers like this:
> #define __smp_mb()      dmb(osh)
> #define __smp_rmb()     dmb(oshld)
> #define __smp_wmb()     dmb(oshst)
> 3. If it didn't help like this
> #define __smp_mb()      dmb(sy)
> #define __smp_rmb()     dmb(ld)
> #define __smp_wmb()     dmb(st)
> 
> -Serge(y)

Hi Serge,

My understanding is all cores running the same OS instance will be in 
the same inner sharable domain. The ARM64 domain seems to be closely 
related to or maybe is just another name for cache-coherency domain. 
This is the case on the system showing this issue.

I first got involved when we saw the issue on a 2 socket 160 core per 
socket AmpereOne CPU system (all 320 cores are cache coherent, in the 
same inner sharable domain) where the ISR was running on a different 
socket than the i2c transfer setup code. It was trivial to reproduce on 
that system. It then later turned out a customer that I believe was 
running a single socket system also had seen the issue, and this patch 
fixed it on both systems.

Not only did we see the timeout messages, I saw one debug log where the 
developer logged many values from inside i2c_dw_xfer_msg, and it showed 
values like dev->msg_read_idx changing magically. Some months ago I had 
debugged a missing barrier bug on a different OS, and values changing 
magically was an indicator of a missing barrier. We had another bug 
report where it looked like dev->msgs was null on the first transfer 
after booting, which is plausibly the same issue.

My understanding is we now have had three large customers see the issue 
with their internal tests, and have reported the issue goes way with the 
barrier patch.

I did some more hunting on what ARM docs say:

On the page 
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/memory-access-ordering-part-3---memory-access-ordering-in-the-arm-architecture 
in the section "Device memory and Strongly-ordered memory" is the sentence

"There is however no guarantee about ordering between memory accesses to 
different devices, or usually between accesses of different memory types".

I interpret that to mean memory mapped as strong-ordering only applies 
within a single device, and there is no ordering guarantee between 
different strong-ordering mapped regions or regions of other memory 
types like normal memory. I could not find any info on how DSB or DMB 
barriers interact with groups of accesses to different memory types. I 
could not find if a DMB after normal memory writes guarantees strongly 
ordered writes (device memory) after the barrier are delayed until the 
normal memory writes before the DMB barrier are fully observable by all 
observers.

The ARM docs saying a DSB barrier is required after normal memory writes 
and before a write that triggers an interrupt makes me believe DMB will 
not cause an ordering barrier between writes to normal memory and writes 
to strongly-ordered device memory.

I did find the below text in the Arm Architectural Reference Manual (DDI 
0487I.a) section K13.4 "Using a mailbox to send an interrupt". It was 
nearly the same wording as the ARM barrier document I previously 
referenced at https://developer.arm.com/documentation/genc007826/latest/ 
This too says a DSB barrier is required for memory updates to be 
observable in the ISR.

"
K13.4 Using a mailbox to send an interrupt
   In some message passing systems, it is common for one observer to 
update memory and then notify a second observer of the update by sending 
an interrupt, using a mailbox. Although a memory access might be made to 
initiate the sending of the mailbox interrupt, a DSB instruction is
required to ensure the completion of previous memory accesses.

Therefore, the following sequence is required to ensure that P2 observes 
the updated value:

AArch32
P1
   STR R5, [R1] ; message stored to shared memory location
   DSB ST
   STR R0, [R4] ; R4 contains the address of a mailbox
P2
   ; interrupt service routine
   LDR R5, [R1]

AArch64
P1
   STR W5, [X1] ; message stored to shared memory location
   DSB ST
   STR W0, [X4] ; R4 contains the address of a mailbox
P2
   ; interrupt service routine
   LDR W5, [X1]
"

I hear your concern about how this barrier in platform portable code may 
impact platforms other than the one I'm trying to fix. It almost seems 
like there is some missing type of barrier macro that on ARM64 does what 
is required for cases like this and on other platforms does whatever is 
appropriate for that platform, often nothing.

Thanks,
Jan

> 
>>
>> That's my understanding but let's wait for the experts to enlighten us :)
>>
>> Regards,
>>
>> -- 
>>
>> Yann
>>
Catalin Marinas Sept. 19, 2023, 9:55 a.m. UTC | #10
On Mon, Sep 18, 2023 at 08:45:42PM -0700, Jan Bottorff wrote:
> > > > > > > diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c
> > > > > > > index ca1035e010c7..1694ac6bb592 100644
> > > > > > > --- a/drivers/i2c/busses/i2c-designware-master.c
> > > > > > > +++ b/drivers/i2c/busses/i2c-designware-master.c
> > > > > > > @@ -248,6 +248,14 @@ static void i2c_dw_xfer_init(struct dw_i2c_dev *dev)
> > > > > > >         /* Dummy read to avoid the register getting stuck on Bay Trail */
> > > > > > >         regmap_read(dev->map, DW_IC_ENABLE_STATUS, &dummy);
> > > > > > > 
> > > > > > > +     /*
> > > > > > > +      * To guarantee data written by the current core is visible to
> > > > > > > +      * all cores, a write barrier is required. This needs to be
> > > > > > > +      * before an interrupt causes execution on another core.
> > > > > > > +      * For ARM processors, this needs to be a DSB barrier.
> > > > > > > +      */
> > > > > > > +     wmb();
[...]
> I did find the below text in the Arm Architectural Reference Manual (DDI
> 0487I.a) section K13.4 "Using a mailbox to send an interrupt". It was nearly
> the same wording as the ARM barrier document I previously referenced at
> https://developer.arm.com/documentation/genc007826/latest/ This too says a
> DSB barrier is required for memory updates to be observable in the ISR.
> 
> "
> K13.4 Using a mailbox to send an interrupt
>   In some message passing systems, it is common for one observer to update
> memory and then notify a second observer of the update by sending an
> interrupt, using a mailbox. Although a memory access might be made to
> initiate the sending of the mailbox interrupt, a DSB instruction is
> required to ensure the completion of previous memory accesses.
> 
> Therefore, the following sequence is required to ensure that P2 observes the
> updated value:
> 
> AArch32
> P1
>   STR R5, [R1] ; message stored to shared memory location
>   DSB ST
>   STR R0, [R4] ; R4 contains the address of a mailbox
> P2
>   ; interrupt service routine
>   LDR R5, [R1]
> 
> AArch64
> P1
>   STR W5, [X1] ; message stored to shared memory location
>   DSB ST
>   STR W0, [X4] ; R4 contains the address of a mailbox
> P2
>   ; interrupt service routine
>   LDR W5, [X1]
> "

Will convinced me in the past that a DMB is sufficient here unless the
peripheral is CPU-local. The Arm ARM is not entirely clear here.

> I hear your concern about how this barrier in platform portable code may
> impact platforms other than the one I'm trying to fix. It almost seems like
> there is some missing type of barrier macro that on ARM64 does what is
> required for cases like this and on other platforms does whatever is
> appropriate for that platform, often nothing.

I also agree that a wmb() in the i2c driver is not the more elegant fix.
For similar reasons, we hid barriers in the write*() macros, drivers
need to stay architecture-agnostic as much as possible.

Where does the regmap_write() end up? I think the barrier should be
somewhere down this path.
Wolfram Sang Sept. 19, 2023, 10:19 a.m. UTC | #11
> I also agree that a wmb() in the i2c driver is not the more elegant fix.
> For similar reasons, we hid barriers in the write*() macros, drivers
> need to stay architecture-agnostic as much as possible.

Exactly my thinking. I wanted to read this patch discussion later this
week. But from glimpsing at it so far, I already wondered why there
isn't a memory barrier in the final accessor to the register.
Yann Sionneau Sept. 19, 2023, 12:38 p.m. UTC | #12
Hi,

On 9/19/23 12:19, Wolfram Sang wrote:
>> I also agree that a wmb() in the i2c driver is not the more elegant fix.
>> For similar reasons, we hid barriers in the write*() macros, drivers
>> need to stay architecture-agnostic as much as possible.
> Exactly my thinking. I wanted to read this patch discussion later this
> week. But from glimpsing at it so far, I already wondered why there
> isn't a memory barrier in the final accessor to the register.

The regmap accessors used by the designware driver end up calling 
writel_relaxed() and readl_relaxed() : 
https://elixir.bootlin.com/linux/v6.6-rc2/source/drivers/i2c/busses/i2c-designware-common.c#L71

Those usually end up just being volatile accesses, making some kind of 
compiler barrier preventing the memory mapped register accesses to be 
moved/removed/merged.

I kind of think it's OK. It starts not being OK when you want some 
ordering between those and some memory accesses in DDR like the problem 
discussed here.

In those cases I would say the smp_* barriers are what we are supposed 
to use, isn't it?

Regards,
Catalin Marinas Sept. 19, 2023, 2:51 p.m. UTC | #13
On Tue, Sep 19, 2023 at 02:38:22PM +0200, Yann Sionneau wrote:
> Hi,
> 
> On 9/19/23 12:19, Wolfram Sang wrote:
> > > I also agree that a wmb() in the i2c driver is not the more elegant fix.
> > > For similar reasons, we hid barriers in the write*() macros, drivers
> > > need to stay architecture-agnostic as much as possible.
> > Exactly my thinking. I wanted to read this patch discussion later this
> > week. But from glimpsing at it so far, I already wondered why there
> > isn't a memory barrier in the final accessor to the register.
> 
> The regmap accessors used by the designware driver end up calling
> writel_relaxed() and readl_relaxed() : https://elixir.bootlin.com/linux/v6.6-rc2/source/drivers/i2c/busses/i2c-designware-common.c#L71

OK, since it ends up with the *_relaxed() accessors, there are no
barriers here. I wonder whether the regmap API should have both standard
and relaxed variants. If a regmap driver does not populate the
.reg_write_relaxed etc. members, a regmap_write_relaxed() would just
fall back to regmap_write().

We went through similar discussions many years ago around the I/O
accessors and decided to add the barriers to readl/writel() even if they
become more expensive, correctness should be first. The relaxed variants
were added as optimisations if specific memory ordering was not
required. I think the regmap API should follow the same semantics, go
for correctness first as you can't tell what the side-effect of a
regmap_write() is (e.g. kicking off DMA or causing an interrupt on
another CPU).

> In those cases I would say the smp_* barriers are what we are supposed to
> use, isn't it?

While smp_* is ok, it really depends on what the regmap_write() does. Is
it a write to a shared peripheral (if not, you may need a DSB)? Does the
regmap_write() caller know this? That's why I think having the barrier
in dw_reg_write() is better.

If you do want to stick to a fix in i2c_dw_xfer_init(), you could go for
dma_wmb(). While this is not strictly DMA, it's sharing data with
another coherent agent (a different CPU in this instance). The smp_wmb()
is more about communication via memory not involving I/O. But this still
assumes that the caller knows regmap_write() ends up with an I/O
write*() (potentially relaxed).
Wolfram Sang Sept. 19, 2023, 2:55 p.m. UTC | #14
> OK, since it ends up with the *_relaxed() accessors, there are no
> barriers here. I wonder whether the regmap API should have both standard
> and relaxed variants. If a regmap driver does not populate the
> .reg_write_relaxed etc. members, a regmap_write_relaxed() would just
> fall back to regmap_write().
> 
> We went through similar discussions many years ago around the I/O
> accessors and decided to add the barriers to readl/writel() even if they
> become more expensive, correctness should be first. The relaxed variants
> were added as optimisations if specific memory ordering was not
> required. I think the regmap API should follow the same semantics, go
> for correctness first as you can't tell what the side-effect of a
> regmap_write() is (e.g. kicking off DMA or causing an interrupt on
> another CPU).

Again, I am all with Catalin here. Safety first, optimizations a la
*_relaxed should be opt-in.
Jan Bottorff Sept. 19, 2023, 6:54 p.m. UTC | #15
On 9/19/2023 7:51 AM, Catalin Marinas wrote:
> 
> While smp_* is ok, it really depends on what the regmap_write() does. Is
> it a write to a shared peripheral (if not, you may need a DSB)? Does the
> regmap_write() caller know this? That's why I think having the barrier
> in dw_reg_write() is better.
> 
> If you do want to stick to a fix in i2c_dw_xfer_init(), you could go for
> dma_wmb(). While this is not strictly DMA, it's sharing data with
> another coherent agent (a different CPU in this instance). The smp_wmb()
> is more about communication via memory not involving I/O. But this still
> assumes that the caller knows regmap_write() ends up with an I/O
> write*() (potentially relaxed).

If we wanted maximum correctness wouldn't we need something like 
writel_triggers_interrupt/regmap_write_triggers_interrupt or maybe 
preinterrupt_wmb?

The ARM docs do have a specific example case where the device write 
triggers an interrupt, and that example specifically says a DSB barrier 
is needed.

If I look at the ARM GIC IPI send function gic_ipi_send_mask in 
https://elixir.bootlin.com/linux/v6.6-rc2/source/drivers/irqchip/irq-gic-v3.c#L1354 
is says:

         /*
	 * Ensure that stores to Normal memory are visible to the
	 * other CPUs before issuing the IPI.
	 */
	dsb(ishst);

I would think the IPI send code is very carefully tuned for performance, 
and would not use a barrier any stronger than required.

I believe dma_wmb maps to DMB on ARM64.

- Jan
Serge Semin Sept. 19, 2023, 9:05 p.m. UTC | #16
Hi all

On Tue, Sep 19, 2023 at 11:54:10AM -0700, Jan Bottorff wrote:
> On 9/19/2023 7:51 AM, Catalin Marinas wrote:
> > 
> > While smp_* is ok, it really depends on what the regmap_write() does. Is
> > it a write to a shared peripheral (if not, you may need a DSB)? Does the
> > regmap_write() caller know this? That's why I think having the barrier
> > in dw_reg_write() is better.
> > 
> > If you do want to stick to a fix in i2c_dw_xfer_init(), you could go for
> > dma_wmb(). While this is not strictly DMA, it's sharing data with
> > another coherent agent (a different CPU in this instance). The smp_wmb()
> > is more about communication via memory not involving I/O. But this still
> > assumes that the caller knows regmap_write() ends up with an I/O
> > write*() (potentially relaxed).

Catalin, thank you very much for your messages. The situation is much
clearer now. I should have noted that we indeed talking about
different memory types: Normal and Device memories. Anyway to sum it up
AFAICS the next situation is happening:
1. some data is updated,
2. IRQ is activated by means of writel_relaxed() to the
DW_IC_INTR_MASK register.
3. IRQ is raised and being handled, but the data updated in 1. looked
as corrupted.

(Jan, correct me if I'm wrong.)

Since ARM doesn't "guarantee ordering between memory accesses to
different devices, or usually between accesses of different memory
types", most likely the CPU core changes 1. and 2. places
occasionally. So in that case the IRQ handler just doesn't see the
updated data. What needs to be done is to make sure that 2. always
happens after 1. is completed. Outer Shareability domain write-barrier
(DMB(oshst)) does that. Am I right? That's why it's utilized in the
__io_bw() and __dma_wmb() macros implementation.

Wolfram, Jan seeing the root cause of the problem is in using the
'_relaxed' accessors for the controller CSRs I assume that the problem
might be more generic than just for ARMs, since based on [1] these
accessors "do not guarantee ordering with respect to locking, _normal_
memory accesses or delay() loops." So theoretically the problem might
get to be met on any other arch if it implements the semantic with the
relaxed normal and device memory accesses execution.

[1] "KERNEL I/O BARRIER EFFECTS" Documentation/memory-barriers.txt

If so we need to have give up from using the _relaxed accessors at for
the CSRs which may cause a side effect like IRQ raising. Instead the
normal IO write should be utilized which "wait for the completion of
all prior writes to memory either issued by, or propagated to, the
same thread." [1] Thus I'd suggest the next fix for the problem:

--- a/drivers/i2c/busses/i2c-designware-common.c
+++ b/drivers/i2c/busses/i2c-designware-common.c
@@ -72,7 +72,10 @@ static int dw_reg_write(void *context, unsigned int reg, unsigned int val)
 {
 	struct dw_i2c_dev *dev = context;
 
-	writel_relaxed(val, dev->base + reg);
+	if (reg == DW_IC_INTR_MASK)
+		writel(val, dev->base + reg);
+	else
+		writel_relaxed(val, dev->base + reg);
 
 	return 0;
 }

(and similar changes for dw_reg_write_swab() and dw_reg_write_word().)

What do you think?

> 
> If we wanted maximum correctness wouldn't we need something like
> writel_triggers_interrupt/regmap_write_triggers_interrupt or maybe
> preinterrupt_wmb?
> 
> The ARM docs do have a specific example case where the device write triggers
> an interrupt, and that example specifically says a DSB barrier is needed.
> 
> If I look at the ARM GIC IPI send function gic_ipi_send_mask in https://elixir.bootlin.com/linux/v6.6-rc2/source/drivers/irqchip/irq-gic-v3.c#L1354
> is says:
> 
>         /*
> 	 * Ensure that stores to Normal memory are visible to the
> 	 * other CPUs before issuing the IPI.
> 	 */
> 	dsb(ishst);
> 
> I would think the IPI send code is very carefully tuned for performance, and
> would not use a barrier any stronger than required.
> 
> I believe dma_wmb maps to DMB on ARM64.

Jan. Yes, but it looks like DMB() for the Outer-shareable domains
should be enough (see my comment above). DSB() seems like overkill
here. We don't raise IPIs or use mailboxes in this case, but merely
update the CSR flags.

-Serge(y)

> 
> - Jan
> 
> 
> 
> 
> 
>
Wolfram Sang Sept. 20, 2023, 9:08 a.m. UTC | #17
> same thread." [1] Thus I'd suggest the next fix for the problem:
> 
> --- a/drivers/i2c/busses/i2c-designware-common.c
> +++ b/drivers/i2c/busses/i2c-designware-common.c
> @@ -72,7 +72,10 @@ static int dw_reg_write(void *context, unsigned int reg, unsigned int val)
>  {
>  	struct dw_i2c_dev *dev = context;
>  
> -	writel_relaxed(val, dev->base + reg);
> +	if (reg == DW_IC_INTR_MASK)
> +		writel(val, dev->base + reg);
> +	else
> +		writel_relaxed(val, dev->base + reg);
>  
>  	return 0;
>  }
> 
> (and similar changes for dw_reg_write_swab() and dw_reg_write_word().)
> 
> What do you think?

To me, this looks reasonable and much more what I would have expected as
a result (from a high level point of view). Let's hope it works. I am
optimistic, though...
Catalin Marinas Sept. 20, 2023, 10:44 a.m. UTC | #18
On Tue, Sep 19, 2023 at 11:54:10AM -0700, Jan Bottorff wrote:
> On 9/19/2023 7:51 AM, Catalin Marinas wrote:
> > While smp_* is ok, it really depends on what the regmap_write() does. Is
> > it a write to a shared peripheral (if not, you may need a DSB)? Does the
> > regmap_write() caller know this? That's why I think having the barrier
> > in dw_reg_write() is better.
> > 
> > If you do want to stick to a fix in i2c_dw_xfer_init(), you could go for
> > dma_wmb(). While this is not strictly DMA, it's sharing data with
> > another coherent agent (a different CPU in this instance). The smp_wmb()
> > is more about communication via memory not involving I/O. But this still
> > assumes that the caller knows regmap_write() ends up with an I/O
> > write*() (potentially relaxed).
> 
> If we wanted maximum correctness wouldn't we need something like
> writel_triggers_interrupt/regmap_write_triggers_interrupt or maybe
> preinterrupt_wmb?

Well, if you want to have an API for all things that can be triggered
(interrupts, device DMA), you can try but I think it would make things
more confusing and driver writers won't bother (if, say, they only test
on x86 and never see a problem). The other way around - barriers by
default and only relax if you see a performance issue - seems more
sensible. But I don't maintain these drivers, so it's up to you guys.

> The ARM docs do have a specific example case where the device write triggers
> an interrupt, and that example specifically says a DSB barrier is needed.

Yeah, the Arm ARM is not very precise here on what the mailbox is,
whether it's a local or shared peripheral and they went for the
stronger DMB. Will added a good explanation on why a DMB is sufficient
in commit 22ec71615d82 ("arm64: io: Relax implicit barriers in default
I/O accessors"). It talks about DMA but it applies equally to another
CPU accessing the memory. It's pretty subtle though.

> If I look at the ARM GIC IPI send function gic_ipi_send_mask in
> https://elixir.bootlin.com/linux/v6.6-rc2/source/drivers/irqchip/irq-gic-v3.c#L1354
> is says:
> 
>         /*
> 	 * Ensure that stores to Normal memory are visible to the
> 	 * other CPUs before issuing the IPI.
> 	 */
> 	dsb(ishst);
> 
> I would think the IPI send code is very carefully tuned for performance, and
> would not use a barrier any stronger than required.

That's why I mentioned in my previous reply that it really depends on
what the regmap_write() does, where the I/O go shared peripheral or some
local CPU interface). In the GIC example above, there's not even an I/O
access but a system register write (MSR, see gic_write_sgi1r()), hence
the DSB. If you look at gic_ipi_send_mask() in irq-gic.c (GICv2), there
is a dmb(ishst) since the interrupt is sent with an I/O write to the GIC
distributor (shared peripheral).

> I believe dma_wmb maps to DMB on ARM64.

Yes, it does.
Catalin Marinas Sept. 20, 2023, 11:03 a.m. UTC | #19
On Wed, Sep 20, 2023 at 12:05:50AM +0300, Serge Semin wrote:
> On Tue, Sep 19, 2023 at 11:54:10AM -0700, Jan Bottorff wrote:
> > On 9/19/2023 7:51 AM, Catalin Marinas wrote:
> > > While smp_* is ok, it really depends on what the regmap_write() does. Is
> > > it a write to a shared peripheral (if not, you may need a DSB)? Does the
> > > regmap_write() caller know this? That's why I think having the barrier
> > > in dw_reg_write() is better.
> > > 
> > > If you do want to stick to a fix in i2c_dw_xfer_init(), you could go for
> > > dma_wmb(). While this is not strictly DMA, it's sharing data with
> > > another coherent agent (a different CPU in this instance). The smp_wmb()
> > > is more about communication via memory not involving I/O. But this still
> > > assumes that the caller knows regmap_write() ends up with an I/O
> > > write*() (potentially relaxed).
> 
> Catalin, thank you very much for your messages. The situation is much
> clearer now. I should have noted that we indeed talking about
> different memory types: Normal and Device memories. Anyway to sum it up
> AFAICS the next situation is happening:
> 1. some data is updated,
> 2. IRQ is activated by means of writel_relaxed() to the
> DW_IC_INTR_MASK register.
> 3. IRQ is raised and being handled, but the data updated in 1. looked
> as corrupted.
> 
> (Jan, correct me if I'm wrong.)
> 
> Since ARM doesn't "guarantee ordering between memory accesses to
> different devices, or usually between accesses of different memory
> types", most likely the CPU core changes 1. and 2. places
> occasionally. So in that case the IRQ handler just doesn't see the
> updated data. What needs to be done is to make sure that 2. always
> happens after 1. is completed. Outer Shareability domain write-barrier
> (DMB(oshst)) does that. Am I right? That's why it's utilized in the
> __io_bw() and __dma_wmb() macros implementation.

Yes.

> Wolfram, Jan seeing the root cause of the problem is in using the
> '_relaxed' accessors for the controller CSRs I assume that the problem
> might be more generic than just for ARMs, since based on [1] these
> accessors "do not guarantee ordering with respect to locking, _normal_
> memory accesses or delay() loops." So theoretically the problem might
> get to be met on any other arch if it implements the semantic with the
> relaxed normal and device memory accesses execution.
> 
> [1] "KERNEL I/O BARRIER EFFECTS" Documentation/memory-barriers.txt
> 
> If so we need to have give up from using the _relaxed accessors at for
> the CSRs which may cause a side effect like IRQ raising. Instead the
> normal IO write should be utilized which "wait for the completion of
> all prior writes to memory either issued by, or propagated to, the
> same thread." [1] Thus I'd suggest the next fix for the problem:
> 
> --- a/drivers/i2c/busses/i2c-designware-common.c
> +++ b/drivers/i2c/busses/i2c-designware-common.c
> @@ -72,7 +72,10 @@ static int dw_reg_write(void *context, unsigned int reg, unsigned int val)
>  {
>  	struct dw_i2c_dev *dev = context;
>  
> -	writel_relaxed(val, dev->base + reg);
> +	if (reg == DW_IC_INTR_MASK)
> +		writel(val, dev->base + reg);
> +	else
> +		writel_relaxed(val, dev->base + reg);
>  
>  	return 0;
>  }
> 
> (and similar changes for dw_reg_write_swab() and dw_reg_write_word().)

This should work as well.
Catalin Marinas Sept. 20, 2023, 11:05 a.m. UTC | #20
On Wed, Sep 20, 2023 at 11:44:58AM +0100, Catalin Marinas wrote:
> On Tue, Sep 19, 2023 at 11:54:10AM -0700, Jan Bottorff wrote:
> > The ARM docs do have a specific example case where the device write triggers
> > an interrupt, and that example specifically says a DSB barrier is needed.
> 
> Yeah, the Arm ARM is not very precise here on what the mailbox is,
> whether it's a local or shared peripheral and they went for the
> stronger DMB. Will added a good explanation on why a DMB is sufficient
           ^^^
	   DSB

(fixing typo in my reply)
Yann Sionneau Sept. 20, 2023, 1:27 p.m. UTC | #21
Hi,

On 20/09/2023 11:08, Wolfram Sang wrote:
>> same thread." [1] Thus I'd suggest the next fix for the problem:
>>
>> --- a/drivers/i2c/busses/i2c-designware-common.c
>> +++ b/drivers/i2c/busses/i2c-designware-common.c
>> @@ -72,7 +72,10 @@ static int dw_reg_write(void *context, unsigned int reg, unsigned int val)
>>   {
>>   	struct dw_i2c_dev *dev = context;
>>   
>> -	writel_relaxed(val, dev->base + reg);
>> +	if (reg == DW_IC_INTR_MASK)
>> +		writel(val, dev->base + reg);
>> +	else
>> +		writel_relaxed(val, dev->base + reg);
>>   
>>   	return 0;
>>   }
>>
>> (and similar changes for dw_reg_write_swab() and dw_reg_write_word().)
>>
>> What do you think?
> To me, this looks reasonable and much more what I would have expected as
> a result (from a high level point of view). Let's hope it works. I am
> optimistic, though...
>
It works if we make sure all the other register accesses to the 
designware i2c IP can't generate IRQ.

Meaning that all register accesses that can trigger an IRQ are enclosed 
in between a call to i2c_dw_disable_int() and a call to 
regmap_write(dev->map, DW_IC_INTR_MASK, DW_IC_INTR_MASTER_MASK); or 
equivalent.

It seems to be the case, I'm not sure what's the best way to make sure 
it will stay that way.

Moreover, maybe writes to IC_ENABLE register should also use the 
non-relaxed writel() version?

Since one could do something like:

[ IP is currently disabled ]

1/ enable interrupts in DW_IC_INTR_MASK

2/ update some variable in dev-> structure in DDR

3/ enable the device by writing to IC_ENABLE, thus triggering for 
instance the TX_FIFO_EMPTY irq.

Regards,
Jan Bottorff Sept. 20, 2023, 7:14 p.m. UTC | #22
On 9/20/2023 6:27 AM, Yann Sionneau wrote:
> Hi,
> 
> On 20/09/2023 11:08, Wolfram Sang wrote:
>>> same thread." [1] Thus I'd suggest the next fix for the problem:
>>>
>>> --- a/drivers/i2c/busses/i2c-designware-common.c
>>> +++ b/drivers/i2c/busses/i2c-designware-common.c
>>> @@ -72,7 +72,10 @@ static int dw_reg_write(void *context, unsigned 
>>> int reg, unsigned int val)
>>>   {
>>>       struct dw_i2c_dev *dev = context;
>>> -    writel_relaxed(val, dev->base + reg);
>>> +    if (reg == DW_IC_INTR_MASK)
>>> +        writel(val, dev->base + reg);
>>> +    else
>>> +        writel_relaxed(val, dev->base + reg);
>>>       return 0;
>>>   }
>>>
>>> (and similar changes for dw_reg_write_swab() and dw_reg_write_word().)
>>>
>>> What do you think?
>> To me, this looks reasonable and much more what I would have expected as
>> a result (from a high level point of view). Let's hope it works. I am
>> optimistic, though...
>>
> It works if we make sure all the other register accesses to the 
> designware i2c IP can't generate IRQ.
> 
> Meaning that all register accesses that can trigger an IRQ are enclosed 
> in between a call to i2c_dw_disable_int() and a call to 
> regmap_write(dev->map, DW_IC_INTR_MASK, DW_IC_INTR_MASTER_MASK); or 
> equivalent.
> 
> It seems to be the case, I'm not sure what's the best way to make sure 
> it will stay that way.
> 
> Moreover, maybe writes to IC_ENABLE register should also use the 
> non-relaxed writel() version?
> 
> Since one could do something like:
> 
> [ IP is currently disabled ]
> 
> 1/ enable interrupts in DW_IC_INTR_MASK
> 
> 2/ update some variable in dev-> structure in DDR
> 
> 3/ enable the device by writing to IC_ENABLE, thus triggering for 
> instance the TX_FIFO_EMPTY irq.
> 

It does seem like there are a variety of register write combinations 
that could immediately cause an interrupt, so would need a barrier.

I understand barriers a bit better now, although still wonder about some 
cases. Like say you write to some driver memory and then write the DW 
command fifo register, and it does not immediately cause an interrupt, 
but will in the future. Even without the barrier the memory write would 
typically become visible to other cores after some small amount of time, 
but don't see that's it's architecturally guaranteed to be visible. This 
implies the barrier is perhaps needed on many/all of the registers.

Jan
Jan Bottorff Sept. 25, 2023, 7:39 p.m. UTC | #23
On 9/25/2023 5:54 AM, Serge Semin wrote:
> On Wed, Sep 20, 2023 at 12:14:17PM -0700, Jan Bottorff wrote:
>> On 9/20/2023 6:27 AM, Yann Sionneau wrote:
>>> Hi,
>>>
>>> On 20/09/2023 11:08, Wolfram Sang wrote:
>>>>> same thread." [1] Thus I'd suggest the next fix for the problem:
>>>>>
>>>>> --- a/drivers/i2c/busses/i2c-designware-common.c
>>>>> +++ b/drivers/i2c/busses/i2c-designware-common.c
>>>>> @@ -72,7 +72,10 @@ static int dw_reg_write(void *context,
>>>>> unsigned int reg, unsigned int val)
>>>>>    {
>>>>>        struct dw_i2c_dev *dev = context;
>>>>> -    writel_relaxed(val, dev->base + reg);
>>>>> +    if (reg == DW_IC_INTR_MASK)
>>>>> +        writel(val, dev->base + reg);
>>>>> +    else
>>>>> +        writel_relaxed(val, dev->base + reg);
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>> (and similar changes for dw_reg_write_swab() and dw_reg_write_word().)
>>>>>
>>>>> What do you think?
>>>> To me, this looks reasonable and much more what I would have expected as
>>>> a result (from a high level point of view). Let's hope it works. I am
>>>> optimistic, though...
>>>>
>>> It works if we make sure all the other register accesses to the
>>> designware i2c IP can't generate IRQ.
>>>
>>> Meaning that all register accesses that can trigger an IRQ are enclosed
>>> in between a call to i2c_dw_disable_int() and a call to
>>> regmap_write(dev->map, DW_IC_INTR_MASK, DW_IC_INTR_MASTER_MASK); or
>>> equivalent.
>>>
>>> It seems to be the case, I'm not sure what's the best way to make sure
>>> it will stay that way.
>>>
>>> Moreover, maybe writes to IC_ENABLE register should also use the
>>> non-relaxed writel() version?
>>>
>>> Since one could do something like:
>>>
>>> [ IP is currently disabled ]
>>>
>>> 1/ enable interrupts in DW_IC_INTR_MASK
>>>
>>> 2/ update some variable in dev-> structure in DDR
>>>
>>> 3/ enable the device by writing to IC_ENABLE, thus triggering for
>>> instance the TX_FIFO_EMPTY irq.
>>>
>>
>> It does seem like there are a variety of register write combinations that
>> could immediately cause an interrupt, so would need a barrier.
> 
> My suggestion was based on your fix. If it won't work or if it won't
> completely solve the problem, then perhaps one of the next option
> shall do it:
> 1. Add the non-relaxed IO call for the IC_ENABLE CSR too.
> 2. Completely convert the IO accessors to using the non-relaxed
> methods especially seeing Wolfram already noted: "Again, I am all with
> Catalin here. Safety first, optimizations a la *_relaxed should be
> opt-in."
> https://lore.kernel.org/linux-i2c/ZQm2Ydt%2F0jRW4crK@shikoro/
> 3. Find all the places where the memory writes need to be fully
> visible after a subsequent IO-write causing an IRQ raise and just
> place dma_wmb() there (though just wmb() would look a bit more
> relevant).
> 
> IMO in the worst case solution 2. must be enough at least in the
> master mode seeing the ISR uses the completion variable to indicate
> the cmd execution completion, which also implies the complete memory
> barrier. Moreover i2c bus isn't that performant for us to be that much
> concerned about the optimizations like the pipeline stalls in between
> the MMIO accesses.
> 

I did stress testing for a few days on our processor of the proposed fix 
that makes writes to DW_IC_INTR_MASK use writel instead of 
writel_relaxed in dw_reg_write. The problem we were seeing is fixed. On 
our system, the problem was occurring when many ssif (ipmi over i2c) 
transfers were done. The stress test was running "ipmitool sdr elist" in 
a loop. Without the change, multiple errors per day from the driver are 
seen in the kernel log.

I'm good with a change that just has that one change. Also applying 
non-relaxed to dw_reg_write_swab and dw_reg_write_word was also 
suggested for completeness.

Does anybody have concerns about other cases that may not get fixed by 
this change? We did have hypothetical cases, like with IC_ENABLE, that 
could have the same issue.

So my next question, is the change to dw_reg_write something that I 
should write and submit, or should someone else submit something more 
generalized, like option 2 above? I don't own the i2c driver, I'm just 
trying to fix one issue on one processor with minimal risk of breaking 
something. I don't have the broader view of what's optimal for the whole 
DesignWare i2c driver. I also don't have any way to test changes on 
other models of processors.
Jarkko Nikula Sept. 29, 2023, 8:48 a.m. UTC | #24
On 9/27/23 22:38, Wolfram Sang wrote:
> 
>> So my next question, is the change to dw_reg_write something that I should
>> write and submit, or should someone else submit something more generalized,
>> like option 2 above? I don't own the i2c driver, I'm just trying to fix one
>> issue on one processor with minimal risk of breaking something. I don't have
>> the broader view of what's optimal for the whole DesignWare i2c driver. I
>> also don't have any way to test changes on other models of processors.
> 
> Well, I guess this is a question for the designware maintainers: do we
> want this one conversion from *_relaxed to non-relaxed. Or are we
> playing safe by using non-relaxed all the time. I would suggest the
> latter because the drivers I look after hardly write registers in a hot
> path (and not many of them at a time). But you guys know your driver
> better...
> 
Well I don't have any preference (read enough knowledge) either here and 
I hardly think performance becomes issue in any configuration.

Not a showstopper to this fix nor necessarily need to cover either but 
one another memory barrier case might be in i2c-slave flows:

1. I2C bus read/write from another host
2. Interrupt to i2c-designware IP
    i2c-designware-slave.c: i2c_dw_isr_slave()
    i2c-core-slave.c: i2c_slave_event()
    -> irq handler goes to slave backend like i2c-slave-eeprom
    i2c-slave-eeprom.c: i2c_slave_eeprom_slave_cb()
3. Shared data between irq handler and process context
    struct eeprom_data is accessed both from irq handler via 
i2c_slave_eeprom_slave_cb() and process context via sysfs node handlers 
i2c_slave_eeprom_bin_read() and i2c_slave_eeprom_bin_write()
Wolfram Sang Oct. 26, 2023, 11:18 a.m. UTC | #25
On Fri, Sep 29, 2023 at 11:48:10AM +0300, Jarkko Nikula wrote:
> On 9/27/23 22:38, Wolfram Sang wrote:
> > 
> > > So my next question, is the change to dw_reg_write something that I should
> > > write and submit, or should someone else submit something more generalized,
> > > like option 2 above? I don't own the i2c driver, I'm just trying to fix one
> > > issue on one processor with minimal risk of breaking something. I don't have
> > > the broader view of what's optimal for the whole DesignWare i2c driver. I
> > > also don't have any way to test changes on other models of processors.
> > 
> > Well, I guess this is a question for the designware maintainers: do we
> > want this one conversion from *_relaxed to non-relaxed. Or are we
> > playing safe by using non-relaxed all the time. I would suggest the
> > latter because the drivers I look after hardly write registers in a hot
> > path (and not many of them at a time). But you guys know your driver
> > better...
> > 
> Well I don't have any preference (read enough knowledge) either here and I
> hardly think performance becomes issue in any configuration.

So, someone wants to come up with a patch to move to non-relaxed io
accessors?
Jan Bottorff Oct. 31, 2023, 12:12 a.m. UTC | #26
On 10/26/2023 4:18 AM, Wolfram Sang wrote:
> So, someone wants to come up with a patch to move to non-relaxed io
> accessors?
> 
Is the current thinking to just make writes to DW_IC_INTR_MASK use the 
non-relaxed variant or something more broad?

 From a safest functioning viewpoint, we talked about making all 
accessors default to non-relaxed variants. A couple of pretty good 
arguments from knowledgeable people favored this. I know there also was 
some concerns about potential performance impact this might have 
although the counter argument was this is a pretty low speed device so 
some extra cpu cycles on register accesses were not likely to degrade 
overall performance.

I could make the patch if we have consensus (or maintainers decision) on 
which way to go: 1) only writes to DW_IC_INTR_MASK are non-relaxed, 2) 
make all read/write accessors use the non-relaxed version.

I'm personally in camp #2, safety first, performance fine tuning later 
if needed. Latent missing barrier bugs are difficult and time consuming 
to find.

- Jan
Wolfram Sang Oct. 31, 2023, 5:51 a.m. UTC | #27
> I'm personally in camp #2, safety first, performance fine tuning later if
> needed. Latent missing barrier bugs are difficult and time consuming to
> find.

I agree.
Yann Sionneau Oct. 31, 2023, 8:44 a.m. UTC | #28
Le 31/10/2023 à 01:12, Jan Bottorff a écrit :
> On 10/26/2023 4:18 AM, Wolfram Sang wrote:
>> So, someone wants to come up with a patch to move to non-relaxed io
>> accessors?
>>
> Is the current thinking to just make writes to DW_IC_INTR_MASK use the 
> non-relaxed variant or something more broad?
>
> From a safest functioning viewpoint, we talked about making all 
> accessors default to non-relaxed variants. A couple of pretty good 
> arguments from knowledgeable people favored this. I know there also 
> was some concerns about potential performance impact this might have 
> although the counter argument was this is a pretty low speed device so 
> some extra cpu cycles on register accesses were not likely to degrade 
> overall performance.
>
> I could make the patch if we have consensus (or maintainers decision) 
> on which way to go: 1) only writes to DW_IC_INTR_MASK are non-relaxed, 
> 2) make all read/write accessors use the non-relaxed version.
>
> I'm personally in camp #2, safety first, performance fine tuning later 
> if needed. Latent missing barrier bugs are difficult and time 
> consuming to find.

Fine with me, let's go for #2 :)

Regards,
Jarkko Nikula Oct. 31, 2023, 12:10 p.m. UTC | #29
On 10/31/23 10:44, Yann Sionneau wrote:
> 
> Le 31/10/2023 à 01:12, Jan Bottorff a écrit :
>> On 10/26/2023 4:18 AM, Wolfram Sang wrote:
>>> So, someone wants to come up with a patch to move to non-relaxed io
>>> accessors?
>>>
>> Is the current thinking to just make writes to DW_IC_INTR_MASK use the 
>> non-relaxed variant or something more broad?
>>
>> From a safest functioning viewpoint, we talked about making all 
>> accessors default to non-relaxed variants. A couple of pretty good 
>> arguments from knowledgeable people favored this. I know there also 
>> was some concerns about potential performance impact this might have 
>> although the counter argument was this is a pretty low speed device so 
>> some extra cpu cycles on register accesses were not likely to degrade 
>> overall performance.
>>
>> I could make the patch if we have consensus (or maintainers decision) 
>> on which way to go: 1) only writes to DW_IC_INTR_MASK are non-relaxed, 
>> 2) make all read/write accessors use the non-relaxed version.
>>
>> I'm personally in camp #2, safety first, performance fine tuning later 
>> if needed. Latent missing barrier bugs are difficult and time 
>> consuming to find.
> 
> Fine with me, let's go for #2 :)
> 
Also simplicity votes for #2.
Jan Bottorff Nov. 1, 2023, 4:51 p.m. UTC | #30
On 10/31/2023 6:06 AM, Serge Semin wrote:
> On Tue, Oct 31, 2023 at 02:10:13PM +0200, Jarkko Nikula wrote:
>> On 10/31/23 10:44, Yann Sionneau wrote:
>>>
>>> Le 31/10/2023 à 01:12, Jan Bottorff a écrit :
>>>> On 10/26/2023 4:18 AM, Wolfram Sang wrote:
>>>>> So, someone wants to come up with a patch to move to non-relaxed io
>>>>> accessors?
>>>>>
...
>>>> I could make the patch if we have consensus (or maintainers
>>>> decision) on which way to go: 1) only writes to DW_IC_INTR_MASK are
>>>> non-relaxed, 2) make all read/write accessors use the non-relaxed
>>>> version.
>>>
>>> Fine with me, let's go for #2 :)
>>>
>> Also simplicity votes for #2.
> 
> +1 for the option #2. Let's do it and be finally over with this
> patch.)
> 
> -Serge(y)

I think we have agreement to make the dw i2c driver regmap accessors use 
the non-relaxed low-level functions, like writel instead of 
writel_relaxed. These low level functions have memory barriers on 
platforms that require them.

I'll work on a patch for this.

- Jan
diff mbox series

Patch

diff --git a/drivers/i2c/busses/i2c-designware-master.c b/drivers/i2c/busses/i2c-designware-master.c
index ca1035e010c7..1694ac6bb592 100644
--- a/drivers/i2c/busses/i2c-designware-master.c
+++ b/drivers/i2c/busses/i2c-designware-master.c
@@ -248,6 +248,14 @@  static void i2c_dw_xfer_init(struct dw_i2c_dev *dev)
 	/* Dummy read to avoid the register getting stuck on Bay Trail */
 	regmap_read(dev->map, DW_IC_ENABLE_STATUS, &dummy);
 
+	/*
+	 * To guarantee data written by the current core is visible to
+	 * all cores, a write barrier is required. This needs to be
+	 * before an interrupt causes execution on another core.
+	 * For ARM processors, this needs to be a DSB barrier.
+	 */
+	wmb();
+
 	/* Clear and enable interrupts */
 	regmap_read(dev->map, DW_IC_CLR_INTR, &dummy);
 	regmap_write(dev->map, DW_IC_INTR_MASK, DW_IC_INTR_MASTER_MASK);