Message ID | 1453918924-27606-1-git-send-email-will.deacon@arm.com |
---|---|
State | New |
Headers | show |
On Wed, Jan 27, 2016 at 07:25:51PM +0100, Peter Zijlstra wrote: > On Wed, Jan 27, 2016 at 06:22:04PM +0000, Will Deacon wrote: > > As much as we'd like to live in a world where RELEASE -> ACQUIRE is > > always cheaply ordered and can be used to construct UNLOCK -> LOCK > > definitions with similar guarantees, the grim reality is that this isn't > > even possible on x86 (thanks to Paul for bringing us crashing down to > > Earth). > > > > This patch handles the issue by introducing a new barrier macro, > > smp_mb__after_release_acquire, that can be placed after an ACQUIRE that > > either reads from a RELEASE or is in program-order after a RELEASE. The > > barrier upgrades the RELEASE-ACQUIRE pair to a full memory barrier, > > implying global transitivity. At the moment, it doesn't have any users, > > so its existence serves mainly as a documentation aid and a potential > > stepping stone to the reintroduction of smp_mb__after_unlock_lock() used > > by RCU. > > > > Documentation/memory-barriers.txt is updated to describe more clearly > > the ACQUIRE and RELEASE ordering in this area and to show some examples > > of the new barrier in action. > > So the obvious question is: do we have a use-case? We have a use-case for smp_mb__after_unlock_lock, so I think we should either strengthen our locking guarantees so that smp_mb__after_unlock_lock isn't needed or introduce smp_mb__after_release_acquire to close the gap. As it stands, we've got an inconsistency (despite it being hidden inside RCU). The main advantage of this patch is a documentation aid, in my opinion (hell, we talk about smp_mb__after_unlock_lock already when reasoning about this stuff). Will
On Thu, Jan 28, 2016 at 10:46:23AM +0800, Boqun Feng wrote: > On Wed, Jan 27, 2016 at 06:22:04PM +0000, Will Deacon wrote: > > As much as we'd like to live in a world where RELEASE -> ACQUIRE is > > always cheaply ordered and can be used to construct UNLOCK -> LOCK > > definitions with similar guarantees, the grim reality is that this isn't > > even possible on x86 (thanks to Paul for bringing us crashing down to > > Earth). > > > > This patch handles the issue by introducing a new barrier macro, > > smp_mb__after_release_acquire, that can be placed after an ACQUIRE that > > either reads from a RELEASE or is in program-order after a RELEASE. The > > barrier upgrades the RELEASE-ACQUIRE pair to a full memory barrier, > > implying global transitivity. At the moment, it doesn't have any users, > > so its existence serves mainly as a documentation aid and a potential > > stepping stone to the reintroduction of smp_mb__after_unlock_lock() used > > by RCU. > > > > Documentation/memory-barriers.txt is updated to describe more clearly > > the ACQUIRE and RELEASE ordering in this area and to show some examples > > of the new barrier in action. > > > > Maybe also add an entry in "CPU MEMORY BARRIERS" section of > memory-barriers.txt? Something like (copy and paste from you commit log > ;-)): > > (*) smp_mb__after_release_acquire(); > > Placed after an ACQUIRE that either reads from a RELEASE or is in > program-order after a RELEASE. The barrier upgrades the > RELEASE-ACQUIRE pair to a full memory barrier, implying global > transitivity. > > This could give the readers an overview of the usage of this barrier. Thanks, I'll add that. > > diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h > > index 5c8db3ce61c8..ee31da604b11 100644 > > --- a/arch/s390/include/asm/barrier.h > > +++ b/arch/s390/include/asm/barrier.h > > @@ -45,6 +45,7 @@ do { \ > > ___p1; \ > > }) > > > > +#define __smp_mb__release_acquire() __smp_mb() > > Should be __smp_mb__after_release_acquire(), so is the title of this > patch ;-) Well spotted. That's a hangover from v2, which I'll fix. Cheers, Will
On Wed, Jan 27, 2016 at 03:00:11PM -0800, Paul E. McKenney wrote: > On Wed, Jan 27, 2016 at 06:39:23PM +0000, Will Deacon wrote: > > On Wed, Jan 27, 2016 at 07:25:51PM +0100, Peter Zijlstra wrote: > > > On Wed, Jan 27, 2016 at 06:22:04PM +0000, Will Deacon wrote: > > > > As much as we'd like to live in a world where RELEASE -> ACQUIRE is > > > > always cheaply ordered and can be used to construct UNLOCK -> LOCK > > > > definitions with similar guarantees, the grim reality is that this isn't > > > > even possible on x86 (thanks to Paul for bringing us crashing down to > > > > Earth). > > > > > > > > This patch handles the issue by introducing a new barrier macro, > > > > smp_mb__after_release_acquire, that can be placed after an ACQUIRE that > > > > either reads from a RELEASE or is in program-order after a RELEASE. The > > > > barrier upgrades the RELEASE-ACQUIRE pair to a full memory barrier, > > > > implying global transitivity. At the moment, it doesn't have any users, > > > > so its existence serves mainly as a documentation aid and a potential > > > > stepping stone to the reintroduction of smp_mb__after_unlock_lock() used > > > > by RCU. > > > > > > > > Documentation/memory-barriers.txt is updated to describe more clearly > > > > the ACQUIRE and RELEASE ordering in this area and to show some examples > > > > of the new barrier in action. > > > > > > So the obvious question is: do we have a use-case? > > > > We have a use-case for smp_mb__after_unlock_lock, so I think we should > > either strengthen our locking guarantees so that smp_mb__after_unlock_lock > > isn't needed or introduce smp_mb__after_release_acquire to close the gap. > > As it stands, we've got an inconsistency (despite it being hidden inside > > RCU). > > > > The main advantage of this patch is a documentation aid, in my opinion > > (hell, we talk about smp_mb__after_unlock_lock already when reasoning > > about this stuff). > > But wasn't there an x86 potential use case that required placing the > strengthening macro after the release and before the acquire? Or is > this a case of old age striking again? The proposal here doesn't order the release/acquire operations with each other -- it just says that they combine with smp_mb__after_release_acquire() to make a full barrier, so I don't think it should matter for the intra-thread case, which is the one that x86 cares out iiuc. Will
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 2dc4ba7c8d4d..62ce096b88fa 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -454,16 +454,22 @@ And a couple of implicit varieties: The use of ACQUIRE and RELEASE operations generally precludes the need for other sorts of memory barrier (but note the exceptions mentioned in the subsection "MMIO write barrier"). In addition, a RELEASE+ACQUIRE - pair is -not- guaranteed to act as a full memory barrier. However, after + pair is -not- guaranteed to act as a full memory barrier without an + explicit smp_mb__after_release_acquire() barrier. However, after an ACQUIRE on a given variable, all memory accesses preceding any prior RELEASE on that same variable are guaranteed to be visible. In other - words, within a given variable's critical section, all accesses of all - previous critical sections for that variable are guaranteed to have - completed. + words, for a CPU executing within a given variable's critical section, + all accesses of all previous critical sections for that variable are + guaranteed to be visible to that CPU. This means that ACQUIRE acts as a minimal "acquire" operation and RELEASE acts as a minimal "release" operation. +A subset of the atomic operations described in atomic_ops.txt have ACQUIRE +and RELEASE variants in addition to fully-ordered and relaxed (no barrier +semantics) definitions. For compound atomics performing both a load and +a store, ACQUIRE semantics apply only to the load and RELEASE semantics +apply only to the store portion of the operation. Memory barriers are only required where there's a possibility of interaction between two CPUs or between a CPU and a device. If it can be guaranteed that @@ -1357,7 +1363,7 @@ However, the transitivity of release-acquire is local to the participating CPUs and does not apply to cpu3(). Therefore, the following outcome is possible: - r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 + r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1 Although cpu0(), cpu1(), and cpu2() will see their respective reads and writes in order, CPUs not involved in the release-acquire chain might @@ -1369,10 +1375,27 @@ store to u as happening -after- cpu1()'s load from v, even though both cpu0() and cpu1() agree that these two operations occurred in the intended order. +This can be forbidden by upgrading the release-acquire relationship +between cpu0() and cpu1() to a full barrier using +smp_mb__after_release_acquire() to enforce global transitivity: + + void cpu1(void) + { + r1 = smp_load_acquire(&y); + smp_mb__after_release_acquire(); + r4 = READ_ONCE(v); + r5 = READ_ONCE(u); + smp_store_release(&z, 1); + } + +With this addition, the previous result is forbidden and, as long as +r1 == 1, all CPUs must agree that cpu0()'s store to u happened before +cpu1()'s read from v. + However, please keep in mind that smp_load_acquire() is not magic. In particular, it simply reads from its argument with ordering. It does -not- ensure that any particular value will be read. Therefore, the -following outcome is possible: +following outcome is possible irrespective of any additional barriers: r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0 @@ -1971,6 +1994,34 @@ the RELEASE would simply complete, thereby avoiding the deadlock. a sleep-unlock race, but the locking primitive needs to resolve such races properly in any case. +Where the RELEASE and ACQUIRE operations are performed by the same CPU +to different addresses, ordering can be enforced by an +smp_mb__after_release_acquire() barrier: + + *A = a; + RELEASE M + ACQUIRE N + smp_mb__after_release_acquire(); + *B = b; + +in which case, the only permitted orderings are: + + STORE *A, RELEASE M, ACQUIRE N, STORE *B + STORE *A, ACQUIRE N, RELEASE M, STORE *B + +Similarly, smp_mb__after_release_acquire() enforces full, +globally-transitive ordering in the case that an ACQUIRE operation on +one CPU reads from a RELEASE operation on another: + + CPU 1 CPU 2 CPU 3 + ======================= ======================= =============== + { X = 0, Y = 0, M = 0 } + STORE X=1 ACQUIRE M==1 STORE Y=1 + RELEASE M=1 smp_mb__after_release_acquire() smp_mb(); + LOAD Y==0 LOAD X=0 + +This outcome is forbidden (i.e. CPU 3 must read X == 1). + Locks and semaphores may not provide any guarantee of ordering on UP compiled systems, and so cannot be counted on in such a situation to actually achieve anything at all - especially with respect to I/O accesses - unless combined diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h index 588f1614cafc..ed803c84bd19 100644 --- a/arch/ia64/include/asm/barrier.h +++ b/arch/ia64/include/asm/barrier.h @@ -67,6 +67,8 @@ do { \ ___p1; \ }) +#define __smp_mb__after_release_acquire() __smp_mb() + /* * The group barrier in front of the rsm & ssm are necessary to ensure * that none of the previous instructions in the same group are diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index c0deafc212b8..b12860b8c7e4 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -74,7 +74,8 @@ do { \ ___p1; \ }) -#define smp_mb__before_spinlock() smp_mb() +#define smp_mb__after_release_acquire() __smp_mb() +#define smp_mb__before_spinlock() smp_mb() #include <asm-generic/barrier.h> diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h index 5c8db3ce61c8..ee31da604b11 100644 --- a/arch/s390/include/asm/barrier.h +++ b/arch/s390/include/asm/barrier.h @@ -45,6 +45,7 @@ do { \ ___p1; \ }) +#define __smp_mb__release_acquire() __smp_mb() #define __smp_mb__before_atomic() barrier() #define __smp_mb__after_atomic() barrier() diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h index c9f6ee64f41d..68c9e931a933 100644 --- a/arch/sparc/include/asm/barrier_64.h +++ b/arch/sparc/include/asm/barrier_64.h @@ -52,8 +52,9 @@ do { \ ___p1; \ }) -#define __smp_mb__before_atomic() barrier() -#define __smp_mb__after_atomic() barrier() +#define __smp_mb__after_release_acquire() __smp_mb() +#define __smp_mb__before_atomic() barrier() +#define __smp_mb__after_atomic() barrier() #include <asm-generic/barrier.h> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index a584e1c50918..b24dcb7e806a 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -77,6 +77,8 @@ do { \ #endif +#define __smp_mb__after_release_acquire() __smp_mb() + /* Atomic operations are already serializing on x86 */ #define __smp_mb__before_atomic() barrier() #define __smp_mb__after_atomic() barrier() diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index 1cceca146905..895f5993d341 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -139,6 +139,10 @@ do { \ }) #endif +#ifndef __smp_mb__after_release_acquire +#define __smp_mb__after_release_acquire() do { } while (0) +#endif + #ifdef CONFIG_SMP #ifndef smp_store_mb @@ -161,6 +165,10 @@ do { \ #define smp_load_acquire(p) __smp_load_acquire(p) #endif +#ifndef smp_mb__after_release_acquire +#define smp_mb__after_release_acquire() __smp_mb__after_release_acquire() +#endif + #else /* !CONFIG_SMP */ #ifndef smp_store_mb @@ -194,6 +202,10 @@ do { \ }) #endif +#ifndef smp_mb__after_release_acquire +#define smp_mb__after_release_acquire() do { } while (0) +#endif + #endif /* Barriers for virtual machine guests when talking to an SMP host */ @@ -206,6 +218,7 @@ do { \ #define virt_mb__after_atomic() __smp_mb__after_atomic() #define virt_store_release(p, v) __smp_store_release(p, v) #define virt_load_acquire(p) __smp_load_acquire(p) +#define virt_mb__after_release_acquire() __smp_mb__after_release_acquire() #endif /* !__ASSEMBLY__ */ #endif /* __ASM_GENERIC_BARRIER_H */
As much as we'd like to live in a world where RELEASE -> ACQUIRE is always cheaply ordered and can be used to construct UNLOCK -> LOCK definitions with similar guarantees, the grim reality is that this isn't even possible on x86 (thanks to Paul for bringing us crashing down to Earth). This patch handles the issue by introducing a new barrier macro, smp_mb__after_release_acquire, that can be placed after an ACQUIRE that either reads from a RELEASE or is in program-order after a RELEASE. The barrier upgrades the RELEASE-ACQUIRE pair to a full memory barrier, implying global transitivity. At the moment, it doesn't have any users, so its existence serves mainly as a documentation aid and a potential stepping stone to the reintroduction of smp_mb__after_unlock_lock() used by RCU. Documentation/memory-barriers.txt is updated to describe more clearly the ACQUIRE and RELEASE ordering in this area and to show some examples of the new barrier in action. Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Will Deacon <will.deacon@arm.com> --- Based on Paul's patch to describe local vs global transitivity: http://lkml.kernel.org/r/20160115173912.GU3818@linux.vnet.ibm.com Documentation/memory-barriers.txt | 63 +++++++++++++++++++++++++++++++++---- arch/ia64/include/asm/barrier.h | 2 ++ arch/powerpc/include/asm/barrier.h | 3 +- arch/s390/include/asm/barrier.h | 1 + arch/sparc/include/asm/barrier_64.h | 5 +-- arch/x86/include/asm/barrier.h | 2 ++ include/asm-generic/barrier.h | 13 ++++++++ 7 files changed, 80 insertions(+), 9 deletions(-) -- 2.1.4