Message ID | 1444215568-24732-1-git-send-email-will.deacon@arm.com |
---|---|
State | New |
Headers | show |
On Wed, Oct 07, 2015 at 11:59:28AM +0100, Will Deacon wrote: > As much as we'd like to live in a world where RELEASE -> ACQUIRE is > always cheaply ordered and can be used to construct UNLOCK -> LOCK > definitions with similar guarantees, the grim reality is that this isn't > even possible on x86 (thanks to Paul for bringing us crashing down to > Earth). > > This patch handles the issue by introducing a new barrier macro, > smp_mb__release_acquire, that can be placed between a RELEASE and a > subsequent ACQUIRE operation in order to upgrade them to a full memory > barrier. At the moment, it doesn't have any users, so its existence > serves mainly as a documentation aid. Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize smp_mb__after_unlock_lock()") for that same reason? > Documentation/memory-barriers.txt is updated to describe more clearly > the ACQUIRE and RELEASE ordering in this area and to show an example of > the new barrier in action. The only nit I have is that if we revert the above it might be make sense to more clearly call out the distinction between the two. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Peter, Thanks for the headache ;) On Wed, Oct 07, 2015 at 01:19:15PM +0200, Peter Zijlstra wrote: > On Wed, Oct 07, 2015 at 11:59:28AM +0100, Will Deacon wrote: > > As much as we'd like to live in a world where RELEASE -> ACQUIRE is > > always cheaply ordered and can be used to construct UNLOCK -> LOCK > > definitions with similar guarantees, the grim reality is that this isn't > > even possible on x86 (thanks to Paul for bringing us crashing down to > > Earth). > > > > This patch handles the issue by introducing a new barrier macro, > > smp_mb__release_acquire, that can be placed between a RELEASE and a > > subsequent ACQUIRE operation in order to upgrade them to a full memory > > barrier. At the moment, it doesn't have any users, so its existence > > serves mainly as a documentation aid. > > Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize > smp_mb__after_unlock_lock()") for that same reason? I don't think we want a straight revert. smp_mb__after_unlock_lock could largely die if PPC strengthened its locks, whereas smp_mb__release_acquire is needed by quite a few architectures. > > Documentation/memory-barriers.txt is updated to describe more clearly > > the ACQUIRE and RELEASE ordering in this area and to show an example of > > the new barrier in action. > > The only nit I have is that if we revert the above it might be make > sense to more clearly call out the distinction between the two. Right. Where I think we'd like to get to is: - RELEASE -> ACQUIRE acts as a full barrier if they operate on the same variable and the ACQUIRE reads from the RELEASE - RELEASE -> ACQUIRE acts as a full barrier if they execute on the same CPU and are interleaved with an smp_mb__release_acquire barrier. - RELEASE -> ACQUIRE ordering is transitive [only the transitivity part is missing in this patch, because I lost track of that discussion] We could then use these same guarantees for UNLOCK -> LOCK in RCU, defining smp_mb__after_unlock_lock to be the same as smp_mb__release_acquire, but only applying to UNLOCK -> LOCK. That's a slight relaxation of how it's defined at the moment (and I guess would need some work on PPC?), but it keeps things consistent which is especially important as core locking primitives are ported over to the ACQUIRE/RELEASE primitives. Thoughts? Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Oct 07, 2015 at 02:23:17PM +0100, Will Deacon wrote: > Hi Peter, > > Thanks for the headache ;) Most welcome :-) > > Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize > > smp_mb__after_unlock_lock()") for that same reason? > > I don't think we want a straight revert. smp_mb__after_unlock_lock could > largely die if PPC strengthened its locks, whereas smp_mb__release_acquire > is needed by quite a few architectures. Fair enough, lets wait for the benchmark results from the PPC people doing that. > > > Documentation/memory-barriers.txt is updated to describe more clearly > > > the ACQUIRE and RELEASE ordering in this area and to show an example of > > > the new barrier in action. > > > > The only nit I have is that if we revert the above it might be make > > sense to more clearly call out the distinction between the two. > > Right. Where I think we'd like to get to is: > > - RELEASE -> ACQUIRE acts as a full barrier if they operate on the same > variable and the ACQUIRE reads from the RELEASE > > - RELEASE -> ACQUIRE acts as a full barrier if they execute on the same > CPU and are interleaved with an smp_mb__release_acquire barrier. > > - RELEASE -> ACQUIRE ordering is transitive > > [only the transitivity part is missing in this patch, because I lost > track of that discussion] > > We could then use these same guarantees for UNLOCK -> LOCK in RCU, > defining smp_mb__after_unlock_lock to be the same as > smp_mb__release_acquire, but only applying to UNLOCK -> LOCK. That's a > slight relaxation of how it's defined at the moment (and I guess would > need some work on PPC?), but it keeps things consistent which is > especially important as core locking primitives are ported over to the > ACQUIRE/RELEASE primitives. > > Thoughts? /me like, although I'm too tired to see how those 3 rules combine to something weaker than the current after_unlock_lock thing for PPC. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Oct 07, 2015 at 02:23:17PM +0100, Will Deacon wrote: > Hi Peter, > > Thanks for the headache ;) > > On Wed, Oct 07, 2015 at 01:19:15PM +0200, Peter Zijlstra wrote: > > On Wed, Oct 07, 2015 at 11:59:28AM +0100, Will Deacon wrote: > > > As much as we'd like to live in a world where RELEASE -> ACQUIRE is > > > always cheaply ordered and can be used to construct UNLOCK -> LOCK > > > definitions with similar guarantees, the grim reality is that this isn't > > > even possible on x86 (thanks to Paul for bringing us crashing down to > > > Earth). > > > > > > This patch handles the issue by introducing a new barrier macro, > > > smp_mb__release_acquire, that can be placed between a RELEASE and a > > > subsequent ACQUIRE operation in order to upgrade them to a full memory > > > barrier. At the moment, it doesn't have any users, so its existence > > > serves mainly as a documentation aid. > > > > Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize > > smp_mb__after_unlock_lock()") for that same reason? > > I don't think we want a straight revert. smp_mb__after_unlock_lock could > largely die if PPC strengthened its locks, whereas smp_mb__release_acquire > is needed by quite a few architectures. > > > > Documentation/memory-barriers.txt is updated to describe more clearly > > > the ACQUIRE and RELEASE ordering in this area and to show an example of > > > the new barrier in action. > > > > The only nit I have is that if we revert the above it might be make > > sense to more clearly call out the distinction between the two. > > Right. Where I think we'd like to get to is: > > - RELEASE -> ACQUIRE acts as a full barrier if they operate on the same > variable and the ACQUIRE reads from the RELEASE > > - RELEASE -> ACQUIRE acts as a full barrier if they execute on the same > CPU and are interleaved with an smp_mb__release_acquire barrier. > > - RELEASE -> ACQUIRE ordering is transitive I believe that these three are good. > [only the transitivity part is missing in this patch, because I lost > track of that discussion] > > We could then use these same guarantees for UNLOCK -> LOCK in RCU, > defining smp_mb__after_unlock_lock to be the same as > smp_mb__release_acquire, but only applying to UNLOCK -> LOCK. That's a > slight relaxation of how it's defined at the moment (and I guess would > need some work on PPC?), but it keeps things consistent which is > especially important as core locking primitives are ported over to the > ACQUIRE/RELEASE primitives. Currently, we do need smp_mb__after_unlock_lock() to be after the acquisition on PPC -- putting it between the unlock and the lock of course doesn't cut it for the cross-thread unlock/lock case. I am with Peter -- we do need the benchmark results for PPC. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote: > On Wed, Oct 07, 2015 at 02:23:17PM +0100, Will Deacon wrote: > > Hi Peter, > > > > Thanks for the headache ;) > > > > On Wed, Oct 07, 2015 at 01:19:15PM +0200, Peter Zijlstra wrote: > > > On Wed, Oct 07, 2015 at 11:59:28AM +0100, Will Deacon wrote: > > > > As much as we'd like to live in a world where RELEASE -> ACQUIRE is > > > > always cheaply ordered and can be used to construct UNLOCK -> LOCK > > > > definitions with similar guarantees, the grim reality is that this isn't > > > > even possible on x86 (thanks to Paul for bringing us crashing down to > > > > Earth). > > > > > > > > This patch handles the issue by introducing a new barrier macro, > > > > smp_mb__release_acquire, that can be placed between a RELEASE and a > > > > subsequent ACQUIRE operation in order to upgrade them to a full memory > > > > barrier. At the moment, it doesn't have any users, so its existence > > > > serves mainly as a documentation aid. > > > > > > Does we want to go revert 12d560f4ea87 ("rcu,locking: Privatize > > > smp_mb__after_unlock_lock()") for that same reason? > > > > I don't think we want a straight revert. smp_mb__after_unlock_lock could > > largely die if PPC strengthened its locks, whereas smp_mb__release_acquire > > is needed by quite a few architectures. > > Currently, we do need smp_mb__after_unlock_lock() to be after the > acquisition on PPC -- putting it between the unlock and the lock > of course doesn't cut it for the cross-thread unlock/lock case. > > I am with Peter -- we do need the benchmark results for PPC. Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not plentiful at the moment. If we do a straight lwsync -> sync conversion for unlock it looks like that will cost us ~4.2% on Anton's standard context switch benchmark. So that's not all that nice. But we also don't want to be the only arch that has the weird lock semantics and has to find all the bugs. cheers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote: > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote: > > Currently, we do need smp_mb__after_unlock_lock() to be after the > > acquisition on PPC -- putting it between the unlock and the lock > > of course doesn't cut it for the cross-thread unlock/lock case. This ^, that makes me think I don't understand smp_mb__after_unlock_lock. How is: UNLOCK x smp_mb__after_unlock_lock() LOCK y a problem? That's still a full barrier. > > I am with Peter -- we do need the benchmark results for PPC. > > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not > plentiful at the moment. > > If we do a straight lwsync -> sync conversion for unlock it looks like that > will cost us ~4.2% on Anton's standard context switch benchmark. And that does not seem to agree with Paul's smp_mb__after_unlock_lock() usage and would not be sufficient for the same (as of yet unexplained) reason. Why does it matter which of the LOCK or UNLOCK gets promoted to full barrier on PPC in order to become RCsc? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote: > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote: > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote: > > > > Currently, we do need smp_mb__after_unlock_lock() to be after the > > > acquisition on PPC -- putting it between the unlock and the lock > > > of course doesn't cut it for the cross-thread unlock/lock case. > > This ^, that makes me think I don't understand > smp_mb__after_unlock_lock. > > How is: > > UNLOCK x > smp_mb__after_unlock_lock() > LOCK y > > a problem? That's still a full barrier. I thought Paul was talking about something like this case: CPU A CPU B CPU C foo = 1 UNLOCK x LOCK x (RELEASE) bar = 1 ACQUIRE bar = 1 READ_ONCE foo = 0 but this looks the same as ISA2+lwsyncs/ISA2+lwsync+ctrlisync+lwsync, which are both forbidden on PPC, so now I'm also confused. The different-lock, same thread case is more straight-forward, I think. > > > I am with Peter -- we do need the benchmark results for PPC. > > > > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not > > plentiful at the moment. > > > > If we do a straight lwsync -> sync conversion for unlock it looks like that > > will cost us ~4.2% on Anton's standard context switch benchmark. Thanks Michael! > And that does not seem to agree with Paul's smp_mb__after_unlock_lock() > usage and would not be sufficient for the same (as of yet unexplained) > reason. > > Why does it matter which of the LOCK or UNLOCK gets promoted to full > barrier on PPC in order to become RCsc? I think we need a PPC litmus test illustrating the inter-thread, same lock failure case when smp_mb__after_unlock_lock is not present so that we can reason about this properly. Paul? Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote: > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote: > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote: > > > > Currently, we do need smp_mb__after_unlock_lock() to be after the > > > acquisition on PPC -- putting it between the unlock and the lock > > > of course doesn't cut it for the cross-thread unlock/lock case. > > This ^, that makes me think I don't understand > smp_mb__after_unlock_lock. > > How is: > > UNLOCK x > smp_mb__after_unlock_lock() > LOCK y > > a problem? That's still a full barrier. The problem is that I need smp_mb__after_unlock_lock() to give me transitivity even if the UNLOCK happened on one CPU and the LOCK on another. For that to work, the smp_mb__after_unlock_lock() needs to be either immediately after the acquire (the current choice) or immediately before the release (which would also work from a purely technical viewpoint, but I much prefer the current choice). Or am I missing your point? > > > I am with Peter -- we do need the benchmark results for PPC. > > > > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not > > plentiful at the moment. > > > > If we do a straight lwsync -> sync conversion for unlock it looks like that > > will cost us ~4.2% on Anton's standard context switch benchmark. > > And that does not seem to agree with Paul's smp_mb__after_unlock_lock() > usage and would not be sufficient for the same (as of yet unexplained) > reason. > > Why does it matter which of the LOCK or UNLOCK gets promoted to full > barrier on PPC in order to become RCsc? You could do either. However, as I understand it, there is hardware for which bc;isync is faster than lwsync. For such hardware, it is cheaper to upgrade the unlock from lwsync to sync than to upgrade the lock from bc;isync to sync. If I recall correctly, the kernel rewrites itself at boot to select whichever of lwsync or bc;isync is better for the hardware at hand. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Oct 08, 2015 at 01:59:38PM +0100, Will Deacon wrote: > On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote: > > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote: > > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote: > > > > > > Currently, we do need smp_mb__after_unlock_lock() to be after the > > > > acquisition on PPC -- putting it between the unlock and the lock > > > > of course doesn't cut it for the cross-thread unlock/lock case. > > > > This ^, that makes me think I don't understand > > smp_mb__after_unlock_lock. > > > > How is: > > > > UNLOCK x > > smp_mb__after_unlock_lock() > > LOCK y > > > > a problem? That's still a full barrier. > > I thought Paul was talking about something like this case: > > CPU A CPU B CPU C > foo = 1 > UNLOCK x > LOCK x > (RELEASE) bar = 1 > ACQUIRE bar = 1 > READ_ONCE foo = 0 More like this: CPU A CPU B CPU C WRITE_ONCE(foo, 1); UNLOCK x LOCK x r1 = READ_ONCE(bar); WRITE_ONCE(bar, 1); smp_mb(); r2 = READ_ONCE(foo); This can result in r1==0 && r2==0. > but this looks the same as ISA2+lwsyncs/ISA2+lwsync+ctrlisync+lwsync, > which are both forbidden on PPC, so now I'm also confused. > > The different-lock, same thread case is more straight-forward, I think. Indeed it is: CPU A CPU B WRITE_ONCE(foo, 1); UNLOCK x LOCK x r1 = READ_ONCE(bar); WRITE_ONCE(bar, 1); smp_mb(); r2 = READ_ONCE(foo); This also can result in r1==0 && r2==0. > > > > I am with Peter -- we do need the benchmark results for PPC. > > > > > > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not > > > plentiful at the moment. > > > > > > If we do a straight lwsync -> sync conversion for unlock it looks like that > > > will cost us ~4.2% on Anton's standard context switch benchmark. > > Thanks Michael! > > > And that does not seem to agree with Paul's smp_mb__after_unlock_lock() > > usage and would not be sufficient for the same (as of yet unexplained) > > reason. > > > > Why does it matter which of the LOCK or UNLOCK gets promoted to full > > barrier on PPC in order to become RCsc? > > I think we need a PPC litmus test illustrating the inter-thread, same > lock failure case when smp_mb__after_unlock_lock is not present so that > we can reason about this properly. Paul? Please see above. ;-) The corresponding litmus tests are below. Thanx, Paul ------------------------------------------------------------------------ PPC lock-2thread-WR-barrier.litmus "" (* * Does 3.0 Linux-kernel Power lock-unlock provide local * barrier that orders prior stores against subsequent loads, * if the unlock and lock happen on different threads? * This version uses lwsync instead of isync. *) (* 23-July-2013: ppcmem says "Sometimes" *) { l=1; 0:r1=1; 0:r4=x; 0:r10=0; 0:r12=l; 1:r1=1; 1:r3=42; 1:r4=x; 1:r5=y; 1:r10=0; 1:r11=0; 1:r12=l; 2:r1=1; 2:r4=x; 2:r5=y; } P0 | P1 | P2; stw r1,0(r4) | lwarx r11,r10,r12 | stw r1,0(r5) ; lwsync | cmpwi r11,0 | lwsync ; stw r10,0(r12) | bne Fail1 | lwz r7,0(r4) ; | stwcx. r1,r10,r12 | ; | bne Fail1 | ; | isync | ; | lwz r3,0(r5) | ; | Fail1: | ; exists (1:r3=0 /\ 2:r7=0) ------------------------------------------------------------------------ PPC lock-1thread-WR-barrier.litmus "" (* * Does 3.0 Linux-kernel Power lock-unlock provide local * barrier that orders prior stores against subsequent loads, * if the unlock and lock happen in the same thread? * This version uses lwsync instead of isync. *) (* 8-Oct-2015: ppcmem says "Sometimes" *) { l=1; 0:r1=1; 0:r3=42; 0:r4=x; 0:r5=y; 0:r10=0; 0:r11=0; 0:r12=l; 1:r1=1; 1:r4=x; 1:r5=y; } P0 | P1 ; stw r1,0(r4) | stw r1,0(r5) ; lwsync | lwsync ; stw r10,0(r12) | lwz r7,0(r4) ; lwarx r11,r10,r12 | ; cmpwi r11,0 | ; bne Fail1 | ; stwcx. r1,r10,r12 | ; bne Fail1 | ; isync | ; lwz r3,0(r5) | ; Fail1: | ; exists (0:r3=0 /\ 1:r7=0) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Oct 08, 2015 at 02:44:39PM -0700, Paul E. McKenney wrote: > > > > I am with Peter -- we do need the benchmark results for PPC. > > > > > > Urgh, sorry guys. I have been slowly doing some benchmarks, but time is not > > > plentiful at the moment. > > > > > > If we do a straight lwsync -> sync conversion for unlock it looks like that > > > will cost us ~4.2% on Anton's standard context switch benchmark. > > > > And that does not seem to agree with Paul's smp_mb__after_unlock_lock() > > usage and would not be sufficient for the same (as of yet unexplained) > > reason. > > > > Why does it matter which of the LOCK or UNLOCK gets promoted to full > > barrier on PPC in order to become RCsc? > > You could do either. However, as I understand it, there is hardware for > which bc;isync is faster than lwsync. For such hardware, it is cheaper > to upgrade the unlock from lwsync to sync than to upgrade the lock from > bc;isync to sync. If I recall correctly, the kernel rewrites itself at > boot to select whichever of lwsync or bc;isync is better for the hardware > at hand. Fair enough. I'll go wake up and think about the other issue ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Oct 08, 2015 at 02:44:39PM -0700, Paul E. McKenney wrote: > On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote: > > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote: > > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote: > > > > > > Currently, we do need smp_mb__after_unlock_lock() to be after the > > > > acquisition on PPC -- putting it between the unlock and the lock > > > > of course doesn't cut it for the cross-thread unlock/lock case. > > > > This ^, that makes me think I don't understand > > smp_mb__after_unlock_lock. > > > > How is: > > > > UNLOCK x > > smp_mb__after_unlock_lock() > > LOCK y > > > > a problem? That's still a full barrier. > > The problem is that I need smp_mb__after_unlock_lock() to give me > transitivity even if the UNLOCK happened on one CPU and the LOCK > on another. For that to work, the smp_mb__after_unlock_lock() needs > to be either immediately after the acquire (the current choice) or > immediately before the release (which would also work from a purely > technical viewpoint, but I much prefer the current choice). > > Or am I missing your point? So lots of little confusions added up to complete fail :-{ Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but again not against uninvolved CPUs). Which leads me to think I would like to suggest alternative rules for RELEASE/ACQUIRE (to replace those Will suggested; as I think those are partly responsible for my confusion). - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when they operate on the same variable and the ACQUIRE reads from the RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity. - RELEASE -> ACQUIRE can be upgraded to a full barrier (including transitivity) using smp_mb__release_acquire(), either before RELEASE or after ACQUIRE (but consistently [*]). - RELEASE -> ACQUIRE _chains_ (on shared variables) preserve causality, (because each link is fully ordered) but are not transitive. And I think that in the past few weeks we've been using transitive ambiguously, the definition we have in Documentation/memory-barriers.txt is a _strong_ transitivity, where we can make guarantees about CPUs not directly involved. What we have here (due to RCpc) is a weak form of transitivity, which, while it preserves the natural concept of causality, does not extend to other CPUs. So we could go around and call them 'strong' and 'weak' transitivity, but I suspect its easier for everyone involved if we come up with separate terms (less room for error if we accidentally omit the 'strong/weak' qualifier). [*] Do we want to take that choice away and go for: smp_mb__after_release_acquire() ? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote: > On Thu, Oct 08, 2015 at 02:44:39PM -0700, Paul E. McKenney wrote: > > On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote: > > > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote: > > > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote: > > > > > > > > Currently, we do need smp_mb__after_unlock_lock() to be after the > > > > > acquisition on PPC -- putting it between the unlock and the lock > > > > > of course doesn't cut it for the cross-thread unlock/lock case. > > > > > > This ^, that makes me think I don't understand > > > smp_mb__after_unlock_lock. > > > > > > How is: > > > > > > UNLOCK x > > > smp_mb__after_unlock_lock() > > > LOCK y > > > > > > a problem? That's still a full barrier. > > > > The problem is that I need smp_mb__after_unlock_lock() to give me > > transitivity even if the UNLOCK happened on one CPU and the LOCK > > on another. For that to work, the smp_mb__after_unlock_lock() needs > > to be either immediately after the acquire (the current choice) or > > immediately before the release (which would also work from a purely > > technical viewpoint, but I much prefer the current choice). > > > > Or am I missing your point? > > So lots of little confusions added up to complete fail :-{ > > Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I > forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are > transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but > again not against uninvolved CPUs). > > Which leads me to think I would like to suggest alternative rules for > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are > partly responsible for my confusion). Yeah, sorry. I originally used the phrase "fully ordered" but changed it to "full barrier", which has stronger transitivity (newly understood definition) requirements that I didn't intend. RELEASE -> ACQUIRE should be used for message passing between two CPUs and not have ordering effects on other observers unless they're part of the RELEASE -> ACQUIRE chain. > - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when > they operate on the same variable and the ACQUIRE reads from the > RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity. Are we explicit about the difference between "fully ordered" and "full barrier" somewhere else, because this looks like it will confuse people. > - RELEASE -> ACQUIRE can be upgraded to a full barrier (including > transitivity) using smp_mb__release_acquire(), either before RELEASE > or after ACQUIRE (but consistently [*]). Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently using (for PPC only). Stepping back a second, I believe that there are three cases: RELEASE X -> ACQUIRE Y (same CPU) * Needs a barrier on TSO architectures for full ordering UNLOCK X -> LOCK Y (same CPU) * Needs a barrier on PPC for full ordering RELEASE X -> ACQUIRE X (different CPUs) UNLOCK X -> ACQUIRE X (different CPUs) * Fully ordered everywhere... * ... but needs a barrier on PPC to become a full barrier so maybe it makes more sense to split out the local and inter-cpu ordering with something like: smp_mb__after_release_acquire() smp_mb__after_release_acquire_local() then the first one directly replaces smp_mb__after_unlock_lock, and is only defined for PPC, whereas the second one is also defined for TSO archs. > - RELEASE -> ACQUIRE _chains_ (on shared variables) preserve causality, > (because each link is fully ordered) but are not transitive. Yup, and that's the same for UNLOCK -> LOCK, too. > And I think that in the past few weeks we've been using transitive > ambiguously, the definition we have in Documentation/memory-barriers.txt > is a _strong_ transitivity, where we can make guarantees about CPUs not > directly involved. > > What we have here (due to RCpc) is a weak form of transitivity, which, > while it preserves the natural concept of causality, does not extend to > other CPUs. > > So we could go around and call them 'strong' and 'weak' transitivity, > but I suspect its easier for everyone involved if we come up with > separate terms (less room for error if we accidentally omit the > 'strong/weak' qualifier). Surely the general case is message passing and so "transitivity" should just refer to chains of RELEASE -> ACQUIRE? Then "strong transitivity" could refer to the far more complicated (imo) case that is synonymous with "full barrier". Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Paul, On Thu, Oct 08, 2015 at 03:17:16PM -0700, Paul E. McKenney wrote: > On Thu, Oct 08, 2015 at 01:59:38PM +0100, Will Deacon wrote: > > I thought Paul was talking about something like this case: > > > > CPU A CPU B CPU C > > foo = 1 > > UNLOCK x > > LOCK x > > (RELEASE) bar = 1 > > ACQUIRE bar = 1 > > READ_ONCE foo = 0 > > More like this: > > CPU A CPU B CPU C > WRITE_ONCE(foo, 1); > UNLOCK x > LOCK x > r1 = READ_ONCE(bar); > WRITE_ONCE(bar, 1); > smp_mb(); > r2 = READ_ONCE(foo); > > This can result in r1==0 && r2==0. Thank you, that is extremely enlightening :) > > I think we need a PPC litmus test illustrating the inter-thread, same > > lock failure case when smp_mb__after_unlock_lock is not present so that > > we can reason about this properly. Paul? > > Please see above. ;-) > > The corresponding litmus tests are below. How do people feel about including these in memory-barriers.txt? I find them considerably easier to read than our current kernel code + list of possible orderings + wall of text, but there's a good chance that my brain has been corrupted from staring at this stuff for too long. The only snag is the ppc assembly code, but it's not *too* horrific ;) > PPC lock-2thread-WR-barrier.litmus > "" > (* > * Does 3.0 Linux-kernel Power lock-unlock provide local > * barrier that orders prior stores against subsequent loads, > * if the unlock and lock happen on different threads? > * This version uses lwsync instead of isync. > *) > (* 23-July-2013: ppcmem says "Sometimes" *) > { > l=1; > 0:r1=1; 0:r4=x; 0:r10=0; 0:r12=l; > 1:r1=1; 1:r3=42; 1:r4=x; 1:r5=y; 1:r10=0; 1:r11=0; 1:r12=l; > 2:r1=1; 2:r4=x; 2:r5=y; > } > P0 | P1 | P2; > stw r1,0(r4) | lwarx r11,r10,r12 | stw r1,0(r5) ; > lwsync | cmpwi r11,0 | lwsync ; > stw r10,0(r12) | bne Fail1 | lwz r7,0(r4) ; > | stwcx. r1,r10,r12 | ; > | bne Fail1 | ; > | isync | ; > | lwz r3,0(r5) | ; > | Fail1: | ; > > > exists > (1:r3=0 /\ 2:r7=0) We could also include a link to the ppcmem/herd web frontends and your lwn.net article. (ppcmem is already linked, but it's not obvious that you can run litmus tests in your browser). Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > Stepping back a second, I believe that there are three cases: > > > RELEASE X -> ACQUIRE Y (same CPU) > * Needs a barrier on TSO architectures for full ordering +PPC > UNLOCK X -> LOCK Y (same CPU) > * Needs a barrier on PPC for full ordering > RELEASE X -> ACQUIRE X (different CPUs) * Fully ordered everywhere... * ... but needs a barrier on TSO + PPC to become a full barrier > UNLOCK X -> ACQUIRE X (different CPUs) s/ACQUIRE/LOCK/ ? > * Fully ordered everywhere... > * ... but needs a barrier on PPC to become a full barrier If you really meant ACQUIRE, then x86 also needs a barrier in order to upgrade, seeing how our unlock is equivalent to smp_store_release(). Our LOCK otoh is far heavier than smp_load_acquire() and would result in different rules. And I'm not sure the "(different CPUs)" bit makes sense, as the same is true if they're on the same CPU. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > > Which leads me to think I would like to suggest alternative rules for > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are > > partly responsible for my confusion). > > Yeah, sorry. I originally used the phrase "fully ordered" but changed it > to "full barrier", which has stronger transitivity (newly understood > definition) requirements that I didn't intend. > Are we explicit about the difference between "fully ordered" and "full > barrier" somewhere else, because this looks like it will confuse people. I suspect we don't. > > - RELEASE -> ACQUIRE can be upgraded to a full barrier (including > > transitivity) using smp_mb__release_acquire(), either before RELEASE > > or after ACQUIRE (but consistently [*]). > > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently > using (for PPC only). No, we do need that. RELEASE/ACQUIRE is RCpc for TSO as well as PPC. UNLOCK/LOCK is only RCpc for PPC, the rest of the world has RCsc for UNLOCK/LOCK. The reason RELEASE/ACQUIRE differ from UNLOCK/LOCK is the fundamental difference between ACQUIRE and LOCK. Where ACQUIRE really is just a LOAD, LOCK ends up fundamentally being a RmW and a control dependency. Now, if you want to upgrade your RCpc RELEASE/ACQUIRE to RCsc, you need to do that on the inside (either after ACQUIRE or before RELEASE), this is crucial (as per Paul's argument) for the case where the RELEASE and ACQUIRE happen on different CPUs. IFF RELEASE and ACQUIRE happen on the _same_ CPU, then it doesn't matter and you can place the barrier in any of the 3 possible locations (before RELEASE, between RELEASE and ACQUIRE, after ACQUIRE). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > > > - RELEASE -> ACQUIRE _chains_ (on shared variables) preserve causality, > > (because each link is fully ordered) but are not transitive. > > Yup, and that's the same for UNLOCK -> LOCK, too. Agreed, except RELEASE/ACQUIRE is more RCpc than UNLOCK/LOCK. IFF we can get UNLOCK/LOCK as RCsc the chains are strongly transitive, unlike the RELEASE/ACQUIRE chains, which will be weakly so. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote: > > The corresponding litmus tests are below. > > How do people feel about including these in memory-barriers.txt? I find > them considerably easier to read than our current kernel code + list of > possible orderings + wall of text, but there's a good chance that my > brain has been corrupted from staring at this stuff for too long. Your brain is corrupt (but then, so probably is mine, just differently so). I've not yet mastered the knack of reading those things; then again, I suspect its not too hard, just not something I've had the time to play with. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 01:02:46PM +0200, Peter Zijlstra wrote: > On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > > Stepping back a second, I believe that there are three cases: > > > > > > RELEASE X -> ACQUIRE Y (same CPU) > > * Needs a barrier on TSO architectures for full ordering > +PPC > > > UNLOCK X -> LOCK Y (same CPU) > > * Needs a barrier on PPC for full ordering > > > > RELEASE X -> ACQUIRE X (different CPUs) > * Fully ordered everywhere... > * ... but needs a barrier on TSO + PPC to become a full barrier > > > UNLOCK X -> ACQUIRE X (different CPUs) > > s/ACQUIRE/LOCK/ ? Yes, sorry. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 01:12:02PM +0200, Peter Zijlstra wrote: > On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > > > Which leads me to think I would like to suggest alternative rules for > > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are > > > partly responsible for my confusion). > > > > Yeah, sorry. I originally used the phrase "fully ordered" but changed it > > to "full barrier", which has stronger transitivity (newly understood > > definition) requirements that I didn't intend. > > > Are we explicit about the difference between "fully ordered" and "full > > barrier" somewhere else, because this looks like it will confuse people. > > I suspect we don't. > > > > - RELEASE -> ACQUIRE can be upgraded to a full barrier (including > > > transitivity) using smp_mb__release_acquire(), either before RELEASE > > > or after ACQUIRE (but consistently [*]). > > > > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This > > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently > > using (for PPC only). > > No, we do need that. RELEASE/ACQUIRE is RCpc for TSO as well as PPC. > > UNLOCK/LOCK is only RCpc for PPC, the rest of the world has RCsc for > UNLOCK/LOCK. > > The reason RELEASE/ACQUIRE differ from UNLOCK/LOCK is the fundamental > difference between ACQUIRE and LOCK. But they don't actually differ in the kernel memory model we have right now, thanks to PPC (we can't be stronger than the weakest implementation). That's the whole reason we've got this unlock_lock mess! > Where ACQUIRE really is just a LOAD, LOCK ends up fundamentally being a > RmW and a control dependency. Have you checked that this is true for the recent RELEASE/ACQUIRE conversions in things like the qrwlock? In particular, we should annotate those control dependencies to make them glaringly obvious if we want to rely on sequentially-consistent locks (and also Alpha may need that). > Now, if you want to upgrade your RCpc RELEASE/ACQUIRE to RCsc, you need > to do that on the inside (either after ACQUIRE or before RELEASE), this > is crucial (as per Paul's argument) for the case where the RELEASE and > ACQUIRE happen on different CPUs. > > IFF RELEASE and ACQUIRE happen on the _same_ CPU, then it doesn't > matter and you can place the barrier in any of the 3 possible locations > (before RELEASE, between RELEASE and ACQUIRE, after ACQUIRE). Right, but these two need to be different barriers so that we don't penalise TSO when UNLOCK -> LOCK ordering is required. That's why I was proposing the local variant of smp_mb__after_release_acquire(). I think we're in agreement about the barriers we need, we just need to name them (and then I'll cook a patch and we can GOTO 10). Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 01:51:11PM +0100, Will Deacon wrote: > On Fri, Oct 09, 2015 at 01:12:02PM +0200, Peter Zijlstra wrote: > > On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > > > > Which leads me to think I would like to suggest alternative rules for > > > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are > > > > partly responsible for my confusion). > > > > > > Yeah, sorry. I originally used the phrase "fully ordered" but changed it > > > to "full barrier", which has stronger transitivity (newly understood > > > definition) requirements that I didn't intend. > > > > > Are we explicit about the difference between "fully ordered" and "full > > > barrier" somewhere else, because this looks like it will confuse people. > > > > I suspect we don't. > > > > > > - RELEASE -> ACQUIRE can be upgraded to a full barrier (including > > > > transitivity) using smp_mb__release_acquire(), either before RELEASE > > > > or after ACQUIRE (but consistently [*]). > > > > > > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This > > > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently > > > using (for PPC only). > > > > No, we do need that. RELEASE/ACQUIRE is RCpc for TSO as well as PPC. > > > > UNLOCK/LOCK is only RCpc for PPC, the rest of the world has RCsc for > > UNLOCK/LOCK. > > > > The reason RELEASE/ACQUIRE differ from UNLOCK/LOCK is the fundamental > > difference between ACQUIRE and LOCK. > > But they don't actually differ in the kernel memory model we have right > now, thanks to PPC (we can't be stronger than the weakest implementation). > That's the whole reason we've got this unlock_lock mess! Correct, which is why I've suggested to separate UNLOCK/LOCK from RELEASE/ACQUIRE (again). Even if only PPC is RCpc for locks, this means we need to have different upgrade barriers (or suffer superfluous full barriers on TSO archs, which I think we all want to avoid). > > Where ACQUIRE really is just a LOAD, LOCK ends up fundamentally being a > > RmW and a control dependency. > > Have you checked that this is true for the recent RELEASE/ACQUIRE > conversions in things like the qrwlock? In particular, we should annotate > those control dependencies to make them glaringly obvious if we want to > rely on sequentially-consistent locks (and also Alpha may need that). I have not, let me make a note of that. > > Now, if you want to upgrade your RCpc RELEASE/ACQUIRE to RCsc, you need > > to do that on the inside (either after ACQUIRE or before RELEASE), this > > is crucial (as per Paul's argument) for the case where the RELEASE and > > ACQUIRE happen on different CPUs. > > > > IFF RELEASE and ACQUIRE happen on the _same_ CPU, then it doesn't > > matter and you can place the barrier in any of the 3 possible locations > > (before RELEASE, between RELEASE and ACQUIRE, after ACQUIRE). > > Right, but these two need to be different barriers so that we don't > penalise TSO when UNLOCK -> LOCK ordering is required. That's why I was > proposing the local variant of smp_mb__after_release_acquire(). > > I think we're in agreement about the barriers we need, we just need to > name them (and then I'll cook a patch and we can GOTO 10). smp_mb__after_unlock_lock() smp_mb__after_release_acquire() Would work, unless of course we can convince the PPC people to go RCsc on their locks -- which per the benchmark result posted is fairly painful :/ Then again, I do sympathise with them not wanting to find all the bugs for being the odd duck. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote: > > On Thu, Oct 08, 2015 at 02:44:39PM -0700, Paul E. McKenney wrote: > > > On Thu, Oct 08, 2015 at 01:16:38PM +0200, Peter Zijlstra wrote: > > > > On Thu, Oct 08, 2015 at 02:50:36PM +1100, Michael Ellerman wrote: > > > > > On Wed, 2015-10-07 at 08:25 -0700, Paul E. McKenney wrote: > > > > > > > > > > Currently, we do need smp_mb__after_unlock_lock() to be after the > > > > > > acquisition on PPC -- putting it between the unlock and the lock > > > > > > of course doesn't cut it for the cross-thread unlock/lock case. > > > > > > > > This ^, that makes me think I don't understand > > > > smp_mb__after_unlock_lock. > > > > > > > > How is: > > > > > > > > UNLOCK x > > > > smp_mb__after_unlock_lock() > > > > LOCK y > > > > > > > > a problem? That's still a full barrier. > > > > > > The problem is that I need smp_mb__after_unlock_lock() to give me > > > transitivity even if the UNLOCK happened on one CPU and the LOCK > > > on another. For that to work, the smp_mb__after_unlock_lock() needs > > > to be either immediately after the acquire (the current choice) or > > > immediately before the release (which would also work from a purely > > > technical viewpoint, but I much prefer the current choice). > > > > > > Or am I missing your point? > > > > So lots of little confusions added up to complete fail :-{ I know that feeling! > > Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I > > forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are > > transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but > > again not against uninvolved CPUs). > > > > Which leads me to think I would like to suggest alternative rules for > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are > > partly responsible for my confusion). > > Yeah, sorry. I originally used the phrase "fully ordered" but changed it > to "full barrier", which has stronger transitivity (newly understood > definition) requirements that I didn't intend. > > RELEASE -> ACQUIRE should be used for message passing between two CPUs > and not have ordering effects on other observers unless they're part of > the RELEASE -> ACQUIRE chain. > > > - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when > > they operate on the same variable and the ACQUIRE reads from the > > RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity. RELEASE->ACQUIRE has local transistivity/causality. However, it is more than just message passing: void thread0(void) { r1 = READ_ONCE(a); WRITE_ONCE(b, 1); smp_store_release(&c, 1); } void thread1(void) { r2 = smp_store_acquire(&c); WRITE_ONCE(a, 1); r3 = READ_ONCE(b); smp_store_release(&d, 1); } void thread2(void) { r4 = smp_load_acquire(&d); WRITE_ONCE(a, 2); r5 = READ_ONCE(b); } After the dust settles, if r2==1&&r4==1, we should also have r1==0&&r3==1&&r5==1. And here is the PowerPC ppcmem/herd model: ------------------------------------------------------------------------ PPC LinuxAcqRelLocal "" (* * Does a powerpc lwsync-based Acq-Rel chain provide local causality? * October 9, 2015: Yes. *) { a=0; b=0; c=0; d=0; 0:r0=1; 0:r11=a; 0:r12=b; 0:r13=c; 0:r14=d; 1:r0=1; 1:r11=a; 1:r12=b; 1:r13=c; 1:r14=d; 2:r0=2; 2:r11=a; 2:r12=b; 2:r13=c; 2:r14=d; } P0 | P1 | P2 ; lwz r1,0(r11) | lwz r2,0(r13) | lwz r4,0(r14) ; stw r0,0(r12) | lwsync | lwsync ; lwsync | stw r0,0(r11) | stw r0,0(r11) ; stw r0,0(r13) | lwz r3,0(r12) | lwz r5,0(r12) ; | lwsync | ; | stw r0,0(r14) | ; exists (1:r2=1 /\ 2:r4=1 /\ (0:r1=1 \/ 0:r1=2 \/ 1:r3=0 \/ 2:r5=0)) ------------------------------------------------------------------------ This confirms the local causality/transitivity. So let's add an uninvolved observer thread, which will also require adding another variable to avoid ordering via cache coherence: void thread0(void) { r1 = READ_ONCE(a); WRITE_ONCE(b, 1); smp_store_release(&c, 1); } void thread1(void) { r2 = smp_store_acquire(&c); WRITE_ONCE(a, 1); r3 = READ_ONCE(b); smp_store_release(&d, 1); } void thread2(void) { r4 = smp_load_acquire(&d); WRITE_ONCE(a, 2); r5 = READ_ONCE(b); r6 = READ_ONCE(e); } void thread3(void) /* uninvolved observer */ { WRITE_ONCE(e, 1); smp_mb(); r7 = READ_ONCE(b); } Can we see r2==1&&r4==1&&r6=0&&r7==0? Here is the PowerPC model: ------------------------------------------------------------------------ PPC LinuxAcqRelGlobal "" (* * Does a powerpc lwsync-based Acq-Rel chain provide global causality? * October 9, 2015: No. *) { a=0; b=0; c=0; d=0; e=0; 0:r0=1; 0:r11=a; 0:r12=b; 0:r13=c; 0:r14=d; 0:r15=e; 1:r0=1; 1:r11=a; 1:r12=b; 1:r13=c; 1:r14=d; 1:r15=e; 2:r0=2; 2:r11=a; 2:r12=b; 2:r13=c; 2:r14=d; 2:r15=e; 3:r0=1; 3:r11=a; 3:r12=b; 3:r13=c; 3:r14=d; 3:r15=e; } P0 | P1 | P2 | P3 ; lwz r1,0(r11) | lwz r2,0(r13) | lwz r4,0(r14) | stw r0,0(r15) ; stw r0,0(r12) | lwsync | lwsync | sync ; lwsync | stw r0,0(r11) | stw r0,0(r11) | lwz r7,0(r12) ; stw r0,0(r13) | lwz r3,0(r12) | lwz r5,0(r12) | ; | lwsync | lwz r6,0(r15) | ; | stw r0,0(r14) | | ; exists (1:r2=1 /\ 2:r4=1 /\ 2:r6=0 /\ 3:r7=0) ------------------------------------------------------------------------ And, as expected, P3 can see misordering because it is not involved in the acquire-release chain. > Are we explicit about the difference between "fully ordered" and "full > barrier" somewhere else, because this looks like it will confuse people. Probably not. We need something meaning "local transitivity" on the one hand and "global transitivity" on the other. I suppose we could use "causality" for the local case and "transitivity" for the global case, but I do not believe that normal formal/mathematical usage matches this. The term "single-copy atomic" could reasonably be used to mean "local transitivity" and "multi-copy atomic" to mean "global transitivity", but those are all too easy to confuse with each other, require a lot of knowledge to understand, and don't exactly roll off the tongue. The phrases "locally visible causality" and "globally visible causality" should be (relatively) easy to understand, but don't exactly roll off the tongue, either. Maybe LVC and GVC? LVC: Locally visible causality: If you are in the causal chain, you are guaranteed to see the ordering. Otherwise, you might not. GVC: Globally visible causality: You are guarantee to see the ordering whether or not you are in the causal chain. I could of course poll the memory-ordering academics and get more options. ;-) > > - RELEASE -> ACQUIRE can be upgraded to a full barrier (including > > transitivity) using smp_mb__release_acquire(), either before RELEASE > > or after ACQUIRE (but consistently [*]). > > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently > using (for PPC only). If by "don't actually need this" you mean "don't currently know of a use case", I agree. For now, anyway. ;-) > Stepping back a second, I believe that there are three cases: > > > RELEASE X -> ACQUIRE Y (same CPU) > * Needs a barrier on TSO architectures for full ordering Agreed, assuming these restatements match your views: Compiler barrier() for TSO for ordering to be visible within the causal chain. Memory-barrier instruction for TSO for ordering to be visible globally. > UNLOCK X -> LOCK Y (same CPU) > * Needs a barrier on PPC for full ordering Where "full ordering" is what I have been calling "globally visible ordering" (or transitivity or causality or whatever), agreed. > RELEASE X -> ACQUIRE X (different CPUs) > UNLOCK X -> ACQUIRE X (different CPUs) > * Fully ordered everywhere... > * ... but needs a barrier on PPC to become a full barrier Agreed, assuming these restatements match your views: With compiler barrier (for TSO) or lwsync-capable memory-barrier instruction (for weakly ordered systems), ordering is visible within the causal chain. With compiler barrier (for TSO) or full memory-barrier instruction (for weakly ordered systems), ordering is visible globally. > so maybe it makes more sense to split out the local and inter-cpu ordering > with something like: > > smp_mb__after_release_acquire() > smp_mb__after_release_acquire_local() > > then the first one directly replaces smp_mb__after_unlock_lock, and is > only defined for PPC, whereas the second one is also defined for TSO archs. I don't see the need for the _local() variant. We already have smp_store_release() and smp_load_acquire(), both of which imply any barriers needed for the case of locally visible causality. > > - RELEASE -> ACQUIRE _chains_ (on shared variables) preserve causality, > > (because each link is fully ordered) but are not transitive. > > Yup, and that's the same for UNLOCK -> LOCK, too. Any transitive ordering is visible only locally, within the chain. For global visibility, you need smp_mb__after_release_acquire(). > > And I think that in the past few weeks we've been using transitive > > ambiguously, the definition we have in Documentation/memory-barriers.txt > > is a _strong_ transitivity, where we can make guarantees about CPUs not > > directly involved. > > > > What we have here (due to RCpc) is a weak form of transitivity, which, > > while it preserves the natural concept of causality, does not extend to > > other CPUs. > > > > So we could go around and call them 'strong' and 'weak' transitivity, > > but I suspect its easier for everyone involved if we come up with > > separate terms (less room for error if we accidentally omit the > > 'strong/weak' qualifier). > > Surely the general case is message passing and so "transitivity" should > just refer to chains of RELEASE -> ACQUIRE? Then "strong transitivity" > could refer to the far more complicated (imo) case that is synonymous > with "full barrier". We should have fun agreeing on the best words to describe all this, but agreed. One nit: RELEASE->ACQUIRE chains can do more than message passing. Anything before the RELEASE, be it a load or a store, is visible to anything after any ACQUIRE further down the chain. Here a prior load is "visible" to a subsequent store in the sense that the store cannot affect the value loaded. This is consistent with the C11/C++11 memory model. Seem reasonable? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote: > Hi Paul, > > On Thu, Oct 08, 2015 at 03:17:16PM -0700, Paul E. McKenney wrote: > > On Thu, Oct 08, 2015 at 01:59:38PM +0100, Will Deacon wrote: > > > I thought Paul was talking about something like this case: > > > > > > CPU A CPU B CPU C > > > foo = 1 > > > UNLOCK x > > > LOCK x > > > (RELEASE) bar = 1 > > > ACQUIRE bar = 1 > > > READ_ONCE foo = 0 > > > > More like this: > > > > CPU A CPU B CPU C > > WRITE_ONCE(foo, 1); > > UNLOCK x > > LOCK x > > r1 = READ_ONCE(bar); > > WRITE_ONCE(bar, 1); > > smp_mb(); > > r2 = READ_ONCE(foo); > > > > This can result in r1==0 && r2==0. > > Thank you, that is extremely enlightening :) > > > > I think we need a PPC litmus test illustrating the inter-thread, same > > > lock failure case when smp_mb__after_unlock_lock is not present so that > > > we can reason about this properly. Paul? > > > > Please see above. ;-) > > > > The corresponding litmus tests are below. > > How do people feel about including these in memory-barriers.txt? I find > them considerably easier to read than our current kernel code + list of > possible orderings + wall of text, but there's a good chance that my > brain has been corrupted from staring at this stuff for too long. > > The only snag is the ppc assembly code, but it's not *too* horrific ;) Maybe we should include them as separate files in Documentation/litmus or some such. We could then use a litmus-test style with Linux-kernel C code, and reference litmus tests for various architectures. Maybe Documentation/litmus/{arm,arm64,powerpc,x86}/ and so on. For example, the first example in memory-barriers.txt is this: ------------------------------------------------------------------------ CPU 1 CPU 2 =============== =============== { A == 1; B == 2 } A = 3; x = B; B = 4; y = A; The set of accesses as seen by the memory system in the middle can be arranged in 24 different combinations: STORE A=3, STORE B=4, y=LOAD A->3, x=LOAD B->4 STORE A=3, STORE B=4, x=LOAD B->4, y=LOAD A->3 STORE A=3, y=LOAD A->3, STORE B=4, x=LOAD B->4 STORE A=3, y=LOAD A->3, x=LOAD B->2, STORE B=4 STORE A=3, x=LOAD B->2, STORE B=4, y=LOAD A->3 STORE A=3, x=LOAD B->2, y=LOAD A->3, STORE B=4 STORE B=4, STORE A=3, y=LOAD A->3, x=LOAD B->4 STORE B=4, ... ... and can thus result in four different combinations of values: x == 2, y == 1 x == 2, y == 3 x == 4, y == 1 x == 4, y == 3 ------------------------------------------------------------------------ Maybe this changes to: ------------------------------------------------------------------------ Linux MP "" { a=1; b=2; } P0 | P1 ; WRITE_ONCE(a, 3) | r1 = READ_ONCE(b) ; WRITE_ONCE(b, 4) | r2 = READ_ONCE(a) ; exists (2:r1=4 /\ 2:r2=3) ------------------------------------------------------------------------ We can then state that this assertion can fail. We could include either ppcmem or herd output along with the litmus tests, which would allow the curious to see a full list of the possible outcomes. > > PPC lock-2thread-WR-barrier.litmus > > "" > > (* > > * Does 3.0 Linux-kernel Power lock-unlock provide local > > * barrier that orders prior stores against subsequent loads, > > * if the unlock and lock happen on different threads? > > * This version uses lwsync instead of isync. > > *) > > (* 23-July-2013: ppcmem says "Sometimes" *) > > { > > l=1; > > 0:r1=1; 0:r4=x; 0:r10=0; 0:r12=l; > > 1:r1=1; 1:r3=42; 1:r4=x; 1:r5=y; 1:r10=0; 1:r11=0; 1:r12=l; > > 2:r1=1; 2:r4=x; 2:r5=y; > > } > > P0 | P1 | P2; > > stw r1,0(r4) | lwarx r11,r10,r12 | stw r1,0(r5) ; > > lwsync | cmpwi r11,0 | lwsync ; > > stw r10,0(r12) | bne Fail1 | lwz r7,0(r4) ; > > | stwcx. r1,r10,r12 | ; > > | bne Fail1 | ; > > | isync | ; > > | lwz r3,0(r5) | ; > > | Fail1: | ; > > > > > > exists > > (1:r3=0 /\ 2:r7=0) > > We could also include a link to the ppcmem/herd web frontends and your > lwn.net article. (ppcmem is already linked, but it's not obvious that > you can run litmus tests in your browser). I bet that the URLs for the web frontends are not stable long term. Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly number of years, but professors do occasionally move from one institution to another. For but one example, Susmit Sarkar is now at University of St. Andrews rather than at Cambridge. So to make this work, we probably need to be thinking in terms of asking the researchers for permission to include their ocaml code in the Linux-kernel source tree. I would be strongly in favor of this, actually. Thoughts? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 01:25:54PM +0200, Peter Zijlstra wrote: > On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote: > > > The corresponding litmus tests are below. > > > > How do people feel about including these in memory-barriers.txt? I find > > them considerably easier to read than our current kernel code + list of > > possible orderings + wall of text, but there's a good chance that my > > brain has been corrupted from staring at this stuff for too long. > > Your brain is corrupt (but then, so probably is mine, just differently > so). > > I've not yet mastered the knack of reading those things; then again, I > suspect its not too hard, just not something I've had the time to play > with. It does take a bit of practice, but is worth the effort. Of course, I -would- say that. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:43:27AM -0700, Paul E. McKenney wrote: > On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote: > > How do people feel about including these in memory-barriers.txt? I find > > them considerably easier to read than our current kernel code + list of > > possible orderings + wall of text, but there's a good chance that my > > brain has been corrupted from staring at this stuff for too long. > > > > The only snag is the ppc assembly code, but it's not *too* horrific ;) > > Maybe we should include them as separate files in Documentation/litmus > or some such. We could then use a litmus-test style with Linux-kernel > C code, and reference litmus tests for various architectures. Maybe > Documentation/litmus/{arm,arm64,powerpc,x86}/ and so on. I think that would be useful, particularly if we had a way to "compile" a kernel litmus test into one for a particular architecture. > For example, the first example in memory-barriers.txt is this: > > ------------------------------------------------------------------------ > CPU 1 CPU 2 > =============== =============== > { A == 1; B == 2 } > A = 3; x = B; > B = 4; y = A; > > The set of accesses as seen by the memory system in the middle can be arranged > in 24 different combinations: > > STORE A=3, STORE B=4, y=LOAD A->3, x=LOAD B->4 > STORE A=3, STORE B=4, x=LOAD B->4, y=LOAD A->3 > STORE A=3, y=LOAD A->3, STORE B=4, x=LOAD B->4 > STORE A=3, y=LOAD A->3, x=LOAD B->2, STORE B=4 > STORE A=3, x=LOAD B->2, STORE B=4, y=LOAD A->3 > STORE A=3, x=LOAD B->2, y=LOAD A->3, STORE B=4 > STORE B=4, STORE A=3, y=LOAD A->3, x=LOAD B->4 > STORE B=4, ... > ... > > and can thus result in four different combinations of values: > > x == 2, y == 1 > x == 2, y == 3 > x == 4, y == 1 > x == 4, y == 3 > ------------------------------------------------------------------------ > > Maybe this changes to: > > ------------------------------------------------------------------------ > Linux MP > "" > { > a=1; b=2; > } > P0 | P1 ; > WRITE_ONCE(a, 3) | r1 = READ_ONCE(b) ; > WRITE_ONCE(b, 4) | r2 = READ_ONCE(a) ; > > exists (2:r1=4 /\ 2:r2=3) > ------------------------------------------------------------------------ > > We can then state that this assertion can fail. We could include > either ppcmem or herd output along with the litmus tests, which would > allow the curious to see a full list of the possible outcomes. More importantly, it would allow them to make small changes to the test and see what the outcome is, without us having to spawn another thread-of-death on LKML :) > > We could also include a link to the ppcmem/herd web frontends and your > > lwn.net article. (ppcmem is already linked, but it's not obvious that > > you can run litmus tests in your browser). > > I bet that the URLs for the web frontends are not stable long term. > Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly > number of years, but professors do occasionally move from one institution > to another. For but one example, Susmit Sarkar is now at University > of St. Andrews rather than at Cambridge. > > So to make this work, we probably need to be thinking in terms of > asking the researchers for permission to include their ocaml code in the > Linux-kernel source tree. I would be strongly in favor of this, actually. > > Thoughts? I'm extremely hesitant to import a bunch of dubiously licensed, academic ocaml code into the kernel. Even if we did, who would maintain it? A better solution might be to host a mirror of the code on kernel.org, along with a web front-end for people to play with (the tests we're talking about here do seem to run ok in my browser). Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 07:33:28PM +0100, Will Deacon wrote: > On Fri, Oct 09, 2015 at 10:43:27AM -0700, Paul E. McKenney wrote: > > On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote: > > > How do people feel about including these in memory-barriers.txt? I find > > > them considerably easier to read than our current kernel code + list of > > > possible orderings + wall of text, but there's a good chance that my > > > brain has been corrupted from staring at this stuff for too long. > > > > > > The only snag is the ppc assembly code, but it's not *too* horrific ;) > > > > Maybe we should include them as separate files in Documentation/litmus > > or some such. We could then use a litmus-test style with Linux-kernel > > C code, and reference litmus tests for various architectures. Maybe > > Documentation/litmus/{arm,arm64,powerpc,x86}/ and so on. > > I think that would be useful, particularly if we had a way to "compile" > a kernel litmus test into one for a particular architecture. Very useful! Probably some limitations, especially for loops and lock acquisition, but what else is new? > > For example, the first example in memory-barriers.txt is this: > > > > ------------------------------------------------------------------------ > > CPU 1 CPU 2 > > =============== =============== > > { A == 1; B == 2 } > > A = 3; x = B; > > B = 4; y = A; > > > > The set of accesses as seen by the memory system in the middle can be arranged > > in 24 different combinations: > > > > STORE A=3, STORE B=4, y=LOAD A->3, x=LOAD B->4 > > STORE A=3, STORE B=4, x=LOAD B->4, y=LOAD A->3 > > STORE A=3, y=LOAD A->3, STORE B=4, x=LOAD B->4 > > STORE A=3, y=LOAD A->3, x=LOAD B->2, STORE B=4 > > STORE A=3, x=LOAD B->2, STORE B=4, y=LOAD A->3 > > STORE A=3, x=LOAD B->2, y=LOAD A->3, STORE B=4 > > STORE B=4, STORE A=3, y=LOAD A->3, x=LOAD B->4 > > STORE B=4, ... > > ... > > > > and can thus result in four different combinations of values: > > > > x == 2, y == 1 > > x == 2, y == 3 > > x == 4, y == 1 > > x == 4, y == 3 > > ------------------------------------------------------------------------ > > > > Maybe this changes to: > > > > ------------------------------------------------------------------------ > > Linux MP > > "" > > { > > a=1; b=2; > > } > > P0 | P1 ; > > WRITE_ONCE(a, 3) | r1 = READ_ONCE(b) ; > > WRITE_ONCE(b, 4) | r2 = READ_ONCE(a) ; > > > > exists (2:r1=4 /\ 2:r2=3) > > ------------------------------------------------------------------------ > > > > We can then state that this assertion can fail. We could include > > either ppcmem or herd output along with the litmus tests, which would > > allow the curious to see a full list of the possible outcomes. > > More importantly, it would allow them to make small changes to the test > and see what the outcome is, without us having to spawn another > thread-of-death on LKML :) Or perhaps we could at least hope for thread-of-tool-supported-death on LKML. ;-) > > > We could also include a link to the ppcmem/herd web frontends and your > > > lwn.net article. (ppcmem is already linked, but it's not obvious that > > > you can run litmus tests in your browser). > > > > I bet that the URLs for the web frontends are not stable long term. > > Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly > > number of years, but professors do occasionally move from one institution > > to another. For but one example, Susmit Sarkar is now at University > > of St. Andrews rather than at Cambridge. > > > > So to make this work, we probably need to be thinking in terms of > > asking the researchers for permission to include their ocaml code in the > > Linux-kernel source tree. I would be strongly in favor of this, actually. > > > > Thoughts? > > I'm extremely hesitant to import a bunch of dubiously licensed, academic > ocaml code into the kernel. Even if we did, who would maintain it? > > A better solution might be to host a mirror of the code on kernel.org, > along with a web front-end for people to play with (the tests we're talking > about here do seem to run ok in my browser). I am not too worried about how this happens, but we should avoid constraining the work of our academic partners. The reason I was thinking in terms of in the kernel was to avoid version-synchronization issues. "Wait, this is Linux kernel v4.17, which means that you need to use version 8.3.5.1 of the tooling... And with these four patches as well." Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote: [snip] > > > > So lots of little confusions added up to complete fail :-{ > > > > Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I > > forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are > > transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but > > again not against uninvolved CPUs). > > > > Which leads me to think I would like to suggest alternative rules for > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are > > partly responsible for my confusion). > > Yeah, sorry. I originally used the phrase "fully ordered" but changed it > to "full barrier", which has stronger transitivity (newly understood > definition) requirements that I didn't intend. > > RELEASE -> ACQUIRE should be used for message passing between two CPUs > and not have ordering effects on other observers unless they're part of > the RELEASE -> ACQUIRE chain. > > > - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when > > they operate on the same variable and the ACQUIRE reads from the > > RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity. > > Are we explicit about the difference between "fully ordered" and "full > barrier" somewhere else, because this looks like it will confuse people. > This is confusing me right now. ;-) Let's use a simple example for only one primitive, as I understand it, if we say a primitive A is "fully ordered", we actually mean: 1. The memory operations preceding(in program order) A can't be reordered after the memory operations following(in PO) A. and 2. The memory operation(s) in A can't be reordered before the memory operations preceding(in PO) A and after the memory operations following(in PO) A. If we say A is a "full barrier", we actually means: 1. The memory operations preceding(in program order) A can't be reordered after the memory operations following(in PO) A. and 2. The memory ordering guarantee in #1 is visible globally. Is that correct? Or "full barrier" is more strong than I understand, i.e. there is a third property of "full barrier": 3. The memory operation(s) in A can't be reordered before the memory operations preceding(in PO) A and after the memory operations following(in PO) A. IOW, is "full barrier" a more strong version of "fully ordered" or not? Regards, Boqun > > - RELEASE -> ACQUIRE can be upgraded to a full barrier (including > > transitivity) using smp_mb__release_acquire(), either before RELEASE > > or after ACQUIRE (but consistently [*]). > > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently > using (for PPC only). > > Stepping back a second, I believe that there are three cases: > > > RELEASE X -> ACQUIRE Y (same CPU) > * Needs a barrier on TSO architectures for full ordering > > UNLOCK X -> LOCK Y (same CPU) > * Needs a barrier on PPC for full ordering > > RELEASE X -> ACQUIRE X (different CPUs) > UNLOCK X -> ACQUIRE X (different CPUs) > * Fully ordered everywhere... > * ... but needs a barrier on PPC to become a full barrier > >
On Mon, Oct 19, 2015 at 09:17:18AM +0800, Boqun Feng wrote: > This is confusing me right now. ;-) > > Let's use a simple example for only one primitive, as I understand it, > if we say a primitive A is "fully ordered", we actually mean: > > 1. The memory operations preceding(in program order) A can't be > reordered after the memory operations following(in PO) A. > > and > > 2. The memory operation(s) in A can't be reordered before the > memory operations preceding(in PO) A and after the memory > operations following(in PO) A. > > If we say A is a "full barrier", we actually means: > > 1. The memory operations preceding(in program order) A can't be > reordered after the memory operations following(in PO) A. > > and > > 2. The memory ordering guarantee in #1 is visible globally. > > Is that correct? Or "full barrier" is more strong than I understand, > i.e. there is a third property of "full barrier": > > 3. The memory operation(s) in A can't be reordered before the > memory operations preceding(in PO) A and after the memory > operations following(in PO) A. > > IOW, is "full barrier" a more strong version of "fully ordered" or not? Yes, that was how I used it. Now of course; the big question is do we want to promote this usage or come up with a different set of words describing this stuff. I think separating the ordering from the transitivity is useful, for we can then talk about and specify them independently. That is, we can say: LOAD-ACQUIRE: orders LOAD->{LOAD,STORE} weak transitivity (RCpc) MB: orders {LOAD,STORE}->{LOAD,STORE} (fully ordered) strong transitivity (RCsc) etc.. Also, in the above I used weak and strong transitivity, but that too is of course up for grabs. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Oct 19, 2015 at 12:23:24PM +0200, Peter Zijlstra wrote: > On Mon, Oct 19, 2015 at 09:17:18AM +0800, Boqun Feng wrote: > > This is confusing me right now. ;-) > > > > Let's use a simple example for only one primitive, as I understand it, > > if we say a primitive A is "fully ordered", we actually mean: > > > > 1. The memory operations preceding(in program order) A can't be > > reordered after the memory operations following(in PO) A. > > > > and > > > > 2. The memory operation(s) in A can't be reordered before the > > memory operations preceding(in PO) A and after the memory > > operations following(in PO) A. > > > > If we say A is a "full barrier", we actually means: > > > > 1. The memory operations preceding(in program order) A can't be > > reordered after the memory operations following(in PO) A. > > > > and > > > > 2. The memory ordering guarantee in #1 is visible globally. > > > > Is that correct? Or "full barrier" is more strong than I understand, > > i.e. there is a third property of "full barrier": > > > > 3. The memory operation(s) in A can't be reordered before the > > memory operations preceding(in PO) A and after the memory > > operations following(in PO) A. > > > > IOW, is "full barrier" a more strong version of "fully ordered" or not? > > Yes, that was how I used it. > > Now of course; the big question is do we want to promote this usage or > come up with a different set of words describing this stuff. > > I think separating the ordering from the transitivity is useful, for we > can then talk about and specify them independently. > Great idea! > That is, we can say: > > LOAD-ACQUIRE: orders LOAD->{LOAD,STORE} > weak transitivity (RCpc) > > MB: orders {LOAD,STORE}->{LOAD,STORE} (fully ordered) > strong transitivity (RCsc) > It will be helpful if we have this kind of description for each primitive mentioned in memory-barriers.txt, which, IMO, is better than the description like the following: """ Any atomic operation that modifies some state in memory and returns information about the state (old or new) implies an SMP-conditional general memory barrier (smp_mb()) on each side of the actual operation (with the exception of """ I'm assuming that the arrow "->" stands for the program order, and word "orders" means that a primitive guarantees some program order becomes the memory operation order, so that the description above can be rewritten as: value-returning atomics: orders {LOAD,STORE}->RmW(atomic operation)->{LOAD,STORE} strong transitivity much simpler and clearer for discussion and reasoning Regards, Boqun > etc.. > > Also, in the above I used weak and strong transitivity, but that too is > of course up for grabs.
On Mon, Oct 12, 2015 at 04:30:48PM -0700, Paul E. McKenney wrote: > On Fri, Oct 09, 2015 at 07:33:28PM +0100, Will Deacon wrote: > > On Fri, Oct 09, 2015 at 10:43:27AM -0700, Paul E. McKenney wrote: > > > On Fri, Oct 09, 2015 at 10:51:29AM +0100, Will Deacon wrote: [snip] > > > > > We could also include a link to the ppcmem/herd web frontends and your > > > > lwn.net article. (ppcmem is already linked, but it's not obvious that > > > > you can run litmus tests in your browser). > > > > > > I bet that the URLs for the web frontends are not stable long term. > > > Don't get me wrong, PPCMEM/ARMMEM has been there for me for a goodly > > > number of years, but professors do occasionally move from one institution > > > to another. For but one example, Susmit Sarkar is now at University > > > of St. Andrews rather than at Cambridge. > > > > > > So to make this work, we probably need to be thinking in terms of > > > asking the researchers for permission to include their ocaml code in the > > > Linux-kernel source tree. I would be strongly in favor of this, actually. > > > > > > Thoughts? > > > > I'm extremely hesitant to import a bunch of dubiously licensed, academic > > ocaml code into the kernel. Even if we did, who would maintain it? > > > > A better solution might be to host a mirror of the code on kernel.org, > > along with a web front-end for people to play with (the tests we're talking > > about here do seem to run ok in my browser). > > I am not too worried about how this happens, but we should avoid > constraining the work of our academic partners. The reason I was thinking > in terms of in the kernel was to avoid version-synchronization issues. > "Wait, this is Linux kernel v4.17, which means that you need to use > version 8.3.5.1 of the tooling... And with these four patches as well." > Maybe including only the models' code(arm.cat, ppc.cat, etc.) into kernel rather than the whole code base could also solve the version-synchronization in some degree, and avoid maintaining the whole tool code? I'm assuming modifying the verifier's code other than the models' code will unlikely change the result of a litmus test. Regards, Boqun
On Mon, Oct 19, 2015 at 09:17:18AM +0800, Boqun Feng wrote: > On Fri, Oct 09, 2015 at 10:40:39AM +0100, Will Deacon wrote: > > On Fri, Oct 09, 2015 at 10:31:38AM +0200, Peter Zijlstra wrote: > [snip] > > > > > > So lots of little confusions added up to complete fail :-{ > > > > > > Mostly I think it was the UNLOCK x + LOCK x are fully ordered (where I > > > forgot: but not against uninvolved CPUs) and RELEASE/ACQUIRE are > > > transitive (where I forgot: RELEASE/ACQUIRE _chains_ are transitive, but > > > again not against uninvolved CPUs). > > > > > > Which leads me to think I would like to suggest alternative rules for > > > RELEASE/ACQUIRE (to replace those Will suggested; as I think those are > > > partly responsible for my confusion). > > > > Yeah, sorry. I originally used the phrase "fully ordered" but changed it > > to "full barrier", which has stronger transitivity (newly understood > > definition) requirements that I didn't intend. > > > > RELEASE -> ACQUIRE should be used for message passing between two CPUs > > and not have ordering effects on other observers unless they're part of > > the RELEASE -> ACQUIRE chain. > > > > > - RELEASE -> ACQUIRE is fully ordered (but not a full barrier) when > > > they operate on the same variable and the ACQUIRE reads from the > > > RELEASE. Notable, RELEASE/ACQUIRE are RCpc and lack transitivity. > > > > Are we explicit about the difference between "fully ordered" and "full > > barrier" somewhere else, because this looks like it will confuse people. > > > > This is confusing me right now. ;-) > > Let's use a simple example for only one primitive, as I understand it, > if we say a primitive A is "fully ordered", we actually mean: > > 1. The memory operations preceding(in program order) A can't be > reordered after the memory operations following(in PO) A. > > and > > 2. The memory operation(s) in A can't be reordered before the > memory operations preceding(in PO) A and after the memory > operations following(in PO) A. > > If we say A is a "full barrier", we actually means: > > 1. The memory operations preceding(in program order) A can't be > reordered after the memory operations following(in PO) A. > > and > > 2. The memory ordering guarantee in #1 is visible globally. > > Is that correct? Or "full barrier" is more strong than I understand, > i.e. there is a third property of "full barrier": > > 3. The memory operation(s) in A can't be reordered before the > memory operations preceding(in PO) A and after the memory > operations following(in PO) A. > > IOW, is "full barrier" a more strong version of "fully ordered" or not? There is also the question of whether the barrier forces ordering of unrelated stores, everything initially zero and all accesses READ_ONCE() or WRITE_ONCE(): P0 P1 P2 P3 X = 1; Y = 1; r1 = X; r3 = Y; some_barrier(); some_barrier(); r2 = Y; r4 = X; P2's and P3's ordering could be globally visible without requiring P0's and P1's independent stores to be ordered, for example, if you used smp_rmb() for some_barrier(). In contrast, if we used smp_mb() for barrier, everyone would agree on the order of P0's and P0's stores. There are actually a fair number of different combinations of aspects of memory ordering. We will need to choose wisely. ;-) My hope is that the store-ordering gets folded into the globally visible transitive level. Especially given that I have not (yet) seen any algorithms used in production that relied on the ordering of independent stores. Thanx, Paul > Regards, > Boqun > > > > - RELEASE -> ACQUIRE can be upgraded to a full barrier (including > > > transitivity) using smp_mb__release_acquire(), either before RELEASE > > > or after ACQUIRE (but consistently [*]). > > > > Hmm, but we don't actually need this for RELEASE -> ACQUIRE, afaict. This > > is just needed for UNLOCK -> LOCK, and is exactly what RCU is currently > > using (for PPC only). > > > > Stepping back a second, I believe that there are three cases: > > > > > > RELEASE X -> ACQUIRE Y (same CPU) > > * Needs a barrier on TSO architectures for full ordering > > > > UNLOCK X -> LOCK Y (same CPU) > > * Needs a barrier on PPC for full ordering > > > > RELEASE X -> ACQUIRE X (different CPUs) > > UNLOCK X -> ACQUIRE X (different CPUs) > > * Fully ordered everywhere... > > * ... but needs a barrier on PPC to become a full barrier > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Oct 20, 2015 at 04:34:51PM -0700, Paul E. McKenney wrote: > There is also the question of whether the barrier forces ordering > of unrelated stores, everything initially zero and all accesses > READ_ONCE() or WRITE_ONCE(): > > P0 P1 P2 P3 > X = 1; Y = 1; r1 = X; r3 = Y; > some_barrier(); some_barrier(); > r2 = Y; r4 = X; > > P2's and P3's ordering could be globally visible without requiring > P0's and P1's independent stores to be ordered, for example, if you > used smp_rmb() for some_barrier(). In contrast, if we used smp_mb() > for barrier, everyone would agree on the order of P0's and P0's stores. Oh!? > There are actually a fair number of different combinations of > aspects of memory ordering. We will need to choose wisely. ;-) > > My hope is that the store-ordering gets folded into the globally > visible transitive level. Especially given that I have not (yet) > seen any algorithms used in production that relied on the ordering of > independent stores. I would hope not, that's quite insane. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Paul E. McKenney > Sent: 21 October 2015 00:35 ... > There is also the question of whether the barrier forces ordering > of unrelated stores, everything initially zero and all accesses > READ_ONCE() or WRITE_ONCE(): > > P0 P1 P2 P3 > X = 1; Y = 1; r1 = X; r3 = Y; > some_barrier(); some_barrier(); > r2 = Y; r4 = X; > > P2's and P3's ordering could be globally visible without requiring > P0's and P1's independent stores to be ordered, for example, if you > used smp_rmb() for some_barrier(). In contrast, if we used smp_mb() > for barrier, everyone would agree on the order of P0's and P0's stores. > > There are actually a fair number of different combinations of > aspects of memory ordering. We will need to choose wisely. ;-) My thoughts on this are that most code probably isn't performance critical enough to be using anything other than normal locks for inter-cpu synchronisation. Certainly most people are likely to get it wrong somewhere. So you want a big red sticker saying 'Don't try to be too clever'. Also without examples of why things go wrong (eg member_consumer() and alpha) it is difficult to understand the differences between all the barriers (etc). OTOH device driver code may need things slightly stronger than barrier() (which I think is asm(:::"memory")) to sequence accesses to hardware devices (and memory the hardware reads), but without having a strong barrier in every ioread/write() access. David
On Wed, Oct 21, 2015 at 10:24:52AM +0200, Peter Zijlstra wrote: > On Tue, Oct 20, 2015 at 04:34:51PM -0700, Paul E. McKenney wrote: > > There is also the question of whether the barrier forces ordering > > of unrelated stores, everything initially zero and all accesses > > READ_ONCE() or WRITE_ONCE(): > > > > P0 P1 P2 P3 > > X = 1; Y = 1; r1 = X; r3 = Y; > > some_barrier(); some_barrier(); > > r2 = Y; r4 = X; > > > > P2's and P3's ordering could be globally visible without requiring > > P0's and P1's independent stores to be ordered, for example, if you > > used smp_rmb() for some_barrier(). In contrast, if we used smp_mb() > > for barrier, everyone would agree on the order of P0's and P0's stores. > > Oh!? Behold sequential consistency, worshipped fervently by a surprisingly large number of people! Something about legacy proof methods, as near as I can tell. ;-) > > There are actually a fair number of different combinations of > > aspects of memory ordering. We will need to choose wisely. ;-) > > > > My hope is that the store-ordering gets folded into the globally > > visible transitive level. Especially given that I have not (yet) > > seen any algorithms used in production that relied on the ordering of > > independent stores. > > I would hope not, that's quite insane. Your point being? ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Oct 21, 2015 at 04:04:04PM +0000, David Laight wrote: > From: Paul E. McKenney > > Sent: 21 October 2015 00:35 > ... > > There is also the question of whether the barrier forces ordering > > of unrelated stores, everything initially zero and all accesses > > READ_ONCE() or WRITE_ONCE(): > > > > P0 P1 P2 P3 > > X = 1; Y = 1; r1 = X; r3 = Y; > > some_barrier(); some_barrier(); > > r2 = Y; r4 = X; > > > > P2's and P3's ordering could be globally visible without requiring > > P0's and P1's independent stores to be ordered, for example, if you > > used smp_rmb() for some_barrier(). In contrast, if we used smp_mb() > > for barrier, everyone would agree on the order of P0's and P0's stores. > > > > There are actually a fair number of different combinations of > > aspects of memory ordering. We will need to choose wisely. ;-) > > My thoughts on this are that most code probably isn't performance critical > enough to be using anything other than normal locks for inter-cpu > synchronisation. > Certainly most people are likely to get it wrong somewhere. > So you want a big red sticker saying 'Don't try to be too clever'. I am afraid that I would run out of red stickers rather quickly, given the large number of ways that one can shoot oneself in the foot, even when single-threaded. > Also without examples of why things go wrong (eg member_consumer() > and alpha) it is difficult to understand the differences between > all the barriers (etc). Not just the hardware peculiarities. It is also important to understand the common use cases. > OTOH device driver code may need things slightly stronger than > barrier() (which I think is asm(:::"memory")) to sequence accesses > to hardware devices (and memory the hardware reads), but without > having a strong barrier in every ioread/write() access. There are more memory models than you can shake a stick at, so yes, we do have to choose carefully. And yes, it does get more complex when you add MMIO, and no, I don't know of any formal model that takes MMIO into account. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Oct 21, 2015 at 12:29:23PM -0700, Paul E. McKenney wrote: > On Wed, Oct 21, 2015 at 10:24:52AM +0200, Peter Zijlstra wrote: > > On Tue, Oct 20, 2015 at 04:34:51PM -0700, Paul E. McKenney wrote: > > > There is also the question of whether the barrier forces ordering > > > of unrelated stores, everything initially zero and all accesses > > > READ_ONCE() or WRITE_ONCE(): > > > > > > P0 P1 P2 P3 > > > X = 1; Y = 1; r1 = X; r3 = Y; > > > some_barrier(); some_barrier(); > > > r2 = Y; r4 = X; > > > > > > P2's and P3's ordering could be globally visible without requiring > > > P0's and P1's independent stores to be ordered, for example, if you > > > used smp_rmb() for some_barrier(). In contrast, if we used smp_mb() > > > for barrier, everyone would agree on the order of P0's and P0's stores. > > > > Oh!? > > Behold sequential consistency, worshipped fervently by a surprisingly > large number of people! Something about legacy proof methods, as near > as I can tell. ;-) But how can smp_mb() guarantee anything about P[01]? There is but the single store, which can race against P[23] arbitrarily. There is nothing to order. Maybe I'm confused again.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Oct 21, 2015 at 09:36:44PM +0200, Peter Zijlstra wrote: > On Wed, Oct 21, 2015 at 12:29:23PM -0700, Paul E. McKenney wrote: > > On Wed, Oct 21, 2015 at 10:24:52AM +0200, Peter Zijlstra wrote: > > > On Tue, Oct 20, 2015 at 04:34:51PM -0700, Paul E. McKenney wrote: > > > > There is also the question of whether the barrier forces ordering > > > > of unrelated stores, everything initially zero and all accesses > > > > READ_ONCE() or WRITE_ONCE(): > > > > > > > > P0 P1 P2 P3 > > > > X = 1; Y = 1; r1 = X; r3 = Y; > > > > some_barrier(); some_barrier(); > > > > r2 = Y; r4 = X; > > > > > > > > P2's and P3's ordering could be globally visible without requiring > > > > P0's and P1's independent stores to be ordered, for example, if you > > > > used smp_rmb() for some_barrier(). In contrast, if we used smp_mb() > > > > for barrier, everyone would agree on the order of P0's and P0's stores. > > > > > > Oh!? > > > > Behold sequential consistency, worshipped fervently by a surprisingly > > large number of people! Something about legacy proof methods, as near > > as I can tell. ;-) > > But how can smp_mb() guarantee anything about P[01]? There is but the > single store, which can race against P[23] arbitrarily. There is nothing > to order. Indeed, if your barrier is only acting locally, there is no way that you can order the two stores. However, some barriers act non-locally, and this non-local action can order the stores. This non-locality is "cumulativity" in the PowerPC docs. And x86 can also order P0's and P1's stores, courtesy of the "T" in "TSO". > Maybe I'm confused again.. They say that confusion is the most productive frame of mind... Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 2ba8461b0631..0f69f1dc74fc 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -459,11 +459,18 @@ And a couple of implicit varieties: RELEASE on that same variable are guaranteed to be visible. In other words, within a given variable's critical section, all accesses of all previous critical sections for that variable are guaranteed to have - completed. + completed. If the RELEASE and ACQUIRE operations act on independent + variables, an smp_mb__release_acquire() barrier can be placed between + them to upgrade the sequence to a full barrier. This means that ACQUIRE acts as a minimal "acquire" operation and RELEASE acts as a minimal "release" operation. +A subset of the atomic operations described in atomic_ops.txt have ACQUIRE +and RELEASE variants in addition to fully-ordered and relaxed (no barrier +semantics) definitions. For compound atomics performing both a load and +a store, ACQUIRE semantics apply only to the load and RELEASE semantics +only to the store portion of the operation. Memory barriers are only required where there's a possibility of interaction between two CPUs or between a CPU and a device. If it can be guaranteed that @@ -1895,6 +1902,23 @@ the RELEASE would simply complete, thereby avoiding the deadlock. a sleep-unlock race, but the locking primitive needs to resolve such races properly in any case. +Where the RELEASE and ACQUIRE operations are performed by the same CPU, +ordering can be enforced by use of an smp_mb__release_acquire() barrier: + + *A = a; + RELEASE M + smp_mb__release_acquire(); + ACQUIRE N + *B = b; + +in which case, the only permitted sequences are: + + STORE *A, RELEASE M, ACQUIRE N, STORE *B + STORE *A, ACQUIRE N, RELEASE M, STORE *B + +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE +operations performed by other CPUs. + Locks and semaphores may not provide any guarantee of ordering on UP compiled systems, and so cannot be counted on in such a situation to actually achieve anything at all - especially with respect to I/O accesses - unless combined diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h index df896a1c41d3..9dceee6c2f20 100644 --- a/arch/ia64/include/asm/barrier.h +++ b/arch/ia64/include/asm/barrier.h @@ -77,6 +77,7 @@ do { \ ___p1; \ }) +#define smp_mb__release_acquire() smp_mb() #define smp_store_mb(var, value) do { WRITE_ONCE(var, value); mb(); } while (0) /* diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index 0eca6efc0631..919624634d0a 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -87,6 +87,7 @@ do { \ ___p1; \ }) +#define smp_mb__release_acquire() smp_mb() #define smp_mb__before_atomic() smp_mb() #define smp_mb__after_atomic() smp_mb() #define smp_mb__before_spinlock() smp_mb() diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h index d48fe0162331..0c150b5fdd1c 100644 --- a/arch/s390/include/asm/barrier.h +++ b/arch/s390/include/asm/barrier.h @@ -53,4 +53,6 @@ do { \ ___p1; \ }) +#define smp_mb__release_acquire() smp_mb() + #endif /* __ASM_BARRIER_H */ diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h index 14a928601657..4ae875cd9e78 100644 --- a/arch/sparc/include/asm/barrier_64.h +++ b/arch/sparc/include/asm/barrier_64.h @@ -71,7 +71,8 @@ do { \ ___p1; \ }) -#define smp_mb__before_atomic() barrier() -#define smp_mb__after_atomic() barrier() +#define smp_mb__release_acquire() smp_mb() +#define smp_mb__before_atomic() barrier() +#define smp_mb__after_atomic() barrier() #endif /* !(__SPARC64_BARRIER_H) */ diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index 0681d2532527..1c61ad251e0e 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -85,6 +85,8 @@ do { \ ___p1; \ }) +#define smp_mb__release_acquire() smp_mb() + #endif /* Atomic operations are already serializing on x86 */ diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index b42afada1280..61ae95199397 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -119,5 +119,9 @@ do { \ ___p1; \ }) +#ifndef smp_mb__release_acquire +#define smp_mb__release_acquire() do { } while (0) +#endif + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_GENERIC_BARRIER_H */
As much as we'd like to live in a world where RELEASE -> ACQUIRE is always cheaply ordered and can be used to construct UNLOCK -> LOCK definitions with similar guarantees, the grim reality is that this isn't even possible on x86 (thanks to Paul for bringing us crashing down to Earth). This patch handles the issue by introducing a new barrier macro, smp_mb__release_acquire, that can be placed between a RELEASE and a subsequent ACQUIRE operation in order to upgrade them to a full memory barrier. At the moment, it doesn't have any users, so its existence serves mainly as a documentation aid. Documentation/memory-barriers.txt is updated to describe more clearly the ACQUIRE and RELEASE ordering in this area and to show an example of the new barrier in action. Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Will Deacon <will.deacon@arm.com> --- v1 -> v2: - Clarified that this barrier affects only access performed by the executing CPU - Definitions for all TSO architectures Documentation/memory-barriers.txt | 26 +++++++++++++++++++++++++- arch/ia64/include/asm/barrier.h | 1 + arch/powerpc/include/asm/barrier.h | 1 + arch/s390/include/asm/barrier.h | 2 ++ arch/sparc/include/asm/barrier_64.h | 5 +++-- arch/x86/include/asm/barrier.h | 2 ++ include/asm-generic/barrier.h | 4 ++++ 7 files changed, 38 insertions(+), 3 deletions(-)