Message ID | 20150916110706.GF28771@arm.com |
---|---|
State | New |
Headers | show |
On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote: > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote: > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote: > > > > Indeed, that is a hole in the definition, that I think we should close. > > > > > I'm struggling to understand the hole, but here's my intuition. If an > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to > > > observe all memory accessed performed by CPUy prior to the RELEASE > > > before it observes the RELEASE itself, regardless of this new barrier. > > > I think this matches what we currently have in memory-barriers.txt (i.e. > > > acquire/release are neither transitive or multi-copy atomic). > > > > Ah agreed. I seem to have gotten my brain in a tangle. > > > > Basically where a program order release+acquire relies on an address > > dependency, a cross cpu release+acquire relies on causality. If we > > observe the release, we must also observe everything prior to it etc. > > Yes, and crucially, the "everything prior to it" only encompasses accesses > made by the releasing CPU itself (in the absence of other barriers and > synchronisation). > Just want to make sure I understand you correctly, do you mean that in the following case: CPU 1 CPU 2 CPU 3 ============== ============================ =============== { A = 0, B = 0 } WRITE_ONCE(A,1); r1 = READ_ONCE(A); r2 = smp_load_acquire(&B); smp_store_release(&B, 1); r3 = READ_ONCE(A); r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted? However, according to the discussion of Paul and Peter: https://lkml.org/lkml/2015/9/15/707 I think that's prohibitted on architectures except s390 for sure. And for s390, we are waiting for the maintainers to verify this. If s390 also prohibits this, then a release-acquire pair(on different CPUs) to the same variable does guarantee transitivity. Did I misunderstand you or miss something here? > Given that we managed to get confused, it doesn't hurt to call this out > explicitly in the doc, so I can add the following extra text. > > Will > > --->8 > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index 46a85abb77c6..794d102d06df 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock. > a sleep-unlock race, but the locking primitive needs to resolve > such races properly in any case. > > -If necessary, ordering can be enforced by use of an > -smp_mb__release_acquire() barrier: > +Where the RELEASE and ACQUIRE operations are performed by the same CPU, > +ordering can be enforced by use of an smp_mb__release_acquire() barrier: > > *A = a; > RELEASE M > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are: > STORE *A, RELEASE M, ACQUIRE N, STORE *B > STORE *A, ACQUIRE N, RELEASE M, STORE *B > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE > +operations performed by other CPUs, even if they are to the same variable. > +In cases where transitivity is required, smp_mb() should be used explicitly. > + Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be: If an ACQUIRE loads the value of stored by a RELEASE, then on the CPU executing the ACQUIRE operation, all the memory operations after the ACQUIRE operation will perceive all the memory operations before the RELEASE operation on the CPU executing the RELEASE operation. This could cover both the "on the same CPU" and "on different CPUs" cases. Of course, this may has nothing to do with smp_mb__release_acquire(), but I think we can take this chance to document the memory order effect of RELEASE+ACQUIRE well. Regards, Boqun > Locks and semaphores may not provide any guarantee of ordering on UP compiled > systems, and so cannot be counted on in such a situation to actually achieve > anything at all - especially with respect to I/O accesses - unless combined > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
On Thu, Sep 17, 2015 at 10:50:12AM +0800, Boqun Feng wrote: > On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote: > > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote: > > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote: > > > > > Indeed, that is a hole in the definition, that I think we should close. > > > > > > > I'm struggling to understand the hole, but here's my intuition. If an > > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to > > > > observe all memory accessed performed by CPUy prior to the RELEASE > > > > before it observes the RELEASE itself, regardless of this new barrier. > > > > I think this matches what we currently have in memory-barriers.txt (i.e. > > > > acquire/release are neither transitive or multi-copy atomic). > > > > > > Ah agreed. I seem to have gotten my brain in a tangle. > > > > > > Basically where a program order release+acquire relies on an address > > > dependency, a cross cpu release+acquire relies on causality. If we > > > observe the release, we must also observe everything prior to it etc. > > > > Yes, and crucially, the "everything prior to it" only encompasses accesses > > made by the releasing CPU itself (in the absence of other barriers and > > synchronisation). > > > > Just want to make sure I understand you correctly, do you mean that in > the following case: > > CPU 1 CPU 2 CPU 3 > ============== ============================ =============== > { A = 0, B = 0 } > WRITE_ONCE(A,1); r1 = READ_ONCE(A); r2 = smp_load_acquire(&B); > smp_store_release(&B, 1); r3 = READ_ONCE(A); > > r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted? > > However, according to the discussion of Paul and Peter: > > https://lkml.org/lkml/2015/9/15/707 > > I think that's prohibitted on architectures except s390 for sure. And > for s390, we are waiting for the maintainers to verify this. If s390 > also prohibits this, then a release-acquire pair(on different CPUs) to > the same variable does guarantee transitivity. > > Did I misunderstand you or miss something here? > > > Given that we managed to get confused, it doesn't hurt to call this out > > explicitly in the doc, so I can add the following extra text. > > > > Will > > > > --->8 > > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > > index 46a85abb77c6..794d102d06df 100644 > > --- a/Documentation/memory-barriers.txt > > +++ b/Documentation/memory-barriers.txt > > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock. > > a sleep-unlock race, but the locking primitive needs to resolve > > such races properly in any case. > > > > -If necessary, ordering can be enforced by use of an > > -smp_mb__release_acquire() barrier: > > +Where the RELEASE and ACQUIRE operations are performed by the same CPU, > > +ordering can be enforced by use of an smp_mb__release_acquire() barrier: > > > > *A = a; > > RELEASE M > > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are: > > STORE *A, RELEASE M, ACQUIRE N, STORE *B > > STORE *A, ACQUIRE N, RELEASE M, STORE *B > > > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE > > +operations performed by other CPUs, even if they are to the same variable. > > +In cases where transitivity is required, smp_mb() should be used explicitly. > > + > > Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be: > > If an ACQUIRE loads the value of stored by a RELEASE, then on the CPU > executing the ACQUIRE operation, all the memory operations after the > ACQUIRE operation will perceive all the memory operations before the > RELEASE operation on the CPU executing the RELEASE operation. > Ah.. I think I lost my mind while writting this. Should be: If an ACQUIRE loads the value of stored by a RELEASE, then after the ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive all the memory operations that have been perceived by the CPU executing the RELEASE operation before the RELEASE operation. Which means a release+acquire pair to the same variable guarantees transitivity. Sorry for the misleading paragraph.. Regards, Boqun > This could cover both the "on the same CPU" and "on different CPUs" > cases. > > Of course, this may has nothing to do with smp_mb__release_acquire(), > but I think we can take this chance to document the memory order effect > of RELEASE+ACQUIRE well. > > > Regards, > Boqun > > > Locks and semaphores may not provide any guarantee of ordering on UP compiled > > systems, and so cannot be counted on in such a situation to actually achieve > > anything at all - especially with respect to I/O accesses - unless combined > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/
On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote: > On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote: > > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote: > > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote: > > > > > Indeed, that is a hole in the definition, that I think we should close. > > > > > > > I'm struggling to understand the hole, but here's my intuition. If an > > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to > > > > observe all memory accessed performed by CPUy prior to the RELEASE > > > > before it observes the RELEASE itself, regardless of this new barrier. > > > > I think this matches what we currently have in memory-barriers.txt (i.e. > > > > acquire/release are neither transitive or multi-copy atomic). > > > > > > Ah agreed. I seem to have gotten my brain in a tangle. > > > > > > Basically where a program order release+acquire relies on an address > > > dependency, a cross cpu release+acquire relies on causality. If we > > > observe the release, we must also observe everything prior to it etc. > > > > Yes, and crucially, the "everything prior to it" only encompasses accesses > > made by the releasing CPU itself (in the absence of other barriers and > > synchronisation). > > > > Just want to make sure I understand you correctly, do you mean that in > the following case: > > CPU 1 CPU 2 CPU 3 > ============== ============================ =============== > { A = 0, B = 0 } > WRITE_ONCE(A,1); r1 = READ_ONCE(A); r2 = smp_load_acquire(&B); > smp_store_release(&B, 1); r3 = READ_ONCE(A); > > r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted? > > However, according to the discussion of Paul and Peter: > > https://lkml.org/lkml/2015/9/15/707 > > I think that's prohibitted on architectures except s390 for sure. And > for s390, we are waiting for the maintainers to verify this. If s390 > also prohibits this, then a release-acquire pair(on different CPUs) to > the same variable does guarantee transitivity. > > Did I misunderstand you or miss something here? That certainly works on arm and arm64, so if it works everywhere else too, then we can strengthen this (but see below). > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > > index 46a85abb77c6..794d102d06df 100644 > > --- a/Documentation/memory-barriers.txt > > +++ b/Documentation/memory-barriers.txt > > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock. > > a sleep-unlock race, but the locking primitive needs to resolve > > such races properly in any case. > > > > -If necessary, ordering can be enforced by use of an > > -smp_mb__release_acquire() barrier: > > +Where the RELEASE and ACQUIRE operations are performed by the same CPU, > > +ordering can be enforced by use of an smp_mb__release_acquire() barrier: > > > > *A = a; > > RELEASE M > > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are: > > STORE *A, RELEASE M, ACQUIRE N, STORE *B > > STORE *A, ACQUIRE N, RELEASE M, STORE *B > > > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE > > +operations performed by other CPUs, even if they are to the same variable. > > +In cases where transitivity is required, smp_mb() should be used explicitly. > > + > > Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be: [updated from your reply] > If an ACQUIRE loads the value of stored by a RELEASE, then after the > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive > all the memory operations that have been perceived by the CPU executing > the RELEASE operation before the RELEASE operation. > > Which means a release+acquire pair to the same variable guarantees > transitivity. Almost, but on arm64 at least, "all the memory operations" above doesn't include reads by other CPUs. I'm struggling to figure out whether that's actually an issue. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote: > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote: > > On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote: > > > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote: > > > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote: > > > > > > Indeed, that is a hole in the definition, that I think we should close. > > > > > > > > > I'm struggling to understand the hole, but here's my intuition. If an > > > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to > > > > > observe all memory accessed performed by CPUy prior to the RELEASE > > > > > before it observes the RELEASE itself, regardless of this new barrier. > > > > > I think this matches what we currently have in memory-barriers.txt (i.e. > > > > > acquire/release are neither transitive or multi-copy atomic). > > > > > > > > Ah agreed. I seem to have gotten my brain in a tangle. > > > > > > > > Basically where a program order release+acquire relies on an address > > > > dependency, a cross cpu release+acquire relies on causality. If we > > > > observe the release, we must also observe everything prior to it etc. > > > > > > Yes, and crucially, the "everything prior to it" only encompasses accesses > > > made by the releasing CPU itself (in the absence of other barriers and > > > synchronisation). > > > > > > > Just want to make sure I understand you correctly, do you mean that in > > the following case: > > > > CPU 1 CPU 2 CPU 3 > > ============== ============================ =============== > > { A = 0, B = 0 } > > WRITE_ONCE(A,1); r1 = READ_ONCE(A); r2 = smp_load_acquire(&B); > > smp_store_release(&B, 1); r3 = READ_ONCE(A); > > > > r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted? > > > > However, according to the discussion of Paul and Peter: > > > > https://lkml.org/lkml/2015/9/15/707 > > > > I think that's prohibitted on architectures except s390 for sure. And > > for s390, we are waiting for the maintainers to verify this. If s390 > > also prohibits this, then a release-acquire pair(on different CPUs) to > > the same variable does guarantee transitivity. > > > > Did I misunderstand you or miss something here? > > That certainly works on arm and arm64, so if it works everywhere else too, > then we can strengthen this (but see below). > > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > > > index 46a85abb77c6..794d102d06df 100644 > > > --- a/Documentation/memory-barriers.txt > > > +++ b/Documentation/memory-barriers.txt > > > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock. > > > a sleep-unlock race, but the locking primitive needs to resolve > > > such races properly in any case. > > > > > > -If necessary, ordering can be enforced by use of an > > > -smp_mb__release_acquire() barrier: > > > +Where the RELEASE and ACQUIRE operations are performed by the same CPU, > > > +ordering can be enforced by use of an smp_mb__release_acquire() barrier: > > > > > > *A = a; > > > RELEASE M > > > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are: > > > STORE *A, RELEASE M, ACQUIRE N, STORE *B > > > STORE *A, ACQUIRE N, RELEASE M, STORE *B > > > > > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE > > > +operations performed by other CPUs, even if they are to the same variable. > > > +In cases where transitivity is required, smp_mb() should be used explicitly. > > > + > > > > Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be: > > [updated from your reply] > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive > > all the memory operations that have been perceived by the CPU executing > > the RELEASE operation before the RELEASE operation. > > > > Which means a release+acquire pair to the same variable guarantees > > transitivity. > > Almost, but on arm64 at least, "all the memory operations" above doesn't > include reads by other CPUs. I'm struggling to figure out whether that's > actually an issue. > Ah.. that's indeed an issue! for example: CPU 0 CPU 1 CPU 2 ===================== ========================== ================ {a = 0, b = 0, c = 0} r1 = READ_ONCE(a); WRITE_ONCE(b, 1); r3 = smp_load_acquire(&c); smp_rmb(); smp_store_release(&c, 1); WRITE_ONCE(a, 1); r2 = READ_ONCE(b) where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at least on POWER. However, I think that doens't mean a release+acquire pair to the same variable doesn't guarantee transitivity, because the transitivity is actually broken at the smp_rmb(). But yes, my document is incorrect. How about: If an ACQUIRE loads the value of stored by a RELEASE, then after the ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive all the memory operations that have been perceived by the CPU executing the RELEASE operation *transitively* before the RELEASE operation. ("transitively before" means that a memory operation is either executed on the same CPU before the other, or guaranteed executed before the other by a transitive barrier). Which means a release+acquire pair to the same variable guarantees transitivity. Maybe we can avoid to use term "transitively before" here, but it's not bad to distinguish different kinds of "before"s. Regards, Boqun
On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote: > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote: > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote: > > > On Wed, Sep 16, 2015 at 12:07:06PM +0100, Will Deacon wrote: > > > > On Wed, Sep 16, 2015 at 11:43:14AM +0100, Peter Zijlstra wrote: > > > > > On Wed, Sep 16, 2015 at 11:29:08AM +0100, Will Deacon wrote: > > > > > > > Indeed, that is a hole in the definition, that I think we should close. > > > > > > > > > > > I'm struggling to understand the hole, but here's my intuition. If an > > > > > > ACQUIRE on CPUx reads from a RELEASE by CPUy, then I'd expect CPUx to > > > > > > observe all memory accessed performed by CPUy prior to the RELEASE > > > > > > before it observes the RELEASE itself, regardless of this new barrier. > > > > > > I think this matches what we currently have in memory-barriers.txt (i.e. > > > > > > acquire/release are neither transitive or multi-copy atomic). > > > > > > > > > > Ah agreed. I seem to have gotten my brain in a tangle. > > > > > > > > > > Basically where a program order release+acquire relies on an address > > > > > dependency, a cross cpu release+acquire relies on causality. If we > > > > > observe the release, we must also observe everything prior to it etc. > > > > > > > > Yes, and crucially, the "everything prior to it" only encompasses accesses > > > > made by the releasing CPU itself (in the absence of other barriers and > > > > synchronisation). > > > > > > > > > > Just want to make sure I understand you correctly, do you mean that in > > > the following case: > > > > > > CPU 1 CPU 2 CPU 3 > > > ============== ============================ =============== > > > { A = 0, B = 0 } > > > WRITE_ONCE(A,1); r1 = READ_ONCE(A); r2 = smp_load_acquire(&B); > > > smp_store_release(&B, 1); r3 = READ_ONCE(A); > > > > > > r1 == 1 && r2 == 1 && r3 == 0 is not prohibitted? > > > > > > However, according to the discussion of Paul and Peter: > > > > > > https://lkml.org/lkml/2015/9/15/707 > > > > > > I think that's prohibitted on architectures except s390 for sure. And > > > for s390, we are waiting for the maintainers to verify this. If s390 > > > also prohibits this, then a release-acquire pair(on different CPUs) to > > > the same variable does guarantee transitivity. > > > > > > Did I misunderstand you or miss something here? > > > > That certainly works on arm and arm64, so if it works everywhere else too, > > then we can strengthen this (but see below). > > > > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > > > > index 46a85abb77c6..794d102d06df 100644 > > > > --- a/Documentation/memory-barriers.txt > > > > +++ b/Documentation/memory-barriers.txt > > > > @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock. > > > > a sleep-unlock race, but the locking primitive needs to resolve > > > > such races properly in any case. > > > > > > > > -If necessary, ordering can be enforced by use of an > > > > -smp_mb__release_acquire() barrier: > > > > +Where the RELEASE and ACQUIRE operations are performed by the same CPU, > > > > +ordering can be enforced by use of an smp_mb__release_acquire() barrier: > > > > > > > > *A = a; > > > > RELEASE M > > > > @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are: > > > > STORE *A, RELEASE M, ACQUIRE N, STORE *B > > > > STORE *A, ACQUIRE N, RELEASE M, STORE *B > > > > > > > > +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE > > > > +operations performed by other CPUs, even if they are to the same variable. > > > > +In cases where transitivity is required, smp_mb() should be used explicitly. > > > > + > > > > > > Then, IIRC, the memory order effect of RELEASE+ACQUIRE should be: > > > > [updated from your reply] > > > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive > > > all the memory operations that have been perceived by the CPU executing > > > the RELEASE operation before the RELEASE operation. > > > > > > Which means a release+acquire pair to the same variable guarantees > > > transitivity. > > > > Almost, but on arm64 at least, "all the memory operations" above doesn't > > include reads by other CPUs. I'm struggling to figure out whether that's > > actually an issue. > > > > Ah.. that's indeed an issue! for example: > > CPU 0 CPU 1 CPU 2 > ===================== ========================== ================ > {a = 0, b = 0, c = 0} > r1 = READ_ONCE(a); WRITE_ONCE(b, 1); r3 = smp_load_acquire(&c); > smp_rmb(); smp_store_release(&c, 1); WRITE_ONCE(a, 1); > r2 = READ_ONCE(b) > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at > least on POWER. > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry for the misleading. How about the behavior of that on arm and arm64? If prohibitted too, please ignore below. Apologies again for that.. Regards, Boqun > However, I think that doens't mean a release+acquire pair to the same > variable doesn't guarantee transitivity, because the transitivity is > actually broken at the smp_rmb(). But yes, my document is incorrect. > How about: > > If an ACQUIRE loads the value of stored by a RELEASE, then after the > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive > all the memory operations that have been perceived by the CPU executing > the RELEASE operation *transitively* before the RELEASE operation. > ("transitively before" means that a memory operation is either executed > on the same CPU before the other, or guaranteed executed before the > other by a transitive barrier). > > Which means a release+acquire pair to the same variable guarantees > transitivity. > > > Maybe we can avoid to use term "transitively before" here, but it's not > bad to distinguish different kinds of "before"s. > > Regards, > Boqun
On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote: > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote: > > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote: > > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote: > > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the > > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive > > > > all the memory operations that have been perceived by the CPU executing > > > > the RELEASE operation before the RELEASE operation. > > > > > > > > Which means a release+acquire pair to the same variable guarantees > > > > transitivity. > > > > > > Almost, but on arm64 at least, "all the memory operations" above doesn't > > > include reads by other CPUs. I'm struggling to figure out whether that's > > > actually an issue. > > > > > > > Ah.. that's indeed an issue! for example: > > > > CPU 0 CPU 1 CPU 2 > > ===================== ========================== ================ > > {a = 0, b = 0, c = 0} > > r1 = READ_ONCE(a); WRITE_ONCE(b, 1); r3 = smp_load_acquire(&c); > > smp_rmb(); smp_store_release(&c, 1); WRITE_ONCE(a, 1); > > r2 = READ_ONCE(b) > > > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at > > least on POWER. > > > > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry > for the misleading. How about the behavior of that on arm and arm64? That explicit test is forbidden on arm/arm64 because of the smp_rmb(), but if you rewrite it as (LDAR is acquire, STLR is release): { 0:X1=x; 0:X3=y; 1:X1=y; 1:X2=z; 2:X1=z; 2:X3=x; } P0 | P1 | P2 ; LDAR W0,[X1] | MOV W0,#1 | LDAR W0,[X1] ; LDR W2,[X3] | STR W0,[X1] | MOV W2,#1 ; | STLR W0,[X2] | STR W2,[X3] ; Observed 0:X0=1; 0:X2=0; 2:X0=1; then it is permitted on arm64. Note that herd currently claims that this is forbidden, but I'm talking to the authors about getting that fixed :) Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote: > On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote: > > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote: > > > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote: > > > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote: > > > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the > > > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive > > > > > all the memory operations that have been perceived by the CPU executing > > > > > the RELEASE operation before the RELEASE operation. > > > > > > > > > > Which means a release+acquire pair to the same variable guarantees > > > > > transitivity. > > > > > > > > Almost, but on arm64 at least, "all the memory operations" above doesn't > > > > include reads by other CPUs. I'm struggling to figure out whether that's > > > > actually an issue. > > > > > > > > > > Ah.. that's indeed an issue! for example: > > > > > > CPU 0 CPU 1 CPU 2 > > > ===================== ========================== ================ > > > {a = 0, b = 0, c = 0} > > > r1 = READ_ONCE(a); WRITE_ONCE(b, 1); r3 = smp_load_acquire(&c); > > > smp_rmb(); smp_store_release(&c, 1); WRITE_ONCE(a, 1); > > > r2 = READ_ONCE(b) > > > > > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at > > > least on POWER. > > > > > > > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry > > for the misleading. How about the behavior of that on arm and arm64? > > That explicit test is forbidden on arm/arm64 because of the smp_rmb(), > but if you rewrite it as (LDAR is acquire, STLR is release): > > > { > 0:X1=x; 0:X3=y; > 1:X1=y; 1:X2=z; > 2:X1=z; 2:X3=x; > } > P0 | P1 | P2 ; > LDAR W0,[X1] | MOV W0,#1 | LDAR W0,[X1] ; > LDR W2,[X3] | STR W0,[X1] | MOV W2,#1 ; > | STLR W0,[X2] | STR W2,[X3] ; > > Observed > 0:X0=1; 0:X2=0; 2:X0=1; > X0 is W0, etc. Right? > > then it is permitted on arm64. Note that herd currently claims that this > is forbidden, but I'm talking to the authors about getting that fixed :) > Good to know ;-) I think this actually means two things: 1. ACQUIRE doesn't provide transitivity itself and 2. We still need the term like "transitively before". Regards, Boqun
On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote: > On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote: > > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote: > > > On Thu, Sep 17, 2015 at 07:00:01PM +0100, Will Deacon wrote: > > > > On Thu, Sep 17, 2015 at 03:50:12AM +0100, Boqun Feng wrote: > > > > > If an ACQUIRE loads the value of stored by a RELEASE, then after the > > > > > ACQUIRE operation, the CPU executing the ACQUIRE operation will perceive > > > > > all the memory operations that have been perceived by the CPU executing > > > > > the RELEASE operation before the RELEASE operation. > > > > > > > > > > Which means a release+acquire pair to the same variable guarantees > > > > > transitivity. > > > > > > > > Almost, but on arm64 at least, "all the memory operations" above doesn't > > > > include reads by other CPUs. I'm struggling to figure out whether that's > > > > actually an issue. > > > > > > > > > > Ah.. that's indeed an issue! for example: > > > > > > CPU 0 CPU 1 CPU 2 > > > ===================== ========================== ================ > > > {a = 0, b = 0, c = 0} > > > r1 = READ_ONCE(a); WRITE_ONCE(b, 1); r3 = smp_load_acquire(&c); > > > smp_rmb(); smp_store_release(&c, 1); WRITE_ONCE(a, 1); > > > r2 = READ_ONCE(b) > > > > > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at > > > least on POWER. > > > > > > > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry > > for the misleading. How about the behavior of that on arm and arm64? > > That explicit test is forbidden on arm/arm64 because of the smp_rmb(), > but if you rewrite it as (LDAR is acquire, STLR is release): > > > { > 0:X1=x; 0:X3=y; > 1:X1=y; 1:X2=z; > 2:X1=z; 2:X3=x; > } > P0 | P1 | P2 ; > LDAR W0,[X1] | MOV W0,#1 | LDAR W0,[X1] ; > LDR W2,[X3] | STR W0,[X1] | MOV W2,#1 ; > | STLR W0,[X2] | STR W2,[X3] ; > > Observed > 0:X0=1; 0:X2=0; 2:X0=1; > > > then it is permitted on arm64. Note that herd currently claims that this > is forbidden, but I'm talking to the authors about getting that fixed :) But a pure store-release/load-acquire chain would be forbidden in hardware as well as by herd, correct? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Paul, On Tue, Sep 22, 2015 at 04:22:41PM +0100, Paul E. McKenney wrote: > On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote: > > On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote: > > > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote: > > > > > > > > Ah.. that's indeed an issue! for example: > > > > > > > > CPU 0 CPU 1 CPU 2 > > > > ===================== ========================== ================ > > > > {a = 0, b = 0, c = 0} > > > > r1 = READ_ONCE(a); WRITE_ONCE(b, 1); r3 = smp_load_acquire(&c); > > > > smp_rmb(); smp_store_release(&c, 1); WRITE_ONCE(a, 1); > > > > r2 = READ_ONCE(b) > > > > > > > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at > > > > least on POWER. > > > > > > > > > > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry > > > for the misleading. How about the behavior of that on arm and arm64? > > > > That explicit test is forbidden on arm/arm64 because of the smp_rmb(), > > but if you rewrite it as (LDAR is acquire, STLR is release): > > > > > > { > > 0:X1=x; 0:X3=y; > > 1:X1=y; 1:X2=z; > > 2:X1=z; 2:X3=x; > > } > > P0 | P1 | P2 ; > > LDAR W0,[X1] | MOV W0,#1 | LDAR W0,[X1] ; > > LDR W2,[X3] | STR W0,[X1] | MOV W2,#1 ; > > | STLR W0,[X2] | STR W2,[X3] ; > > > > Observed > > 0:X0=1; 0:X2=0; 2:X0=1; > > > > > > then it is permitted on arm64. Note that herd currently claims that this > > is forbidden, but I'm talking to the authors about getting that fixed :) > > But a pure store-release/load-acquire chain would be forbidden in > hardware as well as by herd, correct? Yup, and since that's likely the common use-case, I think that's precisely the scenario where it makes sense for us to require transitivity in the kernel. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Sep 22, 2015 at 04:58:28PM +0100, Will Deacon wrote: > Hi Paul, > > On Tue, Sep 22, 2015 at 04:22:41PM +0100, Paul E. McKenney wrote: > > On Mon, Sep 21, 2015 at 11:23:01PM +0100, Will Deacon wrote: > > > On Mon, Sep 21, 2015 at 03:10:38PM +0100, Boqun Feng wrote: > > > > On Mon, Sep 21, 2015 at 09:45:15PM +0800, Boqun Feng wrote: > > > > > > > > > > Ah.. that's indeed an issue! for example: > > > > > > > > > > CPU 0 CPU 1 CPU 2 > > > > > ===================== ========================== ================ > > > > > {a = 0, b = 0, c = 0} > > > > > r1 = READ_ONCE(a); WRITE_ONCE(b, 1); r3 = smp_load_acquire(&c); > > > > > smp_rmb(); smp_store_release(&c, 1); WRITE_ONCE(a, 1); > > > > > r2 = READ_ONCE(b) > > > > > > > > > > where r1 == 1 && r2 == 0 && r3 == 1 is actually not prohibitted, at > > > > > least on POWER. > > > > > > > > > > > > > Oops.. I use wrong litmus here.. so this is prohibitted on POWER. Sorry > > > > for the misleading. How about the behavior of that on arm and arm64? > > > > > > That explicit test is forbidden on arm/arm64 because of the smp_rmb(), > > > but if you rewrite it as (LDAR is acquire, STLR is release): > > > > > > > > > { > > > 0:X1=x; 0:X3=y; > > > 1:X1=y; 1:X2=z; > > > 2:X1=z; 2:X3=x; > > > } > > > P0 | P1 | P2 ; > > > LDAR W0,[X1] | MOV W0,#1 | LDAR W0,[X1] ; > > > LDR W2,[X3] | STR W0,[X1] | MOV W2,#1 ; > > > | STLR W0,[X2] | STR W2,[X3] ; > > > > > > Observed > > > 0:X0=1; 0:X2=0; 2:X0=1; > > > > > > > > > then it is permitted on arm64. Note that herd currently claims that this > > > is forbidden, but I'm talking to the authors about getting that fixed :) > > > > But a pure store-release/load-acquire chain would be forbidden in > > hardware as well as by herd, correct? > > Yup, and since that's likely the common use-case, I think that's precisely > the scenario where it makes sense for us to require transitivity in the > kernel. Agreed. And again I believe that we need to err on the side of restricting what the developer can expect. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 46a85abb77c6..794d102d06df 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1902,8 +1902,8 @@ the RELEASE would simply complete, thereby avoiding the deadlock. a sleep-unlock race, but the locking primitive needs to resolve such races properly in any case. -If necessary, ordering can be enforced by use of an -smp_mb__release_acquire() barrier: +Where the RELEASE and ACQUIRE operations are performed by the same CPU, +ordering can be enforced by use of an smp_mb__release_acquire() barrier: *A = a; RELEASE M @@ -1916,6 +1916,10 @@ in which case, the only permitted sequences are: STORE *A, RELEASE M, ACQUIRE N, STORE *B STORE *A, ACQUIRE N, RELEASE M, STORE *B +Note that smp_mb__release_acquire() has no effect on ACQUIRE or RELEASE +operations performed by other CPUs, even if they are to the same variable. +In cases where transitivity is required, smp_mb() should be used explicitly. + Locks and semaphores may not provide any guarantee of ordering on UP compiled systems, and so cannot be counted on in such a situation to actually achieve anything at all - especially with respect to I/O accesses - unless combined