diff mbox

[ODP/PATCH] Remove race condition and simplify barrier implementation

Message ID 1395186958-5594-1-git-send-email-bill.fischofer@linaro.org
State Superseded, archived
Headers show

Commit Message

Bill Fischofer March 18, 2014, 11:55 p.m. UTC
Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org>
---
 include/odp_barrier.h                       |  3 +-
 platform/linux-generic/source/odp_barrier.c | 46 +++++++++++++----------------
 2 files changed, 21 insertions(+), 28 deletions(-)

Comments

Petri Savolainen March 19, 2014, 7:47 a.m. UTC | #1
Hi,

It seems to work, except that those memory syncs are still needed. See 
comments below.


On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote:

> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org <javascript:>> 
> --- 
>  include/odp_barrier.h                       |  3 +- 
>  platform/linux-generic/source/odp_barrier.c | 46 
> +++++++++++++---------------- 
>  2 files changed, 21 insertions(+), 28 deletions(-) 
>
> diff --git a/include/odp_barrier.h b/include/odp_barrier.h 
> index bb4a6c5..0a1404b 100644 
> --- a/include/odp_barrier.h 
> +++ b/include/odp_barrier.h 
> @@ -28,8 +28,7 @@ extern "C" { 
>   */ 
>  typedef struct odp_barrier_t { 
>          int              count; 
> -        odp_atomic_int_t in; 
> -        odp_atomic_int_t out; 
> +        odp_atomic_int_t bar; 
>  } odp_barrier_t; 
>   
>   
> diff --git a/platform/linux-generic/source/odp_barrier.c 
> b/platform/linux-generic/source/odp_barrier.c 
> index 64fbdb9..9dc6fb5 100644 
> --- a/platform/linux-generic/source/odp_barrier.c 
> +++ b/platform/linux-generic/source/odp_barrier.c 
> @@ -11,40 +11,34 @@ 
>  void odp_barrier_init_count(odp_barrier_t *barrier, int count) 
>  { 
>          barrier->count = count; 
> -        barrier->in    = 0; 
> -        barrier->out   = count - 1; 
> -        odp_sync_stores(); 
>
Write sync needed to avoid race between init and first barrier_sync call.
 

> +        barrier->bar = 0; 
>  } 
>   
> +/* 
> + * Efficient barrier_sync - 
> + * 
> + *   Barriers are initialized with a count of the number of callers 
> + *   that must sync on the barrier before any may proceed. 
> + * 
> + *   To avoid race conditions and to permit the barrier to be fully 
> + *   reusable, the barrier value cycles between 0..2*count-1. When 
> + *   synchronizing the wasless variable simply tracks which half of 
> + *   the cycle the barrier was in upon entry.  Exit is when the 
> + *   barrier crosses to the other half of the cycle. 
> + */ 
>   
>  void odp_barrier_sync(odp_barrier_t *barrier) 
>  { 
>          int count; 
> +        int wasless; 
>   
> -        odp_sync_stores(); 
>
Write sync needed to ensure data ordering over the barrier (what happened 
before barrier vs. what happens after it).
 

> - 
> -        count = odp_atomic_fetch_inc_int(&barrier->in); 
> - 
> -        if (count == barrier->count - 1) { 
> -                /* If last thread, release others */ 
> -                barrier->in = 0; 
> -                odp_sync_stores(); 
> - 
> -                /* Wait for others to exit */ 
> -                while (barrier->out) 
> -                        odp_spin(); 
> - 
> -                /* Ready, reset out counter */ 
> -                barrier->out = barrier->count - 1; 
> -                odp_sync_stores(); 
> +        wasless = barrier->bar < barrier->count; 
> +        count = odp_atomic_fetch_inc_int(&barrier->bar); 
>   
> +        if (count == 2*barrier->count-1) { 
> +                barrier->bar = 0; 
>          } else { 
> -                /* Wait for the last thread*/ 
> -                while (barrier->in) 
> -                        odp_spin(); 
> - 
> -                /* Ready */ 
> -                odp_atomic_dec_int(&barrier->out); 
> -                odp_mem_barrier();


> +          while ((barrier->bar < barrier->count) == wasless) 
> +            odp_spin(); 
>          } 
>

odp_mem_barrier();
Memory barrier needed at the end to ban optimizer to move other code inside 
the barrier.
 
-Petri
Bill Fischofer March 19, 2014, 11:20 a.m. UTC | #2
The odp_barrier_t definition defines the barrier as of type
odp_atomic_int_t, which is volatile.  Volatile already requires that the
compiler preserve these orderings so the additional syncs are unnecessary.

The various memory sync instructions are designed to be used when
interacting with areas of memory referenced by HW, not other CPUs.  A
typical example where such instructions is needed is a HW interface exposed
as two MMRs where a write to one location requests an operation whose
result is read from another location.  A memory sync is needed between the
write and the read to ensure that the write is propagated from the CPU
before the read is presented on the bus.

Inter-CPU race conditions are precisely what the fetch-and-op instructions
do, which is intrinsic to the definition of a barrier.  Volatile protects
against intra-CPU issues.

The only potential race between barrier_init() and barrier_sync() would be
if the application fires off threads that start doing barrier_sync() before
barrier_init() is called.  In this case a memory sync in barrier_init()
offers no protection since the ordering of operations performed by another
CPU cannot be determined by the action of one CPU.  That's an application
design issue: barriers must be initialized before they are used, e.g.,
using critical sections.  An application would normally initialize a
barrier before it begins parallel operations that reference them.  Again a
correct implementation of volatile is sufficient to provide protection here.

Do we have specific conditions in mind that seem to require the extra syncs
or is this a "just in case" sort of protection?

Bill


On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen <
petri.savolainen@linaro.org> wrote:

> Hi,
>
> It seems to work, except that those memory syncs are still needed. See
> comments below.
>
>
> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote:
>
>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org>
>> ---
>>  include/odp_barrier.h                       |  3 +-
>>  platform/linux-generic/source/odp_barrier.c | 46
>> +++++++++++++----------------
>>  2 files changed, 21 insertions(+), 28 deletions(-)
>>
>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h
>> index bb4a6c5..0a1404b 100644
>> --- a/include/odp_barrier.h
>> +++ b/include/odp_barrier.h
>> @@ -28,8 +28,7 @@ extern "C" {
>>   */
>>  typedef struct odp_barrier_t {
>>          int              count;
>> -        odp_atomic_int_t in;
>> -        odp_atomic_int_t out;
>> +        odp_atomic_int_t bar;
>>  } odp_barrier_t;
>>
>>
>> diff --git a/platform/linux-generic/source/odp_barrier.c
>> b/platform/linux-generic/source/odp_barrier.c
>> index 64fbdb9..9dc6fb5 100644
>> --- a/platform/linux-generic/source/odp_barrier.c
>> +++ b/platform/linux-generic/source/odp_barrier.c
>> @@ -11,40 +11,34 @@
>>  void odp_barrier_init_count(odp_barrier_t *barrier, int count)
>>  {
>>          barrier->count = count;
>> -        barrier->in    = 0;
>> -        barrier->out   = count - 1;
>> -        odp_sync_stores();
>>
> Write sync needed to avoid race between init and first barrier_sync call.
>
>
>> +        barrier->bar = 0;
>>  }
>>
>> +/*
>> + * Efficient barrier_sync -
>> + *
>> + *   Barriers are initialized with a count of the number of callers
>> + *   that must sync on the barrier before any may proceed.
>> + *
>> + *   To avoid race conditions and to permit the barrier to be fully
>> + *   reusable, the barrier value cycles between 0..2*count-1. When
>> + *   synchronizing the wasless variable simply tracks which half of
>> + *   the cycle the barrier was in upon entry.  Exit is when the
>> + *   barrier crosses to the other half of the cycle.
>> + */
>>
>>  void odp_barrier_sync(odp_barrier_t *barrier)
>>  {
>>          int count;
>> +        int wasless;
>>
>> -        odp_sync_stores();
>>
> Write sync needed to ensure data ordering over the barrier (what happened
> before barrier vs. what happens after it).
>
>
>> -
>> -        count = odp_atomic_fetch_inc_int(&barrier->in);
>> -
>> -        if (count == barrier->count - 1) {
>> -                /* If last thread, release others */
>> -                barrier->in = 0;
>> -                odp_sync_stores();
>> -
>> -                /* Wait for others to exit */
>> -                while (barrier->out)
>> -                        odp_spin();
>> -
>> -                /* Ready, reset out counter */
>> -                barrier->out = barrier->count - 1;
>> -                odp_sync_stores();
>> +        wasless = barrier->bar < barrier->count;
>> +        count = odp_atomic_fetch_inc_int(&barrier->bar);
>>
>> +        if (count == 2*barrier->count-1) {
>> +                barrier->bar = 0;
>>          } else {
>> -                /* Wait for the last thread*/
>> -                while (barrier->in)
>> -                        odp_spin();
>> -
>> -                /* Ready */
>> -                odp_atomic_dec_int(&barrier->out);
>> -                odp_mem_barrier();
>
>
>> +          while ((barrier->bar < barrier->count) == wasless)
>> +            odp_spin();
>>          }
>>
>
> odp_mem_barrier();
> Memory barrier needed at the end to ban optimizer to move other code
> inside the barrier.
>
> -Petri
>
>
>
Petri Savolainen March 19, 2014, 11:57 a.m. UTC | #3
Hi,

Believe me, those syncs are needed. Volatiles don't have anything to do 
with those syncs.

Odp_sync_stores are used to force data store ordering within a CPU, 
otherwise you cannot be sure when those stores actually hit the  memory 
(out-of-order execution + write buffering). E.g. without store sync, those 
init function stores may hit memory _after_ other threads have been 
created/launched and already executed their first barrier syncs...

Same goes with optimizer, C lines after the barrier sync may crawl inside 
the barrier sync code. Odp_mem_barrier prevents that to happen.


-Petri


On Wednesday, 19 March 2014 13:20:37 UTC+2, Bill Fischofer wrote:
>
> The odp_barrier_t definition defines the barrier as of type 
> odp_atomic_int_t, which is volatile.  Volatile already requires that the 
> compiler preserve these orderings so the additional syncs are unnecessary.  
>
> The various memory sync instructions are designed to be used when 
> interacting with areas of memory referenced by HW, not other CPUs.  A 
> typical example where such instructions is needed is a HW interface exposed 
> as two MMRs where a write to one location requests an operation whose 
> result is read from another location.  A memory sync is needed between the 
> write and the read to ensure that the write is propagated from the CPU 
> before the read is presented on the bus.  
>
> Inter-CPU race conditions are precisely what the fetch-and-op instructions 
> do, which is intrinsic to the definition of a barrier.  Volatile protects 
> against intra-CPU issues.
>
> The only potential race between barrier_init() and barrier_sync() would be 
> if the application fires off threads that start doing barrier_sync() before 
> barrier_init() is called.  In this case a memory sync in barrier_init() 
> offers no protection since the ordering of operations performed by another 
> CPU cannot be determined by the action of one CPU.  That's an application 
> design issue: barriers must be initialized before they are used, e.g., 
> using critical sections.  An application would normally initialize a 
> barrier before it begins parallel operations that reference them.  Again a 
> correct implementation of volatile is sufficient to provide protection here.
>
> Do we have specific conditions in mind that seem to require the extra 
> syncs or is this a "just in case" sort of protection?
>
> Bill
>
>
> On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen <petri.sa...@linaro.org<javascript:>
> > wrote:
>
>> Hi,
>>
>> It seems to work, except that those memory syncs are still needed. See 
>> comments below.
>>
>>
>> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote:
>>
>>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org> 
>>> --- 
>>>  include/odp_barrier.h                       |  3 +- 
>>>  platform/linux-generic/source/odp_barrier.c | 46 
>>> +++++++++++++---------------- 
>>>  2 files changed, 21 insertions(+), 28 deletions(-) 
>>>
>>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h 
>>> index bb4a6c5..0a1404b 100644 
>>> --- a/include/odp_barrier.h 
>>> +++ b/include/odp_barrier.h 
>>> @@ -28,8 +28,7 @@ extern "C" { 
>>>   */ 
>>>  typedef struct odp_barrier_t { 
>>>          int              count; 
>>> -        odp_atomic_int_t in; 
>>> -        odp_atomic_int_t out; 
>>> +        odp_atomic_int_t bar; 
>>>  } odp_barrier_t; 
>>>   
>>>   
>>> diff --git a/platform/linux-generic/source/odp_barrier.c 
>>> b/platform/linux-generic/source/odp_barrier.c 
>>> index 64fbdb9..9dc6fb5 100644 
>>> --- a/platform/linux-generic/source/odp_barrier.c 
>>> +++ b/platform/linux-generic/source/odp_barrier.c 
>>> @@ -11,40 +11,34 @@ 
>>>  void odp_barrier_init_count(odp_barrier_t *barrier, int count) 
>>>  { 
>>>          barrier->count = count; 
>>> -        barrier->in    = 0; 
>>> -        barrier->out   = count - 1; 
>>> -        odp_sync_stores(); 
>>>
>> Write sync needed to avoid race between init and first barrier_sync call.
>>  
>>
>>> +        barrier->bar = 0; 
>>>  } 
>>>   
>>> +/* 
>>> + * Efficient barrier_sync - 
>>> + * 
>>> + *   Barriers are initialized with a count of the number of callers 
>>> + *   that must sync on the barrier before any may proceed. 
>>> + * 
>>> + *   To avoid race conditions and to permit the barrier to be fully 
>>> + *   reusable, the barrier value cycles between 0..2*count-1. When 
>>> + *   synchronizing the wasless variable simply tracks which half of 
>>> + *   the cycle the barrier was in upon entry.  Exit is when the 
>>> + *   barrier crosses to the other half of the cycle. 
>>> + */ 
>>>   
>>>  void odp_barrier_sync(odp_barrier_t *barrier) 
>>>  { 
>>>          int count; 
>>> +        int wasless; 
>>>   
>>> -        odp_sync_stores(); 
>>>
>> Write sync needed to ensure data ordering over the barrier (what happened 
>> before barrier vs. what happens after it).
>>  
>>
>>> - 
>>> -        count = odp_atomic_fetch_inc_int(&barrier->in); 
>>> - 
>>> -        if (count == barrier->count - 1) { 
>>> -                /* If last thread, release others */ 
>>> -                barrier->in = 0; 
>>> -                odp_sync_stores(); 
>>> - 
>>> -                /* Wait for others to exit */ 
>>> -                while (barrier->out) 
>>> -                        odp_spin(); 
>>> - 
>>> -                /* Ready, reset out counter */ 
>>> -                barrier->out = barrier->count - 1; 
>>> -                odp_sync_stores(); 
>>> +        wasless = barrier->bar < barrier->count; 
>>> +        count = odp_atomic_fetch_inc_int(&barrier->bar); 
>>>   
>>> +        if (count == 2*barrier->count-1) { 
>>> +                barrier->bar = 0; 
>>>          } else { 
>>> -                /* Wait for the last thread*/ 
>>> -                while (barrier->in) 
>>> -                        odp_spin(); 
>>> - 
>>> -                /* Ready */ 
>>> -                odp_atomic_dec_int(&barrier->out); 
>>> -                odp_mem_barrier();
>>
>>
>>> +          while ((barrier->bar < barrier->count) == wasless) 
>>> +            odp_spin(); 
>>>          } 
>>>
>>
>> odp_mem_barrier();
>> Memory barrier needed at the end to ban optimizer to move other code 
>> inside the barrier.
>>  
>> -Petri
>>
>>
>>
>
Alexandru Badicioiu March 19, 2014, 12:20 p.m. UTC | #4
At least on PPC, memory sync (sync, msync) instructions affect how other
CPUs  perceive the order of memory operations. This is a quote from the
E-cores programming manual:
"If the programmer must ensure that cache or other instructions have been
performed with
respect to all other processors and system mechanisms, an msync must be
placed after those instructions."

Fetch-and-op primitives on PPC are translated to this kind of sequence :
loop:
    lwarx r5,0,r3 #load and reserve
    add r0,r4,r5 #increment word
    stwcx. r0,0,r3 #store new value if still reserved
    bc 4,2,loop #loop if lost reservation

 The order of lwarx and swtcx kept by default, there is no need for memory
barriers here.

Hope this helps,
Alex



On 19 March 2014 13:20, Bill Fischofer <bill.fischofer@linaro.org> wrote:

> The odp_barrier_t definition defines the barrier as of type
> odp_atomic_int_t, which is volatile.  Volatile already requires that the
> compiler preserve these orderings so the additional syncs are unnecessary.
>
> The various memory sync instructions are designed to be used when
> interacting with areas of memory referenced by HW, not other CPUs.  A
> typical example where such instructions is needed is a HW interface exposed
> as two MMRs where a write to one location requests an operation whose
> result is read from another location.  A memory sync is needed between the
> write and the read to ensure that the write is propagated from the CPU
> before the read is presented on the bus.
>
> Inter-CPU race conditions are precisely what the fetch-and-op instructions
> do, which is intrinsic to the definition of a barrier.  Volatile protects
> against intra-CPU issues.
>
> The only potential race between barrier_init() and barrier_sync() would be
> if the application fires off threads that start doing barrier_sync() before
> barrier_init() is called.  In this case a memory sync in barrier_init()
> offers no protection since the ordering of operations performed by another
> CPU cannot be determined by the action of one CPU.  That's an application
> design issue: barriers must be initialized before they are used, e.g.,
> using critical sections.  An application would normally initialize a
> barrier before it begins parallel operations that reference them.  Again a
> correct implementation of volatile is sufficient to provide protection here.
>
> Do we have specific conditions in mind that seem to require the extra
> syncs or is this a "just in case" sort of protection?
>
> Bill
>
>
> On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen <
> petri.savolainen@linaro.org> wrote:
>
>> Hi,
>>
>> It seems to work, except that those memory syncs are still needed. See
>> comments below.
>>
>>
>> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote:
>>
>>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org>
>>> ---
>>>  include/odp_barrier.h                       |  3 +-
>>>  platform/linux-generic/source/odp_barrier.c | 46
>>> +++++++++++++----------------
>>>  2 files changed, 21 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h
>>> index bb4a6c5..0a1404b 100644
>>> --- a/include/odp_barrier.h
>>> +++ b/include/odp_barrier.h
>>> @@ -28,8 +28,7 @@ extern "C" {
>>>   */
>>>  typedef struct odp_barrier_t {
>>>          int              count;
>>> -        odp_atomic_int_t in;
>>> -        odp_atomic_int_t out;
>>> +        odp_atomic_int_t bar;
>>>  } odp_barrier_t;
>>>
>>>
>>> diff --git a/platform/linux-generic/source/odp_barrier.c
>>> b/platform/linux-generic/source/odp_barrier.c
>>> index 64fbdb9..9dc6fb5 100644
>>> --- a/platform/linux-generic/source/odp_barrier.c
>>> +++ b/platform/linux-generic/source/odp_barrier.c
>>> @@ -11,40 +11,34 @@
>>>  void odp_barrier_init_count(odp_barrier_t *barrier, int count)
>>>  {
>>>          barrier->count = count;
>>> -        barrier->in    = 0;
>>> -        barrier->out   = count - 1;
>>> -        odp_sync_stores();
>>>
>> Write sync needed to avoid race between init and first barrier_sync call.
>>
>>
>>> +        barrier->bar = 0;
>>>  }
>>>
>>> +/*
>>> + * Efficient barrier_sync -
>>> + *
>>> + *   Barriers are initialized with a count of the number of callers
>>> + *   that must sync on the barrier before any may proceed.
>>> + *
>>> + *   To avoid race conditions and to permit the barrier to be fully
>>> + *   reusable, the barrier value cycles between 0..2*count-1. When
>>> + *   synchronizing the wasless variable simply tracks which half of
>>> + *   the cycle the barrier was in upon entry.  Exit is when the
>>> + *   barrier crosses to the other half of the cycle.
>>> + */
>>>
>>>  void odp_barrier_sync(odp_barrier_t *barrier)
>>>  {
>>>          int count;
>>> +        int wasless;
>>>
>>> -        odp_sync_stores();
>>>
>> Write sync needed to ensure data ordering over the barrier (what happened
>> before barrier vs. what happens after it).
>>
>>
>>> -
>>> -        count = odp_atomic_fetch_inc_int(&barrier->in);
>>> -
>>> -        if (count == barrier->count - 1) {
>>> -                /* If last thread, release others */
>>> -                barrier->in = 0;
>>> -                odp_sync_stores();
>>> -
>>> -                /* Wait for others to exit */
>>> -                while (barrier->out)
>>> -                        odp_spin();
>>> -
>>> -                /* Ready, reset out counter */
>>> -                barrier->out = barrier->count - 1;
>>> -                odp_sync_stores();
>>> +        wasless = barrier->bar < barrier->count;
>>> +        count = odp_atomic_fetch_inc_int(&barrier->bar);
>>>
>>> +        if (count == 2*barrier->count-1) {
>>> +                barrier->bar = 0;
>>>          } else {
>>> -                /* Wait for the last thread*/
>>> -                while (barrier->in)
>>> -                        odp_spin();
>>> -
>>> -                /* Ready */
>>> -                odp_atomic_dec_int(&barrier->out);
>>> -                odp_mem_barrier();
>>
>>
>>> +          while ((barrier->bar < barrier->count) == wasless)
>>> +            odp_spin();
>>>          }
>>>
>>
>> odp_mem_barrier();
>> Memory barrier needed at the end to ban optimizer to move other code
>> inside the barrier.
>>
>> -Petri
>>
>>
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "LNG ODP Sub-team - lng-odp@linaro.org" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to lng-odp+unsubscribe@linaro.org.
> To post to this group, send email to lng-odp@linaro.org.
> Visit this group at http://groups.google.com/a/linaro.org/group/lng-odp/.
> To view this discussion on the web visit
> https://groups.google.com/a/linaro.org/d/msgid/lng-odp/CAKb83kbTCB2JiNjRBfcHo-VUBoREyEd8dueabPV8cQjrUpLY-A%40mail.gmail.com<https://groups.google.com/a/linaro.org/d/msgid/lng-odp/CAKb83kbTCB2JiNjRBfcHo-VUBoREyEd8dueabPV8cQjrUpLY-A%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/a/linaro.org/d/optout.
>
Bill Fischofer March 19, 2014, 3:53 p.m. UTC | #5
I'll buy the odp_sync_stores() in the init routine but given that the new
barrier_sync() only references the barrier via fetch-and-op calls it should
be safe as-is.  If the barrier reset (back to 0) is delayed for any reason
that doesn't matter since the code doesn't care when that store is actually
performed.

Bill


On Wed, Mar 19, 2014 at 6:57 AM, Petri Savolainen <
petri.savolainen@linaro.org> wrote:

> Hi,
>
> Believe me, those syncs are needed. Volatiles don't have anything to do
> with those syncs.
>
> Odp_sync_stores are used to force data store ordering within a CPU,
> otherwise you cannot be sure when those stores actually hit the  memory
> (out-of-order execution + write buffering). E.g. without store sync, those
> init function stores may hit memory _after_ other threads have been
> created/launched and already executed their first barrier syncs...
>
> Same goes with optimizer, C lines after the barrier sync may crawl inside
> the barrier sync code. Odp_mem_barrier prevents that to happen.
>
>
> -Petri
>
>
>
> On Wednesday, 19 March 2014 13:20:37 UTC+2, Bill Fischofer wrote:
>
>> The odp_barrier_t definition defines the barrier as of type
>> odp_atomic_int_t, which is volatile.  Volatile already requires that the
>> compiler preserve these orderings so the additional syncs are unnecessary.
>>
>> The various memory sync instructions are designed to be used when
>> interacting with areas of memory referenced by HW, not other CPUs.  A
>> typical example where such instructions is needed is a HW interface exposed
>> as two MMRs where a write to one location requests an operation whose
>> result is read from another location.  A memory sync is needed between the
>> write and the read to ensure that the write is propagated from the CPU
>> before the read is presented on the bus.
>>
>> Inter-CPU race conditions are precisely what the fetch-and-op
>> instructions do, which is intrinsic to the definition of a barrier.
>>  Volatile protects against intra-CPU issues.
>>
>> The only potential race between barrier_init() and barrier_sync() would
>> be if the application fires off threads that start doing barrier_sync()
>> before barrier_init() is called.  In this case a memory sync in
>> barrier_init() offers no protection since the ordering of operations
>> performed by another CPU cannot be determined by the action of one CPU.
>>  That's an application design issue: barriers must be initialized before
>> they are used, e.g., using critical sections.  An application would
>> normally initialize a barrier before it begins parallel operations that
>> reference them.  Again a correct implementation of volatile is sufficient
>> to provide protection here.
>>
>> Do we have specific conditions in mind that seem to require the extra
>> syncs or is this a "just in case" sort of protection?
>>
>> Bill
>>
>>
>> On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen <petri.sa...@linaro.org
>> > wrote:
>>
>>> Hi,
>>>
>>> It seems to work, except that those memory syncs are still needed. See
>>> comments below.
>>>
>>>
>>> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote:
>>>
>>>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org>
>>>> ---
>>>>  include/odp_barrier.h                       |  3 +-
>>>>  platform/linux-generic/source/odp_barrier.c | 46
>>>> +++++++++++++----------------
>>>>  2 files changed, 21 insertions(+), 28 deletions(-)
>>>>
>>>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h
>>>> index bb4a6c5..0a1404b 100644
>>>> --- a/include/odp_barrier.h
>>>> +++ b/include/odp_barrier.h
>>>> @@ -28,8 +28,7 @@ extern "C" {
>>>>   */
>>>>  typedef struct odp_barrier_t {
>>>>          int              count;
>>>> -        odp_atomic_int_t in;
>>>> -        odp_atomic_int_t out;
>>>> +        odp_atomic_int_t bar;
>>>>  } odp_barrier_t;
>>>>
>>>>
>>>> diff --git a/platform/linux-generic/source/odp_barrier.c
>>>> b/platform/linux-generic/source/odp_barrier.c
>>>> index 64fbdb9..9dc6fb5 100644
>>>> --- a/platform/linux-generic/source/odp_barrier.c
>>>> +++ b/platform/linux-generic/source/odp_barrier.c
>>>> @@ -11,40 +11,34 @@
>>>>  void odp_barrier_init_count(odp_barrier_t *barrier, int count)
>>>>  {
>>>>          barrier->count = count;
>>>> -        barrier->in    = 0;
>>>> -        barrier->out   = count - 1;
>>>> -        odp_sync_stores();
>>>>
>>> Write sync needed to avoid race between init and first barrier_sync call.
>>>
>>>
>>>> +        barrier->bar = 0;
>>>>  }
>>>>
>>>> +/*
>>>> + * Efficient barrier_sync -
>>>> + *
>>>> + *   Barriers are initialized with a count of the number of callers
>>>> + *   that must sync on the barrier before any may proceed.
>>>> + *
>>>> + *   To avoid race conditions and to permit the barrier to be fully
>>>> + *   reusable, the barrier value cycles between 0..2*count-1. When
>>>> + *   synchronizing the wasless variable simply tracks which half of
>>>> + *   the cycle the barrier was in upon entry.  Exit is when the
>>>> + *   barrier crosses to the other half of the cycle.
>>>> + */
>>>>
>>>>  void odp_barrier_sync(odp_barrier_t *barrier)
>>>>  {
>>>>          int count;
>>>> +        int wasless;
>>>>
>>>> -        odp_sync_stores();
>>>>
>>> Write sync needed to ensure data ordering over the barrier (what
>>> happened before barrier vs. what happens after it).
>>>
>>>
>>>> -
>>>> -        count = odp_atomic_fetch_inc_int(&barrier->in);
>>>> -
>>>> -        if (count == barrier->count - 1) {
>>>> -                /* If last thread, release others */
>>>> -                barrier->in = 0;
>>>> -                odp_sync_stores();
>>>> -
>>>> -                /* Wait for others to exit */
>>>> -                while (barrier->out)
>>>> -                        odp_spin();
>>>> -
>>>> -                /* Ready, reset out counter */
>>>> -                barrier->out = barrier->count - 1;
>>>> -                odp_sync_stores();
>>>> +        wasless = barrier->bar < barrier->count;
>>>> +        count = odp_atomic_fetch_inc_int(&barrier->bar);
>>>>
>>>> +        if (count == 2*barrier->count-1) {
>>>> +                barrier->bar = 0;
>>>>          } else {
>>>> -                /* Wait for the last thread*/
>>>> -                while (barrier->in)
>>>> -                        odp_spin();
>>>> -
>>>> -                /* Ready */
>>>> -                odp_atomic_dec_int(&barrier->out);
>>>> -                odp_mem_barrier();
>>>
>>>
>>>> +          while ((barrier->bar < barrier->count) == wasless)
>>>> +            odp_spin();
>>>>          }
>>>>
>>>
>>> odp_mem_barrier();
>>> Memory barrier needed at the end to ban optimizer to move other code
>>> inside the barrier.
>>>
>>> -Petri
>>>
>>>
>>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "LNG ODP Sub-team - lng-odp@linaro.org" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to lng-odp+unsubscribe@linaro.org.
> To post to this group, send email to lng-odp@linaro.org.
> Visit this group at http://groups.google.com/a/linaro.org/group/lng-odp/.
> To view this discussion on the web visit
> https://groups.google.com/a/linaro.org/d/msgid/lng-odp/abde0bfa-372c-4e03-9d36-842fa97c6d75%40linaro.org<https://groups.google.com/a/linaro.org/d/msgid/lng-odp/abde0bfa-372c-4e03-9d36-842fa97c6d75%40linaro.org?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/a/linaro.org/d/optout.
>
Petri Savolainen March 20, 2014, 8:14 a.m. UTC | #6
Hi,

OK, third time. What is the purpose of odp_barrier? It 
synchronizes/separates execution of code lines before the barrier from code 
lines following the barrier. If the barrier does not include store 
synchronization, stores issued before the barrier may complete any random 
time (after) the barrier. E.g. if barrier is used to ensure that all cores 
have compiled operation xyz (called before the barrier), a barrier without 
store sync would not guarantee that. It's not about correctness of the 
barrier algoritm, it's about synchronization of the user data (that barrier 
should protect).

Same goes for the mem barrier at the end. Without it, operation zyx (after 
the barrier) may start executing inside the barrier - and again user data 
synchronization would be lost.


-Petri


 



On Wednesday, 19 March 2014 17:53:31 UTC+2, Bill Fischofer wrote:
>
> I'll buy the odp_sync_stores() in the init routine but given that the new 
> barrier_sync() only references the barrier via fetch-and-op calls it should 
> be safe as-is.  If the barrier reset (back to 0) is delayed for any reason 
> that doesn't matter since the code doesn't care when that store is actually 
> performed.
>
> Bill 
>
>
> On Wed, Mar 19, 2014 at 6:57 AM, Petri Savolainen <petri.sa...@linaro.org<javascript:>
> > wrote:
>
>> Hi,
>>
>> Believe me, those syncs are needed. Volatiles don't have anything to do 
>> with those syncs.
>>
>> Odp_sync_stores are used to force data store ordering within a CPU, 
>> otherwise you cannot be sure when those stores actually hit the  memory 
>> (out-of-order execution + write buffering). E.g. without store sync, those 
>> init function stores may hit memory _after_ other threads have been 
>> created/launched and already executed their first barrier syncs...
>>
>> Same goes with optimizer, C lines after the barrier sync may crawl inside 
>> the barrier sync code. Odp_mem_barrier prevents that to happen.
>>
>>
>> -Petri
>>
>>
>>
>> On Wednesday, 19 March 2014 13:20:37 UTC+2, Bill Fischofer wrote:
>>
>>> The odp_barrier_t definition defines the barrier as of type 
>>> odp_atomic_int_t, which is volatile.  Volatile already requires that the 
>>> compiler preserve these orderings so the additional syncs are unnecessary.  
>>>
>>> The various memory sync instructions are designed to be used when 
>>> interacting with areas of memory referenced by HW, not other CPUs.  A 
>>> typical example where such instructions is needed is a HW interface exposed 
>>> as two MMRs where a write to one location requests an operation whose 
>>> result is read from another location.  A memory sync is needed between the 
>>> write and the read to ensure that the write is propagated from the CPU 
>>> before the read is presented on the bus.  
>>>
>>> Inter-CPU race conditions are precisely what the fetch-and-op 
>>> instructions do, which is intrinsic to the definition of a barrier. 
>>>  Volatile protects against intra-CPU issues.
>>>
>>> The only potential race between barrier_init() and barrier_sync() would 
>>> be if the application fires off threads that start doing barrier_sync() 
>>> before barrier_init() is called.  In this case a memory sync in 
>>> barrier_init() offers no protection since the ordering of operations 
>>> performed by another CPU cannot be determined by the action of one CPU. 
>>>  That's an application design issue: barriers must be initialized before 
>>> they are used, e.g., using critical sections.  An application would 
>>> normally initialize a barrier before it begins parallel operations that 
>>> reference them.  Again a correct implementation of volatile is sufficient 
>>> to provide protection here.
>>>
>>> Do we have specific conditions in mind that seem to require the extra 
>>> syncs or is this a "just in case" sort of protection?
>>>
>>> Bill
>>>
>>>
>>> On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen <
>>> petri.sa...@linaro.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> It seems to work, except that those memory syncs are still needed. See 
>>>> comments below.
>>>>
>>>>
>>>> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote:
>>>>
>>>>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org> 
>>>>> --- 
>>>>>  include/odp_barrier.h                       |  3 +- 
>>>>>  platform/linux-generic/source/odp_barrier.c | 46 
>>>>> +++++++++++++---------------- 
>>>>>  2 files changed, 21 insertions(+), 28 deletions(-) 
>>>>>
>>>>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h 
>>>>> index bb4a6c5..0a1404b 100644 
>>>>> --- a/include/odp_barrier.h 
>>>>> +++ b/include/odp_barrier.h 
>>>>> @@ -28,8 +28,7 @@ extern "C" { 
>>>>>   */ 
>>>>>  typedef struct odp_barrier_t { 
>>>>>          int              count; 
>>>>> -        odp_atomic_int_t in; 
>>>>> -        odp_atomic_int_t out; 
>>>>> +        odp_atomic_int_t bar; 
>>>>>  } odp_barrier_t; 
>>>>>   
>>>>>   
>>>>> diff --git a/platform/linux-generic/source/odp_barrier.c 
>>>>> b/platform/linux-generic/source/odp_barrier.c 
>>>>> index 64fbdb9..9dc6fb5 100644 
>>>>> --- a/platform/linux-generic/source/odp_barrier.c 
>>>>> +++ b/platform/linux-generic/source/odp_barrier.c 
>>>>> @@ -11,40 +11,34 @@ 
>>>>>  void odp_barrier_init_count(odp_barrier_t *barrier, int count) 
>>>>>  { 
>>>>>          barrier->count = count; 
>>>>> -        barrier->in    = 0; 
>>>>> -        barrier->out   = count - 1; 
>>>>> -        odp_sync_stores(); 
>>>>>
>>>> Write sync needed to avoid race between init and first barrier_sync 
>>>> call.
>>>>  
>>>>
>>>>> +        barrier->bar = 0; 
>>>>>  } 
>>>>>   
>>>>> +/* 
>>>>> + * Efficient barrier_sync - 
>>>>> + * 
>>>>> + *   Barriers are initialized with a count of the number of callers 
>>>>> + *   that must sync on the barrier before any may proceed. 
>>>>> + * 
>>>>> + *   To avoid race conditions and to permit the barrier to be fully 
>>>>> + *   reusable, the barrier value cycles between 0..2*count-1. When 
>>>>> + *   synchronizing the wasless variable simply tracks which half of 
>>>>> + *   the cycle the barrier was in upon entry.  Exit is when the 
>>>>> + *   barrier crosses to the other half of the cycle. 
>>>>> + */ 
>>>>>   
>>>>>  void odp_barrier_sync(odp_barrier_t *barrier) 
>>>>>  { 
>>>>>          int count; 
>>>>> +        int wasless; 
>>>>>   
>>>>> -        odp_sync_stores(); 
>>>>>
>>>> Write sync needed to ensure data ordering over the barrier (what 
>>>> happened before barrier vs. what happens after it).
>>>>  
>>>>
>>>>> - 
>>>>> -        count = odp_atomic_fetch_inc_int(&barrier->in); 
>>>>> - 
>>>>> -        if (count == barrier->count - 1) { 
>>>>> -                /* If last thread, release others */ 
>>>>> -                barrier->in = 0; 
>>>>> -                odp_sync_stores(); 
>>>>> - 
>>>>> -                /* Wait for others to exit */ 
>>>>> -                while (barrier->out) 
>>>>> -                        odp_spin(); 
>>>>> - 
>>>>> -                /* Ready, reset out counter */ 
>>>>> -                barrier->out = barrier->count - 1; 
>>>>> -                odp_sync_stores(); 
>>>>> +        wasless = barrier->bar < barrier->count; 
>>>>> +        count = odp_atomic_fetch_inc_int(&barrier->bar); 
>>>>>   
>>>>> +        if (count == 2*barrier->count-1) { 
>>>>> +                barrier->bar = 0; 
>>>>>          } else { 
>>>>> -                /* Wait for the last thread*/ 
>>>>> -                while (barrier->in) 
>>>>> -                        odp_spin(); 
>>>>> - 
>>>>> -                /* Ready */ 
>>>>> -                odp_atomic_dec_int(&barrier->out); 
>>>>> -                odp_mem_barrier();
>>>>
>>>>
>>>>> +          while ((barrier->bar < barrier->count) == wasless) 
>>>>> +            odp_spin(); 
>>>>>          } 
>>>>>
>>>>
>>>> odp_mem_barrier();
>>>> Memory barrier needed at the end to ban optimizer to move other code 
>>>> inside the barrier.
>>>>  
>>>> -Petri
>>>>
>>>>
>>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "LNG ODP Sub-team - lng...@linaro.org <javascript:>" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to lng-odp+u...@linaro.org <javascript:>.
>> To post to this group, send email to lng...@linaro.org <javascript:>.
>> Visit this group at http://groups.google.com/a/linaro.org/group/lng-odp/.
>> To view this discussion on the web visit 
>> https://groups.google.com/a/linaro.org/d/msgid/lng-odp/abde0bfa-372c-4e03-9d36-842fa97c6d75%40linaro.org<https://groups.google.com/a/linaro.org/d/msgid/lng-odp/abde0bfa-372c-4e03-9d36-842fa97c6d75%40linaro.org?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/a/linaro.org/d/optout.
>>
>
>
diff mbox

Patch

diff --git a/include/odp_barrier.h b/include/odp_barrier.h
index bb4a6c5..0a1404b 100644
--- a/include/odp_barrier.h
+++ b/include/odp_barrier.h
@@ -28,8 +28,7 @@  extern "C" {
  */
 typedef struct odp_barrier_t {
 	int              count;
-	odp_atomic_int_t in;
-	odp_atomic_int_t out;
+	odp_atomic_int_t bar;
 } odp_barrier_t;
 
 
diff --git a/platform/linux-generic/source/odp_barrier.c b/platform/linux-generic/source/odp_barrier.c
index 64fbdb9..9dc6fb5 100644
--- a/platform/linux-generic/source/odp_barrier.c
+++ b/platform/linux-generic/source/odp_barrier.c
@@ -11,40 +11,34 @@ 
 void odp_barrier_init_count(odp_barrier_t *barrier, int count)
 {
 	barrier->count = count;
-	barrier->in    = 0;
-	barrier->out   = count - 1;
-	odp_sync_stores();
+	barrier->bar = 0;
 }
 
+/*
+ * Efficient barrier_sync -
+ *
+ *   Barriers are initialized with a count of the number of callers
+ *   that must sync on the barrier before any may proceed.
+ *
+ *   To avoid race conditions and to permit the barrier to be fully
+ *   reusable, the barrier value cycles between 0..2*count-1. When
+ *   synchronizing the wasless variable simply tracks which half of
+ *   the cycle the barrier was in upon entry.  Exit is when the
+ *   barrier crosses to the other half of the cycle.
+ */
 
 void odp_barrier_sync(odp_barrier_t *barrier)
 {
 	int count;
+	int wasless;
 
-	odp_sync_stores();
-
-	count = odp_atomic_fetch_inc_int(&barrier->in);
-
-	if (count == barrier->count - 1) {
-		/* If last thread, release others */
-		barrier->in = 0;
-		odp_sync_stores();
-
-		/* Wait for others to exit */
-		while (barrier->out)
-			odp_spin();
-
-		/* Ready, reset out counter */
-		barrier->out = barrier->count - 1;
-		odp_sync_stores();
+	wasless = barrier->bar < barrier->count;
+	count = odp_atomic_fetch_inc_int(&barrier->bar);
 
+	if (count == 2*barrier->count-1) {
+		barrier->bar = 0;
 	} else {
-		/* Wait for the last thread*/
-		while (barrier->in)
-			odp_spin();
-
-		/* Ready */
-		odp_atomic_dec_int(&barrier->out);
-		odp_mem_barrier();
+	  while ((barrier->bar < barrier->count) == wasless)
+	    odp_spin();
 	}
 }