Message ID | 1412349086-11473-4-git-send-email-will.newton@linaro.org |
---|---|
State | Superseded |
Headers | show |
On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote: > Using the relaxed memory model for atomics when single-threaded allows > a reduction in the number of barriers (dmb) executed and an improvement in > single thread performance on the malloc benchtest: I'm aware that we do have catomic* functions and they are being used. However, I'm wondering whether they are the right tool for what we want to achieve. They simply allow to avoid some of the overhead of atomics (but not necessarily all). Wouldn't it be better to change the calling code to contain optimized paths for single-threaded execution? Also, calling code could either be reentrant or not. For the former, you could even avoid actual atomic accesses instead of just avoiding the barriers. Also, the compiler could perhaps generate more efficient code if it doesn't have to deal with (relaxed) atomics. > Before: 259.073 > After: 246.749 What is the performance number for a program that does have several threads but runs with your patch (i.e., has conditional execution but can't avoid the barriers)? Do you have numbers for a hacked malloc that uses no atomics?
On 6 October 2014 14:43, Torvald Riegel <triegel@redhat.com> wrote: > On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote: >> Using the relaxed memory model for atomics when single-threaded allows >> a reduction in the number of barriers (dmb) executed and an improvement in >> single thread performance on the malloc benchtest: > > I'm aware that we do have catomic* functions and they are being used. > However, I'm wondering whether they are the right tool for what we want > to achieve. They are kind of ugly and rather undocumented as to their precise semantics, so I share your general concerns about these functions. > They simply allow to avoid some of the overhead of atomics (but not > necessarily all). Wouldn't it be better to change the calling code to > contain optimized paths for single-threaded execution? How would you suggest implementing that? My first instinct is that the result would look a lot like what we have now, i.e. some kind of wrapper round atomic functions that special-cases the single-threaded case. malloc is the main performance critical subsystem using catomic, so it may be possible to do more of this work there and reduce the complexity of the generic atomic implementation (although I believe an earlier patch did do this to some extent but was rejected). > Also, calling code could either be reentrant or not. For the former, > you could even avoid actual atomic accesses instead of just avoiding the > barriers. Also, the compiler could perhaps generate more efficient code > if it doesn't have to deal with (relaxed) atomics. Yes, that would be ideal if we had that option. It's not clear to me what catomic_ actually means, it seems from the previous discussions that it has to be atomic wrt. signal handlers which is why the atomic operations remain (but the barriers can be dropped). malloc is generally not re-entrant or AS-safe so optimizing away the atomic instructions would be a bg win here... >> Before: 259.073 >> After: 246.749 > > What is the performance number for a program that does have several > threads but runs with your patch (i.e., has conditional execution but > can't avoid the barriers)? I don't have them, but I will look at that. > Do you have numbers for a hacked malloc that uses no atomics? No, I'll see what I can do.
On Mon, 2014-10-06 at 15:13 +0100, Will Newton wrote: > On 6 October 2014 14:43, Torvald Riegel <triegel@redhat.com> wrote: > > On Fri, 2014-10-03 at 16:11 +0100, Will Newton wrote: > >> Using the relaxed memory model for atomics when single-threaded allows > >> a reduction in the number of barriers (dmb) executed and an improvement in > >> single thread performance on the malloc benchtest: > > > > I'm aware that we do have catomic* functions and they are being used. > > However, I'm wondering whether they are the right tool for what we want > > to achieve. > > They are kind of ugly and rather undocumented as to their precise > semantics, so I share your general concerns about these functions. > > > They simply allow to avoid some of the overhead of atomics (but not > > necessarily all). Wouldn't it be better to change the calling code to > > contain optimized paths for single-threaded execution? > > How would you suggest implementing that? My first instinct is that the > result would look a lot like what we have now, i.e. some kind of > wrapper round atomic functions that special-cases the single-threaded > case. First, I'd differentiate between functions that can be reentrant and those that aren't. For the former, we'd use sequential code, which would probably also avoid a few loops and branches because there's nothing happening concurrently. For the latter, using catomic* or memory_order_relaxed atomics directly (after the transition to C11 atomics for this particular piece of code) would be the way to go I guess, because you still need atomic ops even though you don't need to synchronize with other threads (although the rules as to which atomics can really be used in signal handlers is still being discussed, at least in the C++ committee). > malloc is the main performance critical subsystem using catomic, so it > may be possible to do more of this work there and reduce the > complexity of the generic atomic implementation (although I believe an > earlier patch did do this to some extent but was rejected). Having a single-threaded malloc implementation sounds like the right thing to me. We'd have some additional code, but the sequential code would only be a subset of the program logic for the concurrent case (IOW, you just would leave out stuff that cares about other interleavings with other threads). > > Also, calling code could either be reentrant or not. For the former, > > you could even avoid actual atomic accesses instead of just avoiding the > > barriers. Also, the compiler could perhaps generate more efficient code > > if it doesn't have to deal with (relaxed) atomics. > > Yes, that would be ideal if we had that option. It's not clear to me > what catomic_ actually means, it seems from the previous discussions > that it has to be atomic wrt. signal handlers which is why the atomic > operations remain (but the barriers can be dropped). That's my understanding too. IIRC, it drops the "lock" prefix on x86, for example. > malloc is > generally not re-entrant or AS-safe so optimizing away the atomic > instructions would be a bg win here... > > >> Before: 259.073 > >> After: 246.749 > > > > What is the performance number for a program that does have several > > threads but runs with your patch (i.e., has conditional execution but > > can't avoid the barriers)? > > I don't have them, but I will look at that. > > > Do you have numbers for a hacked malloc that uses no atomics? > > No, I'll see what I can do. > Thanks.
diff --git a/sysdeps/arm/bits/atomic.h b/sysdeps/arm/bits/atomic.h index be314e4..0fbd82b 100644 --- a/sysdeps/arm/bits/atomic.h +++ b/sysdeps/arm/bits/atomic.h @@ -52,6 +52,19 @@ void __arm_link_error (void); a pattern to do this efficiently. */ #if __GNUC_PREREQ (4, 7) && defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 +# if defined IS_IN_libpthread || !defined NOT_IN_libc +# ifdef IS_IN_libpthread +extern int __pthread_multiple_threads attribute_hidden; +# define __atomic_is_single_thread (__pthread_multiple_threads == 0) +# else +extern int __libc_multiple_threads attribute_hidden; +# define __atomic_is_single_thread (__libc_multiple_threads == 0) +# endif +# else +# define __atomic_is_single_thread 0 +# endif + + /* Compare and exchange. For all "bool" routines, we return FALSE if exchange successful. */ @@ -180,7 +193,19 @@ void __arm_link_error (void); __atomic_val_bysize (__arch_exchange_and_add, int, mem, value, \ __ATOMIC_RELEASE) -# define catomic_exchange_and_add atomic_exchange_and_add +# define atomic_exchange_and_add_relaxed(mem, value) \ + __atomic_val_bysize (__arch_exchange_and_add, int, mem, value, \ + __ATOMIC_RELAXED) + +# define catomic_exchange_and_add(mem, value) \ + ({ \ + __typeof (*(mem)) __res; \ + if (__atomic_is_single_thread) \ + __res = atomic_exchange_and_add_relaxed (mem, value); \ + else \ + __res = atomic_exchange_and_add_acq (mem, value); \ + __res; \ + }) /* Atomically bitwise and value and return the previous value. */ @@ -200,9 +225,21 @@ void __arm_link_error (void); __atomic_val_bysize (__arch_exchange_and_and, int, mem, value, \ __ATOMIC_ACQUIRE) +# define atomic_and_relaxed(mem, value) \ + __atomic_val_bysize (__arch_exchange_and_and, int, mem, value, \ + __ATOMIC_RELAXED) + # define atomic_and_val atomic_and -# define catomic_and atomic_and +# define catomic_and(mem, value) \ + ({ \ + __typeof (*(mem)) __res; \ + if (__atomic_is_single_thread) \ + __res = atomic_and_relaxed (mem, value); \ + else \ + __res = atomic_and (mem, value); \ + __res; \ + }) /* Atomically bitwise or value and return the previous value. */ @@ -222,9 +259,21 @@ void __arm_link_error (void); __atomic_val_bysize (__arch_exchange_and_or, int, mem, value, \ __ATOMIC_ACQUIRE) +# define atomic_or_relaxed(mem, value) \ + __atomic_val_bysize (__arch_exchange_and_or, int, mem, value, \ + __ATOMIC_RELAXED) + # define atomic_or_val atomic_or -# define catomic_or atomic_or +# define catomic_or(mem, value) \ + ({ \ + __typeof (*(mem)) __res; \ + if (__atomic_is_single_thread) \ + __res = atomic_or_relaxed (mem, value); \ + else \ + __res = atomic_or (mem, value); \ + __res; \ + }) #elif defined __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4