diff mbox

Remove sparcv8 support

Message ID 20161107.113825.631166023186879199.davem@davemloft.net
State New
Headers show

Commit Message

David Miller Nov. 7, 2016, 4:38 p.m. UTC
So the following attached is what I started playing around with this
weekend.

It implements software trap "0x23" to perform a CAS operations, the
operands are expected in registers %o0, %o1, and %o2.

Since it was easiest to test I implemented this first on sparc64 which
just executes the CAS instruction directly.  I'll start working on the
32-bit part in the background.

The capability will be advertised via the mask returned by the "get
kernel features" system call.  We could check this early in the
crt'ish code and cache the value in a variable which the atomics can
check.

Another kernel side change I have to do is advertise the LEON CAS
availability in the _dl_hwcaps so that we can use the LEON CAS in
glibc when available.

The first patch is the kernel side, and the second is the glibc side.
The whole NPTL testsuite passes for the plain 32-bit sparc target with
these changes.

Comments

Sam Ravnborg Nov. 7, 2016, 9:20 p.m. UTC | #1
On Mon, Nov 07, 2016 at 11:38:25AM -0500, David Miller wrote:
> 

> So the following attached is what I started playing around with this

> weekend.

> 

> It implements software trap "0x23" to perform a CAS operations, the

> operands are expected in registers %o0, %o1, and %o2.

> 

> Since it was easiest to test I implemented this first on sparc64 which

> just executes the CAS instruction directly.  I'll start working on the

> 32-bit part in the background.

> 

> The capability will be advertised via the mask returned by the "get

> kernel features" system call.  We could check this early in the

> crt'ish code and cache the value in a variable which the atomics can

> check.

> 

> Another kernel side change I have to do is advertise the LEON CAS

> availability in the _dl_hwcaps so that we can use the LEON CAS in

> glibc when available.

> 

> The first patch is the kernel side, and the second is the glibc side.

> The whole NPTL testsuite passes for the plain 32-bit sparc target with

> these changes.


Glad that you found some time to look into this!


> >From fa1cad39df7318cdb46baea5774c340322cd74f2 Mon Sep 17 00:00:00 2001

> From: "David S. Miller" <davem@davemloft.net>

> Date: Mon, 7 Nov 2016 08:27:05 -0800

> Subject: [PATCH] sparc64: Add CAS emulation trap.

> 

> Older 32-bit sparc cpus (other than LEON) lack a CAS instruction, so

> we need to provide some kind of helper infrastructure in the kernel

> to emulate it.

> 

> This is the first part which firstly defines the basic infrastructure

> and the simplest implementation, which is to just directly execute the

> instruction on sparc64.

> 

> We make use of the window fill/spill fault unwind facilities to make

> this as simple as possible.  When we take a full TSB miss, we check if

> the trap level is greater than one, and if so unwind the trap to one

> of the final 3 instructions of the interrupted trap handler's block.

> Which of the three to use is based upon whether this is a real fault,

> an unaligned access, or a data access exception (ie. bus error).

> 

> Signed-off-by: David S. Miller <davem@davemloft.net>

> ---

>  arch/sparc/include/uapi/asm/unistd.h | 1 +

>  arch/sparc/kernel/Makefile           | 1 +

>  arch/sparc/kernel/sys_sparc_64.c     | 2 +-

>  arch/sparc/kernel/ttable_64.S        | 3 ++-

>  4 files changed, 5 insertions(+), 2 deletions(-)


casemul.S is missing.
So all the fun kernel stuf was not included in the patch...

	Sam
Torvald Riegel Nov. 9, 2016, 5:08 p.m. UTC | #2
On Mon, 2016-11-07 at 11:38 -0500, David Miller wrote:
> 

> So the following attached is what I started playing around with this

> weekend.

> 

> It implements software trap "0x23" to perform a CAS operations, the

> operands are expected in registers %o0, %o1, and %o2.

> 

> Since it was easiest to test I implemented this first on sparc64 which

> just executes the CAS instruction directly.  I'll start working on the

> 32-bit part in the background.

> 

> The capability will be advertised via the mask returned by the "get

> kernel features" system call.  We could check this early in the

> crt'ish code and cache the value in a variable which the atomics can

> check.

> 

> Another kernel side change I have to do is advertise the LEON CAS

> availability in the _dl_hwcaps so that we can use the LEON CAS in

> glibc when available.

> 

> The first patch is the kernel side, and the second is the glibc side.

> The whole NPTL testsuite passes for the plain 32-bit sparc target with

> these changes.


What approach are you going to use in the kernel to emulate the CAS if
the hardware doesn't offer one?  If you are not stopping all threads,
then there could be concurrent stores to the same memory location
targeted by the CAS; to make such stores atomic wrt. the CAS, you would
need to implement atomic stores in glibc to also use the kernel (eg, to
do a CAS).
I didn't see this in the glibc patch you sent, so I thought I'd ask.
David Miller Nov. 9, 2016, 5:15 p.m. UTC | #3
From: Torvald Riegel <triegel@redhat.com>

Date: Wed, 09 Nov 2016 09:08:15 -0800

> What approach are you going to use in the kernel to emulate the CAS if

> the hardware doesn't offer one?  If you are not stopping all threads,

> then there could be concurrent stores to the same memory location

> targeted by the CAS; to make such stores atomic wrt. the CAS, you would

> need to implement atomic stores in glibc to also use the kernel (eg, to

> do a CAS).


I keep hearing about this case, but as long as the CAS is atomic what
is the difference between the store being synchronized in some way
or not?

I think the ordering allowed for gives the same set of legal results.

In any possible case either the CAS "wins" or the async store "wins"
and that determines the final result written.  All combinations are
legal outcomes even with a hardware CAS implementation.

I really don't think such asynchronous stores are legal, nor should
the be explicitly accomodated in the CAS emulation support.  Either
the value is maintained in an atomic manner, or it is not.  And if it
is, updates must use CAS.  Straight stores are only legal on the
initialization of the word before any CAS code paths can get to the
value.

I cannot think of any sane setup that can allow async stores
intermixed with CAS updates.
Torvald Riegel Nov. 10, 2016, 5:05 a.m. UTC | #4
On Wed, 2016-11-09 at 12:15 -0500, David Miller wrote:
> From: Torvald Riegel <triegel@redhat.com>

> Date: Wed, 09 Nov 2016 09:08:15 -0800

> 

> > What approach are you going to use in the kernel to emulate the CAS if

> > the hardware doesn't offer one?  If you are not stopping all threads,

> > then there could be concurrent stores to the same memory location

> > targeted by the CAS; to make such stores atomic wrt. the CAS, you would

> > need to implement atomic stores in glibc to also use the kernel (eg, to

> > do a CAS).

> 

> I keep hearing about this case, but as long as the CAS is atomic what

> is the difference between the store being synchronized in some way

> or not?

> 

> I think the ordering allowed for gives the same set of legal results.

> 

> In any possible case either the CAS "wins" or the async store "wins"

> and that determines the final result written.  All combinations are

> legal outcomes even with a hardware CAS implementation.


See this example, a is initially 0:

Thread 1:
atomic_store_relaxed (&a, 1);
r = atomic_load_relaxed (&a);

Thread 2:
exp = 0;
atomic_compare_exchange_weak_relaxed (&a, &exp, 2); // succeeds

r should never equal 2.  But if the CAS is not atomic wrt. the store by
Thread 1, then the CAS can load 0, then Thread 1's store comes in, and
then Thread 2's CAS stores 2 because it thought the value of a would be
the expected value of 0.

> I really don't think such asynchronous stores are legal, nor should

> the be explicitly accomodated in the CAS emulation support.  Either

> the value is maintained in an atomic manner, or it is not.  And if it

> is, updates must use CAS.


Yes, the implementation of atomic_store_* in glibc must use the CAS
emulation.  We do not care about plain stores because we consider them
data races in the context of the C11 model.  However, we still have
quite a few cases of plain stores that should be atomic stores in glibc;
so we might have a few problems until we've converted all of those.
Chris Metcalf Nov. 10, 2016, 4:41 p.m. UTC | #5
On 11/9/2016 12:15 PM, David Miller wrote:
> From: Torvald Riegel <triegel@redhat.com>

> Date: Wed, 09 Nov 2016 09:08:15 -0800

>

>> What approach are you going to use in the kernel to emulate the CAS if

>> the hardware doesn't offer one?  If you are not stopping all threads,

>> then there could be concurrent stores to the same memory location

>> targeted by the CAS; to make such stores atomic wrt. the CAS, you would

>> need to implement atomic stores in glibc to also use the kernel (eg, to

>> do a CAS).

> I keep hearing about this case, but as long as the CAS is atomic what

> is the difference between the store being synchronized in some way

> or not?

>

> I think the ordering allowed for gives the same set of legal results.

>

> In any possible case either the CAS "wins" or the async store "wins"

> and that determines the final result written.  All combinations are

> legal outcomes even with a hardware CAS implementation.


That's not actually true.  Suppose you have an initial zero value, and you race
with a store of 2 and a kernel CAS from 0 to 1.  The legal output is only 2:
either the store hit first and the CAS failed, or the CAS hit first and succeeded,
then was overwritten by the 2.  But if the kernel CAS starts first and loads the
zero, then the store hits and sets the value to 2, the CAS will still decide it was
successful and write the 1, thus leaving the value illegally set to 1.

> I really don't think such asynchronous stores are legal, nor should

> the be explicitly accomodated in the CAS emulation support.  Either

> the value is maintained in an atomic manner, or it is not.  And if it

> is, updates must use CAS.  Straight stores are only legal on the

> initialization of the word before any CAS code paths can get to the

> value.

>

> I cannot think of any sane setup that can allow async stores

> intermixed with CAS updates.


So despite arguing above that mixing CAS and asynchronous store is safe,
here you are arguing that you shouldn't do it?  In any case yes, I think you
have come to the right conclusion, and you shouldn't do it.

If you're interested, I have some optimized code for the tilepro architecture to
handle this in arch/tile.  In kernel/intvec_32.S, the intvec_\vecname macro
does a fastpath check for negative syscalls and calls out to sys_cmpxchg, which
does some optimized work to figure out how to provide optimized atomics.
We actually support both 32 and 64-bit cmpxchg, as well as an "atomic_update"
that does (*mem & mask) + added, giving obvious implementations for
atomic_exchange, atomic_exchange_and_add, atomic_and_val, and atomic_or_val
(see glibc's sysdeps/tile/tilepro/atomic-machine.h).  There's some very hairy
stuff designed to handle the case of faulting with a bad user address here, since
we haven't set up the kernel stack yet.  But it works, and it's quite fast
(about 50 cycles to do the fast syscall).

We also hook into the same logic to support a more extended set of in-kernel
atomic operations; see arch/tile/lib/atomic*32* for that stuff.

The underlying locking is done by hashing into a lock table based on the low bits
of the address, which lets us support process-shared as well as process-private,
although it does mean that if multiple processes start up roughly
simultaneously and all try to lock the same process-private futex, they contend
with each other since they're using the same VA.  Oh well; we didn't come up
with a better solution that had good uncontended performance, but perhaps
there are better solutions to the hash function.

-- 
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com
Torvald Riegel Nov. 10, 2016, 5:08 p.m. UTC | #6
On Thu, 2016-11-10 at 11:41 -0500, Chris Metcalf wrote:
> On 11/9/2016 12:15 PM, David Miller wrote:

> > From: Torvald Riegel <triegel@redhat.com>

> > Date: Wed, 09 Nov 2016 09:08:15 -0800

> >

> >> What approach are you going to use in the kernel to emulate the CAS if

> >> the hardware doesn't offer one?  If you are not stopping all threads,

> >> then there could be concurrent stores to the same memory location

> >> targeted by the CAS; to make such stores atomic wrt. the CAS, you would

> >> need to implement atomic stores in glibc to also use the kernel (eg, to

> >> do a CAS).

> > I keep hearing about this case, but as long as the CAS is atomic what

> > is the difference between the store being synchronized in some way

> > or not?

> >

> > I think the ordering allowed for gives the same set of legal results.

> >

> > In any possible case either the CAS "wins" or the async store "wins"

> > and that determines the final result written.  All combinations are

> > legal outcomes even with a hardware CAS implementation.

> 

> That's not actually true.  Suppose you have an initial zero value, and you race

> with a store of 2 and a kernel CAS from 0 to 1.  The legal output is only 2:

> either the store hit first and the CAS failed, or the CAS hit first and succeeded,

> then was overwritten by the 2.  But if the kernel CAS starts first and loads the

> zero, then the store hits and sets the value to 2, the CAS will still decide it was

> successful and write the 1, thus leaving the value illegally set to 1.


Looking at tile's atomic-machine.h files again, it seems we're not
actually enforcing that atomic stores are atomic wrt. the CAS
implementation in the kernel.
The default implementation for atomic_store_relaxed in include/atomic.h
does a plain memory store instead of falling back to exchange.  This is
the right approach by default, I think, because that's what
pre-C11-concurrency code in glibc does (ie, there's no abstraction for
an atomic store at all, and plain memory accesses are used).

However, if we emulate CAS with locks or such in the kernel, atomic
stores need to synchronize with the CAS.  This would mean that all archs
such as tile or sparc that do that have to define atomic_store_relaxed
to fix this (at least for code converted to using C11 atomics, all
nonconverted code might still do the wrong thing).
Torvald Riegel Nov. 10, 2016, 11:38 p.m. UTC | #7
On Thu, 2016-11-10 at 13:22 -0500, Chris Metcalf wrote:
> On 11/10/2016 12:08 PM, Torvald Riegel wrote:

> > Looking at tile's atomic-machine.h files again, it seems we're not

> > actually enforcing that atomic stores are atomic wrt. the CAS

> > implementation in the kernel.

> > The default implementation for atomic_store_relaxed in include/atomic.h

> > does a plain memory store instead of falling back to exchange.  This is

> > the right approach by default, I think, because that's what

> > pre-C11-concurrency code in glibc does (ie, there's no abstraction for

> > an atomic store at all, and plain memory accesses are used).

> >

> > However, if we emulate CAS with locks or such in the kernel, atomic

> > stores need to synchronize with the CAS.  This would mean that all archs

> > such as tile or sparc that do that have to define atomic_store_relaxed

> > to fix this (at least for code converted to using C11 atomics, all

> > nonconverted code might still do the wrong thing).

> 

> Note that our mainstream tilegx architecture has full atomic support, so

> this is only applicable to the older tilepro architecture.


LGTM, thanks.
diff mbox

Patch

From 9e4a9f69dd74c47a7d84c1233164acfae7602a9f Mon Sep 17 00:00:00 2001
From: "David S. Miller" <davem@davemloft.net>
Date: Sun, 6 Nov 2016 19:01:31 -0800
Subject: [PATCH] sparc: On 32-bit, provide atomics via kernel assist.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 sysdeps/sparc/sparc32/atomic-machine.h             | 313 ++-------------------
 sysdeps/sparc/sparc32/pthread_barrier_wait.c       |   1 -
 sysdeps/sparc/sparc32/sem_post.c                   |  82 ------
 sysdeps/sparc/sparc32/sem_waitcommon.c             | 146 ----------
 .../sparc/sparc32/sparcv9/pthread_barrier_wait.c   |   1 -
 sysdeps/sparc/sparc32/sparcv9/sem_post.c           |   1 -
 sysdeps/sparc/sparc32/sparcv9/sem_waitcommon.c     |   1 -
 .../unix/sysv/linux/sparc/sparc32/atomic-machine.h |  35 +++
 .../linux/sparc/sparc32/sparcv9/atomic-machine.h   |   1 +
 9 files changed, 55 insertions(+), 526 deletions(-)
 delete mode 100644 sysdeps/sparc/sparc32/pthread_barrier_wait.c
 delete mode 100644 sysdeps/sparc/sparc32/sem_post.c
 delete mode 100644 sysdeps/sparc/sparc32/sem_waitcommon.c
 delete mode 100644 sysdeps/sparc/sparc32/sparcv9/pthread_barrier_wait.c
 delete mode 100644 sysdeps/sparc/sparc32/sparcv9/sem_post.c
 delete mode 100644 sysdeps/sparc/sparc32/sparcv9/sem_waitcommon.c
 create mode 100644 sysdeps/unix/sysv/linux/sparc/sparc32/atomic-machine.h
 create mode 100644 sysdeps/unix/sysv/linux/sparc/sparc32/sparcv9/atomic-machine.h

diff --git a/sysdeps/sparc/sparc32/atomic-machine.h b/sysdeps/sparc/sparc32/atomic-machine.h
index d6e68f9..ceac729 100644
--- a/sysdeps/sparc/sparc32/atomic-machine.h
+++ b/sysdeps/sparc/sparc32/atomic-machine.h
@@ -50,311 +50,36 @@  typedef uintmax_t uatomic_max_t;
 #define __HAVE_64B_ATOMICS 0
 #define USE_ATOMIC_COMPILER_BUILTINS 0
 
+#define __arch_compare_and_exchange_val_8_acq(mem, newval, oldval) \
+  (abort (), (__typeof (*mem)) 0)
 
-/* We have no compare and swap, just test and set.
-   The following implementation contends on 64 global locks
-   per library and assumes no variable will be accessed using atomic.h
-   macros from two different libraries.  */
+#define __arch_compare_and_exchange_val_16_acq(mem, newval, oldval) \
+  (abort (), (__typeof (*mem)) 0)
 
-__make_section_unallocated
-  (".gnu.linkonce.b.__sparc32_atomic_locks, \"aw\", %nobits");
+# define __arch_compare_and_exchange_val_32_acq(mem, newval, oldval) \
+  __sparc_assisted_compare_and_exchange_val_32_acq ((mem), (newval), (oldval))
 
-volatile unsigned char __sparc32_atomic_locks[64]
-  __attribute__ ((nocommon, section (".gnu.linkonce.b.__sparc32_atomic_locks"
-				     __sec_comment),
-		  visibility ("hidden")));
+#define __arch_compare_and_exchange_val_64_acq(mem, newval, oldval) \
+  (abort (), (__typeof (*mem)) 0)
 
-#define __sparc32_atomic_do_lock(addr) \
-  do								      \
-    {								      \
-      unsigned int __old_lock;					      \
-      unsigned int __idx = (((long) addr >> 2) ^ ((long) addr >> 12)) \
-			   & 63;				      \
-      do							      \
-	__asm __volatile ("ldstub %1, %0"			      \
-			  : "=r" (__old_lock),			      \
-			    "=m" (__sparc32_atomic_locks[__idx])      \
-			  : "m" (__sparc32_atomic_locks[__idx])	      \
-			  : "memory");				      \
-      while (__old_lock);					      \
-    }								      \
-  while (0)
+#define atomic_compare_and_exchange_val_24_acq(mem, newval, oldval) \
+  atomic_compare_and_exchange_val_acq (mem, newval, oldval)
 
-#define __sparc32_atomic_do_unlock(addr) \
-  do								      \
-    {								      \
-      __sparc32_atomic_locks[(((long) addr >> 2)		      \
-			      ^ ((long) addr >> 12)) & 63] = 0;	      \
-      __asm __volatile ("" ::: "memory");			      \
-    }								      \
-  while (0)
-
-#define __sparc32_atomic_do_lock24(addr) \
-  do								      \
-    {								      \
-      unsigned int __old_lock;					      \
-      do							      \
-	__asm __volatile ("ldstub %1, %0"			      \
-			  : "=r" (__old_lock), "=m" (*(addr))	      \
-			  : "m" (*(addr))			      \
-			  : "memory");				      \
-      while (__old_lock);					      \
-    }								      \
-  while (0)
-
-#define __sparc32_atomic_do_unlock24(addr) \
-  do								      \
-    {								      \
-      __asm __volatile ("" ::: "memory");			      \
-      *(char *) (addr) = 0;					      \
-    }								      \
-  while (0)
-
-
-#ifndef SHARED
-# define __v9_compare_and_exchange_val_32_acq(mem, newval, oldval) \
-({union { __typeof (oldval) a; uint32_t v; } oldval_arg = { .a = (oldval) };  \
-  union { __typeof (newval) a; uint32_t v; } newval_arg = { .a = (newval) };  \
-  register uint32_t __acev_tmp __asm ("%g6");			              \
-  register __typeof (mem) __acev_mem __asm ("%g1") = (mem);		      \
-  register uint32_t __acev_oldval __asm ("%g5");		              \
-  __acev_tmp = newval_arg.v;						      \
-  __acev_oldval = oldval_arg.v;						      \
-  /* .word 0xcde05005 is cas [%g1], %g5, %g6.  Can't use cas here though,     \
-     because as will then mark the object file as V8+ arch.  */		      \
-  __asm __volatile (".word 0xcde05005"					      \
-		    : "+r" (__acev_tmp), "=m" (*__acev_mem)		      \
-		    : "r" (__acev_oldval), "m" (*__acev_mem),		      \
-		      "r" (__acev_mem) : "memory");			      \
-  (__typeof (oldval)) __acev_tmp; })
-#endif
-
-/* The only basic operation needed is compare and exchange.  */
-#define __v7_compare_and_exchange_val_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __acev_memp = (mem);			      \
-     __typeof (*mem) __acev_ret;				      \
-     __typeof (*mem) __acev_newval = (newval);			      \
-								      \
-     __sparc32_atomic_do_lock (__acev_memp);			      \
-     __acev_ret = *__acev_memp;					      \
-     if (__acev_ret == (oldval))				      \
-       *__acev_memp = __acev_newval;				      \
-     __sparc32_atomic_do_unlock (__acev_memp);			      \
-     __acev_ret; })
-
-#define __v7_compare_and_exchange_bool_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __aceb_memp = (mem);			      \
-     int __aceb_ret;						      \
-     __typeof (*mem) __aceb_newval = (newval);			      \
-								      \
-     __sparc32_atomic_do_lock (__aceb_memp);			      \
-     __aceb_ret = 0;						      \
-     if (*__aceb_memp == (oldval))				      \
-       *__aceb_memp = __aceb_newval;				      \
-     else							      \
-       __aceb_ret = 1;						      \
-     __sparc32_atomic_do_unlock (__aceb_memp);			      \
-     __aceb_ret; })
-
-#define __v7_exchange_acq(mem, newval) \
-  ({ __typeof (mem) __acev_memp = (mem);			      \
-     __typeof (*mem) __acev_ret;				      \
-     __typeof (*mem) __acev_newval = (newval);			      \
-								      \
-     __sparc32_atomic_do_lock (__acev_memp);			      \
-     __acev_ret = *__acev_memp;					      \
-     *__acev_memp = __acev_newval;				      \
-     __sparc32_atomic_do_unlock (__acev_memp);			      \
-     __acev_ret; })
-
-#define __v7_exchange_and_add(mem, value) \
-  ({ __typeof (mem) __acev_memp = (mem);			      \
-     __typeof (*mem) __acev_ret;				      \
-								      \
-     __sparc32_atomic_do_lock (__acev_memp);			      \
-     __acev_ret = *__acev_memp;					      \
-     *__acev_memp = __acev_ret + (value);			      \
-     __sparc32_atomic_do_unlock (__acev_memp);			      \
-     __acev_ret; })
-
-/* Special versions, which guarantee that top 8 bits of all values
-   are cleared and use those bits as the ldstub lock.  */
-#define __v7_compare_and_exchange_val_24_acq(mem, newval, oldval) \
-  ({ __typeof (mem) __acev_memp = (mem);			      \
-     __typeof (*mem) __acev_ret;				      \
-     __typeof (*mem) __acev_newval = (newval);			      \
-								      \
-     __sparc32_atomic_do_lock24 (__acev_memp);			      \
-     __acev_ret = *__acev_memp & 0xffffff;			      \
-     if (__acev_ret == (oldval))				      \
-       *__acev_memp = __acev_newval;				      \
-     else							      \
-       __sparc32_atomic_do_unlock24 (__acev_memp);		      \
-     __asm __volatile ("" ::: "memory");			      \
-     __acev_ret; })
-
-#define __v7_exchange_24_rel(mem, newval) \
-  ({ __typeof (mem) __acev_memp = (mem);			      \
-     __typeof (*mem) __acev_ret;				      \
-     __typeof (*mem) __acev_newval = (newval);			      \
-								      \
-     __sparc32_atomic_do_lock24 (__acev_memp);			      \
-     __acev_ret = *__acev_memp & 0xffffff;			      \
-     *__acev_memp = __acev_newval;				      \
-     __asm __volatile ("" ::: "memory");			      \
-     __acev_ret; })
-
-#ifdef SHARED
-
-/* When dynamically linked, we assume pre-v9 libraries are only ever
-   used on pre-v9 CPU.  */
-# define __atomic_is_v9 0
-
-# define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
-  __v7_compare_and_exchange_val_acq (mem, newval, oldval)
-
-# define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
-  __v7_compare_and_exchange_bool_acq (mem, newval, oldval)
-
-# define atomic_exchange_acq(mem, newval) \
-  __v7_exchange_acq (mem, newval)
-
-# define atomic_exchange_and_add(mem, value) \
-  __v7_exchange_and_add (mem, value)
-
-# define atomic_compare_and_exchange_val_24_acq(mem, newval, oldval) \
-  ({								      \
-     if (sizeof (*mem) != 4)					      \
-       abort ();						      \
-     __v7_compare_and_exchange_val_24_acq (mem, newval, oldval); })
-
-# define atomic_exchange_24_rel(mem, newval) \
-  ({								      \
-     if (sizeof (*mem) != 4)					      \
-       abort ();						      \
-     __v7_exchange_24_rel (mem, newval); })
+#define atomic_exchange_24_rel(mem, newval) \
+  atomic_exchange_rel (mem, newval)
 
 # define atomic_full_barrier() __asm ("" ::: "memory")
 # define atomic_read_barrier() atomic_full_barrier ()
 # define atomic_write_barrier() atomic_full_barrier ()
 
-#else
-
-/* In libc.a/libpthread.a etc. we don't know if we'll be run on
-   pre-v9 or v9 CPU.  To be interoperable with dynamically linked
-   apps on v9 CPUs e.g. with process shared primitives, use cas insn
-   on v9 CPUs and ldstub on pre-v9.  */
-
-extern uint64_t _dl_hwcap __attribute__((weak));
-# define __atomic_is_v9 \
-  (__builtin_expect (&_dl_hwcap != 0, 1) \
-   && __builtin_expect (_dl_hwcap & HWCAP_SPARC_V9, HWCAP_SPARC_V9))
-
-# define atomic_compare_and_exchange_val_acq(mem, newval, oldval) \
-  ({								      \
-     __typeof (*mem) __acev_wret;				      \
-     if (sizeof (*mem) != 4)					      \
-       abort ();						      \
-     if (__atomic_is_v9)					      \
-       __acev_wret						      \
-	 = __v9_compare_and_exchange_val_32_acq (mem, newval, oldval);\
-     else							      \
-       __acev_wret						      \
-	 = __v7_compare_and_exchange_val_acq (mem, newval, oldval);   \
-     __acev_wret; })
-
-# define atomic_compare_and_exchange_bool_acq(mem, newval, oldval) \
-  ({								      \
-     int __acev_wret;						      \
-     if (sizeof (*mem) != 4)					      \
-       abort ();						      \
-     if (__atomic_is_v9)					      \
-       {							      \
-	 __typeof (oldval) __acev_woldval = (oldval);		      \
-	 __acev_wret						      \
-	   = __v9_compare_and_exchange_val_32_acq (mem, newval,	      \
-						   __acev_woldval)    \
-	     != __acev_woldval;					      \
-       }							      \
-     else							      \
-       __acev_wret						      \
-	 = __v7_compare_and_exchange_bool_acq (mem, newval, oldval);  \
-     __acev_wret; })
-
-# define atomic_exchange_rel(mem, newval) \
-  ({								      \
-     __typeof (*mem) __acev_wret;				      \
-     if (sizeof (*mem) != 4)					      \
-       abort ();						      \
-     if (__atomic_is_v9)					      \
-       {							      \
-	 __typeof (mem) __acev_wmemp = (mem);			      \
-	 __typeof (*(mem)) __acev_wval = (newval);		      \
-	 do							      \
-	   __acev_wret = *__acev_wmemp;				      \
-	 while (__builtin_expect				      \
-		  (__v9_compare_and_exchange_val_32_acq (__acev_wmemp,\
-							 __acev_wval, \
-							 __acev_wret) \
-		   != __acev_wret, 0));				      \
-       }							      \
-     else							      \
-       __acev_wret = __v7_exchange_acq (mem, newval);		      \
-     __acev_wret; })
-
-# define atomic_compare_and_exchange_val_24_acq(mem, newval, oldval) \
-  ({								      \
-     __typeof (*mem) __acev_wret;				      \
-     if (sizeof (*mem) != 4)					      \
-       abort ();						      \
-     if (__atomic_is_v9)					      \
-       __acev_wret						      \
-	 = __v9_compare_and_exchange_val_32_acq (mem, newval, oldval);\
-     else							      \
-       __acev_wret						      \
-	 = __v7_compare_and_exchange_val_24_acq (mem, newval, oldval);\
-     __acev_wret; })
-
-# define atomic_exchange_24_rel(mem, newval) \
-  ({								      \
-     __typeof (*mem) __acev_w24ret;				      \
-     if (sizeof (*mem) != 4)					      \
-       abort ();						      \
-     if (__atomic_is_v9)					      \
-       __acev_w24ret = atomic_exchange_rel (mem, newval);	      \
-     else							      \
-       __acev_w24ret = __v7_exchange_24_rel (mem, newval);	      \
-     __acev_w24ret; })
-
-#define atomic_full_barrier()						\
-  do {									\
-     if (__atomic_is_v9)						\
-       /* membar #LoadLoad | #LoadStore | #StoreLoad | #StoreStore */	\
-       __asm __volatile (".word 0x8143e00f" : : : "memory");		\
-     else								\
-       __asm __volatile ("" : : : "memory");				\
-  } while (0)
-
-#define atomic_read_barrier()						\
-  do {									\
-     if (__atomic_is_v9)						\
-       /* membar #LoadLoad | #LoadStore */				\
-       __asm __volatile (".word 0x8143e005" : : : "memory");		\
-     else								\
-       __asm __volatile ("" : : : "memory");				\
-  } while (0)
-
-#define atomic_write_barrier()						\
-  do {									\
-     if (__atomic_is_v9)						\
-       /* membar  #LoadStore | #StoreStore */				\
-       __asm __volatile (".word 0x8143e00c" : : : "memory");		\
-     else								\
-       __asm __volatile ("" : : : "memory");				\
-  } while (0)
+void __sparc_link_error (void);
 
+/* An OS-specific atomic-machine.h file will define this macro if
+   the OS can provide something.  If not, we'll fail to build
+   with a compiler that doesn't supply the operation.  */
+#ifndef __sparc_assisted_compare_and_exchange_val_32_acq
+# define __sparc_assisted_compare_and_exchange_val_32_acq(mem, newval, oldval) \
+  ({ __sparc_link_error (); oldval; })
 #endif
 
-#include <sysdep.h>
-
 #endif	/* atomic-machine.h */
diff --git a/sysdeps/sparc/sparc32/pthread_barrier_wait.c b/sysdeps/sparc/sparc32/pthread_barrier_wait.c
deleted file mode 100644
index e5ef911..0000000
--- a/sysdeps/sparc/sparc32/pthread_barrier_wait.c
+++ /dev/null
@@ -1 +0,0 @@ 
-#error No support for pthread barriers on pre-v9 sparc.
diff --git a/sysdeps/sparc/sparc32/sem_post.c b/sysdeps/sparc/sparc32/sem_post.c
deleted file mode 100644
index 415a3d5..0000000
--- a/sysdeps/sparc/sparc32/sem_post.c
+++ /dev/null
@@ -1,82 +0,0 @@ 
-/* sem_post -- post to a POSIX semaphore.  Generic futex-using version.
-   Copyright (C) 2003-2016 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Jakub Jelinek <jakub@redhat.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include <atomic.h>
-#include <errno.h>
-#include <sysdep.h>
-#include <lowlevellock.h>
-#include <internaltypes.h>
-#include <semaphore.h>
-#include <futex-internal.h>
-
-#include <shlib-compat.h>
-
-
-/* See sem_wait for an explanation of the algorithm.  */
-int
-__new_sem_post (sem_t *sem)
-{
-  struct new_sem *isem = (struct new_sem *) sem;
-  int private = isem->private;
-  unsigned int v;
-
-  __sparc32_atomic_do_lock24 (&isem->pad);
-
-  v = isem->value;
-  if ((v >> SEM_VALUE_SHIFT) == SEM_VALUE_MAX)
-    {
-      __sparc32_atomic_do_unlock24 (&isem->pad);
-
-      __set_errno (EOVERFLOW);
-      return -1;
-    }
-  isem->value = v + (1 << SEM_VALUE_SHIFT);
-
-  __sparc32_atomic_do_unlock24 (&isem->pad);
-
-  if ((v & SEM_NWAITERS_MASK) != 0)
-    futex_wake (&isem->value, 1, private);
-
-  return 0;
-}
-versioned_symbol (libpthread, __new_sem_post, sem_post, GLIBC_2_1);
-
-
-#if SHLIB_COMPAT (libpthread, GLIBC_2_0, GLIBC_2_1)
-int
-attribute_compat_text_section
-__old_sem_post (sem_t *sem)
-{
-  int *futex = (int *) sem;
-
-  /* We must need to synchronize with consumers of this token, so the atomic
-     increment must have release MO semantics.  */
-  atomic_write_barrier ();
-  (void) atomic_increment_val (futex);
-  /* We always have to assume it is a shared semaphore.  */
-  int err = lll_futex_wake (futex, 1, LLL_SHARED);
-  if (__builtin_expect (err, 0) < 0)
-    {
-      __set_errno (-err);
-      return -1;
-    }
-  return 0;
-}
-compat_symbol (libpthread, __old_sem_post, sem_post, GLIBC_2_0);
-#endif
diff --git a/sysdeps/sparc/sparc32/sem_waitcommon.c b/sysdeps/sparc/sparc32/sem_waitcommon.c
deleted file mode 100644
index 5340f57..0000000
--- a/sysdeps/sparc/sparc32/sem_waitcommon.c
+++ /dev/null
@@ -1,146 +0,0 @@ 
-/* sem_waitcommon -- wait on a semaphore, shared code.
-   Copyright (C) 2003-2016 Free Software Foundation, Inc.
-   This file is part of the GNU C Library.
-   Contributed by Paul Mackerras <paulus@au.ibm.com>, 2003.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.	 See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#include <errno.h>
-#include <sysdep.h>
-#include <futex-internal.h>
-#include <internaltypes.h>
-#include <semaphore.h>
-#include <sys/time.h>
-
-#include <pthreadP.h>
-#include <shlib-compat.h>
-#include <atomic.h>
-
-
-static void
-__sem_wait_32_finish (struct new_sem *sem);
-
-static void
-__sem_wait_cleanup (void *arg)
-{
-  struct new_sem *sem = (struct new_sem *) arg;
-
-  __sem_wait_32_finish (sem);
-}
-
-/* Wait until at least one token is available, possibly with a timeout.
-   This is in a separate function in order to make sure gcc
-   puts the call site into an exception region, and thus the
-   cleanups get properly run.  TODO still necessary?  Other futex_wait
-   users don't seem to need it.  */
-static int
-__attribute__ ((noinline))
-do_futex_wait (struct new_sem *sem, const struct timespec *abstime)
-{
-  int err;
-
-  err = futex_abstimed_wait_cancelable (&sem->value, SEM_NWAITERS_MASK,
-					abstime, sem->private);
-
-  return err;
-}
-
-/* Fast path: Try to grab a token without blocking.  */
-static int
-__new_sem_wait_fast (struct new_sem *sem, int definitive_result)
-{
-  unsigned int v;
-  int ret = 0;
-
-  __sparc32_atomic_do_lock24(&sem->pad);
-
-  v = sem->value;
-  if ((v >> SEM_VALUE_SHIFT) == 0)
-    ret = -1;
-  else
-    sem->value = v - (1 << SEM_VALUE_SHIFT);
-
-  __sparc32_atomic_do_unlock24(&sem->pad);
-
-  return ret;
-}
-
-/* Slow path that blocks.  */
-static int
-__attribute__ ((noinline))
-__new_sem_wait_slow (struct new_sem *sem, const struct timespec *abstime)
-{
-  unsigned int v;
-  int err = 0;
-
-  __sparc32_atomic_do_lock24(&sem->pad);
-
-  sem->nwaiters++;
-
-  pthread_cleanup_push (__sem_wait_cleanup, sem);
-
-  /* Wait for a token to be available.  Retry until we can grab one.  */
-  v = sem->value;
-  do
-    {
-      if (!(v & SEM_NWAITERS_MASK))
-	sem->value = v | SEM_NWAITERS_MASK;
-
-      /* If there is no token, wait.  */
-      if ((v >> SEM_VALUE_SHIFT) == 0)
-	{
-	  __sparc32_atomic_do_unlock24(&sem->pad);
-
-	  err = do_futex_wait(sem, abstime);
-	  if (err == ETIMEDOUT || err == EINTR)
-	    {
-	      __set_errno (err);
-	      err = -1;
-	      goto error;
-	    }
-	  err = 0;
-
-	  __sparc32_atomic_do_lock24(&sem->pad);
-
-	  /* We blocked, so there might be a token now.  */
-	  v = sem->value;
-	}
-    }
-  /* If there is no token, we must not try to grab one.  */
-  while ((v >> SEM_VALUE_SHIFT) == 0);
-
-  sem->value = v - (1 << SEM_VALUE_SHIFT);
-
-  __sparc32_atomic_do_unlock24(&sem->pad);
-
-error:
-  pthread_cleanup_pop (0);
-
-  __sem_wait_32_finish (sem);
-
-  return err;
-}
-
-/* Stop being a registered waiter (non-64b-atomics code only).  */
-static void
-__sem_wait_32_finish (struct new_sem *sem)
-{
-  __sparc32_atomic_do_lock24(&sem->pad);
-
-  if (--sem->nwaiters == 0)
-    sem->value &= ~SEM_NWAITERS_MASK;
-
-  __sparc32_atomic_do_unlock24(&sem->pad);
-}
diff --git a/sysdeps/sparc/sparc32/sparcv9/pthread_barrier_wait.c b/sysdeps/sparc/sparc32/sparcv9/pthread_barrier_wait.c
deleted file mode 100644
index 246c8d4..0000000
--- a/sysdeps/sparc/sparc32/sparcv9/pthread_barrier_wait.c
+++ /dev/null
@@ -1 +0,0 @@ 
-#include <nptl/pthread_barrier_wait.c>
diff --git a/sysdeps/sparc/sparc32/sparcv9/sem_post.c b/sysdeps/sparc/sparc32/sparcv9/sem_post.c
deleted file mode 100644
index 6a2813c..0000000
--- a/sysdeps/sparc/sparc32/sparcv9/sem_post.c
+++ /dev/null
@@ -1 +0,0 @@ 
-#include <nptl/sem_post.c>
diff --git a/sysdeps/sparc/sparc32/sparcv9/sem_waitcommon.c b/sysdeps/sparc/sparc32/sparcv9/sem_waitcommon.c
deleted file mode 100644
index d4a1395..0000000
--- a/sysdeps/sparc/sparc32/sparcv9/sem_waitcommon.c
+++ /dev/null
@@ -1 +0,0 @@ 
-#include <nptl/sem_waitcommon.c>
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/atomic-machine.h b/sysdeps/unix/sysv/linux/sparc/sparc32/atomic-machine.h
new file mode 100644
index 0000000..4bb8aa4
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/sparc/sparc32/atomic-machine.h
@@ -0,0 +1,35 @@ 
+/* Atomic operations.  SPARC/Linux version.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <stdint.h>
+
+#define __sparc_assisted_compare_and_exchange_val_32_acq(mem, newval, oldval)\
+({union { __typeof (oldval) a; uint32_t v; } oldval_arg = { .a = (oldval) }; \
+  union { __typeof (newval) a; uint32_t v; } newval_arg = { .a = (newval) }; \
+  register uint32_t __acev_tmp __asm ("%o2");			             \
+  register __typeof (mem) __acev_mem __asm ("%o0") = (mem);		     \
+  register uint32_t __acev_oldval __asm ("%o1");		             \
+  __acev_tmp = newval_arg.v;						     \
+  __acev_oldval = oldval_arg.v;						     \
+  __asm __volatile ("ta 0x23"					             \
+		    : "+r" (__acev_tmp), "=m" (*__acev_mem)		     \
+		    : "r" (__acev_oldval), "m" (*__acev_mem),		     \
+		      "r" (__acev_mem) : "memory");			     \
+  (__typeof (oldval)) __acev_tmp; })
+
+#include <sysdeps/sparc/sparc32/atomic-machine.h>
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/sparcv9/atomic-machine.h b/sysdeps/unix/sysv/linux/sparc/sparc32/sparcv9/atomic-machine.h
new file mode 100644
index 0000000..c5cf630
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/sparc/sparc32/sparcv9/atomic-machine.h
@@ -0,0 +1 @@ 
+#include <sysdeps/sparc/sparc32/sparcv9/atomic-machine.h>
-- 
2.1.2.532.g19b5d50