Message ID | 1448624646-15863-1-git-send-email-will.deacon@arm.com |
---|---|
State | Accepted |
Commit | d86b8da04dfa4771a68bdbad6c424d40f22f0d14 |
Headers | show |
On Tue, Dec 01, 2015 at 08:40:17AM +0800, Boqun Feng wrote: > Hi Will, Hi Boqun, > On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote: > > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait > > on architectures implementing spin_lock with LL/SC sequences and acquire > > semantics: > > > > | CPU 1 CPU 2 CPU 3 > > | ================== ==================== ============== > > | spin_unlock(&lock); > > | spin_lock(&lock): > > | r1 = *lock; // r1 == 0; > > | o = READ_ONCE(object); // reordered here > > | object = NULL; > > | smp_mb(); > > | spin_unlock_wait(&lock); > > | *lock = 1; > > | smp_mb(); > > | o->dead = true; > > | if (o) // true > > | BUG_ON(o->dead); // true!! > > > > The crux of the problem is that spin_unlock_wait(&lock) can return on > > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be > > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it > > I wonder whether upgrading it to a LOCK operation is necessary. Please > see below. > > > to serialise against a concurrent locker and giving it acquire semantics > > in the process (although it is not at all clear whether this is needed - > > different callers seem to assume different things about the barrier > > semantics and architectures are similarly disjoint in their > > implementations of the macro). > > > > This patch implements spin_unlock_wait using an LL/SC sequence with > > acquire semantics on arm64. For v8.1 systems with the LSE atomics, the > > IIUC, you implement this with acquire semantics because a LOCK requires > acquire semantics, right? I get that spin_unlock_wait() becoming a LOCK > will surely simply our analysis, because LOCK->LOCK is always globally > ordered. But for this particular problem, seems only a relaxed LL/SC > loop suffices, and the use of spin_unlock_wait() in do_exit() only > requires a control dependency which could be fulfilled by a relaxed > LL/SC loop. So the acquire semantics may be not necessary here. > > Am I missing something subtle here which is the reason you want to > upgrading spin_unlock_wait() to a LOCK? There's two things going on here: (1) Having spin_unlock_wait do a complete LL/SC sequence (i.e. writing to the lock, like a LOCK operation would) (2) Giving it ACQUIRE semantics (2) is not necessary to fix the problem you described. However, LOCK operations do imply ACQUIRE and it's not clear to me that what the ordering semantics are for spin_unlock_wait. I'd really like to keep this as simple as possible. Introducing yet another magic barrier macro that isn't well understood or widely used feels like a major step backwards to me, so I opted to upgrade spin_unlock_wait to LOCK on arm64 so that I no longer have to worry about it. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Mon, Nov 30, 2015 at 04:58:39PM +0100, Peter Zijlstra wrote: > On Fri, Nov 27, 2015 at 11:44:06AM +0000, Will Deacon wrote: > > Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait > > on architectures implementing spin_lock with LL/SC sequences and acquire > > semantics: > > > > | CPU 1 CPU 2 CPU 3 > > | ================== ==================== ============== > > | spin_unlock(&lock); > > | spin_lock(&lock): > > | r1 = *lock; // r1 == 0; > > | o = READ_ONCE(object); // reordered here > > | object = NULL; > > | smp_mb(); > > | spin_unlock_wait(&lock); > > | *lock = 1; > > | smp_mb(); > > | o->dead = true; > > | if (o) // true > > | BUG_ON(o->dead); // true!! > > > > The crux of the problem is that spin_unlock_wait(&lock) can return on > > CPU 1 whilst CPU 2 is in the process of taking the lock. This can be > > resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it > > to serialise against a concurrent locker and giving it acquire semantics > > in the process (although it is not at all clear whether this is needed - > > different callers seem to assume different things about the barrier > > semantics and architectures are similarly disjoint in their > > implementations of the macro). > > Do we want to go do a note with spin_unlock_wait() in > include/linux/spinlock.h warning about these subtle issues for the next > arch that thinks this is a 'trivial' thing to implement? Could do, but I still need agreement from Paul on the solution before I can describe it in core code. At the moment, the semantics are, unfortunately, arch-specific. Paul -- did you have any more thoughts about this? I ended up at: https://lkml.org/lkml/2015/11/16/343 and then ran out of ideas. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Hi Peter, Paul, Firstly, thanks for writing that up. I agree that you have something that can work in theory, but see below. On Thu, Dec 03, 2015 at 02:28:39PM +0100, Peter Zijlstra wrote: > On Wed, Dec 02, 2015 at 04:11:41PM -0800, Paul E. McKenney wrote: > > This looks architecture-agnostic to me: > > > > a. TSO systems have smp_mb__after_unlock_lock() be a no-op, and > > have a read-only implementation for spin_unlock_wait(). > > > > b. Small-scale weakly ordered systems can also have > > smp_mb__after_unlock_lock() be a no-op, but must instead > > have spin_unlock_wait() acquire the lock and immediately > > release it, or some optimized implementation of this. > > > > c. Large-scale weakly ordered systems are required to define > > smp_mb__after_unlock_lock() as smp_mb(), but can have a > > read-only implementation of spin_unlock_wait(). > > This would still require all relevant spin_lock() sites to be annotated > with smp_mb__after_unlock_lock(), which is going to be a painful (no > warning when done wrong) exercise and expensive (added MBs all over the > place). > > But yes, I think the proposal is technically sound, just not quite sure > how far we'll want to push this. When I said that the solution isn't architecture-agnostic, I was referring to the fact that spin_unlock_wait acts as a LOCK operation on all architectures other than those using case (c) above. You've resolved this by requiring smp_mb__after_unlock_lock() after every LOCK, but I share Peter's concerns that this isn't going to work in practice because: 1. The smp_mb__after_unlock_lock additions aren't necessarily in the code where the spin_unlock_wait() is being added, so it's going to require a lot of diligence for developers to get this right 2. Only PowerPC is going to see the (very occassional) failures, so testing this is nigh on impossible :( 3. We've now made the kernel memory model even more difficult to understand, so people might not even bother with this addition Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Fri, Dec 04, 2015 at 08:07:06AM -0800, Paul E. McKenney wrote: > On Fri, Dec 04, 2015 at 10:21:10AM +0100, Peter Zijlstra wrote: > > On Thu, Dec 03, 2015 at 09:22:07AM -0800, Paul E. McKenney wrote: > > > > 2. Only PowerPC is going to see the (very occassional) failures, so > > > > testing this is nigh on impossible :( > > > > > > Indeed, we clearly cannot rely on normal testing, witness rcutorture > > > failing to find the missing smp_mb__after_unlock_lock() instances that > > > Peter found by inspection. So I believe that augmented testing is > > > required, perhaps as suggested above. > > > > To be fair, those were in debug code and non critical for correctness > > per se. That is, at worst the debug print would've observed an incorrect > > value. > > True enough, but there is still risk from people repurposing debug code > for non-debug uses. Still, thank you, I don't feel -quite- so bad about > rcutorture's failure to find these. ;-) It's the ones that it's yet to find that you should be worried about, and the debug code is all fixed ;) Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Fri, Dec 11, 2015 at 04:09:11PM +0800, Boqun Feng wrote: > In conclusion, we have more than a half of uses working well already, > and each of the fix-needed ones has only one related critical section > and only one related data access in it. So on PPC, I think my proposal > won't have more smp_mb() instances to fix all current use cases than > adding smp_mb__after_unlock_lock() after the lock acquisition in each > related lock critical section. > > Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so > Paul and Will, what do you think? ;-) I already queued the change promoting it to LOCK for arm64. It makes the semantics easy to understand and I've failed to measure any difference in performance. It's also robust against any future users of the macro and matches what other architectures do. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On Fri, Dec 11, 2015 at 05:42:26AM -0800, Paul E. McKenney wrote: > On Fri, Dec 11, 2015 at 09:46:52AM +0000, Will Deacon wrote: > > On Fri, Dec 11, 2015 at 04:09:11PM +0800, Boqun Feng wrote: > > > In conclusion, we have more than a half of uses working well already, > > > and each of the fix-needed ones has only one related critical section > > > and only one related data access in it. So on PPC, I think my proposal > > > won't have more smp_mb() instances to fix all current use cases than > > > adding smp_mb__after_unlock_lock() after the lock acquisition in each > > > related lock critical section. > > > > > > Of course, my proposal needs the buy-ins of both PPC and ARM64v8, so > > > Paul and Will, what do you think? ;-) > > > > I already queued the change promoting it to LOCK for arm64. It makes the > > semantics easy to understand and I've failed to measure any difference > > in performance. It's also robust against any future users of the macro > > and matches what other architectures do. > > What size system did you do your performance testing on? A tiny system by your standards (4 clusters of 2 CPUs), but my take for arm64 is that either wfe-based ll/sc loops scale sufficiently or you build with the new atomic instructions (that don't have an issue here). I have a bigger system (10s of cores) I can try with, but I don't currently have the ability to run mainline on it. Will _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h index c85e96d174a5..fc9682bfe002 100644 --- a/arch/arm64/include/asm/spinlock.h +++ b/arch/arm64/include/asm/spinlock.h @@ -26,9 +26,28 @@ * The memory barriers are implicit with the load-acquire and store-release * instructions. */ +static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) +{ + unsigned int tmp; + arch_spinlock_t lockval; -#define arch_spin_unlock_wait(lock) \ - do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0) + asm volatile( +" sevl\n" +"1: wfe\n" +"2: ldaxr %w0, %2\n" +" eor %w1, %w0, %w0, ror #16\n" +" cbnz %w1, 1b\n" + ARM64_LSE_ATOMIC_INSN( + /* LL/SC */ +" stxr %w1, %w0, %2\n" +" cbnz %w1, 2b\n", /* Serialise against any concurrent lockers */ + /* LSE atomics */ +" nop\n" +" nop\n") + : "=&r" (lockval), "=&r" (tmp), "+Q" (*lock) + : + : "memory"); +} #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
Boqun Feng reported a rather nasty ordering issue with spin_unlock_wait on architectures implementing spin_lock with LL/SC sequences and acquire semantics: | CPU 1 CPU 2 CPU 3 | ================== ==================== ============== | spin_unlock(&lock); | spin_lock(&lock): | r1 = *lock; // r1 == 0; | o = READ_ONCE(object); // reordered here | object = NULL; | smp_mb(); | spin_unlock_wait(&lock); | *lock = 1; | smp_mb(); | o->dead = true; | if (o) // true | BUG_ON(o->dead); // true!! The crux of the problem is that spin_unlock_wait(&lock) can return on CPU 1 whilst CPU 2 is in the process of taking the lock. This can be resolved by upgrading spin_unlock_wait to a LOCK operation, forcing it to serialise against a concurrent locker and giving it acquire semantics in the process (although it is not at all clear whether this is needed - different callers seem to assume different things about the barrier semantics and architectures are similarly disjoint in their implementations of the macro). This patch implements spin_unlock_wait using an LL/SC sequence with acquire semantics on arm64. For v8.1 systems with the LSE atomics, the exclusive writeback is omitted, since the spin_lock operation is indivisible and no intermediate state can be observed. Signed-off-by: Will Deacon <will.deacon@arm.com> --- arch/arm64/include/asm/spinlock.h | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) -- 2.1.4 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel