Message ID | 20190301140348.25175-2-will.deacon@arm.com |
---|---|
State | Superseded |
Headers | show |
Series | Remove Mysterious Macro Intended to Obscure Weird Behaviours (mmiowb()) | expand |
Will Deacon's on March 2, 2019 12:03 am: > In preparation for removing all explicit mmiowb() calls from driver > code, implement a tracking system in asm-generic based loosely on the > PowerPC implementation. This allows architectures with a non-empty > mmiowb() definition to have the barrier automatically inserted in > spin_unlock() following a critical section containing an I/O write. Is there a reason to call this "mmiowb"? We already have wmb that orders cacheable stores vs mmio stores don't we? Yes ia64 "sn2" is broken in that case, but that can be fixed (if anyone really cares about the platform any more). Maybe that's orthogonal to what you're doing here, I just don't like seeing "mmiowb" spread. This series works for spin locks, but you would want a driver to be able to use wmb() to order locks vs mmio when using a bit lock or a mutex or whatever else. Calling your wmb-if-io-is-pending version io_mb_before_unlock() would kind of match with existing patterns. > +static inline void mmiowb_set_pending(void) > +{ > + struct mmiowb_state *ms = __mmiowb_state(); > + ms->mmiowb_pending = ms->nesting_count; > +} > + > +static inline void mmiowb_spin_lock(void) > +{ > + struct mmiowb_state *ms = __mmiowb_state(); > + ms->nesting_count++; > +} > + > +static inline void mmiowb_spin_unlock(void) > +{ > + struct mmiowb_state *ms = __mmiowb_state(); > + > + if (unlikely(ms->mmiowb_pending)) { > + ms->mmiowb_pending = 0; > + mmiowb(); > + } > + > + ms->nesting_count--; > +} Humour me for a minute and tell me what this algorithm is doing, or what was broken about the powerpc one, which is basically: static inline void mmiowb_set_pending(void) { struct mmiowb_state *ms = __mmiowb_state(); ms->mmiowb_pending = 1; } static inline void mmiowb_spin_lock(void) { } static inline void mmiowb_spin_unlock(void) { struct mmiowb_state *ms = __mmiowb_state(); if (unlikely(ms->mmiowb_pending)) { ms->mmiowb_pending = 0; mmiowb(); } } > diff --git a/include/asm-generic/mmiowb_types.h b/include/asm-generic/mmiowb_types.h > new file mode 100644 > index 000000000000..8eb0095655e7 > --- /dev/null > +++ b/include/asm-generic/mmiowb_types.h > @@ -0,0 +1,12 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef __ASM_GENERIC_MMIOWB_TYPES_H > +#define __ASM_GENERIC_MMIOWB_TYPES_H > + > +#include <linux/types.h> > + > +struct mmiowb_state { > + u16 nesting_count; > + u16 mmiowb_pending; > +}; Really need more than 255 nested spin locks? I had the idea that 16 bit operations were a bit more costly than 8 bit on some CPUs... may not be true, but at least the smaller size packs a bit better on powerpc. Thanks, Nick
On Sat, Mar 2, 2019 at 5:43 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Is there a reason to call this "mmiowb"? We already have wmb that > orders cacheable stores vs mmio stores don't we? Sadly no it doesn't. Not on ia64, and people tried to make that the new rule because of the platform breakage on what some people thought would be a major platform. Plain wmb() was only guaranteed to order regular memory against each other (mostly useful for dma) on some of these platforms, because they had such broken IO synchronization. So mmiowb() is not a new name. It's been around for a while, and the people who wanted it have happily become irrelevant. Will is making it go away, but the name remains for historical reasons, even if Will's new acronym explanation for the name is much better ;) Linus
Linus Torvalds's on March 3, 2019 12:18 pm: > On Sat, Mar 2, 2019 at 5:43 PM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Is there a reason to call this "mmiowb"? We already have wmb that >> orders cacheable stores vs mmio stores don't we? > > Sadly no it doesn't. Not on ia64, and people tried to make that the > new rule because of the platform breakage on what some people thought > would be a major platform. Let me try this again, because I was babbling a train of thought continuing from my past mails on the subject. Kill mmiowb with fire. It was added for a niche platform that hasn't been produced for 10 years for a CPU ISA that is no longer being developed. Let's make mb/wmb great again (aka actually possible for normal people to understand). If something comes along again that reorders mmios from different CPUs in the IO controller like the Altix did, they implement wmb the slow and correct way. They can add a new faster primitive for the few devices they care about in the couple of perf critical places that matter. It doesn't have to be done all at once with this series, obviously this is a big improvement on its own. But why perpetuate the nomenclature and concept for new code added now? Thanks, Nick
Nicholas Piggin <npiggin@gmail.com> writes: > Will Deacon's on March 2, 2019 12:03 am: >> In preparation for removing all explicit mmiowb() calls from driver >> code, implement a tracking system in asm-generic based loosely on the >> PowerPC implementation. This allows architectures with a non-empty >> mmiowb() definition to have the barrier automatically inserted in >> spin_unlock() following a critical section containing an I/O write. > > Is there a reason to call this "mmiowb"? We already have wmb that > orders cacheable stores vs mmio stores don't we? > > Yes ia64 "sn2" is broken in that case, but that can be fixed (if > anyone really cares about the platform any more). Maybe that's > orthogonal to what you're doing here, I just don't like seeing > "mmiowb" spread. > > This series works for spin locks, but you would want a driver to > be able to use wmb() to order locks vs mmio when using a bit lock > or a mutex or whatever else. Calling your wmb-if-io-is-pending > version io_mb_before_unlock() would kind of match with existing > patterns. > >> +static inline void mmiowb_set_pending(void) >> +{ >> + struct mmiowb_state *ms = __mmiowb_state(); >> + ms->mmiowb_pending = ms->nesting_count; >> +} >> + >> +static inline void mmiowb_spin_lock(void) >> +{ >> + struct mmiowb_state *ms = __mmiowb_state(); >> + ms->nesting_count++; >> +} >> + >> +static inline void mmiowb_spin_unlock(void) >> +{ >> + struct mmiowb_state *ms = __mmiowb_state(); >> + >> + if (unlikely(ms->mmiowb_pending)) { >> + ms->mmiowb_pending = 0; >> + mmiowb(); >> + } >> + >> + ms->nesting_count--; >> +} > > Humour me for a minute and tell me what this algorithm is doing, or > what was broken about the powerpc one, which is basically: > > static inline void mmiowb_set_pending(void) > { > struct mmiowb_state *ms = __mmiowb_state(); > ms->mmiowb_pending = 1; > } > > static inline void mmiowb_spin_lock(void) > { > } The current powerpc code clears io_sync in spin_lock(). ie, it would be equivalent to: static inline void mmiowb_spin_lock(void) { ms->mmiowb_pending = 0; } Which means that: spin_lock(a); writel(x, y); spin_lock(b); ... spin_unlock(b); spin_unlock(a); Does no barrier. cheers
Linus Torvalds's on March 3, 2019 2:29 pm: > On Sat, Mar 2, 2019, 19:34 Nicholas Piggin <npiggin@gmail.com> wrote: > >> >> It doesn't have to be done all at once with this series, obviously this >> is a big improvement on its own. But why perpetuate the nomenclature >> and concept for new code added now? >> > > What nomenclature? > > Nobody will be using mmiowb(). That's the whole point of the patch series. > > It's now an entirely internal name, and nobody cares. Why even bother with it at all, "internal" or not? Just get rid of mmiowb, the concept is obsolete. > And none of this has anything to do with wmb(), since it's about IO being > ordered across cpu's by spin locks, not by barriers. > > So I'm not seeing what you're arguing about. Pretend ia64 doesn't exist for a minute. Now the regular mb/wmb barriers orders IO across CPUs with respect to their cacheable accesses. Regardless of whether that cacheable access is a spin lock, a bit lock, an atomic, a mutex... This is how it was before mmiowb came along. Nothing wrong with this series to make spinlocks order mmio, but why call it mmiowb? Another patch could rename ia64's mmiowb and then the name can be removed from the tree completely. Thanks, Nick
Michael Ellerman's on March 3, 2019 7:26 pm: > Nicholas Piggin <npiggin@gmail.com> writes: >> Will Deacon's on March 2, 2019 12:03 am: >>> In preparation for removing all explicit mmiowb() calls from driver >>> code, implement a tracking system in asm-generic based loosely on the >>> PowerPC implementation. This allows architectures with a non-empty >>> mmiowb() definition to have the barrier automatically inserted in >>> spin_unlock() following a critical section containing an I/O write. >> >> Is there a reason to call this "mmiowb"? We already have wmb that >> orders cacheable stores vs mmio stores don't we? >> >> Yes ia64 "sn2" is broken in that case, but that can be fixed (if >> anyone really cares about the platform any more). Maybe that's >> orthogonal to what you're doing here, I just don't like seeing >> "mmiowb" spread. >> >> This series works for spin locks, but you would want a driver to >> be able to use wmb() to order locks vs mmio when using a bit lock >> or a mutex or whatever else. Calling your wmb-if-io-is-pending >> version io_mb_before_unlock() would kind of match with existing >> patterns. >> >>> +static inline void mmiowb_set_pending(void) >>> +{ >>> + struct mmiowb_state *ms = __mmiowb_state(); >>> + ms->mmiowb_pending = ms->nesting_count; >>> +} >>> + >>> +static inline void mmiowb_spin_lock(void) >>> +{ >>> + struct mmiowb_state *ms = __mmiowb_state(); >>> + ms->nesting_count++; >>> +} >>> + >>> +static inline void mmiowb_spin_unlock(void) >>> +{ >>> + struct mmiowb_state *ms = __mmiowb_state(); >>> + >>> + if (unlikely(ms->mmiowb_pending)) { >>> + ms->mmiowb_pending = 0; >>> + mmiowb(); >>> + } >>> + >>> + ms->nesting_count--; >>> +} >> >> Humour me for a minute and tell me what this algorithm is doing, or >> what was broken about the powerpc one, which is basically: >> >> static inline void mmiowb_set_pending(void) >> { >> struct mmiowb_state *ms = __mmiowb_state(); >> ms->mmiowb_pending = 1; >> } >> >> static inline void mmiowb_spin_lock(void) >> { >> } > > The current powerpc code clears io_sync in spin_lock(). > > ie, it would be equivalent to: > > static inline void mmiowb_spin_lock(void) > { > ms->mmiowb_pending = 0; > } Ah okay that's what I missed. How about we just not do that? Thanks, Nick
On Sun, Mar 3, 2019 at 2:05 AM Nicholas Piggin <npiggin@gmail.com> wrote: > > Why even bother with it at all, "internal" or not? Just get rid of > mmiowb, the concept is obsolete. It *is* gone, for chrissake! Only the name remains as an internal detail of "this is what we need to do". > Pretend ia64 doesn't exist for a minute. Now the regular mb/wmb barriers > orders IO across CPUs with respect to their cacheable accesses. Stop with the total red herring already. THIS HAS NOTHING TO DO WITH mb()/wmb(). As long as you keep bringing those up, you're only showing that you're talking about the wrong thing. > Regardless of whether that cacheable access is a spin lock, a bit lock, > an atomic, a mutex... This is how it was before mmiowb came along. No. Beflore mmiowb() came along, there was one rule: do what x86 does. And x86 orders mmio inside spinlocks. Seriously. Notice how there's not a single "barrier" mentioned here anywhere in the above. No "mb()", no "wmb()", no nothing. Only "spinlocks order IO". That's the fundamental rule (that we broke for ia64), and all that matters for this patch series. Stop talking about wmb(). It's irrelevant. A spinlock does not *contain* a wmb(). Nobody even _cares_ about wmb(). They are entirely irrelevant wrt IO, because IO is ordered on any particular CPU anyway (which is what wmb() enforces). Only when you do special things like __raw_writel() etc does wmb() matter, but at that point this whole series is entirely irrelevant, and once again, that's still about just ordering on a single CPU. So as long as you talk about wmb(), all you show is that you're talking about something entirely different FROM THIS WHOLE SERIES. And like it or not, ia64 still exists. We support it. It doesn't _matter_ and we don't much care any more, but it still exists. Which is why we have that concept of mmiowb(). On other platforms, mmiowb() might be a wmb(). Or it might not. It might be some other barrier, or it might be a no-op entirely without a barrier at all. It doesn't matter. But mmiowb() exists, and is now happily entirely hidden inside the rule of "spinlocks order MMIO across CPU's". Linus
Nicholas Piggin <npiggin@gmail.com> writes: > Michael Ellerman's on March 3, 2019 7:26 pm: >> Nicholas Piggin <npiggin@gmail.com> writes: ... >>> what was broken about the powerpc one, which is basically: >>> >>> static inline void mmiowb_set_pending(void) >>> { >>> struct mmiowb_state *ms = __mmiowb_state(); >>> ms->mmiowb_pending = 1; >>> } >>> >>> static inline void mmiowb_spin_lock(void) >>> { >>> } >> >> The current powerpc code clears io_sync in spin_lock(). >> >> ie, it would be equivalent to: >> >> static inline void mmiowb_spin_lock(void) >> { >> ms->mmiowb_pending = 0; >> } > > Ah okay that's what I missed. How about we just not do that? Yeah I thought of that too but it's not great. We'd start semi-randomly executing the sync in unlock depending on whether someone had done IO on that CPU prior to the spinlock. eg. writel(x, y); // sets paca->io_sync ... <schedule> spin_lock(a); ... // No IO in here ... spin_unlock(a); // sync() here because other task did writel(). Which wouldn't be *incorrect*, but would be kind of weird. cheers
Nicholas Piggin <npiggin@gmail.com> writes: > Will Deacon's on March 2, 2019 12:03 am: >> In preparation for removing all explicit mmiowb() calls from driver >> code, implement a tracking system in asm-generic based loosely on the >> PowerPC implementation. This allows architectures with a non-empty >> mmiowb() definition to have the barrier automatically inserted in >> spin_unlock() following a critical section containing an I/O write. > > Is there a reason to call this "mmiowb"? We already have wmb that > orders cacheable stores vs mmio stores don't we? > > Yes ia64 "sn2" is broken in that case, but that can be fixed (if > anyone really cares about the platform any more). Maybe that's > orthogonal to what you're doing here, I just don't like seeing > "mmiowb" spread. > > This series works for spin locks, but you would want a driver to > be able to use wmb() to order locks vs mmio when using a bit lock > or a mutex or whatever else. Without wading into the rest of the discussion, this does raise an interesting point, ie. what about eg. rwlock's? They're basically equivalent to spinlocks, and so could reasonably be expected to have the same behaviour. But we don't check the io_sync flag in arch_read/write_unlock() etc. and both of those use lwsync. Seems like we just forgot they existed? Commit f007cacffc88 ("[POWERPC] Fix MMIO ops to provide expected barrier behaviour") that added the io_sync stuff doesn't mention them at all. Am I missing anything? AFAICS read/write locks were never built on top of spin locks, so seems like we're just hoping drivers using rwlock do the right barriers? cheers
On Mon, Mar 4, 2019 at 2:24 AM Michael Ellerman <mpe@ellerman.id.au> wrote: > > Without wading into the rest of the discussion, this does raise an > interesting point, ie. what about eg. rwlock's? > > They're basically equivalent to spinlocks, and so could reasonably be > expected to have the same behaviour. > > But we don't check the io_sync flag in arch_read/write_unlock() etc. and > both of those use lwsync. I think technically rwlocks should do the same thing, at least when they are used for exclusion. Because of the exclusion argument, we can presubably limit it to just write_unlock(), although at least in theory I guess you could have some "one reader does IO, then a writer comes in" situation.. Perhaps more importantly, what about sleeping locks? When they actually *block*, they get the barrier thanks to the scheduler, but you can have a nice non-contended sequence that never does that. I guess the fact that these cases have never even shown up as an issue means that we could just continue to ignore it. We could even give that approach some fancy name, and claim it as a revolutionary new programming paradigm ("ostrich programming" to go with "agile" and "pair programming"). Linus
Linus Torvalds's on March 4, 2019 4:48 am: > On Sun, Mar 3, 2019 at 2:05 AM Nicholas Piggin <npiggin@gmail.com> wrote: >> >> Why even bother with it at all, "internal" or not? Just get rid of >> mmiowb, the concept is obsolete. > > It *is* gone, for chrissake! Only the name remains as an internal > detail of "this is what we need to do". > >> Pretend ia64 doesn't exist for a minute. Now the regular mb/wmb barriers >> orders IO across CPUs with respect to their cacheable accesses. > > Stop with the total red herring already. > > THIS HAS NOTHING TO DO WITH mb()/wmb(). > > As long as you keep bringing those up, you're only showing that you're > talking about the wrong thing. Why? I'm talking about them because they are not taken care of by this part of mmiowb removal. Talking about spin locks is the wrong thing because we're already past that and everybody agrees it's the right approach. >> Regardless of whether that cacheable access is a spin lock, a bit lock, >> an atomic, a mutex... This is how it was before mmiowb came along. > > No. > > Beflore mmiowb() came along, there was one rule: do what x86 does. > > And x86 orders mmio inside spinlocks. > > Seriously. > > Notice how there's not a single "barrier" mentioned here anywhere in > the above. No "mb()", no "wmb()", no nothing. Only "spinlocks order > IO". > > That's the fundamental rule (that we broke for ia64), and all that > matters for this patch series. > > Stop talking about wmb(). It's irrelevant. A spinlock does not > *contain* a wmb(). Well you don't have to talk about it but why do you want me to stop? I don't understand. It's an open topic still after this series. I can post a new thread about it if that would upset you less, I just thought it would kind of fit here because we're talking about mmiowb, I'm not trying to derail this series. > Nobody even _cares_ about wmb(). They are entirely irrelevant wrt IO, > because IO is ordered on any particular CPU anyway (which is what > wmb() enforces). > > Only when you do special things like __raw_writel() etc does wmb() > matter, but at that point this whole series is entirely irrelevant, > and once again, that's still about just ordering on a single CPU. > > So as long as you talk about wmb(), all you show is that you're > talking about something entirely different FROM THIS WHOLE SERIES. > > And like it or not, ia64 still exists. We support it. It doesn't > _matter_ and we don't much care any more, but it still exists. Which > is why we have that concept of mmiowb(). > > On other platforms, mmiowb() might be a wmb(). Or it might not. It > might be some other barrier, or it might be a no-op entirely without a > barrier at all. It doesn't matter. But mmiowb() exists, and is now > happily entirely hidden inside the rule of "spinlocks order MMIO > across CPU's". The driver writer still has to know exactly as much about mmiowb (the concept, if not the name) before this series as afterward. That is, sequences of mmio stores to a device from different CPUs can only be atomic if you (put mmiowb before spin unlock | protect them with spin locks). I just don't understand the reason to expose the driver writer to that additional detail. Intuitively, mb() should order stores to all kind of memory the same as smp_mb() orders stores to cacheable (without the detail of stores being reordered at the interconnect or controller -- driver writer doesn't care about store queues in the CPU or whatever details, they want the device to see IOs in some order). Thanks, Nick
Michael Ellerman's on March 4, 2019 11:01 am: > Nicholas Piggin <npiggin@gmail.com> writes: >> Michael Ellerman's on March 3, 2019 7:26 pm: >>> Nicholas Piggin <npiggin@gmail.com> writes: > ... >>>> what was broken about the powerpc one, which is basically: >>>> >>>> static inline void mmiowb_set_pending(void) >>>> { >>>> struct mmiowb_state *ms = __mmiowb_state(); >>>> ms->mmiowb_pending = 1; >>>> } >>>> >>>> static inline void mmiowb_spin_lock(void) >>>> { >>>> } >>> >>> The current powerpc code clears io_sync in spin_lock(). >>> >>> ie, it would be equivalent to: >>> >>> static inline void mmiowb_spin_lock(void) >>> { >>> ms->mmiowb_pending = 0; >>> } >> >> Ah okay that's what I missed. How about we just not do that? > > Yeah I thought of that too but it's not great. We'd start semi-randomly > executing the sync in unlock depending on whether someone had done IO on > that CPU prior to the spinlock. > > eg. > > writel(x, y); // sets paca->io_sync > ... > > <schedule> > > spin_lock(a); > ... > // No IO in here > ... > spin_unlock(a); // sync() here because other task did writel(). > > > Which wouldn't be *incorrect*, but would be kind of weird. schedule is probably okay, we could clear pending there. But you possibly could get interrupts, or some lock free mmios that set the flag. Does it matter that much? A random cache miss could have the same effect. It may matter slightly less for powerpc because we don't inline spin locks, although I have been hoping to for a while, this might put the nail in that. We can always tinker with it later though so I won't insist. Thanks, Nick
On Mon, Mar 4, 2019 at 4:21 PM Nicholas Piggin <npiggin@gmail.com> wrote: > > Well you don't have to talk about it but why do you want me to stop? > I don't understand. It's an open topic still after this series. I > can post a new thread about it if that would upset you less, I just > thought it would kind of fit here because we're talking about mmiowb, > I'm not trying to derail this series. Because if anybody is doing lockless programming with IO, they deserve whatever they get. In other words, the whole "wmb()" issue is basically not an issue. We already have rules like: - mmio is ordered wrt itself - mmio is ordered wrt previous memory ops (because of dma) and while it turned out that at least alpha had broken those rules at some point, and we had a discussion about it, that was just a bug. So there's basically no real reason to ever use "wmb()" with any of the normal mmio stuff. Now, we do have __raw_writel() etc, which are explicitly not ordered, but they also haven't been really standardized. And in fact, people who use them often seem to want to use them together with various weak memory remappings. And yes, "wmb()" has been the traditional way to order those, to the point where "wmb()" on x86 is actually a "sfence" because once you do IO on those kinds of unordered mappings, the usual SMP rules go out the window (a normal "smp_wmb()" is just a compiler barrier on x86). But notice how this is *entirely* independent of the spinlock issue, and has absolutely *nothing* to do with the whole mmiowb() thing that was always about "normal IO vs normal RAM" (due to the ia64 breakage). Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, Mar 4, 2019 at 2:24 AM Michael Ellerman <mpe@ellerman.id.au> wrote: >> >> Without wading into the rest of the discussion, this does raise an >> interesting point, ie. what about eg. rwlock's? >> >> They're basically equivalent to spinlocks, and so could reasonably be >> expected to have the same behaviour. >> >> But we don't check the io_sync flag in arch_read/write_unlock() etc. and >> both of those use lwsync. > > I think technically rwlocks should do the same thing, at least when > they are used for exclusion. OK. > Because of the exclusion argument, we can presubably limit it to just > write_unlock(), although at least in theory I guess you could have > some "one reader does IO, then a writer comes in" situation.. It's a bit hard to grep for, but I did find one case: static void netxen_nic_io_write_128M(struct netxen_adapter *adapter, void __iomem *addr, u32 data) { read_lock(&adapter->ahw.crb_lock); writel(data, addr); read_unlock(&adapter->ahw.crb_lock); } It looks like that driver is using the rwlock to exclude cases that can just do a readl()/writel() (readers) vs another case that has to reconfigure a window or something, before doing readl()/writel() and then configuring the window back. So that seems like a valid use for a rwlock. Whether we want to penalise all read_unlock() usages with a mmiowb() check just to support that one driver is another question. > Perhaps more importantly, what about sleeping locks? When they > actually *block*, they get the barrier thanks to the scheduler, but > you can have a nice non-contended sequence that never does that. Yeah. The mutex unlock fast path is just: if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr) return true; And because it's the "release" variant we just use lwsync, which doesn't order MMIO. If it was just atomic_long_cmpxchg() that would work because we use sync for those. __up_write() uses atomic_long_sub_return_release(), so same story. > I guess the fact that these cases have never even shown up as an issue > means that we could just continue to ignore it. > > We could even give that approach some fancy name, and claim it as a > revolutionary new programming paradigm ("ostrich programming" to go > with "agile" and "pair programming"). Maybe. On power we have the double whammy of weaker ordering than other arches and infinitesimal market share, which makes me worry that there are bugs lurking that we just haven't found, it's happened before. cheers
On Wed, Mar 6, 2019 at 4:48 PM Michael Ellerman <mpe@ellerman.id.au> wrote: > > It's a bit hard to grep for, but I did find one case: > > static void netxen_nic_io_write_128M(struct netxen_adapter *adapter, > void __iomem *addr, u32 data) > { > read_lock(&adapter->ahw.crb_lock); > writel(data, addr); > read_unlock(&adapter->ahw.crb_lock); > } > > It looks like that driver is using the rwlock to exclude cases that can > just do a readl()/writel() (readers) vs another case that has to reconfigure a > window or something, before doing readl()/writel() and then configuring > the window back. So that seems like a valid use for a rwlock. Oh, it's actually fairly sane: the IO itself is apparently windowed on that hardware, and the *common* case is that you'd access "window 1". So if everybody accesses window 1, they can all work in parallel - the read case. But if somebody needs to access any of the other special IO windows, they need to take the write lock, then change the window pointer to the window they want to access, do the access, and then set it back to the default "window 1". So yes. That driver very much relies on exclusion of the IO through an rwlock. I'm guessing nobody uses that hardware on Power? Or maybe the "window 1 is common" is *so* common that the other cases basically never happen and don't really end up ever causing problems? [ Time passes, I look at it ] Actually, the driver probably works on Power, because *setting* the window isn't just a write to the window register, it's always serialized by a read _from_ the window register to verify that the write "took". Apparently the hardware itself really needs that "don't do accesses to the window before I've settled". And that would basically serialize the only operation that really needs serialization, so in the case of _that_ driver, it all looks safe. Even if it's partly by luck. > > Perhaps more importantly, what about sleeping locks? When they > > actually *block*, they get the barrier thanks to the scheduler, but > > you can have a nice non-contended sequence that never does that. > > Yeah. > > The mutex unlock fast path is just: Yup. Both lock/unlock have fast paths that should be just trivial atomic sequences. But the good news is that *usually* device IO is protected by a spinlock, since you almost always have interrupt serialization needs too whenever you have any sequence of MMIO that isn't just some "write single word to start the engine". So the "use mutex to serialize IO" may be fairly unusual. Linus
On Thu, Mar 07, 2019 at 11:47:53AM +1100, Michael Ellerman wrote: > The mutex unlock fast path is just: > > if (atomic_long_cmpxchg_release(&lock->owner, curr, 0UL) == curr) > return true; > > And because it's the "release" variant we just use lwsync, which doesn't > order MMIO. If it was just atomic_long_cmpxchg() that would work because > we use sync for those. > > __up_write() uses atomic_long_sub_return_release(), so same story. As does spin_unlock() of course, which is a great segway into... my RCsc desires :-) If all your unlocks were to have SYNC, your locks would, aside from ordering MMIO, also be RCsc, Win-Win :-) There is, of course, that pesky little performance detail that keeps getting in the way.
diff --git a/include/asm-generic/mmiowb.h b/include/asm-generic/mmiowb.h new file mode 100644 index 000000000000..9439ff037b2d --- /dev/null +++ b/include/asm-generic/mmiowb.h @@ -0,0 +1,63 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_GENERIC_MMIOWB_H +#define __ASM_GENERIC_MMIOWB_H + +/* + * Generic implementation of mmiowb() tracking for spinlocks. + * + * If your architecture doesn't ensure that writes to an I/O peripheral + * within two spinlocked sections on two different CPUs are seen by the + * peripheral in the order corresponding to the lock handover, then you + * need to follow these FIVE easy steps: + * + * 1. Implement mmiowb() (and arch_mmiowb_state() if you're fancy) + * in asm/mmiowb.h, then #include this file + * 2. Ensure your I/O write accessors call mmiowb_set_pending() + * 3. Select ARCH_HAS_MMIOWB + * 4. Untangle the resulting mess of header files + * 5. Complain to your architects + */ +#ifdef CONFIG_MMIOWB + +#include <linux/compiler.h> +#include <asm-generic/mmiowb_types.h> + +#ifndef arch_mmiowb_state +#include <asm/percpu.h> +#include <asm/smp.h> + +DECLARE_PER_CPU(struct mmiowb_state, __mmiowb_state); +#define __mmiowb_state() this_cpu_ptr(&__mmiowb_state) +#else +#define __mmiowb_state() arch_mmiowb_state() +#endif /* arch_mmiowb_state */ + +static inline void mmiowb_set_pending(void) +{ + struct mmiowb_state *ms = __mmiowb_state(); + ms->mmiowb_pending = ms->nesting_count; +} + +static inline void mmiowb_spin_lock(void) +{ + struct mmiowb_state *ms = __mmiowb_state(); + ms->nesting_count++; +} + +static inline void mmiowb_spin_unlock(void) +{ + struct mmiowb_state *ms = __mmiowb_state(); + + if (unlikely(ms->mmiowb_pending)) { + ms->mmiowb_pending = 0; + mmiowb(); + } + + ms->nesting_count--; +} +#else +#define mmiowb_set_pending() do { } while (0) +#define mmiowb_spin_lock() do { } while (0) +#define mmiowb_spin_unlock() do { } while (0) +#endif /* CONFIG_MMIOWB */ +#endif /* __ASM_GENERIC_MMIOWB_H */ diff --git a/include/asm-generic/mmiowb_types.h b/include/asm-generic/mmiowb_types.h new file mode 100644 index 000000000000..8eb0095655e7 --- /dev/null +++ b/include/asm-generic/mmiowb_types.h @@ -0,0 +1,12 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_GENERIC_MMIOWB_TYPES_H +#define __ASM_GENERIC_MMIOWB_TYPES_H + +#include <linux/types.h> + +struct mmiowb_state { + u16 nesting_count; + u16 mmiowb_pending; +}; + +#endif /* __ASM_GENERIC_MMIOWB_TYPES_H */ diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 84d882f3e299..82fa481ecb78 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -248,3 +248,10 @@ config ARCH_USE_QUEUED_RWLOCKS config QUEUED_RWLOCKS def_bool y if ARCH_USE_QUEUED_RWLOCKS depends on SMP + +config ARCH_HAS_MMIOWB + bool + +config MMIOWB + def_bool y if ARCH_HAS_MMIOWB + depends on SMP diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c index 936f3d14dd6b..0ff08380f531 100644 --- a/kernel/locking/spinlock.c +++ b/kernel/locking/spinlock.c @@ -22,6 +22,13 @@ #include <linux/debug_locks.h> #include <linux/export.h> +#ifdef CONFIG_MMIOWB +#ifndef arch_mmiowb_state +DEFINE_PER_CPU(struct mmiowb_state, __mmiowb_state); +EXPORT_PER_CPU_SYMBOL(__mmiowb_state); +#endif +#endif + /* * If lockdep is enabled then we use the non-preemption spin-ops * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
In preparation for removing all explicit mmiowb() calls from driver code, implement a tracking system in asm-generic based loosely on the PowerPC implementation. This allows architectures with a non-empty mmiowb() definition to have the barrier automatically inserted in spin_unlock() following a critical section containing an I/O write. Signed-off-by: Will Deacon <will.deacon@arm.com> --- include/asm-generic/mmiowb.h | 63 ++++++++++++++++++++++++++++++++++++++ include/asm-generic/mmiowb_types.h | 12 ++++++++ kernel/Kconfig.locks | 7 +++++ kernel/locking/spinlock.c | 7 +++++ 4 files changed, 89 insertions(+) create mode 100644 include/asm-generic/mmiowb.h create mode 100644 include/asm-generic/mmiowb_types.h -- 2.11.0