diff mbox series

random: use correct memory barriers for crng_node_pool

Message ID 20200916233042.51634-1-ebiggers@kernel.org
State Superseded
Headers show
Series random: use correct memory barriers for crng_node_pool | expand

Commit Message

Eric Biggers Sept. 16, 2020, 11:30 p.m. UTC
From: Eric Biggers <ebiggers@google.com>

When a CPU selects which CRNG to use, it accesses crng_node_pool without
a memory barrier.  That's wrong, because crng_node_pool can be set by
another CPU concurrently.  Without a memory barrier, the crng_state that
is used might not appear to be fully initialized.

There's an explicit mb() on the write side, but it's redundant with
cmpxchg() (or cmpxchg_release()) and does nothing to fix the read side.

Implement this correctly by using a cmpxchg_release() +
smp_load_acquire() pair.

Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly userspace programs")
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 drivers/char/random.c | 42 ++++++++++++++++++++++--------------------
 1 file changed, 22 insertions(+), 20 deletions(-)

Comments

Eric Biggers Sept. 17, 2020, 4:58 p.m. UTC | #1
On Thu, Sep 17, 2020 at 05:26:44PM +1000, Herbert Xu wrote:
> Eric Biggers <ebiggers@kernel.org> wrote:

> > From: Eric Biggers <ebiggers@google.com>

> > 

> > When a CPU selects which CRNG to use, it accesses crng_node_pool without

> > a memory barrier.  That's wrong, because crng_node_pool can be set by

> > another CPU concurrently.  Without a memory barrier, the crng_state that

> > is used might not appear to be fully initialized.

> 

> The only architecture that requires a barrier for data dependency

> is Alpha.  The correct primitive to ensure that barrier is present

> is smp_barrier_depends, or you could just use READ_ONCE.

> 


smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization
that is difficult to tell whether it's correct or not.  For trivial data
structures it's "easy" to tell.  But whenever there is a->b where b is an
internal implementation detail of another kernel subsystem, the use of which
could involve accesses to global or static data (for example, spin_lock()
accessing lockdep stuff), a control dependency can slip in.

The last time I tried to use READ_ONCE(), it started a big controversy
(https://lkml.kernel.org/linux-fsdevel/20200713033330.205104-1-ebiggers@kernel.org/T/#u,
https://lkml.kernel.org/linux-fsdevel/20200717044427.68747-1-ebiggers@kernel.org/T/#u,
https://lwn.net/Articles/827180/).  In the end, people refused to even allow the
READ_ONCE() optimization to be documented, because they felt that
smp_load_acquire() should just be used instead.

So I think we should just go with smp_load_acquire()...

- Eric
Herbert Xu Sept. 21, 2020, 8:19 a.m. UTC | #2
On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:
>

> smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization

> that is difficult to tell whether it's correct or not.  For trivial data

> structures it's "easy" to tell.  But whenever there is a->b where b is an

> internal implementation detail of another kernel subsystem, the use of which

> could involve accesses to global or static data (for example, spin_lock()

> accessing lockdep stuff), a control dependency can slip in.


If we're going to follow this line of reasoning, surely you should
be converting the RCU derference first and foremost, no?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu Sept. 21, 2020, 10:11 p.m. UTC | #3
On Mon, Sep 21, 2020 at 08:27:14AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 21, 2020 at 06:19:39PM +1000, Herbert Xu wrote:

> > On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:

> > >

> > > smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization

> > > that is difficult to tell whether it's correct or not.  For trivial data

> > > structures it's "easy" to tell.  But whenever there is a->b where b is an

> > > internal implementation detail of another kernel subsystem, the use of which

> > > could involve accesses to global or static data (for example, spin_lock()

> > > accessing lockdep stuff), a control dependency can slip in.

> > 

> > If we're going to follow this line of reasoning, surely you should

> > be converting the RCU derference first and foremost, no?


...

> And to Eric's point, it is also true that when you have pointers to

> static data, and when the compiler can guess this, you do need something

> like smp_load_acquire().  But this is a problem only when you are (1)

> using feedback-driven compiler optimization or (2) when you compare the

> pointer to the address of the static data.


Let me restate what I think Eric is saying.  He is concerned about
the case where a->b and b is some opaque object that may in turn
dereference a global data structure unconnected to a.  The case
in question here is crng_node_pool in drivers/char/random.c which
in turn contains a spin lock.

But this reasoning could apply to any data structure that contains
a spin lock, in particular ones that are dereferenced through RCU.

So my question if this reasoning is valid, then why aren't we first
converting rcu_dereference to use smp_load_acquire?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Paul E. McKenney Sept. 21, 2020, 11:26 p.m. UTC | #4
On Tue, Sep 22, 2020 at 08:11:04AM +1000, Herbert Xu wrote:
> On Mon, Sep 21, 2020 at 08:27:14AM -0700, Paul E. McKenney wrote:

> > On Mon, Sep 21, 2020 at 06:19:39PM +1000, Herbert Xu wrote:

> > > On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:

> > > >

> > > > smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization

> > > > that is difficult to tell whether it's correct or not.  For trivial data

> > > > structures it's "easy" to tell.  But whenever there is a->b where b is an

> > > > internal implementation detail of another kernel subsystem, the use of which

> > > > could involve accesses to global or static data (for example, spin_lock()

> > > > accessing lockdep stuff), a control dependency can slip in.

> > > 

> > > If we're going to follow this line of reasoning, surely you should

> > > be converting the RCU derference first and foremost, no?

> 

> ...

> 

> > And to Eric's point, it is also true that when you have pointers to

> > static data, and when the compiler can guess this, you do need something

> > like smp_load_acquire().  But this is a problem only when you are (1)

> > using feedback-driven compiler optimization or (2) when you compare the

> > pointer to the address of the static data.

> 

> Let me restate what I think Eric is saying.  He is concerned about

> the case where a->b and b is some opaque object that may in turn

> dereference a global data structure unconnected to a.  The case

> in question here is crng_node_pool in drivers/char/random.c which

> in turn contains a spin lock.


As long as the compiler generates code that reaches that global via
pointer a, everything will work fine.  Which it will, unless the guy
writing the code makes the mistake of introducing a comparison between the
pointer to be dereferenced and the address of the global data structure.

So this is OK:

	p = rcu_dereference(a);
	do_something(p->b);

This is not OK:

	p = rcu_dereference(a);
	if (p == &some_global_variable)
		we_really_should_not_have_done_that_comparison();
	do_something(p->b);

The reason this last is not OK is because the compiler can transform
it as follows:

	p = rcu_dereference(a);
	if (p == &some_global_variable) {
		we_really_should_not_have_done_that_comparison();
		do_something(some_global_variable.b);
	} else {
		do_something(p->b);
	}

The compiler is not allowed to make up that sort of comparison, based
on my February 2020 discussion with the standards committee.

> But this reasoning could apply to any data structure that contains

> a spin lock, in particular ones that are dereferenced through RCU.


I lost you on this one.  What is special about a spin lock?

Here is what I think you mean:

	struct foo {
		spinlock_t lock;
		int a;
		char b;
		long c;
	};

	struct foo *a;

	...

	p = rcu_dereference(a);
	BUG_ON(!p);
	if (is_this_the_one(p)) {
		spin_lock(p->lock);
		do_something_else(p);
		spin_unlock(p->lock);
	}

This should be fine.  Or were you thinking of some other example?

> So my question if this reasoning is valid, then why aren't we first

> converting rcu_dereference to use smp_load_acquire?


For LTO in ARM, rumor has it that Will is doing so.  Which was what
motivated the BoF on this topic at Linux Plumbers Conference.

							Thanx, Paul
Paul E. McKenney Sept. 22, 2020, 6:31 p.m. UTC | #5
On Mon, Sep 21, 2020 at 04:52:43PM -0700, Eric Biggers wrote:
> On Mon, Sep 21, 2020 at 04:26:39PM -0700, Paul E. McKenney wrote:

> > On Tue, Sep 22, 2020 at 08:11:04AM +1000, Herbert Xu wrote:

> > > On Mon, Sep 21, 2020 at 08:27:14AM -0700, Paul E. McKenney wrote:

> > > > On Mon, Sep 21, 2020 at 06:19:39PM +1000, Herbert Xu wrote:

> > > > > On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:

> > > > > >

> > > > > > smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization

> > > > > > that is difficult to tell whether it's correct or not.  For trivial data

> > > > > > structures it's "easy" to tell.  But whenever there is a->b where b is an

> > > > > > internal implementation detail of another kernel subsystem, the use of which

> > > > > > could involve accesses to global or static data (for example, spin_lock()

> > > > > > accessing lockdep stuff), a control dependency can slip in.

> > > > > 

> > > > > If we're going to follow this line of reasoning, surely you should

> > > > > be converting the RCU derference first and foremost, no?

> > > 

> > > ...

> > > 

> > > > And to Eric's point, it is also true that when you have pointers to

> > > > static data, and when the compiler can guess this, you do need something

> > > > like smp_load_acquire().  But this is a problem only when you are (1)

> > > > using feedback-driven compiler optimization or (2) when you compare the

> > > > pointer to the address of the static data.

> > > 

> > > Let me restate what I think Eric is saying.  He is concerned about

> > > the case where a->b and b is some opaque object that may in turn

> > > dereference a global data structure unconnected to a.  The case

> > > in question here is crng_node_pool in drivers/char/random.c which

> > > in turn contains a spin lock.

> > 

> > As long as the compiler generates code that reaches that global via

> > pointer a, everything will work fine.  Which it will, unless the guy

> > writing the code makes the mistake of introducing a comparison between the

> > pointer to be dereferenced and the address of the global data structure.

> > 

> > So this is OK:

> > 

> > 	p = rcu_dereference(a);

> > 	do_something(p->b);

> > 

> > This is not OK:

> > 

> > 	p = rcu_dereference(a);

> > 	if (p == &some_global_variable)

> > 		we_really_should_not_have_done_that_comparison();

> > 	do_something(p->b);

> 

> If you call some function that's an internal implementation detail of some other

> kernel subsystem, how do you know it doesn't do that?


If the only globals I insert into my linked data structure are local to my
compilation unit, then the internal implementation details of some other
kernel subsystem in some other translation unit cannot do that comparison.

> Also, it's not just the p == &global_variable case.  Consider:

> 

> struct a { struct b *b; };

> struct b { ... };

> 

> Thread 1:

> 

> 	/* one-time initialized data shared by all instances of b */

> 	static struct c *c;

> 

> 	void init_b(struct a *a)

> 	{

> 		if (!c)

> 			c = alloc_c();

> 

> 		smp_store_release(&a->b, kzalloc(sizeof(struct b)));

> 	}

> 

> Thread 2:

> 

> 	void use_b_if_present(struct a *a)

> 	{

> 		struct b *b = READ_ONCE(a->b);

> 

> 		if (b) {

> 			c->... # crashes because c still appears to be NULL

> 		}

> 	}

> 

> 

> So when the *first* "b" is allocated, the global data "c" is initialized.  Then

> when using a "b", we expect to be able to access "c".  But there's no

> data dependency from "b" to "c"; it's a control dependency only.

> So smp_load_acquire() is needed, not READ_ONCE().

> 

> And it can be an internal implementation detail of "b"'s subsystem whether it

> happens to use global data "c".


Given that "c" is static, these two subsystems must be in the same
translation unit.  So I don't see how this qualifies as being internal to
"b"'s subsystem.

Besides which, control dependencies should be used only by LKMM experts
at this point.  Yes, we are trying to get the compiler people to give us
a way to tell the compiler about dependencies that we need to preserve,
but in the meantime, you beed to be really careful how you use them,
and you need to make sure that your external API can be used without
creating traps like the one you are driving at.

> This sort of thing is why people objected to the READ_ONCE() optimization during

> the discussion at

> https://lkml.kernel.org/linux-fsdevel/20200717044427.68747-1-ebiggers@kernel.org/T/#u.

> Most kernel developers aren't experts in the LKMM, and they want something

> that's guaranteed to be correct without having to to think really hard about it

> and make assumptions about the internal implementation details of other

> subsystems, how compilers have implemented the C standard, and so on.


And smp_load_acquire()is provided for that reason.  Its name was
even based on the nomenclature used in the C standard and elsewhere.
And again, control dependencies are for LKMM experts, as they are very
tricky to get right.

But in the LKMM documentation, you are likely to find LKMM experts who
want to optimize all the way, particularly in cases like the one-time
init pattern where all the data is often local.  And the best basis for
READ_ONCE() in one-time init is not a control dependency, but rather
ordering of accesses to a single variable from a single task combined
with locking, both of which are quite robust and much easier to use,
especially in comparison to control dependencies.

My goal for LKMM is not that each and every developer have a full
understanding of every nook and cranny of that model, but instead that
people can find the primitives supporting the desired point in the
performance/simplicity tradoff space.  And yes, I have more writing
to do to make more progress towards that goal.

							Thanx, Paul
Paul E. McKenney Sept. 22, 2020, 8:31 p.m. UTC | #6
On Tue, Sep 22, 2020 at 11:59:31AM -0700, Eric Biggers wrote:
> On Tue, Sep 22, 2020 at 11:42:43AM -0700, Paul E. McKenney wrote:

> > On Tue, Sep 22, 2020 at 09:51:36AM +1000, Herbert Xu wrote:

> > > On Mon, Sep 21, 2020 at 04:26:39PM -0700, Paul E. McKenney wrote:

> > > >

> > > > > But this reasoning could apply to any data structure that contains

> > > > > a spin lock, in particular ones that are dereferenced through RCU.

> > > > 

> > > > I lost you on this one.  What is special about a spin lock?

> > > 

> > > I don't know, that was Eric's concern.  He is inferring that

> > > spin locks through lockdep debugging may trigger dependencies

> > > that require smp_load_acquire.

> > > 

> > > Anyway, my point is if it applies to crng_node_pool then it

> > > would equally apply to RCU in general.

> > 

> > Referring to the patch you call out below...

> > 

> > Huh.  The old cmpxchg() primitive is fully ordered, so the old mb()

> > preceding it must have been for correctly interacting with hardware on

> > !SMP systems.  If that is the case, then the use of cmpxchg_release()

> > is incorrect.  This is not the purview of the memory model, but rather

> > of device-driver semantics.  Or does crng not (or no longer, as the case

> > might be) interact with hardware RNGs?

> 

> No hardware involved here.  The mb() is just unnecessary, as I noted in my patch

> https://lore.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org/.

> 

> > What prevents either the old or the new code from kfree()ing the old

> > state out from under another CPU that just now picked up a pointer to the

> > old state?  The combination of cmpxchg_release() and smp_load_acquire()

> > won't do anything to prevent this from happening.  This is after all not

> > a memory-ordering issue, but instead an object-lifetime issue.  But maybe

> > you have a lock or something that provides the needed protection.  I don't

> > see how this can be the case and still require the cmpxchg_release()

> > and smp_load_acquire(), but perhaps this is a failure of imagination on

> > my part.

> 

> crng_node_pool is initialized only once, and never freed.


Thank you on both counts!

							Thanx, Paul
Paul E. McKenney Sept. 25, 2020, 12:59 a.m. UTC | #7
On Tue, Sep 22, 2020 at 02:55:58PM -0700, Eric Biggers wrote:
> On Tue, Sep 22, 2020 at 01:56:28PM -0700, Paul E. McKenney wrote:

> > > You're missing the point here.  b and c could easily be allocated by a function

> > > alloc_b() that's in another file.

> > 

> > I am still missing something.

> > 

> > If by "allocated" you mean something like kmalloc(), the compiler doesn't

> > know the address.  If you instead mean that there is a function that

> > returns the address of another translation unit's static variable, then

> > any needed ordering should preferably be built into that function's API.

> > Either way, one would hope for some documentation of anything the caller

> > needed to be careful of.

> > 

> > > > Besides which, control dependencies should be used only by LKMM experts

> > > > at this point.  

> > > 

> > > What does that even mean?  Control dependencies are everywhere.

> > 

> > Does the following work better for you?

> > 

> > "... the non-local ordering properties of control dependencies should be

> > relied on only by LKMM experts ...".

> 

> No.  I don't know what that means.  And I think very few people would know.

> 

> I just want to know if I use the one-time init pattern with a pointer to a data

> structure foo, are the readers using foo_use() supposed to use READ_ONCE() or

> are they supposed to use smp_load_acquire().

> 

> It seems the answer is that smp_load_acquire() is the only safe choice, since

> foo_use() *might* involve a control dependency, or might in the future since

> it's part of another kernel subsystem and its implementation could change.


First, the specific issue of one-time init.

If you are the one writing the code implementing one-time init, it is your
choice.  It seems like you prefer smp_load_acquire().  If someone sees
performance problems due to the resulting memory-barrier instructions,
they have the option of submitting a patch and either forking the
implementation or taking your implementation over from you, depending
on how that conversation goes.

Me, I have thus far managed to avoid relying on non-local ordering
from control dependencies.  Then again, I have not been working on ring
buffers or on certain scheduler fastpaths.

Either way, the API should document the requirements on the user.

I leave analysis of DO_ONCE() on the various architectures to people
who know static branches better than I do.

First, the fully-locked version is fine with normal C-language loads,
as written in your July patch to recipes.txt.  

Second, the flag-and-global variant of init_foo_if_needed() needs to use
smp_load_acquire(&foo_inited) if its callers are going to read from the
initialized data.  The reason for this is that control dependencies order
later stores, not later loads.  In this case READ_ONCE() absolutely does
not suffice.  Oddly enough, except on DEC Alpha.  ;-)

So this variant is also fine as written.

Third, your first pointer-based variant works with smp_load_acquire(),
but could instead use READ_ONCE() as long as it is reworked to something
like this:

	static struct foo *foo;
	static DEFINE_MUTEX(foo_init_mutex);

	// The returned pointer must be handled carefully in the same
	// way as would a pointer returned from rcu_derefeference().
	// See Documentation/RCU/rcu_dereference.rst.
	struct foo *init_foo_if_needed(void)
	{
		int err = 0;
		struct foo *retp;

		/* pairs with smp_store_release() below */
		retp = READ_ONCE(foo);
		if (retp)
			return retp;

		mutex_lock(&foo_init_mutex);
		if (!foo) {
			// The compiler doesn't know the address:
			struct foo *p = alloc_foo();

			if (p) /* pairs with smp_load_acquire() above */
				smp_store_release(&foo, p);
			else
				err = -ENOMEM;
		}
		mutex_unlock(&foo_init_mutex);
		return err;
	}

There are more than 2,000 instances of the rcu_dereference() family of
primitives in the v5.8 kernel, so it should not be hard to find people
who are familiar with it.  TL;DR:  Just dereference the pointer and you
will be fine.

So again, your first pointer-based variant is fine as is, but can be
safely optimized if its users are comfortable using RCU.  No special LKMM
expertise is required.  And there is no reliance on non-local ordering
from any control dependency.

The second pointer-based variant that uses cmpxchg_release() can have
its read path sped up similarly:

	// The returned pointer must be handled carefully in the same
	// way as would a pointer returned from rcu_derefeference().
	// See Documentation/RCU/rcu_dereference.rst.
	struct foo *init_foo_if_needed(void)
	{
		struct foo *p;
		struct foo *retp;

		/* pairs with successful cmpxchg_release() below */
		retp = READ_ONCE(foo);
		if (retp)
			return retp;

		// The compiler doesn't know the address:
		p = alloc_foo();
		if (!p)
			return -ENOMEM;

		/* on success, pairs with smp_load_acquire() above and below */
		if (cmpxchg_release(&foo, NULL, p) != NULL) {
			free_foo(p);
			/* pairs with successful cmpxchg_release() above */
			smp_load_acquire(&foo);
		}
		return 0;
	}

Again, this works fine if its users are comfortable using RCU, no special
LKMM knowledge is required.  And again, there is no reliance on non-local
ordering from any control dependency.

In both cases, the only reason to avoid smp_load_acquire() is performance.
And init_foo_if_needed() would need to be on a pretty hot code path for
anyone to be able to see the performance difference.

In short, this is all about engineering tradeoffs, which include not only
performance and scalability, but also ease of use, maintainability, and
so on and so forth.  I cannot possibly give you a "one size fits all"
answer, because one size does not fit all.

> > If this control dependency's non-local ordering places any requirements on

> > the users of that code, those requirements need to be clearly documented.

> > It is of course better if the control dependency's non-local ordering

> > properties are local to the code containing those control dependencies

> > so that the callers don't need to worry about the resulting non-local

> > ordering.

> > 

> > > > But in the LKMM documentation, you are likely to find LKMM experts who

> > > > want to optimize all the way, particularly in cases like the one-time

> > > > init pattern where all the data is often local.  And the best basis for

> > > > READ_ONCE() in one-time init is not a control dependency, but rather

> > > > ordering of accesses to a single variable from a single task combined

> > > > with locking, both of which are quite robust and much easier to use,

> > > > especially in comparison to control dependencies.

> > > > 

> > > > My goal for LKMM is not that each and every developer have a full

> > > > understanding of every nook and cranny of that model, but instead that

> > > > people can find the primitives supporting the desired point in the

> > > > performance/simplicity tradoff space.  And yes, I have more writing

> > > > to do to make more progress towards that goal.

> > > 

> > > So are you saying people should use smp_load_acquire(), or are you saying people

> > > should use READ_ONCE()?

> > 

> > C'mon, you know the answer to that!  ;-)

> > 

> > The answer is that it depends on both the people and the situation.

> > 

> > In the specific case of crng, where you need address dependency

> > ordering but the pointed-to data is dynamically allocated and never

> > deallocated, READ_ONCE() now suffices [1].  Of course, smp_load_acquire()

> > also suffices, at the cost of extra/expensive instructions on some

> > architectures.  The cmpxchg() needs at least release semantics, but

> > presumably no one cares if this operation is a bit more expensive than

> > it needs to be.

> > 

> > So, is select_crng() used on a fastpath?  If so, READ_ONCE()

> > might be necessary.  If not, why bother with anything stronger than

> > smp_load_acquire()?  The usual approach is to run this both ways on ARM

> > or PowerPC and see if it makes a significant difference.  If there is

> > no significant difference, keep it simple and just use smp_load_acquire().

> > 

> > If the code was sufficiently performance-insensitive, even better would

> > be to just use locking.  My hope is that no one bothered with the atomics

> > without a good reason, but you never know.

> > 

> > I confess some uncertainty as to how the transition from the global

> > primary_crng and the per-NUMA-node locks is handled.  I hope that the

> > global primary_crng guards global state that is disjoint from the state

> > being allocated by do_numa_crng_init()!

> 

> crng_node_pool just uses the one-time init pattern.  It's nothing unusual; lots

> of other places in the kernel want to do one-time initialization too.  It seems

> to be one of the more common cases where people run into the LKMM at all.

> I tried to document it in

> https://lkml.kernel.org/lkml/20200717044427.68747-1-ebiggers@kernel.org/T/#u,

> but people complained it was still too complicated.


Well, submitting that patch certainly did get a discussion going!

I am not sold on using one-time init as an example in recipes.txt.
But I don't see any of your implementations as being too complex for
the kernel, at least assuming that your more complex examples really
are justified by real performance data.  Lacking such data, you should
of course keep it simple.

> I hope that people can at least reach some general recommendation about

> READ_ONCE() vs. smp_load_acquire(), so that every kernel developer doesn't have

> to understand the detailed difference, and so that we don't need to have a long

> discussion (potentially requiring LWN coverage) about every patch.


I am sorry, but I don't expect that kind of general recommendation
any more than I expect the kernel to standardize on one particular
type of binary tree.

The best that I can do is to say that the more users a given artifact has,
the more worthwhile it is to invest in small increments of performance.
And scalability.  And usability.  And ...

> > Use the simplest thing that gets the job done.  Which in the Linux kernel

> > often won't be all that simple, but life is like that sometimes.

> > 

> > 							Thanx, Paul

> > 

> > [1]	It used to be that READ_ONCE() did -not- suffice on DEC Alpha,

> > 	but this has thankfully changed, so that lockless_dereference()

> > 	is no more.

> 

> Let me give an example using spinlock_t, since that's used in crng_node_pool.

> However, it could be any other data structure too; this is *just an example*.

> And it doesn't matter if the implementation is currently different; the point is

> that it's an *implementation*.

> 

> The allocation side uses spin_lock_init(), while the read side uses spin_lock().

> Let's say that some debugging feature is enabled where spin locks use some

> global debugging information (say, a list of all locks) that gets allocated the

> first time a spin lock is initialized:

> 

> 	static struct spin_lock_debug_info *debug_info;

> 	static DEFINE_MUTEX(debug_info_alloc_mutex);

> 

> 	void spin_lock_init(spinlock_t *lock)

> 	{

> 	#ifdef CONFIG_DEBUG_SPIN_LOCKS

> 		mutex_lock(&debug_info_alloc_mutex);

> 		if (!debug_info)

> 			debug_info = alloc_debug_info();

> 		add_lock(debug_info, lock);

> 		mutex_unlock(&debug_info_alloc_mutex);

> 	#endif

> 		real_spin_lock_init(lock);

> 	}

> 

> 	void spin_lock(spinlock_t *lock)

> 	{

> 	#ifdef CONFIG_DEBUG_SPIN_LOCKS

> 		debug_info->...; # use the debug info

> 	#endif

> 		real_spin_lock(lock);

> 	}

> 

> In that case, readers would have a control dependency between the condition of

> the data struct containing the spinlock_t being non-NULL, and the dereference of

> debug_info by spin_lock().  So anyone "receiving" a data structure containing a

> spinlock_t would need to use smp_load_acquire(), not READ_ONCE().


Sorry, no.

The user had jolly well better make -very- sure that the call to
spin_lock_init() is ordered before any call to spin_lock().  Running
spin_lock() concurrently with spin_lock_init() will bring you nothing
but sorrow, even without that debug_info control-dependency issue.

In the various one-time init examples, the required ordering would be
correctly supplied if spin_lock_init() was invoked by init_foo() or
alloc_foo(), depending on the example, and used only after a successful
return from init_foo_if_needed().  None of these examples rely on control
dependencies.

> Point is, whether it's safe to use READ_ONCE() with a data structure or not is

> an implementation detail, not an API guarantee.


Again, no.  It is perfectly possible to design APIs that hide
the ordering.  In that case, the use of READ_ONCE() is an internal
design decision for the guy writing the code that implements the API.
For example, RCU calls schedule() and doesn't need to know or care that
some portions of the scheduler rely on control dependencies (or at least
they did last I looked).

Of course, it is also possible to hack together an API that exposes strange
ordering details, and sometimes it is even necessary.  But even then the
ordering needs to be documented.

							Thanx, Paul
Eric Biggers Oct. 2, 2020, 3:07 a.m. UTC | #8
On Thu, Sep 24, 2020 at 08:31:02PM -0700, Paul E. McKenney wrote:
> On Thu, Sep 24, 2020 at 07:09:08PM -0700, Eric Biggers wrote:

> > On Thu, Sep 24, 2020 at 05:59:34PM -0700, Paul E. McKenney wrote:

> > > On Tue, Sep 22, 2020 at 02:55:58PM -0700, Eric Biggers wrote:

> > > > On Tue, Sep 22, 2020 at 01:56:28PM -0700, Paul E. McKenney wrote:

> > > > > > You're missing the point here.  b and c could easily be allocated by a function

> > > > > > alloc_b() that's in another file.

> > > > > 

> > > > > I am still missing something.

> > > > > 

> > > > > If by "allocated" you mean something like kmalloc(), the compiler doesn't

> > > > > know the address.  If you instead mean that there is a function that

> > > > > returns the address of another translation unit's static variable, then

> > > > > any needed ordering should preferably be built into that function's API.

> > > > > Either way, one would hope for some documentation of anything the caller

> > > > > needed to be careful of.

> > > > > 

> > > > > > > Besides which, control dependencies should be used only by LKMM experts

> > > > > > > at this point.  

> > > > > > 

> > > > > > What does that even mean?  Control dependencies are everywhere.

> > > > > 

> > > > > Does the following work better for you?

> > > > > 

> > > > > "... the non-local ordering properties of control dependencies should be

> > > > > relied on only by LKMM experts ...".

> > > > 

> > > > No.  I don't know what that means.  And I think very few people would know.

> > > > 

> > > > I just want to know if I use the one-time init pattern with a pointer to a data

> > > > structure foo, are the readers using foo_use() supposed to use READ_ONCE() or

> > > > are they supposed to use smp_load_acquire().

> > > > 

> > > > It seems the answer is that smp_load_acquire() is the only safe choice, since

> > > > foo_use() *might* involve a control dependency, or might in the future since

> > > > it's part of another kernel subsystem and its implementation could change.

> > > 

> > > First, the specific issue of one-time init.

> > > 

> > > If you are the one writing the code implementing one-time init, it is your

> > > choice.  It seems like you prefer smp_load_acquire().  If someone sees

> > > performance problems due to the resulting memory-barrier instructions,

> > > they have the option of submitting a patch and either forking the

> > > implementation or taking your implementation over from you, depending

> > > on how that conversation goes.

> > 

> > It doesn't matter what I "prefer".  The question is, how to write code that is

> > actually guaranteed to be correct on all supported Linux architectures, without

> > assuming internal implementation details of other kernel subsystems.

> 

> And that question allows ample room for personal preferences.

> 

> There are after all tradeoffs.  Do you want to live within the current

> knowledge of your users, or are you willing to invest time and energy

> into teaching them something new?  If someone wants a level of performance

> that is accommodated only by a difficult-to-use pattern, will you choose

> to accommodate them, or will you tell them to build write their own?

> 

> There are often a number of ways to make something work, and they all

> have advantages and disadvantages.  There are tradeoffs, and preferences

> have a role to play as well.


Having options doesn't matter if no one can agree on which one to use.  This is
the second bug fix that I can't get accepted due to bikeshedding over how to
implement "one-time init":

First patch:
v1: https://lkml.kernel.org/linux-fsdevel/20200713033330.205104-1-ebiggers@kernel.org
v2: https://lkml.kernel.org/linux-fsdevel/20200717050510.95832-1-ebiggers@kernel.org
Related thread: https://lkml.kernel.org/lkml/20200717044427.68747-1-ebiggers@kernel.org

Second patch (this one):
https://lkml.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org

The problem is identical in both cases.  In both cases, the code currently
implements "one-time init" using a plain load on the reader side, which is
undefined behavior and isn't sufficient on all supported Linux architectures
(even *if* there is no control dependency, which is something that usually is
hard to determine, as I've explained several times).

However in both cases, no one can agree on what to replace the broken code with.
And the opinions were conflicting.  In the first patch, people were advocating
for smp_load_acquire() over READ_ONCE() because it's too hard to determine when
READ_ONCE() is safe.  And even after I switched to smp_load_acquire(), the patch
was still rejected, with conflicting reasons.

Now in the second patch, people are instead advocating for READ_ONCE() over
smp_load_acquire().  And you're claiming that all kernel developers are expected
to read Documentation/RCU/rcu_dereference.rst and design all allocation
functions to be usable in combination with rcu_dereference() or READ_ONCE().
(Tough luck if anyone didn't do that, I guess!)

So what actually happens is that no one agrees on a fix, so the obviously broken
code just gets left as-is.

Now, I don't have unlimited patience, so in the end I'm just going to let that
happen, and let it be Someone Else's problem to fix these bugs months/years down
the line when they happen to cause a "real" problem, or are detected by KCSAN,
etc.  It certainly seems like a bad outcome, though.

> If alloc_foo() also initializes static data, then it really should have

> some name other than alloc_foo().


Not necessarily.  It could do one-time-init of static data, e.g. a workqueue or
a mempool.  See fscrypt_initialize() (called by fscrypt_get_encryption_info())
for a real-world example of this.  The fscrypt_bounce_page_pool uses a
significant amount of memory, so we only allocate it if someone actually is
going to need it.

Are you disputing that that's a reasonable thing to do?  There's a lot of kernel
code that does something like this.

- Eric
Paul E. McKenney Oct. 8, 2020, 6:31 p.m. UTC | #9
On Thu, Oct 01, 2020 at 08:07:31PM -0700, Eric Biggers wrote:
> On Thu, Sep 24, 2020 at 08:31:02PM -0700, Paul E. McKenney wrote:

> > On Thu, Sep 24, 2020 at 07:09:08PM -0700, Eric Biggers wrote:

> > > On Thu, Sep 24, 2020 at 05:59:34PM -0700, Paul E. McKenney wrote:

> > > > On Tue, Sep 22, 2020 at 02:55:58PM -0700, Eric Biggers wrote:

> > > > > On Tue, Sep 22, 2020 at 01:56:28PM -0700, Paul E. McKenney wrote:

> > > > > > > You're missing the point here.  b and c could easily be allocated by a function

> > > > > > > alloc_b() that's in another file.

> > > > > > 

> > > > > > I am still missing something.

> > > > > > 

> > > > > > If by "allocated" you mean something like kmalloc(), the compiler doesn't

> > > > > > know the address.  If you instead mean that there is a function that

> > > > > > returns the address of another translation unit's static variable, then

> > > > > > any needed ordering should preferably be built into that function's API.

> > > > > > Either way, one would hope for some documentation of anything the caller

> > > > > > needed to be careful of.

> > > > > > 

> > > > > > > > Besides which, control dependencies should be used only by LKMM experts

> > > > > > > > at this point.  

> > > > > > > 

> > > > > > > What does that even mean?  Control dependencies are everywhere.

> > > > > > 

> > > > > > Does the following work better for you?

> > > > > > 

> > > > > > "... the non-local ordering properties of control dependencies should be

> > > > > > relied on only by LKMM experts ...".

> > > > > 

> > > > > No.  I don't know what that means.  And I think very few people would know.

> > > > > 

> > > > > I just want to know if I use the one-time init pattern with a pointer to a data

> > > > > structure foo, are the readers using foo_use() supposed to use READ_ONCE() or

> > > > > are they supposed to use smp_load_acquire().

> > > > > 

> > > > > It seems the answer is that smp_load_acquire() is the only safe choice, since

> > > > > foo_use() *might* involve a control dependency, or might in the future since

> > > > > it's part of another kernel subsystem and its implementation could change.

> > > > 

> > > > First, the specific issue of one-time init.

> > > > 

> > > > If you are the one writing the code implementing one-time init, it is your

> > > > choice.  It seems like you prefer smp_load_acquire().  If someone sees

> > > > performance problems due to the resulting memory-barrier instructions,

> > > > they have the option of submitting a patch and either forking the

> > > > implementation or taking your implementation over from you, depending

> > > > on how that conversation goes.

> > > 

> > > It doesn't matter what I "prefer".  The question is, how to write code that is

> > > actually guaranteed to be correct on all supported Linux architectures, without

> > > assuming internal implementation details of other kernel subsystems.

> > 

> > And that question allows ample room for personal preferences.

> > 

> > There are after all tradeoffs.  Do you want to live within the current

> > knowledge of your users, or are you willing to invest time and energy

> > into teaching them something new?  If someone wants a level of performance

> > that is accommodated only by a difficult-to-use pattern, will you choose

> > to accommodate them, or will you tell them to build write their own?

> > 

> > There are often a number of ways to make something work, and they all

> > have advantages and disadvantages.  There are tradeoffs, and preferences

> > have a role to play as well.

> 

> Having options doesn't matter if no one can agree on which one to use.  This is

> the second bug fix that I can't get accepted due to bikeshedding over how to

> implement "one-time init":


I understand that this situation could be quite frustrating, but we
can only expect a memory model to model memory.  Its job is to help us
understand what can work and what will not work from a memory-ordering
perspective, which at best will provide you with the options that you
seem to be so dissatisfied with.  The memory model is quite incapable of
browbeating intransigent human beings into agreeing on which option should
be used in a given situation.  This last never was a requirement of the
LKMM project.  Please rest assured that it will remain a non-requirement.

> First patch:

> v1: https://lkml.kernel.org/linux-fsdevel/20200713033330.205104-1-ebiggers@kernel.org

> v2: https://lkml.kernel.org/linux-fsdevel/20200717050510.95832-1-ebiggers@kernel.org

> Related thread: https://lkml.kernel.org/lkml/20200717044427.68747-1-ebiggers@kernel.org

> 

> Second patch (this one):

> https://lkml.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org

> 

> The problem is identical in both cases.  In both cases, the code currently

> implements "one-time init" using a plain load on the reader side, which is

> undefined behavior and isn't sufficient on all supported Linux architectures

> (even *if* there is no control dependency, which is something that usually is

> hard to determine, as I've explained several times).


For this particular problem, please forget about control dependencies.
In theory, yes, you can construct situations in which control dependencies
would be useful for one-time init, but in practice all of the situations
I can think of are extremely strange, even my my standards.

> However in both cases, no one can agree on what to replace the broken code with.

> And the opinions were conflicting.  In the first patch, people were advocating

> for smp_load_acquire() over READ_ONCE() because it's too hard to determine when

> READ_ONCE() is safe.  And even after I switched to smp_load_acquire(), the patch

> was still rejected, with conflicting reasons.


Been there, done that.  Still there, still doing that, actually.

So welcome to my world!

> Now in the second patch, people are instead advocating for READ_ONCE() over

> smp_load_acquire().  And you're claiming that all kernel developers are expected

> to read Documentation/RCU/rcu_dereference.rst and design all allocation

> functions to be usable in combination with rcu_dereference() or READ_ONCE().

> (Tough luck if anyone didn't do that, I guess!)


Not quite.

I will suspend disbelief for the moment, and for the remainder of this
email I will act as if I am absolutely certain that you are not making
an especially clumsy attempt to troll me.

What I am claiming is that -if- a particular kernel developer wishes to
-fully- explore -all- of the possible implementations of one-time init,
-then-, and only then -that- -particular- kernel developer will need to
know a great deal about the Linux-kernel memory model.  That particular
kernel developer might (as you suggest above) carefully read a bunch
of documentation in the kernel source tree.  I would advise that kernel
developer to also learn to use the herd7 tool, but different developers
will have different preferences.  Which is OK.

But most kernel developers will not need any understanding at all of the
Linux-kernel memory model.  Most will be able to continue to use APIs that
hide this memory-ordering complexity.  In fact, hiding this complexity
is an advantage of a well-designed and agreed-upon one-time init API.

> So what actually happens is that no one agrees on a fix, so the obviously broken

> code just gets left as-is.

> 

> Now, I don't have unlimited patience, so in the end I'm just going to let that

> happen, and let it be Someone Else's problem to fix these bugs months/years down

> the line when they happen to cause a "real" problem, or are detected by KCSAN,

> etc.  It certainly seems like a bad outcome, though.


My plate is quite full at the moment, so I will not be doing this work.

But yes, it does appear that you are suffering the consequences
of C. Northcote Parkinson's Law of Triviality, from which the term
"bikeshedding" arose.  I am sorry, but I do not have a general solution
to the bikeshedding problem, other than the usual recommendations of
patience, persistence, and taking care to fully understand the mindsets
of the people making all the conflicting recommendations and objections.

Besides, the ability to deal with this sort of thing will be quite
valuable to you, and you cannot learn it any younger!

> > If alloc_foo() also initializes static data, then it really should have

> > some name other than alloc_foo().

> 

> Not necessarily.  It could do one-time-init of static data, e.g. a workqueue or

> a mempool.  See fscrypt_initialize() (called by fscrypt_get_encryption_info())

> for a real-world example of this.  The fscrypt_bounce_page_pool uses a

> significant amount of memory, so we only allocate it if someone actually is

> going to need it.


There is no reason you cannot place the corresponding pointers, handles,
or whatever into the dynamically allocated block of memory, and do so
before doing the store-release of the pointer.  Then the users access
the workqueue, mempool, or whatever via this same pointer and everything
works out.

And if "someone actually is going to need it" means that different
portions of the initialization happen at different times, then there is
no reason why you cannot just use multiple instances of the one-time
init pattern, with the always-initialized data being handled by the
first instance and the only-somtimes-initialized data being handled by
the other instances.

> Are you disputing that that's a reasonable thing to do?  There's a lot of kernel

> code that does something like this.


If you wish to actually solve this problem, you are going to have to face
up to the fact that you are going to have to make some choices, and no
matter what set of choices you make, there will be people who are unhappy.
You will then either need to convince a critical mass of people of the
wisdom of your choices or adjust your choices to get a critical mass
on board.  If you do not wish to take on this particular challenge, your
best strategy is of course to stop complaining about it and take up other
challenges that are more technical and less people-oriented in nature.

On the off-chance that you wish to continue with one-time init, I am
reiterating your choices:

1.	Fully locked.  Acquire the lock every time the one-time init
	primitive is invoked.  The biggest advantage of this approach
	is that almost all kernel developers will easily understand
	how this works, in happy contrast with the developers back in
	the 1980s and 1990s.  Too bad about the lack of scalability,
	but if used infrequently enough, so what?

2.	Release and acquire, as discussed earlier in this thread.  Great
	scalability once initialization has completed, OK performance.
	The set of kernel developers who understand this is somewhat
	smaller than that for locking, but there is still no shortage
	of kernel developers who could successfully develop and maintain
	a one-time init facility based on release and acquire.

	Both #1 and #2 have the advantage of handling the case where
	the initialized data is scattered and completely disorganized.

3.	Like #2, but use either rcu_dereference() or READ_ONCE() in place
	of the smp_load_acquire(), replace the flag with a pointer to the
	to-be-initialized data, dynamically allocate memory for this data,
	and make very sure that -all- of the data resides in this memory.

	The payoff for #3's requirement that the developers abide by
	all of these restrictions is improved performance compared to
	#2 and especially to #1.  On all architectures, the compiler
	will have greater freedom to optimize.	On some architectures,
	the explicit memory-barrier instructions required for #1 and #2
	may be omitted for #3.

4.	Strange approaches (even by my standards) that I will ignore
	unless they are proven to be necessary.

All the memory model can do is to tell you that these approaches all work,
at least assuming everyone follows the corresponding rules.

Figuring out which of these options to actually use required reviewing
all of the use cases, determining the performance and scalability
requirements, and figuring out whether the affected developers are
willing and able to abide by the corresponding usage restrictions.
And, perhaps hardest, convincing a critical mass of developers to
actually agree on something.

It is quite possible that more than one of the options will be required.
For example, it might turn out that both #2 and #3 are needed.

This is actually the common case.  Just look at all the locking primitives
in the kernel if you doubt this.  Which is of course yet another way
to make people unhappy.  But again, no matter what you do in this area,
there will be unhappy people.  ;-)

							Thanx, Paul
diff mbox series

Patch

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 09b1551d4092f..9f1e7a4a0fbbb 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -843,8 +843,8 @@  static void do_numa_crng_init(struct work_struct *work)
 		crng_initialize_secondary(crng);
 		pool[i] = crng;
 	}
-	mb();
-	if (cmpxchg(&crng_node_pool, NULL, pool)) {
+	/* pairs with smp_load_acquire() in select_crng() */
+	if (cmpxchg_release(&crng_node_pool, NULL, pool) != NULL) {
 		for_each_node(i)
 			kfree(pool[i]);
 		kfree(pool);
@@ -857,8 +857,26 @@  static void numa_crng_init(void)
 {
 	schedule_work(&numa_crng_init_work);
 }
+
+static inline struct crng_state *select_crng(void)
+{
+	struct crng_state **pool;
+	int nid = numa_node_id();
+
+	/* pairs with cmpxchg_release() in do_numa_crng_init() */
+	pool = smp_load_acquire(&crng_node_pool);
+	if (pool && pool[nid])
+		return pool[nid];
+
+	return &primary_crng;
+}
 #else
 static void numa_crng_init(void) {}
+
+static inline struct crng_state *select_crng(void)
+{
+	return &primary_crng;
+}
 #endif
 
 /*
@@ -1005,15 +1023,7 @@  static void _extract_crng(struct crng_state *crng,
 
 static void extract_crng(__u8 out[CHACHA_BLOCK_SIZE])
 {
-	struct crng_state *crng = NULL;
-
-#ifdef CONFIG_NUMA
-	if (crng_node_pool)
-		crng = crng_node_pool[numa_node_id()];
-	if (crng == NULL)
-#endif
-		crng = &primary_crng;
-	_extract_crng(crng, out);
+	_extract_crng(select_crng(), out);
 }
 
 /*
@@ -1042,15 +1052,7 @@  static void _crng_backtrack_protect(struct crng_state *crng,
 
 static void crng_backtrack_protect(__u8 tmp[CHACHA_BLOCK_SIZE], int used)
 {
-	struct crng_state *crng = NULL;
-
-#ifdef CONFIG_NUMA
-	if (crng_node_pool)
-		crng = crng_node_pool[numa_node_id()];
-	if (crng == NULL)
-#endif
-		crng = &primary_crng;
-	_crng_backtrack_protect(crng, tmp, used);
+	_crng_backtrack_protect(select_crng(), tmp, used);
 }
 
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)