diff mbox series

random: use correct memory barriers for crng_node_pool

Message ID 20200916233042.51634-1-ebiggers@kernel.org
State Superseded
Headers show
Series random: use correct memory barriers for crng_node_pool | expand

Commit Message

Eric Biggers Sept. 16, 2020, 11:30 p.m. UTC
From: Eric Biggers <ebiggers@google.com>

When a CPU selects which CRNG to use, it accesses crng_node_pool without
a memory barrier.  That's wrong, because crng_node_pool can be set by
another CPU concurrently.  Without a memory barrier, the crng_state that
is used might not appear to be fully initialized.

There's an explicit mb() on the write side, but it's redundant with
cmpxchg() (or cmpxchg_release()) and does nothing to fix the read side.

Implement this correctly by using a cmpxchg_release() +
smp_load_acquire() pair.

Fixes: 1e7f583af67b ("random: make /dev/urandom scalable for silly userspace programs")
Cc: <stable@vger.kernel.org> # v4.8+
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 drivers/char/random.c | 42 ++++++++++++++++++++++--------------------
 1 file changed, 22 insertions(+), 20 deletions(-)

Comments

Herbert Xu Sept. 17, 2020, 7:26 a.m. UTC | #1
Eric Biggers <ebiggers@kernel.org> wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> When a CPU selects which CRNG to use, it accesses crng_node_pool without
> a memory barrier.  That's wrong, because crng_node_pool can be set by
> another CPU concurrently.  Without a memory barrier, the crng_state that
> is used might not appear to be fully initialized.

The only architecture that requires a barrier for data dependency
is Alpha.  The correct primitive to ensure that barrier is present
is smp_barrier_depends, or you could just use READ_ONCE.

Cheers,
Eric Biggers Sept. 17, 2020, 4:58 p.m. UTC | #2
On Thu, Sep 17, 2020 at 05:26:44PM +1000, Herbert Xu wrote:
> Eric Biggers <ebiggers@kernel.org> wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > When a CPU selects which CRNG to use, it accesses crng_node_pool without
> > a memory barrier.  That's wrong, because crng_node_pool can be set by
> > another CPU concurrently.  Without a memory barrier, the crng_state that
> > is used might not appear to be fully initialized.
> 
> The only architecture that requires a barrier for data dependency
> is Alpha.  The correct primitive to ensure that barrier is present
> is smp_barrier_depends, or you could just use READ_ONCE.
> 

smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization
that is difficult to tell whether it's correct or not.  For trivial data
structures it's "easy" to tell.  But whenever there is a->b where b is an
internal implementation detail of another kernel subsystem, the use of which
could involve accesses to global or static data (for example, spin_lock()
accessing lockdep stuff), a control dependency can slip in.

The last time I tried to use READ_ONCE(), it started a big controversy
(https://lkml.kernel.org/linux-fsdevel/20200713033330.205104-1-ebiggers@kernel.org/T/#u,
https://lkml.kernel.org/linux-fsdevel/20200717044427.68747-1-ebiggers@kernel.org/T/#u,
https://lwn.net/Articles/827180/).  In the end, people refused to even allow the
READ_ONCE() optimization to be documented, because they felt that
smp_load_acquire() should just be used instead.

So I think we should just go with smp_load_acquire()...

- Eric
Herbert Xu Sept. 21, 2020, 8:19 a.m. UTC | #3
On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:
>
> smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization
> that is difficult to tell whether it's correct or not.  For trivial data
> structures it's "easy" to tell.  But whenever there is a->b where b is an
> internal implementation detail of another kernel subsystem, the use of which
> could involve accesses to global or static data (for example, spin_lock()
> accessing lockdep stuff), a control dependency can slip in.

If we're going to follow this line of reasoning, surely you should
be converting the RCU derference first and foremost, no?

Cheers,
Paul E. McKenney Sept. 21, 2020, 3:27 p.m. UTC | #4
On Mon, Sep 21, 2020 at 06:19:39PM +1000, Herbert Xu wrote:
> On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:

> >

> > smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization

> > that is difficult to tell whether it's correct or not.  For trivial data

> > structures it's "easy" to tell.  But whenever there is a->b where b is an

> > internal implementation detail of another kernel subsystem, the use of which

> > could involve accesses to global or static data (for example, spin_lock()

> > accessing lockdep stuff), a control dependency can slip in.

> 

> If we're going to follow this line of reasoning, surely you should

> be converting the RCU derference first and foremost, no?


Agreed, rcu_dereference() is preferred over READ_ONCE() when reading
RCU-protected pointers.  Much better debugging support, if nothing else.

However, as part of making the kernel safe from DEC Alpha, READ_ONCE()
does protect against reading and dereferencing pointers to objects
concurrently being inserted into a linked data structure.  If they are
never removed (or are removed only when there are known to be no readers),
RCU is not required.

And to Eric's point, it is also true that when you have pointers to
static data, and when the compiler can guess this, you do need something
like smp_load_acquire().  But this is a problem only when you are (1)
using feedback-driven compiler optimization or (2) when you compare the
pointer to the address of the static data.

And yes, we are still working to be able to tell the compiler when
a pointer carries a dependency, but this continues to be slow going.  :-/

							Thanx, Paul
Herbert Xu Sept. 21, 2020, 10:11 p.m. UTC | #5
On Mon, Sep 21, 2020 at 08:27:14AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 21, 2020 at 06:19:39PM +1000, Herbert Xu wrote:
> > On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:
> > >
> > > smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization
> > > that is difficult to tell whether it's correct or not.  For trivial data
> > > structures it's "easy" to tell.  But whenever there is a->b where b is an
> > > internal implementation detail of another kernel subsystem, the use of which
> > > could involve accesses to global or static data (for example, spin_lock()
> > > accessing lockdep stuff), a control dependency can slip in.
> > 
> > If we're going to follow this line of reasoning, surely you should
> > be converting the RCU derference first and foremost, no?

...

> And to Eric's point, it is also true that when you have pointers to
> static data, and when the compiler can guess this, you do need something
> like smp_load_acquire().  But this is a problem only when you are (1)
> using feedback-driven compiler optimization or (2) when you compare the
> pointer to the address of the static data.

Let me restate what I think Eric is saying.  He is concerned about
the case where a->b and b is some opaque object that may in turn
dereference a global data structure unconnected to a.  The case
in question here is crng_node_pool in drivers/char/random.c which
in turn contains a spin lock.

But this reasoning could apply to any data structure that contains
a spin lock, in particular ones that are dereferenced through RCU.

So my question if this reasoning is valid, then why aren't we first
converting rcu_dereference to use smp_load_acquire?

Cheers,
Paul E. McKenney Sept. 21, 2020, 11:26 p.m. UTC | #6
On Tue, Sep 22, 2020 at 08:11:04AM +1000, Herbert Xu wrote:
> On Mon, Sep 21, 2020 at 08:27:14AM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2020 at 06:19:39PM +1000, Herbert Xu wrote:
> > > On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:
> > > >
> > > > smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization
> > > > that is difficult to tell whether it's correct or not.  For trivial data
> > > > structures it's "easy" to tell.  But whenever there is a->b where b is an
> > > > internal implementation detail of another kernel subsystem, the use of which
> > > > could involve accesses to global or static data (for example, spin_lock()
> > > > accessing lockdep stuff), a control dependency can slip in.
> > > 
> > > If we're going to follow this line of reasoning, surely you should
> > > be converting the RCU derference first and foremost, no?
> 
> ...
> 
> > And to Eric's point, it is also true that when you have pointers to
> > static data, and when the compiler can guess this, you do need something
> > like smp_load_acquire().  But this is a problem only when you are (1)
> > using feedback-driven compiler optimization or (2) when you compare the
> > pointer to the address of the static data.
> 
> Let me restate what I think Eric is saying.  He is concerned about
> the case where a->b and b is some opaque object that may in turn
> dereference a global data structure unconnected to a.  The case
> in question here is crng_node_pool in drivers/char/random.c which
> in turn contains a spin lock.

As long as the compiler generates code that reaches that global via
pointer a, everything will work fine.  Which it will, unless the guy
writing the code makes the mistake of introducing a comparison between the
pointer to be dereferenced and the address of the global data structure.

So this is OK:

	p = rcu_dereference(a);
	do_something(p->b);

This is not OK:

	p = rcu_dereference(a);
	if (p == &some_global_variable)
		we_really_should_not_have_done_that_comparison();
	do_something(p->b);

The reason this last is not OK is because the compiler can transform
it as follows:

	p = rcu_dereference(a);
	if (p == &some_global_variable) {
		we_really_should_not_have_done_that_comparison();
		do_something(some_global_variable.b);
	} else {
		do_something(p->b);
	}

The compiler is not allowed to make up that sort of comparison, based
on my February 2020 discussion with the standards committee.

> But this reasoning could apply to any data structure that contains
> a spin lock, in particular ones that are dereferenced through RCU.

I lost you on this one.  What is special about a spin lock?

Here is what I think you mean:

	struct foo {
		spinlock_t lock;
		int a;
		char b;
		long c;
	};

	struct foo *a;

	...

	p = rcu_dereference(a);
	BUG_ON(!p);
	if (is_this_the_one(p)) {
		spin_lock(p->lock);
		do_something_else(p);
		spin_unlock(p->lock);
	}

This should be fine.  Or were you thinking of some other example?

> So my question if this reasoning is valid, then why aren't we first
> converting rcu_dereference to use smp_load_acquire?

For LTO in ARM, rumor has it that Will is doing so.  Which was what
motivated the BoF on this topic at Linux Plumbers Conference.

							Thanx, Paul
Herbert Xu Sept. 21, 2020, 11:51 p.m. UTC | #7
On Mon, Sep 21, 2020 at 04:26:39PM -0700, Paul E. McKenney wrote:
>

> > But this reasoning could apply to any data structure that contains

> > a spin lock, in particular ones that are dereferenced through RCU.

> 

> I lost you on this one.  What is special about a spin lock?


I don't know, that was Eric's concern.  He is inferring that
spin locks through lockdep debugging may trigger dependencies
that require smp_load_acquire.

Anyway, my point is if it applies to crng_node_pool then it
would equally apply to RCU in general.

> > So my question if this reasoning is valid, then why aren't we first

> > converting rcu_dereference to use smp_load_acquire?

> 

> For LTO in ARM, rumor has it that Will is doing so.  Which was what

> motivated the BoF on this topic at Linux Plumbers Conference.


Sure, if RCU switches over to smp_load_acquire then I would have
no problems with everybody else following in its footsteps.

Here is the original patch in question:

https://lore.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org/

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Eric Biggers Sept. 21, 2020, 11:52 p.m. UTC | #8
On Mon, Sep 21, 2020 at 04:26:39PM -0700, Paul E. McKenney wrote:
> On Tue, Sep 22, 2020 at 08:11:04AM +1000, Herbert Xu wrote:
> > On Mon, Sep 21, 2020 at 08:27:14AM -0700, Paul E. McKenney wrote:
> > > On Mon, Sep 21, 2020 at 06:19:39PM +1000, Herbert Xu wrote:
> > > > On Thu, Sep 17, 2020 at 09:58:02AM -0700, Eric Biggers wrote:
> > > > >
> > > > > smp_load_acquire() is obviously correct, whereas READ_ONCE() is an optimization
> > > > > that is difficult to tell whether it's correct or not.  For trivial data
> > > > > structures it's "easy" to tell.  But whenever there is a->b where b is an
> > > > > internal implementation detail of another kernel subsystem, the use of which
> > > > > could involve accesses to global or static data (for example, spin_lock()
> > > > > accessing lockdep stuff), a control dependency can slip in.
> > > > 
> > > > If we're going to follow this line of reasoning, surely you should
> > > > be converting the RCU derference first and foremost, no?
> > 
> > ...
> > 
> > > And to Eric's point, it is also true that when you have pointers to
> > > static data, and when the compiler can guess this, you do need something
> > > like smp_load_acquire().  But this is a problem only when you are (1)
> > > using feedback-driven compiler optimization or (2) when you compare the
> > > pointer to the address of the static data.
> > 
> > Let me restate what I think Eric is saying.  He is concerned about
> > the case where a->b and b is some opaque object that may in turn
> > dereference a global data structure unconnected to a.  The case
> > in question here is crng_node_pool in drivers/char/random.c which
> > in turn contains a spin lock.
> 
> As long as the compiler generates code that reaches that global via
> pointer a, everything will work fine.  Which it will, unless the guy
> writing the code makes the mistake of introducing a comparison between the
> pointer to be dereferenced and the address of the global data structure.
> 
> So this is OK:
> 
> 	p = rcu_dereference(a);
> 	do_something(p->b);
> 
> This is not OK:
> 
> 	p = rcu_dereference(a);
> 	if (p == &some_global_variable)
> 		we_really_should_not_have_done_that_comparison();
> 	do_something(p->b);

If you call some function that's an internal implementation detail of some other
kernel subsystem, how do you know it doesn't do that?

Also, it's not just the p == &global_variable case.  Consider:

struct a { struct b *b; };
struct b { ... };

Thread 1:

	/* one-time initialized data shared by all instances of b */
	static struct c *c;

	void init_b(struct a *a)
	{
		if (!c)
			c = alloc_c();

		smp_store_release(&a->b, kzalloc(sizeof(struct b)));
	}

Thread 2:

	void use_b_if_present(struct a *a)
	{
		struct b *b = READ_ONCE(a->b);

		if (b) {
			c->... # crashes because c still appears to be NULL
		}
	}


So when the *first* "b" is allocated, the global data "c" is initialized.  Then
when using a "b", we expect to be able to access "c".  But there's no
data dependency from "b" to "c"; it's a control dependency only.
So smp_load_acquire() is needed, not READ_ONCE().

And it can be an internal implementation detail of "b"'s subsystem whether it
happens to use global data "c".

This sort of thing is why people objected to the READ_ONCE() optimization during
the discussion at
https://lkml.kernel.org/linux-fsdevel/20200717044427.68747-1-ebiggers@kernel.org/T/#u.
Most kernel developers aren't experts in the LKMM, and they want something
that's guaranteed to be correct without having to to think really hard about it
and make assumptions about the internal implementation details of other
subsystems, how compilers have implemented the C standard, and so on.

- Eric
Paul E. McKenney Sept. 22, 2020, 6:42 p.m. UTC | #9
On Tue, Sep 22, 2020 at 09:51:36AM +1000, Herbert Xu wrote:
> On Mon, Sep 21, 2020 at 04:26:39PM -0700, Paul E. McKenney wrote:
> >
> > > But this reasoning could apply to any data structure that contains
> > > a spin lock, in particular ones that are dereferenced through RCU.
> > 
> > I lost you on this one.  What is special about a spin lock?
> 
> I don't know, that was Eric's concern.  He is inferring that
> spin locks through lockdep debugging may trigger dependencies
> that require smp_load_acquire.
> 
> Anyway, my point is if it applies to crng_node_pool then it
> would equally apply to RCU in general.

Referring to the patch you call out below...

Huh.  The old cmpxchg() primitive is fully ordered, so the old mb()
preceding it must have been for correctly interacting with hardware on
!SMP systems.  If that is the case, then the use of cmpxchg_release()
is incorrect.  This is not the purview of the memory model, but rather
of device-driver semantics.  Or does crng not (or no longer, as the case
might be) interact with hardware RNGs?

What prevents either the old or the new code from kfree()ing the old
state out from under another CPU that just now picked up a pointer to the
old state?  The combination of cmpxchg_release() and smp_load_acquire()
won't do anything to prevent this from happening.  This is after all not
a memory-ordering issue, but instead an object-lifetime issue.  But maybe
you have a lock or something that provides the needed protection.  I don't
see how this can be the case and still require the cmpxchg_release()
and smp_load_acquire(), but perhaps this is a failure of imagination on
my part.

I am guessing that this lifetime issue prompted RCU to be introduced
into this discussion.  This would be one way of handling the lifetime
of the pool[] array and the objects that its elements point to, but at
some cost in either latency or memory footprint for synchronize_rcu()
and call_rcu(), respectively.

Or am I missing something subtle here?

> > > So my question if this reasoning is valid, then why aren't we first
> > > converting rcu_dereference to use smp_load_acquire?
> > 
> > For LTO in ARM, rumor has it that Will is doing so.  Which was what
> > motivated the BoF on this topic at Linux Plumbers Conference.
> 
> Sure, if RCU switches over to smp_load_acquire then I would have
> no problems with everybody else following in its footsteps.

The x86 guys might well be OK with this change, but I would guess that
the ARM and PowerPC guys might have some heartburn.  ;-)

> Here is the original patch in question:
> 
> https://lore.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org/

Thank you for the pointer!  I freely confess that I was wondering what
this was all about.

							Thanx, Paul
Eric Biggers Sept. 22, 2020, 6:59 p.m. UTC | #10
On Tue, Sep 22, 2020 at 11:42:43AM -0700, Paul E. McKenney wrote:
> On Tue, Sep 22, 2020 at 09:51:36AM +1000, Herbert Xu wrote:

> > On Mon, Sep 21, 2020 at 04:26:39PM -0700, Paul E. McKenney wrote:

> > >

> > > > But this reasoning could apply to any data structure that contains

> > > > a spin lock, in particular ones that are dereferenced through RCU.

> > > 

> > > I lost you on this one.  What is special about a spin lock?

> > 

> > I don't know, that was Eric's concern.  He is inferring that

> > spin locks through lockdep debugging may trigger dependencies

> > that require smp_load_acquire.

> > 

> > Anyway, my point is if it applies to crng_node_pool then it

> > would equally apply to RCU in general.

> 

> Referring to the patch you call out below...

> 

> Huh.  The old cmpxchg() primitive is fully ordered, so the old mb()

> preceding it must have been for correctly interacting with hardware on

> !SMP systems.  If that is the case, then the use of cmpxchg_release()

> is incorrect.  This is not the purview of the memory model, but rather

> of device-driver semantics.  Or does crng not (or no longer, as the case

> might be) interact with hardware RNGs?


No hardware involved here.  The mb() is just unnecessary, as I noted in my patch
https://lore.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org/.

> What prevents either the old or the new code from kfree()ing the old

> state out from under another CPU that just now picked up a pointer to the

> old state?  The combination of cmpxchg_release() and smp_load_acquire()

> won't do anything to prevent this from happening.  This is after all not

> a memory-ordering issue, but instead an object-lifetime issue.  But maybe

> you have a lock or something that provides the needed protection.  I don't

> see how this can be the case and still require the cmpxchg_release()

> and smp_load_acquire(), but perhaps this is a failure of imagination on

> my part.


crng_node_pool is initialized only once, and never freed.

- Eric
Eric Biggers Sept. 22, 2020, 7:09 p.m. UTC | #11
On Tue, Sep 22, 2020 at 11:31:00AM -0700, Paul E. McKenney wrote:
> > Also, it's not just the p == &global_variable case.  Consider:

> > 

> > struct a { struct b *b; };

> > struct b { ... };

> > 

> > Thread 1:

> > 

> > 	/* one-time initialized data shared by all instances of b */

> > 	static struct c *c;

> > 

> > 	void init_b(struct a *a)

> > 	{

> > 		if (!c)

> > 			c = alloc_c();

> > 

> > 		smp_store_release(&a->b, kzalloc(sizeof(struct b)));

> > 	}

> > 

> > Thread 2:

> > 

> > 	void use_b_if_present(struct a *a)

> > 	{

> > 		struct b *b = READ_ONCE(a->b);

> > 

> > 		if (b) {

> > 			c->... # crashes because c still appears to be NULL

> > 		}

> > 	}

> > 

> > 

> > So when the *first* "b" is allocated, the global data "c" is initialized.  Then

> > when using a "b", we expect to be able to access "c".  But there's no

> > data dependency from "b" to "c"; it's a control dependency only.

> > So smp_load_acquire() is needed, not READ_ONCE().

> > 

> > And it can be an internal implementation detail of "b"'s subsystem whether it

> > happens to use global data "c".

> 

> Given that "c" is static, these two subsystems must be in the same

> translation unit.  So I don't see how this qualifies as being internal to

> "b"'s subsystem.


You're missing the point here.  b and c could easily be allocated by a function
alloc_b() that's in another file.

> Besides which, control dependencies should be used only by LKMM experts

> at this point.  


What does that even mean?  Control dependencies are everywhere.

> > This sort of thing is why people objected to the READ_ONCE() optimization during

> > the discussion at

> > https://lkml.kernel.org/linux-fsdevel/20200717044427.68747-1-ebiggers@kernel.org/T/#u.

> > Most kernel developers aren't experts in the LKMM, and they want something

> > that's guaranteed to be correct without having to to think really hard about it

> > and make assumptions about the internal implementation details of other

> > subsystems, how compilers have implemented the C standard, and so on.

> 

> And smp_load_acquire()is provided for that reason.  Its name was

> even based on the nomenclature used in the C standard and elsewhere.

> And again, control dependencies are for LKMM experts, as they are very

> tricky to get right.


How does a developer know that the code they're calling in another subsystem
wasn't written by one of these "experts" and therefore has a control dependency?

> 

> But in the LKMM documentation, you are likely to find LKMM experts who

> want to optimize all the way, particularly in cases like the one-time

> init pattern where all the data is often local.  And the best basis for

> READ_ONCE() in one-time init is not a control dependency, but rather

> ordering of accesses to a single variable from a single task combined

> with locking, both of which are quite robust and much easier to use,

> especially in comparison to control dependencies.

> 

> My goal for LKMM is not that each and every developer have a full

> understanding of every nook and cranny of that model, but instead that

> people can find the primitives supporting the desired point in the

> performance/simplicity tradoff space.  And yes, I have more writing

> to do to make more progress towards that goal.


So are you saying people should use smp_load_acquire(), or are you saying people
should use READ_ONCE()?

- Eric
Paul E. McKenney Sept. 22, 2020, 8:31 p.m. UTC | #12
On Tue, Sep 22, 2020 at 11:59:31AM -0700, Eric Biggers wrote:
> On Tue, Sep 22, 2020 at 11:42:43AM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 22, 2020 at 09:51:36AM +1000, Herbert Xu wrote:
> > > On Mon, Sep 21, 2020 at 04:26:39PM -0700, Paul E. McKenney wrote:
> > > >
> > > > > But this reasoning could apply to any data structure that contains
> > > > > a spin lock, in particular ones that are dereferenced through RCU.
> > > > 
> > > > I lost you on this one.  What is special about a spin lock?
> > > 
> > > I don't know, that was Eric's concern.  He is inferring that
> > > spin locks through lockdep debugging may trigger dependencies
> > > that require smp_load_acquire.
> > > 
> > > Anyway, my point is if it applies to crng_node_pool then it
> > > would equally apply to RCU in general.
> > 
> > Referring to the patch you call out below...
> > 
> > Huh.  The old cmpxchg() primitive is fully ordered, so the old mb()
> > preceding it must have been for correctly interacting with hardware on
> > !SMP systems.  If that is the case, then the use of cmpxchg_release()
> > is incorrect.  This is not the purview of the memory model, but rather
> > of device-driver semantics.  Or does crng not (or no longer, as the case
> > might be) interact with hardware RNGs?
> 
> No hardware involved here.  The mb() is just unnecessary, as I noted in my patch
> https://lore.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org/.
> 
> > What prevents either the old or the new code from kfree()ing the old
> > state out from under another CPU that just now picked up a pointer to the
> > old state?  The combination of cmpxchg_release() and smp_load_acquire()
> > won't do anything to prevent this from happening.  This is after all not
> > a memory-ordering issue, but instead an object-lifetime issue.  But maybe
> > you have a lock or something that provides the needed protection.  I don't
> > see how this can be the case and still require the cmpxchg_release()
> > and smp_load_acquire(), but perhaps this is a failure of imagination on
> > my part.
> 
> crng_node_pool is initialized only once, and never freed.

Thank you on both counts!

							Thanx, Paul
Paul E. McKenney Sept. 22, 2020, 8:56 p.m. UTC | #13
On Tue, Sep 22, 2020 at 12:09:36PM -0700, Eric Biggers wrote:
> On Tue, Sep 22, 2020 at 11:31:00AM -0700, Paul E. McKenney wrote:
> > > Also, it's not just the p == &global_variable case.  Consider:
> > > 
> > > struct a { struct b *b; };
> > > struct b { ... };
> > > 
> > > Thread 1:
> > > 
> > > 	/* one-time initialized data shared by all instances of b */
> > > 	static struct c *c;
> > > 
> > > 	void init_b(struct a *a)
> > > 	{
> > > 		if (!c)
> > > 			c = alloc_c();
> > > 
> > > 		smp_store_release(&a->b, kzalloc(sizeof(struct b)));
> > > 	}
> > > 
> > > Thread 2:
> > > 
> > > 	void use_b_if_present(struct a *a)
> > > 	{
> > > 		struct b *b = READ_ONCE(a->b);
> > > 
> > > 		if (b) {
> > > 			c->... # crashes because c still appears to be NULL
> > > 		}
> > > 	}
> > > 
> > > 
> > > So when the *first* "b" is allocated, the global data "c" is initialized.  Then
> > > when using a "b", we expect to be able to access "c".  But there's no
> > > data dependency from "b" to "c"; it's a control dependency only.
> > > So smp_load_acquire() is needed, not READ_ONCE().
> > > 
> > > And it can be an internal implementation detail of "b"'s subsystem whether it
> > > happens to use global data "c".
> > 
> > Given that "c" is static, these two subsystems must be in the same
> > translation unit.  So I don't see how this qualifies as being internal to
> > "b"'s subsystem.
> 
> You're missing the point here.  b and c could easily be allocated by a function
> alloc_b() that's in another file.

I am still missing something.

If by "allocated" you mean something like kmalloc(), the compiler doesn't
know the address.  If you instead mean that there is a function that
returns the address of another translation unit's static variable, then
any needed ordering should preferably be built into that function's API.
Either way, one would hope for some documentation of anything the caller
needed to be careful of.

> > Besides which, control dependencies should be used only by LKMM experts
> > at this point.  
> 
> What does that even mean?  Control dependencies are everywhere.

Does the following work better for you?

"... the non-local ordering properties of control dependencies should be
relied on only by LKMM experts ...".

> > > This sort of thing is why people objected to the READ_ONCE() optimization during
> > > the discussion at
> > > https://lkml.kernel.org/linux-fsdevel/20200717044427.68747-1-ebiggers@kernel.org/T/#u.
> > > Most kernel developers aren't experts in the LKMM, and they want something
> > > that's guaranteed to be correct without having to to think really hard about it
> > > and make assumptions about the internal implementation details of other
> > > subsystems, how compilers have implemented the C standard, and so on.
> > 
> > And smp_load_acquire()is provided for that reason.  Its name was
> > even based on the nomenclature used in the C standard and elsewhere.
> > And again, control dependencies are for LKMM experts, as they are very
> > tricky to get right.
> 
> How does a developer know that the code they're calling in another subsystem
> wasn't written by one of these "experts" and therefore has a control dependency?

If this control dependency's non-local ordering places any requirements on
the users of that code, those requirements need to be clearly documented.
It is of course better if the control dependency's non-local ordering
properties are local to the code containing those control dependencies
so that the callers don't need to worry about the resulting non-local
ordering.

> > But in the LKMM documentation, you are likely to find LKMM experts who
> > want to optimize all the way, particularly in cases like the one-time
> > init pattern where all the data is often local.  And the best basis for
> > READ_ONCE() in one-time init is not a control dependency, but rather
> > ordering of accesses to a single variable from a single task combined
> > with locking, both of which are quite robust and much easier to use,
> > especially in comparison to control dependencies.
> > 
> > My goal for LKMM is not that each and every developer have a full
> > understanding of every nook and cranny of that model, but instead that
> > people can find the primitives supporting the desired point in the
> > performance/simplicity tradoff space.  And yes, I have more writing
> > to do to make more progress towards that goal.
> 
> So are you saying people should use smp_load_acquire(), or are you saying people
> should use READ_ONCE()?

C'mon, you know the answer to that!  ;-)

The answer is that it depends on both the people and the situation.

In the specific case of crng, where you need address dependency
ordering but the pointed-to data is dynamically allocated and never
deallocated, READ_ONCE() now suffices [1].  Of course, smp_load_acquire()
also suffices, at the cost of extra/expensive instructions on some
architectures.  The cmpxchg() needs at least release semantics, but
presumably no one cares if this operation is a bit more expensive than
it needs to be.

So, is select_crng() used on a fastpath?  If so, READ_ONCE()
might be necessary.  If not, why bother with anything stronger than
smp_load_acquire()?  The usual approach is to run this both ways on ARM
or PowerPC and see if it makes a significant difference.  If there is
no significant difference, keep it simple and just use smp_load_acquire().

If the code was sufficiently performance-insensitive, even better would
be to just use locking.  My hope is that no one bothered with the atomics
without a good reason, but you never know.

I confess some uncertainty as to how the transition from the global
primary_crng and the per-NUMA-node locks is handled.  I hope that the
global primary_crng guards global state that is disjoint from the state
being allocated by do_numa_crng_init()!

Use the simplest thing that gets the job done.  Which in the Linux kernel
often won't be all that simple, but life is like that sometimes.

							Thanx, Paul

[1]	It used to be that READ_ONCE() did -not- suffice on DEC Alpha,
	but this has thankfully changed, so that lockless_dereference()
	is no more.
Eric Biggers Sept. 22, 2020, 9:55 p.m. UTC | #14
On Tue, Sep 22, 2020 at 01:56:28PM -0700, Paul E. McKenney wrote:
> > You're missing the point here.  b and c could easily be allocated by a function

> > alloc_b() that's in another file.

> 

> I am still missing something.

> 

> If by "allocated" you mean something like kmalloc(), the compiler doesn't

> know the address.  If you instead mean that there is a function that

> returns the address of another translation unit's static variable, then

> any needed ordering should preferably be built into that function's API.

> Either way, one would hope for some documentation of anything the caller

> needed to be careful of.

> 

> > > Besides which, control dependencies should be used only by LKMM experts

> > > at this point.  

> > 

> > What does that even mean?  Control dependencies are everywhere.

> 

> Does the following work better for you?

> 

> "... the non-local ordering properties of control dependencies should be

> relied on only by LKMM experts ...".


No.  I don't know what that means.  And I think very few people would know.

I just want to know if I use the one-time init pattern with a pointer to a data
structure foo, are the readers using foo_use() supposed to use READ_ONCE() or
are they supposed to use smp_load_acquire().

It seems the answer is that smp_load_acquire() is the only safe choice, since
foo_use() *might* involve a control dependency, or might in the future since
it's part of another kernel subsystem and its implementation could change.

> If this control dependency's non-local ordering places any requirements on

> the users of that code, those requirements need to be clearly documented.

> It is of course better if the control dependency's non-local ordering

> properties are local to the code containing those control dependencies

> so that the callers don't need to worry about the resulting non-local

> ordering.

> 

> > > But in the LKMM documentation, you are likely to find LKMM experts who

> > > want to optimize all the way, particularly in cases like the one-time

> > > init pattern where all the data is often local.  And the best basis for

> > > READ_ONCE() in one-time init is not a control dependency, but rather

> > > ordering of accesses to a single variable from a single task combined

> > > with locking, both of which are quite robust and much easier to use,

> > > especially in comparison to control dependencies.

> > > 

> > > My goal for LKMM is not that each and every developer have a full

> > > understanding of every nook and cranny of that model, but instead that

> > > people can find the primitives supporting the desired point in the

> > > performance/simplicity tradoff space.  And yes, I have more writing

> > > to do to make more progress towards that goal.

> > 

> > So are you saying people should use smp_load_acquire(), or are you saying people

> > should use READ_ONCE()?

> 

> C'mon, you know the answer to that!  ;-)

> 

> The answer is that it depends on both the people and the situation.

> 

> In the specific case of crng, where you need address dependency

> ordering but the pointed-to data is dynamically allocated and never

> deallocated, READ_ONCE() now suffices [1].  Of course, smp_load_acquire()

> also suffices, at the cost of extra/expensive instructions on some

> architectures.  The cmpxchg() needs at least release semantics, but

> presumably no one cares if this operation is a bit more expensive than

> it needs to be.

> 

> So, is select_crng() used on a fastpath?  If so, READ_ONCE()

> might be necessary.  If not, why bother with anything stronger than

> smp_load_acquire()?  The usual approach is to run this both ways on ARM

> or PowerPC and see if it makes a significant difference.  If there is

> no significant difference, keep it simple and just use smp_load_acquire().

> 

> If the code was sufficiently performance-insensitive, even better would

> be to just use locking.  My hope is that no one bothered with the atomics

> without a good reason, but you never know.

> 

> I confess some uncertainty as to how the transition from the global

> primary_crng and the per-NUMA-node locks is handled.  I hope that the

> global primary_crng guards global state that is disjoint from the state

> being allocated by do_numa_crng_init()!


crng_node_pool just uses the one-time init pattern.  It's nothing unusual; lots
of other places in the kernel want to do one-time initialization too.  It seems
to be one of the more common cases where people run into the LKMM at all.
I tried to document it in
https://lkml.kernel.org/lkml/20200717044427.68747-1-ebiggers@kernel.org/T/#u,
but people complained it was still too complicated.

I hope that people can at least reach some general recommendation about
READ_ONCE() vs. smp_load_acquire(), so that every kernel developer doesn't have
to understand the detailed difference, and so that we don't need to have a long
discussion (potentially requiring LWN coverage) about every patch.

> 

> Use the simplest thing that gets the job done.  Which in the Linux kernel

> often won't be all that simple, but life is like that sometimes.

> 

> 							Thanx, Paul

> 

> [1]	It used to be that READ_ONCE() did -not- suffice on DEC Alpha,

> 	but this has thankfully changed, so that lockless_dereference()

> 	is no more.


Let me give an example using spinlock_t, since that's used in crng_node_pool.
However, it could be any other data structure too; this is *just an example*.
And it doesn't matter if the implementation is currently different; the point is
that it's an *implementation*.

The allocation side uses spin_lock_init(), while the read side uses spin_lock().
Let's say that some debugging feature is enabled where spin locks use some
global debugging information (say, a list of all locks) that gets allocated the
first time a spin lock is initialized:

	static struct spin_lock_debug_info *debug_info;
	static DEFINE_MUTEX(debug_info_alloc_mutex);

	void spin_lock_init(spinlock_t *lock)
	{
	#ifdef CONFIG_DEBUG_SPIN_LOCKS
		mutex_lock(&debug_info_alloc_mutex);
		if (!debug_info)
			debug_info = alloc_debug_info();
		add_lock(debug_info, lock);
		mutex_unlock(&debug_info_alloc_mutex);
	#endif
		real_spin_lock_init(lock);
	}

	void spin_lock(spinlock_t *lock)
	{
	#ifdef CONFIG_DEBUG_SPIN_LOCKS
		debug_info->...; # use the debug info
	#endif
		real_spin_lock(lock);
	}

In that case, readers would have a control dependency between the condition of
the data struct containing the spinlock_t being non-NULL, and the dereference of
debug_info by spin_lock().  So anyone "receiving" a data structure containing a
spinlock_t would need to use smp_load_acquire(), not READ_ONCE().

Point is, whether it's safe to use READ_ONCE() with a data structure or not is
an implementation detail, not an API guarantee.

- Eric
Eric Biggers Sept. 25, 2020, 2:09 a.m. UTC | #15
On Thu, Sep 24, 2020 at 05:59:34PM -0700, Paul E. McKenney wrote:
> On Tue, Sep 22, 2020 at 02:55:58PM -0700, Eric Biggers wrote:

> > On Tue, Sep 22, 2020 at 01:56:28PM -0700, Paul E. McKenney wrote:

> > > > You're missing the point here.  b and c could easily be allocated by a function

> > > > alloc_b() that's in another file.

> > > 

> > > I am still missing something.

> > > 

> > > If by "allocated" you mean something like kmalloc(), the compiler doesn't

> > > know the address.  If you instead mean that there is a function that

> > > returns the address of another translation unit's static variable, then

> > > any needed ordering should preferably be built into that function's API.

> > > Either way, one would hope for some documentation of anything the caller

> > > needed to be careful of.

> > > 

> > > > > Besides which, control dependencies should be used only by LKMM experts

> > > > > at this point.  

> > > > 

> > > > What does that even mean?  Control dependencies are everywhere.

> > > 

> > > Does the following work better for you?

> > > 

> > > "... the non-local ordering properties of control dependencies should be

> > > relied on only by LKMM experts ...".

> > 

> > No.  I don't know what that means.  And I think very few people would know.

> > 

> > I just want to know if I use the one-time init pattern with a pointer to a data

> > structure foo, are the readers using foo_use() supposed to use READ_ONCE() or

> > are they supposed to use smp_load_acquire().

> > 

> > It seems the answer is that smp_load_acquire() is the only safe choice, since

> > foo_use() *might* involve a control dependency, or might in the future since

> > it's part of another kernel subsystem and its implementation could change.

> 

> First, the specific issue of one-time init.

> 

> If you are the one writing the code implementing one-time init, it is your

> choice.  It seems like you prefer smp_load_acquire().  If someone sees

> performance problems due to the resulting memory-barrier instructions,

> they have the option of submitting a patch and either forking the

> implementation or taking your implementation over from you, depending

> on how that conversation goes.


It doesn't matter what I "prefer".  The question is, how to write code that is
actually guaranteed to be correct on all supported Linux architectures, without
assuming internal implementation details of other kernel subsystems.

> Third, your first pointer-based variant works with smp_load_acquire(),

> but could instead use READ_ONCE() as long as it is reworked to something

> like this:

> 

> 	static struct foo *foo;

> 	static DEFINE_MUTEX(foo_init_mutex);

> 

> 	// The returned pointer must be handled carefully in the same

> 	// way as would a pointer returned from rcu_derefeference().

> 	// See Documentation/RCU/rcu_dereference.rst.

> 	struct foo *init_foo_if_needed(void)

> 	{

> 		int err = 0;

> 		struct foo *retp;

> 

> 		/* pairs with smp_store_release() below */

> 		retp = READ_ONCE(foo);

> 		if (retp)

> 			return retp;

> 

> 		mutex_lock(&foo_init_mutex);

> 		if (!foo) {

> 			// The compiler doesn't know the address:

> 			struct foo *p = alloc_foo();

> 

> 			if (p) /* pairs with smp_load_acquire() above */

> 				smp_store_release(&foo, p);

> 			else

> 				err = -ENOMEM;

> 		}

> 		mutex_unlock(&foo_init_mutex);

> 		return err;

> 	}

> 

> There are more than 2,000 instances of the rcu_dereference() family of

> primitives in the v5.8 kernel, so it should not be hard to find people

> who are familiar with it.  TL;DR:  Just dereference the pointer and you

> will be fine.


I don't understand why you are saying READ_ONCE() is safe here.  alloc_foo()
could very well initialize some static data as well as foo itself, and after
'if (retp)', use_foo() can be called that happens to use the static data as well
as foo itself.  That's a control dependency that would require
smp_load_acquire(); is it not?  And if foo is some object managed by another
kernel subsystem, that can easily happen without this code being specifically
aware of the implementation detail that actually causes the control dependency.

Also just because there are 2000 instances of rcu_dereference() doesn't mean
kernel developers understand the pitfalls of using it.  Especially since
real-world problems would only be seen very rarely on specific CPU architectures
and/or if some seemingly unrelated kernel code changes.

> > Let me give an example using spinlock_t, since that's used in crng_node_pool.

> > However, it could be any other data structure too; this is *just an example*.

> > And it doesn't matter if the implementation is currently different; the point is

> > that it's an *implementation*.

> > 

> > The allocation side uses spin_lock_init(), while the read side uses spin_lock().

> > Let's say that some debugging feature is enabled where spin locks use some

> > global debugging information (say, a list of all locks) that gets allocated the

> > first time a spin lock is initialized:

> > 

> > 	static struct spin_lock_debug_info *debug_info;

> > 	static DEFINE_MUTEX(debug_info_alloc_mutex);

> > 

> > 	void spin_lock_init(spinlock_t *lock)

> > 	{

> > 	#ifdef CONFIG_DEBUG_SPIN_LOCKS

> > 		mutex_lock(&debug_info_alloc_mutex);

> > 		if (!debug_info)

> > 			debug_info = alloc_debug_info();

> > 		add_lock(debug_info, lock);

> > 		mutex_unlock(&debug_info_alloc_mutex);

> > 	#endif

> > 		real_spin_lock_init(lock);

> > 	}

> > 

> > 	void spin_lock(spinlock_t *lock)

> > 	{

> > 	#ifdef CONFIG_DEBUG_SPIN_LOCKS

> > 		debug_info->...; # use the debug info

> > 	#endif

> > 		real_spin_lock(lock);

> > 	}

> > 

> > In that case, readers would have a control dependency between the condition of

> > the data struct containing the spinlock_t being non-NULL, and the dereference of

> > debug_info by spin_lock().  So anyone "receiving" a data structure containing a

> > spinlock_t would need to use smp_load_acquire(), not READ_ONCE().

> 

> Sorry, no.

> 

> The user had jolly well better make -very- sure that the call to

> spin_lock_init() is ordered before any call to spin_lock().  Running

> spin_lock() concurrently with spin_lock_init() will bring you nothing

> but sorrow, even without that debug_info control-dependency issue.


The example was one-time init of a data structure containing a spin lock.
Like the patch this thread is about:
https://lkml.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org

So you're saying that smp_load_acquire() is needed, since otherwise
spin_lock_init() won't be fully ordered before spin_lock()?

> In the various one-time init examples, the required ordering would be

> correctly supplied if spin_lock_init() was invoked by init_foo() or

> alloc_foo(), depending on the example, and used only after a successful

> return from init_foo_if_needed().  None of these examples rely on control

> dependencies.


... or are you saying that READ_ONCE() provides the required full ordering
between spin_lock_init() and spin_lock()?  If so, why, and is it guaranteed
independent of the implementation of these functions?

Please explain in English.

Thanks.

- Eric
Paul E. McKenney Sept. 25, 2020, 3:31 a.m. UTC | #16
On Thu, Sep 24, 2020 at 07:09:08PM -0700, Eric Biggers wrote:
> On Thu, Sep 24, 2020 at 05:59:34PM -0700, Paul E. McKenney wrote:

> > On Tue, Sep 22, 2020 at 02:55:58PM -0700, Eric Biggers wrote:

> > > On Tue, Sep 22, 2020 at 01:56:28PM -0700, Paul E. McKenney wrote:

> > > > > You're missing the point here.  b and c could easily be allocated by a function

> > > > > alloc_b() that's in another file.

> > > > 

> > > > I am still missing something.

> > > > 

> > > > If by "allocated" you mean something like kmalloc(), the compiler doesn't

> > > > know the address.  If you instead mean that there is a function that

> > > > returns the address of another translation unit's static variable, then

> > > > any needed ordering should preferably be built into that function's API.

> > > > Either way, one would hope for some documentation of anything the caller

> > > > needed to be careful of.

> > > > 

> > > > > > Besides which, control dependencies should be used only by LKMM experts

> > > > > > at this point.  

> > > > > 

> > > > > What does that even mean?  Control dependencies are everywhere.

> > > > 

> > > > Does the following work better for you?

> > > > 

> > > > "... the non-local ordering properties of control dependencies should be

> > > > relied on only by LKMM experts ...".

> > > 

> > > No.  I don't know what that means.  And I think very few people would know.

> > > 

> > > I just want to know if I use the one-time init pattern with a pointer to a data

> > > structure foo, are the readers using foo_use() supposed to use READ_ONCE() or

> > > are they supposed to use smp_load_acquire().

> > > 

> > > It seems the answer is that smp_load_acquire() is the only safe choice, since

> > > foo_use() *might* involve a control dependency, or might in the future since

> > > it's part of another kernel subsystem and its implementation could change.

> > 

> > First, the specific issue of one-time init.

> > 

> > If you are the one writing the code implementing one-time init, it is your

> > choice.  It seems like you prefer smp_load_acquire().  If someone sees

> > performance problems due to the resulting memory-barrier instructions,

> > they have the option of submitting a patch and either forking the

> > implementation or taking your implementation over from you, depending

> > on how that conversation goes.

> 

> It doesn't matter what I "prefer".  The question is, how to write code that is

> actually guaranteed to be correct on all supported Linux architectures, without

> assuming internal implementation details of other kernel subsystems.


And that question allows ample room for personal preferences.

There are after all tradeoffs.  Do you want to live within the current
knowledge of your users, or are you willing to invest time and energy
into teaching them something new?  If someone wants a level of performance
that is accommodated only by a difficult-to-use pattern, will you choose
to accommodate them, or will you tell them to build write their own?

There are often a number of ways to make something work, and they all
have advantages and disadvantages.  There are tradeoffs, and preferences
have a role to play as well.

> > Third, your first pointer-based variant works with smp_load_acquire(),

> > but could instead use READ_ONCE() as long as it is reworked to something

> > like this:

> > 

> > 	static struct foo *foo;

> > 	static DEFINE_MUTEX(foo_init_mutex);

> > 

> > 	// The returned pointer must be handled carefully in the same

> > 	// way as would a pointer returned from rcu_derefeference().

> > 	// See Documentation/RCU/rcu_dereference.rst.

> > 	struct foo *init_foo_if_needed(void)

> > 	{

> > 		int err = 0;

> > 		struct foo *retp;

> > 

> > 		/* pairs with smp_store_release() below */

> > 		retp = READ_ONCE(foo);

> > 		if (retp)

> > 			return retp;

> > 

> > 		mutex_lock(&foo_init_mutex);

> > 		if (!foo) {

> > 			// The compiler doesn't know the address:

> > 			struct foo *p = alloc_foo();

> > 

> > 			if (p) /* pairs with smp_load_acquire() above */

> > 				smp_store_release(&foo, p);

> > 			else

> > 				err = -ENOMEM;

> > 		}

> > 		mutex_unlock(&foo_init_mutex);

> > 		return err;

> > 	}

> > 

> > There are more than 2,000 instances of the rcu_dereference() family of

> > primitives in the v5.8 kernel, so it should not be hard to find people

> > who are familiar with it.  TL;DR:  Just dereference the pointer and you

> > will be fine.

> 

> I don't understand why you are saying READ_ONCE() is safe here.  alloc_foo()

> could very well initialize some static data as well as foo itself, and after

> 'if (retp)', use_foo() can be called that happens to use the static data as well

> as foo itself.  That's a control dependency that would require

> smp_load_acquire(); is it not?  And if foo is some object managed by another

> kernel subsystem, that can easily happen without this code being specifically

> aware of the implementation detail that actually causes the control dependency.


If alloc_foo() also initializes static data, then it really should have
some name other than alloc_foo().

But yes, if it does initialize static data, you are back at the earlier
examples.  For this variant to work, either the initialization must
be confined to the foo pointer and the pointed-to data on the one hand,
or some additional synchronization is required for users wishing to
access the static data on the other.

So to reiterate, in order to correctly use the above pattern, you
must confine the initialization to the structure referenced by the
foo pointer.  Otherwise, you need to use some other pattern.

Apologies if that was not clear.

> Also just because there are 2000 instances of rcu_dereference() doesn't mean

> kernel developers understand the pitfalls of using it.  Especially since

> real-world problems would only be seen very rarely on specific CPU architectures

> and/or if some seemingly unrelated kernel code changes.


I would be more worried about the compiler than the CPU.  And yet these
instances do seem to work.

> > > Let me give an example using spinlock_t, since that's used in crng_node_pool.

> > > However, it could be any other data structure too; this is *just an example*.

> > > And it doesn't matter if the implementation is currently different; the point is

> > > that it's an *implementation*.

> > > 

> > > The allocation side uses spin_lock_init(), while the read side uses spin_lock().

> > > Let's say that some debugging feature is enabled where spin locks use some

> > > global debugging information (say, a list of all locks) that gets allocated the

> > > first time a spin lock is initialized:

> > > 

> > > 	static struct spin_lock_debug_info *debug_info;

> > > 	static DEFINE_MUTEX(debug_info_alloc_mutex);

> > > 

> > > 	void spin_lock_init(spinlock_t *lock)

> > > 	{

> > > 	#ifdef CONFIG_DEBUG_SPIN_LOCKS

> > > 		mutex_lock(&debug_info_alloc_mutex);

> > > 		if (!debug_info)

> > > 			debug_info = alloc_debug_info();

> > > 		add_lock(debug_info, lock);

> > > 		mutex_unlock(&debug_info_alloc_mutex);

> > > 	#endif

> > > 		real_spin_lock_init(lock);

> > > 	}

> > > 

> > > 	void spin_lock(spinlock_t *lock)

> > > 	{

> > > 	#ifdef CONFIG_DEBUG_SPIN_LOCKS

> > > 		debug_info->...; # use the debug info

> > > 	#endif

> > > 		real_spin_lock(lock);

> > > 	}

> > > 

> > > In that case, readers would have a control dependency between the condition of

> > > the data struct containing the spinlock_t being non-NULL, and the dereference of

> > > debug_info by spin_lock().  So anyone "receiving" a data structure containing a

> > > spinlock_t would need to use smp_load_acquire(), not READ_ONCE().

> > 

> > Sorry, no.

> > 

> > The user had jolly well better make -very- sure that the call to

> > spin_lock_init() is ordered before any call to spin_lock().  Running

> > spin_lock() concurrently with spin_lock_init() will bring you nothing

> > but sorrow, even without that debug_info control-dependency issue.

> 

> The example was one-time init of a data structure containing a spin lock.

> Like the patch this thread is about:

> https://lkml.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org

> 

> So you're saying that smp_load_acquire() is needed, since otherwise

> spin_lock_init() won't be fully ordered before spin_lock()?


I am absolutely not, repeat, -not-, saying that.

Let's assume that my variation on your alloc_foo() example is used.
For simplicity, let's assume the first one using the lock.  A very
similar line of reasoning works for the cmpxchg().

And I -am- assuming that the spinlock is in the newly allocated structure
(and I cannot tell whether or not this is the case from your patch).
I am also assuming that the pointer is stored with smp_store_release()
or stronger, and that all of the initialization (including that of the
spinlock) is carried out by code preceding the smp_store_release().
I am also assuming that users get the foo pointer only by calling 
init_foo_if_needed(), and have no sort of "side-channel" access
to the spinlock.

Then the pointer is not in memory until after the foo structure is
initialized (including the spinlock) and the user picks up the pointer
before using the spinlock.  As noted in in my paragraph immediately
below, this provides the needed ordering so that spin_lock() and
spin_lock_init() never run concurrently.

But a key point is that all of these usage patterns come with restrictions.
If you want to make use of a given pattern, you will need to abide by
its restrictions.

Which should be no surprise.  There are also restrictions when using
things like locking, it is just that a lot of people are very accustomed
to abiding by those restrictions.  But fail to acquire the lock where
you should have or acquire a pair of locks in the wrong order and life
will get very bad.

> > In the various one-time init examples, the required ordering would be

> > correctly supplied if spin_lock_init() was invoked by init_foo() or

> > alloc_foo(), depending on the example, and used only after a successful

> > return from init_foo_if_needed().  None of these examples rely on control

> > dependencies.

> 

> ... or are you saying that READ_ONCE() provides the required full ordering

> between spin_lock_init() and spin_lock()?  If so, why, and is it guaranteed

> independent of the implementation of these functions?


Single operations do not provide ordering, instead pairs of operations
provide ordering.  Hence the phrase "memory-barrier pairing".

If the pointer returned by the READ_ONCE() is handled properly
(rcu_dereference.rst again), then the combination of that READ_ONCE()
and the proper handling on the one hand and the ordering of foo_alloc()
and the smp_store_release() on the other provides the needed ordering.

Which makes sense when you think about it.  If either side fails to do
its part to provide the needed ordering, then there won't be ordering.

So you do need to think about both sides to reason about ordering.

In theory, you also need to do this with locking, but most people use
shortcuts of the form "I hold this lock, so nothing can change unless I
change it."  Which is an excellent way to think about locks, at least
until such time as you are the guy who is implementing the locking
primitive itself.

> Please explain in English.


Oh.  My apologies.  What language have I been writing in?

							Thanx, Paul
diff mbox series

Patch

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 09b1551d4092f..9f1e7a4a0fbbb 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -843,8 +843,8 @@  static void do_numa_crng_init(struct work_struct *work)
 		crng_initialize_secondary(crng);
 		pool[i] = crng;
 	}
-	mb();
-	if (cmpxchg(&crng_node_pool, NULL, pool)) {
+	/* pairs with smp_load_acquire() in select_crng() */
+	if (cmpxchg_release(&crng_node_pool, NULL, pool) != NULL) {
 		for_each_node(i)
 			kfree(pool[i]);
 		kfree(pool);
@@ -857,8 +857,26 @@  static void numa_crng_init(void)
 {
 	schedule_work(&numa_crng_init_work);
 }
+
+static inline struct crng_state *select_crng(void)
+{
+	struct crng_state **pool;
+	int nid = numa_node_id();
+
+	/* pairs with cmpxchg_release() in do_numa_crng_init() */
+	pool = smp_load_acquire(&crng_node_pool);
+	if (pool && pool[nid])
+		return pool[nid];
+
+	return &primary_crng;
+}
 #else
 static void numa_crng_init(void) {}
+
+static inline struct crng_state *select_crng(void)
+{
+	return &primary_crng;
+}
 #endif
 
 /*
@@ -1005,15 +1023,7 @@  static void _extract_crng(struct crng_state *crng,
 
 static void extract_crng(__u8 out[CHACHA_BLOCK_SIZE])
 {
-	struct crng_state *crng = NULL;
-
-#ifdef CONFIG_NUMA
-	if (crng_node_pool)
-		crng = crng_node_pool[numa_node_id()];
-	if (crng == NULL)
-#endif
-		crng = &primary_crng;
-	_extract_crng(crng, out);
+	_extract_crng(select_crng(), out);
 }
 
 /*
@@ -1042,15 +1052,7 @@  static void _crng_backtrack_protect(struct crng_state *crng,
 
 static void crng_backtrack_protect(__u8 tmp[CHACHA_BLOCK_SIZE], int used)
 {
-	struct crng_state *crng = NULL;
-
-#ifdef CONFIG_NUMA
-	if (crng_node_pool)
-		crng = crng_node_pool[numa_node_id()];
-	if (crng == NULL)
-#endif
-		crng = &primary_crng;
-	_crng_backtrack_protect(crng, tmp, used);
+	_crng_backtrack_protect(select_crng(), tmp, used);
 }
 
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)