Message ID | 20220111071212.1210124-1-surenb@google.com |
---|---|
State | Superseded |
Headers | show |
Series | [v2,1/1] psi: Fix uaf issue when psi trigger is destroyed while being polled | expand |
On Tue, Jan 11, 2022 at 10:48 AM Eric Biggers <ebiggers@kernel.org> wrote: > > The write here needs to use smp_store_release(), since it is paired with the > concurrent READ_ONCE() in psi_trigger_poll(). A smp_store_release() doesn't make sense pairing with a READ_ONCE(). Any memory ordering that the smp_store_release() does on the writing side is entirely irrelevant, since the READ_ONCE() doesn't imply any ordering on the reading side. Ordering one but not the other is nonsensical. So the proper pattern is to use a WRITE_ONCE() to pair with a READ_ONCE() (when you don't care about memory ordering, or you handle it explicitly), or a smp_load_acquire() with a smp_store_release() (in which case writes before the smp_store_release() on the writing side will be ordered wrt accesses after smp_load_acquire() on the reading side). Of course, in practice, for pointers, the whole "dereference off a pointer" on the read side *does* imply a barrier in all relevant situations. So yes, a smp_store_release() -> READ_ONCE() does work in practice, although it's technically wrong (in particular, it's wrong on alpha, because of the completely broken memory ordering that alpha has that doesn't even honor data dependencies as read-side orderings) But in this case, I do think that since there's some setup involved with the trigger pointer, the proper serialization is to use smp_store_release() to set the pointer, and then smp_load_acquire() on the reading side. Or just use the RCU primitives - they are even better optimized, and handle exactly that case, and can be more efficient on some architectures if release->acquire isn't already cheap. That said, we've pretty much always accepted that normal word writes are not going to tear, so we *have* also accepted just - do any normal store of a value on the write side - do a READ_ONCE() on the reading side where the reading side doesn't actually care *what* value it gets, it only cares that the value it gets is *stable* (ie no compiler reloads that might show up as two different values on the reading side). Of course, that has the same issue as WRITE_ONCE/READ_ONCE - you need to worry about memory ordering separately. > > + seq->private = new; > > Likewise here. Yeah, same deal, except here you can't even use the RCU ones, because 'seq->private' isn't annotated for RCU. Or you'd do the casting, of course. Linus
On Tue, Jan 11, 2022 at 11:11 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Jan 11, 2022 at 10:48 AM Eric Biggers <ebiggers@kernel.org> wrote: > > > > The write here needs to use smp_store_release(), since it is paired with the > > concurrent READ_ONCE() in psi_trigger_poll(). > > A smp_store_release() doesn't make sense pairing with a READ_ONCE(). > > Any memory ordering that the smp_store_release() does on the writing > side is entirely irrelevant, since the READ_ONCE() doesn't imply any > ordering on the reading side. Ordering one but not the other is > nonsensical. > > So the proper pattern is to use a WRITE_ONCE() to pair with a > READ_ONCE() (when you don't care about memory ordering, or you handle > it explicitly), or a smp_load_acquire() with a smp_store_release() (in > which case writes before the smp_store_release() on the writing side > will be ordered wrt accesses after smp_load_acquire() on the reading > side). > > Of course, in practice, for pointers, the whole "dereference off a > pointer" on the read side *does* imply a barrier in all relevant > situations. So yes, a smp_store_release() -> READ_ONCE() does work in > practice, although it's technically wrong (in particular, it's wrong > on alpha, because of the completely broken memory ordering that alpha > has that doesn't even honor data dependencies as read-side orderings) > > But in this case, I do think that since there's some setup involved > with the trigger pointer, the proper serialization is to use > smp_store_release() to set the pointer, and then smp_load_acquire() on > the reading side. > > Or just use the RCU primitives - they are even better optimized, and > handle exactly that case, and can be more efficient on some > architectures if release->acquire isn't already cheap. > > That said, we've pretty much always accepted that normal word writes > are not going to tear, so we *have* also accepted just > > - do any normal store of a value on the write side > > - do a READ_ONCE() on the reading side > > where the reading side doesn't actually care *what* value it gets, it > only cares that the value it gets is *stable* (ie no compiler reloads > that might show up as two different values on the reading side). > > Of course, that has the same issue as WRITE_ONCE/READ_ONCE - you need > to worry about memory ordering separately. > > > > + seq->private = new; > > > > Likewise here. > > Yeah, same deal, except here you can't even use the RCU ones, because > 'seq->private' isn't annotated for RCU. > > Or you'd do the casting, of course. Thanks for the explanation! So, it sounds like the best (semantically correct) option I have here is smp_store_release() to set the pointer, and then smp_load_acquire() to read it. Is my understanding correct? > > Linus
On Tue, Jan 11, 2022 at 11:27 AM Suren Baghdasaryan <surenb@google.com> wrote: > > Thanks for the explanation! > So, it sounds like the best (semantically correct) option I have here > is smp_store_release() to set the pointer, and then smp_load_acquire() > to read it. Is my understanding correct? Yeah, that's the clearest one from a memory ordering standpoint, and is generally also cheap. Linus
On Tue, Jan 11, 2022 at 11:41 AM Eric Biggers <ebiggers@kernel.org> wrote: > > On Tue, Jan 11, 2022 at 11:11:32AM -0800, Linus Torvalds wrote: > > On Tue, Jan 11, 2022 at 10:48 AM Eric Biggers <ebiggers@kernel.org> wrote: > > > > > > The write here needs to use smp_store_release(), since it is paired with the > > > concurrent READ_ONCE() in psi_trigger_poll(). > > > > A smp_store_release() doesn't make sense pairing with a READ_ONCE(). > > > > Any memory ordering that the smp_store_release() does on the writing > > side is entirely irrelevant, since the READ_ONCE() doesn't imply any > > ordering on the reading side. Ordering one but not the other is > > nonsensical. > > > > So the proper pattern is to use a WRITE_ONCE() to pair with a > > READ_ONCE() (when you don't care about memory ordering, or you handle > > it explicitly), or a smp_load_acquire() with a smp_store_release() (in > > which case writes before the smp_store_release() on the writing side > > will be ordered wrt accesses after smp_load_acquire() on the reading > > side). > > > > Of course, in practice, for pointers, the whole "dereference off a > > pointer" on the read side *does* imply a barrier in all relevant > > situations. So yes, a smp_store_release() -> READ_ONCE() does work in > > practice, although it's technically wrong (in particular, it's wrong > > on alpha, because of the completely broken memory ordering that alpha > > has that doesn't even honor data dependencies as read-side orderings) > > > > But in this case, I do think that since there's some setup involved > > with the trigger pointer, the proper serialization is to use > > smp_store_release() to set the pointer, and then smp_load_acquire() on > > the reading side. > > > > Or just use the RCU primitives - they are even better optimized, and > > handle exactly that case, and can be more efficient on some > > architectures if release->acquire isn't already cheap. > > > > That said, we've pretty much always accepted that normal word writes > > are not going to tear, so we *have* also accepted just > > > > - do any normal store of a value on the write side > > > > - do a READ_ONCE() on the reading side > > > > where the reading side doesn't actually care *what* value it gets, it > > only cares that the value it gets is *stable* (ie no compiler reloads > > that might show up as two different values on the reading side). > > > > Of course, that has the same issue as WRITE_ONCE/READ_ONCE - you need > > to worry about memory ordering separately. > > > > > > + seq->private = new; > > > > > > Likewise here. > > > > Yeah, same deal, except here you can't even use the RCU ones, because > > 'seq->private' isn't annotated for RCU. > > > > Or you'd do the casting, of course. > > > > This is yet another case of "one time init". There have been long discussions > on this topic before: > * https://lore.kernel.org/linux-fsdevel/20200713033330.205104-1-ebiggers@kernel.org/T/#u > * https://lore.kernel.org/lkml/20200916233042.51634-1-ebiggers@kernel.org/T/#u > * https://lwn.net/Articles/827180/ > > I even attempted to document the best practices: > * https://lore.kernel.org/linux-fsdevel/20200717044427.68747-1-ebiggers@kernel.org/T/#u > > However, no one could agree on whether READ_ONCE() or smp_load_acquire() should > be used. smp_load_acquire() is always correct, so it remains my preference. > However, READ_ONCE() is correct in some cases, and some people (including the > primary LKMM maintainer) insist that it be used in all such cases, as well as in > rcu_dereference() even though this places difficult-to-understand constraints on > how rcu_dereference() can be used. > > My preference is that smp_load_acquire() be used. But be aware that this risks > the READ_ONCE() people coming out of the woodwork and arguing for READ_ONCE(). I like my chances here (I believe we do need memory ordering in this case). I'll post a fix with smp_load_acquire/smp_store_release shortly after I run my tests. Thanks for the guidance! > > - Eric
On Tue, Jan 11, 2022 at 12:15 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Jan 11, 2022 at 11:41 AM Eric Biggers <ebiggers@kernel.org> wrote: > > > > This is yet another case of "one time init". > > Ehh. It's somewhat debatable. > > For a flag that sets a value once, the rules are somewhat different. > In that case, people may simply not care about memory ordering at all, > because all they care about is the actual flag value, and - thanks to > the one-time behavior - basically whether some transition had happened > or not. That's not all that unusual. > > But when you fetch a pointer, things are at least conceptually > slightly different. > > Of course, you may use the existence of the pointer itself as a flag > (ie just a "NULL or not"), in which case it's the same as any other > one-time flag thing. > > But if you use it to dereference something, then _by_definition_ > you're not just fetching a one-time flag - even if the pointer is only > set once. At that point, at a minimum, you require that that thing has > been initialized. > > Now, it's then absolutely true that the stuff behind the pointer may > then have other reasons not to care about memory ordering again, and > you may be able to avoid memory ordering even then. If you're just > switching the pointer around between different objects that has been > statically allocated and initialized, then there is no memory ordering > required, for example. You might be back to the "I just want one or > the other of these two pointers". > > But if you have something that was initialized before the pointer was > assigned, you really do hit the problem we had on alpha, where even if > you order the pointer write side accesses, the dereferencing of the > pointer may not be ordered on the read side. > > Now, alpha is basically dead, and we probably don't really care. Even > on alpha, the whole "data dependency isn't a memory ordering" is > almost impossible to trigger. > > And in fact, to avoid too much pain we ended up saying "screw alpha" > and added a memory barrier to READ_ONCE(), so it turns out that > smp_store_release -> READ_ONCE() does work because we just couldn't be > bothered to try something more proper. > > So yeah, READ_ONCE() ends up making the "access through a pointer" > thing safe, but that's less of a "it should be safe" and more of a "we > can't waste time dealing with braindamage on platforms that don't > matter". > > In general, I think the rule should be that READ_ONCE() is for things > that simply don't care about memory ordering at all (or do whatever > ordering they want explicitly). And yes, one such very common case is > the "one-way flag" where once a certain state has been reached, it's > idempotent. > > Of course, then we have the fact that READ_ONCE() can be more > efficient than "smp_load_acquire()" on some platforms, so if something > is *hugely* performance-critical, you might use READ_ONCE() even if > it's not really technically the right thing. > > So it's complicated. > > A lot of READ_ONCE() users exist just for historical reasons because > they predated smp_store_release/smp_load_acquire. They may well have > been using ACCESS_ONCE() long ago. > > And some are there because it's a very critical piece of code, and > it's very intentional. > > But if you don't have some huge reasons, I really would prefer people > use "smp_store_release -> smp_load_acquire" as a very clear "handoff" > event. Posted v3 with smp_store_release/smp_load_acquire: https://lore.kernel.org/all/20220111232309.1786347-1-surenb@google.com Thanks! > > Linus
On Tue, Jan 11, 2022 at 11:11:32AM -0800, Linus Torvalds wrote: > Of course, in practice, for pointers, the whole "dereference off a > pointer" on the read side *does* imply a barrier in all relevant > situations. So yes, a smp_store_release() -> READ_ONCE() does work in > practice, although it's technically wrong (in particular, it's wrong > on alpha, because of the completely broken memory ordering that alpha > has that doesn't even honor data dependencies as read-side orderings) On a tangent, that actually works, even on Alpha, see commit d646285885154 ("alpha: Override READ_ONCE() with barriered implementation").
On Mon, Jan 10, 2022 at 11:12:12PM -0800, Suren Baghdasaryan wrote: > With write operation on psi files replacing old trigger with a new one, > the lifetime of its waitqueue is totally arbitrary. Overwriting an > existing trigger causes its waitqueue to be freed and pending poll() > will stumble on trigger->event_wait which was destroyed. > Fix this by disallowing to redefine an existing psi trigger. If a write > operation is used on a file descriptor with an already existing psi > trigger, the operation will fail with EBUSY error. > Also bypass a check for psi_disabled in the psi_trigger_destroy as the > flag can be flipped after the trigger is created, leading to a memory > leak. > > Fixes: 0e94682b73bf ("psi: introduce psi monitor") > Cc: stable@vger.kernel.org > Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com > Analyzed-by: Eric Biggers <ebiggers@kernel.org> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Signed-off-by: Suren Baghdasaryan <surenb@google.com> That looks good to me, thanks Suren. Acked-by: Johannes Weiner <hannes@cmpxchg.org> Peter, would you mind picking this up and routing this to Linus through the sched tree, please?
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Of course, in practice, for pointers, the whole "dereference off a > pointer" on the read side *does* imply a barrier in all relevant > situations. So yes, a smp_store_release() -> READ_ONCE() does work in > practice, although it's technically wrong (in particular, it's wrong > on alpha, because of the completely broken memory ordering that alpha > has that doesn't even honor data dependencies as read-side orderings) READ_ONCE has contained the alpha barrier since 2017: commit 76ebbe78f7390aee075a7f3768af197ded1bdfbb Author: Will Deacon <will@kernel.org> Date: Tue Oct 24 11:22:47 2017 +0100 locking/barriers: Add implicit smp_read_barrier_depends() to READ_ONCE() Cheers,
diff --git a/Documentation/accounting/psi.rst b/Documentation/accounting/psi.rst index f2b3439edcc2..860fe651d645 100644 --- a/Documentation/accounting/psi.rst +++ b/Documentation/accounting/psi.rst @@ -92,7 +92,8 @@ Triggers can be set on more than one psi metric and more than one trigger for the same psi metric can be specified. However for each trigger a separate file descriptor is required to be able to poll it separately from others, therefore for each trigger a separate open() syscall should be made even -when opening the same psi interface file. +when opening the same psi interface file. Write operations to a file descriptor +with an already existing psi trigger will fail with EBUSY. Monitors activate only when system enters stall state for the monitored psi metric and deactivates upon exit from the stall state. While system is diff --git a/include/linux/psi.h b/include/linux/psi.h index 65eb1476ac70..74f7148dfb9f 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -32,7 +32,7 @@ void cgroup_move_task(struct task_struct *p, struct css_set *to); struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, size_t nbytes, enum psi_res res); -void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *t); +void psi_trigger_destroy(struct psi_trigger *t); __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, poll_table *wait); diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 0a23300d49af..6537d0c92825 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -129,9 +129,6 @@ struct psi_trigger { * events to one per window */ u64 last_event_time; - - /* Refcounting to prevent premature destruction */ - struct kref refcount; }; struct psi_group { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index cafb8c114a21..93b51a2104f7 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3642,6 +3642,12 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf, cgroup_get(cgrp); cgroup_kn_unlock(of->kn); + /* Allow only one trigger per file descriptor */ + if (ctx->psi.trigger) { + cgroup_put(cgrp); + return -EBUSY; + } + psi = cgroup_ino(cgrp) == 1 ? &psi_system : &cgrp->psi; new = psi_trigger_create(psi, buf, nbytes, res); if (IS_ERR(new)) { @@ -3649,8 +3655,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf, return PTR_ERR(new); } - psi_trigger_replace(&ctx->psi.trigger, new); - + ctx->psi.trigger = new; cgroup_put(cgrp); return nbytes; @@ -3689,7 +3694,7 @@ static void cgroup_pressure_release(struct kernfs_open_file *of) { struct cgroup_file_ctx *ctx = of->priv; - psi_trigger_replace(&ctx->psi.trigger, NULL); + psi_trigger_destroy(ctx->psi.trigger); } bool cgroup_psi_enabled(void) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 1652f2bb54b7..1fe4b8d3ae98 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1151,7 +1151,6 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, t->event = 0; t->last_event_time = 0; init_waitqueue_head(&t->event_wait); - kref_init(&t->refcount); mutex_lock(&group->trigger_lock); @@ -1180,15 +1179,19 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group, return t; } -static void psi_trigger_destroy(struct kref *ref) +void psi_trigger_destroy(struct psi_trigger *t) { - struct psi_trigger *t = container_of(ref, struct psi_trigger, refcount); - struct psi_group *group = t->group; + struct psi_group *group; struct task_struct *task_to_destroy = NULL; - if (static_branch_likely(&psi_disabled)) + /* + * We do not check psi_disabled since it might have been disabled after + * the trigger got created. + */ + if (!t) return; + group = t->group; /* * Wakeup waiters to stop polling. Can happen if cgroup is deleted * from under a polling process. @@ -1224,9 +1227,9 @@ static void psi_trigger_destroy(struct kref *ref) mutex_unlock(&group->trigger_lock); /* - * Wait for both *trigger_ptr from psi_trigger_replace and - * poll_task RCUs to complete their read-side critical sections - * before destroying the trigger and optionally the poll_task + * Wait for psi_schedule_poll_work RCU to complete its read-side + * critical section before destroying the trigger and optionally the + * poll_task. */ synchronize_rcu(); /* @@ -1243,18 +1246,6 @@ static void psi_trigger_destroy(struct kref *ref) kfree(t); } -void psi_trigger_replace(void **trigger_ptr, struct psi_trigger *new) -{ - struct psi_trigger *old = *trigger_ptr; - - if (static_branch_likely(&psi_disabled)) - return; - - rcu_assign_pointer(*trigger_ptr, new); - if (old) - kref_put(&old->refcount, psi_trigger_destroy); -} - __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, poll_table *wait) { @@ -1264,24 +1255,15 @@ __poll_t psi_trigger_poll(void **trigger_ptr, if (static_branch_likely(&psi_disabled)) return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; - rcu_read_lock(); - - t = rcu_dereference(*(void __rcu __force **)trigger_ptr); - if (!t) { - rcu_read_unlock(); + t = READ_ONCE(*trigger_ptr); + if (!t) return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; - } - kref_get(&t->refcount); - - rcu_read_unlock(); poll_wait(file, &t->event_wait, wait); if (cmpxchg(&t->event, 1, 0) == 1) ret |= EPOLLPRI; - kref_put(&t->refcount, psi_trigger_destroy); - return ret; } @@ -1305,14 +1287,24 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf, buf[buf_size - 1] = '\0'; - new = psi_trigger_create(&psi_system, buf, nbytes, res); - if (IS_ERR(new)) - return PTR_ERR(new); - seq = file->private_data; + /* Take seq->lock to protect seq->private from concurrent writes */ mutex_lock(&seq->lock); - psi_trigger_replace(&seq->private, new); + + /* Allow only one trigger per file descriptor */ + if (seq->private) { + mutex_unlock(&seq->lock); + return -EBUSY; + } + + new = psi_trigger_create(&psi_system, buf, nbytes, res); + if (IS_ERR(new)) { + mutex_unlock(&seq->lock); + return PTR_ERR(new); + } + + seq->private = new; mutex_unlock(&seq->lock); return nbytes; @@ -1347,7 +1339,7 @@ static int psi_fop_release(struct inode *inode, struct file *file) { struct seq_file *seq = file->private_data; - psi_trigger_replace(&seq->private, NULL); + psi_trigger_destroy(seq->private); return single_release(inode, file); }
With write operation on psi files replacing old trigger with a new one, the lifetime of its waitqueue is totally arbitrary. Overwriting an existing trigger causes its waitqueue to be freed and pending poll() will stumble on trigger->event_wait which was destroyed. Fix this by disallowing to redefine an existing psi trigger. If a write operation is used on a file descriptor with an already existing psi trigger, the operation will fail with EBUSY error. Also bypass a check for psi_disabled in the psi_trigger_destroy as the flag can be flipped after the trigger is created, leading to a memory leak. Fixes: 0e94682b73bf ("psi: introduce psi monitor") Cc: stable@vger.kernel.org Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com Analyzed-by: Eric Biggers <ebiggers@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- Changes in v2: - Added Fixes tag, per Eric Biggers - CC'ed stable@vger.kernel.org, per Eric Biggers - Removed unnecessary READ_ONCE/WRITE_ONCE, per Eric Biggers - Changed psi_trigger_destroy to accept psi_trigger pointer, per Eric Biggers Documentation/accounting/psi.rst | 3 +- include/linux/psi.h | 2 +- include/linux/psi_types.h | 3 -- kernel/cgroup/cgroup.c | 11 ++++-- kernel/sched/psi.c | 66 ++++++++++++++------------------ 5 files changed, 40 insertions(+), 45 deletions(-)