Message ID | 20200629171529.558003-1-daniel.thompson@linaro.org |
---|---|
State | New |
Headers | show |
Series | kgdb: Resolve races during kgdb_io_register/unregister_module | expand |
Hi, On Mon, Jun 29, 2020 at 10:15 AM Daniel Thompson <daniel.thompson@linaro.org> wrote: > > Currently kgdb_register_callbacks() and kgdb_unregister_callbacks() > are called outside the scope of the kgdb_registration_lock. This > allows them to race with each other. This could do all sorts of crazy > things up to and including dbg_io_ops becoming NULL partway through the > execution of the kgdb trap handler (which isn't allowed and would be > fatal). > > Fix this by bringing the trap handler setup and teardown into the scope > of the registration lock. > > Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> > --- > kernel/debug/debug_core.c | 8 +++++--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c > index 9e5934780f41..9799f2c6dc94 100644 > --- a/kernel/debug/debug_core.c > +++ b/kernel/debug/debug_core.c > @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops) > > dbg_io_ops = new_dbg_io_ops; > > - spin_unlock(&kgdb_registration_lock); > - > if (old_dbg_io_ops) { > + spin_unlock(&kgdb_registration_lock); > old_dbg_io_ops->deinit(); > return 0; > } > @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops) > /* Arm KGDB now. */ > kgdb_register_callbacks(); > > + spin_unlock(&kgdb_registration_lock); From looking at code paths, I think this is illegal, isn't it? You're now calling kgdb_register_callbacks() while holding a spinlock, but: kgdb_register_callbacks() -> register_console() --> console_lock() ---> might_sleep() ----> <boom!> I'm a little curious about the exact race we're trying to solve. Calling unregister on an IO module before register even finished seems like an error on the caller, so I guess it would be calling register from a 2nd thread for a different IO module while the first thread was partway through unregistering? Even that seems awfully sketchy since you're risking registering a 2nd IO ops while the first is still there and that's illegal enough that we do a pr_err() for it (though we don't crash), but let's say we're trying to solve that one. Looking at it closely, I _think_ the only race in this case is if the one we're trying to unregister had a deinit() function and we going to replace it? If it didn't have a deinit function: cpu1 (unregister) cpu2 (register): ----------------- ---------------------- kgdb_unregister_callbacks() spin_lock() <got> spin_lock() <blocked> if (old_dbg_io_ops) <true> if (has dinit) <false> print error spin_unlock() return -EBUSY <finish unregister> The above is fine and is the same thing that would happen if the whole register function ran before the unregister even started, right? Also: if the unregister won the race that should also be fine. So really the problem is this: cpu1 (unregister) cpu2 (register): ----------------- ---------------------- kgdb_unregister_callbacks() spin_lock() <got> spin_lock() <blocked> if (old_dbg_io_ops) <true> if (has dinit) <true> print Replacing init new IO ops spin_unlock() if (old_dbg_io_ops) <true> finish deinit of old return true WARN_ON() <hits and shouts!> dbg_io_ops = NULL spin_unlock() if (deinit) <true> double-call to deinit of old So in this case we'll hit a WARN_ON(), incorrectly unregister the new IO ops, and call deinit twice. -Doug
On Mon, Jun 29, 2020 at 02:03:52PM -0700, Doug Anderson wrote: > Hi, > > On Mon, Jun 29, 2020 at 10:15 AM Daniel Thompson > <daniel.thompson@linaro.org> wrote: > > > > Currently kgdb_register_callbacks() and kgdb_unregister_callbacks() > > are called outside the scope of the kgdb_registration_lock. This > > allows them to race with each other. This could do all sorts of crazy > > things up to and including dbg_io_ops becoming NULL partway through the > > execution of the kgdb trap handler (which isn't allowed and would be > > fatal). > > > > Fix this by bringing the trap handler setup and teardown into the scope > > of the registration lock. > > > > Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> > > --- > > kernel/debug/debug_core.c | 8 +++++--- > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c > > index 9e5934780f41..9799f2c6dc94 100644 > > --- a/kernel/debug/debug_core.c > > +++ b/kernel/debug/debug_core.c > > @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops) > > > > dbg_io_ops = new_dbg_io_ops; > > > > - spin_unlock(&kgdb_registration_lock); > > - > > if (old_dbg_io_ops) { > > + spin_unlock(&kgdb_registration_lock); > > old_dbg_io_ops->deinit(); > > return 0; > > } > > @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops) > > /* Arm KGDB now. */ > > kgdb_register_callbacks(); > > > > + spin_unlock(&kgdb_registration_lock); > > From looking at code paths, I think this is illegal, isn't it? You're > now calling kgdb_register_callbacks() while holding a spinlock, but: > > kgdb_register_callbacks() > -> register_console() > --> console_lock() > ---> might_sleep() > ----> <boom!> Thanks. I very nearly didn't press "Send" yesterday because I was worried I was rushing it too much (in order to avoid forgetting it ;-) ). Should have listened to myself! > I'm a little curious about the exact race we're trying to solve. > Calling unregister on an IO module before register even finished seems > like an error on the caller, so I guess it would be calling register > from a 2nd thread for a different IO module while the first thread was > partway through unregistering? Even that seems awfully sketchy since > you're risking registering a 2nd IO ops while the first is still there > and that's illegal enough that we do a pr_err() for it (though we > don't crash), but let's say we're trying to solve that one. I didn't follow all the possible paths. Utlimately the (un)register_callbacks() functions use a flag variable without a lock and that can interact in lots of different ways. To be honest none are especially likely because the normal case is to register once during boot and never unregister. However we can trigger register/unregister from userspace so I think they can happen in parallel. Double unregister can lead to some especially nasty schedules... although they still remain pretty unlikely since we need the double unregister to coincide with a breakpoint: kgdb_unregister_callbacks() kgdb_unregister_callbacks() . . test flag . set flag to 0 . . test flag . spin_lock() *** kgdb trap *** . . paranoid dbg_io_ops check . . dbg_io_ops = NULL . stop other CPUs . try to use NULL dbg_io_ops I have drawn the kgdb trap in the first column because otherwise things get too wide but the trap could trigger on any CPU in the system and provoke the problem. > > Looking at it closely, I _think_ the only race in this case is if the > one we're trying to unregister had a deinit() function and we going to > replace it? If it didn't have a deinit function: > > cpu1 (unregister) cpu2 (register): > ----------------- ---------------------- > kgdb_unregister_callbacks() > spin_lock() <got> > spin_lock() <blocked> > if (old_dbg_io_ops) <true> > if (has dinit) <false> > print error > spin_unlock() > return -EBUSY > <finish unregister> > > The above is fine and is the same thing that would happen if the > whole register function ran before the unregister even started, right? > > Also: if the unregister won the race that should also be fine. > > So really the problem is this: > > cpu1 (unregister) cpu2 (register): > ----------------- ---------------------- > kgdb_unregister_callbacks() > spin_lock() <got> > spin_lock() <blocked> > if (old_dbg_io_ops) <true> > if (has dinit) <true> > print Replacing > init new IO ops > spin_unlock() > if (old_dbg_io_ops) <true> > finish deinit of old > return true > WARN_ON() <hits and shouts!> > dbg_io_ops = NULL > spin_unlock() > if (deinit) <true> > double-call to deinit of old > > So in this case we'll hit a WARN_ON(), incorrectly unregister the new > IO ops, and call deinit twice. To be honest I was simply working on "it is racy" and "there's not a good reason to allow that", especially as we start to develop tools to bring races to the surfaces someone will yell at us about it sooner or later ;-). Of course, implementing it correctly would have been better... Daniel.
Hi, On Tue, Jun 30, 2020 at 8:05 AM Daniel Thompson <daniel.thompson@linaro.org> wrote: > > On Mon, Jun 29, 2020 at 02:03:52PM -0700, Doug Anderson wrote: > > Hi, > > > > On Mon, Jun 29, 2020 at 10:15 AM Daniel Thompson > > <daniel.thompson@linaro.org> wrote: > > > > > > Currently kgdb_register_callbacks() and kgdb_unregister_callbacks() > > > are called outside the scope of the kgdb_registration_lock. This > > > allows them to race with each other. This could do all sorts of crazy > > > things up to and including dbg_io_ops becoming NULL partway through the > > > execution of the kgdb trap handler (which isn't allowed and would be > > > fatal). > > > > > > Fix this by bringing the trap handler setup and teardown into the scope > > > of the registration lock. > > > > > > Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> > > > --- > > > kernel/debug/debug_core.c | 8 +++++--- > > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > > > diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c > > > index 9e5934780f41..9799f2c6dc94 100644 > > > --- a/kernel/debug/debug_core.c > > > +++ b/kernel/debug/debug_core.c > > > @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops) > > > > > > dbg_io_ops = new_dbg_io_ops; > > > > > > - spin_unlock(&kgdb_registration_lock); > > > - > > > if (old_dbg_io_ops) { > > > + spin_unlock(&kgdb_registration_lock); > > > old_dbg_io_ops->deinit(); > > > return 0; > > > } > > > @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops) > > > /* Arm KGDB now. */ > > > kgdb_register_callbacks(); > > > > > > + spin_unlock(&kgdb_registration_lock); > > > > From looking at code paths, I think this is illegal, isn't it? You're > > now calling kgdb_register_callbacks() while holding a spinlock, but: > > > > kgdb_register_callbacks() > > -> register_console() > > --> console_lock() > > ---> might_sleep() > > ----> <boom!> > > Thanks. > > I very nearly didn't press "Send" yesterday because I was worried I was > rushing it too much (in order to avoid forgetting it ;-) ). Should have > listened to myself! > > > > I'm a little curious about the exact race we're trying to solve. > > Calling unregister on an IO module before register even finished seems > > like an error on the caller, so I guess it would be calling register > > from a 2nd thread for a different IO module while the first thread was > > partway through unregistering? Even that seems awfully sketchy since > > you're risking registering a 2nd IO ops while the first is still there > > and that's illegal enough that we do a pr_err() for it (though we > > don't crash), but let's say we're trying to solve that one. > > I didn't follow all the possible paths. Utlimately the > (un)register_callbacks() functions use a flag variable without a lock > and that can interact in lots of different ways. > > To be honest none are especially likely because the normal case is to > register once during boot and never unregister. However we can trigger > register/unregister from userspace so I think they can happen > in parallel. This is for kgdboc or one of the other IO modules? I do know that, at least for kgdboc, we have the "config_mutex". I won't promise that there are no bugs there but in the very least it should mostly prevent a host of these types of issues. ...so I guess you'd have to in parallel be spamming a register of a non kgdboc IO module together with an unregister of kgdboc? > Double unregister can lead to some especially nasty schedules... > although they still remain pretty unlikely since we need the double > unregister to coincide with a breakpoint: > > > kgdb_unregister_callbacks() kgdb_unregister_callbacks() > . . > test flag . > set flag to 0 . > . test flag > . spin_lock() > *** kgdb trap *** . > . paranoid dbg_io_ops check . > . dbg_io_ops = NULL > . stop other CPUs > . try to use NULL dbg_io_ops > > > I have drawn the kgdb trap in the first column because otherwise things > get too wide but the trap could trigger on any CPU in the system and > provoke the problem. > > > > > > Looking at it closely, I _think_ the only race in this case is if the > > one we're trying to unregister had a deinit() function and we going to > > replace it? If it didn't have a deinit function: > > > > cpu1 (unregister) cpu2 (register): > > ----------------- ---------------------- > > kgdb_unregister_callbacks() > > spin_lock() <got> > > spin_lock() <blocked> > > if (old_dbg_io_ops) <true> > > if (has dinit) <false> > > print error > > spin_unlock() > > return -EBUSY > > <finish unregister> > > > > The above is fine and is the same thing that would happen if the > > whole register function ran before the unregister even started, right? > > > > Also: if the unregister won the race that should also be fine. > > > > So really the problem is this: > > > > cpu1 (unregister) cpu2 (register): > > ----------------- ---------------------- > > kgdb_unregister_callbacks() > > spin_lock() <got> > > spin_lock() <blocked> > > if (old_dbg_io_ops) <true> > > if (has dinit) <true> > > print Replacing > > init new IO ops > > spin_unlock() > > if (old_dbg_io_ops) <true> > > finish deinit of old > > return true > > WARN_ON() <hits and shouts!> > > dbg_io_ops = NULL > > spin_unlock() > > if (deinit) <true> > > double-call to deinit of old > > > > So in this case we'll hit a WARN_ON(), incorrectly unregister the new > > IO ops, and call deinit twice. > > To be honest I was simply working on "it is racy" and "there's not a > good reason to allow that", especially as we start to develop tools to > bring races to the surfaces someone will yell at us about it sooner or > later ;-). > > Of course, implementing it correctly would have been better... Yeah, still wouldn't hurt to try to figure out how to make it cleaner. :-) -Doug
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c index 9e5934780f41..9799f2c6dc94 100644 --- a/kernel/debug/debug_core.c +++ b/kernel/debug/debug_core.c @@ -1117,9 +1117,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops) dbg_io_ops = new_dbg_io_ops; - spin_unlock(&kgdb_registration_lock); - if (old_dbg_io_ops) { + spin_unlock(&kgdb_registration_lock); old_dbg_io_ops->deinit(); return 0; } @@ -1129,6 +1128,8 @@ int kgdb_register_io_module(struct kgdb_io *new_dbg_io_ops) /* Arm KGDB now. */ kgdb_register_callbacks(); + spin_unlock(&kgdb_registration_lock); + if (kgdb_break_asap && (!dbg_is_early || IS_ENABLED(CONFIG_ARCH_HAS_EARLY_DEBUG))) kgdb_initial_breakpoint(); @@ -1147,13 +1148,14 @@ void kgdb_unregister_io_module(struct kgdb_io *old_dbg_io_ops) { BUG_ON(kgdb_connected); + spin_lock(&kgdb_registration_lock); + /* * KGDB is no longer able to communicate out, so * unregister our callbacks and reset state. */ kgdb_unregister_callbacks(); - spin_lock(&kgdb_registration_lock); WARN_ON_ONCE(dbg_io_ops != old_dbg_io_ops); dbg_io_ops = NULL;
Currently kgdb_register_callbacks() and kgdb_unregister_callbacks() are called outside the scope of the kgdb_registration_lock. This allows them to race with each other. This could do all sorts of crazy things up to and including dbg_io_ops becoming NULL partway through the execution of the kgdb trap handler (which isn't allowed and would be fatal). Fix this by bringing the trap handler setup and teardown into the scope of the registration lock. Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> --- kernel/debug/debug_core.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) base-commit: 9ebcfadb0610322ac537dd7aa5d9cbc2b2894c68 -- 2.25.4