Message ID | 48d66166-4d39-4fe2-3392-7e0c84b9bdb3@i-love.sakura.ne.jp |
---|---|
State | New |
Headers | show |
Series | [v3] Bluetooth: call lock_sock() outside of spinlock section | expand |
Hi Tetsuo, On Tue, Jul 13, 2021 at 4:28 AM Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> wrote: > > syzbot is hitting might_sleep() warning at hci_sock_dev_event() due to > calling lock_sock() with rw spinlock held [1]. Among three possible > approaches [2], this patch chose holding a refcount via sock_hold() and > revalidating the element via sk_hashed(). > > Link: https://syzkaller.appspot.com/bug?extid=a5df189917e79d5e59c9 [1] > Link: https://lkml.kernel.org/r/05535d35-30d6-28b6-067e-272d01679d24@i-love.sakura.ne.jp [2] > Reported-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Tested-by: syzbot <syzbot+a5df189917e79d5e59c9@syzkaller.appspotmail.com> > Fixes: e305509e678b3a4a ("Bluetooth: use correct lock to prevent UAF of hdev object") > --- > Changes in v3: > Don't use unlocked hci_pi(sk)->hdev != hdev test, for it is racy. > No need to defer hci_dev_put(hdev), for it can't be the last reference. > > Changes in v2: > Take hci_sk_list.lock for write in case bt_sock_unlink() is called after > sk_hashed(sk) test, and defer hci_dev_put(hdev) till schedulable context. How about we revert back to use bh_lock_sock_nested but use local_bh_disable like the following patch: https://patchwork.kernel.org/project/bluetooth/patch/20210713162838.693266-1-desmondcheongzx@gmail.com/ > net/bluetooth/hci_sock.c | 30 +++++++++++++++++++++++++++++- > 1 file changed, 29 insertions(+), 1 deletion(-) > > diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c > index b04a5a02ecf3..786a06a232fd 100644 > --- a/net/bluetooth/hci_sock.c > +++ b/net/bluetooth/hci_sock.c > @@ -760,10 +760,18 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) > struct sock *sk; > > /* Detach sockets from device */ > +restart: > read_lock(&hci_sk_list.lock); > sk_for_each(sk, &hci_sk_list.head) { > + /* This sock_hold(sk) is safe, for bt_sock_unlink(sk) > + * is not called yet. > + */ > + sock_hold(sk); > + read_unlock(&hci_sk_list.lock); > lock_sock(sk); > - if (hci_pi(sk)->hdev == hdev) { > + write_lock(&hci_sk_list.lock); > + /* Check that bt_sock_unlink(sk) is not called yet. */ > + if (sk_hashed(sk) && hci_pi(sk)->hdev == hdev) { > hci_pi(sk)->hdev = NULL; > sk->sk_err = EPIPE; > sk->sk_state = BT_OPEN; > @@ -771,7 +779,27 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) > > hci_dev_put(hdev); > } > + write_unlock(&hci_sk_list.lock); > release_sock(sk); > + read_lock(&hci_sk_list.lock); > + /* If bt_sock_unlink(sk) is not called yet, we can > + * continue iteration. We can use __sock_put(sk) here > + * because hci_sock_release() will call sock_put(sk) > + * after bt_sock_unlink(sk). > + */ > + if (sk_hashed(sk)) { > + __sock_put(sk); > + continue; > + } > + /* Otherwise, we need to restart iteration, for the > + * next socket pointed by sk->next might be already > + * gone. We can't use __sock_put(sk) here because > + * hci_sock_release() might have already called > + * sock_put(sk) after bt_sock_unlink(sk). > + */ > + read_unlock(&hci_sk_list.lock); > + sock_put(sk); > + goto restart; > } > read_unlock(&hci_sk_list.lock); > } > -- > 2.18.4 >
Hi there, I'm just exhilarated to see there have been some new ideas to fix this. > > How about we revert back to use bh_lock_sock_nested but use > local_bh_disable like the following patch: > > https://patchwork.kernel.org/project/bluetooth/patch/20210713162838.693266-1-desmondcheongzx@gmail.com/ > I have checked that patch and learn about some `local_bh_disable/enable` usage. To the best of my knowledge, the local_bh_disable() function can be used to disable the processing of bottom halves (softirqs). Or in another word, if process context function, hci_sock_sendmsg() for example, can mask the BH (hci_dev_do_close()?). It doesn't need to worry about the UAF. However, after doing some experiments, I failed :( For instance, I try to do following patch: --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -1720,6 +1720,7 @@ static int hci_sock_sendmsg(struct socket *sock, struct msghdr *msg, return -EINVAL; lock_sock(sk); + local_bh_disable(); switch (hci_pi(sk)->channel) { case HCI_CHANNEL_RAW: @@ -1832,7 +1833,9 @@ static int hci_sock_sendmsg(struct socket *sock, struct msghdr *msg, err = len; done: + local_bh_enable(); release_sock(sk); + return err; But the POC code shows error message like below: [ 18.169155] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:197 [ 18.170181] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 120, name: exp [ 18.170987] 1 lock held by exp/120: [ 18.171384] #0: ffff888011dd5120 (sk_lock-AF_BLUETOOTH-BTPROTO_HCI){+.+.}-{0:0}, at: hci_sock_sendmsg+0x11e/0x26c0 [ 18.172300] CPU: 0 PID: 120 Comm: exp Not tainted 5.11.11+ #44 [ 18.172921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 ... The patch provided by Desmond adds the local_bh_disable() before the bh_lock_sock() so I also try that in --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -762,6 +762,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) /* Detach sockets from device */ read_lock(&hci_sk_list.lock); sk_for_each(sk, &hci_sk_list.head) { + local_bh_disable(); bh_lock_sock_nested(sk); if (hci_pi(sk)->hdev == hdev) { hci_pi(sk)->hdev = NULL; @@ -772,6 +773,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) hci_dev_put(hdev); } bh_unlock_sock(sk); + local_bh_enable(); } read_unlock(&hci_sk_list.lock); } But this is not useful, the UAF still occurs [ 13.862117] ================================================================== [ 13.863064] BUG: KASAN: use-after-free in __lock_acquire+0xe5/0x2ca0 [ 13.863852] Read of size 8 at addr ffff888011d9aeb0 by task exp/119 [ 13.864620] [ 13.864818] CPU: 0 PID: 119 Comm: exp Not tainted 5.11.11+ #45 [ 13.865543] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 13.866634] Call Trace: [ 13.866947] dump_stack+0x183/0x22e [ 13.867389] ? show_regs_print_info+0x12/0x12 [ 13.867927] ? log_buf_vmcoreinfo_setup+0x45d/0x45d [ 13.868503] ? _raw_spin_lock_irqsave+0xbd/0x100 [ 13.869244] print_address_description+0x7b/0x3a0 [ 13.869828] __kasan_report+0x14e/0x200 [ 13.870288] ? __lock_acquire+0xe5/0x2ca0 [ 13.870768] kasan_report+0x47/0x60 [ 13.871189] __lock_acquire+0xe5/0x2ca0 [ 13.871647] ? lock_acquire+0x168/0x6a0 [ 13.872107] ? trace_lock_release+0x5c/0x120 [ 13.872615] ? do_user_addr_fault+0x9c2/0xdb0 [ 13.873135] ? trace_lock_acquire+0x150/0x150 [ 13.873661] ? rcu_read_lock_sched_held+0x87/0x110 [ 13.874232] ? perf_trace_rcu_barrier+0x360/0x360 [ 13.874790] ? avc_has_perm_noaudit+0x442/0x4c0 [ 13.875332] lock_acquire+0x168/0x6a0 [ 13.875772] ? skb_queue_tail+0x32/0x120 [ 13.876240] ? do_kern_addr_fault+0x230/0x230 [ 13.876756] ? read_lock_is_recursive+0x10/0x10 [ 13.877300] ? exc_page_fault+0xf3/0x1b0 [ 13.877770] ? cred_has_capability+0x191/0x3f0 [ 13.878290] ? cred_has_capability+0x2a1/0x3f0 [ 13.878816] ? rcu_lock_release+0x20/0x20 [ 13.879295] _raw_spin_lock_irqsave+0xb1/0x100 [ 13.879821] ? skb_queue_tail+0x32/0x120 [ 13.880287] ? _raw_spin_lock+0x40/0x40 [ 13.880745] skb_queue_tail+0x32/0x120 [ 13.881194] hci_sock_sendmsg+0x1545/0x26b0 From my point of view, adding the local_bh_disable() cannot prevent current hci_sock_dev_event() to set and decrease the ref-count. It's not quite similar with the cases that Desmond discussed. (Or maybe just I don't know how to use this). I recently tried to find some similar cases (and I did, reported to security already but get no reply) and figure out how others are fixed. Some guideline tells me that (http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch07lev1sec6.html) "If process context code and a bottom half share data, you need to disable bottom-half processing and obtain a lock before accessing the data. Doing both ensures local and SMP protection and prevents a deadlock." Assuming hci_sock_sendmsg()/hci_sock_bound_ioctl() are the process contexts while the hci_sock_dev_event(), not sure, is the BH context. The fact is that the hci_sock_dev_event() should wait for the process contexts. Hence, I think Tetsuo is on the right way. Regards Lock-Noob LinMa
On 15/7/21 11:03 am, LinMa wrote: > Hi there, > > I'm just exhilarated to see there have been some new ideas to fix this. > >> >> How about we revert back to use bh_lock_sock_nested but use >> local_bh_disable like the following patch: >> >> https://patchwork.kernel.org/project/bluetooth/patch/20210713162838.693266-1-desmondcheongzx@gmail.com/ >> > > I have checked that patch and learn about some `local_bh_disable/enable` usage. > To the best of my knowledge, the local_bh_disable() function can be used to disable the processing of bottom halves (softirqs). > Or in another word, if process context function, hci_sock_sendmsg() for example, can mask the BH (hci_dev_do_close()?). It doesn't need to worry about the UAF. > > However, after doing some experiments, I failed :( > For instance, I try to do following patch: > > --- a/net/bluetooth/hci_sock.c > +++ b/net/bluetooth/hci_sock.c > @@ -1720,6 +1720,7 @@ static int hci_sock_sendmsg(struct socket *sock, struct msghdr *msg, > return -EINVAL; > > lock_sock(sk); > + local_bh_disable(); > > switch (hci_pi(sk)->channel) { > case HCI_CHANNEL_RAW: > @@ -1832,7 +1833,9 @@ static int hci_sock_sendmsg(struct socket *sock, struct msghdr *msg, > err = len; > > done: > + local_bh_enable(); > release_sock(sk); > + > return err; > > But the POC code shows error message like below: > > [ 18.169155] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:197 > [ 18.170181] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 120, name: exp > [ 18.170987] 1 lock held by exp/120: > [ 18.171384] #0: ffff888011dd5120 (sk_lock-AF_BLUETOOTH-BTPROTO_HCI){+.+.}-{0:0}, at: hci_sock_sendmsg+0x11e/0x26c0 > [ 18.172300] CPU: 0 PID: 120 Comm: exp Not tainted 5.11.11+ #44 > [ 18.172921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 > ... Hi, Saw this and thought I'd offer my two cents. This is the original problem that Tetsuo's patch was trying to fix. Under the hood of lock_sock, we call lock_sock_nested which might sleep because of the mutex_acquire. But we shouldn't sleep while holding the rw spinlock. So we either have to acquire a spinlock instead of a mutex as was done before, or we need to move lock_sock out of the rw spinlock critical section as Tetsuo proposes. > > The patch provided by Desmond adds the local_bh_disable() before the bh_lock_sock() so I also try that in > > --- a/net/bluetooth/hci_sock.c > +++ b/net/bluetooth/hci_sock.c > @@ -762,6 +762,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) > /* Detach sockets from device */ > read_lock(&hci_sk_list.lock); > sk_for_each(sk, &hci_sk_list.head) { > + local_bh_disable(); > bh_lock_sock_nested(sk); > if (hci_pi(sk)->hdev == hdev) { > hci_pi(sk)->hdev = NULL; > @@ -772,6 +773,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) > hci_dev_put(hdev); > } > bh_unlock_sock(sk); > + local_bh_enable(); > } > read_unlock(&hci_sk_list.lock); > } > > But this is not useful, the UAF still occurs > I might be very mistaken on this, but I believe the UAF still happens because you can't really mix bh_lock_sock* and lock_sock* to protect the same things. The former holds the spinlock &sk->sk_lock.slock and synchronizes between user contexts and bottom halves, while the latter holds a mutex on &sk->sk_lock.dep_map to synchronize between multiple users. One option I can think of would be to switch instances of lock_sock to bh_lock_sock_nested for users that might race (such as hci_sock_sendmsg, hci_sock_bound_ioctl, and others as needed). But I'm not sure if that's quite what we want, plus we would need to ensure that sleeping functions aren't called between the bh_lock/unlock. Best wishes, Desmond > [ 13.862117] ================================================================== > [ 13.863064] BUG: KASAN: use-after-free in __lock_acquire+0xe5/0x2ca0 > [ 13.863852] Read of size 8 at addr ffff888011d9aeb0 by task exp/119 > [ 13.864620] > [ 13.864818] CPU: 0 PID: 119 Comm: exp Not tainted 5.11.11+ #45 > [ 13.865543] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 > [ 13.866634] Call Trace: > [ 13.866947] dump_stack+0x183/0x22e > [ 13.867389] ? show_regs_print_info+0x12/0x12 > [ 13.867927] ? log_buf_vmcoreinfo_setup+0x45d/0x45d > [ 13.868503] ? _raw_spin_lock_irqsave+0xbd/0x100 > [ 13.869244] print_address_description+0x7b/0x3a0 > [ 13.869828] __kasan_report+0x14e/0x200 > [ 13.870288] ? __lock_acquire+0xe5/0x2ca0 > [ 13.870768] kasan_report+0x47/0x60 > [ 13.871189] __lock_acquire+0xe5/0x2ca0 > [ 13.871647] ? lock_acquire+0x168/0x6a0 > [ 13.872107] ? trace_lock_release+0x5c/0x120 > [ 13.872615] ? do_user_addr_fault+0x9c2/0xdb0 > [ 13.873135] ? trace_lock_acquire+0x150/0x150 > [ 13.873661] ? rcu_read_lock_sched_held+0x87/0x110 > [ 13.874232] ? perf_trace_rcu_barrier+0x360/0x360 > [ 13.874790] ? avc_has_perm_noaudit+0x442/0x4c0 > [ 13.875332] lock_acquire+0x168/0x6a0 > [ 13.875772] ? skb_queue_tail+0x32/0x120 > [ 13.876240] ? do_kern_addr_fault+0x230/0x230 > [ 13.876756] ? read_lock_is_recursive+0x10/0x10 > [ 13.877300] ? exc_page_fault+0xf3/0x1b0 > [ 13.877770] ? cred_has_capability+0x191/0x3f0 > [ 13.878290] ? cred_has_capability+0x2a1/0x3f0 > [ 13.878816] ? rcu_lock_release+0x20/0x20 > [ 13.879295] _raw_spin_lock_irqsave+0xb1/0x100 > [ 13.879821] ? skb_queue_tail+0x32/0x120 > [ 13.880287] ? _raw_spin_lock+0x40/0x40 > [ 13.880745] skb_queue_tail+0x32/0x120 > [ 13.881194] hci_sock_sendmsg+0x1545/0x26b0 > > From my point of view, adding the local_bh_disable() cannot prevent current hci_sock_dev_event() to set and decrease the ref-count. It's not quite similar with the cases that Desmond discussed. > (Or maybe just I don't know how to use this). > > I recently tried to find some similar cases (and I did, reported to security already but get no reply) and figure out how others are fixed. > Some guideline tells me that (http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch07lev1sec6.html) > > "If process context code and a bottom half share data, you need to disable bottom-half processing and obtain a lock before accessing the data. Doing both ensures local and SMP protection and prevents a deadlock." > > Assuming hci_sock_sendmsg()/hci_sock_bound_ioctl() are the process contexts while the hci_sock_dev_event(), not sure, is the BH context. The fact is that the hci_sock_dev_event() should wait for the process contexts. Hence, I think Tetsuo is on the right way. > > Regards > Lock-Noob LinMa > > >
On 16/7/21 11:47 am, Desmond Cheong Zhi Xi wrote: > On 15/7/21 11:03 am, LinMa wrote: >> Hi there, >> >> I'm just exhilarated to see there have been some new ideas to fix this. >> >>> >>> How about we revert back to use bh_lock_sock_nested but use >>> local_bh_disable like the following patch: >>> >>> https://patchwork.kernel.org/project/bluetooth/patch/20210713162838.693266-1-desmondcheongzx@gmail.com/ >>> >>> >> >> I have checked that patch and learn about some >> `local_bh_disable/enable` usage. >> To the best of my knowledge, the local_bh_disable() function can be >> used to disable the processing of bottom halves (softirqs). >> Or in another word, if process context function, hci_sock_sendmsg() >> for example, can mask the BH (hci_dev_do_close()?). It doesn't need to >> worry about the UAF. >> >> However, after doing some experiments, I failed :( >> For instance, I try to do following patch: >> >> --- a/net/bluetooth/hci_sock.c >> +++ b/net/bluetooth/hci_sock.c >> @@ -1720,6 +1720,7 @@ static int hci_sock_sendmsg(struct socket *sock, >> struct msghdr *msg, >> return -EINVAL; >> >> lock_sock(sk); >> + local_bh_disable(); >> >> switch (hci_pi(sk)->channel) { >> case HCI_CHANNEL_RAW: >> @@ -1832,7 +1833,9 @@ static int hci_sock_sendmsg(struct socket *sock, >> struct msghdr *msg, >> err = len; >> >> done: >> + local_bh_enable(); >> release_sock(sk); >> + >> return err; >> >> But the POC code shows error message like below: >> >> [ 18.169155] BUG: sleeping function called from invalid context at >> include/linux/sched/mm.h:197 >> [ 18.170181] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: >> 120, name: exp >> [ 18.170987] 1 lock held by exp/120: >> [ 18.171384] #0: ffff888011dd5120 >> (sk_lock-AF_BLUETOOTH-BTPROTO_HCI){+.+.}-{0:0}, at: >> hci_sock_sendmsg+0x11e/0x26c0 >> [ 18.172300] CPU: 0 PID: 120 Comm: exp Not tainted 5.11.11+ #44 >> [ 18.172921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >> BIOS 1.10.2-1ubuntu1 04/01/2014 >> ... > > Hi, > > Saw this and thought I'd offer my two cents. > BUG: sleeping function called from invalid context > This is the original problem that Tetsuo's patch was trying to fix. > Under the hood of lock_sock, we call lock_sock_nested which might sleep > because of the mutex_acquire. But we shouldn't sleep while holding the > rw spinlock. So we either have to acquire a spinlock instead of a mutex > as was done before, or we need to move lock_sock out of the rw spinlock > critical section as Tetsuo proposes. > My bad, was thinking more about the problem and noticed your poc was for hci_sock_sendmsg, not hci_sock_dev_event. In this case, it's not clear to me why the atomic context is being violated. Sorry for the noise. >> >> The patch provided by Desmond adds the local_bh_disable() before the >> bh_lock_sock() so I also try that in >> >> --- a/net/bluetooth/hci_sock.c >> +++ b/net/bluetooth/hci_sock.c >> @@ -762,6 +762,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int >> event) >> /* Detach sockets from device */ >> read_lock(&hci_sk_list.lock); >> sk_for_each(sk, &hci_sk_list.head) { >> + local_bh_disable(); >> bh_lock_sock_nested(sk); >> if (hci_pi(sk)->hdev == hdev) { >> hci_pi(sk)->hdev = NULL; >> @@ -772,6 +773,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int >> event) >> hci_dev_put(hdev); >> } >> bh_unlock_sock(sk); >> + local_bh_enable(); >> } >> read_unlock(&hci_sk_list.lock); >> } >> >> But this is not useful, the UAF still occurs >> > > I might be very mistaken on this, but I believe the UAF still happens > because you can't really mix bh_lock_sock* and lock_sock* to protect the > same things. The former holds the spinlock &sk->sk_lock.slock and > synchronizes between user contexts and bottom halves, while the latter > holds a mutex on &sk->sk_lock.dep_map to synchronize between multiple > users. > > One option I can think of would be to switch instances of lock_sock to > bh_lock_sock_nested for users that might race (such as hci_sock_sendmsg, > hci_sock_bound_ioctl, and others as needed). But I'm not sure if that's > quite what we want, plus we would need to ensure that sleeping functions > aren't called between the bh_lock/unlock. > > Best wishes, > Desmond > >> [ 13.862117] >> ================================================================== >> [ 13.863064] BUG: KASAN: use-after-free in __lock_acquire+0xe5/0x2ca0 >> [ 13.863852] Read of size 8 at addr ffff888011d9aeb0 by task exp/119 >> [ 13.864620] >> [ 13.864818] CPU: 0 PID: 119 Comm: exp Not tainted 5.11.11+ #45 >> [ 13.865543] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), >> BIOS 1.10.2-1ubuntu1 04/01/2014 >> [ 13.866634] Call Trace: >> [ 13.866947] dump_stack+0x183/0x22e >> [ 13.867389] ? show_regs_print_info+0x12/0x12 >> [ 13.867927] ? log_buf_vmcoreinfo_setup+0x45d/0x45d >> [ 13.868503] ? _raw_spin_lock_irqsave+0xbd/0x100 >> [ 13.869244] print_address_description+0x7b/0x3a0 >> [ 13.869828] __kasan_report+0x14e/0x200 >> [ 13.870288] ? __lock_acquire+0xe5/0x2ca0 >> [ 13.870768] kasan_report+0x47/0x60 >> [ 13.871189] __lock_acquire+0xe5/0x2ca0 >> [ 13.871647] ? lock_acquire+0x168/0x6a0 >> [ 13.872107] ? trace_lock_release+0x5c/0x120 >> [ 13.872615] ? do_user_addr_fault+0x9c2/0xdb0 >> [ 13.873135] ? trace_lock_acquire+0x150/0x150 >> [ 13.873661] ? rcu_read_lock_sched_held+0x87/0x110 >> [ 13.874232] ? perf_trace_rcu_barrier+0x360/0x360 >> [ 13.874790] ? avc_has_perm_noaudit+0x442/0x4c0 >> [ 13.875332] lock_acquire+0x168/0x6a0 >> [ 13.875772] ? skb_queue_tail+0x32/0x120 >> [ 13.876240] ? do_kern_addr_fault+0x230/0x230 >> [ 13.876756] ? read_lock_is_recursive+0x10/0x10 >> [ 13.877300] ? exc_page_fault+0xf3/0x1b0 >> [ 13.877770] ? cred_has_capability+0x191/0x3f0 >> [ 13.878290] ? cred_has_capability+0x2a1/0x3f0 >> [ 13.878816] ? rcu_lock_release+0x20/0x20 >> [ 13.879295] _raw_spin_lock_irqsave+0xb1/0x100 >> [ 13.879821] ? skb_queue_tail+0x32/0x120 >> [ 13.880287] ? _raw_spin_lock+0x40/0x40 >> [ 13.880745] skb_queue_tail+0x32/0x120 >> [ 13.881194] hci_sock_sendmsg+0x1545/0x26b0 >> >> From my point of view, adding the local_bh_disable() cannot prevent >> current hci_sock_dev_event() to set and decrease the ref-count. It's >> not quite similar with the cases that Desmond discussed. >> (Or maybe just I don't know how to use this). >> > I recently tried to find some similar cases (and I did, reported to > security already but get no reply) and figure out how others are fixed. >> Some guideline tells me that >> (http://books.gigatux.nl/mirror/kerneldevelopment/0672327201/ch07lev1sec6.html) >> >> >> "If process context code and a bottom half share data, you need to >> disable bottom-half processing and obtain a lock before accessing the >> data. Doing both ensures local and SMP protection and prevents a >> deadlock." >> >> Assuming hci_sock_sendmsg()/hci_sock_bound_ioctl() are the process >> contexts while the hci_sock_dev_event(), not sure, is the BH context. >> The fact is that the hci_sock_dev_event() should wait for the process >> contexts. Hence, I think Tetsuo is on the right way. >> >> Regards >> Lock-Noob LinMa >> >> >> >
On 2021/07/16 13:11, Desmond Cheong Zhi Xi wrote: > On 16/7/21 11:47 am, Desmond Cheong Zhi Xi wrote: >> Saw this and thought I'd offer my two cents. >> BUG: sleeping function called from invalid context >> This is the original problem that Tetsuo's patch was trying to fix. Yes. >> Under the hood of lock_sock, we call lock_sock_nested which might sleep >> because of the mutex_acquire. Both lock_sock() and lock_sock_nested() might sleep. >> But we shouldn't sleep while holding the rw spinlock. Right. In atomic context (e.g. inside interrupt handler, schedulable context with interrupts or preemption disabled, schedulable context inside RCU read critical section, schedulable context inside a spinlock section), we must not call functions (e.g. waiting for a mutex, waiting for a semaphore, waiting for a page fault) which are not atomic. >> So we either have to acquire a spinlock instead of a mutex as was done before, Regarding hci_sock_dev_event(HCI_DEV_UNREG) case, we can't use a spinlock. Like LinMa explained, lock_sock() has to be used in order to serialize functions (e.g. hci_sock_sendmsg()) which access hci_pi(sk)->hdev between lock_sock(sk) and release_sock(sk). And like I explained, we can't defer resetting hci_pi(sk)->hdev to NULL, for hci_sock_dev_event(HCI_DEV_UNREG) is responsible for resetting hci_pi(sk)->hdev to NULL because the caller of hci_sock_dev_event(HCI_DEV_UNREG) immediately destroys resources associated with this hdev. >> or we need to move lock_sock out of the rw spinlock critical section as Tetsuo proposes. Exactly. Since this is a regression introduced when fixing CVE-2021-3573, Linux distributors are waiting for this patch so that they can apply the fix for CVE-2021-3573. This patch should be sent to linux.git and stables as soon as possible. But due to little attention on this patch, I'm already testing this patch in linux-next.git via my tree. I'll drop when Bluetooth maintainers pick this patch up for linux-5.14-rcX. (Or should I directly send to Linus?) >> > > My bad, was thinking more about the problem and noticed your poc was for hci_sock_sendmsg, > not hci_sock_dev_event. I didn't catch this part. Are you talking about a different poc? As far as I'm aware, exp.c in POC.zip was for hci_sock_bound_ioctl(HCIUNBLOCKADDR). hci_sock_bound_ioctl(HCIUNBLOCKADDR) (which is called between lock_sock() and release_sock()) calls copy_from_user() which might cause page fault, and userfaultfd mechanism allows an attacker to slowdown page fault handling enough to hci_sock_dev_event(HCI_DEV_UNREG) to return without waiting for hci_sock_bound_ioctl(HCIUNBLOCKADDR) to call release_sock(). This race window results in UAF (doesn't it, LinMa?). > In this case, it's not clear to me why the atomic context is being violated. In atomic context (in hci_sock_dev_event(HCI_DEV_UNREG) case, between read_lock(&hci_sk_list.lock) and read_unlock(&hci_sk_list.lock)), we must not call lock_sock(sk) which might wait for hci_sock_bound_ioctl(HCIUNBLOCKADDR) to call release_sock(). > > Sorry for the noise. > >>> >>> The patch provided by Desmond adds the local_bh_disable() before the bh_lock_sock() so I also try that in >>> >>> --- a/net/bluetooth/hci_sock.c >>> +++ b/net/bluetooth/hci_sock.c >>> @@ -762,6 +762,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) >>> /* Detach sockets from device */ >>> read_lock(&hci_sk_list.lock); >>> sk_for_each(sk, &hci_sk_list.head) { >>> + local_bh_disable(); >>> bh_lock_sock_nested(sk); >>> if (hci_pi(sk)->hdev == hdev) { >>> hci_pi(sk)->hdev = NULL; >>> @@ -772,6 +773,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) >>> hci_dev_put(hdev); >>> } >>> bh_unlock_sock(sk); >>> + local_bh_enable(); >>> } >>> read_unlock(&hci_sk_list.lock); >>> } >>> >>> But this is not useful, the UAF still occurs >>> >> >> I might be very mistaken on this, but I believe the UAF still happens because >> you can't really mix bh_lock_sock* and lock_sock* to protect the same things. Right. https://www.kernel.org/doc/html/v5.13/kernel-hacking/locking.html >> The former holds the spinlock &sk->sk_lock.slock and synchronizes between >> user contexts and bottom halves, serializes access to resources which might be accessed from atomic (i.e. non-schedulable) contexts >> while the latter holds a mutex on &sk->sk_lock.dep_map to synchronize between >> multiple users. serializes access to resources which are accessed from only schedulable (i.e. non-atomic) contexts >> >> One option I can think of would be to switch instances of lock_sock to bh_lock_sock_nested >> for users that might race (such as hci_sock_sendmsg, hci_sock_bound_ioctl, and others as >> needed). But I'm not sure if that's quite what we want, plus we would need to ensure that >> sleeping functions aren't called between the bh_lock/unlock. We can't do it for hci_sock_dev_event(HCI_DEV_UNREG).
Hello everyone, Sorry, it's my fault to cause the misunderstanding. As I keep mentioning "hci_sock_sendmsg()" instead of "hci_sock_bound_ioctl()". In fact, both these two functions are able to cause the race. > >> > > > > My bad, was thinking more about the problem and noticed your poc was for hci_sock_sendmsg, > > not hci_sock_dev_event. > > I didn't catch this part. Are you talking about a different poc? > As far as I'm aware, exp.c in POC.zip was for hci_sock_bound_ioctl(HCIUNBLOCKADDR). > > hci_sock_bound_ioctl(HCIUNBLOCKADDR) (which is called between lock_sock() and release_sock()) > calls copy_from_user() which might cause page fault, and userfaultfd mechanism allows an attacker > to slowdown page fault handling enough to hci_sock_dev_event(HCI_DEV_UNREG) to return without > waiting for hci_sock_bound_ioctl(HCIUNBLOCKADDR) to call release_sock(). This race window > results in UAF (doesn't it, LinMa?). > Your understanding above is quite right. In addition, because the hci_sock_sendmsg() calls the memcpy_from_msg(...), which also in fact fetch data from userspace memory, the userfaultfd can take effect as well. (When writing the exploit for this CVE, the hci_sock_sendmsg() is much useful... so I recently keep mentioning this function) Regards Lin Ma
Hi Tetsuo, Just find out another interesting function: sock_owned_by_user(). (I am just a noob of kernel locks) Hence I think the following patch has the same 'effect' as the old patch e305509e678b3 ("Bluetooth: use correct lock to prevent UAF of hdev object") diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index b04a5a02ecf3..0cc4b88daa96 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -762,7 +762,11 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) /* Detach sockets from device */ read_lock(&hci_sk_list.lock); sk_for_each(sk, &hci_sk_list.head) { - lock_sock(sk); + bh_lock_sock_nested(sk); +busywait: + if (sock_owned_by_user(sk)) + goto busywait; + if (hci_pi(sk)->hdev == hdev) { hci_pi(sk)->hdev = NULL; sk->sk_err = EPIPE; @@ -771,7 +775,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) hci_dev_put(hdev); } - release_sock(sk); + bh_unlock_sock(sk); } read_unlock(&hci_sk_list.lock); } The sad thing is that it seems will cost CPU resource to do meaningless wait... What do you think? Can this sock_owned_by_user() function do any help? Thanks Lin Ma > From: "Tetsuo Handa" <penguin-kernel@i-love.sakura.ne.jp> > Sent Time: 2021-07-16 22:48:13 (Friday) > To: "Desmond Cheong Zhi Xi" <desmondcheongzx@gmail.com>, LinMa <linma@zju.edu.cn>, "Luiz Augusto von Dentz" <luiz.dentz@gmail.com>, "Johan Hedberg" <johan.hedberg@gmail.com>, "Marcel Holtmann" <marcel@holtmann.org> > Cc: "linux-bluetooth@vger.kernel.org" <linux-bluetooth@vger.kernel.org>, "David S. Miller" <davem@davemloft.net>, "Jakub Kicinski" <kuba@kernel.org>, "open list:NETWORKING [GENERAL]" <netdev@vger.kernel.org> > Subject: Re: [PATCH v3] Bluetooth: call lock_sock() outside of spinlock section > > On 2021/07/16 13:11, Desmond Cheong Zhi Xi wrote: > > On 16/7/21 11:47 am, Desmond Cheong Zhi Xi wrote: > >> Saw this and thought I'd offer my two cents. > >> BUG: sleeping function called from invalid context > >> This is the original problem that Tetsuo's patch was trying to fix. > > Yes. > > >> Under the hood of lock_sock, we call lock_sock_nested which might sleep > >> because of the mutex_acquire. > > Both lock_sock() and lock_sock_nested() might sleep. > > >> But we shouldn't sleep while holding the rw spinlock. > > Right. In atomic context (e.g. inside interrupt handler, schedulable context > with interrupts or preemption disabled, schedulable context inside RCU read > critical section, schedulable context inside a spinlock section), we must not > call functions (e.g. waiting for a mutex, waiting for a semaphore, waiting for > a page fault) which are not atomic. > > >> So we either have to acquire a spinlock instead of a mutex as was done before, > > Regarding hci_sock_dev_event(HCI_DEV_UNREG) case, we can't use a spinlock. > > Like LinMa explained, lock_sock() has to be used in order to serialize functions > (e.g. hci_sock_sendmsg()) which access hci_pi(sk)->hdev between lock_sock(sk) and > release_sock(sk). And like I explained, we can't defer resetting hci_pi(sk)->hdev > to NULL, for hci_sock_dev_event(HCI_DEV_UNREG) is responsible for resetting > hci_pi(sk)->hdev to NULL because the caller of hci_sock_dev_event(HCI_DEV_UNREG) > immediately destroys resources associated with this hdev. > > >> or we need to move lock_sock out of the rw spinlock critical section as Tetsuo proposes. > > Exactly. Since this is a regression introduced when fixing CVE-2021-3573, Linux > distributors are waiting for this patch so that they can apply the fix for CVE-2021-3573. > This patch should be sent to linux.git and stables as soon as possible. But due to little > attention on this patch, I'm already testing this patch in linux-next.git via my tree. > I'll drop when Bluetooth maintainers pick this patch up for linux-5.14-rcX. (Or should I > directly send to Linus?) > > >> > > > > My bad, was thinking more about the problem and noticed your poc was for hci_sock_sendmsg, > > not hci_sock_dev_event. > > I didn't catch this part. Are you talking about a different poc? > As far as I'm aware, exp.c in POC.zip was for hci_sock_bound_ioctl(HCIUNBLOCKADDR). > > hci_sock_bound_ioctl(HCIUNBLOCKADDR) (which is called between lock_sock() and release_sock()) > calls copy_from_user() which might cause page fault, and userfaultfd mechanism allows an attacker > to slowdown page fault handling enough to hci_sock_dev_event(HCI_DEV_UNREG) to return without > waiting for hci_sock_bound_ioctl(HCIUNBLOCKADDR) to call release_sock(). This race window > results in UAF (doesn't it, LinMa?). > > > In this case, it's not clear to me why the atomic context is being violated. > > In atomic context (in hci_sock_dev_event(HCI_DEV_UNREG) case, between > read_lock(&hci_sk_list.lock) and read_unlock(&hci_sk_list.lock)), we must not call > lock_sock(sk) which might wait for hci_sock_bound_ioctl(HCIUNBLOCKADDR) to call release_sock(). > > > > > Sorry for the noise. > > > >>> > >>> The patch provided by Desmond adds the local_bh_disable() before the bh_lock_sock() so I also try that in > >>> > >>> --- a/net/bluetooth/hci_sock.c > >>> +++ b/net/bluetooth/hci_sock.c > >>> @@ -762,6 +762,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) > >>> /* Detach sockets from device */ > >>> read_lock(&hci_sk_list.lock); > >>> sk_for_each(sk, &hci_sk_list.head) { > >>> + local_bh_disable(); > >>> bh_lock_sock_nested(sk); > >>> if (hci_pi(sk)->hdev == hdev) { > >>> hci_pi(sk)->hdev = NULL; > >>> @@ -772,6 +773,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) > >>> hci_dev_put(hdev); > >>> } > >>> bh_unlock_sock(sk); > >>> + local_bh_enable(); > >>> } > >>> read_unlock(&hci_sk_list.lock); > >>> } > >>> > >>> But this is not useful, the UAF still occurs > >>> > >> > >> I might be very mistaken on this, but I believe the UAF still happens because > >> you can't really mix bh_lock_sock* and lock_sock* to protect the same things. > > Right. https://www.kernel.org/doc/html/v5.13/kernel-hacking/locking.html > > >> The former holds the spinlock &sk->sk_lock.slock and synchronizes between > >> user contexts and bottom halves, > > serializes access to resources which might be accessed from atomic (i.e. non-schedulable) contexts > > >> while the latter holds a mutex on &sk->sk_lock.dep_map to synchronize between > >> multiple users. > > serializes access to resources which are accessed from only schedulable (i.e. non-atomic) contexts > > >> > >> One option I can think of would be to switch instances of lock_sock to bh_lock_sock_nested > >> for users that might race (such as hci_sock_sendmsg, hci_sock_bound_ioctl, and others as > >> needed). But I'm not sure if that's quite what we want, plus we would need to ensure that > >> sleeping functions aren't called between the bh_lock/unlock. > > We can't do it for hci_sock_dev_event(HCI_DEV_UNREG).
On 2021/07/22 13:47, LinMa wrote: > Hi Tetsuo, > > Just find out another interesting function: sock_owned_by_user(). (I am just a noob of kernel locks) > > Hence I think the following patch has the same 'effect' as the old patch e305509e678b3 ("Bluetooth: use correct lock to prevent UAF of hdev object") > > diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c > index b04a5a02ecf3..0cc4b88daa96 100644 > --- a/net/bluetooth/hci_sock.c > +++ b/net/bluetooth/hci_sock.c > @@ -762,7 +762,11 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) > /* Detach sockets from device */ > read_lock(&hci_sk_list.lock); > sk_for_each(sk, &hci_sk_list.head) { > - lock_sock(sk); > + bh_lock_sock_nested(sk); > +busywait: > + if (sock_owned_by_user(sk)) > + goto busywait; > + > if (hci_pi(sk)->hdev == hdev) { > hci_pi(sk)->hdev = NULL; > sk->sk_err = EPIPE; > @@ -771,7 +775,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) > > hci_dev_put(hdev); > } > - release_sock(sk); > + bh_unlock_sock(sk); > } > read_unlock(&hci_sk_list.lock); > } > > The sad thing is that it seems will cost CPU resource to do meaningless wait... > > What do you think? Can this sock_owned_by_user() function do any help? I don't think it helps. One of problems we are seeing (and my patch will fix) is a race window that this sk_for_each() loop needs to wait using lock_sock() because "hci_pi(sk)->hdev = hdev;" is called by hci_sock_bind() under lock_sock(). Doing hci_pi(sk)->hdev == hdev comparison without lock_sock() will lead to failing to "/* Detach sockets from device */".
On 2021/07/17 0:26, LinMa wrote: > Hello everyone, > > Sorry, it's my fault to cause the misunderstanding. > > As I keep mentioning "hci_sock_sendmsg()" instead of "hci_sock_bound_ioctl()". In fact, > both these two functions are able to cause the race. I sent two patches for avoiding page fault with kernel lock held. [PATCH v2] Bluetooth: reorganize ioctls from hci_sock_bound_ioctl() https://lkml.kernel.org/r/39b677ce-dcbf-6393-6279-88ed3a9e570e@i-love.sakura.ne.jp [PATCH] Bluetooth: reorganize functions from hci_sock_sendmsg() https://lkml.kernel.org/r/20210722074208.8040-1-penguin-kernel@I-love.SAKURA.ne.jp These two patches will eliminate user-controlled delay at lock_sock() which [PATCH v3] Bluetooth: call lock_sock() outside of spinlock section https://lkml.kernel.org/r/48d66166-4d39-4fe2-3392-7e0c84b9bdb3@i-love.sakura.ne.jp would suffer. Are you aware of more locations which trigger page fault with sock lock held? If none, we can send these three patches together. If we are absolutely sure that there is no more locations, we could try choice (1) or (2) at https://lkml.kernel.org/r/05535d35-30d6-28b6-067e-272d01679d24@i-love.sakura.ne.jp .
diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index b04a5a02ecf3..786a06a232fd 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -760,10 +760,18 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) struct sock *sk; /* Detach sockets from device */ +restart: read_lock(&hci_sk_list.lock); sk_for_each(sk, &hci_sk_list.head) { + /* This sock_hold(sk) is safe, for bt_sock_unlink(sk) + * is not called yet. + */ + sock_hold(sk); + read_unlock(&hci_sk_list.lock); lock_sock(sk); - if (hci_pi(sk)->hdev == hdev) { + write_lock(&hci_sk_list.lock); + /* Check that bt_sock_unlink(sk) is not called yet. */ + if (sk_hashed(sk) && hci_pi(sk)->hdev == hdev) { hci_pi(sk)->hdev = NULL; sk->sk_err = EPIPE; sk->sk_state = BT_OPEN; @@ -771,7 +779,27 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) hci_dev_put(hdev); } + write_unlock(&hci_sk_list.lock); release_sock(sk); + read_lock(&hci_sk_list.lock); + /* If bt_sock_unlink(sk) is not called yet, we can + * continue iteration. We can use __sock_put(sk) here + * because hci_sock_release() will call sock_put(sk) + * after bt_sock_unlink(sk). + */ + if (sk_hashed(sk)) { + __sock_put(sk); + continue; + } + /* Otherwise, we need to restart iteration, for the + * next socket pointed by sk->next might be already + * gone. We can't use __sock_put(sk) here because + * hci_sock_release() might have already called + * sock_put(sk) after bt_sock_unlink(sk). + */ + read_unlock(&hci_sk_list.lock); + sock_put(sk); + goto restart; } read_unlock(&hci_sk_list.lock); }