Message ID | 20210517092006.803332-1-omosnace@redhat.com |
---|---|
State | New |
Headers | show |
Series | [v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks | expand |
Ondrej Mosnacek <omosnace@redhat.com> writes: > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > lockdown") added an implementation of the locked_down LSM hook to > SELinux, with the aim to restrict which domains are allowed to perform > operations that would breach lockdown. > > However, in several places the security_locked_down() hook is called in > situations where the current task isn't doing any action that would > directly breach lockdown, leading to SELinux checks that are basically > bogus. > > Since in most of these situations converting the callers such that > security_locked_down() is called in a context where the current task > would be meaningful for SELinux is impossible or very non-trivial (and > could lead to TOCTOU issues for the classic Lockdown LSM > implementation), fix this by modifying the hook to accept a struct cred > pointer as argument, where NULL will be interpreted as a request for a > "global", task-independent lockdown decision only. Then modify SELinux > to ignore calls with cred == NULL. > > Since most callers will just want to pass current_cred() as the cred > parameter, rename the hook to security_cred_locked_down() and provide > the original security_locked_down() function as a simple wrapper around > the new hook. > > The callers migrated to the new hook, passing NULL as cred: > 1. arch/powerpc/xmon/xmon.c > Here the hook seems to be called from non-task context and is only > used for redacting some sensitive values from output sent to > userspace. It's hard to follow but it actually disables interactive use of xmon entirely if lockdown is in confidentiality mode, and disables modifications of the kernel in integrity mode. But that's not really that important, the patch looks fine. Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) cheers
On Mon, May 17, 2021 at 1:00 PM Michael Ellerman <mpe@ellerman.id.au> wrote: > Ondrej Mosnacek <omosnace@redhat.com> writes: > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > > lockdown") added an implementation of the locked_down LSM hook to > > SELinux, with the aim to restrict which domains are allowed to perform > > operations that would breach lockdown. > > > > However, in several places the security_locked_down() hook is called in > > situations where the current task isn't doing any action that would > > directly breach lockdown, leading to SELinux checks that are basically > > bogus. > > > > Since in most of these situations converting the callers such that > > security_locked_down() is called in a context where the current task > > would be meaningful for SELinux is impossible or very non-trivial (and > > could lead to TOCTOU issues for the classic Lockdown LSM > > implementation), fix this by modifying the hook to accept a struct cred > > pointer as argument, where NULL will be interpreted as a request for a > > "global", task-independent lockdown decision only. Then modify SELinux > > to ignore calls with cred == NULL. > > > > Since most callers will just want to pass current_cred() as the cred > > parameter, rename the hook to security_cred_locked_down() and provide > > the original security_locked_down() function as a simple wrapper around > > the new hook. > > > > The callers migrated to the new hook, passing NULL as cred: > > 1. arch/powerpc/xmon/xmon.c > > Here the hook seems to be called from non-task context and is only > > used for redacting some sensitive values from output sent to > > userspace. > > It's hard to follow but it actually disables interactive use of xmon > entirely if lockdown is in confidentiality mode, and disables > modifications of the kernel in integrity mode. > > But that's not really that important, the patch looks fine. > > Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Thanks, Michael! James/Paul, is there anything blocking this patch from being merged? Especially the BPF case is causing real trouble for people and the only workaround is to broadly allow lockdown::confidentiality in the policy. -- Ondrej Mosnacek Software Engineer, Linux Security - SELinux kernel Red Hat, Inc.
On Wed, 26 May 2021, Ondrej Mosnacek wrote: > Thanks, Michael! > > James/Paul, is there anything blocking this patch from being merged? > Especially the BPF case is causing real trouble for people and the > only workaround is to broadly allow lockdown::confidentiality in the > policy. It would be good to see more signoffs/reviews, especially from Paul, but he is busy with the io_uring stuff. Let's see if anyone else can look at this in the next couple of days. -- James Morris <jmorris@namei.org>
On Thu, May 27, 2021 at 12:33 AM James Morris <jmorris@namei.org> wrote: > On Wed, 26 May 2021, Ondrej Mosnacek wrote: > > > Thanks, Michael! > > > > James/Paul, is there anything blocking this patch from being merged? > > Especially the BPF case is causing real trouble for people and the > > only workaround is to broadly allow lockdown::confidentiality in the > > policy. > > It would be good to see more signoffs/reviews, especially from Paul, but > he is busy with the io_uring stuff. Yes, it's been a busy week with various things going on around here. I looked at the v1 posting but haven't had a chance yet to look at v2; I promise to get to it today, but it might not happen until later tonight. > Let's see if anyone else can look at this in the next couple of days. -- paul moore www.paul-moore.com
On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > lockdown") added an implementation of the locked_down LSM hook to > SELinux, with the aim to restrict which domains are allowed to perform > operations that would breach lockdown. > > However, in several places the security_locked_down() hook is called in > situations where the current task isn't doing any action that would > directly breach lockdown, leading to SELinux checks that are basically > bogus. > > Since in most of these situations converting the callers such that > security_locked_down() is called in a context where the current task > would be meaningful for SELinux is impossible or very non-trivial (and > could lead to TOCTOU issues for the classic Lockdown LSM > implementation), fix this by modifying the hook to accept a struct cred > pointer as argument, where NULL will be interpreted as a request for a > "global", task-independent lockdown decision only. Then modify SELinux > to ignore calls with cred == NULL. I'm not overly excited about skipping the access check when cred is NULL. Based on the description and the little bit that I've dug into thus far it looks like using SECINITSID_KERNEL as the subject would be much more appropriate. *Something* (the kernel in most of the relevant cases it looks like) is requesting that a potentially sensitive disclosure be made, and ignoring it seems like the wrong thing to do. Leaving the access control intact also provides a nice avenue to audit these requests should users want to do that. Those users that generally don't care can grant kernel_t all the necessary permissions without much policy. > Since most callers will just want to pass current_cred() as the cred > parameter, rename the hook to security_cred_locked_down() and provide > the original security_locked_down() function as a simple wrapper around > the new hook. I know you and Casey went back and forth on this in v1, but I agree with Casey that having two LSM hooks here is a mistake. I know it makes backports hard, but spoiler alert: maintaining complex software over any non-trivial period of time is hard, reeeeally hard sometimes ;) > The callers migrated to the new hook, passing NULL as cred: > 1. arch/powerpc/xmon/xmon.c > Here the hook seems to be called from non-task context and is only > used for redacting some sensitive values from output sent to > userspace. This definitely sounds like kernel_t based on the description above. > 2. fs/tracefs/inode.c:tracefs_create_file() > Here the call is used to prevent creating new tracefs entries when > the kernel is locked down. Assumes that locking down is one-way - > i.e. if the hook returns non-zero once, it will never return zero > again, thus no point in creating these files. More kernel_t. > 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common() > Called when a BPF program calls a helper that could leak kernel > memory. The task context is not relevant here, since the program > may very well be run in the context of a different task than the > consumer of the data. > See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585 The access control check isn't so much who is consuming the data, but who is requesting a potential violation of a "lockdown", yes? For example, the SELinux policy rule for the current lockdown check looks something like this: allow <who> <who> : lockdown { <reason> }; It seems to me that the task context is relevant here and performing the access control check based on the task's domain is correct. If we are also concerned about who has access to this sensitive information once it has been determined that the task can cause it to be sent, we should have another check point for that, assuming the access isn't already covered by another check/hook. > 4. net/xfrm/xfrm_user.c:copy_to_user_*() > Here a cryptographic secret is redacted based on the value returned > from the hook. There are two possible actions that may lead here: > a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the > task context is relevant, since the dumped data is sent back to > the current task. If the task context is relevant we should use it. > b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are > broadcasted to tasks subscribed to XFRM events - here the > SELinux check is not meningful as the current task's creds do > not represent the tasks that could potentially see the secret. This looks very similar to the BPF hook discussed above, I believe my comments above apply here as well. > It really doesn't seem worth it to try to preserve the check in the > a) case ... After you've read all of the above I hope you can understand why I disagree with this. > ... since the eventual leak can be circumvented anyway via b) I don't follow the statement above ... ? However I'm not sure it matters much considering my other concerns. > plus there is no way for the task to indicate that it doesn't care > about the actual key value, so the check could generate a lot of > noise. > > Improvements-suggested-by: Casey Schaufler <casey@schaufler-ca.com> > Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") > Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> > --- > > v2: > - change to a single hook based on suggestions by Casey Schaufler > > v1: https://lore.kernel.org/lkml/20210507114048.138933-1-omosnace@redhat.com/ > > arch/powerpc/xmon/xmon.c | 4 ++-- > fs/tracefs/inode.c | 2 +- > include/linux/lsm_hook_defs.h | 3 ++- > include/linux/lsm_hooks.h | 3 ++- > include/linux/security.h | 11 ++++++++--- > kernel/trace/bpf_trace.c | 4 ++-- > net/xfrm/xfrm_user.c | 2 +- > security/lockdown/lockdown.c | 5 +++-- > security/security.c | 6 +++--- > security/selinux/hooks.c | 12 +++++++++--- > 10 files changed, 33 insertions(+), 19 deletions(-) -- paul moore www.paul-moore.com
On 5/28/21 3:37 AM, Paul Moore wrote: > On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: >> >> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux >> lockdown") added an implementation of the locked_down LSM hook to >> SELinux, with the aim to restrict which domains are allowed to perform >> operations that would breach lockdown. >> >> However, in several places the security_locked_down() hook is called in >> situations where the current task isn't doing any action that would >> directly breach lockdown, leading to SELinux checks that are basically >> bogus. >> >> Since in most of these situations converting the callers such that >> security_locked_down() is called in a context where the current task >> would be meaningful for SELinux is impossible or very non-trivial (and >> could lead to TOCTOU issues for the classic Lockdown LSM >> implementation), fix this by modifying the hook to accept a struct cred >> pointer as argument, where NULL will be interpreted as a request for a >> "global", task-independent lockdown decision only. Then modify SELinux >> to ignore calls with cred == NULL. > > I'm not overly excited about skipping the access check when cred is > NULL. Based on the description and the little bit that I've dug into > thus far it looks like using SECINITSID_KERNEL as the subject would be > much more appropriate. *Something* (the kernel in most of the > relevant cases it looks like) is requesting that a potentially > sensitive disclosure be made, and ignoring it seems like the wrong > thing to do. Leaving the access control intact also provides a nice > avenue to audit these requests should users want to do that. I think the rationale/workaround for ignoring calls with cred == NULL (or the previous patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his seen tracing cases: i) The audit events that are triggered due to calls to security_locked_down() can OOM kill a machine, see below details [0]. ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() when presumingly trying to wake up kauditd [1]. How would your suggestion above solve both i) and ii)? [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585 : I starting seeing this with F-34. When I run a container that is traced with eBPF to record the syscalls it is doing, auditd is flooded with messages like: type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality } for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM" scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 tclass=lockdown permissive=0 This seems to be leading to auditd running out of space in the backlog buffer and eventually OOMs the machine. auditd running at 99% CPU presumably processing all the messages, eventually I get: Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64 Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1 Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/ : Upstream kernel 5.11.0-rc7 and later was found to deadlock during a bpf_probe_read_compat() call within a sched_switch tracepoint. The problem is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend testsuite on x86_64 as well as the runqlat,runqslower tools from bcc on ppc64le. Example stack trace from [1]: [ 730.868702] stack backtrace: [ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1 [ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 [ 730.873278] Call Trace: [ 730.873770] dump_stack+0x7f/0xa1 [ 730.874433] check_noncircular+0xdf/0x100 [ 730.875232] __lock_acquire+0x1202/0x1e10 [ 730.876031] ? __lock_acquire+0xfc0/0x1e10 [ 730.876844] lock_acquire+0xc2/0x3a0 [ 730.877551] ? __wake_up_common_lock+0x52/0x90 [ 730.878434] ? lock_acquire+0xc2/0x3a0 [ 730.879186] ? lock_is_held_type+0xa7/0x120 [ 730.880044] ? skb_queue_tail+0x1b/0x50 [ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90 [ 730.881656] ? __wake_up_common_lock+0x52/0x90 [ 730.882532] __wake_up_common_lock+0x52/0x90 [ 730.883375] audit_log_end+0x5b/0x100 [ 730.884104] slow_avc_audit+0x69/0x90 [ 730.884836] avc_has_perm+0x8b/0xb0 [ 730.885532] selinux_lockdown+0xa5/0xd0 [ 730.886297] security_locked_down+0x20/0x40 [ 730.887133] bpf_probe_read_compat+0x66/0xd0 [ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820 [ 730.888917] trace_call_bpf+0xe9/0x240 [ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0 [ 730.890579] perf_trace_sched_switch+0x142/0x180 [ 730.891485] ? __schedule+0x6d8/0xb20 [ 730.892209] __schedule+0x6d8/0xb20 [ 730.892899] schedule+0x5b/0xc0 [ 730.893522] exit_to_user_mode_prepare+0x11d/0x240 [ 730.894457] syscall_exit_to_user_mode+0x27/0x70 [ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae >> Since most callers will just want to pass current_cred() as the cred >> parameter, rename the hook to security_cred_locked_down() and provide >> the original security_locked_down() function as a simple wrapper around >> the new hook. [...] > >> 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common() >> Called when a BPF program calls a helper that could leak kernel >> memory. The task context is not relevant here, since the program >> may very well be run in the context of a different task than the >> consumer of the data. >> See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585 > > The access control check isn't so much who is consuming the data, but > who is requesting a potential violation of a "lockdown", yes? For > example, the SELinux policy rule for the current lockdown check looks > something like this: > > allow <who> <who> : lockdown { <reason> }; > > It seems to me that the task context is relevant here and performing > the access control check based on the task's domain is correct. This doesn't make much sense to me, it's /not/ the task 'requesting a potential violation of a "lockdown"', but rather the running tracing program which is e.g. inspecting kernel data structures around the triggered event. If I understood you correctly, having an 'allow' check on, say, httpd would be rather odd since things like perf/bcc/bpftrace/systemtap/etc is installing the tracing probe instead. Meaning, if we would /not/ trace such events (like in the prior mentioned syscall example), then there is also no call to the security_locked_down() from that same/ unmodified application. Thanks, Daniel
On Fri, May 28, 2021 at 09:09:57AM +0200, Daniel Borkmann wrote: > On 5/28/21 3:37 AM, Paul Moore wrote: > > On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > > > > > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > > > lockdown") added an implementation of the locked_down LSM hook to > > > SELinux, with the aim to restrict which domains are allowed to perform > > > operations that would breach lockdown. > > > > > > However, in several places the security_locked_down() hook is called in > > > situations where the current task isn't doing any action that would > > > directly breach lockdown, leading to SELinux checks that are basically > > > bogus. > > > > > > Since in most of these situations converting the callers such that > > > security_locked_down() is called in a context where the current task > > > would be meaningful for SELinux is impossible or very non-trivial (and > > > could lead to TOCTOU issues for the classic Lockdown LSM > > > implementation), fix this by modifying the hook to accept a struct cred > > > pointer as argument, where NULL will be interpreted as a request for a > > > "global", task-independent lockdown decision only. Then modify SELinux > > > to ignore calls with cred == NULL. > > > > I'm not overly excited about skipping the access check when cred is > > NULL. Based on the description and the little bit that I've dug into > > thus far it looks like using SECINITSID_KERNEL as the subject would be > > much more appropriate. *Something* (the kernel in most of the > > relevant cases it looks like) is requesting that a potentially > > sensitive disclosure be made, and ignoring it seems like the wrong > > thing to do. Leaving the access control intact also provides a nice > > avenue to audit these requests should users want to do that. > > I think the rationale/workaround for ignoring calls with cred == NULL (or the previous > patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his > seen tracing cases: > > i) The audit events that are triggered due to calls to security_locked_down() > can OOM kill a machine, see below details [0]. > > ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() > when presumingly trying to wake up kauditd [1]. hi, I saw the same deadlock, ended up with this sequence: rq_lock(rq) -> trace_sched_switch -> bpf_prog -> selinux_lockdown -> audit_log_end -> wake_up_interruptible -> try_to_wake_up -> rq_lock(rq) problem is that trace_sched_switch already holds rq_lock I had powerpc server where I could reproduce this easily, but now for some reason I can't hit the issue anymore jirka > > How would your suggestion above solve both i) and ii)? > > [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585 : > > I starting seeing this with F-34. When I run a container that is traced with eBPF > to record the syscalls it is doing, auditd is flooded with messages like: > > type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality } for > pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM" > scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 tclass=lockdown permissive=0 > > This seems to be leading to auditd running out of space in the backlog buffer and > eventually OOMs the machine. > > auditd running at 99% CPU presumably processing all the messages, eventually I get: > Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded > Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded > Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64 > Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64 > Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64 > Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64 > Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 > Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1 > Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 > > [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/ : > > Upstream kernel 5.11.0-rc7 and later was found to deadlock during a bpf_probe_read_compat() > call within a sched_switch tracepoint. The problem is reproducible with the reg_alloc3 > testcase from SystemTap's BPF backend testsuite on x86_64 as well as the runqlat,runqslower > tools from bcc on ppc64le. Example stack trace from [1]: > > [ 730.868702] stack backtrace: > [ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1 > [ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 > [ 730.873278] Call Trace: > [ 730.873770] dump_stack+0x7f/0xa1 > [ 730.874433] check_noncircular+0xdf/0x100 > [ 730.875232] __lock_acquire+0x1202/0x1e10 > [ 730.876031] ? __lock_acquire+0xfc0/0x1e10 > [ 730.876844] lock_acquire+0xc2/0x3a0 > [ 730.877551] ? __wake_up_common_lock+0x52/0x90 > [ 730.878434] ? lock_acquire+0xc2/0x3a0 > [ 730.879186] ? lock_is_held_type+0xa7/0x120 > [ 730.880044] ? skb_queue_tail+0x1b/0x50 > [ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90 > [ 730.881656] ? __wake_up_common_lock+0x52/0x90 > [ 730.882532] __wake_up_common_lock+0x52/0x90 > [ 730.883375] audit_log_end+0x5b/0x100 > [ 730.884104] slow_avc_audit+0x69/0x90 > [ 730.884836] avc_has_perm+0x8b/0xb0 > [ 730.885532] selinux_lockdown+0xa5/0xd0 > [ 730.886297] security_locked_down+0x20/0x40 > [ 730.887133] bpf_probe_read_compat+0x66/0xd0 > [ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820 > [ 730.888917] trace_call_bpf+0xe9/0x240 > [ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0 > [ 730.890579] perf_trace_sched_switch+0x142/0x180 > [ 730.891485] ? __schedule+0x6d8/0xb20 > [ 730.892209] __schedule+0x6d8/0xb20 > [ 730.892899] schedule+0x5b/0xc0 > [ 730.893522] exit_to_user_mode_prepare+0x11d/0x240 > [ 730.894457] syscall_exit_to_user_mode+0x27/0x70 > [ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > Since most callers will just want to pass current_cred() as the cred > > > parameter, rename the hook to security_cred_locked_down() and provide > > > the original security_locked_down() function as a simple wrapper around > > > the new hook. > [...] > > > > > 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common() > > > Called when a BPF program calls a helper that could leak kernel > > > memory. The task context is not relevant here, since the program > > > may very well be run in the context of a different task than the > > > consumer of the data. > > > See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585 > > > > The access control check isn't so much who is consuming the data, but > > who is requesting a potential violation of a "lockdown", yes? For > > example, the SELinux policy rule for the current lockdown check looks > > something like this: > > > > allow <who> <who> : lockdown { <reason> }; > > > > It seems to me that the task context is relevant here and performing > > the access control check based on the task's domain is correct. > This doesn't make much sense to me, it's /not/ the task 'requesting a potential > violation of a "lockdown"', but rather the running tracing program which is e.g. > inspecting kernel data structures around the triggered event. If I understood > you correctly, having an 'allow' check on, say, httpd would be rather odd since > things like perf/bcc/bpftrace/systemtap/etc is installing the tracing probe instead. > > Meaning, if we would /not/ trace such events (like in the prior mentioned syscall > example), then there is also no call to the security_locked_down() from that same/ > unmodified application. > > Thanks, > Daniel >
On 5/28/21 9:09 AM, Daniel Borkmann wrote: > On 5/28/21 3:37 AM, Paul Moore wrote: >> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: >>> >>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux >>> lockdown") added an implementation of the locked_down LSM hook to >>> SELinux, with the aim to restrict which domains are allowed to perform >>> operations that would breach lockdown. >>> >>> However, in several places the security_locked_down() hook is called in >>> situations where the current task isn't doing any action that would >>> directly breach lockdown, leading to SELinux checks that are basically >>> bogus. >>> >>> Since in most of these situations converting the callers such that >>> security_locked_down() is called in a context where the current task >>> would be meaningful for SELinux is impossible or very non-trivial (and >>> could lead to TOCTOU issues for the classic Lockdown LSM >>> implementation), fix this by modifying the hook to accept a struct cred >>> pointer as argument, where NULL will be interpreted as a request for a >>> "global", task-independent lockdown decision only. Then modify SELinux >>> to ignore calls with cred == NULL. >> >> I'm not overly excited about skipping the access check when cred is >> NULL. Based on the description and the little bit that I've dug into >> thus far it looks like using SECINITSID_KERNEL as the subject would be >> much more appropriate. *Something* (the kernel in most of the >> relevant cases it looks like) is requesting that a potentially >> sensitive disclosure be made, and ignoring it seems like the wrong >> thing to do. Leaving the access control intact also provides a nice >> avenue to audit these requests should users want to do that. > > I think the rationale/workaround for ignoring calls with cred == NULL (or the previous > patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his > seen tracing cases: > > i) The audit events that are triggered due to calls to security_locked_down() > can OOM kill a machine, see below details [0]. > > ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() > when presumingly trying to wake up kauditd [1]. Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked at the rest but it's also kind of independent), the attached fix should address both reported issues, please take a look & test. Thanks a lot, Daniel From 5893ad528dc0a0a68933b8f2a81b18d3f539660d Mon Sep 17 00:00:00 2001 From: Daniel Borkmann <daniel@iogearbox.net> Date: Fri, 28 May 2021 09:16:31 +0000 Subject: [PATCH bpf] bpf, audit, lockdown: Fix bogus SELinux lockdown permission checks Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") added an implementation of the locked_down LSM hook to SELinux, with the aim to restrict which domains are allowed to perform operations that would breach lockdown. This is indirectly also getting audit subsystem involved to report events. The latter is problematic, as reported by Ondrej and Serhei, since it can bring down the whole system via audit: i) The audit events that are triggered due to calls to security_locked_down() can OOM kill a machine, see below details [0]. ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() when presumingly trying to wake up kauditd [1]. Fix both at the same time by taking a completely different approach, that is, move the check into the program verification phase where we actually retrieve the func proto. This also reliably gets the task (current) that is trying to install the tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since we're moving this out of the BPF helpers which can be called millions of times per second. [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says: I starting seeing this with F-34. When I run a container that is traced with BPF to record the syscalls it is doing, auditd is flooded with messages like: type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality } for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM" scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 tclass=lockdown permissive=0 This seems to be leading to auditd running out of space in the backlog buffer and eventually OOMs the machine. [...] auditd running at 99% CPU presumably processing all the messages, eventually I get: Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64 Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1 Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [...] [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/, Serhei Makarov says: Upstream kernel 5.11.0-rc7 and later was found to deadlock during a bpf_probe_read_compat() call within a sched_switch tracepoint. The problem is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend testsuite on x86_64 as well as the runqlat,runqslower tools from bcc on ppc64le. Example stack trace: [...] [ 730.868702] stack backtrace: [ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1 [ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 [ 730.873278] Call Trace: [ 730.873770] dump_stack+0x7f/0xa1 [ 730.874433] check_noncircular+0xdf/0x100 [ 730.875232] __lock_acquire+0x1202/0x1e10 [ 730.876031] ? __lock_acquire+0xfc0/0x1e10 [ 730.876844] lock_acquire+0xc2/0x3a0 [ 730.877551] ? __wake_up_common_lock+0x52/0x90 [ 730.878434] ? lock_acquire+0xc2/0x3a0 [ 730.879186] ? lock_is_held_type+0xa7/0x120 [ 730.880044] ? skb_queue_tail+0x1b/0x50 [ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90 [ 730.881656] ? __wake_up_common_lock+0x52/0x90 [ 730.882532] __wake_up_common_lock+0x52/0x90 [ 730.883375] audit_log_end+0x5b/0x100 [ 730.884104] slow_avc_audit+0x69/0x90 [ 730.884836] avc_has_perm+0x8b/0xb0 [ 730.885532] selinux_lockdown+0xa5/0xd0 [ 730.886297] security_locked_down+0x20/0x40 [ 730.887133] bpf_probe_read_compat+0x66/0xd0 [ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820 [ 730.888917] trace_call_bpf+0xe9/0x240 [ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0 [ 730.890579] perf_trace_sched_switch+0x142/0x180 [ 730.891485] ? __schedule+0x6d8/0xb20 [ 730.892209] __schedule+0x6d8/0xb20 [ 730.892899] schedule+0x5b/0xc0 [ 730.893522] exit_to_user_mode_prepare+0x11d/0x240 [ 730.894457] syscall_exit_to_user_mode+0x27/0x70 [ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae [...] Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") Reported-by: Ondrej Mosnacek <omosnace@redhat.com> Reported-by: Jakub Hrozek <jhrozek@redhat.com> Reported-by: Serhei Makarov <smakarov@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Frank Eigler <fche@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paul Moore <paul@paul-moore.com> --- kernel/bpf/helpers.c | 6 ++++-- kernel/trace/bpf_trace.c | 36 +++++++++++++----------------------- 2 files changed, 17 insertions(+), 25 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 73443498d88f..6f6e090c5310 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1069,11 +1069,13 @@ bpf_base_func_proto(enum bpf_func_id func_id) case BPF_FUNC_probe_read_user: return &bpf_probe_read_user_proto; case BPF_FUNC_probe_read_kernel: - return &bpf_probe_read_kernel_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_kernel_proto; case BPF_FUNC_probe_read_user_str: return &bpf_probe_read_user_str_proto; case BPF_FUNC_probe_read_kernel_str: - return &bpf_probe_read_kernel_str_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_kernel_str_proto; case BPF_FUNC_snprintf_btf: return &bpf_snprintf_btf_proto; case BPF_FUNC_snprintf: diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d2d7cf6cfe83..3df43d89d642 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -215,16 +215,10 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = { static __always_inline int bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr) { - int ret = security_locked_down(LOCKDOWN_BPF_READ); + int ret = copy_from_kernel_nofault(dst, unsafe_ptr, size); if (unlikely(ret < 0)) - goto fail; - ret = copy_from_kernel_nofault(dst, unsafe_ptr, size); - if (unlikely(ret < 0)) - goto fail; - return ret; -fail: - memset(dst, 0, size); + memset(dst, 0, size); return ret; } @@ -246,11 +240,6 @@ const struct bpf_func_proto bpf_probe_read_kernel_proto = { static __always_inline int bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) { - int ret = security_locked_down(LOCKDOWN_BPF_READ); - - if (unlikely(ret < 0)) - goto fail; - /* * The strncpy_from_kernel_nofault() call will likely not fill the * entire buffer, but that's okay in this circumstance as we're probing @@ -260,13 +249,10 @@ bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) * code altogether don't copy garbage; otherwise length of string * is returned that can be used for bpf_perf_event_output() et al. */ - ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size); - if (unlikely(ret < 0)) - goto fail; + int ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size); - return ret; -fail: - memset(dst, 0, size); + if (unlikely(ret < 0)) + memset(dst, 0, size); return ret; } @@ -1011,16 +997,20 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_probe_read_user: return &bpf_probe_read_user_proto; case BPF_FUNC_probe_read_kernel: - return &bpf_probe_read_kernel_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_kernel_proto; case BPF_FUNC_probe_read_user_str: return &bpf_probe_read_user_str_proto; case BPF_FUNC_probe_read_kernel_str: - return &bpf_probe_read_kernel_str_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_kernel_str_proto; #ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE case BPF_FUNC_probe_read: - return &bpf_probe_read_compat_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_compat_proto; case BPF_FUNC_probe_read_str: - return &bpf_probe_read_compat_str_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_compat_str_proto; #endif #ifdef CONFIG_CGROUPS case BPF_FUNC_get_current_cgroup_id: -- 2.27.0
On Fri, May 28, 2021 at 11:56:02AM +0200, Daniel Borkmann wrote: SNIP > > Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked > at the rest but it's also kind of independent), the attached fix should address both > reported issues, please take a look & test. > > Thanks a lot, > Daniel > From 5893ad528dc0a0a68933b8f2a81b18d3f539660d Mon Sep 17 00:00:00 2001 > From: Daniel Borkmann <daniel@iogearbox.net> > Date: Fri, 28 May 2021 09:16:31 +0000 > Subject: [PATCH bpf] bpf, audit, lockdown: Fix bogus SELinux lockdown permission checks > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") > added an implementation of the locked_down LSM hook to SELinux, with the aim > to restrict which domains are allowed to perform operations that would breach > lockdown. This is indirectly also getting audit subsystem involved to report > events. The latter is problematic, as reported by Ondrej and Serhei, since it > can bring down the whole system via audit: > > i) The audit events that are triggered due to calls to security_locked_down() > can OOM kill a machine, see below details [0]. > > ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() > when presumingly trying to wake up kauditd [1]. > > Fix both at the same time by taking a completely different approach, that is, > move the check into the program verification phase where we actually retrieve > the func proto. This also reliably gets the task (current) that is trying to > install the tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also > fixes the OOM since we're moving this out of the BPF helpers which can be called > millions of times per second. nice idea.. I'll try to reproduce and test jirka
On Fri, May 28, 2021 at 11:56:02AM +0200, Daniel Borkmann wrote: SNIP > Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked > at the rest but it's also kind of independent), the attached fix should address both > reported issues, please take a look & test. > > Thanks a lot, > Daniel > From 5893ad528dc0a0a68933b8f2a81b18d3f539660d Mon Sep 17 00:00:00 2001 > From: Daniel Borkmann <daniel@iogearbox.net> > Date: Fri, 28 May 2021 09:16:31 +0000 > Subject: [PATCH bpf] bpf, audit, lockdown: Fix bogus SELinux lockdown permission checks > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") > added an implementation of the locked_down LSM hook to SELinux, with the aim > to restrict which domains are allowed to perform operations that would breach > lockdown. This is indirectly also getting audit subsystem involved to report > events. The latter is problematic, as reported by Ondrej and Serhei, since it > can bring down the whole system via audit: > > i) The audit events that are triggered due to calls to security_locked_down() > can OOM kill a machine, see below details [0]. > > ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() > when presumingly trying to wake up kauditd [1]. > > Fix both at the same time by taking a completely different approach, that is, > move the check into the program verification phase where we actually retrieve > the func proto. This also reliably gets the task (current) that is trying to > install the tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also > fixes the OOM since we're moving this out of the BPF helpers which can be called > millions of times per second. > > [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says: > > I starting seeing this with F-34. When I run a container that is traced with > BPF to record the syscalls it is doing, auditd is flooded with messages like: > > type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality } > for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM" > scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 > tclass=lockdown permissive=0 > > This seems to be leading to auditd running out of space in the backlog buffer > and eventually OOMs the machine. > > [...] > auditd running at 99% CPU presumably processing all the messages, eventually I get: > Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded > Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded > Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64 > Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64 > Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64 > Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64 > Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 > Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1 > Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 > [...] > > [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/, > Serhei Makarov says: > > Upstream kernel 5.11.0-rc7 and later was found to deadlock during a > bpf_probe_read_compat() call within a sched_switch tracepoint. The problem > is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend > testsuite on x86_64 as well as the runqlat,runqslower tools from bcc on > ppc64le. Example stack trace: > > [...] > [ 730.868702] stack backtrace: > [ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1 > [ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 > [ 730.873278] Call Trace: > [ 730.873770] dump_stack+0x7f/0xa1 > [ 730.874433] check_noncircular+0xdf/0x100 > [ 730.875232] __lock_acquire+0x1202/0x1e10 > [ 730.876031] ? __lock_acquire+0xfc0/0x1e10 > [ 730.876844] lock_acquire+0xc2/0x3a0 > [ 730.877551] ? __wake_up_common_lock+0x52/0x90 > [ 730.878434] ? lock_acquire+0xc2/0x3a0 > [ 730.879186] ? lock_is_held_type+0xa7/0x120 > [ 730.880044] ? skb_queue_tail+0x1b/0x50 > [ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90 > [ 730.881656] ? __wake_up_common_lock+0x52/0x90 > [ 730.882532] __wake_up_common_lock+0x52/0x90 > [ 730.883375] audit_log_end+0x5b/0x100 > [ 730.884104] slow_avc_audit+0x69/0x90 > [ 730.884836] avc_has_perm+0x8b/0xb0 > [ 730.885532] selinux_lockdown+0xa5/0xd0 > [ 730.886297] security_locked_down+0x20/0x40 > [ 730.887133] bpf_probe_read_compat+0x66/0xd0 > [ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820 > [ 730.888917] trace_call_bpf+0xe9/0x240 > [ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0 > [ 730.890579] perf_trace_sched_switch+0x142/0x180 > [ 730.891485] ? __schedule+0x6d8/0xb20 > [ 730.892209] __schedule+0x6d8/0xb20 > [ 730.892899] schedule+0x5b/0xc0 > [ 730.893522] exit_to_user_mode_prepare+0x11d/0x240 > [ 730.894457] syscall_exit_to_user_mode+0x27/0x70 > [ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae > [...] > > Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") > Reported-by: Ondrej Mosnacek <omosnace@redhat.com> > Reported-by: Jakub Hrozek <jhrozek@redhat.com> > Reported-by: Serhei Makarov <smakarov@redhat.com> > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> > Cc: Stephen Smalley <sds@tycho.nsa.gov> > Cc: Jerome Marchand <jmarchan@redhat.com> > Cc: Frank Eigler <fche@redhat.com> > Cc: Jiri Olsa <jolsa@redhat.com> > Cc: Paul Moore <paul@paul-moore.com> found the original server and reproduced.. this patch fixes it for me Tested-by: Jiri Olsa <jolsa@redhat.com> thanks, jirka > --- > kernel/bpf/helpers.c | 6 ++++-- > kernel/trace/bpf_trace.c | 36 +++++++++++++----------------------- > 2 files changed, 17 insertions(+), 25 deletions(-) > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 73443498d88f..6f6e090c5310 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -1069,11 +1069,13 @@ bpf_base_func_proto(enum bpf_func_id func_id) > case BPF_FUNC_probe_read_user: > return &bpf_probe_read_user_proto; > case BPF_FUNC_probe_read_kernel: > - return &bpf_probe_read_kernel_proto; > + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? > + NULL : &bpf_probe_read_kernel_proto; > case BPF_FUNC_probe_read_user_str: > return &bpf_probe_read_user_str_proto; > case BPF_FUNC_probe_read_kernel_str: > - return &bpf_probe_read_kernel_str_proto; > + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? > + NULL : &bpf_probe_read_kernel_str_proto; > case BPF_FUNC_snprintf_btf: > return &bpf_snprintf_btf_proto; > case BPF_FUNC_snprintf: > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index d2d7cf6cfe83..3df43d89d642 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -215,16 +215,10 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = { > static __always_inline int > bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr) > { > - int ret = security_locked_down(LOCKDOWN_BPF_READ); > + int ret = copy_from_kernel_nofault(dst, unsafe_ptr, size); > > if (unlikely(ret < 0)) > - goto fail; > - ret = copy_from_kernel_nofault(dst, unsafe_ptr, size); > - if (unlikely(ret < 0)) > - goto fail; > - return ret; > -fail: > - memset(dst, 0, size); > + memset(dst, 0, size); > return ret; > } > > @@ -246,11 +240,6 @@ const struct bpf_func_proto bpf_probe_read_kernel_proto = { > static __always_inline int > bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) > { > - int ret = security_locked_down(LOCKDOWN_BPF_READ); > - > - if (unlikely(ret < 0)) > - goto fail; > - > /* > * The strncpy_from_kernel_nofault() call will likely not fill the > * entire buffer, but that's okay in this circumstance as we're probing > @@ -260,13 +249,10 @@ bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) > * code altogether don't copy garbage; otherwise length of string > * is returned that can be used for bpf_perf_event_output() et al. > */ > - ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size); > - if (unlikely(ret < 0)) > - goto fail; > + int ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size); > > - return ret; > -fail: > - memset(dst, 0, size); > + if (unlikely(ret < 0)) > + memset(dst, 0, size); > return ret; > } > > @@ -1011,16 +997,20 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > case BPF_FUNC_probe_read_user: > return &bpf_probe_read_user_proto; > case BPF_FUNC_probe_read_kernel: > - return &bpf_probe_read_kernel_proto; > + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? > + NULL : &bpf_probe_read_kernel_proto; > case BPF_FUNC_probe_read_user_str: > return &bpf_probe_read_user_str_proto; > case BPF_FUNC_probe_read_kernel_str: > - return &bpf_probe_read_kernel_str_proto; > + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? > + NULL : &bpf_probe_read_kernel_str_proto; > #ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE > case BPF_FUNC_probe_read: > - return &bpf_probe_read_compat_proto; > + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? > + NULL : &bpf_probe_read_compat_proto; > case BPF_FUNC_probe_read_str: > - return &bpf_probe_read_compat_str_proto; > + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? > + NULL : &bpf_probe_read_compat_str_proto; > #endif > #ifdef CONFIG_CGROUPS > case BPF_FUNC_get_current_cgroup_id: > -- > 2.27.0 >
On 5/28/21 1:47 PM, Jiri Olsa wrote: > On Fri, May 28, 2021 at 11:56:02AM +0200, Daniel Borkmann wrote: > >> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked >> at the rest but it's also kind of independent), the attached fix should address both >> reported issues, please take a look & test. >> >> Thanks a lot, >> Daniel > >> From 5893ad528dc0a0a68933b8f2a81b18d3f539660d Mon Sep 17 00:00:00 2001 >> From: Daniel Borkmann <daniel@iogearbox.net> >> Date: Fri, 28 May 2021 09:16:31 +0000 >> Subject: [PATCH bpf] bpf, audit, lockdown: Fix bogus SELinux lockdown permission checks >> >> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") >> added an implementation of the locked_down LSM hook to SELinux, with the aim >> to restrict which domains are allowed to perform operations that would breach >> lockdown. This is indirectly also getting audit subsystem involved to report >> events. The latter is problematic, as reported by Ondrej and Serhei, since it >> can bring down the whole system via audit: >> >> i) The audit events that are triggered due to calls to security_locked_down() >> can OOM kill a machine, see below details [0]. >> >> ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() >> when presumingly trying to wake up kauditd [1]. >> >> Fix both at the same time by taking a completely different approach, that is, >> move the check into the program verification phase where we actually retrieve >> the func proto. This also reliably gets the task (current) that is trying to >> install the tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also >> fixes the OOM since we're moving this out of the BPF helpers which can be called >> millions of times per second. >> >> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says: >> >> I starting seeing this with F-34. When I run a container that is traced with >> BPF to record the syscalls it is doing, auditd is flooded with messages like: >> >> type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality } >> for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM" >> scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 >> tclass=lockdown permissive=0 >> >> This seems to be leading to auditd running out of space in the backlog buffer >> and eventually OOMs the machine. >> >> [...] >> auditd running at 99% CPU presumably processing all the messages, eventually I get: >> Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded >> Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded >> Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64 >> Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64 >> Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64 >> Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64 >> Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 >> Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1 >> Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 >> [...] >> >> [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/, >> Serhei Makarov says: >> >> Upstream kernel 5.11.0-rc7 and later was found to deadlock during a >> bpf_probe_read_compat() call within a sched_switch tracepoint. The problem >> is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend >> testsuite on x86_64 as well as the runqlat,runqslower tools from bcc on >> ppc64le. Example stack trace: >> >> [...] >> [ 730.868702] stack backtrace: >> [ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1 >> [ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 >> [ 730.873278] Call Trace: >> [ 730.873770] dump_stack+0x7f/0xa1 >> [ 730.874433] check_noncircular+0xdf/0x100 >> [ 730.875232] __lock_acquire+0x1202/0x1e10 >> [ 730.876031] ? __lock_acquire+0xfc0/0x1e10 >> [ 730.876844] lock_acquire+0xc2/0x3a0 >> [ 730.877551] ? __wake_up_common_lock+0x52/0x90 >> [ 730.878434] ? lock_acquire+0xc2/0x3a0 >> [ 730.879186] ? lock_is_held_type+0xa7/0x120 >> [ 730.880044] ? skb_queue_tail+0x1b/0x50 >> [ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90 >> [ 730.881656] ? __wake_up_common_lock+0x52/0x90 >> [ 730.882532] __wake_up_common_lock+0x52/0x90 >> [ 730.883375] audit_log_end+0x5b/0x100 >> [ 730.884104] slow_avc_audit+0x69/0x90 >> [ 730.884836] avc_has_perm+0x8b/0xb0 >> [ 730.885532] selinux_lockdown+0xa5/0xd0 >> [ 730.886297] security_locked_down+0x20/0x40 >> [ 730.887133] bpf_probe_read_compat+0x66/0xd0 >> [ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820 >> [ 730.888917] trace_call_bpf+0xe9/0x240 >> [ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0 >> [ 730.890579] perf_trace_sched_switch+0x142/0x180 >> [ 730.891485] ? __schedule+0x6d8/0xb20 >> [ 730.892209] __schedule+0x6d8/0xb20 >> [ 730.892899] schedule+0x5b/0xc0 >> [ 730.893522] exit_to_user_mode_prepare+0x11d/0x240 >> [ 730.894457] syscall_exit_to_user_mode+0x27/0x70 >> [ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae >> [...] >> >> Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") >> Reported-by: Ondrej Mosnacek <omosnace@redhat.com> >> Reported-by: Jakub Hrozek <jhrozek@redhat.com> >> Reported-by: Serhei Makarov <smakarov@redhat.com> >> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> >> Cc: Stephen Smalley <sds@tycho.nsa.gov> >> Cc: Jerome Marchand <jmarchan@redhat.com> >> Cc: Frank Eigler <fche@redhat.com> >> Cc: Jiri Olsa <jolsa@redhat.com> >> Cc: Paul Moore <paul@paul-moore.com> > > found the original server and reproduced.. this patch fixes it for me > > Tested-by: Jiri Olsa <jolsa@redhat.com> Thanks Jiri! If Paul is fine with this as well, I can later route this fix via bpf tree. Best, Daniel
(I'm off work today and plan to reply also to Paul's comments next week, but for now let me at least share a couple quick thoughts on Daniel's patch.) On Fri, May 28, 2021 at 11:56 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 5/28/21 9:09 AM, Daniel Borkmann wrote: > > On 5/28/21 3:37 AM, Paul Moore wrote: > >> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > >>> > >>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > >>> lockdown") added an implementation of the locked_down LSM hook to > >>> SELinux, with the aim to restrict which domains are allowed to perform > >>> operations that would breach lockdown. > >>> > >>> However, in several places the security_locked_down() hook is called in > >>> situations where the current task isn't doing any action that would > >>> directly breach lockdown, leading to SELinux checks that are basically > >>> bogus. > >>> > >>> Since in most of these situations converting the callers such that > >>> security_locked_down() is called in a context where the current task > >>> would be meaningful for SELinux is impossible or very non-trivial (and > >>> could lead to TOCTOU issues for the classic Lockdown LSM > >>> implementation), fix this by modifying the hook to accept a struct cred > >>> pointer as argument, where NULL will be interpreted as a request for a > >>> "global", task-independent lockdown decision only. Then modify SELinux > >>> to ignore calls with cred == NULL. > >> > >> I'm not overly excited about skipping the access check when cred is > >> NULL. Based on the description and the little bit that I've dug into > >> thus far it looks like using SECINITSID_KERNEL as the subject would be > >> much more appropriate. *Something* (the kernel in most of the > >> relevant cases it looks like) is requesting that a potentially > >> sensitive disclosure be made, and ignoring it seems like the wrong > >> thing to do. Leaving the access control intact also provides a nice > >> avenue to audit these requests should users want to do that. > > > > I think the rationale/workaround for ignoring calls with cred == NULL (or the previous > > patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his > > seen tracing cases: > > > > i) The audit events that are triggered due to calls to security_locked_down() > > can OOM kill a machine, see below details [0]. > > > > ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() > > when presumingly trying to wake up kauditd [1]. Actually, I wasn't aware of the deadlock... But calling an LSM hook [that is backed by a SELinux access check] from within a BPF helper is calling for all kinds of trouble, so I'm not surprised :) > Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked > at the rest but it's also kind of independent), the attached fix should address both > reported issues, please take a look & test. Thanks, I like this solution, although there are a few gotchas: 1. This patch creates a slight "regression" in that if someone flips the Lockdown LSM into lockdown mode on runtime, existing (already loaded) BPF programs will still be able to call the confidentiality-breaching helpers, while before the lockdown would apply also to them. Personally, I don't think it's a big deal (and I bet there are other existing cases where some handle kept from before lockdown could leak data), but I wanted to mention it in case someone thinks the opposite. 2. IIUC. when a BPF program is rejected due to lockdown/SELinux, the kernel will return -EINVAL to userspace (looking at check_helper_call() in kernel/bpf/verifier.c; didn't have time to look at other callers...). It would be nicer if the error code from the security_locked_down() call would be passed through the call chain and eventually returned to the caller. It should be relatively straightforward to convert bpf_base_func_proto() to return a PTR_ERR() instead of NULL on error, but it looks like this would result in quite a big patch updating all the callers (and callers of callers, etc.) with a not-so-small chance of missing some NULL check and introducing a bug... I guess we could live with EINVAL-on-denied in stable kernels and only have the error path refactoring in -next; I'm not sure... 3. This is a bit of a shot-in-the-dark, but I suppose there might be some BPF programs that would be able to do something useful also when the read_kernel helpers return an error, yet the kernel will now outright refuse to load them (when the lockdown hook returns nonzero). I have no idea if such BPF programs realistically exist in practice, but perhaps it would be worth returning some dummy always-error-returning helper function instead of NULL from bpf_base_func_proto() when security_locked_down() returns an error. That would also resolve (2.), basically. (Then there is the question of what error code to use (because Lockdown LSM uses -EPERM, while SELinux -EACCESS), but I think always returning -EPERM from such stub helpers would be a viable choice.) -- Ondrej Mosnacek Software Engineer, Linux Security - SELinux kernel Red Hat, Inc.
On Mon, 17 May 2021 11:20:06 +0200 Ondrej Mosnacek <omosnace@redhat.com> wrote: > diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c > index 1261e8b41edb..7edde3fc22f5 100644 > --- a/fs/tracefs/inode.c > +++ b/fs/tracefs/inode.c > @@ -396,7 +396,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, > struct dentry *dentry; > struct inode *inode; > > - if (security_locked_down(LOCKDOWN_TRACEFS)) > + if (security_cred_locked_down(NULL, LOCKDOWN_TRACEFS)) > return NULL; > > if (!(mode & S_IFMT)) Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org> -- Steve
On 5/28/21 3:42 PM, Ondrej Mosnacek wrote: > (I'm off work today and plan to reply also to Paul's comments next > week, but for now let me at least share a couple quick thoughts on > Daniel's patch.) > > On Fri, May 28, 2021 at 11:56 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 5/28/21 9:09 AM, Daniel Borkmann wrote: >>> On 5/28/21 3:37 AM, Paul Moore wrote: >>>> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: >>>>> >>>>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux >>>>> lockdown") added an implementation of the locked_down LSM hook to >>>>> SELinux, with the aim to restrict which domains are allowed to perform >>>>> operations that would breach lockdown. >>>>> >>>>> However, in several places the security_locked_down() hook is called in >>>>> situations where the current task isn't doing any action that would >>>>> directly breach lockdown, leading to SELinux checks that are basically >>>>> bogus. >>>>> >>>>> Since in most of these situations converting the callers such that >>>>> security_locked_down() is called in a context where the current task >>>>> would be meaningful for SELinux is impossible or very non-trivial (and >>>>> could lead to TOCTOU issues for the classic Lockdown LSM >>>>> implementation), fix this by modifying the hook to accept a struct cred >>>>> pointer as argument, where NULL will be interpreted as a request for a >>>>> "global", task-independent lockdown decision only. Then modify SELinux >>>>> to ignore calls with cred == NULL. >>>> >>>> I'm not overly excited about skipping the access check when cred is >>>> NULL. Based on the description and the little bit that I've dug into >>>> thus far it looks like using SECINITSID_KERNEL as the subject would be >>>> much more appropriate. *Something* (the kernel in most of the >>>> relevant cases it looks like) is requesting that a potentially >>>> sensitive disclosure be made, and ignoring it seems like the wrong >>>> thing to do. Leaving the access control intact also provides a nice >>>> avenue to audit these requests should users want to do that. >>> >>> I think the rationale/workaround for ignoring calls with cred == NULL (or the previous >>> patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his >>> seen tracing cases: >>> >>> i) The audit events that are triggered due to calls to security_locked_down() >>> can OOM kill a machine, see below details [0]. >>> >>> ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() >>> when presumingly trying to wake up kauditd [1]. > > Actually, I wasn't aware of the deadlock... But calling an LSM hook > [that is backed by a SELinux access check] from within a BPF helper is > calling for all kinds of trouble, so I'm not surprised :) Fully agree, it's just waiting to blow up in unpredictable ways.. :/ >> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked >> at the rest but it's also kind of independent), the attached fix should address both >> reported issues, please take a look & test. > > Thanks, I like this solution, although there are a few gotchas: > > 1. This patch creates a slight "regression" in that if someone flips > the Lockdown LSM into lockdown mode on runtime, existing (already > loaded) BPF programs will still be able to call the > confidentiality-breaching helpers, while before the lockdown would > apply also to them. Personally, I don't think it's a big deal (and I > bet there are other existing cases where some handle kept from before > lockdown could leak data), but I wanted to mention it in case someone > thinks the opposite. Yes, right, though this is nothing new either in the sense that there are plenty of other cases with security_locked_down() that operate this way e.g. take the open_kcore() for /proc/kcore access or the module_sig_check() for mod signatures just to pick some random ones, same approach where the enforcement is happen at open/load time. > 2. IIUC. when a BPF program is rejected due to lockdown/SELinux, the > kernel will return -EINVAL to userspace (looking at > check_helper_call() in kernel/bpf/verifier.c; didn't have time to look > at other callers...). It would be nicer if the error code from the > security_locked_down() call would be passed through the call chain and > eventually returned to the caller. It should be relatively > straightforward to convert bpf_base_func_proto() to return a PTR_ERR() > instead of NULL on error, but it looks like this would result in quite > a big patch updating all the callers (and callers of callers, etc.) > with a not-so-small chance of missing some NULL check and introducing > a bug... I guess we could live with EINVAL-on-denied in stable kernels > and only have the error path refactoring in -next; I'm not sure... Right, it would return a verifier log entry with reporting to the user that the prog is attempting to use an unavailable/unknown helper function. We do have similar return NULL with bpf_capable() and perfmon_capable() checks. Potentially, we could do PTR_ERR() in future where we tell if it failed due to missing CAPs, due to lockdown or just due to helper not compiled in.. > 3. This is a bit of a shot-in-the-dark, but I suppose there might be > some BPF programs that would be able to do something useful also when > the read_kernel helpers return an error, yet the kernel will now > outright refuse to load them (when the lockdown hook returns nonzero). > I have no idea if such BPF programs realistically exist in practice, > but perhaps it would be worth returning some dummy > always-error-returning helper function instead of NULL from > bpf_base_func_proto() when security_locked_down() returns an error. > That would also resolve (2.), basically. (Then there is the question > of what error code to use (because Lockdown LSM uses -EPERM, while > SELinux -EACCESS), but I think always returning -EPERM from such stub > helpers would be a viable choice.) It would actually be harder to debug. Returning NULL at verification time, libbpf, for example, would have a chance to probe for this. See the feature_probes[] in libbpf's kernel_supports(), so it could provide a meaningful warning to the user that the tracing functionality is unavailable on the system. With returning an error from the helper, libbpf cannot check it.. theoretically, it could but significantly more cumbersome given it needs to attach the probe somewhere, trigger it, read out the helper result and pass it back to libbpf user space.. not really feasible. Overall, moving into func_proto and returning NULL is the much better approach (and in line with the CAP check enforcement). Thanks, Daniel
On Fri, May 28, 2021 at 3:10 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > On 5/28/21 3:37 AM, Paul Moore wrote: > > On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > >> > >> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > >> lockdown") added an implementation of the locked_down LSM hook to > >> SELinux, with the aim to restrict which domains are allowed to perform > >> operations that would breach lockdown. > >> > >> However, in several places the security_locked_down() hook is called in > >> situations where the current task isn't doing any action that would > >> directly breach lockdown, leading to SELinux checks that are basically > >> bogus. > >> > >> Since in most of these situations converting the callers such that > >> security_locked_down() is called in a context where the current task > >> would be meaningful for SELinux is impossible or very non-trivial (and > >> could lead to TOCTOU issues for the classic Lockdown LSM > >> implementation), fix this by modifying the hook to accept a struct cred > >> pointer as argument, where NULL will be interpreted as a request for a > >> "global", task-independent lockdown decision only. Then modify SELinux > >> to ignore calls with cred == NULL. > > > > I'm not overly excited about skipping the access check when cred is > > NULL. Based on the description and the little bit that I've dug into > > thus far it looks like using SECINITSID_KERNEL as the subject would be > > much more appropriate. *Something* (the kernel in most of the > > relevant cases it looks like) is requesting that a potentially > > sensitive disclosure be made, and ignoring it seems like the wrong > > thing to do. Leaving the access control intact also provides a nice > > avenue to audit these requests should users want to do that. > > I think the rationale/workaround for ignoring calls with cred == NULL (or the previous > patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his > seen tracing cases: > > i) The audit events that are triggered due to calls to security_locked_down() > can OOM kill a machine, see below details [0]. > > ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() > when presumingly trying to wake up kauditd [1]. > > How would your suggestion above solve both i) and ii)? First off, a bit of general commentary - I'm not sure if Ondrej was aware of this, but info like that is good to have in the commit description. Perhaps it was in the linked RHBZ but I try not to look at those when reviewing patches; the commit descriptions must be self-sufficient since we can't rely on the accessibility or the lifetime of external references. It's fine if people want to include external links in their commits, I would actually even encourage it in some cases, but the links shouldn't replace a proper description of the problem and why the proposed solution is The Best Solution. With that out of the way, it sounds like your issue isn't so much the access check, but rather the frequency of the access denials and the resulting audit records in your particular use case. My initial reaction is that you might want to understand why you are getting so many SELinux access denials, your loaded security policy clearly does not match with your intended use :) Beyond that, if you want to basically leave things as-is but quiet the high frequency audit records that result from these SELinux denials you might want to look into the SELinux "dontaudit" policy rule, it was created for things like this. Some info can be found in The SELinux Notebook, relevant link below: * https://github.com/SELinuxProject/selinux-notebook/blob/main/src/avc_rules.md#dontaudit The deadlock issue that was previously reported remains an open case as far as I'm concerned; I'm presently occupied trying to sort out a rather serious issue with respect to io_uring and LSM/audit (plus general stuff at $DAYJOB) so I haven't had time to investigate this any further. Of course anyone else is welcome to dive into it (I always want to encourage this, especially from "performance people" who just want to shut it all off), however if the answer is basically "disable LSM and/or audit checks" you have to know that it is going to result in a high degree of skepticism from me, so heavy documentation on why it is The Best Solution would be a very good thing :) Beyond that, I think the suggestions above of "why do you have so many policy denials?" and "have you looked into dontaudit?" are solid places to look for a solution in your particular case. > >> Since most callers will just want to pass current_cred() as the cred > >> parameter, rename the hook to security_cred_locked_down() and provide > >> the original security_locked_down() function as a simple wrapper around > >> the new hook. > > [...] > > > >> 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common() > >> Called when a BPF program calls a helper that could leak kernel > >> memory. The task context is not relevant here, since the program > >> may very well be run in the context of a different task than the > >> consumer of the data. > >> See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585 > > > > The access control check isn't so much who is consuming the data, but > > who is requesting a potential violation of a "lockdown", yes? For > > example, the SELinux policy rule for the current lockdown check looks > > something like this: > > > > allow <who> <who> : lockdown { <reason> }; > > > > It seems to me that the task context is relevant here and performing > > the access control check based on the task's domain is correct. > > This doesn't make much sense to me, it's /not/ the task 'requesting a potential > violation of a "lockdown"', but rather the running tracing program which is e.g. > inspecting kernel data structures around the triggered event. If I understood > you correctly, having an 'allow' check on, say, httpd would be rather odd since > things like perf/bcc/bpftrace/systemtap/etc is installing the tracing probe instead. > > Meaning, if we would /not/ trace such events (like in the prior mentioned syscall > example), then there is also no call to the security_locked_down() from that same/ > unmodified application. My turn to say that you don't make much sense to me :) Let's reset. What task_struct is running the BPF tracing program which is calling into security_locked_down()? My current feeling is that it is this context/domain/cred that should be used for the access control check; in the cases where it is a kernel thread, I think passing NULL is reasonable, but I think the proper thing for SELinux is to interpret NULL as kernel_t. -- paul moore www.paul-moore.com
On Fri, May 28, 2021 at 10:43 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > On 5/28/21 3:42 PM, Ondrej Mosnacek wrote: > > (I'm off work today and plan to reply also to Paul's comments next > > week, but for now let me at least share a couple quick thoughts on > > Daniel's patch.) Oooh, I sense some disagreement brewing :) > > On Fri, May 28, 2021 at 11:56 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > >> On 5/28/21 9:09 AM, Daniel Borkmann wrote: > >>> On 5/28/21 3:37 AM, Paul Moore wrote: > >>>> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: ... > >> Ondrej / Paul / Jiri: at least for the BPF tracing case specifically (I haven't looked > >> at the rest but it's also kind of independent), the attached fix should address both > >> reported issues, please take a look & test. > > > > Thanks, I like this solution, although there are a few gotchas: > > > > 1. This patch creates a slight "regression" in that if someone flips > > the Lockdown LSM into lockdown mode on runtime, existing (already > > loaded) BPF programs will still be able to call the > > confidentiality-breaching helpers, while before the lockdown would > > apply also to them. Personally, I don't think it's a big deal (and I > > bet there are other existing cases where some handle kept from before > > lockdown could leak data), but I wanted to mention it in case someone > > thinks the opposite. > > Yes, right, though this is nothing new either in the sense that there are > plenty of other cases with security_locked_down() that operate this way > e.g. take the open_kcore() for /proc/kcore access or the module_sig_check() > for mod signatures just to pick some random ones, same approach where the > enforcement is happen at open/load time. Another, yes, this is not really a good thing to do. Also, just because there are other places that don't really do The Right Thing doesn't mean that it is okay to also not do The Right Thing here. It's basically the two-wrongs-don't-make-a-right issue applied to kernel code. -- paul moore www.paul-moore.com
On 5/28/21 5:47 PM, Paul Moore wrote: > On Fri, May 28, 2021 at 3:10 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 5/28/21 3:37 AM, Paul Moore wrote: >>> On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: >>>> >>>> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux >>>> lockdown") added an implementation of the locked_down LSM hook to >>>> SELinux, with the aim to restrict which domains are allowed to perform >>>> operations that would breach lockdown. >>>> >>>> However, in several places the security_locked_down() hook is called in >>>> situations where the current task isn't doing any action that would >>>> directly breach lockdown, leading to SELinux checks that are basically >>>> bogus. >>>> >>>> Since in most of these situations converting the callers such that >>>> security_locked_down() is called in a context where the current task >>>> would be meaningful for SELinux is impossible or very non-trivial (and >>>> could lead to TOCTOU issues for the classic Lockdown LSM >>>> implementation), fix this by modifying the hook to accept a struct cred >>>> pointer as argument, where NULL will be interpreted as a request for a >>>> "global", task-independent lockdown decision only. Then modify SELinux >>>> to ignore calls with cred == NULL. >>> >>> I'm not overly excited about skipping the access check when cred is >>> NULL. Based on the description and the little bit that I've dug into >>> thus far it looks like using SECINITSID_KERNEL as the subject would be >>> much more appropriate. *Something* (the kernel in most of the >>> relevant cases it looks like) is requesting that a potentially >>> sensitive disclosure be made, and ignoring it seems like the wrong >>> thing to do. Leaving the access control intact also provides a nice >>> avenue to audit these requests should users want to do that. >> >> I think the rationale/workaround for ignoring calls with cred == NULL (or the previous >> patch with the unimplemented hook) from Ondrej was two-fold, at least speaking for his >> seen tracing cases: >> >> i) The audit events that are triggered due to calls to security_locked_down() >> can OOM kill a machine, see below details [0]. >> >> ii) It seems to be causing a deadlock via slow_avc_audit() -> audit_log_end() >> when presumingly trying to wake up kauditd [1]. >> >> How would your suggestion above solve both i) and ii)? > > First off, a bit of general commentary - I'm not sure if Ondrej was > aware of this, but info like that is good to have in the commit > description. Perhaps it was in the linked RHBZ but I try not to look > at those when reviewing patches; the commit descriptions must be > self-sufficient since we can't rely on the accessibility or the > lifetime of external references. It's fine if people want to include > external links in their commits, I would actually even encourage it in > some cases, but the links shouldn't replace a proper description of > the problem and why the proposed solution is The Best Solution. > > With that out of the way, it sounds like your issue isn't so much the > access check, but rather the frequency of the access denials and the > resulting audit records in your particular use case. My initial > reaction is that you might want to understand why you are getting so > many SELinux access denials, your loaded security policy clearly does > not match with your intended use :) Beyond that, if you want to > basically leave things as-is but quiet the high frequency audit > records that result from these SELinux denials you might want to look > into the SELinux "dontaudit" policy rule, it was created for things > like this. Some info can be found in The SELinux Notebook, relevant > link below: > > * https://github.com/SELinuxProject/selinux-notebook/blob/main/src/avc_rules.md#dontaudit > > The deadlock issue that was previously reported remains an open case > as far as I'm concerned; I'm presently occupied trying to sort out a > rather serious issue with respect to io_uring and LSM/audit (plus > general stuff at $DAYJOB) so I haven't had time to investigate this > any further. Of course anyone else is welcome to dive into it (I > always want to encourage this, especially from "performance people" > who just want to shut it all off), however if the answer is basically > "disable LSM and/or audit checks" you have to know that it is going to > result in a high degree of skepticism from me, so heavy documentation > on why it is The Best Solution would be a very good thing :) Beyond > that, I think the suggestions above of "why do you have so many policy > denials?" and "have you looked into dontaudit?" are solid places to > look for a solution in your particular case. > >>>> Since most callers will just want to pass current_cred() as the cred >>>> parameter, rename the hook to security_cred_locked_down() and provide >>>> the original security_locked_down() function as a simple wrapper around >>>> the new hook. >> >> [...] >>> >>>> 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common() >>>> Called when a BPF program calls a helper that could leak kernel >>>> memory. The task context is not relevant here, since the program >>>> may very well be run in the context of a different task than the >>>> consumer of the data. >>>> See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585 >>> >>> The access control check isn't so much who is consuming the data, but >>> who is requesting a potential violation of a "lockdown", yes? For >>> example, the SELinux policy rule for the current lockdown check looks >>> something like this: >>> >>> allow <who> <who> : lockdown { <reason> }; >>> >>> It seems to me that the task context is relevant here and performing >>> the access control check based on the task's domain is correct. >> >> This doesn't make much sense to me, it's /not/ the task 'requesting a potential >> violation of a "lockdown"', but rather the running tracing program which is e.g. >> inspecting kernel data structures around the triggered event. If I understood >> you correctly, having an 'allow' check on, say, httpd would be rather odd since >> things like perf/bcc/bpftrace/systemtap/etc is installing the tracing probe instead. >> >> Meaning, if we would /not/ trace such events (like in the prior mentioned syscall >> example), then there is also no call to the security_locked_down() from that same/ >> unmodified application. > > My turn to say that you don't make much sense to me :) > > Let's reset. Sure, yep, lets shortly take one step back. :) > What task_struct is running the BPF tracing program which is calling > into security_locked_down()? My current feeling is that it is this > context/domain/cred that should be used for the access control check; > in the cases where it is a kernel thread, I think passing NULL is > reasonable, but I think the proper thing for SELinux is to interpret > NULL as kernel_t. If this was a typical LSM hook and, say, your app calls into bind(2) where we then invoke security_socket_bind() and check 'current' task, then I'm all with you, because this was _explicitly initiated_ by the httpd app, so that allow/deny policy belongs in the context of httpd. In the case of tracing, it's different. You install small programs that are triggered when certain events fire. Random example from bpftrace's README [0], you want to generate a histogram of syscall counts by program. One-liner is: bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' bpftrace then goes and generates a BPF prog from this internally. One way of doing it could be to call bpf_get_current_task() helper and then access current->comm via one of bpf_probe_read_kernel{,_str}() helpers. So the program itself has nothing to do with httpd or any other random app doing a syscall here. The BPF prog _explicitly initiated_ the lockdown check. The allow/deny policy belongs in the context of bpftrace: meaning, you want to grant bpftrace access to use these helpers, but other tracers on the systems like my_random_tracer not. While this works for prior mentioned cases of security_locked_down() with open_kcore() for /proc/kcore access or the module_sig_check(), it is broken for tracing as-is, and the patch I sent earlier fixes this. Thanks, Daniel [0] https://github.com/iovisor/bpftrace
On Fri, May 28, 2021 at 2:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > On 5/28/21 5:47 PM, Paul Moore wrote: > > Let's reset. > > Sure, yep, lets shortly take one step back. :) > > > What task_struct is running the BPF tracing program which is calling > > into security_locked_down()? My current feeling is that it is this > > context/domain/cred that should be used for the access control check; > > in the cases where it is a kernel thread, I think passing NULL is > > reasonable, but I think the proper thing for SELinux is to interpret > > NULL as kernel_t. > > If this was a typical LSM hook and, say, your app calls into bind(2) where > we then invoke security_socket_bind() and check 'current' task, then I'm all > with you, because this was _explicitly initiated_ by the httpd app, so that > allow/deny policy belongs in the context of httpd. > > In the case of tracing, it's different. You install small programs that are > triggered when certain events fire. Random example from bpftrace's README [0], > you want to generate a histogram of syscall counts by program. One-liner is: > > bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' > > bpftrace then goes and generates a BPF prog from this internally. One way of > doing it could be to call bpf_get_current_task() helper and then access > current->comm via one of bpf_probe_read_kernel{,_str}() helpers. So the > program itself has nothing to do with httpd or any other random app doing > a syscall here. The BPF prog _explicitly initiated_ the lockdown check. > The allow/deny policy belongs in the context of bpftrace: meaning, you want > to grant bpftrace access to use these helpers, but other tracers on the > systems like my_random_tracer not. While this works for prior mentioned > cases of security_locked_down() with open_kcore() for /proc/kcore access > or the module_sig_check(), it is broken for tracing as-is, and the patch > I sent earlier fixes this. Sigh. Generally it's helpful when someone asks a question if you answer it directly before going off and answering your own questions. Listen, I get it, you wrote a patch and it fixes your problem (you've mentioned that already) and it's wonderful and all that, but the rest of us (maybe just me) need to sort this out too and talking past questions isn't a great way to help us get there (once again, maybe just me). I think I can infer an answer from you, but you've made me grumpy now so I'm not ACK'ing or NACK'ing anything right now; I clearly need to go spend some time reading through BPF code. Woo. > [0] https://github.com/iovisor/bpftrace -- paul moore www.paul-moore.com
On Fri, May 28, 2021 at 2:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > In the case of tracing, it's different. You install small programs that are > triggered when certain events fire. Random example from bpftrace's README [0], > you want to generate a histogram of syscall counts by program. One-liner is: > > bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' > > bpftrace then goes and generates a BPF prog from this internally. One way of > doing it could be to call bpf_get_current_task() helper and then access > current->comm via one of bpf_probe_read_kernel{,_str}() helpers ... I think we can all agree that the BPF tracing is a bit chaotic in the sense that the tracing programs can be executed in various places/contexts and that presents some challenges with respect to access control and auditing. If you are following the io_uring stuff that is going on now you can see a little of what is required to make audit work properly in the various io_uring contexts and that is relatively small compared to what is possible with BPF tracing. Of course this assumes I've managed to understand bpf tracing properly this morning, and I very well may still be missing points and/or confused about some of the important details. Corrections are welcome. Daniel's patch side steps that worry by just doing the lockdown permission check when the BPF program is loaded, but that isn't a great solution if the policy changes afterward. I was hoping there might be some way to perform the permission check as needed, but the more I look the more that appears to be difficult, if not impossible (once again, corrections are welcome). I'm now wondering if the right solution here is to make use of the LSM notifier mechanism. I'm not yet entirely sure if this would work from a BPF perspective, but I could envision the BPF subsystem registering a LSM notification callback via register_blocking_lsm_notifier(), see if Infiniband code as an example, and then when the LSM(s) policy changes the BPF subsystem would get a notification and it could revalidate the existing BPF programs and take block/remove/whatever the offending BPF programs. This obviously requires a few things which I'm not sure are easily done, or even possible: 1. Somehow the BPF programs would need to be "marked" at load/verification time with respect to their lockdown requirements so that decisions can be made later. Perhaps a flag in bpf_prog_aux? 2. While it looks like it should be possible to iterate over all of the loaded BPF programs in the LSM notifier callback via idr_for_each(prog_idr, ...), it is not clear to me if it is possible to safely remove, or somehow disable, BPF programs once they have been loaded. Hopefully the BPF folks can help answer that question. 3. Disabling of BPF programs might be preferable to removing them entirely on LSM policy changes as it would be possible to make the lockdown state less restrictive at a future point in time, allowing for the BPF program to be executed again. Once again, not sure if this is even possible. Related, the lockdown LSM should probably also grow LSM notifier support similar to selinux_lsm_notifier_avc_callback(), for example either lock_kernel_down() or lockdown_write() might want to do a call_blocking_lsm_notifier(LSM_POLICY_CHANGE, NULL) call. -- paul moore www.paul-moore.com
On 5/29/21 8:48 PM, Paul Moore wrote: [...] > Daniel's patch side steps that worry by just doing the lockdown > permission check when the BPF program is loaded, but that isn't a > great solution if the policy changes afterward. I was hoping there > might be some way to perform the permission check as needed, but the > more I look the more that appears to be difficult, if not impossible > (once again, corrections are welcome). Your observation is correct, will try to clarify below a bit. > I'm now wondering if the right solution here is to make use of the LSM > notifier mechanism. I'm not yet entirely sure if this would work from > a BPF perspective, but I could envision the BPF subsystem registering > a LSM notification callback via register_blocking_lsm_notifier(), see > if Infiniband code as an example, and then when the LSM(s) policy > changes the BPF subsystem would get a notification and it could > revalidate the existing BPF programs and take block/remove/whatever > the offending BPF programs. This obviously requires a few things > which I'm not sure are easily done, or even possible: > > 1. Somehow the BPF programs would need to be "marked" at > load/verification time with respect to their lockdown requirements so > that decisions can be made later. Perhaps a flag in bpf_prog_aux? > > 2. While it looks like it should be possible to iterate over all of > the loaded BPF programs in the LSM notifier callback via > idr_for_each(prog_idr, ...), it is not clear to me if it is possible > to safely remove, or somehow disable, BPF programs once they have been > loaded. Hopefully the BPF folks can help answer that question. > > 3. Disabling of BPF programs might be preferable to removing them > entirely on LSM policy changes as it would be possible to make the > lockdown state less restrictive at a future point in time, allowing > for the BPF program to be executed again. Once again, not sure if > this is even possible. Part of why this gets really complex/impossible is that BPF programs in the kernel are reference counted from various sides, be it that there are references from user space to them (fd from application, BPF fs, or BPF links), hooks where they are attached to as well as tail call maps where one BPF prog calls into another. There is currently also no global infra of some sort where you could piggy back to atomically keep track of all the references in a list or such. And the other thing is that BPF progs have no ownership that is tied to a specific task after they have been loaded. Meaning, once they are loaded into the kernel by an application and attached to a specific hook, they can remain there potentially until reboot of the node, so lifecycle of the user space application != lifecycle of the BPF program. It's maybe best to compare this aspect to kernel modules in the sense that you have an application that loads it into the kernel (insmod, etc, where you could also enforce lockdown signature check), but after that, they can be managed by other entities as well (implicitly refcounted from kernel, removed by other applications, etc). My understanding of the lockdown settings are that users have options to select/enforce a lockdown level of CONFIG_LOCK_DOWN_KERNEL_FORCE_{INTEGRITY, CONFIDENTIALITY} at compilation time, they have a lockdown={integrity| confidentiality} boot-time parameter, /sys/kernel/security/lockdown, and then more fine-grained policy via 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown"). Once you have set a global policy level, you cannot revert back to a less strict mode. So the SELinux policy is specifically tied around tasks to further restrict applications in respect to the global policy. I presume that would mean for those users that majority of tasks have the confidentiality option set via SELinux with just a few necessary using the integrity global policy. So overall the enforcing option when BPF program is loaded is the only really sensible option to me given only there we have the valid current task where such policy can be enforced. Best, Daniel
On Mon, May 31, 2021 at 4:24 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > On 5/29/21 8:48 PM, Paul Moore wrote: > [...] > > Daniel's patch side steps that worry by just doing the lockdown > > permission check when the BPF program is loaded, but that isn't a > > great solution if the policy changes afterward. I was hoping there > > might be some way to perform the permission check as needed, but the > > more I look the more that appears to be difficult, if not impossible > > (once again, corrections are welcome). > > Your observation is correct, will try to clarify below a bit. > > > I'm now wondering if the right solution here is to make use of the LSM > > notifier mechanism. I'm not yet entirely sure if this would work from > > a BPF perspective, but I could envision the BPF subsystem registering > > a LSM notification callback via register_blocking_lsm_notifier(), see > > if Infiniband code as an example, and then when the LSM(s) policy > > changes the BPF subsystem would get a notification and it could > > revalidate the existing BPF programs and take block/remove/whatever > > the offending BPF programs. This obviously requires a few things > > which I'm not sure are easily done, or even possible: > > > > 1. Somehow the BPF programs would need to be "marked" at > > load/verification time with respect to their lockdown requirements so > > that decisions can be made later. Perhaps a flag in bpf_prog_aux? > > > > 2. While it looks like it should be possible to iterate over all of > > the loaded BPF programs in the LSM notifier callback via > > idr_for_each(prog_idr, ...), it is not clear to me if it is possible > > to safely remove, or somehow disable, BPF programs once they have been > > loaded. Hopefully the BPF folks can help answer that question. > > > > 3. Disabling of BPF programs might be preferable to removing them > > entirely on LSM policy changes as it would be possible to make the > > lockdown state less restrictive at a future point in time, allowing > > for the BPF program to be executed again. Once again, not sure if > > this is even possible. > > Part of why this gets really complex/impossible is that BPF programs in > the kernel are reference counted from various sides, be it that there > are references from user space to them (fd from application, BPF fs, or > BPF links), hooks where they are attached to as well as tail call maps > where one BPF prog calls into another. There is currently also no global > infra of some sort where you could piggy back to atomically keep track of > all the references in a list or such. And the other thing is that BPF progs > have no ownership that is tied to a specific task after they have been > loaded. Meaning, once they are loaded into the kernel by an application > and attached to a specific hook, they can remain there potentially until > reboot of the node, so lifecycle of the user space application != lifecycle > of the BPF program. I don't think the disjoint lifecycle or lack of task ownership is a deal breaker from a LSM perspective as the LSMs can stash whatever info they need in the security pointer during the program allocation hook, e.g. selinux_bpf_prog_alloc() saves the security domain which allocates/loads the BPF program. The thing I'm worried about would be the case where a LSM policy change requires that an existing BPF program be removed or disabled. I'm guessing based on the refcounting that there is not presently a clean way to remove a BPF program from the system, but is this something we could resolve? If we can't safely remove a BPF program from the system, can we replace/swap it with an empty/NULL BPF program? > It's maybe best to compare this aspect to kernel modules in the sense that > you have an application that loads it into the kernel (insmod, etc, where > you could also enforce lockdown signature check), but after that, they can > be managed by other entities as well (implicitly refcounted from kernel, > removed by other applications, etc). Well, I guess we could consider BPF programs as out-of-tree kernel modules that potentially do very odd and dangerous things, e.g. performing access control checks *inside* access control checks ... but yeah, I get your point at a basic level, I just think that comparing BPF programs to kernel modules is a not-so-great comparison in general. > My understanding of the lockdown settings are that users have options > to select/enforce a lockdown level of CONFIG_LOCK_DOWN_KERNEL_FORCE_{INTEGRITY, > CONFIDENTIALITY} at compilation time, they have a lockdown={integrity| > confidentiality} boot-time parameter, /sys/kernel/security/lockdown, > and then more fine-grained policy via 59438b46471a ("security,lockdown,selinux: > implement SELinux lockdown"). Once you have set a global policy level, > you cannot revert back to a less strict mode. I don't recall there being anything in the SELinux lockdown support that prevents a newly loaded policy from allowing a change in the lockdown level, either stricter or more permissive, for a given domain. Looking quickly at the code, that still seems to be the case. The SELinux lockdown access controls function independently of the global build and runtime lockdown configuration. > So the SELinux policy is > specifically tied around tasks to further restrict applications in respect > to the global policy. As a reminder, there is no guarantee that both the SELinux and lockdown LSM are both loaded and active at runtime, it is possible that only SELinux is active. If SELinux is the only LSM enforcing lockdown access controls, there is no global lockdown setting, it is determined per-domain. > I presume that would mean for those users that majority > of tasks have the confidentiality option set via SELinux with just a few > necessary using the integrity global policy. So overall the enforcing > option when BPF program is loaded is the only really sensible option to > me given only there we have the valid current task where such policy can > be enforced. -- paul moore www.paul-moore.com
On 6/1/21 10:47 PM, Paul Moore wrote: > On Mon, May 31, 2021 at 4:24 AM Daniel Borkmann <daniel@iogearbox.net> wrote: >> On 5/29/21 8:48 PM, Paul Moore wrote: >> [...] >>> Daniel's patch side steps that worry by just doing the lockdown >>> permission check when the BPF program is loaded, but that isn't a >>> great solution if the policy changes afterward. I was hoping there >>> might be some way to perform the permission check as needed, but the >>> more I look the more that appears to be difficult, if not impossible >>> (once again, corrections are welcome). >> >> Your observation is correct, will try to clarify below a bit. >> >>> I'm now wondering if the right solution here is to make use of the LSM >>> notifier mechanism. I'm not yet entirely sure if this would work from >>> a BPF perspective, but I could envision the BPF subsystem registering >>> a LSM notification callback via register_blocking_lsm_notifier(), see >>> if Infiniband code as an example, and then when the LSM(s) policy >>> changes the BPF subsystem would get a notification and it could >>> revalidate the existing BPF programs and take block/remove/whatever >>> the offending BPF programs. This obviously requires a few things >>> which I'm not sure are easily done, or even possible: >>> >>> 1. Somehow the BPF programs would need to be "marked" at >>> load/verification time with respect to their lockdown requirements so >>> that decisions can be made later. Perhaps a flag in bpf_prog_aux? >>> >>> 2. While it looks like it should be possible to iterate over all of >>> the loaded BPF programs in the LSM notifier callback via >>> idr_for_each(prog_idr, ...), it is not clear to me if it is possible >>> to safely remove, or somehow disable, BPF programs once they have been >>> loaded. Hopefully the BPF folks can help answer that question. >>> >>> 3. Disabling of BPF programs might be preferable to removing them >>> entirely on LSM policy changes as it would be possible to make the >>> lockdown state less restrictive at a future point in time, allowing >>> for the BPF program to be executed again. Once again, not sure if >>> this is even possible. >> >> Part of why this gets really complex/impossible is that BPF programs in >> the kernel are reference counted from various sides, be it that there >> are references from user space to them (fd from application, BPF fs, or >> BPF links), hooks where they are attached to as well as tail call maps >> where one BPF prog calls into another. There is currently also no global >> infra of some sort where you could piggy back to atomically keep track of >> all the references in a list or such. And the other thing is that BPF progs >> have no ownership that is tied to a specific task after they have been >> loaded. Meaning, once they are loaded into the kernel by an application >> and attached to a specific hook, they can remain there potentially until >> reboot of the node, so lifecycle of the user space application != lifecycle >> of the BPF program. > > I don't think the disjoint lifecycle or lack of task ownership is a > deal breaker from a LSM perspective as the LSMs can stash whatever > info they need in the security pointer during the program allocation > hook, e.g. selinux_bpf_prog_alloc() saves the security domain which > allocates/loads the BPF program. > > The thing I'm worried about would be the case where a LSM policy > change requires that an existing BPF program be removed or disabled. > I'm guessing based on the refcounting that there is not presently a > clean way to remove a BPF program from the system, but is this > something we could resolve? If we can't safely remove a BPF program > from the system, can we replace/swap it with an empty/NULL BPF > program? Removing progs would somehow mean destroying those references from an async event and then /safely/ guaranteeing that nothing is accessing them anymore. But then if policy changes once more where they would be allowed again we would need to revert back to the original state, which brings us to your replace/swap question with an empty/null prog. It's not feasible either, because there are different BPF program types and they can have different return code semantics that lead to subsequent actions. If we were to replace them with an empty/NULL program, then essentially this will get us into an undefined system state given it's unclear what should be a default policy for each program type, etc. Just to pick one simple example, outside of tracing, that comes to mind: say, you attached a program with tc to a given device ingress hook. That program implements firewalling functionality, and potentially deep down in that program there is functionality to record/sample packets along with some meta data. Part of what is exported to the ring buffer to the user space reader may be a struct net_device field that is otherwise not available (or at least not yet), hence it's probe-read with mentioned helpers. If you were now to change the SELinux policy for that tc loader application, and therefore replace/swap the progs in the kernel that were loaded with it (given tc's lockdown policy was recorded in their sec blob) with an empty/NULL program, then either you say allow-all or drop-all, but either way, you break the firewalling functionality completely by locking yourself out of the machine or letting everything through. There is no sane way where we could reason about the context/internals of a given program where it would be safe to replace with a simple empty/NULL prog. Best, Daniel
On Fri, May 28, 2021 at 3:37 AM Paul Moore <paul@paul-moore.com> wrote: > On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > > > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > > lockdown") added an implementation of the locked_down LSM hook to > > SELinux, with the aim to restrict which domains are allowed to perform > > operations that would breach lockdown. > > > > However, in several places the security_locked_down() hook is called in > > situations where the current task isn't doing any action that would > > directly breach lockdown, leading to SELinux checks that are basically > > bogus. > > > > Since in most of these situations converting the callers such that > > security_locked_down() is called in a context where the current task > > would be meaningful for SELinux is impossible or very non-trivial (and > > could lead to TOCTOU issues for the classic Lockdown LSM > > implementation), fix this by modifying the hook to accept a struct cred > > pointer as argument, where NULL will be interpreted as a request for a > > "global", task-independent lockdown decision only. Then modify SELinux > > to ignore calls with cred == NULL. > > I'm not overly excited about skipping the access check when cred is > NULL. Based on the description and the little bit that I've dug into > thus far it looks like using SECINITSID_KERNEL as the subject would be > much more appropriate. *Something* (the kernel in most of the > relevant cases it looks like) is requesting that a potentially > sensitive disclosure be made, and ignoring it seems like the wrong > thing to do. Leaving the access control intact also provides a nice > avenue to audit these requests should users want to do that. > > Those users that generally don't care can grant kernel_t all the > necessary permissions without much policy. Seems kind of pointless to me, but it's a relatively simple change to do a check against SECINITSID_KERNEL, so I don't mind doing it like that. > > Since most callers will just want to pass current_cred() as the cred > > parameter, rename the hook to security_cred_locked_down() and provide > > the original security_locked_down() function as a simple wrapper around > > the new hook. > > I know you and Casey went back and forth on this in v1, but I agree > with Casey that having two LSM hooks here is a mistake. I know it > makes backports hard, but spoiler alert: maintaining complex software > over any non-trivial period of time is hard, reeeeally hard sometimes > ;) Do you mean having two slots in lsm_hook_defs.h or also having two security_*() functions? (It's not clear to me if you're just reiterating disagreement with v1 or if you dislike the simplified v2 as well.) > > The callers migrated to the new hook, passing NULL as cred: > > 1. arch/powerpc/xmon/xmon.c > > Here the hook seems to be called from non-task context and is only > > used for redacting some sensitive values from output sent to > > userspace. > > This definitely sounds like kernel_t based on the description above. Here I'm a little concerned that the hook might be called from some unusual interrupt, which is not masked by spin_lock_irqsave()... We ran into this with PMI (Platform Management Interrupt) before, see commit 5ae5fbd21079 ("powerpc/perf: Fix handling of privilege level checks in perf interrupt context"). While I can't see anything that would suggest something like this happening here, the whole thing is so foreign to me that I'm wary of making assumptions :) @Michael/PPC devs, can you confirm to us that xmon_is_locked_down() is only called from normal syscall/interrupt context (as opposed to something tricky like PMI)? > > 2. fs/tracefs/inode.c:tracefs_create_file() > > Here the call is used to prevent creating new tracefs entries when > > the kernel is locked down. Assumes that locking down is one-way - > > i.e. if the hook returns non-zero once, it will never return zero > > again, thus no point in creating these files. > > More kernel_t. This should be OK. > > 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common() > > Called when a BPF program calls a helper that could leak kernel > > memory. The task context is not relevant here, since the program > > may very well be run in the context of a different task than the > > consumer of the data. > > See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585 > > The access control check isn't so much who is consuming the data, but > who is requesting a potential violation of a "lockdown", yes? For > example, the SELinux policy rule for the current lockdown check looks > something like this: > > allow <who> <who> : lockdown { <reason> }; > > It seems to me that the task context is relevant here and performing > the access control check based on the task's domain is correct. If we > are also concerned about who has access to this sensitive information > once it has been determined that the task can cause it to be sent, we > should have another check point for that, assuming the access isn't > already covered by another check/hook. This case is being discussed further in this thread, so I'm going to skip it in this reply. > > 4. net/xfrm/xfrm_user.c:copy_to_user_*() > > Here a cryptographic secret is redacted based on the value returned > > from the hook. There are two possible actions that may lead here: > > a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the > > task context is relevant, since the dumped data is sent back to > > the current task. > > If the task context is relevant we should use it. Yes, but as I said it would create an asymmetry with case b), which I'll expand on below... > > b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are > > broadcasted to tasks subscribed to XFRM events - here the > > SELinux check is not meningful as the current task's creds do > > not represent the tasks that could potentially see the secret. > > This looks very similar to the BPF hook discussed above, I believe my > comments above apply here as well. Using the current task is just logically wrong in this case. The current task here is just simply deleting an SA that happens to have some secret value in it. When deleting an SA, a notification is sent to a group of subscribers (some group of other tasks), which includes a dump of the secret value. The current task isn't doing any attempt to breach lockdown, it's just deleting an SA. It also makes it really awkward to make policy decisions around this. Suppose that domains A, B, and C need to be able to add/delete SAs and domains D, E, and F need to receive notifications about changes in SAs. Then if, say, domain E actually needs to see the secret values in the notifications, you must grant the confidentiality permission to all of A, B, C to keep things working. And now you have opened up the door for A, B, C to do other lockdown-confidentiality stuff, even though these domains themselves actually don't request/need any confidential data from the kernel. That's just not logical and you may actually end up (slightly) worse security-wise than if you just skipped checking for XFRM secrets altogether, because you need to allow confidentiality to domains for which it may be excessive. This is why I talk about the task that gets to see the sensitive values as the relevant one - because otherwise the semantics of a given domain having the confidentiality permission granted becomes very hard to reason about. > > It really doesn't seem worth it to try to preserve the check in the > > a) case ... > > After you've read all of the above I hope you can understand why I > disagree with this. > > > ... since the eventual leak can be circumvented anyway via b) > > I don't follow the statement above ... ? However I'm not sure it > matters much considering my other concerns. What I meant was that if we skip/kernel_t-ize the check in case b) (for which I don't see a good alternative), then denying confidentiality perm to a given domain wouldn't prevent it from seeing the key value, as it could potentially see them by subscribing to SA modification events. IMO, in that case it's better to just give up on controlling the SA secrets with SELinux lockdown altogether than to create some false assumptions of this being covered. You may disagree and would be willing to implement the partial checking as well if you insist, but we need to first come to a consensus about case b) before such discussion becomes relevant, anyway... Given the yet unresolved discussions around the XFRM and BPF cases, I plan to respin the patch with just the tracefs and xmon changes and we can then incrementally address the rest as the individual discussions come to a consensus. -- Ondrej Mosnacek Software Engineer, Linux Security - SELinux kernel Red Hat, Inc.
On Wed, Jun 2, 2021 at 8:40 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > On 6/1/21 10:47 PM, Paul Moore wrote: > > The thing I'm worried about would be the case where a LSM policy > > change requires that an existing BPF program be removed or disabled. > > I'm guessing based on the refcounting that there is not presently a > > clean way to remove a BPF program from the system, but is this > > something we could resolve? If we can't safely remove a BPF program > > from the system, can we replace/swap it with an empty/NULL BPF > > program? > > Removing progs would somehow mean destroying those references from an > async event and then /safely/ guaranteeing that nothing is accessing > them anymore. But then if policy changes once more where they would > be allowed again we would need to revert back to the original state, > which brings us to your replace/swap question with an empty/null prog. > It's not feasible either, because there are different BPF program types > and they can have different return code semantics that lead to subsequent > actions. If we were to replace them with an empty/NULL program, then > essentially this will get us into an undefined system state given it's > unclear what should be a default policy for each program type, etc. > Just to pick one simple example, outside of tracing, that comes to mind: > say, you attached a program with tc to a given device ingress hook. That > program implements firewalling functionality, and potentially deep down > in that program there is functionality to record/sample packets along > with some meta data. Part of what is exported to the ring buffer to the > user space reader may be a struct net_device field that is otherwise not > available (or at least not yet), hence it's probe-read with mentioned > helpers. If you were now to change the SELinux policy for that tc loader > application, and therefore replace/swap the progs in the kernel that were > loaded with it (given tc's lockdown policy was recorded in their sec blob) > with an empty/NULL program, then either you say allow-all or drop-all, > but either way, you break the firewalling functionality completely by > locking yourself out of the machine or letting everything through. There > is no sane way where we could reason about the context/internals of a > given program where it would be safe to replace with a simple empty/NULL > prog. Help me out here, is your answer that the access check can only be done at BPF program load time? That isn't really a solution from a SELinux perspective as far as I'm concerned. I understand the ideas I've tossed out aren't practical from a BPF perspective, but it would be nice if we could find something that does work. Surely you BPF folks can think of some way to provide a runtime, not load time, check? -- paul moore www.paul-moore.com
On Wed, Jun 2, 2021 at 9:40 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > On Fri, May 28, 2021 at 3:37 AM Paul Moore <paul@paul-moore.com> wrote: > > On Mon, May 17, 2021 at 5:22 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > > > > > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux > > > lockdown") added an implementation of the locked_down LSM hook to > > > SELinux, with the aim to restrict which domains are allowed to perform > > > operations that would breach lockdown. > > > > > > However, in several places the security_locked_down() hook is called in > > > situations where the current task isn't doing any action that would > > > directly breach lockdown, leading to SELinux checks that are basically > > > bogus. > > > > > > Since in most of these situations converting the callers such that > > > security_locked_down() is called in a context where the current task > > > would be meaningful for SELinux is impossible or very non-trivial (and > > > could lead to TOCTOU issues for the classic Lockdown LSM > > > implementation), fix this by modifying the hook to accept a struct cred > > > pointer as argument, where NULL will be interpreted as a request for a > > > "global", task-independent lockdown decision only. Then modify SELinux > > > to ignore calls with cred == NULL. > > > > I'm not overly excited about skipping the access check when cred is > > NULL. Based on the description and the little bit that I've dug into > > thus far it looks like using SECINITSID_KERNEL as the subject would be > > much more appropriate. *Something* (the kernel in most of the > > relevant cases it looks like) is requesting that a potentially > > sensitive disclosure be made, and ignoring it seems like the wrong > > thing to do. Leaving the access control intact also provides a nice > > avenue to audit these requests should users want to do that. > > > > Those users that generally don't care can grant kernel_t all the > > necessary permissions without much policy. > > Seems kind of pointless to me, but it's a relatively simple change to > do a check against SECINITSID_KERNEL, so I don't mind doing it like > that. It's not pointless, the granularity isn't as great as one would like, but it doesn't mean it is pointless. *Someone* is acting, in this case it just happens to be the kernel. It is likely the most admins and policy developers will not care, but some might, and we should enable that. > > > Since most callers will just want to pass current_cred() as the cred > > > parameter, rename the hook to security_cred_locked_down() and provide > > > the original security_locked_down() function as a simple wrapper around > > > the new hook. > > > > I know you and Casey went back and forth on this in v1, but I agree > > with Casey that having two LSM hooks here is a mistake. I know it > > makes backports hard, but spoiler alert: maintaining complex software > > over any non-trivial period of time is hard, reeeeally hard sometimes > > ;) > > Do you mean having two slots in lsm_hook_defs.h or also having two > security_*() functions? (It's not clear to me if you're just > reiterating disagreement with v1 or if you dislike the simplified v2 > as well.) To be clear I don't think there should be two functions for this, just make whatever changes are necessary to the existing security_locked_down() LSM hook. Yes, the backport is hard. Yes, it will touch a lot of code. Yes, those are lame excuses to not do the right thing. > > > The callers migrated to the new hook, passing NULL as cred: > > > 1. arch/powerpc/xmon/xmon.c > > > Here the hook seems to be called from non-task context and is only > > > used for redacting some sensitive values from output sent to > > > userspace. > > > > This definitely sounds like kernel_t based on the description above. > > Here I'm a little concerned that the hook might be called from some > unusual interrupt, which is not masked by spin_lock_irqsave()... We > ran into this with PMI (Platform Management Interrupt) before, see > commit 5ae5fbd21079 ("powerpc/perf: Fix handling of privilege level > checks in perf interrupt context"). While I can't see anything that > would suggest something like this happening here, the whole thing is > so foreign to me that I'm wary of making assumptions :) > > @Michael/PPC devs, can you confirm to us that xmon_is_locked_down() is > only called from normal syscall/interrupt context (as opposed to > something tricky like PMI)? You did submit the code change so I assumed you weren't concerned about it :) If it is a bad hook placement that is something else entirely. Hopefully we'll get some guidance from the PPC folks. > > > 4. net/xfrm/xfrm_user.c:copy_to_user_*() > > > Here a cryptographic secret is redacted based on the value returned > > > from the hook. There are two possible actions that may lead here: > > > a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the > > > task context is relevant, since the dumped data is sent back to > > > the current task. > > > > If the task context is relevant we should use it. > > Yes, but as I said it would create an asymmetry with case b), which > I'll expand on below... > > > > b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are > > > broadcasted to tasks subscribed to XFRM events - here the > > > SELinux check is not meningful as the current task's creds do > > > not represent the tasks that could potentially see the secret. > > > > This looks very similar to the BPF hook discussed above, I believe my > > comments above apply here as well. > > Using the current task is just logically wrong in this case. The > current task here is just simply deleting an SA that happens to have > some secret value in it. When deleting an SA, a notification is sent > to a group of subscribers (some group of other tasks), which includes > a dump of the secret value. The current task isn't doing any attempt > to breach lockdown, it's just deleting an SA. > > It also makes it really awkward to make policy decisions around this. > Suppose that domains A, B, and C need to be able to add/delete SAs and > domains D, E, and F need to receive notifications about changes in > SAs. Then if, say, domain E actually needs to see the secret values in > the notifications, you must grant the confidentiality permission to > all of A, B, C to keep things working. And now you have opened up the > door for A, B, C to do other lockdown-confidentiality stuff, even > though these domains themselves actually don't request/need any > confidential data from the kernel. That's just not logical and you may > actually end up (slightly) worse security-wise than if you just > skipped checking for XFRM secrets altogether, because you need to > allow confidentiality to domains for which it may be excessive. It sounds an awful lot like the lockdown hook is in the wrong spot. It sounds like it would be a lot better to relocate the hook than remove it. -- paul moore www.paul-moore.com
On 6/2/21 5:13 PM, Paul Moore wrote: [...] > Help me out here, is your answer that the access check can only be > done at BPF program load time? That isn't really a solution from a > SELinux perspective as far as I'm concerned. That is the current answer. The unfortunate irony is that 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") broke this in the first place. W/o the SELinux hook implementation it would have been working just fine at runtime, but given it's UAPI since quite a while now, that ship has sailed. > I understand the ideas I've tossed out aren't practical from a BPF > perspective, but it would be nice if we could find something that does > work. Surely you BPF folks can think of some way to provide a > runtime, not load time, check? I did run this entire discussion by both of the other BPF co-maintainers (Alexei, Andrii, CC'ed) and together we did further brainstorming on the matter on how we could solve this, but couldn't find a sensible & clean solution so far. You could potentially track the programs in the sec blob and iff they have been JITed fix up the jump targets via text_poke to a dummy handler for those requiring it and such, but that's just entirely fragile, horrid and broken. Given users are actively hitting issues with already released kernels in the wild, we concluded to fix the majority of the damage caused by commit 59438b46471a [concerning BPF at least, the rest done by Ondrej as I understand] with the below fix that is shipping to Linus. This is a step in the right direction for moving things forward regardless. With the hook at load it's also not doing anything that is off with respect to the remainder of lockdown hooks, so solving a policy change can be looked at from a more broader/general scope given same applies to other users, too, iff it's indeed the case that it turns out to be feasible. Anyway, I've reflected an overall summary of the discussions also in the commit msg. Thanks, Daniel --- [PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") added an implementation of the locked_down LSM hook to SELinux, with the aim to restrict which domains are allowed to perform operations that would breach lockdown. This is indirectly also getting audit subsystem involved to report events. The latter is problematic, as reported by Ondrej and Serhei, since it can bring down the whole system via audit: 1) The audit events that are triggered due to calls to security_locked_down() can OOM kill a machine, see below details [0]. 2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit() when trying to wake up kauditd, for example, when using trace_sched_switch() tracepoint, see details in [1]. Triggering this was not via some hypothetical corner case, but with existing tools like runqlat & runqslower from bcc, for example, which make use of this tracepoint. Rough call sequence goes like: rq_lock(rq) -> -------------------------+ trace_sched_switch() -> | bpf_prog_xyz() -> +-> deadlock selinux_lockdown() -> | audit_log_end() -> | wake_up_interruptible() -> | try_to_wake_up() -> | rq_lock(rq) --------------+ What's worse is that the intention of 59438b46471a to further restrict lockdown settings for specific applications in respect to the global lockdown policy is completely broken for BPF. The SELinux policy rule for the current lockdown check looks something like this: allow <who> <who> : lockdown { <reason> }; However, this doesn't match with the 'current' task where the security_locked_down() is executed, example: httpd does a syscall. There is a tracing program attached to the syscall which triggers a BPF program to run, which ends up doing a bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does the permission check against 'current', that is, httpd in this example. httpd has literally zero relation to this tracing program, and it would be nonsensical having to write an SELinux policy rule against httpd to let the tracing helper pass. The policy in this case needs to be against the entity that is installing the BPF program. For example, if bpftrace would generate a histogram of syscall counts by user space application: bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' bpftrace would then go and generate a BPF program from this internally. One way of doing it [for the sake of the example] could be to call bpf_get_current_task() helper and then access current->comm via one of bpf_probe_read_kernel{,_str}() helpers. So the program itself has nothing to do with httpd or any other random app doing a syscall here. The BPF program _explicitly initiated_ the lockdown check. The allow/deny policy belongs in the context of bpftrace: meaning, you want to grant bpftrace access to use these helpers, but other tracers on the system like my_random_tracer _not_. Therefore fix all three issues at the same time by taking a completely different approach for the security_locked_down() hook, that is, move the check into the program verification phase where we actually retrieve the BPF func proto. This also reliably gets the task (current) that is trying to install the BPF tracing program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since we're moving this out of the BPF helper's fast-path which can be called several millions of times per second. The check is then also in line with other security_locked_down() hooks in the system where the enforcement is performed at open/load time, for example, open_kcore() for /proc/kcore access or module_sig_check() for module signatures just to pick few random ones. What's out of scope in the fix as well as in other security_locked_down() hook locations /outside/ of BPF subsystem is that if the lockdown policy changes on the fly there is no retrospective action. This requires a different discussion, potentially complex infrastructure, and it's also not clear whether this can be solved generically. Either way, it is out of scope for a suitable stable fix which this one is targeting. Note that the breakage is specifically on 59438b46471a where it started to rely on 'current' as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf: Restrict bpf when kernel lockdown is in confidentiality mode"). [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says: I starting seeing this with F-34. When I run a container that is traced with BPF to record the syscalls it is doing, auditd is flooded with messages like: type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality } for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM" scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0 tclass=lockdown permissive=0 This seems to be leading to auditd running out of space in the backlog buffer and eventually OOMs the machine. [...] auditd running at 99% CPU presumably processing all the messages, eventually I get: Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64 Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64 Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000 Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1 Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [...] [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/, Serhei Makarov says: Upstream kernel 5.11.0-rc7 and later was found to deadlock during a bpf_probe_read_compat() call within a sched_switch tracepoint. The problem is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on ppc64le. Example stack trace: [...] [ 730.868702] stack backtrace: [ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1 [ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 [ 730.873278] Call Trace: [ 730.873770] dump_stack+0x7f/0xa1 [ 730.874433] check_noncircular+0xdf/0x100 [ 730.875232] __lock_acquire+0x1202/0x1e10 [ 730.876031] ? __lock_acquire+0xfc0/0x1e10 [ 730.876844] lock_acquire+0xc2/0x3a0 [ 730.877551] ? __wake_up_common_lock+0x52/0x90 [ 730.878434] ? lock_acquire+0xc2/0x3a0 [ 730.879186] ? lock_is_held_type+0xa7/0x120 [ 730.880044] ? skb_queue_tail+0x1b/0x50 [ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90 [ 730.881656] ? __wake_up_common_lock+0x52/0x90 [ 730.882532] __wake_up_common_lock+0x52/0x90 [ 730.883375] audit_log_end+0x5b/0x100 [ 730.884104] slow_avc_audit+0x69/0x90 [ 730.884836] avc_has_perm+0x8b/0xb0 [ 730.885532] selinux_lockdown+0xa5/0xd0 [ 730.886297] security_locked_down+0x20/0x40 [ 730.887133] bpf_probe_read_compat+0x66/0xd0 [ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820 [ 730.888917] trace_call_bpf+0xe9/0x240 [ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0 [ 730.890579] perf_trace_sched_switch+0x142/0x180 [ 730.891485] ? __schedule+0x6d8/0xb20 [ 730.892209] __schedule+0x6d8/0xb20 [ 730.892899] schedule+0x5b/0xc0 [ 730.893522] exit_to_user_mode_prepare+0x11d/0x240 [ 730.894457] syscall_exit_to_user_mode+0x27/0x70 [ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae [...] Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") Reported-by: Ondrej Mosnacek <omosnace@redhat.com> Reported-by: Jakub Hrozek <jhrozek@redhat.com> Reported-by: Serhei Makarov <smakarov@redhat.com> Reported-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Jiri Olsa <jolsa@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: James Morris <jamorris@linux.microsoft.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Frank Eigler <fche@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net --- kernel/bpf/helpers.c | 7 +++++-- kernel/trace/bpf_trace.c | 32 ++++++++++++-------------------- 2 files changed, 17 insertions(+), 22 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 73443498d88f..a2f1f15ce432 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -14,6 +14,7 @@ #include <linux/jiffies.h> #include <linux/pid_namespace.h> #include <linux/proc_ns.h> +#include <linux/security.h> #include "../../lib/kstrtox.h" @@ -1069,11 +1070,13 @@ bpf_base_func_proto(enum bpf_func_id func_id) case BPF_FUNC_probe_read_user: return &bpf_probe_read_user_proto; case BPF_FUNC_probe_read_kernel: - return &bpf_probe_read_kernel_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_kernel_proto; case BPF_FUNC_probe_read_user_str: return &bpf_probe_read_user_str_proto; case BPF_FUNC_probe_read_kernel_str: - return &bpf_probe_read_kernel_str_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_kernel_str_proto; case BPF_FUNC_snprintf_btf: return &bpf_snprintf_btf_proto; case BPF_FUNC_snprintf: diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d2d7cf6cfe83..7a52bc172841 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -215,16 +215,11 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = { static __always_inline int bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr) { - int ret = security_locked_down(LOCKDOWN_BPF_READ); + int ret; - if (unlikely(ret < 0)) - goto fail; ret = copy_from_kernel_nofault(dst, unsafe_ptr, size); if (unlikely(ret < 0)) - goto fail; - return ret; -fail: - memset(dst, 0, size); + memset(dst, 0, size); return ret; } @@ -246,10 +241,7 @@ const struct bpf_func_proto bpf_probe_read_kernel_proto = { static __always_inline int bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) { - int ret = security_locked_down(LOCKDOWN_BPF_READ); - - if (unlikely(ret < 0)) - goto fail; + int ret; /* * The strncpy_from_kernel_nofault() call will likely not fill the @@ -262,11 +254,7 @@ bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) */ ret = strncpy_from_kernel_nofault(dst, unsafe_ptr, size); if (unlikely(ret < 0)) - goto fail; - - return ret; -fail: - memset(dst, 0, size); + memset(dst, 0, size); return ret; } @@ -1011,16 +999,20 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_probe_read_user: return &bpf_probe_read_user_proto; case BPF_FUNC_probe_read_kernel: - return &bpf_probe_read_kernel_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_kernel_proto; case BPF_FUNC_probe_read_user_str: return &bpf_probe_read_user_str_proto; case BPF_FUNC_probe_read_kernel_str: - return &bpf_probe_read_kernel_str_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_kernel_str_proto; #ifdef CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE case BPF_FUNC_probe_read: - return &bpf_probe_read_compat_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_compat_proto; case BPF_FUNC_probe_read_str: - return &bpf_probe_read_compat_str_proto; + return security_locked_down(LOCKDOWN_BPF_READ) < 0 ? + NULL : &bpf_probe_read_compat_str_proto; #endif #ifdef CONFIG_CGROUPS case BPF_FUNC_get_current_cgroup_id: -- 2.21.0
On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > On 6/2/21 5:13 PM, Paul Moore wrote: > [...] > > Help me out here, is your answer that the access check can only be > > done at BPF program load time? That isn't really a solution from a > > SELinux perspective as far as I'm concerned. > > That is the current answer. The unfortunate irony is that 59438b46471a > ("security,lockdown,selinux: implement SELinux lockdown") broke this in > the first place. W/o the SELinux hook implementation it would have been > working just fine at runtime, but given it's UAPI since quite a while > now, that ship has sailed. Explaining the other side of the "unfortunate irony ..." comment is going to take us in a direction that isn't very constructive so I'm going to skip past that now and simply say that if there was better cooperation across subsystems, especially with the LSM folks, a lot of this pain could be mitigated. ... and yes I said "mitigated", I'm not foolish to think the pain could be avoided entirely ;) > > I understand the ideas I've tossed out aren't practical from a BPF > > perspective, but it would be nice if we could find something that does > > work. Surely you BPF folks can think of some way to provide a > > runtime, not load time, check? > > I did run this entire discussion by both of the other BPF co-maintainers > (Alexei, Andrii, CC'ed) and together we did further brainstorming on the > matter on how we could solve this, but couldn't find a sensible & clean > solution so far. Before I jump into the patch below I just want to say that I appreciate you looking into solutions on the BPF side of things. However, I voted "no" on this patch previously and since you haven't really changed it, my "no"/NACK vote remains, at least until we exhaust a few more options. > [PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks > > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") > added an implementation of the locked_down LSM hook to SELinux, with the aim > to restrict which domains are allowed to perform operations that would breach > lockdown. This is indirectly also getting audit subsystem involved to report > events. The latter is problematic, as reported by Ondrej and Serhei, since it > can bring down the whole system via audit: > > 1) The audit events that are triggered due to calls to security_locked_down() > can OOM kill a machine, see below details [0]. > > 2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit() > when trying to wake up kauditd, for example, when using trace_sched_switch() > tracepoint, see details in [1]. Triggering this was not via some hypothetical > corner case, but with existing tools like runqlat & runqslower from bcc, for > example, which make use of this tracepoint. Rough call sequence goes like: > > rq_lock(rq) -> -------------------------+ > trace_sched_switch() -> | > bpf_prog_xyz() -> +-> deadlock > selinux_lockdown() -> | > audit_log_end() -> | > wake_up_interruptible() -> | > try_to_wake_up() -> | > rq_lock(rq) --------------+ Since BPF is a bit of chaotic nightmare in the sense that it basically out-of-tree kernel code that can be called from anywhere and do pretty much anything; it presents quite the challenge for those of us worried about LSM access controls. You and the other BPF folks have investigated ways in which BPF might be able to disable helper functions allowing us to do proper runtime access checks but haven't been able to make it work, which brings this patch up yet again. I'm not a fan of this patch as it basically allows BPF programs to side-step any changes to the security policy once the BPF programs have been loaded; this is Not Good. So let's look at this from a different angle. Let's look at the two problems you mention above. If we start with the runqueue deadlock we see the main problem is that audit_log_end() pokes the kauditd_wait waitqueue to ensure the kauditd_thread thread wakes up and processes the audit queue. The audit_log_start() function does something similar, but it is conditional on a number of factors and isn't as likely to be hit. If we relocate these kauditd wakeup calls we can remove the deadlock in trace_sched_switch(). In the case of CONFIG_AUDITSYSCALL=y we can probably just move the wakeup to __audit_syscall_exit() and in the case of CONFIG_AUDITSYSCALL=n we can likely just change the wait_event_freezable() call in kauditd_thread to a wait_event_freezable_timeout() call with a HZ timeout (the audit stream will be much less on these systems anyway so a queue overflow is much less likely). I'm building a kernel with these changes now, I should have something to test when I wake up tomorrow morning. It might even provide a bit of a performance boost as we would only be calling a wakeup function once for each syscall. The other issue is related to security_locked_down() and using the right subject for the access control check. As has been pointed out several times in this thread, the current code uses the current() task as the subject, which is arguably incorrect for many of the BPF helper functions. In the case of BPF, we have talked about using the credentials of the task which loaded the BPF program instead of current(), and that does make a certain amount of sense. Such an approach should make the security policy easier to develop and rationalize, leading to a significant decrease in audit records coming from LSM access denials. The question is how to implement such a change. The current SELinux security_bpf_prog_alloc() hook causes the newly loaded BPF program to inherit the subject context from the task which loads the BPF program; if it is possible to reference the bpf_prog struct, or really just the associated bpf_prog_aux->security blob, from inside a security_bpf_locked_down() function we use that subject information to perform the access check. BPF folks, is there a way to get that information from within a BPF kernel helper function? If it isn't currently possible, could it be made possible (or something similar)? If it turns out we can do both of these things (relocated audit wakeup, bpf_prog reference inside kernel helpers) I think we can arrive at a fix which both groups can accept. -- paul moore www.paul-moore.com
On 6/4/21 6:50 AM, Paul Moore wrote: > On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann <daniel@iogearbox.net> wrote: [...] >> I did run this entire discussion by both of the other BPF co-maintainers >> (Alexei, Andrii, CC'ed) and together we did further brainstorming on the >> matter on how we could solve this, but couldn't find a sensible & clean >> solution so far. > > Before I jump into the patch below I just want to say that I > appreciate you looking into solutions on the BPF side of things. > However, I voted "no" on this patch previously and since you haven't > really changed it, my "no"/NACK vote remains, at least until we > exhaust a few more options. Just to set the record straight, you previously did neither ACK nor NACK it. And again, as summarized earlier, this patch is _fixing_ the majority of the damage caused by 59438b46471a for at least the BPF side of things where users run into this, Ondrej the rest. Everything else can be discussed on top, and so far it seems there is no clean solution in front of us either, not even speaking of one that is small and suitable for _stable_ trees. Let me reiterate where we are: it's not that the original implementation in 9d1f8be5cf42 ("bpf: Restrict bpf when kernel lockdown is in confidentiality mode") was broken, it's that the later added _SELinux_ locked_down hook implementation that is broken, and not other LSMs. Now you're trying to retrofittingly ask us for hacks at all costs just because of /a/ broken LSM implementation. Maybe another avenue is to just swallow the pill and revert 59438b46471a since it made assumptions that don't work /generally/. And the use case for an application runtime policy change in particular in case of lockdown where users only have 3 choices is extremely tiny/rare, if it's not then something is very wrong in your deployment. Let me ask you this ... are you also planning to address *all* the other cases inside the kernel? If your answer is no, then this entire discussion is pointless. >> [PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks >> >> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") >> added an implementation of the locked_down LSM hook to SELinux, with the aim >> to restrict which domains are allowed to perform operations that would breach >> lockdown. This is indirectly also getting audit subsystem involved to report >> events. The latter is problematic, as reported by Ondrej and Serhei, since it >> can bring down the whole system via audit: >> >> 1) The audit events that are triggered due to calls to security_locked_down() >> can OOM kill a machine, see below details [0]. >> >> 2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit() >> when trying to wake up kauditd, for example, when using trace_sched_switch() >> tracepoint, see details in [1]. Triggering this was not via some hypothetical >> corner case, but with existing tools like runqlat & runqslower from bcc, for >> example, which make use of this tracepoint. Rough call sequence goes like: >> >> rq_lock(rq) -> -------------------------+ >> trace_sched_switch() -> | >> bpf_prog_xyz() -> +-> deadlock >> selinux_lockdown() -> | >> audit_log_end() -> | >> wake_up_interruptible() -> | >> try_to_wake_up() -> | >> rq_lock(rq) --------------+ > > Since BPF is a bit of chaotic nightmare in the sense that it basically > out-of-tree kernel code that can be called from anywhere and do pretty > much anything; it presents quite the challenge for those of us worried > about LSM access controls. There is no need to generalize ... for those worried, BPF subsystem has LSM access controls for the syscall since 2017 via afdb09c720b6 ("security: bpf: Add LSM hooks for bpf object related syscall"). [...] > So let's look at this from a different angle. Let's look at the two > problems you mention above. > > If we start with the runqueue deadlock we see the main problem is that > audit_log_end() pokes the kauditd_wait waitqueue to ensure the > kauditd_thread thread wakes up and processes the audit queue. The > audit_log_start() function does something similar, but it is > conditional on a number of factors and isn't as likely to be hit. If > we relocate these kauditd wakeup calls we can remove the deadlock in > trace_sched_switch(). In the case of CONFIG_AUDITSYSCALL=y we can > probably just move the wakeup to __audit_syscall_exit() and in the > case of CONFIG_AUDITSYSCALL=n we can likely just change the > wait_event_freezable() call in kauditd_thread to a > wait_event_freezable_timeout() call with a HZ timeout (the audit > stream will be much less on these systems anyway so a queue overflow > is much less likely). I'm building a kernel with these changes now, I > should have something to test when I wake up tomorrow morning. It > might even provide a bit of a performance boost as we would only be > calling a wakeup function once for each syscall. As other SELinux developers like Ondrej already pointed out to you in this thread: Actually, I wasn't aware of the deadlock... But calling an LSM hook [that is backed by a SELinux access check] from within a BPF helper is calling for all kinds of trouble, so I'm not surprised This is _generally_ a bad idea since it will potentially blow up in random ways. A _simple_ on/off switch like lockdown_is_locked_down() did was okayish (maybe modulo the pr_notice() which should rather have been a pr_notice_ratelimited() when this is potentially called Mio of times per sec worst case), but everything else is, again, just asking for trouble now and/or in future when folks extend the SELinux backend implementation or add a locked_down hook to other LSMs. That 59438b46471a was missing it was the first proof of exactly this, and other LSMs will run into the same. Similarly for relying on 'current' given it works on /some/ of the security_locked_down() call-sites /but not others/. No matter from which angle you look at it, calling a LSM hook from a helper is just plain broken. > The other issue is related to security_locked_down() and using the > right subject for the access control check. As has been pointed out > several times in this thread, the current code uses the current() task > as the subject, which is arguably incorrect for many of the BPF helper > functions. In the case of BPF, we have talked about using the > credentials of the task which loaded the BPF program instead of > current(), and that does make a certain amount of sense. Such an > approach should make the security policy easier to develop and > rationalize, leading to a significant decrease in audit records coming > from LSM access denials. The question is how to implement such a > change. The current SELinux security_bpf_prog_alloc() hook causes the > newly loaded BPF program to inherit the subject context from the task > which loads the BPF program; if it is possible to reference the > bpf_prog struct, or really just the associated bpf_prog_aux->security > blob, from inside a security_bpf_locked_down() function we use that > subject information to perform the access check. BPF folks, is there > a way to get that information from within a BPF kernel helper > function? If it isn't currently possible, could it be made possible > (or something similar)? While this could be a potential avenue, the problem here is that BPF helpers have neither access to the prog struct nor to bpf_prog_aux. As I mentioned earlier, potentially you could go and fix up JITed images for those progs where the credentials of the loading task require this when policy suddenly changes to a more stricter level. I'm not a fan of this though because of the fragility involved here. Again, the problem is not limited to BPF at all. kprobes is doing register- time hooks which are equivalent to the one of BPF. Anything in run-time trying to prevent probe_read_kernel by kprobes or BPF is broken by design. Thanks, Daniel
On Fri, Jun 4, 2021 at 2:02 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > > On 6/4/21 6:50 AM, Paul Moore wrote: > > On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann <daniel@iogearbox.net> wrote: > [...] > >> I did run this entire discussion by both of the other BPF co-maintainers > >> (Alexei, Andrii, CC'ed) and together we did further brainstorming on the > >> matter on how we could solve this, but couldn't find a sensible & clean > >> solution so far. > > > > Before I jump into the patch below I just want to say that I > > appreciate you looking into solutions on the BPF side of things. > > However, I voted "no" on this patch previously and since you haven't > > really changed it, my "no"/NACK vote remains, at least until we > > exhaust a few more options. > > Just to set the record straight, you previously did neither ACK nor NACK it. I think I said it wasn't a great solution. Perhaps I should have been more clear, but I don't like NACK'ing things while we are still hashing out possible solutions on the lists. From my perspective NACK'ing is pretty harsh and I try to leave that as a last resort. > And > again, as summarized earlier, this patch is _fixing_ the majority of the damage > caused by 59438b46471a for at least the BPF side of things where users run into this, > Ondrej the rest. Everything else can be discussed on top, and so far it seems there > is no clean solution in front of us either, not even speaking of one that is small > and suitable for _stable_ trees. Let me reiterate where we are: it's not that the > original implementation in 9d1f8be5cf42 ("bpf: Restrict bpf when kernel lockdown is > in confidentiality mode") was broken, it's that the later added _SELinux_ locked_down > hook implementation that is broken, and not other LSMs. I think there are still options for how to solve this moving forward, more on that at the end of the email, and I would like to see if we can chase down some of those ideas first. Maybe the ideas aren't practical, or maybe they are practical but not from a -stable perspective; we/I don't know until we talk about it. Based on previous experience I'm afraid to ACK a quick-fix without some discussion about the proper long-term fix. If the long-term fix isn't suitable for -stable, then we can take a smaller fix just for the stable trees. > Now you're trying to retrofittingly > ask us for hacks at all costs just because of /a/ broken LSM implementation. This is starting to get a little off the rails now isn't it? Daniel you've always come across as a reasonable person to me, but phrasing things like that is inflammatory at best. Could the SELinux lockdown hooks have been done better, of course, I don't think we are debating that anymore. New functionality, checks, etc. are added to the kernel all the time, and because we're all human we screw that up on occasion. The important part is that we come together to fix it, which is what I think we're trying to do here (it's what I'm trying to do here). > Maybe > another avenue is to just swallow the pill and revert 59438b46471a since it made > assumptions that don't work /generally/. And the use case for an application runtime > policy change in particular in case of lockdown where users only have 3 choices is > extremely tiny/rare, if it's not then something is very wrong in your deployment. > Let me ask you this ... are you also planning to address *all* the other cases inside > the kernel? If your answer is no, then this entire discussion is pointless. Um, yes? We were talking about that earlier in the thread before the BPF portions of the fix started to dominate the discussion. Just because the BPF portion of the fix has dominated the discussion recently doesn't mean the other issues aren't going to be addressed. When stuff is busted, I work to fix it. I think everyone here does. > >> [PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks > >> > >> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") > >> added an implementation of the locked_down LSM hook to SELinux, with the aim > >> to restrict which domains are allowed to perform operations that would breach > >> lockdown. This is indirectly also getting audit subsystem involved to report > >> events. The latter is problematic, as reported by Ondrej and Serhei, since it > >> can bring down the whole system via audit: > >> > >> 1) The audit events that are triggered due to calls to security_locked_down() > >> can OOM kill a machine, see below details [0]. > >> > >> 2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit() > >> when trying to wake up kauditd, for example, when using trace_sched_switch() > >> tracepoint, see details in [1]. Triggering this was not via some hypothetical > >> corner case, but with existing tools like runqlat & runqslower from bcc, for > >> example, which make use of this tracepoint. Rough call sequence goes like: > >> > >> rq_lock(rq) -> -------------------------+ > >> trace_sched_switch() -> | > >> bpf_prog_xyz() -> +-> deadlock > >> selinux_lockdown() -> | > >> audit_log_end() -> | > >> wake_up_interruptible() -> | > >> try_to_wake_up() -> | > >> rq_lock(rq) --------------+ > > > > Since BPF is a bit of chaotic nightmare in the sense that it basically > > out-of-tree kernel code that can be called from anywhere and do pretty > > much anything; it presents quite the challenge for those of us worried > > about LSM access controls. > > There is no need to generalize ... for those worried, BPF subsystem has LSM access > controls for the syscall since 2017 via afdb09c720b6 ("security: bpf: Add LSM hooks > for bpf object related syscall"). We weren't really talking about those hooks in this thread, we're talking about the security_locked_down() hooks in the BPF helper functions. The BPF/LSM hooks you mention above have nothing to do with that at present, but yes we do have lots of cool LSM hooks that allow admins to do lots of cool things with access controls. Anyway, that was a distraction I started in my last email when I shouldn't have. If I'm going to call you out for inflammatory language I should do the same to myself - you have my apology. > > So let's look at this from a different angle. Let's look at the two > > problems you mention above. > > > > If we start with the runqueue deadlock we see the main problem is that > > audit_log_end() pokes the kauditd_wait waitqueue to ensure the > > kauditd_thread thread wakes up and processes the audit queue. The > > audit_log_start() function does something similar, but it is > > conditional on a number of factors and isn't as likely to be hit. If > > we relocate these kauditd wakeup calls we can remove the deadlock in > > trace_sched_switch(). In the case of CONFIG_AUDITSYSCALL=y we can > > probably just move the wakeup to __audit_syscall_exit() and in the > > case of CONFIG_AUDITSYSCALL=n we can likely just change the > > wait_event_freezable() call in kauditd_thread to a > > wait_event_freezable_timeout() call with a HZ timeout (the audit > > stream will be much less on these systems anyway so a queue overflow > > is much less likely). I'm building a kernel with these changes now, I > > should have something to test when I wake up tomorrow morning. It > > might even provide a bit of a performance boost as we would only be > > calling a wakeup function once for each syscall. > > As other SELinux developers like Ondrej already pointed out to you in > this thread: > > Actually, I wasn't aware of the deadlock... But calling an LSM hook > [that is backed by a SELinux access check] from within a BPF helper > is calling for all kinds of trouble, so I'm not surprised Skipping past the phrasing of your comment, like any two people Ondrej and I disagree on some things, and this is one of those things. Surely the BPF folks never disagree on anything :) I agree that the current security_locked_down() SELinux hook implementation has problems, mostly due to the audit code path, but I think it is something we can fix. Simply throwing our hands up in the air in defeat and ripping out the LSM checks in the BPF helpers is something I see as a last resort and I think we still have one more potential solution to discuss. The SELinux hook implementation was only a problem because of the wakeup call in audit_log_end(). The fact that the hook progressed all the way to audit_log_end() for the AVC record shows that the bulk of the hook is fine, we just need to remove the wakeup from audit_log_end(). To that end, I've been playing with a patch that does just that, it removes all of the wakeup calls from the audit_log_start() and audit_log_end() functions, I made some small tweaks to the patch before I started responding to your email (the kernel is compiling as I type this) but I expect to post it to the audit list as a RFC tonight or this weekend for review. > > The other issue is related to security_locked_down() and using the > > right subject for the access control check. As has been pointed out > > several times in this thread, the current code uses the current() task > > as the subject, which is arguably incorrect for many of the BPF helper > > functions. In the case of BPF, we have talked about using the > > credentials of the task which loaded the BPF program instead of > > current(), and that does make a certain amount of sense. Such an > > approach should make the security policy easier to develop and > > rationalize, leading to a significant decrease in audit records coming > > from LSM access denials. The question is how to implement such a > > change. The current SELinux security_bpf_prog_alloc() hook causes the > > newly loaded BPF program to inherit the subject context from the task > > which loads the BPF program; if it is possible to reference the > > bpf_prog struct, or really just the associated bpf_prog_aux->security > > blob, from inside a security_bpf_locked_down() function we use that > > subject information to perform the access check. BPF folks, is there > > a way to get that information from within a BPF kernel helper > > function? If it isn't currently possible, could it be made possible > > (or something similar)? > > While this could be a potential avenue, the problem here is that BPF > helpers have neither access to the prog struct nor to bpf_prog_aux. As > I mentioned earlier, potentially you could go and fix up JITed images > for those progs where the credentials of the loading task require this > when policy suddenly changes to a more stricter level. I'm not a fan > of this though because of the fragility involved here. FWIW, the JIT hacks sounded awful to me too, I'm not advocating for that as a solution. What I'm thinking of is a sort of "bpf_current" that functions similar to current() but could be used from within the BPF helpers, and/or the LSMs hooks they call, to reference the bpf_prog struct of the currently executing BPF program. Once again, I'm not a kernel BPF expert, but it seems like we could create a per-CPU pointer for the bpf_prog struct and assign it as part of the __BPF_PROG_RUN macro. The macro already has the bpf_prog pointer so I wouldn't expect that doing a per-CPU pointer assignment wouldn't be too costly considering the other things that take place. Or am I missing something important? > Again, the problem is not limited to BPF at all. kprobes is doing register- > time hooks which are equivalent to the one of BPF. Anything in run-time > trying to prevent probe_read_kernel by kprobes or BPF is broken by design. Not being an expert on kprobes I can't really comment on that, but right now I'm focused on trying to make things work for the BPF helpers. I suspect that if we can get the SELinux lockdown implementation working properly for BPF the solution for kprobes won't be far off. -- paul moore www.paul-moore.com
On Fri, Jun 4, 2021 at 4:34 PM Paul Moore <paul@paul-moore.com> wrote: > > > Again, the problem is not limited to BPF at all. kprobes is doing register- > > time hooks which are equivalent to the one of BPF. Anything in run-time > > trying to prevent probe_read_kernel by kprobes or BPF is broken by design. > > Not being an expert on kprobes I can't really comment on that, but > right now I'm focused on trying to make things work for the BPF > helpers. I suspect that if we can get the SELinux lockdown > implementation working properly for BPF the solution for kprobes won't > be far off. Paul, Both kprobe and bpf can call probe_read_kernel==copy_from_kernel_nofault from all contexts. Including NMI. Most of audit_log_* is not acceptable. Just removing a wakeup is not solving anything. Audit hooks don't belong in NMI. Audit design needs memory allocation. Hence it's not suitable for NMI and hardirq. But kprobes and bpf progs do run just fine there. BPF, for example, only uses pre-allocated memory.
On 6/4/2021 5:08 PM, Alexei Starovoitov wrote: > On Fri, Jun 4, 2021 at 4:34 PM Paul Moore <paul@paul-moore.com> wrote: >>> Again, the problem is not limited to BPF at all. kprobes is doing register- >>> the hooks which are equivalent to the one of BPF. Anything in run-time >>> trying to prevent probe_read_kernel by kprobes or BPF is broken by design. >> Not being an expert on kprobes I can't really comment on that, but >> right now I'm focused on trying to make things work for the BPF >> helpers. I suspect that if we can get the SELinux lockdown >> implementation working properly for BPF the solution for kprobes won't >> be far off. > Paul, > > Both kprobe and bpf can call probe_read_kernel==copy_from_kernel_nofault > from all contexts. > Including NMI. Most of audit_log_* is not acceptable. > Just removing a wakeup is not solving anything. > Audit hooks don't belong in NMI. > Audit design needs memory allocation. Hence it's not suitable > for NMI and hardirq. But kprobes and bpf progs do run just fine there. > BPF, for example, only uses pre-allocated memory. You have fallen into a common fallacy. The fact that the "code runs" does not assure that the "system works right". In the security world we face this all the time, often with performance expectations. In this case the BPF design has failed to accommodate the long standing needs of audit and SELinux. Shifting the responsibility for these design flaws to SELinux is inappropriate. Integration of sub-systems is usually the burden of the newcomer, which in this case is BPF. Paul is doing the bulk of your work for you. Maybe you could step up to your responsibility and work with him, not against him.
On Sat, Jun 5, 2021 at 11:11 AM Casey Schaufler <casey@schaufler-ca.com> wrote: > > You have fallen into a common fallacy. The fact that the "code runs" > does not assure that the "system works right". In the security world > we face this all the time, often with performance expectations. In this > case the BPF design has failed [..] I think it's the lockdown patches that have failed. They did the wrong thing, they didn't work, The report in question is for a regression. THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS. Honestly, security people need to understand that "not working" is not a success case of security. It's a failure case. Yes, "not working" may be secure. But security in that case is *pointless*. Linus
On Fri, Jun 4, 2021 at 8:08 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Fri, Jun 4, 2021 at 4:34 PM Paul Moore <paul@paul-moore.com> wrote: > > > > > Again, the problem is not limited to BPF at all. kprobes is doing register- > > > time hooks which are equivalent to the one of BPF. Anything in run-time > > > trying to prevent probe_read_kernel by kprobes or BPF is broken by design. > > > > Not being an expert on kprobes I can't really comment on that, but > > right now I'm focused on trying to make things work for the BPF > > helpers. I suspect that if we can get the SELinux lockdown > > implementation working properly for BPF the solution for kprobes won't > > be far off. > > Paul, Hi Alexei, > Both kprobe and bpf can call probe_read_kernel==copy_from_kernel_nofault > from all contexts. > Including NMI. Thanks, that is helpful. In hindsight it should have been obvious that kprobe/BPF would offer to insert code into the NMI handlers, but I don't recall it earlier in the discussion, it's possible I simply missed the mention. > Most of audit_log_* is not acceptable. > Just removing a wakeup is not solving anything. That's not really fair now is it? Removing the wakeups in audit_log_start() and audit_log_end() does solve some problems, although not all of them (i.e. the NMI problem being the 800lb gorilla). Because of the NMI case we're not going to solve the LSM/audit case anytime soon so it looks like we are going to have to fall back to the patch Daniel proposed. Acked-by: Paul Moore <paul@paul-moore.com> -- paul moore www.paul-moore.com
On Sat, Jun 5, 2021 at 2:17 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, Jun 5, 2021 at 11:11 AM Casey Schaufler <casey@schaufler-ca.com> wrote: > > > > You have fallen into a common fallacy. The fact that the "code runs" > > does not assure that the "system works right". In the security world > > we face this all the time, often with performance expectations. In this > > case the BPF design has failed [..] > > I think it's the lockdown patches that have failed. They did the wrong > thing, they didn't work, > > The report in question is for a regression. > > THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS. To think I was worried we might end this thread without a bit of CAPS LOCK, whew! :) I don't think anyone in this discussion, even Casey's last comment, was denying that there was a problem. The discussion and the disagreements were about what a "proper" fix would be, and how one might implement that fix; of course there were different ideas of "proper" and implementations vary even when people agree, so things were a bit of a mess. If you want to get upset and shouty, I think there are a few things spread across the subsystems involved that would be worthy targets, but to say that Casey, myself, or anyone else who plays under security/ denied the problem in this thread is not fair, or correct, in my opinion. > Honestly, security people need to understand that "not working" is not > a success case of security. It's a failure case. I can't pretend to know what all of the "security people" are thinking, but I can say with a good degree of certainty that my goal is not to crash, panic, kill, or otherwise disable a user's system. When it comes to things like the LSM hooks, my goal is to try and make sure we have the right hooks in the right places so that admins and users have the tools they need to control access to their data and systems in the way that they choose. Sometimes this puts us at odds with other subsystems in the kernel, we saw that in this thread, but that's to be expected anytime you have competing priorities. The important part is that eventually we figure out some way to move forward, and the fact that we are still all making progress and putting out new kernel releases is proof that we are finding a way. That's what matters to me, and if I was forced to guess, I would imagine that matters quite a lot to most of us here. -- paul moore www.paul-moore.com
On Thu, Jun 3, 2021 at 7:46 PM Paul Moore <paul@paul-moore.com> wrote: > On Wed, Jun 2, 2021 at 9:40 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > > On Fri, May 28, 2021 at 3:37 AM Paul Moore <paul@paul-moore.com> wrote: [...] > > > I know you and Casey went back and forth on this in v1, but I agree > > > with Casey that having two LSM hooks here is a mistake. I know it > > > makes backports hard, but spoiler alert: maintaining complex software > > > over any non-trivial period of time is hard, reeeeally hard sometimes > > > ;) > > > > Do you mean having two slots in lsm_hook_defs.h or also having two > > security_*() functions? (It's not clear to me if you're just > > reiterating disagreement with v1 or if you dislike the simplified v2 > > as well.) > > To be clear I don't think there should be two functions for this, just > make whatever changes are necessary to the existing > security_locked_down() LSM hook. Yes, the backport is hard. Yes, it > will touch a lot of code. Yes, those are lame excuses to not do the > right thing. OK, I guess I'll just go with the bigger patch. > > > > The callers migrated to the new hook, passing NULL as cred: > > > > 1. arch/powerpc/xmon/xmon.c > > > > Here the hook seems to be called from non-task context and is only > > > > used for redacting some sensitive values from output sent to > > > > userspace. > > > > > > This definitely sounds like kernel_t based on the description above. > > > > Here I'm a little concerned that the hook might be called from some > > unusual interrupt, which is not masked by spin_lock_irqsave()... We > > ran into this with PMI (Platform Management Interrupt) before, see > > commit 5ae5fbd21079 ("powerpc/perf: Fix handling of privilege level > > checks in perf interrupt context"). While I can't see anything that > > would suggest something like this happening here, the whole thing is > > so foreign to me that I'm wary of making assumptions :) > > > > @Michael/PPC devs, can you confirm to us that xmon_is_locked_down() is > > only called from normal syscall/interrupt context (as opposed to > > something tricky like PMI)? > > You did submit the code change so I assumed you weren't concerned > about it :) If it is a bad hook placement that is something else > entirely. What I submitted effectively avoided the SELinux hook to be called, so I didn't bother checking deeper in what context those hook calls would occur. > Hopefully we'll get some guidance from the PPC folks. > > > > > 4. net/xfrm/xfrm_user.c:copy_to_user_*() > > > > Here a cryptographic secret is redacted based on the value returned > > > > from the hook. There are two possible actions that may lead here: > > > > a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the > > > > task context is relevant, since the dumped data is sent back to > > > > the current task. > > > > > > If the task context is relevant we should use it. > > > > Yes, but as I said it would create an asymmetry with case b), which > > I'll expand on below... > > > > > > b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are > > > > broadcasted to tasks subscribed to XFRM events - here the > > > > SELinux check is not meningful as the current task's creds do > > > > not represent the tasks that could potentially see the secret. > > > > > > This looks very similar to the BPF hook discussed above, I believe my > > > comments above apply here as well. > > > > Using the current task is just logically wrong in this case. The > > current task here is just simply deleting an SA that happens to have > > some secret value in it. When deleting an SA, a notification is sent > > to a group of subscribers (some group of other tasks), which includes > > a dump of the secret value. The current task isn't doing any attempt > > to breach lockdown, it's just deleting an SA. > > > > It also makes it really awkward to make policy decisions around this. > > Suppose that domains A, B, and C need to be able to add/delete SAs and > > domains D, E, and F need to receive notifications about changes in > > SAs. Then if, say, domain E actually needs to see the secret values in > > the notifications, you must grant the confidentiality permission to > > all of A, B, C to keep things working. And now you have opened up the > > door for A, B, C to do other lockdown-confidentiality stuff, even > > though these domains themselves actually don't request/need any > > confidential data from the kernel. That's just not logical and you may > > actually end up (slightly) worse security-wise than if you just > > skipped checking for XFRM secrets altogether, because you need to > > allow confidentiality to domains for which it may be excessive. > > It sounds an awful lot like the lockdown hook is in the wrong spot. > It sounds like it would be a lot better to relocate the hook than > remove it. I don't see how you would solve this by moving the hook. Where do you want to relocate it? The main obstacle is that the message containing the SA dump is sent to consumers via a simple netlink broadcast, which doesn't provide a facility to redact the SA secret on a per-consumer basis. I can't see any way to make the checks meaningful for SELinux without a major overhaul of the broadcast logic. IMO, switching the subject to kernel_t either in both cases or at least in case b) is the best compromise between usability, security, and developers' time / code complexity. I don't think controlling which of the CAP_NET_ADMIN domains can/can't see/leak SA secrets is worth obsessing about. -- Ondrej Mosnacek Software Engineer, Linux Security - SELinux kernel Red Hat, Inc.
On Tue, Jun 8, 2021 at 7:02 AM Ondrej Mosnacek <omosnace@redhat.com> wrote: > On Thu, Jun 3, 2021 at 7:46 PM Paul Moore <paul@paul-moore.com> wrote: ... > > It sounds an awful lot like the lockdown hook is in the wrong spot. > > It sounds like it would be a lot better to relocate the hook than > > remove it. > > I don't see how you would solve this by moving the hook. Where do you > want to relocate it? Wherever it makes sense. Based on your comments it really sounded like the hook was in a bad spot and since your approach in a lot of this had been to remove or disable hooks I wanted to make sure that relocating the hook was something you had considered. Thankfully it sounds like you have considered moving the hook - that's good. > The main obstacle is that the message containing > the SA dump is sent to consumers via a simple netlink broadcast, which > doesn't provide a facility to redact the SA secret on a per-consumer > basis. I can't see any way to make the checks meaningful for SELinux > without a major overhaul of the broadcast logic. Fair enough. -- paul moore www.paul-moore.com
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index c8173e92f19d..90992793b4c5 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -299,7 +299,7 @@ static bool xmon_is_locked_down(void) static bool lockdown; if (!lockdown) { - lockdown = !!security_locked_down(LOCKDOWN_XMON_RW); + lockdown = !!security_cred_locked_down(NULL, LOCKDOWN_XMON_RW); if (lockdown) { printf("xmon: Disabled due to kernel lockdown\n"); xmon_is_ro = true; @@ -307,7 +307,7 @@ static bool xmon_is_locked_down(void) } if (!xmon_is_ro) { - xmon_is_ro = !!security_locked_down(LOCKDOWN_XMON_WR); + xmon_is_ro = !!security_cred_locked_down(NULL, LOCKDOWN_XMON_WR); if (xmon_is_ro) printf("xmon: Read-only due to kernel lockdown\n"); } diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 1261e8b41edb..7edde3fc22f5 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -396,7 +396,7 @@ struct dentry *tracefs_create_file(const char *name, umode_t mode, struct dentry *dentry; struct inode *inode; - if (security_locked_down(LOCKDOWN_TRACEFS)) + if (security_cred_locked_down(NULL, LOCKDOWN_TRACEFS)) return NULL; if (!(mode & S_IFMT)) diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h index 2adeea44c0d5..0115d7e3db55 100644 --- a/include/linux/lsm_hook_defs.h +++ b/include/linux/lsm_hook_defs.h @@ -393,7 +393,8 @@ LSM_HOOK(int, 0, bpf_prog_alloc_security, struct bpf_prog_aux *aux) LSM_HOOK(void, LSM_RET_VOID, bpf_prog_free_security, struct bpf_prog_aux *aux) #endif /* CONFIG_BPF_SYSCALL */ -LSM_HOOK(int, 0, locked_down, enum lockdown_reason what) +LSM_HOOK(int, 0, cred_locked_down, const struct cred *cred, + enum lockdown_reason what) #ifdef CONFIG_PERF_EVENTS LSM_HOOK(int, 0, perf_event_open, struct perf_event_attr *attr, int type) diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index 5c4c5c0602cb..2d2d82ffea34 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -1539,10 +1539,11 @@ * @bpf_prog_free_security: * Clean up the security information stored inside bpf prog. * - * @locked_down: + * @cred_locked_down: * Determine whether a kernel feature that potentially enables arbitrary * code execution in kernel space should be permitted. * + * @cred: credential asociated with the operation, or NULL if not applicable * @what: kernel feature being accessed * * Security hooks for perf events diff --git a/include/linux/security.h b/include/linux/security.h index 24eda04221e9..6a609787a03a 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -26,6 +26,7 @@ #include <linux/kernel_read_file.h> #include <linux/key.h> #include <linux/capability.h> +#include <linux/cred.h> #include <linux/fs.h> #include <linux/slab.h> #include <linux/err.h> @@ -33,7 +34,6 @@ #include <linux/mm.h> struct linux_binprm; -struct cred; struct rlimit; struct kernel_siginfo; struct sembuf; @@ -470,7 +470,7 @@ void security_inode_invalidate_secctx(struct inode *inode); int security_inode_notifysecctx(struct inode *inode, void *ctx, u32 ctxlen); int security_inode_setsecctx(struct dentry *dentry, void *ctx, u32 ctxlen); int security_inode_getsecctx(struct inode *inode, void **ctx, u32 *ctxlen); -int security_locked_down(enum lockdown_reason what); +int security_cred_locked_down(const struct cred *cred, enum lockdown_reason what); #else /* CONFIG_SECURITY */ static inline int call_blocking_lsm_notifier(enum lsm_event event, void *data) @@ -1343,12 +1343,17 @@ static inline int security_inode_getsecctx(struct inode *inode, void **ctx, u32 { return -EOPNOTSUPP; } -static inline int security_locked_down(enum lockdown_reason what) +static inline int security_cred_locked_down(struct cred *cred, enum lockdown_reason what) { return 0; } #endif /* CONFIG_SECURITY */ +static inline int security_locked_down(enum lockdown_reason what) +{ + return security_cred_locked_down(current_cred(), what); +} + #if defined(CONFIG_SECURITY) && defined(CONFIG_WATCH_QUEUE) int security_post_notification(const struct cred *w_cred, const struct cred *cred, diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d2d7cf6cfe83..d8a242837c1e 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -215,7 +215,7 @@ const struct bpf_func_proto bpf_probe_read_user_str_proto = { static __always_inline int bpf_probe_read_kernel_common(void *dst, u32 size, const void *unsafe_ptr) { - int ret = security_locked_down(LOCKDOWN_BPF_READ); + int ret = security_cred_locked_down(NULL, LOCKDOWN_BPF_READ); if (unlikely(ret < 0)) goto fail; @@ -246,7 +246,7 @@ const struct bpf_func_proto bpf_probe_read_kernel_proto = { static __always_inline int bpf_probe_read_kernel_str_common(void *dst, u32 size, const void *unsafe_ptr) { - int ret = security_locked_down(LOCKDOWN_BPF_READ); + int ret = security_cred_locked_down(NULL, LOCKDOWN_BPF_READ); if (unlikely(ret < 0)) goto fail; diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index f0aecee4d539..5f45848c4ff3 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -851,7 +851,7 @@ static int copy_user_offload(struct xfrm_state_offload *xso, struct sk_buff *skb static bool xfrm_redact(void) { return IS_ENABLED(CONFIG_SECURITY) && - security_locked_down(LOCKDOWN_XFRM_SECRET); + security_cred_locked_down(NULL, LOCKDOWN_XFRM_SECRET); } static int copy_to_user_auth(struct xfrm_algo_auth *auth, struct sk_buff *skb) diff --git a/security/lockdown/lockdown.c b/security/lockdown/lockdown.c index 87cbdc64d272..2a13c866c22a 100644 --- a/security/lockdown/lockdown.c +++ b/security/lockdown/lockdown.c @@ -55,7 +55,8 @@ early_param("lockdown", lockdown_param); * lockdown_is_locked_down - Find out if the kernel is locked down * @what: Tag to use in notice generated if lockdown is in effect */ -static int lockdown_is_locked_down(enum lockdown_reason what) +static int lockdown_is_locked_down(const struct cred *cred, + enum lockdown_reason what) { if (WARN(what >= LOCKDOWN_CONFIDENTIALITY_MAX, "Invalid lockdown reason")) @@ -72,7 +73,7 @@ static int lockdown_is_locked_down(enum lockdown_reason what) } static struct security_hook_list lockdown_hooks[] __lsm_ro_after_init = { - LSM_HOOK_INIT(locked_down, lockdown_is_locked_down), + LSM_HOOK_INIT(cred_locked_down, lockdown_is_locked_down), }; static int __init lockdown_lsm_init(void) diff --git a/security/security.c b/security/security.c index 0c1c9796e3e4..c987b70bc16c 100644 --- a/security/security.c +++ b/security/security.c @@ -2592,11 +2592,11 @@ void security_bpf_prog_free(struct bpf_prog_aux *aux) } #endif /* CONFIG_BPF_SYSCALL */ -int security_locked_down(enum lockdown_reason what) +int security_cred_locked_down(const struct cred *cred, enum lockdown_reason what) { - return call_int_hook(locked_down, 0, what); + return call_int_hook(cred_locked_down, 0, cred, what); } -EXPORT_SYMBOL(security_locked_down); +EXPORT_SYMBOL(security_cred_locked_down); #ifdef CONFIG_PERF_EVENTS int security_perf_event_open(struct perf_event_attr *attr, int type) diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index eaea837d89d1..c415e3ca4f24 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -7017,10 +7017,10 @@ static void selinux_bpf_prog_free(struct bpf_prog_aux *aux) } #endif -static int selinux_lockdown(enum lockdown_reason what) +static int selinux_lockdown(const struct cred *cred, enum lockdown_reason what) { struct common_audit_data ad; - u32 sid = current_sid(); + u32 sid; int invalid_reason = (what <= LOCKDOWN_NONE) || (what == LOCKDOWN_INTEGRITY_MAX) || (what >= LOCKDOWN_CONFIDENTIALITY_MAX); @@ -7032,6 +7032,12 @@ static int selinux_lockdown(enum lockdown_reason what) return -EINVAL; } + /* Ignore if there is no relevant cred to check against */ + if (!cred) + return 0; + + sid = cred_sid(cred); + ad.type = LSM_AUDIT_DATA_LOCKDOWN; ad.u.reason = what; @@ -7353,7 +7359,7 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(perf_event_write, selinux_perf_event_write), #endif - LSM_HOOK_INIT(locked_down, selinux_lockdown), + LSM_HOOK_INIT(cred_locked_down, selinux_lockdown), /* * PUT "CLONING" (ACCESSING + ALLOCATING) HOOKS HERE
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") added an implementation of the locked_down LSM hook to SELinux, with the aim to restrict which domains are allowed to perform operations that would breach lockdown. However, in several places the security_locked_down() hook is called in situations where the current task isn't doing any action that would directly breach lockdown, leading to SELinux checks that are basically bogus. Since in most of these situations converting the callers such that security_locked_down() is called in a context where the current task would be meaningful for SELinux is impossible or very non-trivial (and could lead to TOCTOU issues for the classic Lockdown LSM implementation), fix this by modifying the hook to accept a struct cred pointer as argument, where NULL will be interpreted as a request for a "global", task-independent lockdown decision only. Then modify SELinux to ignore calls with cred == NULL. Since most callers will just want to pass current_cred() as the cred parameter, rename the hook to security_cred_locked_down() and provide the original security_locked_down() function as a simple wrapper around the new hook. The callers migrated to the new hook, passing NULL as cred: 1. arch/powerpc/xmon/xmon.c Here the hook seems to be called from non-task context and is only used for redacting some sensitive values from output sent to userspace. 2. fs/tracefs/inode.c:tracefs_create_file() Here the call is used to prevent creating new tracefs entries when the kernel is locked down. Assumes that locking down is one-way - i.e. if the hook returns non-zero once, it will never return zero again, thus no point in creating these files. 3. kernel/trace/bpf_trace.c:bpf_probe_read_kernel{,_str}_common() Called when a BPF program calls a helper that could leak kernel memory. The task context is not relevant here, since the program may very well be run in the context of a different task than the consumer of the data. See: https://bugzilla.redhat.com/show_bug.cgi?id=1955585 4. net/xfrm/xfrm_user.c:copy_to_user_*() Here a cryptographic secret is redacted based on the value returned from the hook. There are two possible actions that may lead here: a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the task context is relevant, since the dumped data is sent back to the current task. b) When deleting an SA via XFRM_MSG_DELSA, the dumped SAs are broadcasted to tasks subscribed to XFRM events - here the SELinux check is not meningful as the current task's creds do not represent the tasks that could potentially see the secret. It really doesn't seem worth it to try to preserve the check in the a) case, since the eventual leak can be circumvented anyway via b), plus there is no way for the task to indicate that it doesn't care about the actual key value, so the check could generate a lot of noise. Improvements-suggested-by: Casey Schaufler <casey@schaufler-ca.com> Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown") Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> --- v2: - change to a single hook based on suggestions by Casey Schaufler v1: https://lore.kernel.org/lkml/20210507114048.138933-1-omosnace@redhat.com/ arch/powerpc/xmon/xmon.c | 4 ++-- fs/tracefs/inode.c | 2 +- include/linux/lsm_hook_defs.h | 3 ++- include/linux/lsm_hooks.h | 3 ++- include/linux/security.h | 11 ++++++++--- kernel/trace/bpf_trace.c | 4 ++-- net/xfrm/xfrm_user.c | 2 +- security/lockdown/lockdown.c | 5 +++-- security/security.c | 6 +++--- security/selinux/hooks.c | 12 +++++++++--- 10 files changed, 33 insertions(+), 19 deletions(-)