Message ID | 20230814-memfd-vm-noexec-uapi-fixes-v2-0-7ff9e3e10ba6@cyphar.com |
---|---|
Headers | show |
Series | memfd: cleanups for vm.memfd_noexec | expand |
On Tue, Aug 15, 2023 at 10:44 PM Dominique Martinet <asmadeus@codewreck.org> wrote: > > Jeff Xu wrote on Tue, Aug 15, 2023 at 10:13:18PM -0700: > > > Given that it is possible for CAP_SYS_ADMIN users to create executable > > > binaries without memfd_create(2) and without touching the host > > > filesystem (not to mention the many other things a CAP_SYS_ADMIN process > > > would be able to do that would be equivalent or worse), it seems strange > > > to cause a fair amount of headache to admins when there doesn't appear > > > to be an actual security benefit to blocking this. There appear to be > > > concerns about confused-deputy-esque attacks[2] but a confused deputy that > > > can write to arbitrary sysctls is a bigger security issue than > > > executable memfds. > > > > > Something to point out: The demo code might be enough to prove your > > case in other distributions, however, in ChromeOS, you can't run this > > code. The executable in ChromeOS are all from known sources and > > verified at boot. > > If an attacker could run this code in ChromeOS, that means the > > attacker already acquired arbitrary code execution through other ways, > > at that point, the attacker no longer needs to create/find an > > executable memfd, they already have the vehicle. You can't use an > > example of an attacker already running arbitrary code to prove that > > disable downgrading is useless. > > I agree it is a big problem that an attacker already can modify a > > sysctl. Assuming this can happen by controlling arguments passed into > > sysctl, at the time, the attacker might not have full arbitrary code > > execution yet, that is the reason the original design is so > > restrictive. > > I don't understand how you can say an attacker cannot run arbitrary code > within a process here, yet assert that they'd somehow run memfd_create + > execveat on it if this sysctl is lowered -- the two look equivalent to > me? > It might require multiple steps for this attack, one possible scenario: 1> control a write primitive in CAP_SYSADMIN process's memory, change arguments of sysctl call, and downgrade the setting for memfd, e.g. change it=0 to revert to old behavior (by default creating executable memfd) 2> control a non-privileged process that creates and writes to memfd, and write the contents with the binary that the attacker wants. This process just needs non-executable memfd, but isn't updated yet. 3> Confuse a non-privilege process to execute the memfd the attacker wrote in step 2. In chromeOS, because all the executables are from verified sources, attackers typically can't easily use the step 3 alone (without step 2), and memfd was such a hole that enables an unverified executable. In the original design, downgrading is not allowed, the attack chain of 2/3 is completely blocked. With this new approach, attackers will try to find an additional step (step 1) to make the old attack (step 2 and 3) working again. It is difficult but I can't say it is impossible. > CAP_SYS_ADMIN is a kludge of a capability that pretty much gives root as > soon as you can run arbitrary code (just have a look at the various > container escape example when the capability is given); I see little > point in trying to harden just this here. I'm not an expert in containers, if the industry is giving up on privileged containers, then the reasoning makes sense.
On 2023-08-15, Jeff Xu <jeffxu@google.com> wrote: > On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > The most critical issue with vm.memfd_noexec=2 (the fact that passing > > MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's > > tree[2], but there are still some outstanding issues that need to be > > addressed: > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > because it will make it far to difficult to ever migrate. Instead it > > should imply MFD_EXEC. > > > > * The dmesg warnings are pr_warn_once(), which on most systems means > > that they will be used up by systemd or some other boot process and > > userspace developers will never see it. > > > > - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a > > rate-limited message to the kernel log is necessary to tell > > userspace that they should add the new flags. > > > > Arguably the most ideal way to deal with the spam concern[3,4] > > while still prompting userspace to switch to the new flags would be > > to only log the warning once per task or something similar. > > However, adding something to task_struct for tracking this would be > > needless bloat for a single pr_warn_ratelimited(). > > > > So just switch to pr_info_ratelimited() to avoid spamming the log > > with something that isn't a real warning. There's lots of > > info-level stuff in dmesg, it seems really unlikely that this > > should be an actual problem. Most programs are already switching to > > the new flags anyway. > > > > - For the vm.memfd_noexec=2 case, we need to log a warning for every > > failure because otherwise userspace will have no idea why their > > previously working program started returning -EACCES (previously > > -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here. > > > > * The racheting mechanism for vm.memfd_noexec makes it incredibly > > unappealing for most users to enable the sysctl because enabling it > > on &init_pid_ns means you need a system reboot to unset it. Given the > > actual security threat being protected against, CAP_SYS_ADMIN users > > being restricted in this way makes little sense. > > > > The argument for this ratcheting by the original author was that it > > allows you to have a hierarchical setting that cannot be unset by > > child pidnses, but this is not accurate -- changing the parent > > pidns's vm.memfd_noexec setting to be more restrictive didn't affect > > children. > > > That is not exactly what I said though. Sorry, I probably should've phrased this as "one of the main arguments". In the last discussion thread we had in the v1 of this patch, it was my impression that this was the primary sticking point. > From ChromeOS's position, allowing downgrade is less secure, and this > setting was designed to be set at startup/reboot time from the very > beginning, such that the kernel command line or as part of the > container runtime environment (get passed to sandboxed container) If this had been implemented as a cmdline flag, it would be completely reasonable that you need to reboot to change it. However, it was implemented as a sysctl and the behaviour of sysctls is that admins can (generally) change them after they've been set -- even for security-related sysctls such as the fs.protected_* sysctls. The only counter-example I know if the YAMA one, and if I'm being honest I think that behaviour is also weird. > I understand your viewpoint, from another distribution point of view, > the original design might be too restricted, so if the kernel wants > to weigh more on ease of admin, I'm OK with your approach. > Though it is less secure for ChromeOS - i.e. we do try to prevent > arbitrary code execution as much as possible, even for CAP_SYSADMIN. > And with this change, it is less secure and one more possibility for > us to consider. FWIW I still think the threat model where a &init_user_ns-privileged CAP_SYS_ADMIN process can be tricked into writing a sysctl should be protected against by memfd_create(MFD_EXEC) doesn't really make sense for the vast majority of systems (if any). If ChromeOS really wants the old vm.memfd_noexec=2 behaviour to be enforced, this can be done with a very simple seccomp filter. If applied to pid1, this would also not be possible to unset without a reboot.
On Fri, Aug 18, 2023 at 7:50 PM Aleksa Sarai <cyphar@cyphar.com> wrote: > > On 2023-08-15, Jeff Xu <jeffxu@google.com> wrote: > > On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote: > > > > > > The most critical issue with vm.memfd_noexec=2 (the fact that passing > > > MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's > > > tree[2], but there are still some outstanding issues that need to be > > > addressed: > > > > > > * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls > > > because it will make it far to difficult to ever migrate. Instead it > > > should imply MFD_EXEC. > > > > > > * The dmesg warnings are pr_warn_once(), which on most systems means > > > that they will be used up by systemd or some other boot process and > > > userspace developers will never see it. > > > > > > - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a > > > rate-limited message to the kernel log is necessary to tell > > > userspace that they should add the new flags. > > > > > > Arguably the most ideal way to deal with the spam concern[3,4] > > > while still prompting userspace to switch to the new flags would be > > > to only log the warning once per task or something similar. > > > However, adding something to task_struct for tracking this would be > > > needless bloat for a single pr_warn_ratelimited(). > > > > > > So just switch to pr_info_ratelimited() to avoid spamming the log > > > with something that isn't a real warning. There's lots of > > > info-level stuff in dmesg, it seems really unlikely that this > > > should be an actual problem. Most programs are already switching to > > > the new flags anyway. > > > > > > - For the vm.memfd_noexec=2 case, we need to log a warning for every > > > failure because otherwise userspace will have no idea why their > > > previously working program started returning -EACCES (previously > > > -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here. > > > > > > * The racheting mechanism for vm.memfd_noexec makes it incredibly > > > unappealing for most users to enable the sysctl because enabling it > > > on &init_pid_ns means you need a system reboot to unset it. Given the > > > actual security threat being protected against, CAP_SYS_ADMIN users > > > being restricted in this way makes little sense. > > > > > > The argument for this ratcheting by the original author was that it > > > allows you to have a hierarchical setting that cannot be unset by > > > child pidnses, but this is not accurate -- changing the parent > > > pidns's vm.memfd_noexec setting to be more restrictive didn't affect > > > children. > > > > > That is not exactly what I said though. > > Sorry, I probably should've phrased this as "one of the main arguments". > In the last discussion thread we had in the v1 of this patch, it was my > impression that this was the primary sticking point. > > > From ChromeOS's position, allowing downgrade is less secure, and this > > setting was designed to be set at startup/reboot time from the very > > beginning, such that the kernel command line or as part of the > > container runtime environment (get passed to sandboxed container) > > If this had been implemented as a cmdline flag, it would be completely > reasonable that you need to reboot to change it. However, it was You might already know that sysctl can be set in kernel command line, thanks to Vlastimil Babka from SUSE. [1] [1] https://lore.kernel.org/lkml/20200325120345.12946-1-vbabka@suse.cz/ > implemented as a sysctl and the behaviour of sysctls is that admins can > (generally) change them after they've been set -- even for > security-related sysctls such as the fs.protected_* sysctls. The only > counter-example I know if the YAMA one, and if I'm being honest I think > that behaviour is also weird. > > > I understand your viewpoint, from another distribution point of view, > > the original design might be too restricted, so if the kernel wants > > to weigh more on ease of admin, I'm OK with your approach. > > Though it is less secure for ChromeOS - i.e. we do try to prevent > > arbitrary code execution as much as possible, even for CAP_SYSADMIN. > > And with this change, it is less secure and one more possibility for > > us to consider. > > FWIW I still think the threat model where a &init_user_ns-privileged > CAP_SYS_ADMIN process can be tricked into writing a sysctl should be > protected against by memfd_create(MFD_EXEC) doesn't really make sense > for the vast majority of systems (if any). > I agree other distributions might not care much about running arbitrary code on the host for CAP_SYS_ADMIN, similar to traditional unix in this aspect. ChromeOS has some unique security features. > If ChromeOS really wants the old vm.memfd_noexec=2 behaviour to be > enforced, this can be done with a very simple seccomp filter. If applied > to pid1, this would also not be possible to unset without a reboot. > In practice, host and process can have different values for vm.memfd_noexec, it can't easily be implemented through seccomp. Seccomp also requires no-new-priv set, there are implications if we set it to pid 1 and apply to all its children. > -- > Aleksa Sarai > Senior Software Engineer (Containers) > SUSE Linux GmbH > <https://www.cyphar.com/> Thanks Best regards, -Jeff
On Mon, Aug 14, 2023 at 06:40:59PM +1000, Aleksa Sarai wrote: > In order to incentivise userspace to switch to passing MFD_EXEC and > MFD_NOEXEC_SEAL, we need to provide a warning on each attempt to call > memfd_create() without the new flags. pr_warn_once() is not useful > because on most systems the one warning is burned up during the boot > process (on my system, systemd does this within the first second of > boot) and thus userspace will in practice never see the warnings to push > them to switch to the new flags. > > The original patchset[1] used pr_warn_ratelimited(), however there were > concerns about the degree of spam in the kernel log[2,3]. The resulting > inability to detect every case was flagged as an issue at the time[4]. > > While we could come up with an alternative rate-limiting scheme such as > only outputting the message if vm.memfd_noexec has been modified, or > only outputting the message once for a given task, these alternatives > have downsides that don't make sense given how low-stakes a single > kernel warning message is. Switching to pr_info_ratelimited() instead > should be fine -- it's possible some monitoring tool will be unhappy > with a stream of warning-level messages but there's already plenty of > info-level message spam in dmesg. > > [1]: https://lore.kernel.org/20221215001205.51969-4-jeffxu@google.com/ > [2]: https://lore.kernel.org/202212161233.85C9783FB@keescook/ > [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/ > [4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/ > > Cc: stable@vger.kernel.org # v6.3+ > Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> > --- Reviewed-by: Christian Brauner <brauner@kernel.org>
On Fri, 1 Sep 2023 07:13:45 +0200 Damian Tometzki <dtometzki@fedoraproject.org> wrote: > > if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > > - pr_warn_once( > > + pr_info_ratelimited( > > "%s[%d]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set\n", > > current->comm, task_pid_nr(current)); > > } > > > > -- > > 2.41.0 > > > Hello Sarai, > > i got a lot of messages in dmesg with this. DMESG is unuseable with > this. > [ 1390.349462] __do_sys_memfd_create: 5 callbacks suppressed > [ 1390.349468] pipewire-pulse[2930]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set > [ 1390.350106] pipewire[2712]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set OK, thanks, I'll revert this. Spamming everyone even harder isn't a good way to get developers to fix their stuff.
* Andrew Morton: > OK, thanks, I'll revert this. Spamming everyone even harder isn't a > good way to get developers to fix their stuff. Is this really buggy userspace? Are future kernels going to require some of these flags? That's going to break lots of applications which use memfd_create to enable run-time code generation on locked-down systems because it looked like a stable interface (“don't break userspace” and all that). Thanks, Florian
On 2023-09-05, Florian Weimer <fweimer@redhat.com> wrote: > * Andrew Morton: > > > OK, thanks, I'll revert this. Spamming everyone even harder isn't a > > good way to get developers to fix their stuff. > > Is this really buggy userspace? Are future kernels going to require > some of these flags? > > That's going to break lots of applications which use memfd_create to > enable run-time code generation on locked-down systems because it looked > like a stable interface (“don't break userspace” and all that). There is no userspace breakage with the current behaviour and obviously actually requiring these flags to be passed by default would be a pretty clear userspace breakage and would never be merged. The original intention (as far as I can tell -- the logging behaviour came from the original patchset) was to try to incentivise userspace to start passing the flags so that if distributions decide to set vm.memfd_noexec=1 as a default setting you won't end up with programs that _need_ executable memfds (such as container runtimes) crashing unexpectedly. I also suspect there was an aspect of "well, userspace *should* be passing these flags after we've introduced them". I'm sending a patch to just remove this part of the logging because I don't think it makes sense if you can't rate-limit it sanely, and there's probably an argument to be made that it doesn't make sense at all (at least for the default vm.memfd_noexec=0 setting).
The most critical issue with vm.memfd_noexec=2 (the fact that passing MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's tree[2], but there are still some outstanding issues that need to be addressed: * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls because it will make it far to difficult to ever migrate. Instead it should imply MFD_EXEC. * The dmesg warnings are pr_warn_once(), which on most systems means that they will be used up by systemd or some other boot process and userspace developers will never see it. - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a rate-limited message to the kernel log is necessary to tell userspace that they should add the new flags. Arguably the most ideal way to deal with the spam concern[3,4] while still prompting userspace to switch to the new flags would be to only log the warning once per task or something similar. However, adding something to task_struct for tracking this would be needless bloat for a single pr_warn_ratelimited(). So just switch to pr_info_ratelimited() to avoid spamming the log with something that isn't a real warning. There's lots of info-level stuff in dmesg, it seems really unlikely that this should be an actual problem. Most programs are already switching to the new flags anyway. - For the vm.memfd_noexec=2 case, we need to log a warning for every failure because otherwise userspace will have no idea why their previously working program started returning -EACCES (previously -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here. * The racheting mechanism for vm.memfd_noexec makes it incredibly unappealing for most users to enable the sysctl because enabling it on &init_pid_ns means you need a system reboot to unset it. Given the actual security threat being protected against, CAP_SYS_ADMIN users being restricted in this way makes little sense. The argument for this ratcheting by the original author was that it allows you to have a hierarchical setting that cannot be unset by child pidnses, but this is not accurate -- changing the parent pidns's vm.memfd_noexec setting to be more restrictive didn't affect children. Instead, switch the vm.memfd_noexec sysctl to be properly hierarchical and allow CAP_SYS_ADMIN users (in the pidns's owning userns) to lower the setting as long as it is not lower than the parent's effective setting. This change also makes it so that changing a parent pidns's vm.memfd_noexec will affect all descendants, providing a properly hierarchical setting. The performance impact of this is incredibly minimal since the maximum depth of pidns is 32 and it is only checked during memfd_create(2) and unshare(CLONE_NEWPID). * The memfd selftests would not exit with a non-zero error code when certain tests that ran in a forked process (specifically the ones related to MFD_EXEC and MFD_NOEXEC_SEAL) failed. [1]: https://lore.kernel.org/all/ZJwcsU0vI-nzgOB_@codewreck.org/ [2]: https://lore.kernel.org/all/20230705063315.3680666-1-jeffxu@google.com/ [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/ [4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/ Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> --- Changes in v2: - Make vm.memfd_noexec restrictions properly hierarchical. - Allow vm.memfd_noexec setting to be lowered by CAP_SYS_ADMIN as long as it is not lower than the parent's effective setting. - Fix the logging behaviour related to the new flags and vm.memfd_noexec=2. - Add more thorough tests for vm.memfd_noexec in selftests. - v1: <https://lore.kernel.org/r/20230713143406.14342-1-cyphar@cyphar.com> --- Aleksa Sarai (5): selftests: memfd: error out test process when child test fails memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2 memfd: improve userspace warnings for missing exec-related flags memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy selftests: improve vm.memfd_noexec sysctl tests include/linux/pid_namespace.h | 39 ++-- kernel/pid.c | 3 + kernel/pid_namespace.c | 6 +- kernel/pid_sysctl.h | 28 ++- mm/memfd.c | 33 ++- tools/testing/selftests/memfd/memfd_test.c | 332 +++++++++++++++++++++++------ 6 files changed, 322 insertions(+), 119 deletions(-) --- base-commit: 3ff995246e801ea4de0a30860a1d8da4aeb538e7 change-id: 20230803-memfd-vm-noexec-uapi-fixes-ace725c67b0f Best regards,