[RFC,v1,0/8] Introduce mseal() syscall

Message ID	20231016143828.647848-1-jeffxu@chromium.org
Headers	show Return-Path: <linux-kselftest-owner@vger.kernel.org> From: jeffxu@chromium.org To: akpm@linux-foundation.org, keescook@chromium.org, sroettger@google.com Cc: jeffxu@google.com, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, jannh@google.com, surenb@google.com, alex.sierra@amd.com, apopple@nvidia.com, aneesh.kumar@linux.ibm.com, axelrasmussen@google.com, ben@decadent.org.uk, catalin.marinas@arm.com, david@redhat.com, dwmw@amazon.co.uk, ying.huang@intel.com, hughd@google.com, joey.gouly@arm.com, corbet@lwn.net, wangkefeng.wang@huawei.com, Liam.Howlett@oracle.com, torvalds@linux-foundation.org, lstoakes@gmail.com, willy@infradead.org, mawupeng1@huawei.com, linmiaohe@huawei.com, namit@vmware.com, peterx@redhat.com, peterz@infradead.org, ryan.roberts@arm.com, shr@devkernel.io, vbabka@suse.cz, xiujianfeng@huawei.com, yu.ma@intel.com, zhangpeng362@huawei.com, dave.hansen@intel.com, luto@kernel.org, linux-hardening@vger.kernel.org Subject: [RFC PATCH v1 0/8] Introduce mseal() syscall Date: Mon, 16 Oct 2023 14:38:19 +0000 Message-ID: <20231016143828.647848-1-jeffxu@chromium.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introduce mseal() syscall \| expand [RFC,v1,0/8] Introduce mseal() syscall [RFC,v1,1/8] Add mseal syscall [RFC,v1,2/8] Wire up mseal syscall [RFC,v1,3/8] mseal: add can_modify_mm and can_modify_vma [RFC,v1,4/8] mseal: seal mprotect [RFC,v1,5/8] mseal munmap [RFC,v1,6/8] mseal mremap [RFC,v1,7/8] mseal mmap [RFC,v1,8/8] selftest mm/mseal mprotect/munmap/mremap/mmap

Jeff Xu Oct. 16, 2023, 2:38 p.m. UTC

From: Jeff Xu <jeffxu@google.com>

This patchset proposes a new mseal() syscall for the Linux kernel. 

Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen.

Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.

The new mseal() is an architecture independent syscall, and with
following signature:

mseal(void addr, size_t len, unsigned int types, unsigned int flags)

addr/len: memory range.  Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.

types: bit mask to specify which syscall to seal, currently they are:
MM_SEAL_MSEAL 0x1
MM_SEAL_MPROTECT 0x2
MM_SEAL_MUNMAP 0x4
MM_SEAL_MMAP 0x8
MM_SEAL_MREMAP 0x10

Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.

The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal().  Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.

MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.

Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks. 

For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated. It might be worth noting that this is different from the
rest of mprotect(), where some updates can happen even when mprotect
returns fail. Consider can_modify_mm only checks vm_seals in
vm_area_struct, and it is not going deeper in the page table or updating
any HW, success or none behavior might fit better here. I would like to
listen to the community's feedback on this.

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.

In addition, Stephen is working on glibc change to add sealing support
into the dynamic linker to seal all non-writable segments at startup. When
that work is completed, all applications can automatically benefit from
these new protections.

[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc

Jeff Xu (8):
  Add mseal syscall
  Wire up mseal syscall
  mseal: add can_modify_mm and can_modify_vma
  mseal: seal mprotect
  mseal munmap
  mseal mremap
  mseal mmap
  selftest mm/mseal mprotect/munmap/mremap/mmap

 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 fs/aio.c                                    |    5 +-
 include/linux/mm.h                          |   55 +-
 include/linux/mm_types.h                    |    7 +
 include/linux/syscalls.h                    |    2 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 include/uapi/linux/mman.h                   |    6 +
 ipc/shm.c                                   |    3 +-
 kernel/sys_ni.c                             |    1 +
 mm/Kconfig                                  |    8 +
 mm/Makefile                                 |    1 +
 mm/internal.h                               |    4 +-
 mm/mmap.c                                   |   49 +-
 mm/mprotect.c                               |    6 +
 mm/mremap.c                                 |   19 +-
 mm/mseal.c                                  |  328 +++++
 mm/nommu.c                                  |    6 +-
 mm/util.c                                   |    8 +-
 tools/testing/selftests/mm/Makefile         |    1 +
 tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
 37 files changed, 1934 insertions(+), 28 deletions(-)
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

Greg Kroah-Hartman Oct. 16, 2023, 3:05 p.m. UTC | #1

On Mon, Oct 16, 2023 at 02:38:20PM +0000, jeffxu@chromium.org wrote:
> +#ifdef CONFIG_MSEAL
> +	/*
> +	 * bit masks for seal.
> +	 * need this since vm_flags is full.
> +	 */
> +	unsigned long vm_seals;		/* seal flags, see mm.h. */

"unsigned long" and yet:

> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index c0cb22cd607d..f574c7dbee76 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned int types,
> +			  unsigned int flags);

"unsigned int"?

Why the mis-match?

> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -55,4 +55,10 @@ struct cachestat {
>  	__u64 nr_recently_evicted;
>  };
>  
> +#define MM_SEAL_MSEAL		0x1
> +#define MM_SEAL_MPROTECT	0x2
> +#define MM_SEAL_MUNMAP		0x4
> +#define MM_SEAL_MMAP		0x8
> +#define MM_SEAL_MREMAP		0x10

I think we can use the BIT() macro in uapi .h files now, it is _BITUL().
Might want to use it here too to make it obvious what is happening.

thanks,

greg k-h

Linus Torvalds Oct. 16, 2023, 5:23 p.m. UTC | #2

On Mon, 16 Oct 2023 at 07:38, <jeffxu@chromium.org> wrote:
>
> This patchset proposes a new mseal() syscall for the Linux kernel.

So I have no objections to adding some kind of "lock down memory
mappings" model, but this isn't it.

First off, the simple stuff: the commit messages are worthless. Having

   check seal for mmap(2)

as the commit message is not even remotely acceptable, to pick one
random example from the series (7/8).

But that doesn't matter much, since I think the more fundamental
problems are much worse:

 - the whole "ON_BEHALF_OF_KERNEL" and "ON_BEHALF_OF_USERSPACE" is
just complete noise and totally illogical. The whole concept needs to
be redone.

Example of complete nonsense (again, picking 7/8, but that's again
just a random example):

> @@ -3017,8 +3022,8 @@ SYSCALL_DEFINE5(remap_file_pages,
>                 flags |= MAP_LOCKED;
>
>         file = get_file(vma->vm_file);
> -       ret = do_mmap(vma->vm_file, start, size,
> -                       prot, flags, pgoff, &populate, NULL);
> +       ret = do_mmap(vma->vm_file, start, size, prot, flags, pgoff,
> +                       &populate, NULL, ON_BEHALF_OF_KERNEL);
>         fput(file);
>  out:
>         mmap_write_unlock(mm);

Christ. That's *literally* the remap_file_pages() system call
definition. No way in hell does "ON_BEHALF_OF_KERNEL" make any sense
in this context.

It's not the only situation. "mremap() as the same thing. vm_munmap()
has the same thing. vm_brk_flags() has the same thing. None of them
make any sense.

But even if those obvious kinds of complete mis-designs were to be
individually fixed, the whole naming and concept is bogus. The
"ON_BEHALF_OF_KERNEL" thing seems to actually just mean "don't check
sealing". Naming should describe what the thing *means*, not some
random policy thing that may or may not be at all relevant.

 - the whole MM_SEAL_xx vs VM_SEAL_xx artificial distinction needs to go away.

Not only is it unacceptable to pointlessly create two different name
spaces for no obvious reason, code like this (from 1/8) should not
exist:

> +       if (types & MM_SEAL_MSEAL)
> +               newtypes |= VM_SEAL_MSEAL;
> +
> +       if (types & MM_SEAL_MPROTECT)
> +               newtypes |= VM_SEAL_MPROTECT;
> +
> +       if (types & MM_SEAL_MUNMAP)
> +               newtypes |= VM_SEAL_MUNMAP;
> +
> +       if (types & MM_SEAL_MMAP)
> +               newtypes |= VM_SEAL_MMAP;
> +
> +       if (types & MM_SEAL_MREMAP)
> +               newtypes |= VM_SEAL_MREMAP;

Sure, we have that in some cases when there was a *reason* for
namespacing imposed on us from an API standpoint (ie the "open()"
flags that get turned into FMODE_xyz flags, or the MS_xyz mount flags
being turned into MNT_xyz flags), but those tend to be signs of
problems in the system call API where some mixup made it impossible to
use the flags directly (most commonly because there is some extended
internal representation of said things).

For example, the MS_xyz namespace is a combination of "these are flags
for the new mount" (like MS_RDONLY) and "this is how you should mount
it" (like MS_REMOUNT), and so the user interface has that odd mixing
of different things, and the MNT_xyz namespace is a distillation of
the former.

But we certainly should not strive to introduce *new* interfaces that
start out with this kind of mixup and pointless "translate from one
bit to another" code.

 - And finally (for now), I hate that MM_ACTION_xyz thing too!

Why do we have MM_ACTION_MREMAP, and pointless like this (from 3/8):

> +       switch (action) {
> +       case MM_ACTION_MPROTECT:
> +               if (vma->vm_seals & VM_SEAL_MPROTECT)
> +                       return false;
> +               break;

when the sane thing to do is to use the *same* MM_SEAL_xyz bitmask and
sealing bitmask and just say

        if (vma->vm_seal & vm_action)
                return -EPERM;

IOW, you have pointlessly introduced not *two* different namespaces,
but *three*. All doing the exact same thing, and all just causing
pointless and ugly code to "translate" the actions of one into the
model of another.

So get rid of the "MM_ACTION_xyz" thing. Get rid of ther "VM_SEAL_xyz"
thing. Use *one* single "these are the sealing bits".

And get rid of "enum caller_origin" entirely. I don't know what the
right model for that thing is, but that isn't it.

*Maybe* the right model is some MM_SEAL_OVERRIDE bit that user space
cannot set, but that the kernel can use internally - and if that is
the right model, then dammit, the *uses* should be very very obvious
why the override is fine, because that remap_file_pages() use sure as
hell was *not* obvious.

In fact, it's not at all obvious why anything should override the
sealing bits - EVER. So I'm not convinced that the right model is
"replace ON_BEHALF_OF_KERNEL with MM_SEAL_OVERRIDE". Why would we
*ever* want to override sealing? It sounds like complete garbage. Most
of the users seem to be things like "execve()", which is nonsensical,
since the VM wouldn't have been sealed at that point _anyway_, so
having a "don't bother checking sealing bits" flag seems entirely
useless.

Anyway, this is all a resounding NAK on this series in this form. My
complaints are not some kind of small "fix this up". These are
fundamental issues.

            Linus

Jann Horn Oct. 16, 2023, 5:34 p.m. UTC | #3

On Mon, Oct 16, 2023 at 4:38 PM <jeffxu@chromium.org> wrote:
>
> From: Jeff Xu <jeffxu@google.com>
>
> This patchset proposes a new mseal() syscall for the Linux kernel.
>
> Modern CPUs support memory permissions such as RW and NX bits. Linux has
> supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> The memory permission feature improves security stance on memory
> corruption bugs, i.e. the attacker can’t just write to arbitrary memory
> and point the code to it, the memory has to be marked with X bit, or
> else an exception will happen.
>
> Memory sealing additionally protects the mapping itself against
> modifications. This is useful to mitigate memory corruption issues where
> a corrupted pointer is passed to a memory management syscall. For example,
> such an attacker primitive can break control-flow integrity guarantees
> since read-only memory that is supposed to be trusted can become writable
> or .text pages can get remapped. Memory sealing can automatically be
> applied by the runtime loader to seal .text and .rodata pages and
> applications can additionally seal security critical data at runtime.
> A similar feature already exists in the XNU kernel with the
> VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.
>
> The new mseal() is an architecture independent syscall, and with
> following signature:
>
> mseal(void addr, size_t len, unsigned int types, unsigned int flags)

Is the plan that the VMAs you need to protect would be created and
mseal()'ed while you expect that attacker code can not (yet) be
running concurrently?

Do you expect to be using sealed memory for shadow stacks (in x86 CET
/ arm64 GCS) to prevent an attacker from mixing those up by moving
pages inside a shadow stack or between different shadow stacks or
such? (If that's even possible, I think it is but I haven't tried.)

> addr/len: memory range.  Must be continuous/allocated memory, or else
> mseal() will fail and no VMA is updated. For details on acceptable
> arguments, please refer to comments in mseal.c. Those are also fully
> covered by the selftest.
> types: bit mask to specify which syscall to seal, currently they are:
> MM_SEAL_MSEAL 0x1
> MM_SEAL_MPROTECT 0x2
> MM_SEAL_MUNMAP 0x4
> MM_SEAL_MMAP 0x8
> MM_SEAL_MREMAP 0x10

You'd probably also want to block destructive madvise() operations
that can effectively alter region contents by discarding pages and
such, in particular MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED;
probably also MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK. Maybe you'd
want to just block all madvise() for sealed VMAs? Or rename
process_madvise_behavior_valid() to something like
"madvise_is_nondestructive()" and use that.

The following comments probably mostly don't matter in practice if
this feature is used in a context that is heavily seccomp-sandboxed
(like Desktop Linux Chrome), but should maybe be addressed to make
this feature more usable for other users. (Including Android Chrome,
which has a weaker sandbox...)

FWIW, it is also possible to write to read-only memory through the
/proc/self/mem interface (or through ptrace commands like
PTRACE_POKETEXT) because of FOLL_FORCE, depending on kernel
configuration, seccomp policy, and what the LSMs do. (I think Android
Chrome would allow /proc/self/mem writes, but would block
PTRACE_POKETEXT with RestrictPtrace() in the sandbox code?)

I had a related ancient patch series in 2016 with an attempt to allow
SELinux to prevent bypassing W^X protections with this, but I never
followed through with getting that landed...
(https://lore.kernel.org/linux-mm/1475103281-7989-1-git-send-email-jann@thejh.net/)

I guess the question there is what the right semantics for this kind
of protected memory are when a debugger is active. The simple solution
that might break some debugging would be "just deny all FOLL_FORCE
write access for this memory" (which would prevent debuggers from
being able to place breakpoints, which would maybe not be great). But
maybe it makes more sense to consider this to be an independent
concern and solve it with a new SELinux feature or something like that
instead, and then document that mseal() requires some complement to
prevent forced writes to read-only private memory? (For which the
simplest solution would be "don't grant filesystem access or ptrace()
access to the sandboxed code".)

What is the intended interaction with userfaultfd, which I believe by
design permits arbitrary data into unpopulated areas of anonymous
VMAs? If the intention is that the process should otherwise be
sandboxed to not have access to userfaultfd, that should maybe be
documented. (Alternatively I guess you could theoretically remove the
VM_MAYWRITE bit from marked VMAs, but that might be more strict than
we want, since it'd also block all FOLL_FORCE writes.)

There are also some interfaces like AIO or the X86 Shadow Stacks
interface that indirectly unmap memory through the kernel and look
like they could perhaps be tricked into racily unmapping a
just-created sealed VMA. You'd probably have to make sure that they
can't do that and essentially treat their unmap operations as if they
came from userspace, I guess? What Linus just wrote.

I think either way this feature needs some documentation on what kind
of context it's supposed to run in.

> Each bit represents sealing for one specific syscall type, e.g.
> MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> is that the API is extendable, i.e. when needed, the sealing can be
> extended to madvise, mlock, etc. Backward compatibility is also easy.
>
> The kernel will remember which seal types are applied, and the application
> doesn’t need to repeat all existing seal types in the next mseal().  Once
> a seal type is applied, it can’t be unsealed. Call mseal() on an existing
> seal type is a no-action, not a failure.
>
> MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.
>
> Internally, vm_area_struct adds a new field vm_seals, to store the bit
> masks.
>
> For the affected syscalls, such as mprotect, a check(can_modify_mm) for
> sealing is added, this usually happens at the early point of the syscall,
> before any update is made to VMAs. The effect of that is: if any of the
> VMAs in the given address range fails the sealing check, none of the VMA
> will be updated. It might be worth noting that this is different from the
> rest of mprotect(), where some updates can happen even when mprotect
> returns fail. Consider can_modify_mm only checks vm_seals in
> vm_area_struct, and it is not going deeper in the page table or updating
> any HW, success or none behavior might fit better here. I would like to
> listen to the community's feedback on this.
>
> The idea that inspired this patch comes from Stephen Röttger’s work in
> V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.
>
> In addition, Stephen is working on glibc change to add sealing support
> into the dynamic linker to seal all non-writable segments at startup. When
> that work is completed, all applications can automatically benefit from
> these new protections.
>
> [1] https://kernelnewbies.org/Linux_2_6_8
> [2] https://v8.dev/blog/control-flow-integrity
> [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
> [4] https://man.openbsd.org/mimmutable.2
> [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
>
> Jeff Xu (8):
>   Add mseal syscall
>   Wire up mseal syscall
>   mseal: add can_modify_mm and can_modify_vma
>   mseal: seal mprotect
>   mseal munmap
>   mseal mremap
>   mseal mmap
>   selftest mm/mseal mprotect/munmap/mremap/mmap
>
>  arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
>  arch/arm/tools/syscall.tbl                  |    1 +
>  arch/arm64/include/asm/unistd.h             |    2 +-
>  arch/arm64/include/asm/unistd32.h           |    2 +
>  arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
>  arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
>  arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
>  arch/s390/kernel/syscalls/syscall.tbl       |    1 +
>  arch/sh/kernel/syscalls/syscall.tbl         |    1 +
>  arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
>  arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
>  fs/aio.c                                    |    5 +-
>  include/linux/mm.h                          |   55 +-
>  include/linux/mm_types.h                    |    7 +
>  include/linux/syscalls.h                    |    2 +
>  include/uapi/asm-generic/unistd.h           |    5 +-
>  include/uapi/linux/mman.h                   |    6 +
>  ipc/shm.c                                   |    3 +-
>  kernel/sys_ni.c                             |    1 +
>  mm/Kconfig                                  |    8 +
>  mm/Makefile                                 |    1 +
>  mm/internal.h                               |    4 +-
>  mm/mmap.c                                   |   49 +-
>  mm/mprotect.c                               |    6 +
>  mm/mremap.c                                 |   19 +-
>  mm/mseal.c                                  |  328 +++++
>  mm/nommu.c                                  |    6 +-
>  mm/util.c                                   |    8 +-
>  tools/testing/selftests/mm/Makefile         |    1 +
>  tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
>  37 files changed, 1934 insertions(+), 28 deletions(-)
>  create mode 100644 mm/mseal.c
>  create mode 100644 tools/testing/selftests/mm/mseal_test.c
>
> --
> 2.42.0.609.gbb76f46606-goog
>

Jeff Xu Oct. 17, 2023, 6:50 a.m. UTC | #4

On Mon, Oct 16, 2023 at 8:07 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Mon, Oct 16, 2023 at 02:38:20PM +0000, jeffxu@chromium.org wrote:
> > +#ifdef CONFIG_MSEAL
> > +     /*
> > +      * bit masks for seal.
> > +      * need this since vm_flags is full.
> > +      */
> > +     unsigned long vm_seals;         /* seal flags, see mm.h. */
>
> "unsigned long" and yet:
>
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index c0cb22cd607d..f574c7dbee76 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -802,6 +802,8 @@ asmlinkage long sys_process_mrelease(int pidfd, unsigned int flags);
> >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> >                       unsigned long prot, unsigned long pgoff,
> >                       unsigned long flags);
> > +asmlinkage long sys_mseal(unsigned long start, size_t len, unsigned int types,
> > +                       unsigned int flags);
>
> "unsigned int"?
>
> Why the mis-match?
>
Thanks. Fixed in V2.

> > --- a/include/uapi/linux/mman.h
> > +++ b/include/uapi/linux/mman.h
> > @@ -55,4 +55,10 @@ struct cachestat {
> >       __u64 nr_recently_evicted;
> >  };
> >
> > +#define MM_SEAL_MSEAL                0x1
> > +#define MM_SEAL_MPROTECT     0x2
> > +#define MM_SEAL_MUNMAP               0x4
> > +#define MM_SEAL_MMAP         0x8
> > +#define MM_SEAL_MREMAP               0x10
>
> I think we can use the BIT() macro in uapi .h files now, it is _BITUL().
> Might want to use it here too to make it obvious what is happening.
>
Sure. Will update in V2.

> thanks,
>
> greg k-h

Jeff Xu Oct. 17, 2023, 8:34 a.m. UTC | #5

Hi Matthew.

Thanks for your comments and time to review the patchset.

On Mon, Oct 16, 2023 at 8:18 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
>
> This seems like a confusing way to introduce the subject.  Here, you're
> talking about page permissions, whereas (as far as I can tell), mseal() is
> about making _virtual_ addresses immutable, for some value of immutable.
>
> > Memory sealing additionally protects the mapping itself against
> > modifications. This is useful to mitigate memory corruption issues where
> > a corrupted pointer is passed to a memory management syscall. For example,
> > such an attacker primitive can break control-flow integrity guarantees
> > since read-only memory that is supposed to be trusted can become writable
> > or .text pages can get remapped. Memory sealing can automatically be
> > applied by the runtime loader to seal .text and .rodata pages and
> > applications can additionally seal security critical data at runtime.
> > A similar feature already exists in the XNU kernel with the
> > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
>
> This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> useful to understand what you're trying to do.
>
> > The new mseal() is an architecture independent syscall, and with
> > following signature:
> >
> > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> >
> > addr/len: memory range.  Must be continuous/allocated memory, or else
> > mseal() will fail and no VMA is updated. For details on acceptable
> > arguments, please refer to comments in mseal.c. Those are also fully
> > covered by the selftest.
>
> Mmm.  So when you say "continuous/allocated" what you really mean is
> "Must have contiguous VMAs" rather than "All pages in this range must
> be populated", yes?
>
There can't be a gap (unallocated memory) in the given range.
Those are covered in selftest:
test_seal_unmapped_start()
test_seal_unmapped_middle()
test_seal_unmapped_end()
The comments in check_mm_seal() also mentioned that.

> > types: bit mask to specify which syscall to seal, currently they are:
> > MM_SEAL_MSEAL 0x1
> > MM_SEAL_MPROTECT 0x2
> > MM_SEAL_MUNMAP 0x4
> > MM_SEAL_MMAP 0x8
> > MM_SEAL_MREMAP 0x10
>
> I don't understand why we want this level of granularity.  The OpenBSD
> and XNU examples just say "This must be immutable*".  For values of
> immutable that allow downgrading access (eg RW to RO or RX to RO),
> but not upgrading access (RW->RX, RO->*, RX->RW).
>
> > Each bit represents sealing for one specific syscall type, e.g.
> > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > is that the API is extendable, i.e. when needed, the sealing can be
> > extended to madvise, mlock, etc. Backward compatibility is also easy.
>
> Honestly, it feels too flexible.  Why not just two flags to mprotect()
> -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> maybe for some things we want to be able to downgrade and for other
> things, we don't.
>
Having a seal type per syscall type helps to add the feature incrementally.
Applications also know exactly what is sealed.

I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we
can define what it seals precisely. As Jann pointed out, there have
other scenarios that potentially affect IMMUTABLE. Implementing all thoses
will take time. And if we missed a case, we could introduce backward
compatibility issues to the application. Bitmask will solve this nicely, i.e.
application will need to apply the newly added sealing type explicitly.

> I'd like to see some discussion of how this interacts with mprotect().
> As far as I can tell, the intent is to lock the protections/existance
> of the mapping, and not to force memory to stay in core.  So it's fine
> for the kernel to swap out the page and set up a PTE as a swap entry.
> It's also fine for the kernel to mark PTEs as RO to catch page faults;
> we're concerned with the LOGICAL permissions, and not the page tables.

Yes. That is correct.

-Jeff Xu

Matthew Wilcox Oct. 17, 2023, 12:56 p.m. UTC | #6

On Tue, Oct 17, 2023 at 01:34:35AM -0700, Jeff Xu wrote:
> > > types: bit mask to specify which syscall to seal, currently they are:
> > > MM_SEAL_MSEAL 0x1
> > > MM_SEAL_MPROTECT 0x2
> > > MM_SEAL_MUNMAP 0x4
> > > MM_SEAL_MMAP 0x8
> > > MM_SEAL_MREMAP 0x10
> >
> > I don't understand why we want this level of granularity.  The OpenBSD
> > and XNU examples just say "This must be immutable*".  For values of
> > immutable that allow downgrading access (eg RW to RO or RX to RO),
> > but not upgrading access (RW->RX, RO->*, RX->RW).
> >
> > > Each bit represents sealing for one specific syscall type, e.g.
> > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > > is that the API is extendable, i.e. when needed, the sealing can be
> > > extended to madvise, mlock, etc. Backward compatibility is also easy.
> >
> > Honestly, it feels too flexible.  Why not just two flags to mprotect()
> > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> > maybe for some things we want to be able to downgrade and for other
> > things, we don't.
> >
> Having a seal type per syscall type helps to add the feature incrementally.
> Applications also know exactly what is sealed.

You're trying to portray a disadvantage as an advantage,  This is the
seccomp disease, only worse because you're trying to deny individual
syscalls instead of building up a list of permitted syscalls.  If we
introduce a new syscall tomorrow which can affect VMAs, we'll then
make it the application's fault for not disabling that new syscall.
That's terrible design!

> I'm not against types such as IMMUTABLE and DOWNGRADEABLE, if we
> can define what it seals precisely. As Jann pointed out, there have
> other scenarios that potentially affect IMMUTABLE. Implementing all thoses
> will take time. And if we missed a case, we could introduce backward
> compatibility issues to the application. Bitmask will solve this nicely, i.e.
> application will need to apply the newly added sealing type explicitly.

It is your job to do this.  You seem to have taken the attitude that you
will give the Chrome team exactly what they asked for instead of trying
to understand their requirements and give them what they need.

And don't send a v2 before discussion of v1 has come to an end.  It
uselessly fragments the discussion.

Theo de Raadt Oct. 17, 2023, 6:20 p.m. UTC | #7

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote:
> >
> > It is probably worth noting that I choose to check one and only
> > one sealing type per syscall. i.e. munmap(2) checks
> > MM_SEAL_MUNMAP only.
> 
> Yeah, this is wrong.
> 
> It's wrong exactly because other system calls will unmap things too.
> 
> Using mmap() to over-map something will unmap the old one.
> 
> Same goes for mremap() to move over an existing mapping.
> 
> So the whole "do things by the name of the system call" is not workable.
> 
> All that matters is what the system calls *do*, not what their name is.

I agree completely...

mseal() is a clone of mimmutable(2), but with an extremely
over-complicated API based upon dubious arguments.

I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all
the components working correctly.  There were many intermediate API
during development, but in the end the API is simply:

     int mimmutable(void *addr, size_t len);

The kernel code for mimmutable() traverses the specified VA range.  In
that range, it will find unmapped sub-regions (which are are ignored)
and mapped sub-regions. For these mapped regions, it does not care what
the permissions are, it just marks each sub-region as immutable.

Later on, when any VM operation request upon a VA range attempts to
      (1) change the permissions
      (2) to re-map on top
      (3) or dispose of the mapping,
that operation is refused with errno EPERM.  We don't care where the
request comes from (ie. what system call).  It is a behaviour of the
VM system, when asked to act upon a VA sub-range mapping.

Very simple semantics.

The only case where the immutable marker is ignored is during address space
teardown as a result of process termination.

In his submission of this API, Jeff Xu makes three claims I find dubious;

> Also, Chrome wants to adopt this feature for their CFI work [2] and this
> patchset has been designed to be compatible with the Chrome use case.

I specifically designed mimmutable(2) with chrome in mind, and the
chrome binary running on OpenBSD is full of immutable mappings.  All the
library regions automatically become immutable because ld.so can infer
it and do the mimmutable calls for the right subregions.

So this chrome work has already been done by OpenBSD, and it is dead
simple.  During early development I thought mimmutable(2) would be
called by applications or libraries, but I was dead wrong: 99.9% of
calls are from ld.so, and no applications need to call it, these are the
two exceptions:

In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data
structures at initialization time, so they canoot be attacked to create
an invariant for use in ROP return-to-libc style methods.

In Chrome, there is a v8_flags variable rounded out to a full page, and
placed in .data.  Chrome initialized this variable, and wants to mprotect
PROT_READ, but .data has been made immutable by ld.so.  So we force this
page into a new ELF section called "openbsd.mutable" which also behaves RW
like .data.  Where chrome does the mprotect  PROT_READ, it now also performs
mimmutable() on that page.

> Having a seal type per syscall type helps to add the feature incrementally.

Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our
entire base operating system and 10,000+ applications automatically receive
the benefits.  In one year's effort.  The only application which cared about
it was chrome, described in the previous paragraph.

I think Jeff's idea here is super dangerous.  What will actually happen
is people will add a few mseal() sub-operations and think the job is done.
It isn't done.  They need all the mseal() requests, or the mapping are
not safe.

It is very counterproductive to provide developers a complex API that has
insecure suboperations.

> Applications also know exactly what is sealed.

Actually applicatins won't know because there is no tooling to inspect this --
but I will argue further that applications don't need to know.  Immutable
marking is a system decision, not a program decision.

I'll close by asking for a new look at the mimmutable(2) API we settled
on for OpenBSD.  I think there is nothing wrong with it.  I'm willing to
help guide glibc / ld.so / musl teams through the problems they may find
along the way, I know where the skeletons are buried.  Two in
particular: -znow RELRO already today, and xonly-text in the future.

[1] https://man.openbsd.org/mimmutable.2

Theo de Raadt Oct. 17, 2023, 6:55 p.m. UTC | #8

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Oct 2023 at 11:20, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > The only case where the immutable marker is ignored is during address space
> > teardown as a result of process termination.
> 
> .. and presumably also execve()?

Ah yes of course that also.

> I do like us starting with just "mimmutable()", since it already
> exists. Particularly if chrome already knows how to use it.

Well, our chrome fork knows how to use it.  Robert Nagy in our group maintains
1280 patches to make chrome work on OpenBSD.  Google ignores them and will not
upstream them.  Some of these changes are security related, and they ignore
them.  Other changes are to cope with security work we've done on our own,
for example: JIT changes from Stephen@google for mandatory IBT which google
hasn't upstreamed yet, impacts due to PROT_EXEC-only mappings, etc.

But the only chrome diff required for mimmutable is for that v8_flags thing
I described.   And the same issue would need handling for mseal().  Introducing
the new "mutable" ELF section is probably going to be a bigger fuss than the
system call after mprotect(PROT_READ)....

> Maybe add a flag field (require it to be zero initially) just to allow
> any future expansion. Maybe the chrome team has *wanted* to have some
> finer granularity thing and currently doesn't use mimmutable() in some
> case?

There's only one feature I can think of, and we already do it right now,
documented in our manual page:

CAVEATS
     At present, mprotect(2) may reduce permissions on immutable pages marked
     PROT_READ | PROT_WRITE to the less permissive PROT_READ.  This one-way
     operation is permitted for an introductory period to observe how software
     uses this mechanism.  It may change to require explicit mutable region
     annotation with __attribute__((section(".openbsd.mutable"))) and explicit
     calls to mimmutable().

We had something which needed this behaviour during the development
transition.  It is exlusively mprotect RW -> R, no other transitions
allowed.

But once the transition was done, we don't need it anymore.  I want to
delete it, because it is a bit of a trap.  It still fails closed from an
attack perspective, but...

What worries me is a piece of code reached by mistake can do a
mprotect(lowering), not receive -1 EPERM, and carry on running..  I'd
prefer the first time you touch a mapping in the wrong way, you receive
indication of error.  This only applies applies to mprotect() acting up
on a region so the argument is a bit weak, due to mprotect() return
value checking being about as rare as unicorns.

Also, it would be a pain for OpenBSD to transition to adding a 0 flag.
I would need to see real cause not theory.

Jeff Xu Oct. 17, 2023, 9:33 p.m. UTC | #9

On Tue, Oct 17, 2023 at 8:30 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Mon, Oct 16, 2023 at 4:18 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Oct 16, 2023 at 02:38:19PM +0000, jeffxu@chromium.org wrote:
> > > Modern CPUs support memory permissions such as RW and NX bits. Linux has
> > > supported NX since the release of kernel version 2.6.8 in August 2004 [1].
> >
> > This seems like a confusing way to introduce the subject.  Here, you're
> > talking about page permissions, whereas (as far as I can tell), mseal() is
> > about making _virtual_ addresses immutable, for some value of immutable.
> >
> > > Memory sealing additionally protects the mapping itself against
> > > modifications. This is useful to mitigate memory corruption issues where
> > > a corrupted pointer is passed to a memory management syscall. For example,
> > > such an attacker primitive can break control-flow integrity guarantees
> > > since read-only memory that is supposed to be trusted can become writable
> > > or .text pages can get remapped. Memory sealing can automatically be
> > > applied by the runtime loader to seal .text and .rodata pages and
> > > applications can additionally seal security critical data at runtime.
> > > A similar feature already exists in the XNU kernel with the
> > > VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
> > > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > > patchset has been designed to be compatible with the Chrome use case.
> >
> > This [2] seems very generic and wide-ranging, not helpful.  [5] was more
> > useful to understand what you're trying to do.
> >
> > > The new mseal() is an architecture independent syscall, and with
> > > following signature:
> > >
> > > mseal(void addr, size_t len, unsigned int types, unsigned int flags)
> > >
> > > addr/len: memory range.  Must be continuous/allocated memory, or else
> > > mseal() will fail and no VMA is updated. For details on acceptable
> > > arguments, please refer to comments in mseal.c. Those are also fully
> > > covered by the selftest.
> >
> > Mmm.  So when you say "continuous/allocated" what you really mean is
> > "Must have contiguous VMAs" rather than "All pages in this range must
> > be populated", yes?
> >
> > > types: bit mask to specify which syscall to seal, currently they are:
> > > MM_SEAL_MSEAL 0x1
> > > MM_SEAL_MPROTECT 0x2
> > > MM_SEAL_MUNMAP 0x4
> > > MM_SEAL_MMAP 0x8
> > > MM_SEAL_MREMAP 0x10
> >
> > I don't understand why we want this level of granularity.  The OpenBSD
> > and XNU examples just say "This must be immutable*".  For values of
> > immutable that allow downgrading access (eg RW to RO or RX to RO),
> > but not upgrading access (RW->RX, RO->*, RX->RW).
> >
> > > Each bit represents sealing for one specific syscall type, e.g.
> > > MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
> > > is that the API is extendable, i.e. when needed, the sealing can be
> > > extended to madvise, mlock, etc. Backward compatibility is also easy.
> >
> > Honestly, it feels too flexible.  Why not just two flags to mprotect()
> > -- PROT_IMMUTABLE and PROT_DOWNGRADABLE.  I can see a use for that --
> > maybe for some things we want to be able to downgrade and for other
> > things, we don't.
>
> I think it's worth pointing out that this suggestion (with PROT_*)
> could easily integrate with mmap() and as such allow for one-shot
> mmap() + mseal().
> If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> definitely sounds like a performance win as we halve the number of
> syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> like common patterns.
>
Yes. mmap() can support sealing as well, and memory is allocated as
immutable from begining.
This is orthogonal to mseal() though.
In case of ld.so, iiuc, memory can be first allocated as W, then later
changed to RO, for example, during symbol resolution.
The important point is that the application can decide what type of
sealing it wants, and when to apply it.  There needs to be an api(),
that can be mseal() or mprotect2() or mimmutable(), the naming is not
important to me.

mprotect() in linux have the following signature:
int mprotect(void addr[.len], size_t len, int prot);
the prot bitmasks are all taken here.
I have not checked the prot field in mmap(), there might be bits left,
even not, we could have mmap2(), so that is not an issue.

Thanks
-Jeff

> --
> Pedro

Jeff Xu Oct. 17, 2023, 11:01 p.m. UTC | #10

On Tue, Oct 17, 2023 at 11:20 AM Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> > On Tue, 17 Oct 2023 at 02:08, Jeff Xu <jeffxu@google.com> wrote:
> > >
> > > It is probably worth noting that I choose to check one and only
> > > one sealing type per syscall. i.e. munmap(2) checks
> > > MM_SEAL_MUNMAP only.
> >
> > Yeah, this is wrong.
> >
> > It's wrong exactly because other system calls will unmap things too.
> >
> > Using mmap() to over-map something will unmap the old one.
> >
> > Same goes for mremap() to move over an existing mapping.
> >
> > So the whole "do things by the name of the system call" is not workable.
> >
> > All that matters is what the system calls *do*, not what their name is.
>
> I agree completely...
>
> mseal() is a clone of mimmutable(2), but with an extremely
> over-complicated API based upon dubious arguments.
>
> I designed mimmutable(2) [1] in OpenBSD, it took about a year to get all
> the components working correctly.  There were many intermediate API
> during development, but in the end the API is simply:
>
>      int mimmutable(void *addr, size_t len);
>
> The kernel code for mimmutable() traverses the specified VA range.  In
> that range, it will find unmapped sub-regions (which are are ignored)
> and mapped sub-regions. For these mapped regions, it does not care what
> the permissions are, it just marks each sub-region as immutable.
>
> Later on, when any VM operation request upon a VA range attempts to
>       (1) change the permissions
>       (2) to re-map on top
>       (3) or dispose of the mapping,
> that operation is refused with errno EPERM.  We don't care where the
> request comes from (ie. what system call).  It is a behaviour of the
> VM system, when asked to act upon a VA sub-range mapping.
>
> Very simple semantics.
>
> The only case where the immutable marker is ignored is during address space
> teardown as a result of process termination.
>
May I ask, for BSD's implementation of immutable(), do you cover
things such as mlock(),
madvice() ? or just the protection bit (WRX) + remap() + unmap().

In other words:
Is BSD's definition of immutable equivalent to
MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP, of this patch set ?

I hesitate to introduce the concept of immutable into linux because I don't know
all the scenarios present in linux where VMAs's metadata can be
modified. As Jann's email pointed out,
There could be quite a few things we still need to deal with, to
completely block the possibility,
e.g. malicious code attempting to write to a RO memory or change RW
memory to RWX.

If, as part of immutable, I also block madvice(), mlock(), which also updates
VMA's metadata, so by definition, I could.  What if the user wants the
features in
madvice() and at the same time, also wants their .text protected ?

Also, if linux introduces a new syscall that depends on a new metadata of VMA,
say msecret(), (for discussion purpose), should immutable
automatically support that ?

Without those questions answered, I couldn't choose the route of
immutable() yet.

-Jeff



>
> In his submission of this API, Jeff Xu makes three claims I find dubious;
>
> > Also, Chrome wants to adopt this feature for their CFI work [2] and this
> > patchset has been designed to be compatible with the Chrome use case.
>
> I specifically designed mimmutable(2) with chrome in mind, and the
> chrome binary running on OpenBSD is full of immutable mappings.  All the
> library regions automatically become immutable because ld.so can infer
> it and do the mimmutable calls for the right subregions.
>
> So this chrome work has already been done by OpenBSD, and it is dead
> simple.  During early development I thought mimmutable(2) would be
> called by applications or libraries, but I was dead wrong: 99.9% of
> calls are from ld.so, and no applications need to call it, these are the
> two exceptions:
>
> In OpenBSD, mimmutable() is used in libc malloc() to lock-down some data
> structures at initialization time, so they canoot be attacked to create
> an invariant for use in ROP return-to-libc style methods.
>
> In Chrome, there is a v8_flags variable rounded out to a full page, and
> placed in .data.  Chrome initialized this variable, and wants to mprotect
> PROT_READ, but .data has been made immutable by ld.so.  So we force this
> page into a new ELF section called "openbsd.mutable" which also behaves RW
> like .data.  Where chrome does the mprotect  PROT_READ, it now also performs
> mimmutable() on that page.
>
> > Having a seal type per syscall type helps to add the feature incrementally.
>
> Yet, somehow OpenBSD didn't do it per syscall, and we managed to make our
> entire base operating system and 10,000+ applications automatically receive
> the benefits.  In one year's effort.  The only application which cared about
> it was chrome, described in the previous paragraph.
>
> I think Jeff's idea here is super dangerous.  What will actually happen
> is people will add a few mseal() sub-operations and think the job is done.
> It isn't done.  They need all the mseal() requests, or the mapping are
> not safe.
>
> It is very counterproductive to provide developers a complex API that has
> insecure suboperations.
>
> > Applications also know exactly what is sealed.
>
> Actually applicatins won't know because there is no tooling to inspect this --
> but I will argue further that applications don't need to know.  Immutable
> marking is a system decision, not a program decision.
>
>
> I'll close by asking for a new look at the mimmutable(2) API we settled
> on for OpenBSD.  I think there is nothing wrong with it.  I'm willing to
> help guide glibc / ld.so / musl teams through the problems they may find
> along the way, I know where the skeletons are buried.  Two in
> particular: -znow RELRO already today, and xonly-text in the future.
>
>
> [1] https://man.openbsd.org/mimmutable.2
>

Theo de Raadt Oct. 18, 2023, 3:37 a.m. UTC | #11

Jeff Xu <jeffxu@google.com> wrote:

> In linux cases, I think, eventually, mseal() will have a bigger scope than
> BSD's mimmutable().

I don't believe that, considering noone needed this behaviour from the VM
system in the last 4 decades.

> VMA's metadata(vm_area_struct) contains a lot
> of control info, depending on application's needs, mseal() can be
> expanded to seal individual control info.

> For example, in madvice(2) case:
> As Jann point out in [1] and I quote:
> "you'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, ..."

Then prohibit madvise(MADV_FREE) on all non-writeable mappings that are
immutable.  Just include this in the set of behaviours. Or make it the
default.

Don't make it an option that a program needs to set on pages!  Noone
is going to call it.  Most programs don't know the addresses of the
*REGIONS* they would want to do this for.

Does your program know where libc's text segment starts and ends?
No your program does not know these addresses, so the parts of the
'system' which do know this needs to do it (which would be ld.so or
the libc init constructors).

If madvise(MADV_FREE) is so dangerous..  say you have a program that
would call through abort(), but you know a zero there can make the
abort not abort but return, then is it bad to let the attacker do:

   madvise(&abort, pagesize, MADV_FREE)

If that is bad, then block it in a smart way!  Don't make a programmer
of an application figure out how to do this.  That results in a defense
methodology where a handful of programs self-protect, but everything
else is unprotected or unprotectable.  That is shortsighted.

> Another example: if an application wants to keep a memory always
> present in RAM, for whatever the reason, it can call seal the mlock().

Please explain the attack surface.

> I think I  explained the logic of using bitmasks in the mseal()
> interface clearly with the example of madvice() and mlock().

It is clear as mud.

Matthew Wilcox Oct. 18, 2023, 3:17 p.m. UTC | #12

On Tue, Oct 17, 2023 at 08:18:47PM -0700, Jeff Xu wrote:
> In practice: libc could do below:
> #define MM_IMMUTABLE
> (MM_SEAL_MPROTECT|MM_SEAL_MUNMAP|MM_SEAL_MREMAP|MM_SEAL_MMAP)
> mseal(add,len, MM_IMMUTABLE)
> it will be equivalent to BSD's immutable().

No, it wouldn't, because you've carefully listed the syscalls you're
blocking instead of understanding the _concept_ of what you need to
block.

> In linux cases, I think, eventually, mseal() will have a bigger scope than
> BSD's mimmutable().  VMA's metadata(vm_area_struct) contains a lot
> of control info, depending on application's needs, mseal() can be
> expanded to seal individual control info.
> 
> For example, in madvice(2) case:
> As Jann point out in [1] and I quote:
> "you'd probably also want to block destructive madvise() operations
> that can effectively alter region contents by discarding pages and
> such, ..."
> 
> Another example: if an application wants to keep a memory always
> present in RAM, for whatever the reason, it can call seal the mlock().
> 
> To handle those two new cases. mseal() could add two more bits:
> MM_SEAL_MADVICE, MM_SEAL_MLOCK.

Yes, thank you for demonstrating that you have no idea what you need to
block.

> It is practical to keep syscall extentable, when the business logic is the same.

I concur with Theo & Linus.  You don't know what you're doing.  I think
the underlying idea of mimmutable() is good, but how you've split it up
and how you've implemented it is terrible.

Let's start with the purpose.  The point of mimmutable/mseal/whatever is
to fix the mapping of an address range to its underlying object, be it
a particular file mapping or anonymous memory.  After the call succeeds,
it must not be possible to make any address in that virtual range point
into any other object.

The secondary purpose is to lock down permissions on that range.
Possibly to fix them where they are, possibly to allow RW->RO transitions.

With those purposes in mind, you should be able to deduce for any syscall
or any madvise(), ... whether it should be allowed.

Look, I appreciate this is only your second set of patches to Linux and
you've taken on a big job.  But that's all the more reason you should
listen to people who are trying to help you.

Jeff Xu Oct. 18, 2023, 6:20 p.m. UTC | #13

On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> > >
> > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > could easily integrate with mmap() and as such allow for one-shot
> > > mmap() + mseal().
> > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > definitely sounds like a performance win as we halve the number of
> > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > like common patterns.
> > >
> > Yes. mmap() can support sealing as well, and memory is allocated as
> > immutable from begining.
> > This is orthogonal to mseal() though.
>
> I don't see how this can be orthogonal to mseal().
> In the case we opt for adding PROT_ bits, we should more or less only
> need to adapt calc_vm_prot_bits(), and the rest should work without
> issues.
> vma merging won't merge vmas with different prots. The current
> interfaces (mmap and mprotect) would work just fine.
> In this case, mseal() or mimmutable() would only be needed if you need
> to set immutability over a range of VMAs with different permissions.
>
Agreed. By orthogonal, I meant we can have two APIs:
mmap() and mseal()/mprotect()
i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
Sealing can be applied after initial memory creation.

> Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> The only annoying wrench in my plans here is that we have effectively
> run out of vm_flags bits in 32-bit architectures, so this approach as
> I described is not compatible with 32-bit.
>
> > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > changed to RO, for example, during symbol resolution.
> > The important point is that the application can decide what type of
> > sealing it wants, and when to apply it.  There needs to be an api(),
> > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > important to me.
> >
> > mprotect() in linux have the following signature:
> > int mprotect(void addr[.len], size_t len, int prot);
> > the prot bitmasks are all taken here.
> > I have not checked the prot field in mmap(), there might be bits left,
> > even not, we could have mmap2(), so that is not an issue.
>
> I don't see what you mean. We have plenty of prot bits left (32-bits,
> and we seem to have around 8 different bits used).
> And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
>
> The only issue seems to be that 32-bit ran out of vm_flags, but that
> can probably be worked around if need be.
>
Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
Apology that I was wrong previously and caused confusion.

There is a slight difference in the syntax of mprotect and mseal.
Each time when mprotect() is called, the kernel takes all of RWX bits
and updates vm_flags,
In other words, the application sets/unset each RWX, and kernel takes it.

In the mseal() case, the kernel will remember which seal types were
applied previously, and the application doesn’t need to repeat all
existing seal types in the next mseal().  Once a seal type is applied,
it can’t be unsealed.

So if we want to use mprotect() for sealing, developers need to think
of sealing bits differently than the rest of prot bits. It is a
different programming model, might or might not be an obvious concept
to developers.

There is a difference in input check and error handling as well.
for mseal(), if a given address range has a gap (unallocated memory),
or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
updated.
For mprotect(), some VMAs can be updated, till an error happens to a VMA.

IMO: I think it makes sense for mseal() and mprotect() to be different
in this. mseal() only checks vma struct so it is fast, and mprotect()
goes deeper into HW.

Because of those two differences, personally I feel a new syscall
might be worth the effort.

That said, mmap() + mprotect() is workable to me if that is what the
community wants.

Thanks
-Jeff

-Jeff

> --
> Pedro

Jeff Xu Oct. 18, 2023, 6:54 p.m. UTC | #14

On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> Let's start with the purpose.  The point of mimmutable/mseal/whatever is
> to fix the mapping of an address range to its underlying object, be it
> a particular file mapping or anonymous memory.  After the call succeeds,
> it must not be possible to make any address in that virtual range point
> into any other object.
>
> The secondary purpose is to lock down permissions on that range.
> Possibly to fix them where they are, possibly to allow RW->RO transitions.
>
> With those purposes in mind, you should be able to deduce for any syscall
> or any madvise(), ... whether it should be allowed.
>
I got it.

IMO: The approaches mimmutable() and mseal() took are different, but
we all want to seal the memory from attackers and make the linux
application safer.

mimmutable() started  with "none of updates to the sealed address is
allowed once marked as immutable", this includes from within kernel space
including driver, or any new syscalls. It is reasonable to me.

mseal() starts with 4 syscalls from userspace, which is just a way (among many
other ways) to demo what memory operation can be sealed, which happens
to meet what Chome wants.  This is an RFC, I appreciate your input.

Best regards,
-Jeff

Theo de Raadt Oct. 18, 2023, 8:36 p.m. UTC | #15

Jeff Xu <jeffxu@google.com> wrote:

> On Wed, Oct 18, 2023 at 8:17 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Let's start with the purpose.  The point of mimmutable/mseal/whatever is
> > to fix the mapping of an address range to its underlying object, be it
> > a particular file mapping or anonymous memory.  After the call succeeds,
> > it must not be possible to make any address in that virtual range point
> > into any other object.
> >
> > The secondary purpose is to lock down permissions on that range.
> > Possibly to fix them where they are, possibly to allow RW->RO transitions.
> >
> > With those purposes in mind, you should be able to deduce for any syscall
> > or any madvise(), ... whether it should be allowed.
> >
> I got it.
> 
> IMO: The approaches mimmutable() and mseal() took are different, but
> we all want to seal the memory from attackers and make the linux
> application safer.

I think you are building mseal for chrome, and chrome alone.

I do not think this will work out for the rest of the application space
because

1) it is too complicated
2) experience with mimmutable() says that applications don't do any of it
   themselves, it is all in execve(), libc initialization, and ld.so.
   You don't strike me as an execve, libc, or ld.so developer.

Stephen Röttger Oct. 19, 2023, 8 a.m. UTC | #16

> I do like us starting with just "mimmutable()", since it already
> exists. Particularly if chrome already knows how to use it.
>
> Maybe add a flag field (require it to be zero initially) just to allow
> any future expansion. Maybe the chrome team has *wanted* to have some
> finer granularity thing and currently doesn't use mimmutable() in some
> case?

Yes, we do have a use case in Chrome to split the sealing into unmap and
mprotect which will allow us to seal additional pages that we can't seal with
pure mimmutable().
For example, we have pkey-tagged RWX memory that we want to seal. Since
the memory is already RWX and the pkey controls write access, we don't care
about permission changes but sometimes we do need to mprotect data only
pages.
But the munmap sealing will provide protection against implicit changes of the
pkey in this case which would happen if a page gets unmapped and another
mapped in its place.

Stephen Röttger Oct. 19, 2023, 8:28 a.m. UTC | #17

> > IMO: The approaches mimmutable() and mseal() took are different, but
> > we all want to seal the memory from attackers and make the linux
> > application safer.
>
> I think you are building mseal for chrome, and chrome alone.
>
> I do not think this will work out for the rest of the application space
> because
>
> 1) it is too complicated
> 2) experience with mimmutable() says that applications don't do any of it
>    themselves, it is all in execve(), libc initialization, and ld.so.
>    You don't strike me as an execve, libc, or ld.so developer.

We do want to build this in a way that it can be applied automatically by ld.so
and we appreciate all your feedback on this. The intention of
splitting the sealing
by syscall was to provide flexibility while still allowing ld.so to
seal all operations.
But it's clear from the feedback that both the fine grained split and
the per-syscall
approach are not the right way to go.
Does Linus' proposal to just split munmap / mprotect sealing address your
complexity concerns? ld.so would always use both flags which should then behave
similar to mimmutable().

Jeff Xu Oct. 19, 2023, 5:30 p.m. UTC | #18

Hi Pedro

Some followup on mmap() + mprotect():

On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > > >
> > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > could easily integrate with mmap() and as such allow for one-shot
> > > > mmap() + mseal().
> > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > definitely sounds like a performance win as we halve the number of
> > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > like common patterns.
> > > >
> > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > immutable from begining.
> > > This is orthogonal to mseal() though.
> >
> > I don't see how this can be orthogonal to mseal().
> > In the case we opt for adding PROT_ bits, we should more or less only
> > need to adapt calc_vm_prot_bits(), and the rest should work without
> > issues.
> > vma merging won't merge vmas with different prots. The current
> > interfaces (mmap and mprotect) would work just fine.
> > In this case, mseal() or mimmutable() would only be needed if you need
> > to set immutability over a range of VMAs with different permissions.
> >
> Agreed. By orthogonal, I meant we can have two APIs:
> mmap() and mseal()/mprotect()
> i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> Sealing can be applied after initial memory creation.
>
> > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > The only annoying wrench in my plans here is that we have effectively
> > run out of vm_flags bits in 32-bit architectures, so this approach as
> > I described is not compatible with 32-bit.
> >
> > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > changed to RO, for example, during symbol resolution.
> > > The important point is that the application can decide what type of
> > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > important to me.
> > >
> > > mprotect() in linux have the following signature:
> > > int mprotect(void addr[.len], size_t len, int prot);
> > > the prot bitmasks are all taken here.
> > > I have not checked the prot field in mmap(), there might be bits left,
> > > even not, we could have mmap2(), so that is not an issue.
> >
> > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > and we seem to have around 8 different bits used).
> > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> >
> > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > can probably be worked around if need be.
> >
> Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> Apology that I was wrong previously and caused confusion.
>
> There is a slight difference in the syntax of mprotect and mseal.
> Each time when mprotect() is called, the kernel takes all of RWX bits
> and updates vm_flags,
> In other words, the application sets/unset each RWX, and kernel takes it.
>
> In the mseal() case, the kernel will remember which seal types were
> applied previously, and the application doesn’t need to repeat all
> existing seal types in the next mseal().  Once a seal type is applied,
> it can’t be unsealed.
>
> So if we want to use mprotect() for sealing, developers need to think
> of sealing bits differently than the rest of prot bits. It is a
> different programming model, might or might not be an obvious concept
> to developers.
>
This probably doesn't matter much to developers.
We can enforce the sealing bit to be the same as the rest of PROT bits.
If mprotect() tries to unset sealing, it will fail.

> There is a difference in input check and error handling as well.
> for mseal(), if a given address range has a gap (unallocated memory),
> or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> updated.
> For mprotect(), some VMAs can be updated, till an error happens to a VMA.
>
This difference doesn't matter much.

For mprotect()/mmap(), is Linux implementation limited by POSIX ?
This can be made backward compatible.
If there is no objection to adding linux specific values in mmap() and
mprotect(),
This works for me.

Thanks for your input.
-Jeff

Pedro Falcato Oct. 19, 2023, 10:47 p.m. UTC | #19

On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu <jeffxu@google.com> wrote:
>
> Hi Pedro
>
> Some followup on mmap() + mprotect():
>
> On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
> >
> > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > > >
> > > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > > could easily integrate with mmap() and as such allow for one-shot
> > > > > mmap() + mseal().
> > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > > definitely sounds like a performance win as we halve the number of
> > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > > like common patterns.
> > > > >
> > > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > > immutable from begining.
> > > > This is orthogonal to mseal() though.
> > >
> > > I don't see how this can be orthogonal to mseal().
> > > In the case we opt for adding PROT_ bits, we should more or less only
> > > need to adapt calc_vm_prot_bits(), and the rest should work without
> > > issues.
> > > vma merging won't merge vmas with different prots. The current
> > > interfaces (mmap and mprotect) would work just fine.
> > > In this case, mseal() or mimmutable() would only be needed if you need
> > > to set immutability over a range of VMAs with different permissions.
> > >
> > Agreed. By orthogonal, I meant we can have two APIs:
> > mmap() and mseal()/mprotect()
> > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> > Sealing can be applied after initial memory creation.
> >
> > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > > The only annoying wrench in my plans here is that we have effectively
> > > run out of vm_flags bits in 32-bit architectures, so this approach as
> > > I described is not compatible with 32-bit.
> > >
> > > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > > changed to RO, for example, during symbol resolution.
> > > > The important point is that the application can decide what type of
> > > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > > important to me.
> > > >
> > > > mprotect() in linux have the following signature:
> > > > int mprotect(void addr[.len], size_t len, int prot);
> > > > the prot bitmasks are all taken here.
> > > > I have not checked the prot field in mmap(), there might be bits left,
> > > > even not, we could have mmap2(), so that is not an issue.
> > >
> > > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > > and we seem to have around 8 different bits used).
> > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> > >
> > > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > > can probably be worked around if need be.
> > >
> > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> > Apology that I was wrong previously and caused confusion.
> >
> > There is a slight difference in the syntax of mprotect and mseal.
> > Each time when mprotect() is called, the kernel takes all of RWX bits
> > and updates vm_flags,
> > In other words, the application sets/unset each RWX, and kernel takes it.
> >
> > In the mseal() case, the kernel will remember which seal types were
> > applied previously, and the application doesn’t need to repeat all
> > existing seal types in the next mseal().  Once a seal type is applied,
> > it can’t be unsealed.
> >
> > So if we want to use mprotect() for sealing, developers need to think
> > of sealing bits differently than the rest of prot bits. It is a
> > different programming model, might or might not be an obvious concept
> > to developers.
> >
> This probably doesn't matter much to developers.
> We can enforce the sealing bit to be the same as the rest of PROT bits.
> If mprotect() tries to unset sealing, it will fail.

Yep. Erroneous or malicious mprotects would all be caught. However, if
we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect()
to less permissions or even downright munmap()) you'd want some care
to preserve that bit when setting permissions.

>
> > There is a difference in input check and error handling as well.
> > for mseal(), if a given address range has a gap (unallocated memory),
> > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> > updated.
> > For mprotect(), some VMAs can be updated, till an error happens to a VMA.
> >
> This difference doesn't matter much.
>
> For mprotect()/mmap(), is Linux implementation limited by POSIX ?

No. POSIX works merely as a baseline that UNIX systems aim towards.
You can (and very frequently do) extend POSIX interfaces (in fact,
it's how most of POSIX was written, through sheer
"design-by-committee" on a bunch of UNIX systems' extensions).

> This can be made backward compatible.
> If there is no objection to adding linux specific values in mmap() and
> mprotect(),
> This works for me.

Linux already has system-specific values for PROT_ (PROT_BTI,
PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc).
Whether this is the right interface is another question. I do like it
a lot, but there's of course value in being compatible with existing
solutions (like mimmutable()).

Linus Torvalds Oct. 19, 2023, 11:06 p.m. UTC | #20

On Thu, 19 Oct 2023 at 15:47, Pedro Falcato <pedro.falcato@gmail.com> wrote:
> >
> > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
>
> No. POSIX works merely as a baseline that UNIX systems aim towards.
> You can (and very frequently do) extend POSIX interfaces (in fact,
> it's how most of POSIX was written, through sheer
> "design-by-committee" on a bunch of UNIX systems' extensions).

We can in extreme circumstances actually go further than that, and not
only extend on POSIX requirements, but actively even violate them.

It does need a very good reason, though, but it has happened when
POSIX requirements were simply actively wrong.

For example, at one point POSIX required

     int accept(int s, struct sockaddr *addr, size_t *addrlen);

which was simply completely wrong. It's utter shite, and didn't
actually match any reality.

The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make
it "size_t" was completely unacceptable.

So we ignored it, told POSIX people that they were full of sh*t, and
they eventually fixed it in the next version (by introducing a
"socklen_t" that had better be the same as "int").

So POSIX can simply be wrong.

Also, if it turns out that we had some ABI that wasn't
POSIX-compatible, the whole "don't break user space" will take
precedence over any POSIX concerns, and we will not "fix" our system
call if people already use our old semantics.

So in that case, we generally end up with a new system call (or new
flags) instead.

Or sometimes it just is such a small detail that nobody cares - POSIX
also has a notion of documenting areas of non-conformance, and people
who really care end up having notions like "conformance vs _strict_
conformance".

                 Linus

Theo de Raadt Oct. 20, 2023, 4:27 p.m. UTC | #21

Stephen Röttger <sroettger@google.com> wrote:

> > I do like us starting with just "mimmutable()", since it already
> > exists. Particularly if chrome already knows how to use it.
> >
> > Maybe add a flag field (require it to be zero initially) just to allow
> > any future expansion. Maybe the chrome team has *wanted* to have some
> > finer granularity thing and currently doesn't use mimmutable() in some
> > case?
> 
> Yes, we do have a use case in Chrome to split the sealing into unmap and
> mprotect which will allow us to seal additional pages that we can't seal with
> pure mimmutable().

> For example, we have pkey-tagged RWX memory that we want to seal. Since
> the memory is already RWX and the pkey controls write access, we don't care
> about permission changes but sometimes we do need to mprotect data only
> pages.

Let me try to decompose this statement.

This is clearly for the JIT. You can pivot between the a JIT generated
code mapping being RW and RX (or X-only), the object will pivot between
W or X to satisfy W^X policy for safety.

I think you are talking about a RWX MAP_ANON object.

Then you use pkey_alloc() to get a PKEY.  pkey_mprotect() sets the PKEY on
the region.  I argue you can then make it entirely immutable / sealed.

Let's say it is fully immutable / sealed.

After which, you can change the in-processor PKU register (using pkey_set)
to toggle the Write-Inhibit and eXecute-Inhibit bits on that key.

The immutable object has a dangerous RWX permission.  But the specific
PKEY making it either RX (or X-only) or RW depending upon your context.
The mapping is never exposed as RWX. The PKU model reduces the
permission access of the object below the immutable permission level.

The security depends on the PKEY WI/XI bits being difficult to control.

SADLY on x86, this is managed with a PKRU userland register which is changeble
without any supervisor control -- yes, it seems quite dangerous.
Changing it requires a complicated MSR dance.  It is unfortunate that
the pkey_set() library function is easily reachedable in the PLT via ROP
methods.  On non-x86 cpus that have similar functionality, the register
is privileged, but operating supporting it generally change it and
return immediately.

The problem you seem to have with fully locked mseal() in chrome seems
to be here:

> about permission changes but sometimes we do need to mprotect data only
> pages.

Does that data have to be in the same region?  Can your allocator not
put the non-code pieces of the JIT elsewhere, with a different
permission, fully immutable / msealed -- and perhaps even managed with a
different PKEY if neccessary?

May that requires a huge rearchitecture.  But isn't the root problem here
that data and code are being handled in the same object with a shared
permission model?

> But the munmap sealing will provide protection against implicit changes of the
> pkey in this case which would happen if a page gets unmapped and another
> mapped in its place.

That primitive feels so weird, I have a difficult time believing it will
remain unattackable in the long term.

But what if you could replace mprotect() with pkey_mprotect() upon a
different key.. ?

--

A few more notes comparing what OpenBSD has done compared to Linux:

In OpenBSD, we do not have the pkey library.  We have stolen one of the
PKEY and use it for kernel support of xonly for kernel code and userland
code.  On x86 we recognize that userland can flip the permission by whacking
the RPKU register -- which would make xonly code readable.  (The chrome data
you are trying to guard faces the same problem).

To prevent that, a majority of traps in the kernel (page faults,
interrupts, etc) check if the PKRU register has been modified, and kill
the process.  It is statistically strong.

We are not making pkey available as a userland feature, but if we later
do so we would still have 15 pkeys to play with.  We would probably make
the pkey_set() operation a system call, so the trap handler can also
observe RPKU register modifications by the instruction.

Above, I mentioned pivoting between "RW or RX (or X-only)".  On OpenBSD, chrome would be
able to pivot between RW and X-only.

When it comes to Pkey utilization, we've ended up in a very different
place than Linux.

Jeff Xu Oct. 23, 2023, 5:42 p.m. UTC | #22

On Thu, Oct 19, 2023 at 3:47 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Thu, Oct 19, 2023 at 6:30 PM Jeff Xu <jeffxu@google.com> wrote:
> >
> > Hi Pedro
> >
> > Some followup on mmap() + mprotect():
> >
> > On Wed, Oct 18, 2023 at 11:20 AM Jeff Xu <jeffxu@google.com> wrote:
> > >
> > > On Tue, Oct 17, 2023 at 3:35 PM Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > > >
> > > > > >
> > > > > > I think it's worth pointing out that this suggestion (with PROT_*)
> > > > > > could easily integrate with mmap() and as such allow for one-shot
> > > > > > mmap() + mseal().
> > > > > > If we consider the common case as 'addr = mmap(...); mseal(addr);', it
> > > > > > definitely sounds like a performance win as we halve the number of
> > > > > > syscalls for a sealed mapping. And if we trivially look at e.g OpenBSD
> > > > > > ld.so code, mmap() + mimmutable() and mprotect() + mimmutable() seem
> > > > > > like common patterns.
> > > > > >
> > > > > Yes. mmap() can support sealing as well, and memory is allocated as
> > > > > immutable from begining.
> > > > > This is orthogonal to mseal() though.
> > > >
> > > > I don't see how this can be orthogonal to mseal().
> > > > In the case we opt for adding PROT_ bits, we should more or less only
> > > > need to adapt calc_vm_prot_bits(), and the rest should work without
> > > > issues.
> > > > vma merging won't merge vmas with different prots. The current
> > > > interfaces (mmap and mprotect) would work just fine.
> > > > In this case, mseal() or mimmutable() would only be needed if you need
> > > > to set immutability over a range of VMAs with different permissions.
> > > >
> > > Agreed. By orthogonal, I meant we can have two APIs:
> > > mmap() and mseal()/mprotect()
> > > i.e. we can't just rely on mmap() only without mseal()/mprotect()/mimmutable().
> > > Sealing can be applied after initial memory creation.
> > >
> > > > Note: modifications should look kinda like this: https://godbolt.org/z/Tbjjd14Pe
> > > > The only annoying wrench in my plans here is that we have effectively
> > > > run out of vm_flags bits in 32-bit architectures, so this approach as
> > > > I described is not compatible with 32-bit.
> > > >
> > > > > In case of ld.so, iiuc, memory can be first allocated as W, then later
> > > > > changed to RO, for example, during symbol resolution.
> > > > > The important point is that the application can decide what type of
> > > > > sealing it wants, and when to apply it.  There needs to be an api(),
> > > > > that can be mseal() or mprotect2() or mimmutable(), the naming is not
> > > > > important to me.
> > > > >
> > > > > mprotect() in linux have the following signature:
> > > > > int mprotect(void addr[.len], size_t len, int prot);
> > > > > the prot bitmasks are all taken here.
> > > > > I have not checked the prot field in mmap(), there might be bits left,
> > > > > even not, we could have mmap2(), so that is not an issue.
> > > >
> > > > I don't see what you mean. We have plenty of prot bits left (32-bits,
> > > > and we seem to have around 8 different bits used).
> > > > And even if we didn't, prot is the same in mprotect and mmap and mmap2 :)
> > > >
> > > > The only issue seems to be that 32-bit ran out of vm_flags, but that
> > > > can probably be worked around if need be.
> > > >
> > > Ah, you are right about this. vm_flags is full, and prot in mprotect() is not.
> > > Apology that I was wrong previously and caused confusion.
> > >
> > > There is a slight difference in the syntax of mprotect and mseal.
> > > Each time when mprotect() is called, the kernel takes all of RWX bits
> > > and updates vm_flags,
> > > In other words, the application sets/unset each RWX, and kernel takes it.
> > >
> > > In the mseal() case, the kernel will remember which seal types were
> > > applied previously, and the application doesn’t need to repeat all
> > > existing seal types in the next mseal().  Once a seal type is applied,
> > > it can’t be unsealed.
> > >
> > > So if we want to use mprotect() for sealing, developers need to think
> > > of sealing bits differently than the rest of prot bits. It is a
> > > different programming model, might or might not be an obvious concept
> > > to developers.
> > >
> > This probably doesn't matter much to developers.
> > We can enforce the sealing bit to be the same as the rest of PROT bits.
> > If mprotect() tries to unset sealing, it will fail.
>
> Yep. Erroneous or malicious mprotects would all be caught. However, if
> we add a PROT_DOWNGRADEABLE (that could let you, lets say, mprotect()
> to less permissions or even downright munmap()) you'd want some care
> to preserve that bit when setting permissions.
>
> >
> > > There is a difference in input check and error handling as well.
> > > for mseal(), if a given address range has a gap (unallocated memory),
> > > or if one of VMA is sealed with MM_SEAL_SEAL flag, none of VMAs is
> > > updated.
> > > For mprotect(), some VMAs can be updated, till an error happens to a VMA.
> > >
> > This difference doesn't matter much.
> >
> > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
>
> No. POSIX works merely as a baseline that UNIX systems aim towards.
> You can (and very frequently do) extend POSIX interfaces (in fact,
> it's how most of POSIX was written, through sheer
> "design-by-committee" on a bunch of UNIX systems' extensions).
>
> > This can be made backward compatible.
> > If there is no objection to adding linux specific values in mmap() and
> > mprotect(),
> > This works for me.
>
> Linux already has system-specific values for PROT_ (PROT_BTI,
> PROT_MTE, PROT_GROWSUP, PROT_GROWSDOWN, etc).
> Whether this is the right interface is another question. I do like it
> a lot, but there's of course value in being compatible with existing
> solutions (like mimmutable()).
>

Thanks Pedro for providing examples on mm extension to POSIX. This
opens more design options on solving the sealing problem. I will take
a few days to research  design options.

-Jeff


> --
> Pedro

Jeff Xu Oct. 23, 2023, 5:44 p.m. UTC | #23

On Thu, Oct 19, 2023 at 4:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 19 Oct 2023 at 15:47, Pedro Falcato <pedro.falcato@gmail.com> wrote:
> > >
> > > For mprotect()/mmap(), is Linux implementation limited by POSIX ?
> >
> > No. POSIX works merely as a baseline that UNIX systems aim towards.
> > You can (and very frequently do) extend POSIX interfaces (in fact,
> > it's how most of POSIX was written, through sheer
> > "design-by-committee" on a bunch of UNIX systems' extensions).
>
> We can in extreme circumstances actually go further than that, and not
> only extend on POSIX requirements, but actively even violate them.
>
> It does need a very good reason, though, but it has happened when
> POSIX requirements were simply actively wrong.
>
> For example, at one point POSIX required
>
>      int accept(int s, struct sockaddr *addr, size_t *addrlen);
>
> which was simply completely wrong. It's utter shite, and didn't
> actually match any reality.
>
> The 'addrlen' parameter is 'int *', and POSIX suddenly trying to make
> it "size_t" was completely unacceptable.
>
> So we ignored it, told POSIX people that they were full of sh*t, and
> they eventually fixed it in the next version (by introducing a
> "socklen_t" that had better be the same as "int").
>
> So POSIX can simply be wrong.
>
> Also, if it turns out that we had some ABI that wasn't
> POSIX-compatible, the whole "don't break user space" will take
> precedence over any POSIX concerns, and we will not "fix" our system
> call if people already use our old semantics.
>
> So in that case, we generally end up with a new system call (or new
> flags) instead.
>
> Or sometimes it just is such a small detail that nobody cares - POSIX
> also has a notion of documenting areas of non-conformance, and people
> who really care end up having notions like "conformance vs _strict_
> conformance".
>
>                  Linus
>

Thanks Linus for clarifying the guidelines on POSIX in Linux.

-Jeff

[RFC,v1,0/8] Introduce mseal() syscall

Message

Comments