[RFC,v2,0/8] Introduce mseal() syscall

Message ID	20231017090815.1067790-1-jeffxu@chromium.org
Headers	show Return-Path: <linux-kselftest-owner@vger.kernel.org> From: jeffxu@chromium.org To: akpm@linux-foundation.org, keescook@chromium.org, jannh@google.com, sroettger@google.com, willy@infradead.org, gregkh@linuxfoundation.org, torvalds@linux-foundation.org Cc: jeffxu@google.com, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, surenb@google.com, alex.sierra@amd.com, apopple@nvidia.com, aneesh.kumar@linux.ibm.com, axelrasmussen@google.com, ben@decadent.org.uk, catalin.marinas@arm.com, david@redhat.com, dwmw@amazon.co.uk, ying.huang@intel.com, hughd@google.com, joey.gouly@arm.com, corbet@lwn.net, wangkefeng.wang@huawei.com, Liam.Howlett@oracle.com, lstoakes@gmail.com, mawupeng1@huawei.com, linmiaohe@huawei.com, namit@vmware.com, peterx@redhat.com, peterz@infradead.org, ryan.roberts@arm.com, shr@devkernel.io, vbabka@suse.cz, xiujianfeng@huawei.com, yu.ma@intel.com, zhangpeng362@huawei.com, dave.hansen@intel.com, luto@kernel.org, linux-hardening@vger.kernel.org Subject: [RFC PATCH v2 0/8] Introduce mseal() syscall Date: Tue, 17 Oct 2023 09:08:07 +0000 Message-ID: <20231017090815.1067790-1-jeffxu@chromium.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introduce mseal() syscall \| expand [RFC,v2,0/8] Introduce mseal() syscall [RFC,v2,3/8] mseal: add can_modify_mm and can_modify_vma [RFC,v2,4/8] mseal: Check seal flag for mprotect(2) [RFC,v2,6/8] mseal: Check seal flag for mremap(2) [RFC,v2,8/8] selftest mm/mseal mprotect/munmap/mremap/mmap

Jeff Xu Oct. 17, 2023, 9:08 a.m. UTC

From: Jeff Xu <jeffxu@google.com>

This patchset proposes a new mseal() syscall for the Linux kernel.

Modern CPUs support memory permissions such as RW and NX bits. Linux has
supported NX since the release of kernel version 2.6.8 in August 2004 [1].
The memory permission feature improves security stance on memory
corruption bugs, i.e. the attacker can’t just write to arbitrary memory
and point the code to it, the memory has to be marked with X bit, or
else an exception will happen. The protection is set by mmap(2),
mprotect(2), mremap(2).

Memory sealing additionally protects the mapping itself against
modifications. This is useful to mitigate memory corruption issues where
a corrupted pointer is passed to a memory management syscall. For example,
such an attacker primitive can break control-flow integrity guarantees
since read-only memory that is supposed to be trusted can become writable
or .text pages can get remapped. Memory sealing can automatically be
applied by the runtime loader to seal .text and .rodata pages and
applications can additionally seal security critical data at runtime.
A similar feature already exists in the XNU kernel with the
VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4].
Also, Chrome wants to adopt this feature for their CFI work [2] and this
patchset has been designed to be compatible with the Chrome use case.

The new mseal() is an architecture independent syscall, and with
following signature:

mseal(void addr, size_t len, unsigned long types, unsigned long flags)

addr/len: memory range.  Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to comments in mseal.c. Those are also fully
covered by the selftest.

types: bit mask to specify which syscall to seal.

Five syscalls can be sealed, as specified by bitmasks:
MM_SEAL_MPROTECT: Deny mprotect(2)/pkey_mprotect(2).
MM_SEAL_MUNMAP: Deny munmap(2).
MM_SEAL_MMAP: Deny mmap(2).
MM_SEAL_MREMAP: Deny mremap(2).
MM_SEAL_MSEAL: Deny adding a new seal type.

Each bit represents sealing for one specific syscall type, e.g.
MM_SEAL_MPROTECT will deny mprotect syscall. The consideration of bitmask
is that the API is extendable, i.e. when needed, the sealing can be
extended to madvise, mlock, etc. Backward compatibility is also easy.

The kernel will remember which seal types are applied, and the application
doesn’t need to repeat all existing seal types in the next mseal().  Once
a seal type is applied, it can’t be unsealed. Call mseal() on an existing
seal type is a no-action, not a failure.

MM_SEAL_MSEAL will deny mseal() calls that try to add a new seal type.

Internally, vm_area_struct adds a new field vm_seals, to store the bit
masks.

For the affected syscalls, such as mprotect, a check(can_modify_mm) for
sealing is added, this usually happens at the early point of the syscall,
before any update is made to VMAs. The effect of that is: if any of the
VMAs in the given address range fails the sealing check, none of the VMA
will be updated.

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5],  Chrome browser in ChromeOS will be the first user of this API.

[1] https://kernelnewbies.org/Linux_2_6_8 
[2] https://v8.dev/blog/control-flow-integrity 
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 
[4] https://man.openbsd.org/mimmutable.2 
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc 
 
PATCH history:

v1:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add detailed commit message.

v0:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/

Jeff Xu (8):
  mseal: Add mseal(2) syscall.
  mseal: Wire up mseal syscall
  mseal: add can_modify_mm and can_modify_vma
  mseal: Check seal flag for mprotect(2)
  mseal: Check seal flag for munmap(2)
  mseal: Check seal flag for mremap(2)
  mseal:Check seal flag for mmap(2)
  selftest mm/mseal mprotect/munmap/mremap/mmap

 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |    1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 fs/aio.c                                    |    5 +-
 include/linux/mm.h                          |   44 +-
 include/linux/mm_types.h                    |    7 +
 include/linux/syscalls.h                    |    2 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 include/uapi/linux/mman.h                   |    6 +
 ipc/shm.c                                   |    3 +-
 kernel/sys_ni.c                             |    1 +
 mm/Kconfig                                  |    8 +
 mm/Makefile                                 |    1 +
 mm/internal.h                               |    4 +-
 mm/mmap.c                                   |   57 +-
 mm/mprotect.c                               |   15 +
 mm/mremap.c                                 |   30 +-
 mm/mseal.c                                  |  268 ++++
 mm/nommu.c                                  |    6 +-
 mm/util.c                                   |    8 +-
 tools/testing/selftests/mm/Makefile         |    1 +
 tools/testing/selftests/mm/mseal_test.c     | 1428 +++++++++++++++++++
 37 files changed, 1891 insertions(+), 28 deletions(-)
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c

Jeff Xu Oct. 18, 2023, 7:01 a.m. UTC | #1

Hi Linus,

On Tue, Oct 17, 2023 at 10:43 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 17 Oct 2023 at 10:04, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Honestly, there is only two kinds of sealing that makes sense:
> >
> >  - you cannot change the permissions of some area
> >
> >  - you cannot unmap an area
>
> Actually, I guess at least theoretically, there could be three different things:
>
>  - you cannot move an area
>
Yes.

Actually, the newly added selftest covers some of the above:
1. can't change the permission of some areas.
test_seal_mprotect
test_seal_mmap_overwrite_prot

2. can't unmap an area (thus mmap() to the same address later)
test_seal_munmap

3. can't move to an area:
test_seal_mremap_move         //can't move from a sealed area:
test_seal_mremap_move_fixed_zero //can't move from a sealed area to a
fixed address
test_seal_mremap_move_fixed   //can't move to a sealed area.

4 can't expand or shrink the area:
test_seal_mremap_shrink
test_seal_mremap_expand

> although I do think that maybe just saying "you cannot unmap" might
> also include "you cannot move".
>
> But I guess it depends on whether you feel it's the virtual _address_
> you are protecting, or whether it's the concept of mapping something.
>
> I personally think that from a security perspective, what you want to
> protect is a particular address. That implies that "seal from
> unmapping" would thus also include "you can't move this area
> elsewhere".
>
> But at least conceptually, splitting "unmap" and "move" apart might
> make some sense. I would like to hear a practical reason for it,
> though.
>
> Without that practical reason, I think the only two sane sealing operations are:
>
>  - SEAL_MUNMAP: "don't allow this mapping address to go away"
>
>    IOW no unmap, no shrinking, no moving mremap
>
>  - SEAL_MPROTECT: "don't allow any mapping permission changes"
>
I agree with the concept in general. The separation of two seal types
is easy to understand.

For mmap(MAP_FIXED), I know for a fact that it can modify permission of
an existing mapping, (as in selftest:test_seal_mmap_overwrite_prot).
I think it can also expand an existing VMA. This is not a problem, code-wise,
I mention it here, because it needs extra care when coding mmap() change.

> Again, that permission case might end up being "don't allow
> _additional_ permissions" and "don't allow taking permissions away".
> Or it could be split by operation (ie "don't allow permission changes
> to writability / readability / executability respectively").
>
Yes. If the application desires this, it can also be done.
i.e. seal of X bit, or seal of W bit, this will be similar to file sealing.
I discussed this with Stephan before, at this point of time,  Chrome
doesn't have a use case.

> I suspect there isn't a real-life example of splitting the
> SEAL_MPROTECT (the same way I doubt there's a real-life example for
> splitting the UNMAP into "unmap vs move"), so unless there is some
> real reason, I'd keep the sealing minimal and to just those two flags.
>
I think two seal-type (permission and unmap/move/expand/shrink)
will work for the Chrome case. Stephen Röttger is an expert in Chrome,
on vacation/ be back soon. I will wait for Stephen to confirm.

> We could always add more flags later, if there is a real use case
> (IOW, if we start with "don't allow any permission changes", we could
> add a flag later that just says "don't allow writability changes").
>
Agreed 100%, thanks for understanding.

-Jeff


>                Linus

Jeff Xu Oct. 18, 2023, 3:08 p.m. UTC | #2

On Tue, Oct 17, 2023 at 9:54 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 17 Oct 2023 at 02:08, <jeffxu@chromium.org> wrote:
> >
> > Of all the call paths that call into do_vmi_munmap(),
> > this is the only place where checkSeals = MM_SEAL_MUNMAP.
> > The rest has checkSeals = 0.
>
> Why?
>
> None of this makes sense.
>
> So you say "we can't munmap in this *one* place, but all others ignore
> the sealing".
>
I apologize that previously, I described what this code does, and not reasoning.

In our threat model, as Stephen Röttger point out in [1], and I quote:

V8 exploits typically follow a similar pattern: an initial bug leads
to memory corruption but often the initial corruption is limited and
the attacker has to find a way to arbitrarily read/write in the whole
address space.

The memory correction is in the user space process, e.g. Chrome.
Attackers will try to modify permission of the memory, by calling
mprotect,  or munmap then mmap to the same address but with different
permission, etc.

Sealing blocks mprotect/munmap/mremap/mmap call from the user space
process, e.g. Chrome.

At time of handling those 4 syscalls, we need to check the seal (
can_modify_mm), this requires locking the VMA (
mmap_write_lock_killable), and ideally, after validating the syscall
input. The reasonable place for can_modify_mm() is from utility
functions, such as do_mmap(), do_vmi_munmap(), etc.

However, there is no guarantee that do_mmap() and do_vmi_munmap() are
only reachable from mprotect/munmap/mremap/mmap syscall entry point
(SYSCALL_DEFINE_XX). In theory,  the kernel can call those in other
scenarios, and some of them can be perfectly legit. Those other
scenarios are not covered by our threat model at this time. Therefore,
we need a flag, passed from the SYSCALL_DEFINE_XX entry , down to
can_modify_mm(), to differentiate those other scenarios.

Now, back to code, it did some optimization, i.e. doesn't pass the
flag from SYSCALL_DEFINE_XX  in all cases. If SYSCALL_DEFINE_XX calls
do_a, and do_a has only one caller, I will set the flag in do_a,
instead of SYSCALL_DEFINE_XX. Doing this reduces the size of the
patchset, but it also makes the code less readable indeed. I could
remove this optimization in V3. I welcome suggestions to improve
readability on this.

When handing the mmap/munmap/mremap/mmap, once the code passed
can_modify_mm(), it means the memory area is not sealed, if the code
continues to call the other utility functions, we don't need to check
the seal again. This is the case for mremap(), the seal of src address
and dest address (when applicable) are checked first, later when the
code calls  do_vmi_munmap(), it no longer needs to check the seal
again.

[1] https://v8.dev/blog/control-flow-integrity

-Jeff

Jeff Xu Oct. 18, 2023, 5:14 p.m. UTC | #3

On Wed, Oct 18, 2023 at 8:08 AM Jeff Xu <jeffxu@google.com> wrote:
>
> On Tue, Oct 17, 2023 at 9:54 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, 17 Oct 2023 at 02:08, <jeffxu@chromium.org> wrote:
> > >
> > > Of all the call paths that call into do_vmi_munmap(),
> > > this is the only place where checkSeals = MM_SEAL_MUNMAP.
> > > The rest has checkSeals = 0.
> >
> > Why?
> >
> > None of this makes sense.
> >
> > So you say "we can't munmap in this *one* place, but all others ignore
> > the sealing".
> >
> I apologize that previously, I described what this code does, and not reasoning.
>
> In our threat model, as Stephen Röttger point out in [1], and I quote:
>
> V8 exploits typically follow a similar pattern: an initial bug leads
> to memory corruption but often the initial corruption is limited and
> the attacker has to find a way to arbitrarily read/write in the whole
> address space.
>
> The memory correction is in the user space process, e.g. Chrome.
> Attackers will try to modify permission of the memory, by calling
> mprotect,  or munmap then mmap to the same address but with different
> permission, etc.
>
> Sealing blocks mprotect/munmap/mremap/mmap call from the user space
> process, e.g. Chrome.
>
> At time of handling those 4 syscalls, we need to check the seal (
> can_modify_mm), this requires locking the VMA (
> mmap_write_lock_killable), and ideally, after validating the syscall
> input. The reasonable place for can_modify_mm() is from utility
> functions, such as do_mmap(), do_vmi_munmap(), etc.
>
> However, there is no guarantee that do_mmap() and do_vmi_munmap() are
> only reachable from mprotect/munmap/mremap/mmap syscall entry point
> (SYSCALL_DEFINE_XX). In theory,  the kernel can call those in other
> scenarios, and some of them can be perfectly legit. Those other
> scenarios are not covered by our threat model at this time. Therefore,
> we need a flag, passed from the SYSCALL_DEFINE_XX entry , down to
> can_modify_mm(), to differentiate those other scenarios.
>
> Now, back to code, it did some optimization, i.e. doesn't pass the
> flag from SYSCALL_DEFINE_XX  in all cases. If SYSCALL_DEFINE_XX calls
> do_a, and do_a has only one caller, I will set the flag in do_a,
> instead of SYSCALL_DEFINE_XX. Doing this reduces the size of the
> patchset, but it also makes the code less readable indeed. I could
> remove this optimization in V3. I welcome suggestions to improve
> readability on this.
>
> When handing the mmap/munmap/mremap/mmap, once the code passed
> can_modify_mm(), it means the memory area is not sealed, if the code
> continues to call the other utility functions, we don't need to check
> the seal again. This is the case for mremap(), the seal of src address
> and dest address (when applicable) are checked first, later when the
> code calls  do_vmi_munmap(), it no longer needs to check the seal
> again.
>
> [1] https://v8.dev/blog/control-flow-integrity
>
> -Jeff

There is also alternative approach:

For all the places that call do_vmi_munmap(), find out which
case should ignore the sealing flag legitimately, set an ignore_seal
flag and pass it down into do_vmi_munmap(). For the rest case,
use default behavior.

All future API will automatically be covered for sealing, by using default.

The risky side, if I missed a case that requires setting ignore_seal,
there will be a bug.

Also if a driver calls the utility functions to unmap a memory, the
seal will be checked as well. (Driver is not in our threat model,
but Chrome probably doesn't mind it.)

Which of those two approaches are better ? I appreciate the direction on this.

Thanks!
-Jeff


-Jeff

Linus Torvalds Oct. 18, 2023, 6:27 p.m. UTC | #4

On Wed, 18 Oct 2023 at 10:14, Jeff Xu <jeffxu@google.com> wrote:
>
> There is also alternative approach:
>
> For all the places that call do_vmi_munmap(), find out which
> case should ignore the sealing flag legitimately,

NO.

Christ.

THERE ARE NO LEGITIMATE CASES OF IGNORING SEALING FLAGS.

If you ignore a sealing flag, it's not a sealing flag. It's random
crap, and claiming that it has *anything* to do with security is just
a cruel joke.

Really.

Stop this. I do not want to hear your excuses for garbage any more.
We're done. If I hear any more arguments for this sh*t, I will
literally put you in my ignore file, and will auto-NAK any future
patches.

This is simply not up for discussion. Any flag for "ignore sealing" is wrong.

We do have one special "unmap" case, namely "unmap_vmas()' called at
last mmput() -> __mmput() -> exit_mmap().

And yes, that is called at munmap() time too, but that's after the
point of no return after we've already removed the vma's from the VM
lists. So it's long after any error cases have been checked.

             Linus

Jeff Xu Oct. 18, 2023, 7:07 p.m. UTC | #5

On Wed, Oct 18, 2023 at 11:27 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 18 Oct 2023 at 10:14, Jeff Xu <jeffxu@google.com> wrote:
> This is simply not up for discussion. Any flag for "ignore sealing" is wrong.
>
> We do have one special "unmap" case, namely "unmap_vmas()' called at
> last mmput() -> __mmput() -> exit_mmap().
>
> And yes, that is called at munmap() time too, but that's after the
> point of no return after we've already removed the vma's from the VM
> lists. So it's long after any error cases have been checked.
>
Ah. I see.
I didn't know there was no legit case, which is what I worried about before.
this flag can be removed.

Stephen Röttger Oct. 19, 2023, 7:27 a.m. UTC | #6

> Without that practical reason, I think the only two sane sealing operations are:
>
>  - SEAL_MUNMAP: "don't allow this mapping address to go away"
>
>    IOW no unmap, no shrinking, no moving mremap
>
>  - SEAL_MPROTECT: "don't allow any mapping permission changes"
>
> Again, that permission case might end up being "don't allow
> _additional_ permissions" and "don't allow taking permissions away".
> Or it could be split by operation (ie "don't allow permission changes
> to writability / readability / executability respectively").
>
> I suspect there isn't a real-life example of splitting the
> SEAL_MPROTECT (the same way I doubt there's a real-life example for
> splitting the UNMAP into "unmap vs move"), so unless there is some
> real reason, I'd keep the sealing minimal and to just those two flags.

These two flags are exactly what we would use in Chrome. I can't think of a
use case for a more fine grained split either.

David Laight Oct. 19, 2023, 9:19 a.m. UTC | #7

From: jeffxu@chromium.org
> Sent: 17 October 2023 10:08
> 
> This patchset proposes a new mseal() syscall for the Linux kernel.

I'm sure you can give it a better name, there isn't a 6 character
limit on identifiers!

FWIW you could also use mprotect(addr, len, IMMUTABLE);

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

[RFC,v2,0/8] Introduce mseal() syscall

Message

Comments