mbox series

[RFC,v3,00/11] Introduce mseal()

Message ID 20231212231706.2680890-1-jeffxu@chromium.org
Headers show
Series Introduce mseal() | expand

Message

Jeff Xu Dec. 12, 2023, 11:16 p.m. UTC
From: Jeff Xu <jeffxu@chromium.org>

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.

The new mseal() is an architecture independent syscall, and with
following signature:

mseal(void addr, size_t len, unsigned long types, unsigned long flags)

addr/len: memory range.  Must be continuous/allocated memory, or else
mseal() will fail and no VMA is updated. For details on acceptable
arguments, please refer to documentation patch (mseal.rst) of this
patch set. Those are also fully covered by the selftest.

types: bit mask to specify the sealing types.
MM_SEAL_BASE
MM_SEAL_PROT_PKEY
MM_SEAL_DISCARD_RO_ANON
MM_SEAL_SEAL

The MM_SEAL_BASE:
The base package includes the features common to all VMA sealing
types. It prevents sealed VMAs from:
1> Unmapping, moving to another location, and shrinking the size, via
munmap() and mremap(), can leave an empty space, therefore can be
replaced with a VMA with a new set of attributes.
2> Move or expand a different vma into the current location, via mremap().
3> Modifying sealed VMA via mmap(MAP_FIXED).
4> Size expansion, via mremap(), does not appear to pose any specific
risks to sealed VMAs. It is included anyway because the use case is
unclear. In any case, users can rely on merging to expand a sealed
VMA.

We consider the MM_SEAL_BASE feature, on which other sealing features
will depend. For instance, it probably does not make sense to seal
PROT_PKEY without sealing the BASE, and the kernel will implicitly add
SEAL_BASE for SEAL_PROT_PKEY.

The MM_SEAL_PROT_PKEY:
Seal PROT and PKEY of the address range, i.e. mprotect() and
pkey_mprotect() will be denied if the memory is sealed with
MM_SEAL_PROT_PKEY.

The MM_SEAL_DISCARD_RO_ANON
Certain types of madvise() operations are destructive [6], such as
MADV_DONTNEED, which can effectively alter region contents by
discarding pages, especially when memory is anonymous. This blocks
such operations for anonymous memory which is not writable to the
user.

The MM_SEAL_SEAL
MM_SEAL_SEAL denies adding a new seal for an VMA.
This is similar to F_SEAL_SEAL in fcntl.


The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.

Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.

Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.

--------------------------------------------------------------------
Change history:
===============
V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
  MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
  both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
  destructive operations of madvise. (Suggested by Jann Horn and
  Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap() (Detail see new discussions)
- Add documentation - mseal.rst

Work in progress:
=================
- update man page for mseal() and mmap()

Open discussions:
=================
Several open discussions from V1/V2 were not incorporated into V3. I
believe these are worth more discussion for future versions of this
patch series.

1> mseal() vs mimmutable()
mseal(bitmasks for multiple seal types)
	BASE + PROT_PKEY+ MM_SEAL_DISCARD_RO_ANON
        Apply PROT_PKEY implies BASE, same for DISCARD_RO_ANON

mimmutable() (openBSD)
        This is equal to SEAL_BASE + SEAL_PROT_PKEY in mseal()
        Plus allowing downgrade from W=>NW (OpenBSD)
        Doesn’t have MM_SEAL_DISCARD_RO_ANON

mimmutable() is designed for memory sealing in libc, and mseal()
is designed for both Chrome browser and libc.

For the two memory ranges that Chrome browser wants to seal, as
discussed previously, the allocator still needs to free or discard
memory for the sealed memory. For performance reasons, we have
explored two solutions in the past: first, using PKEY-based munmap()
[7], and second, separating SEAL_MPROTECT (v1 of this patch set).
Recently, we have experimented with an alternative approach that
allows us to remove the separation of SEAL_MPROTECT. For those two
memory ranges, Chrome browser will use BASE + PROT_PKEY +
DISCARD_RO_ANON for all its sealing needs.

In the case of libc, the .text segment can be sealed with the BASE and
PROT_PKEY, and the RO data segments can be sealed with the BASE +
PROT_PKEY + DISCARD_RO_ANON.

Comments

Linus Torvalds Dec. 13, 2023, 12:39 a.m. UTC | #1
On Tue, 12 Dec 2023 at 15:17, <jeffxu@chromium.org> wrote:
> +
> +**types**: bit mask to specify the sealing types, they are:

I really want a real-life use-case for more than one bit of "don't modify".

IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?

Or when would you *ever* say "seal this area, but mprotect()" is ok.

IOW, I want to know why we don't just do the BSD immutable thing, and
why we need this multi-level sealing thing.

               Linus
Greg KH Dec. 13, 2023, 7:24 a.m. UTC | #2
On Tue, Dec 12, 2023 at 11:16:55PM +0000, jeffxu@chromium.org wrote:
> +config MSEAL
> +	default n

Minor nit, "n" is always the default, no need to call it out here.

> +	bool "Enable mseal() system call"
> +	depends on MMU
> +	help
> +	  Enable the virtual memory sealing.
> +	  This feature allows sealing each virtual memory area separately with
> +	  multiple sealing types.

You might want to include more documentation as to what this is for,
otherwise distros / users will not know if they need to enable this
or not.

thanks,

greg k-h
Jeff Xu Dec. 14, 2023, 12:35 a.m. UTC | #3
On Tue, Dec 12, 2023 at 4:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 12 Dec 2023 at 15:17, <jeffxu@chromium.org> wrote:
> > +
> > +**types**: bit mask to specify the sealing types, they are:
>
> I really want a real-life use-case for more than one bit of "don't modify".
>
For the real-life use case question, Stephen Röttger and I put
description in the cover letter as well as the open discussion section
(mseal() vs immutable()) of patch 0/11.  Perhaps you are looking for more
details in chrome usage of the API, e.g. code-wise ?

> IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
>
The MADV_DONTNEED is OK for file-backed mapping.
As state in man page of madvise: [1]

"subsequent accesses of pages in the range will succeed,  but will
result in either repopulating the memory contents from the up-to-date
contents of the underlying mapped file"

> Or when would you *ever* say "seal this area, but mprotect()" is ok.
>
The fact  that openBSD allows RW=>RO transaction, as in its man page [2]

 "  At present, mprotect(2) may reduce permissions on immutable pages
  marked PROT_READ | PROT_WRITE to the less permissive PROT_READ."

suggests application might desire multiple ways to seal the "PROT" bits.

E.g.
Applications that wants a full lockdown of PROT and PKEY might use
SEAL_PROT_PKEY (Chrome case and implemented in this patch)

Application that desires RW=>RO transaction, might implement
SEAL_PROT_DOWNGRADEABLE, or specifically allow RW=>RO.
(not implemented but can be added in future as extension if  needed.)

> IOW, I want to know why we don't just do the BSD immutable thing, and
> why we need this multi-level sealing thing.
>
The details are discussed in mseal() vs immutable()) of the cover letter
(patch 0/11)

In short, BSD's immutable is designed specific for libc case, and Chrome
case is just different (e.g. the lifetime of those mappings and requirement of
free/discard unused memory).

Single bit vs multi-bits are still up for discussion.
If there are strong opinions on the multiple-bits approach, (and
no objection on applying MM_SEAL_DISCARD_RO_ANON to the .text part
during libc dynamic loading, which has no effect anyway because it is
file backed.), we could combine all three bits into one. A side note is that we
could not add something such as SEAL_PROT_DOWNGRADEABLE later,
since pkey_mprotect is sealed.

I'm open to one bit approach. If we took that approach,
We might consider the following:

mseal() or
mseal(flags), flags are reserved for future use.

I appreciate a direction on this.

 [1] https://man7.org/linux/man-pages/man2/madvise.2.html
 [2] https://man.openbsd.org/mimmutable.2

-Jeff



>                Linus
Theo de Raadt Dec. 14, 2023, 1:09 a.m. UTC | #4
Jeff Xu <jeffxu@google.com> wrote:

> > Or when would you *ever* say "seal this area, but mprotect()" is ok.
> >
> The fact  that openBSD allows RW=>RO transaction, as in its man page [2]
> 
>  "  At present, mprotect(2) may reduce permissions on immutable pages
>   marked PROT_READ | PROT_WRITE to the less permissive PROT_READ."

Let me explain this.

We encountered two places that needed this less-permission-transition.

Both of these problems were found in either .data or bss, which the
kernel makes immutable by default.  The OpenBSD kernel makes those
regions immutable BY DEFAULT, and there is no way to turn that off.

One was in our libc malloc, which after initialization, wants to protect
a control data structure from being written in the future.

The other was in chrome v8, for the v8_flags variable, this is
similarily mprotected to less permission after initialization to avoid
tampering (because it's an amazing relative-address located control
gadget).

We introduced a different mechanism to solve these problem.

So we added a new ELF section which annotates objects you need to be
MUTABLE.  If these are .data or .bss, they are placed in the MUTABLE
region annotated with the following Program Header:

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  OPENBSD_MUTABLE 0x0e9000 0x00000000000ec000 0x00000000000ec000 0x001000 0x001000 RW  0x1000

associated with this Section Header

  [20] .openbsd.mutable  PROGBITS        00000000000ec000 0e9000 001000 00  WA  0   0 4096

(It is vaguely similar to RELRO).

You can place objects there using the a compiler __attribute__((section
declaration, like this example from our libc/malloc.c code

static union {
        struct malloc_readonly mopts;
        u_char _pad[MALLOC_PAGESIZE];
} malloc_readonly __attribute__((aligned(MALLOC_PAGESIZE)))
                __attribute__((section(".openbsd.mutable")));

During startup the code can set the protection and then the immutability
of the object correctly.

Since we have no purpose left for this permission reduction semantic
upon immutable mappings, we may be deleting that behaviour in the
future.  I wrote that code, because I needed it to make progress with some
difficult pieces of code.  But we found a better way.
Linus Torvalds Dec. 14, 2023, 1:31 a.m. UTC | #5
On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
>
>
> > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> >
> The MADV_DONTNEED is OK for file-backed mapping.

Right. It makes no semantic difference. So there's no point to it.

My point was that you added this magic flag for "not ok for RO anon mapping".

It's such a *completely* random flag, that I go "that's just crazy
random - make sealing _always_ disallow that case".

So what I object to in this series is basically random small details
that should just eb part of the basic act of sealing.

I think sealing should just mean "you can't do any operations that
have semantic meaning for the mapping, because it is SEALED".

So I think sealing should automatically mean "can't do MADV_DONTNEED
on anon memory", because that's basically equivalent to a munmap/remap
operation.

I also think that sealing should just automatically mean "can't do
mprotect any more".

And yes, the OpenBSD semantics of "immutable" apparently allowed
reducing permissions, but even the openbsd man-page seems to think
that was a bug, so we should just not allow it. And the openbsd case
seems to be because of how they made certain things immutable by
default, which is different from what this mseal() thing is.

End result: I'd really like to make the thing conceptually simpler,
rather than add all those random (*very* random in case of
MADV_DONTNEED) special cases.

Is there any actual practical example of why you'd want a half-sealed thing?

And no, I didn't read the pdf that was attached. If it can't just be
explained in plain language, it's not an explanation.

I'd love for "sealed" to be just a single bit in the vm_flags things
that we already have. Not a config option. Not some complicated thing
that is hard to explain. A simple "I have set up this mapping, you
can't change it any more".

And if it cannot be that kind of thing, I want to have clear and
obvious examples of why it can't be that simple thing.

Not a pdf file that describes some google-chrome design. Something
down-to-earth and practical (and not a "we might want this in the
future" thing either).

IOW, what is wrong with "THIS VMA SETUP CANNOT BE CHANGED ANY MORE"?

Nothing less, but also nothing more. No random odd bits that need explaining.

              Linus
Theo de Raadt Dec. 14, 2023, 3:04 p.m. UTC | #6
Jeff Xu <jeffxu@google.com> wrote:

> In short, BSD's immutable is designed specific for libc case, and Chrome
> case is just different (e.g. the lifetime of those mappings and requirement of
> free/discard unused memory).

That is not true.  During the mimmutable design I took the entire
software ecosystem into consideration.  Not just libc.  That is either
uncharitable or uninformed.

In OpenBSD, pretty much the only thing which calls mimmutable() is the
shared library linker, which does so on all possible regions of all DSO
objects, not just libc.

For example, chrome loads 96 libraries, and all their text/data/bss/etc
are immutable. All the static address space is immutable.  It's the same
for all other programs running in OpenBSD -- only transient heap and
mmap spaces remain permission mutable.

It is not just libc.

What you are trying to do here with chrome is bring some sort of
soft-immutable management to regions of memory, so that trusted parts of
chrome can still change the permissions, but untrusted / gadgetry parts
of chrome cannot change the permissions.  That's a very different thing
than what I set out to do with mimmutable().  I'm not aware of any other
piece of software that needs this.  I still can't wrap my head around
the assurance model of the design. 

Maybe it is time to stop comparing mseal() to mimmutable().

Also, maybe this proposal should be using the name chromesyscall()
instead -- then it could be extended indefinitely in the future...
Stephen Röttger Dec. 14, 2023, 6:06 p.m. UTC | #7
On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
> >
> >
> > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> > >
> > The MADV_DONTNEED is OK for file-backed mapping.
>
> Right. It makes no semantic difference. So there's no point to it.
>
> My point was that you added this magic flag for "not ok for RO anon mapping".
>
> It's such a *completely* random flag, that I go "that's just crazy
> random - make sealing _always_ disallow that case".
>
> So what I object to in this series is basically random small details
> that should just eb part of the basic act of sealing.
>
> I think sealing should just mean "you can't do any operations that
> have semantic meaning for the mapping, because it is SEALED".
>
> So I think sealing should automatically mean "can't do MADV_DONTNEED
> on anon memory", because that's basically equivalent to a munmap/remap
> operation.

In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory.
We have a pkey-tagged heap and code region for JIT code. The regions are
writable by page permissions, but we use the pkey to control write access.
These regions are mmapped at process startup and we want to seal them to ensure
that the pkey and page permissions can't change.
Since these regions are used for dynamic allocations, we still need a way to
release unneeded resources, i.e. madvise(DONTNEED) unused pages on free().

AIUI, the madvise(DONTNEED) should effectively only change the content of
anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
added this special case: if you want to madvise(DONTNEED) an anonymous page,
you should have write permissions to the page.

In our allocator, on free we can then release resources via:
* allow pkey writes
* madvise(DONTNEED)
* disallow pkey writes
Pedro Falcato Dec. 14, 2023, 8:11 p.m. UTC | #8
On Thu, Dec 14, 2023 at 6:07 PM Stephen Röttger <sroettger@google.com> wrote:
>
> On Thu, Dec 14, 2023 at 2:31 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Wed, 13 Dec 2023 at 16:36, Jeff Xu <jeffxu@google.com> wrote:
> > >
> > >
> > > > IOW, when would you *ever* say "seal this area, but MADV_DONTNEED is ok"?
> > > >
> > > The MADV_DONTNEED is OK for file-backed mapping.
> >
> > Right. It makes no semantic difference. So there's no point to it.
> >
> > My point was that you added this magic flag for "not ok for RO anon mapping".
> >
> > It's such a *completely* random flag, that I go "that's just crazy
> > random - make sealing _always_ disallow that case".
> >
> > So what I object to in this series is basically random small details
> > that should just eb part of the basic act of sealing.
> >
> > I think sealing should just mean "you can't do any operations that
> > have semantic meaning for the mapping, because it is SEALED".
> >
> > So I think sealing should automatically mean "can't do MADV_DONTNEED
> > on anon memory", because that's basically equivalent to a munmap/remap
> > operation.
>
> In Chrome, we have a use case to allow MADV_DONTNEED on sealed memory.

I don't want to be that guy (*believe me*), but what if there was a
way to attach BPF programs to mm's? Such that you could handle 'seal
failures' in BPF, and thus allow for this sort of weird semantics?
e.g: madvise(MADV_DONTNEED) on a sealed region fails, kernel invokes
the BPF program (that chrome loaded), BPF program sees it was a
MADV_DONTNEED and allows it to proceed.

It requires BPF but sounds like a good compromise in order to not get
an ugly API?
Linus Torvalds Dec. 14, 2023, 8:14 p.m. UTC | #9
On Thu, 14 Dec 2023 at 10:07, Stephen Röttger <sroettger@google.com> wrote:
>
> AIUI, the madvise(DONTNEED) should effectively only change the content of
> anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
> added this special case: if you want to madvise(DONTNEED) an anonymous page,
> you should have write permissions to the page.

Hmm. I actually would be happier if we just made that change in
general. Maybe even without sealing, but I agree that it *definitely*
makes sense in general as a sealing thing.

IOW, just saying

 "madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed"

makes 100% sense to me. Having a separate _flag_ to give sensible
semantics is just odd.

IOW, what I really want is exactly that "sensible semantics, not random flags".

Particularly for new system calls with fairly specialized use, I think
it's very important that the semantics are sensible on a conceptual
level, and that we do not add system calls that are based on "random
implementation issue of the day".

Yes, yes, then as we have to maintain things long-term, and we hit
some compatibility issue, at *THAT* point we'll end up facing nasty
"we had an implementation that had these semantics in practice, so now
we're stuck with it", but when introducing a new system call, let's
try really hard to start off from those kinds of random things.

Wouldn't it be lovely if we can just come up with a sane set of "this
is what it means to seal a vma", and enumerate those, and make those
sane conceptual rules be the initial definition. By all means have a
"flags" argument for future cases when we figure out there was
something wrong or the notion needed to be extended, but if we already
*start* with random extensions, I feel there's something wrong with
the whole concept.

So I would really wish for the first version of

     mseal(start, len, flags);

to have "flags=0" be the one and only case we actually handle
initially, and only add a single PROT_SEAL flag to mmap() that says
"create this mapping already pre-sealed".

Strive very hard to make sealing be a single VM_SEALED flag in the
vma->vm_flags that we already have, just admit that none of this
matters on 32-bit architectures, so that VM_SEALED can just use one of
the high flags that we have several free of (and that pkeys already
depends on), and make this a standard feature with no #ifdef's.

Can chrome live with that? And what would the required semantics be?
I'll start the list:

 - you can't unmap or remap in any way (including over-mapping)

 - you can't change protections (but with architecture support like
pkey, you can obviously change the protections indirectly with PKRU
etc)

 - you can't do VM operations that change data without the area being
writable (so the DONTNEED case - maybe there are others)

 - anything else?

Wouldn't it be lovely to have just a single notion of sealing that is
well-documented and makes sense, and doesn't require people to worry
about odd special cases?

And yes, we'd have the 'flags' argument for future special cases, and
hope really hard that it's never needed.

           Linus
Jeff Xu Dec. 14, 2023, 10:52 p.m. UTC | #10
On Thu, Dec 14, 2023 at 12:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 14 Dec 2023 at 10:07, Stephen Röttger <sroettger@google.com> wrote:
> >
> > AIUI, the madvise(DONTNEED) should effectively only change the content of
> > anonymous pages, i.e. it's similar to a memset(0) in that case. That's why we
> > added this special case: if you want to madvise(DONTNEED) an anonymous page,
> > you should have write permissions to the page.
>
> Hmm. I actually would be happier if we just made that change in
> general. Maybe even without sealing, but I agree that it *definitely*
> makes sense in general as a sealing thing.
>
> IOW, just saying
>
>  "madvise(DONTNEED) needs write permissions to an anonymous mapping when sealed"
>
> makes 100% sense to me. Having a separate _flag_ to give sensible
> semantics is just odd.
>
> IOW, what I really want is exactly that "sensible semantics, not random flags".
>
> Particularly for new system calls with fairly specialized use, I think
> it's very important that the semantics are sensible on a conceptual
> level, and that we do not add system calls that are based on "random
> implementation issue of the day".
>
> Yes, yes, then as we have to maintain things long-term, and we hit
> some compatibility issue, at *THAT* point we'll end up facing nasty
> "we had an implementation that had these semantics in practice, so now
> we're stuck with it", but when introducing a new system call, let's
> try really hard to start off from those kinds of random things.
>
> Wouldn't it be lovely if we can just come up with a sane set of "this
> is what it means to seal a vma", and enumerate those, and make those
> sane conceptual rules be the initial definition. By all means have a
> "flags" argument for future cases when we figure out there was
> something wrong or the notion needed to be extended, but if we already
> *start* with random extensions, I feel there's something wrong with
> the whole concept.
>
> So I would really wish for the first version of
>
>      mseal(start, len, flags);
>
> to have "flags=0" be the one and only case we actually handle
> initially, and only add a single PROT_SEAL flag to mmap() that says
> "create this mapping already pre-sealed".
>
> Strive very hard to make sealing be a single VM_SEALED flag in the
> vma->vm_flags that we already have, just admit that none of this
> matters on 32-bit architectures, so that VM_SEALED can just use one of
> the high flags that we have several free of (and that pkeys already
> depends on), and make this a standard feature with no #ifdef's.
>
> Can chrome live with that? And what would the required semantics be?
> I'll start the list:
>
>  - you can't unmap or remap in any way (including over-mapping)
>
>  - you can't change protections (but with architecture support like
> pkey, you can obviously change the protections indirectly with PKRU
> etc)
>
>  - you can't do VM operations that change data without the area being
> writable (so the DONTNEED case - maybe there are others)
>
>  - anything else?
>
> Wouldn't it be lovely to have just a single notion of sealing that is
> well-documented and makes sense, and doesn't require people to worry
> about odd special cases?
>
> And yes, we'd have the 'flags' argument for future special cases, and
> hope really hard that it's never needed.
>

Yes, those inputs make a lot of sense !
I will start the next version. In the meantime, if there are more
comments, please continue to post, I appreciate those to make the API
better and simple to use.

-Jeff

>            Linus
>
Theo de Raadt Jan. 20, 2024, 3:23 p.m. UTC | #11
Some notes about compatibility between mimmutable(2) and mseal().

This morning, the "RW -> R demotion" code in mimmutable(2) was removed.
As described previously, that was a development crutch to solved a problem
but we found a better way with a new ELF section which is available at
compile time with __attribute__((section(".openbsd.mutable"))).   Which
works great.

I am syncronizing the madvise / msync behaviour further, we will be compatible.
I have worried about madvise / msync for a long time, and audited vast amounts
of the software ecosystem to come to a conclusion we can be more strict, but
I never acted upon it.

BTW, on OpenBSD and probably other related BSD operating systems,
MADV_DONTNEED is non-destructive.  However we have a destructive
operation called MADV_FREE.  msync() MS_INVALIDATE is also destructive.
But all of these operations will now be prohibited, to syncronize the
error return value situation.

There is an one large difference remainig between mimmutable() and mseal(),
which is how other system calls behave.

We return EPERM for failures in all the system calls that fail upon
immutable memory (since Oct 2022).

You are returning EACESS.

Before it is too late, do you want to reconsider that return value, or
do you have a justification for the choice?

I think this remains the blocker which would prevent software from doing

#define mimmutable(addr, len)  mseal(addr, len, 0)
Linus Torvalds Jan. 20, 2024, 4:40 p.m. UTC | #12
On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> There is an one large difference remainig between mimmutable() and mseal(),
> which is how other system calls behave.
>
> We return EPERM for failures in all the system calls that fail upon
> immutable memory (since Oct 2022).
>
> You are returning EACESS.
>
> Before it is too late, do you want to reconsider that return value, or
> do you have a justification for the choice?

I don't think there's any real reason for the difference.

Jeff - mind changing the EACESS to EPERM, and we'll have something
that is more-or-less compatible between Linux and OpenBSD?

             Linus
Theo de Raadt Jan. 20, 2024, 4:59 p.m. UTC | #13
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > There is an one large difference remainig between mimmutable() and mseal(),
> > which is how other system calls behave.
> >
> > We return EPERM for failures in all the system calls that fail upon
> > immutable memory (since Oct 2022).
> >
> > You are returning EACESS.
> >
> > Before it is too late, do you want to reconsider that return value, or
> > do you have a justification for the choice?
> 
> I don't think there's any real reason for the difference.
> 
> Jeff - mind changing the EACESS to EPERM, and we'll have something
> that is more-or-less compatible between Linux and OpenBSD?

(I tried to remember why I chose EPERM, replaying the view from the
German castle during kernel compiles...)

In mmap, EACCESS already means something.

     [EACCES]           The flag PROT_READ was specified as part of the prot
                        parameter and fd was not open for reading.  The flags
                        MAP_SHARED and PROT_WRITE were specified as part of
                        the flags and prot parameters and fd was not open for
                        writing.

In mprotect, the situation is similar

     [EACCES]           The process does not have sufficient access to the
                        underlying memory object to provide the requested
                        protection.

immutable isn't an aspect of the underlying object, but an aspect of the
mapping.

Anyways, it is common for one errno value to have multiple causes.

But this error-aliasing can make it harder to figure things out when
studying a "system call trace" a program, and I strongly believe in
keeping systems are simple as possible.

For all the memory mapping control operations, EPERM was available and
unambiguous.
Jeff Xu Jan. 21, 2024, 12:16 a.m. UTC | #14
On Sat, Jan 20, 2024 at 8:40 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sat, 20 Jan 2024 at 07:23, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > There is an one large difference remainig between mimmutable() and mseal(),
> > which is how other system calls behave.
> >
> > We return EPERM for failures in all the system calls that fail upon
> > immutable memory (since Oct 2022).
> >
> > You are returning EACESS.
> >
> > Before it is too late, do you want to reconsider that return value, or
> > do you have a justification for the choice?
>
> I don't think there's any real reason for the difference.
>
> Jeff - mind changing the EACESS to EPERM, and we'll have something
> that is more-or-less compatible between Linux and OpenBSD?
>
Sounds Good. I will make the necessary changes in the next version.

-Jeff

>              Linus