Message ID | 20240131175027.3287009-1-jeffxu@chromium.org |
---|---|
Headers | show |
Series | Introduce mseal | expand |
Jeff Xu <jeffxu@chromium.org> wrote: > I considered Theo's inputs from OpenBSD's perspective regarding the > difference, and I wasn't convinced that Linux should remove these. In > my view, those are two different kernels code, and the difference in > Linux is not added without reasons (for MAP_SEALABLE, there is a note > in the documentation section with details). That note is describing a fiction. > I would love to hear more from Linux developers on this. I'm not sure you are capable of listening. But I'll repeat for others to stop this train wreck: 1. When execve() maps a programs's .data section, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE? Does the kernel seal the .data section? It cannot, because of RELRO and IFUNCS. Do you know what those are? (like in OpenBSD) the kernel cannot and will *not* seal the .data section, it lets later code do that. 2. When execve() maps a programs's .bss section, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE? Does the kernel seal the .bss section? It cannot, because of RELRO and IFUNCS. Do you know what those are? (like in OpenBSD) the kernel cannot and will *not* seal the .bss section, it lets later code do that. In the proposed diff, the kernel does not set MAP_SEALABLE on those regions. How does a userland program seal the .data and .bss regions? It cannot. It is too late to set the MAP_SEALABLE, because the kernel already decided not do to it. So those regions cannot be sealed. 3. When execve() maps a programs's stack, does the kernel set MAP_SEALABLE on that region? Or does it not set MAP_SEALABLE? In the proposed diff, the kernel does not set MAP_SEALABLE. You think you can seal the stack in the kernel?? Sorry to be the bearer of bad news, but glibc has code which on occasion will mprotects the stack executable. But if userland decides that mprotect case won't occur -- how does a userland program seal its stack? It is now too late to set MAP_SEALABLE. So the stack must remain unsealed. 4. What about the text segment? 5. Do you know what a text-relocation is? They are now rare, but there are still compile/linker stages which will produce them, and there is software which requires that to work. It means userland fixes it's own .text, then calls mprotect. The kernel does not know if this will happen. 6. When execve() maps the .text segment, will it set MAP_SEALABLE? If it doesn't set it, userland cannot seal it's text after it makes the decision to do. You can continue to extrapolate those same points for all other segments of a static binary, all segments of a dynamic binary, all segments of the shared library linker. And then you can go further, and recognize the logic that will be needed in the shared library linker to *make the same decisions*. In each case, the *decision* to make a mapping happens in one piece of code, and the decision to use and NOW SEAL THAT MAPPING, happens in a different piece of code. The only answer to these problems will be to always set MAP_SEALABLE. To go through the entire Linux ecosystem, and change every call to mmap() to use this new MAP_SEALABLE flag, and it will look something like this: +#ifndef MAP_SEALABLE +#define MAP_SEALABLE 0 +#endif - ptr = mmap(...., MAP... - ptr = mmap(...., MAP_SEALABLE | MAP... Every single one of them, and you'll need to do it in the kernel. If you had spent a second trying to make this work in a second piece of software, you would have realized that the ONLY way this could work is by adding a flag with the opposite meaning: MAP_NOTSEALABLE But nothing will use that. I promise you > I would love to hear more from Linux developers on this. I'm not sure you are capable of listening.
> -----Original Message----- > From: Theo de Raadt <deraadt@openbsd.org> > > I would love to hear more from Linux developers on this. > > I'm not sure you are capable of listening. > Theo, It is possible to make your technical points, and even to express frustration that it has been difficult to get them across, without resorting to personal attacks. -- Tim
> To me, this means Linus is OK with the general signatures of the APIs.
Linus, you are in for a shock when the proposal doesn't work for glibc
and all the applications!
On Thu, 1 Feb 2024 at 14:54, Theo de Raadt <deraadt@openbsd.org> wrote: > > Linus, you are in for a shock when the proposal doesn't work for glibc > and all the applications! Heh. I've enjoyed seeing your argumentative style that made you so famous back in the days. Maybe it's always been there, but I haven't seen the BSD people in so long that I'd forgotten all about it. That said, famously argumentative or not, I think Theo is right, and I do think the MAP_SEALABLE bit is nonsensical. If somebody wants to mseal() a memory region, why would they need to express that ahead of time? So the part I think is sane is the mseal() system call itself, in that it allows *potential* future expansion of the semantics. But hopefully said future expansion isn't even needed, and all users want the base experience, which is why I think PROT_SEAL (both to mmap and to mprotect) makes sense as an alternative form. So yes, to my mind mprotect(addr, len, PROT_READ); mseal(addr, len, 0); should basically give identical results to mprotect(addr, len, PROT_READ | PROT_SEAL); and using PROT_SEAL at mmap() time is similarly the same obvious notion of "map this, and then seal that mapping". The reason for having "mseal()" as a separate call at all from the PROT_SEAL bit is that it does allow possible future expansion (while PROT_SEAL is just a single bit, and it won't change semantics) but also so that you can do whatever prep-work in stages if you want to, and then just go "now we seal it all". Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > So yes, to my mind > > mprotect(addr, len, PROT_READ); > mseal(addr, len, 0); > > should basically give identical results to > > mprotect(addr, len, PROT_READ | PROT_SEAL); > > and using PROT_SEAL at mmap() time is similarly the same obvious > notion of "map this, and then seal that mapping". I think that isn't easy to do. Let's expand it to show error checking. if (mprotect(addr, len, PROT_READ) == -1) react to the errno value if (mseal(addr, len, 0) == -1) react to the errno value and if (mprotect(addr, len, PROT_READ | PROT_SEAL) == -1) react to the errno value For current mprotect(), the errno values are mostly related to range issues with the parameters. After sealing a region, mprotect() also has the new errno EPERM. But what is the return value supposed to be from "PROT_READ | PROT_SEAL" over various sub-region types? Say I have a region 3 pages long. One page is unmapped, one page is regular, and one page is sealed. Re-arrange those 3 pages in all 6 permutations. Try them all. Does the returned errno change, based upon the order? Does it do part of the operation, or all of the operation? If the sealed page is first, the regular page is second, and the unmapped page is 3rd, does it return an error or return 0? Does it change the permission on the 3rd page? If it returns an error, has it changed any permissions? I don't think the diff follows the principle of if an error is returned --> we know nothing was changed. if success is returned --> we know all the requests were satisfied > The reason for having "mseal()" as a separate call at all from the > PROT_SEAL bit is that it does allow possible future expansion (while > PROT_SEAL is just a single bit, and it won't change semantics) but > also so that you can do whatever prep-work in stages if you want to, > and then just go "now we seal it all". How about you add basic mseal() that is maximum compatible with mimmutable(), and then we can all talk about whether PROT_SEAL makes sense once there are applications that demand it, and can prove they need it?
* Jeff Xu <jeffxu@google.com> [240201 22:15]: > On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > * Jeff Xu <jeffxu@chromium.org> [240131 20:27]: > > > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett > > > <Liam.Howlett@oracle.com> wrote: > > > > > > > > Having to opt-in to allowing mseal will probably not work well. > I'm leaving the opt-in discussion in Linus's thread. > > > Initial library mappings happen in one huge chunk then it's cut up into > > smaller VMAs, at least that's what I see with my maple tree tracing. If > > you opt-in, then the entire library will have to opt-in and so the > > 'discourage inadvertent sealing' argument is not very strong. > > > Regarding "The initial library mappings happen in one huge chunk then > it is cut up into smaller VMAS", this is not a problem. > > As example of elf loading (fs/binfmt_elf.c), there is just a few > places to pass in what type of memory to be allocated, e.g. > MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can add MAP_SEALABLE at those > places. > If glic does additional splitting on the memory range, by using > mprotect(), then the MAP_SEALABLE is automatically applied after > splitting. > If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE). You are adding a flag that requires a new glibc. When I try to point out how this is unnecessary and excessive, you tell me it's fine and probably not a whole lot of work. This isn't working with developers, you are dismissing the developers who are trying to help you. Can you please: Provide code that uses this feature. Provide benchmark results where you apply mseal to 1, 2, 4, 8, 16, and 32 VMAs. Provide code that tests and checks the failure paths. Failures at the start, middle, and end of the modifications. Document what happens in those failure paths. And, most importantly: keep an open mind and allow your opinion to change when presented with new information. All of these things are to help you. We need to know what needs fixing so you can be successful. Thanks, Liam
On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote: > > On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote: > > > > Jeff Xu <jeffxu@chromium.org> wrote: > > > > > Even without free. > > > I personally do not like the heap getting sealed like that. > > > > > > Component A. > > > p=malloc(4096); > > > writing something to p. > > > > > > Compohave nent B: > > > mprotect(p,4096, RO) > > > mseal(p,4096) > > > > > > This will split the heap VMA, and prevent the heap from shrinking, if > > > this is in a frequent code path, then it might hurt the process's > > > memory usage. > > > > > > The existing code is more likely to use malloc than mmap(), so it is > > > easier for dev to seal a piece of data belonging to another component. > > > I hope this pattern is not wide-spreading. > > > > > > The ideal way will be just changing the library A to use mmap. > > > > I think you are lacking some test programs to see how it actually > > behaves; the effect is worse than you think, and the impact is immediately > > visible to the programmer, and the lesson is clear: > > > > you can only seal objects which you gaurantee never get recycled. > > > > Pushing a sealed object back into reuse is a disasterous bug. > > > > Noone should call this interface, unless they understand that. > > > > I'll say again, you don't have a test program for various allocators to > > understand how it behaves. The failure modes described in your docuemnts > > are not correct. > > > I understand what you mean: I will add that part to the document: > Try to recycle a sealed memory is disastrous, e.g. > p=malloc(4096); > mprotect(p,4096,RO) > mseal(p,4096) > free(p); > > My point is: > I think sealing an object from the heap is a bad pattern in general, > even dev doesn't free it. That was one of the reasons for the sealable > flag, I hope saying this doesn't be perceived as looking for excuses. The point you're missing is that adding MAP_SEALABLE reduces composability. With MAP_SEALABLE, everything that mmaps some part of the address space that may ever be sealed will need to be modified to know about MAP_SEALABLE. Say you did the same thing for mprotect. MAP_PROTECT would control the mprotectability of the map. You'd stop: p = malloc(4096); mprotect(p, 4096, PROT_READ); free(p); ! But you'd need to change every spot that mmap()'s something to know about and use MAP_PROTECT: all "producers" of mmap memory would need to know about the consumers doing mprotect(). So now either all mmap() callers mindlessly add MAP_PROTECT out of fear the consumers do mprotect (and you gain nothing from MAP_PROTECT), or the mmap() callers need to know the consumers call mprotect(), and thus you introduce a huge layering violation (and you actually lose from having MAP_PROTECT). Hopefully you can map the above to MAP_SEALABLE. Or to any other m*() operation. For example, if chrome runs on an older glibc that does not know about MAP_SEALABLE, it will not be able to mseal() its own shared libraries' .text (even if, yes, that should ideally be left to ld.so). IMO, UNIX API design has historically mostly been "play stupid games, win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) work. If you close stdout (and don't dup/reopen something to stdout) and printf(), things will break, and you get to keep both pieces. There's no O_CLOSEABLE, just as there's no O_DUPABLE.
> What I'm more concerned about is what happens if you call mseal() on a > range and it can mseal a portion. Like, what happens to the first vma > in your test_seal_unmapped_middle case? I see it returns an error, but > is the first VMA mseal()'ed? (no it's not, but test that) That is correct, Liam. Unix system calls must be atomic. They either return an error, and that is a promise they made no changes. Or they do the work required, and then return success. In OpenBSD, all mimmutable() aspects were carefully studied to gaurantee this behaviour. I am not an expert in the Linux kernel to make the assessment; someone who is qualified must make that assessment. Fuzzing with tests is a good way to judge it simpler.
On Fri, Feb 2, 2024 at 12:37 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, 2 Feb 2024 at 11:32, Theo de Raadt <deraadt@openbsd.org> wrote: > > > > Unix system calls must be atomic. > > > > They either return an error, and that is a promise they made no changes. > > That's actually not true, and never has been. > > It's a good thing to aim for, but several errors means "some or all > may have been done". > > EFAULT (for various system calls), ENOMEM and other errors are all > things that can happen after some of the system call has already been > done, and the rest failed. > > There are lots of examples, but to pick one obvious VM example, > something like mlock() may well return an error after the area has > been successfully locked, but then the population of said pages failed > for some reason. > > Of course, implementations can differ, and POSIX sometimes has insane > language that is actively incorrect. > > Furthermore, the definition of "atomic" is unclear. For example, POSIX > claims that a "write()" system call is one atomic thing for regular > files, and some people think that means that you see all or nothing. > That's simply not true, and you'll see the write progress in various > indirect ways (look at intermediate file size with 'stat', look at > intermediate contents with 'mmap' etc etc). > > So I agree that atomicity is something that people should always > *strive* for, but it's not some kind of final truth or absolute > requirement. > > In the specific case of mseal(), I suspect there are very few reasons > ever *not* to be atomic, so in this particular context atomicity is > likely always something that should be guaranteed. But I just wanted > to point out that it's most definitely not a black-and-white issue in > the general case. > Thanks. At least I got this part done right for mseal() :-) -Jeff > Linus >
On Fri, Feb 2, 2024 at 10:52 AM Pedro Falcato <pedro.falcato@gmail.com> wrote: > > On Fri, Feb 2, 2024 at 5:59 PM Jeff Xu <jeffxu@chromium.org> wrote: > > > > On Thu, Feb 1, 2024 at 9:00 PM Theo de Raadt <deraadt@openbsd.org> wrote: > > > > > > Jeff Xu <jeffxu@chromium.org> wrote: > > > > > > > Even without free. > > > > I personally do not like the heap getting sealed like that. > > > > > > > > Component A. > > > > p=malloc(4096); > > > > writing something to p. > > > > > > > > Compohave nent B: > > > > mprotect(p,4096, RO) > > > > mseal(p,4096) > > > > > > > > This will split the heap VMA, and prevent the heap from shrinking, if > > > > this is in a frequent code path, then it might hurt the process's > > > > memory usage. > > > > > > > > The existing code is more likely to use malloc than mmap(), so it is > > > > easier for dev to seal a piece of data belonging to another component. > > > > I hope this pattern is not wide-spreading. > > > > > > > > The ideal way will be just changing the library A to use mmap. > > > > > > I think you are lacking some test programs to see how it actually > > > behaves; the effect is worse than you think, and the impact is immediately > > > visible to the programmer, and the lesson is clear: > > > > > > you can only seal objects which you gaurantee never get recycled. > > > > > > Pushing a sealed object back into reuse is a disasterous bug. > > > > > > Noone should call this interface, unless they understand that. > > > > > > I'll say again, you don't have a test program for various allocators to > > > understand how it behaves. The failure modes described in your docuemnts > > > are not correct. > > > > > I understand what you mean: I will add that part to the document: > > Try to recycle a sealed memory is disastrous, e.g. > > p=malloc(4096); > > mprotect(p,4096,RO) > > mseal(p,4096) > > free(p); > > > > My point is: > > I think sealing an object from the heap is a bad pattern in general, > > even dev doesn't free it. That was one of the reasons for the sealable > > flag, I hope saying this doesn't be perceived as looking for excuses. > > The point you're missing is that adding MAP_SEALABLE reduces > composability. With MAP_SEALABLE, everything that mmaps some part of > the address space that may ever be sealed will need to be modified to > know about MAP_SEALABLE. > > Say you did the same thing for mprotect. MAP_PROTECT would control the > mprotectability of the map. You'd stop: > > p = malloc(4096); > mprotect(p, 4096, PROT_READ); > free(p); > > ! But you'd need to change every spot that mmap()'s something to know > about and use MAP_PROTECT: all "producers" of mmap memory would need > to know about the consumers doing mprotect(). So now either all mmap() > callers mindlessly add MAP_PROTECT out of fear the consumers do > mprotect (and you gain nothing from MAP_PROTECT), or the mmap() > callers need to know the consumers call mprotect(), and thus you > introduce a huge layering violation (and you actually lose from having > MAP_PROTECT). > > Hopefully you can map the above to MAP_SEALABLE. Or to any other m*() > operation. For example, if chrome runs on an older glibc that does not > know about MAP_SEALABLE, it will not be able to mseal() its own shared > libraries' .text (even if, yes, that should ideally be left to ld.so). > I think I have heard enough complaints about MAP_SEALABLE from Linux developers and Linus in the last two days to convince myself that it is a bad idea :) For the last time, I was trying to limit the scope of mseal() limited to two known cases. And MAP_SEALABLE is a reversible decision, a system ctrl can turn it off, or we can obsolete it in future. (this was mentioned in the document of V8). I will rest my case. Obviously from the feedback, it is loud and clear that we want to be able to seal all the memory. > IMO, UNIX API design has historically mostly been "play stupid games, > win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) > work. If you close stdout (and don't dup/reopen something to stdout) > and printf(), things will break, and you get to keep both pieces. > There's no O_CLOSEABLE, just as there's no O_DUPABLE. > > -- > Pedro
... > IMO, UNIX API design has historically mostly been "play stupid games, > win stupid prizes", which is e.g: why things like close(STDOUT_FILENO) > work. If you close stdout (and don't dup/reopen something to stdout) > and printf(), things will break, and you get to keep both pieces. That is pretty much why libraries must never use printf(). (Try telling that to people at work!) In the days when processes could only have 20 files open it was a much bigger problem. You couldn't afford to not use 0, 1 and 2. A certain daemon ended up using fd 1 as a pipe to another daemon. Someone accidentally used printf() instead of fprintf() for a trace. When the 10k stdio buffer filled the text got written to the pipe. The expected fixed size message had a 32bit 'trailer' size. Although no defined messages supported trailers the second daemon synchronously discarded the trailer - with the expected side effect. Wasn't my bug, and someone else found it, but I'd read the broken code a few times without seeing the fubar. Trouble is it all worked for quite a long time... David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Feb 2, 2024 at 8:46 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Linus Torvalds <torvalds@linux-foundation.org> [240202 18:36]: > > On Fri, 2 Feb 2024 at 13:18, Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > There will be a larger performance cost to checking up front without > > > allowing the partial completion. > > > > I suspect that for mseal(), the only half-way common case will be > > sealing an area that is entirely contained within one vma. > > Agreed. > > > > > So the cost will be the vma splitting (if it's not the whole vma), and > > very unlikely to be any kind of "walk the vma's to check that they can > > all be sealed" loop up-front. > > That's the cost of calling mseal(), and I think that will be totally > reasonable. > > I'm more concerned with the other calls that do affect more than one vma > that will now have to ensure there is not an mseal'ed vma among the > affected area. > > As you pointed out, we don't do atomic updates and so we have to add a > loop at the beginning to check this new special case, which is what this > patch set does today. That means we're going to be looping through > twice for any call that could fail if one is mseal'ed. This includes > munmap() and mprotect(). > > The impact will vary based on how many vma's are handled. I'd like some > numbers on this so we can see if it is a concern, which Jeff has agreed > to provide in the future - Thank you, Jeff. Yes please. The additional walk Liam points to seems to be happening even if we don't use mseal at all. Android apps often create thousands of VMAs, so a small regression to a syscall like mprotect might cause a very visible regression to app launch times (one of the key metrics for Android). Having performance impact numbers here would be very helpful. > > It also means we're modifying the behaviour of those calls so they could > fail before anything changes (regardless of where the failure would > occur), and we could still fail later when another aspect of a vma would > cause a failure as we do today. We are paying the price for a more > atomic update, but we aren't trying very hard to be atomic with our > updates - we don't have many (virtually no) vma checks before > modifications start. > > For instance, we could move the mprotect check for map_deny_write_exec() > to the pre-update loop to make it more atomic in nature. This one seems > somewhat related to mseal, so it would be better if they were both > checked atomic(ish) together. Although, I wonder if the user visible > changes would be acceptable and worth the risk. > > We will have two classes of updates to vma's: the more atomic view and > the legacy view. The question of what happens when the two mix, or > where a specific check should go will get (more) confusing. > > Thanks, > Liam >
From: Jeff Xu <jeffxu@chromium.org> This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. In addition: mmap() has two related changes. The PROT_SEAL bit in prot field of mmap(). When present, it marks the map sealed since creation. The MAP_SEALABLE bit in the flags field of mmap(). When present, it marks the map as sealable. A map created without MAP_SEALABLE will not support sealing, i.e. mseal() will fail. Applications that don't care about sealing will expect their behavior unchanged. For those that need sealing support, opt-in by adding MAP_SEALABLE in mmap(). The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Pedro Falcato: suggesting sealing in the mmap(). Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. Change history: =============== V8: - perf optimization in mmap. (Liam R. Howlett) - add one testcase (test_seal_zero_address) - Update mseal.rst to add note for MAP_SEALABLE. V7: - fix index.rst (Randy Dunlap) - fix arm build (Randy Dunlap) - return EPERM for blocked operations (Theo de Raadt) https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.org/T/ V6: - Drop RFC from subject, Given Linus's general approval. - Adjust syscall number for mseal (main Jan.11/2024) - Code style fix (Matthew Wilcox) - selftest: use ksft macros (Muhammad Usama Anjum) - Document fix. (Randy Dunlap) https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/ V5: - fix build issue in mseal-Wire-up-mseal-syscall (Suggested by Linus Torvalds, and Greg KH) - updates on selftest. https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r V4: (Suggested by Linus Torvalds) - new signature: mseal(start,len,flags) - 32 bit is not supported. vm_seal is removed, use vm_flags instead. - single bit in vm_flags for sealed state. - CONFIG_MSEAL kernel config is removed. - single bit of PROT_SEAL in the "Prot" field of mmap(). Other changes: - update selftest (Suggested by Muhammad Usama Anjum) - update documentation. https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/ V3: - Abandon per-syscall approach, (Suggested by Linus Torvalds). - Organize sealing types around their functionality, such as MM_SEAL_BASE, MM_SEAL_PROT_PKEY. - Extend the scope of sealing from calls originated in userspace to both kernel and userspace. (Suggested by Linus Torvalds) - Add seal type support in mmap(). (Suggested by Pedro Falcato) - Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent destructive operations of madvise. (Suggested by Jann Horn and Stephen Röttger) - Make sealed VMAs mergeable. (Suggested by Jann Horn) - Add MAP_SEALABLE to mmap() - Add documentation - mseal.rst https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.org/ v2: Use _BITUL to define MM_SEAL_XX type. Use unsigned long for seal type in sys_mseal() and other functions. Remove internal VM_SEAL_XX type and convert_user_seal_type(). Remove MM_ACTION_XX type. Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask. Add more comments in code. Add a detailed commit message. https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/ v1: https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/ ---------------------------------------------------------------- [1] https://kernelnewbies.org/Linux_2_6_8 [2] https://v8.dev/blog/control-flow-integrity [3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274 [4] https://man.openbsd.org/mimmutable.2 [5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc [6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/ [7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/ Jeff Xu (4): mseal: Wire up mseal syscall mseal: add mseal syscall selftest mm/mseal memory sealing mseal:add documentation Documentation/userspace-api/index.rst | 1 + Documentation/userspace-api/mseal.rst | 215 ++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/mman-common.h | 8 + include/uapi/asm-generic/unistd.h | 5 +- kernel/sys_ni.c | 1 + mm/Makefile | 4 + mm/internal.h | 48 + mm/madvise.c | 12 + mm/mmap.c | 35 +- mm/mprotect.c | 10 + mm/mremap.c | 31 + mm/mseal.c | 343 ++++ tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/mseal_test.c | 2024 +++++++++++++++++++ 33 files changed, 2756 insertions(+), 3 deletions(-) create mode 100644 Documentation/userspace-api/mseal.rst create mode 100644 mm/mseal.c create mode 100644 tools/testing/selftests/mm/mseal_test.c