Message ID | 20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org |
---|---|
Headers | show |
Series | fork: Support shadow stacks in clone3() | expand |
On Mon, Nov 20, 2023 at 11:54:28PM +0000, Mark Brown wrote: > The kernel has recently added support for shadow stacks, currently > x86 only using their CET feature but both arm64 and RISC-V have > equivalent features (GCS and Zicfiss respectively), I am actively > working on GCS[1]. With shadow stacks the hardware maintains an > additional stack containing only the return addresses for branch > instructions which is not generally writeable by userspace and ensures > that any returns are to the recorded addresses. This provides some > protection against ROP attacks and making it easier to collect call > stacks. These shadow stacks are allocated in the address space of the > userspace process. > > Our API for shadow stacks does not currently offer userspace any > flexiblity for managing the allocation of shadow stacks for newly > created threads, instead the kernel allocates a new shadow stack with > the same size as the normal stack whenever a thread is created with the > feature enabled. The stacks allocated in this way are freed by the > kernel when the thread exits or shadow stacks are disabled for the > thread. This lack of flexibility and control isn't ideal, in the vast > majority of cases the shadow stack will be over allocated and the > implicit allocation and deallocation is not consistent with other > interfaces. As far as I can tell the interface is done in this manner > mainly because the shadow stack patches were in development since before > clone3() was implemented. > > Since clone3() is readily extensible let's add support for specifying a > shadow stack when creating a new thread or process in a similar manner So while I made clone3() readily extensible I don't want it to ever devolve into a fancier version of a prctl(). I would really like to see a strong reason for allowing userspace to configure the shadow stack size at this point in time. I have a few questions that are probably me just not knowing much about shadow stacks so hopefully I'm not asking you write a thesis by accident: (1) What does it mean for a shadow stack to be over allocated and is over-allocation really that much of a problem out in the wild that we need to give I userspace a knob to control a kernel security feature? (2) With what other interfaces is implicit allocation and deallocation not consistent? I don't understand this argument. The kernel creates a shadow stack as a security measure to store return addresses. It seems to me exactly that the kernel should implicitly allocate and deallocate the shadow stack and not have userspace muck around with its size? (3) Why is it safe for userspace to request the shadow stack size? What if they request a tiny shadow stack size? Should this interface require any privilege? (4) Why isn't the @stack_size argument I added for clone3() enough? If it is specified can't the size of the shadow stack derived from it? And my current main objection is that shadow stacks were just released to userspace. There can't be a massive amount of users yet - outside of maybe early adopters. The fact that there are other architectures that bring in a similar feature makes me even more hesitant. If they have all agreed _and_ implemented shadow stacks and have unified semantics then we can consider exposing control knobs to userspace that aren't implicitly architecture specific currently. So I don't have anything against the patches per obviously but with the wider context. Thanks!
The 11/21/2023 11:17, Christian Brauner wrote: > On Mon, Nov 20, 2023 at 11:54:28PM +0000, Mark Brown wrote: > > The kernel has recently added support for shadow stacks, currently > > x86 only using their CET feature but both arm64 and RISC-V have > > equivalent features (GCS and Zicfiss respectively), I am actively > > working on GCS[1]. With shadow stacks the hardware maintains an > > additional stack containing only the return addresses for branch > > instructions which is not generally writeable by userspace and ensures > > that any returns are to the recorded addresses. This provides some > > protection against ROP attacks and making it easier to collect call > > stacks. These shadow stacks are allocated in the address space of the > > userspace process. > > > > Our API for shadow stacks does not currently offer userspace any > > flexiblity for managing the allocation of shadow stacks for newly > > created threads, instead the kernel allocates a new shadow stack with > > the same size as the normal stack whenever a thread is created with the > > feature enabled. The stacks allocated in this way are freed by the > > kernel when the thread exits or shadow stacks are disabled for the > > thread. This lack of flexibility and control isn't ideal, in the vast > > majority of cases the shadow stack will be over allocated and the > > implicit allocation and deallocation is not consistent with other > > interfaces. As far as I can tell the interface is done in this manner > > mainly because the shadow stack patches were in development since before > > clone3() was implemented. > > > > Since clone3() is readily extensible let's add support for specifying a > > shadow stack when creating a new thread or process in a similar manner > > So while I made clone3() readily extensible I don't want it to ever > devolve into a fancier version of a prctl(). > > I would really like to see a strong reason for allowing userspace to > configure the shadow stack size at this point in time. > > I have a few questions that are probably me just not knowing much about > shadow stacks so hopefully I'm not asking you write a thesis by > accident: > > (1) What does it mean for a shadow stack to be over allocated and is > over-allocation really that much of a problem out in the wild that > we need to give I userspace a knob to control a kernel security > feature? over-allocation: allocating 100M shadow stack (RLIMIT_DATA, RLIMIT_AS) for a thread that keeps arrays of data on the stack, instead of say 8k, and thus running into resource limits. under-allocation: small thread stack with runaway recursion, but large sigaltstack: the stack overflow handler can run out of space because it uses the same shadow stack as the thread. > (2) With what other interfaces is implicit allocation and deallocation > not consistent? I don't understand this argument. The kernel creates > a shadow stack as a security measure to store return addresses. It > seems to me exactly that the kernel should implicitly allocate and > deallocate the shadow stack and not have userspace muck around with > its size? the kernel is not supposed to impose stack size policy or a particular programming model that limits the stack management options nor prevent the handling of stack overflows. > (3) Why is it safe for userspace to request the shadow stack size? What > if they request a tiny shadow stack size? Should this interface > require any privilege? user can allocate huge or tiny stacks already. and i think userspace can take control over shadow stack management: it can disable signals, start a clone child with stack_size == 1 page, map_shadow_stack and switch to it, enable signals. however this is complicated, leaks 1 page of kernel allocated shadow stack (+reserved guard page, i guess userspace could unmap, not sure if that works currently) and requires additional syscalls. > (4) Why isn't the @stack_size argument I added for clone3() enough? > If it is specified can't the size of the shadow stack derived from it? shadow stack only contains return addresses so it is proportional to the number of stack frames, not the stack size and it must account for sigaltstack too, not just the thread stack. if you make minimal assumptions about stack usage and ignore the sigaltstack issue then the worst case shadow stack requirement is indeed proportional to the stack_size, but this upper bound can be pessimistic and userspace knows the tradeoffs better. > > And my current main objection is that shadow stacks were just released > to userspace. There can't be a massive amount of users yet - outside of > maybe early adopters. no upstream libc has code to enable shadow stacks at this point so there are exactly 0 users in the open. (this feature requires runtime support) the change is expected to allow wider deployability. (e.g. not just in glibc) > > The fact that there are other architectures that bring in a similar > feature makes me even more hesitant. If they have all agreed _and_ > implemented shadow stacks and have unified semantics then we can > consider exposing control knobs to userspace that aren't implicitly > architecture specific currently. > > So I don't have anything against the patches per obviously but with the > wider context. > > Thanks!
On Tue, Nov 21, 2023 at 12:21:37PM +0000, Szabolcs Nagy wrote: > The 11/21/2023 11:17, Christian Brauner wrote: > > I have a few questions that are probably me just not knowing much about > > shadow stacks so hopefully I'm not asking you write a thesis by > > accident: One thing it feels like it's worth saying up front here is that shadow stacks are userspace memory with special permissions and instructions for access - they are mapped into the userspace address range and userspace can directly interact with them in restricted ways. For example there's some thought to using shadow stacks in unwinders since all the return addresses are stored in a single convenient block of memory which it's much harder to corrupt. Overflowing a shadow stack results in userspace getting a memory access fault just as with other memory access issues. > > (2) With what other interfaces is implicit allocation and deallocation > > not consistent? I don't understand this argument. The kernel creates > > a shadow stack as a security measure to store return addresses. It > > seems to me exactly that the kernel should implicitly allocate and > > deallocate the shadow stack and not have userspace muck around with > > its size? > the kernel is not supposed to impose stack size policy or a particular > programming model that limits the stack management options nor prevent > the handling of stack overflows. The inconsistency here is with the management of the standard stack - with the standard stack userspace passes an already allocated address range to the kernel. A constant tension during review of the shadow stack interfaces has been that shadow stack memory is userspace memory but the security constraints mean that we've come down on the side of having a custom allocation syscall for it instead of using flags on mmap() and friends like people often expect, and now having it allocated as part of clone3(). The aim is to highlight that this difference is deliberately chosen for specific reasons rather than just carelessness. > > (3) Why is it safe for userspace to request the shadow stack size? What > > if they request a tiny shadow stack size? Should this interface > > require any privilege? > user can allocate huge or tiny stacks already. > and i think userspace can take control over shadow stack management: > it can disable signals, start a clone child with stack_size == 1 page, > map_shadow_stack and switch to it, enable signals. however this is > complicated, leaks 1 page of kernel allocated shadow stack (+reserved > guard page, i guess userspace could unmap, not sure if that works > currently) and requires additional syscalls. The other thing here is that if userspace gets this wrong it'll result in the userspace process hitting the top of the stack and getting fatal signals in a similar manner to what happens if it gets the size of the standard stack wrong (the kernel allocation does mean that there should always be guard pages and it's harder to overrun the stack and corrupt adjacent memory). There doesn't seem to be any meaningful risk here over what userspace can already do to itself anyway as part of thread allocation. > > (4) Why isn't the @stack_size argument I added for clone3() enough? > > If it is specified can't the size of the shadow stack derived from it? > shadow stack only contains return addresses so it is proportional > to the number of stack frames, not the stack size and it must > account for sigaltstack too, not just the thread stack. > if you make minimal assumptions about stack usage and ignore the > sigaltstack issue then the worst case shadow stack requirement > is indeed proportional to the stack_size, but this upper bound > can be pessimistic and userspace knows the tradeoffs better. It's also worth pointing out here that the existing shadow stack support for x86 and in review code for arm64 make exactly these assumptions and guesses at a shadow stack size based on the stack_size for the thread. There's just been a general lack of enthusiasm for the fact that due to the need to store variables on the normal stack the resulting shadow stack is very likely to be substantially overallocated but we can't safely reduce the size without information from userspace. > > And my current main objection is that shadow stacks were just released > > to userspace. There can't be a massive amount of users yet - outside of > > maybe early adopters. > no upstream libc has code to enable shadow stacks at this point > so there are exactly 0 users in the open. (this feature requires > runtime support) > the change is expected to allow wider deployability. (e.g. not > just in glibc) Right, and the lack of any userspace control of the shadow stack size has been a review concern with the arm64 GCS series which I'm trying to address here. The main concern is that userspaces that start a lot of threads are going to start using a lot more address space than they need to when shadow stacks are enabled. Given the fairly long deployment pipeline from extending a syscall to end users who might be using the feature in conjuction with imposing resource limits it does seem like a reasonable problem to anticipate. > > The fact that there are other architectures that bring in a similar > > feature makes me even more hesitant. If they have all agreed _and_ > > implemented shadow stacks and have unified semantics then we can > > consider exposing control knobs to userspace that aren't implicitly > > architecture specific currently. To be clear the reason I'm working on this is that I've implemented the arm64 support, I don't even have any access to x86 systems that have the feature (hence the RFT in the subject line) - Rick Edgecombe did the x86 work. The arm64 code is still in review, the userspace interface is very similar to that for x86 and there doesn't seem to be any controversy there which makes me expect that a change is likely. Unlike x86 we only have a spec and virtual implementations at present, there's no immintent hardware, so we're taking our time with making sure that everything is working well. Deepak Gupta (in CC) has been reviewing the series from the point of view of RISC-V. I think we're all fairly well aligned on the requirements here.
On Tue, Nov 21, 2023 at 04:09:40PM +0000, Mark Brown wrote: > On Tue, Nov 21, 2023 at 12:21:37PM +0000, Szabolcs Nagy wrote: > > The 11/21/2023 11:17, Christian Brauner wrote: > > > > I have a few questions that are probably me just not knowing much about > > > shadow stacks so hopefully I'm not asking you write a thesis by > > > accident: > > One thing it feels like it's worth saying up front here is that shadow > stacks are userspace memory with special permissions and instructions > for access - they are mapped into the userspace address range and > userspace can directly interact with them in restricted ways. For > example there's some thought to using shadow stacks in unwinders since > all the return addresses are stored in a single convenient block of > memory which it's much harder to corrupt. Overflowing a shadow stack > results in userspace getting a memory access fault just as with other > memory access issues. Thanks for that summary. > > > > (2) With what other interfaces is implicit allocation and deallocation > > > not consistent? I don't understand this argument. The kernel creates > > > a shadow stack as a security measure to store return addresses. It > > > seems to me exactly that the kernel should implicitly allocate and > > > deallocate the shadow stack and not have userspace muck around with > > > its size? > > > the kernel is not supposed to impose stack size policy or a particular > > programming model that limits the stack management options nor prevent > > the handling of stack overflows. > > The inconsistency here is with the management of the standard stack - > with the standard stack userspace passes an already allocated address > range to the kernel. A constant tension during review of the shadow > stack interfaces has been that shadow stack memory is userspace memory > but the security constraints mean that we've come down on the side of > having a custom allocation syscall for it instead of using flags on > mmap() and friends like people often expect, and now having it allocated > as part of clone3(). The aim is to highlight that this difference is So you have two interfaces for allocating a shadow stack. The first one is to explicitly alloc a shadow stack via the map_shadow_stack(). The second one is an implicit allocation during clone3() and you want to allow explicitly influencing that. > deliberately chosen for specific reasons rather than just carelessness. > > > > (3) Why is it safe for userspace to request the shadow stack size? What > > > if they request a tiny shadow stack size? Should this interface > > > require any privilege? > > > user can allocate huge or tiny stacks already. > > > and i think userspace can take control over shadow stack management: > > it can disable signals, start a clone child with stack_size == 1 page, > > map_shadow_stack and switch to it, enable signals. however this is > > complicated, leaks 1 page of kernel allocated shadow stack (+reserved > > guard page, i guess userspace could unmap, not sure if that works > > currently) and requires additional syscalls. > > The other thing here is that if userspace gets this wrong it'll result > in the userspace process hitting the top of the stack and getting fatal > signals in a similar manner to what happens if it gets the size of > the standard stack wrong (the kernel allocation does mean that there > should always be guard pages and it's harder to overrun the stack and > corrupt adjacent memory). There doesn't seem to be any meaningful risk > here over what userspace can already do to itself anyway as part of > thread allocation. clone3() _aimed_ to cleanup the stack handling a bit but we had concerns that deviating too much from legacy clone() would mean userspace couldn't fully replace it. So we would have liked to clean up stack handling a lot more but there's limits to that. We do however perform basic sanity checks now. > > > > (4) Why isn't the @stack_size argument I added for clone3() enough? > > > If it is specified can't the size of the shadow stack derived from it? > > > shadow stack only contains return addresses so it is proportional > > to the number of stack frames, not the stack size and it must > > account for sigaltstack too, not just the thread stack. > > > if you make minimal assumptions about stack usage and ignore the > > sigaltstack issue then the worst case shadow stack requirement > > is indeed proportional to the stack_size, but this upper bound > > can be pessimistic and userspace knows the tradeoffs better. > > It's also worth pointing out here that the existing shadow stack support > for x86 and in review code for arm64 make exactly these assumptions and > guesses at a shadow stack size based on the stack_size for the thread. Ok. > There's just been a general lack of enthusiasm for the fact that due to > the need to store variables on the normal stack the resulting shadow > stack is very likely to be substantially overallocated but we can't > safely reduce the size without information from userspace. Ok. > > > > And my current main objection is that shadow stacks were just released > > > to userspace. There can't be a massive amount of users yet - outside of > > > maybe early adopters. > > > no upstream libc has code to enable shadow stacks at this point > > so there are exactly 0 users in the open. (this feature requires > > runtime support) > > > the change is expected to allow wider deployability. (e.g. not > > just in glibc) > > Right, and the lack of any userspace control of the shadow stack size > has been a review concern with the arm64 GCS series which I'm trying to > address here. The main concern is that userspaces that start a lot of > threads are going to start using a lot more address space than they need > to when shadow stacks are enabled. Given the fairly long deployment > pipeline from extending a syscall to end users who might be using the > feature in conjuction with imposing resource limits it does seem like a > reasonable problem to anticipate. Ok, I can see that argument. > > > > The fact that there are other architectures that bring in a similar > > > feature makes me even more hesitant. If they have all agreed _and_ > > > implemented shadow stacks and have unified semantics then we can > > > consider exposing control knobs to userspace that aren't implicitly > > > architecture specific currently. > > To be clear the reason I'm working on this is that I've implemented the > arm64 support, I don't even have any access to x86 systems that have the Yes, I'm aware. > feature (hence the RFT in the subject line) - Rick Edgecombe did the x86 > work. The arm64 code is still in review, the userspace interface is > very similar to that for x86 and there doesn't seem to be any > controversy there which makes me expect that a change is likely. Unlike > x86 we only have a spec and virtual implementations at present, there's > no immintent hardware, so we're taking our time with making sure that > everything is working well. Deepak Gupta (in CC) has been reviewing the > series from the point of view of RISC-V. I think we're all fairly well > aligned on the requirements here. I'm still not enthusiastic that we only have one implementation for this in the kernel. What's the harm in waiting until the arm patches are merged? This shouldn't result in chicken and egg: if the implementations are sufficiently similar then we can do an appropriate clone3() extension.
On Thu, Nov 23, 2023 at 11:10:24AM +0100, Christian Brauner wrote: > On Tue, Nov 21, 2023 at 04:09:40PM +0000, Mark Brown wrote: > > On Tue, Nov 21, 2023 at 12:21:37PM +0000, Szabolcs Nagy wrote: > > > The 11/21/2023 11:17, Christian Brauner wrote: > > > > (2) With what other interfaces is implicit allocation and deallocation > > > > not consistent? I don't understand this argument. The kernel creates > > > > a shadow stack as a security measure to store return addresses. It > > > > seems to me exactly that the kernel should implicitly allocate and > > > > deallocate the shadow stack and not have userspace muck around with > > > > its size? ... > > The inconsistency here is with the management of the standard stack - > > with the standard stack userspace passes an already allocated address > > range to the kernel. A constant tension during review of the shadow > > stack interfaces has been that shadow stack memory is userspace memory > > but the security constraints mean that we've come down on the side of > > having a custom allocation syscall for it instead of using flags on > > mmap() and friends like people often expect, and now having it allocated > > as part of clone3(). The aim is to highlight that this difference is > So you have two interfaces for allocating a shadow stack. The first one > is to explicitly alloc a shadow stack via the map_shadow_stack(). The > second one is an implicit allocation during clone3() and you want to > allow explicitly influencing that. Yes. Shadow stacks are also allocated when the inital call to enable shadow stacks is done, the clone()/clone3() behaviour is to implicitly allocate when a thread is created by a thread that itself has shadow stacks enabled (so we avoid gaps in coverage). > > feature (hence the RFT in the subject line) - Rick Edgecombe did the x86 > > work. The arm64 code is still in review, the userspace interface is > > very similar to that for x86 and there doesn't seem to be any > > controversy there which makes me expect that a change is likely. Unlike > > x86 we only have a spec and virtual implementations at present, there's > > no immintent hardware, so we're taking our time with making sure that > > everything is working well. Deepak Gupta (in CC) has been reviewing the > > series from the point of view of RISC-V. I think we're all fairly well > > aligned on the requirements here. > I'm still not enthusiastic that we only have one implementation for this > in the kernel. What's the harm in waiting until the arm patches are > merged? This shouldn't result in chicken and egg: if the implementations > are sufficiently similar then we can do an appropriate clone3() > extension. The main thing would be that it would mean that people doing userspace enablement based on the merged x86 support can't use the stack size control. It's not the end of the world if that has to wait a bit, it's a bit of a detail thing, but it would make life easier, I guess the userspace people can let us know if it's starting to be a real hassle and we can reevaulate if that happens. It's also currently a dependency for the arm64 code so it'd be good to at least get ageement that assuming nothing comes up in testing the patches can go in along with the arm64 series, removing the dependency and then adding it as an incremental thing would be a hassle. It's likely that the arm64 series will be held out of tree for a while to as more complete userspace support is built up and validated so things might be sitting for a while - we don't have hardware right now so we can be cautious with the testing.
The kernel has recently added support for shadow stacks, currently x86 only using their CET feature but both arm64 and RISC-V have equivalent features (GCS and Zicfiss respectively), I am actively working on GCS[1]. With shadow stacks the hardware maintains an additional stack containing only the return addresses for branch instructions which is not generally writeable by userspace and ensures that any returns are to the recorded addresses. This provides some protection against ROP attacks and making it easier to collect call stacks. These shadow stacks are allocated in the address space of the userspace process. Our API for shadow stacks does not currently offer userspace any flexiblity for managing the allocation of shadow stacks for newly created threads, instead the kernel allocates a new shadow stack with the same size as the normal stack whenever a thread is created with the feature enabled. The stacks allocated in this way are freed by the kernel when the thread exits or shadow stacks are disabled for the thread. This lack of flexibility and control isn't ideal, in the vast majority of cases the shadow stack will be over allocated and the implicit allocation and deallocation is not consistent with other interfaces. As far as I can tell the interface is done in this manner mainly because the shadow stack patches were in development since before clone3() was implemented. Since clone3() is readily extensible let's add support for specifying a shadow stack when creating a new thread or process in a similar manner to how the normal stack is specified, keeping the current implicit allocation behaviour if one is not specified either with clone3() or through the use of clone(). Unlike normal stacks only the shadow stack size is specified, similar issues to those that lead to the creation of map_shadow_stack() apply. Please note that the x86 portions of this code are build tested only, I don't appear to have a system that can run CET avaible to me, I have done testing with an integration into my pending work for GCS. There is some possibility that the arm64 implementation may require the use of clone3() and explicit userspace allocation of shadow stacks, this is still under discussion. A new architecture feature Kconfig option for shadow stacks is added as here, this was suggested as part of the review comments for the arm64 GCS series and since we need to detect if shadow stacks are supported it seemed sensible to roll it in here. [1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/ Signed-off-by: Mark Brown <broonie@kernel.org> --- Changes in v3: - Rebase onto v6.7-rc2. - Remove stale shadow_stack in internal kargs. - If a shadow stack is specified unconditionally use it regardless of CLONE_ parameters. - Force enable shadow stacks in the selftest. - Update changelogs for RISC-V feature rename. - Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@kernel.org Changes in v2: - Rebase onto v6.7-rc1. - Remove ability to provide preallocated shadow stack, just specify the desired size. - Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@kernel.org --- Mark Brown (5): mm: Introduce ARCH_HAS_USER_SHADOW_STACK fork: Add shadow stack support to clone3() selftests/clone3: Factor more of main loop into test_clone3() selftests/clone3: Allow tests to flag if -E2BIG is a valid error code kselftest/clone3: Test shadow stack support arch/x86/Kconfig | 1 + arch/x86/include/asm/shstk.h | 11 +- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/shstk.c | 59 +++++-- fs/proc/task_mmu.c | 2 +- include/linux/mm.h | 2 +- include/linux/sched/task.h | 1 + include/uapi/linux/sched.h | 4 + kernel/fork.c | 22 ++- mm/Kconfig | 6 + tools/testing/selftests/clone3/clone3.c | 200 +++++++++++++++++----- tools/testing/selftests/clone3/clone3_selftests.h | 7 + 12 files changed, 250 insertions(+), 67 deletions(-) --- base-commit: 98b1cc82c4affc16f5598d4fa14b1858671b2263 change-id: 20231019-clone3-shadow-stack-15d40d2bf536 Best regards,