Message ID | 20240822-arm64-gcs-v11-0-41b81947ecb5@kernel.org |
---|---|
Headers | show |
Series | arm64/gcs: Provide support for GCS in userspace | expand |
On Thu, Aug 22, 2024 at 02:15:08AM +0100, Mark Brown wrote: > FEAT_GCS introduces a number of new system registers, we require that > access to these registers is not trapped when we identify that the feature > is present. There is also a HCRX_EL2 control to make GCS operations > functional. > > Since if GCS is enabled any function call instruction will cause a fault > we also require that the feature be specifically disabled, existing > kernels implicitly have this requirement and especially given that the > MMU must be disabled it is difficult to see a situation where leaving > GCS enabled would be reasonable. > > Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> > Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On Thu, Aug 22, 2024 at 02:15:20AM +0100, Mark Brown wrote: > Provide a hwcap to enable userspace to detect support for GCS. > > Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> > Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On Thu, Aug 22, 2024 at 05:17:14PM +0100, Catalin Marinas wrote: > > /* > > - * Ensure that GCS changes are observable by/from other PEs in > > - * case of migration. > > + * Ensure that GCS memory effects of the 'prev' thread are > > + * ordered before other memory accesses with release semantics > > + * (or preceded by a DMB) on the current PE. In addition, any > > + * memory accesses with acquire semantics (or succeeded by a > > + * DMB) are ordered before GCS memory effects of the 'next' > > + * thread. This will ensure that the GCS memory effects are > > + * visible to other PEs in case of migration. > > */ > > - gcsb_dsync(); > > + if (task_gcs_el0_enabled(current) || task_gcs_el0_enabled(next)) > > + gcsb_dsync(); > Ah, the comment turned up in this patch. It looks fine. Oh, sorry - I should probably just pull this hunk into the other patch.
On Thu, Aug 22, 2024 at 05:44:19PM +0100, Mark Brown wrote: > On Thu, Aug 22, 2024 at 05:12:30PM +0100, Catalin Marinas wrote: > > On Thu, Aug 22, 2024 at 02:15:22AM +0100, Mark Brown wrote: > > > > +static bool is_invalid_gcs_access(struct vm_area_struct *vma, u64 esr) > > > > + } else if (unlikely(vma->vm_flags & VM_SHADOW_STACK)) { > > > + /* Only GCS operations can write to a GCS page */ > > > + return is_write_abort(esr); > > > + } > > > I don't think that's right. The ESR on this path may not even indicate a > > data abort and ESR.WnR bit check wouldn't make sense. > > > I presume we want to avoid an infinite loop on a (writeable) GCS page > > when the user does a normal STR but the CPU raises a permission fault. I > > think this function needs to just return false if !esr_is_data_abort(). > > Yes, that should check for a data abort. I think I'd formed the > impression that is_write_abort() included that check somehow. As you > say it's to avoid spinning trying to resolve a permission fault for a > write (non-GCS reads to a GCS page are valid), I do think we need the > is_write_abort() since non-GCS reads are valid so something like: > > if (!esr_is_data_abort(esr)) > return false; > > return is_write_abort(esr); We do need the write abort check but not unconditionally, only if to a GCS page (you can have other genuine write aborts).
On Thu, Aug 22, 2024 at 06:19:38PM +0100, Catalin Marinas wrote: > On Thu, Aug 22, 2024 at 05:44:19PM +0100, Mark Brown wrote: > > On Thu, Aug 22, 2024 at 05:12:30PM +0100, Catalin Marinas wrote: > > > On Thu, Aug 22, 2024 at 02:15:22AM +0100, Mark Brown wrote: > > > > > > +static bool is_invalid_gcs_access(struct vm_area_struct *vma, u64 esr) > > > > > > + } else if (unlikely(vma->vm_flags & VM_SHADOW_STACK)) { > > > > + /* Only GCS operations can write to a GCS page */ > > > > + return is_write_abort(esr); > > > > + } > > Yes, that should check for a data abort. I think I'd formed the > > impression that is_write_abort() included that check somehow. As you > > say it's to avoid spinning trying to resolve a permission fault for a > > write (non-GCS reads to a GCS page are valid), I do think we need the > > is_write_abort() since non-GCS reads are valid so something like: > > > > if (!esr_is_data_abort(esr)) > > return false; > > > > return is_write_abort(esr); > > We do need the write abort check but not unconditionally, only if to a > GCS page (you can have other genuine write aborts). tThat was to replace the checks in the above case, not the function as a whole.
On Thu, Aug 22, 2024 at 02:15:28AM +0100, Mark Brown wrote: > +static int preserve_gcs_context(struct gcs_context __user *ctx) > +{ > + int err = 0; > + u64 gcspr; > + > + /* > + * We will add a cap token to the frame, include it in the > + * GCSPR_EL0 we report to support stack switching via > + * sigreturn. > + */ > + gcs_preserve_current_state(); > + gcspr = current->thread.gcspr_el0 - 8; > + > + __put_user_error(GCS_MAGIC, &ctx->head.magic, err); > + __put_user_error(sizeof(*ctx), &ctx->head.size, err); > + __put_user_error(gcspr, &ctx->gcspr, err); > + __put_user_error(0, &ctx->reserved, err); > + __put_user_error(current->thread.gcs_el0_mode, > + &ctx->features_enabled, err); > + > + return err; > +} Do we actually need to store the gcspr value after the cap token has been pushed or just the value of the interrupted context? If we at some point get a sigaltshadowstack() syscall, the saved GCS wouldn't point to the new stack but rather the original one. Unwinders should be able to get the actual GCSPR_EL0 register, no need for the sigcontext to point to the new shadow stack. Also in gcs_signal_entry() in the previous patch, we seem to subtract 16 rather than 8. I admit I haven't checked the past discussions in this area, so maybe I'm missing something. > +static int restore_gcs_context(struct user_ctxs *user) > +{ > + u64 gcspr, enabled; > + int err = 0; > + > + if (user->gcs_size != sizeof(*user->gcs)) > + return -EINVAL; > + > + __get_user_error(gcspr, &user->gcs->gcspr, err); > + __get_user_error(enabled, &user->gcs->features_enabled, err); > + if (err) > + return err; > + > + /* Don't allow unknown modes */ > + if (enabled & ~PR_SHADOW_STACK_SUPPORTED_STATUS_MASK) > + return -EINVAL; > + > + err = gcs_check_locked(current, enabled); > + if (err != 0) > + return err; > + > + /* Don't allow enabling */ > + if (!task_gcs_el0_enabled(current) && > + (enabled & PR_SHADOW_STACK_ENABLE)) > + return -EINVAL; > + > + /* If we are disabling disable everything */ > + if (!(enabled & PR_SHADOW_STACK_ENABLE)) > + enabled = 0; > + > + current->thread.gcs_el0_mode = enabled; > + > + /* > + * We let userspace set GCSPR_EL0 to anything here, we will > + * validate later in gcs_restore_signal(). > + */ > + current->thread.gcspr_el0 = gcspr; > + write_sysreg_s(current->thread.gcspr_el0, SYS_GCSPR_EL0); So in preserve_gcs_context(), we subtract 8 from the gcspr_el0 value. Where is it added back? What I find confusing is that both restore_gcs_context() and gcs_restore_signal() seem to touch current->thread.gcspr_el0 and the sysreg. Which one takes priority? I should probably check the branch out to see the end result. > @@ -977,6 +1079,13 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user, > return err; > } > > + if (add_all || task_gcs_el0_enabled(current)) { > + err = sigframe_alloc(user, &user->gcs_offset, > + sizeof(struct gcs_context)); > + if (err) > + return err; > + } I'm still not entirely convinced of this conditional saving and the interaction with unwinders. In a previous thread you mentioned that we need to keep the GCSPR_EL0 sysreg value up to date even after disabling GCS for a thread as not to confuse the unwinders. We could get a signal delivered together with a sigreturn without any context switch. Do we lose any state? It might help if you describe the scenario, maybe even adding a comment in the code, otherwise I'm sure we'll forget in a few months time.
On Thu, Aug 22, 2024 at 02:15:30AM +0100, Mark Brown wrote: > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index a2f8ff354ca6..772f9ba99fe8 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -2137,6 +2137,26 @@ config ARM64_EPAN > if the cpu does not implement the feature. > endmenu # "ARMv8.7 architectural features" > > +menu "v9.4 architectural features" > + > +config ARM64_GCS > + bool "Enable support for Guarded Control Stack (GCS)" > + default y > + select ARCH_HAS_USER_SHADOW_STACK > + select ARCH_USES_HIGH_VMA_FLAGS > + help > + Guarded Control Stack (GCS) provides support for a separate > + stack with restricted access which contains only return > + addresses. This can be used to harden against some attacks > + by comparing return address used by the program with what is > + stored in the GCS, and may also be used to efficiently obtain > + the call stack for applications such as profiling. > + > + The feature is detected at runtime, and will remain disabled > + if the system does not implement the feature. > + > +endmenu # "v9.4 architectural features" BTW, as Mark R spotted we'd also need to handle uprobes. Since that's off in defconfig, I think it can be done separately on top of this series. In the meantime, we could make this dependent on !UPROBES.
On Fri, Aug 23, 2024 at 10:37:19AM +0100, Catalin Marinas wrote: > On Thu, Aug 22, 2024 at 02:15:28AM +0100, Mark Brown wrote: > > + gcs_preserve_current_state(); > > + gcspr = current->thread.gcspr_el0 - 8; > > + __put_user_error(gcspr, &ctx->gcspr, err); > Do we actually need to store the gcspr value after the cap token has > been pushed or just the value of the interrupted context? If we at some > point get a sigaltshadowstack() syscall, the saved GCS wouldn't point to > the new stack but rather the original one. Unwinders should be able to > get the actual GCSPR_EL0 register, no need for the sigcontext to point > to the new shadow stack. We could store either the cap token or the interrupted GCSPR_EL0 (the address below the cap token). It felt more joined up to go with the cap token since notionally signal return is consuming the cap token but either way would work, we could just add an offset when looking at the pointer. > Also in gcs_signal_entry() in the previous patch, we seem to subtract 16 > rather than 8. We need to not only place a cap but also a GCS frame for the sigreturn trampoline, the sigreturn trampoline isn't part of the interrupted context so isn't included in the signal frame but it needs to have a record on the GCS so that the signal handler doesn't just generate a GCS fault if it tries to return to the trampoline. This means that the GCSPR_EL0 that is set for the signal handler needs to move two entries, one for the cap token and one for the trampoline. > What I find confusing is that both restore_gcs_context() and > gcs_restore_signal() seem to touch current->thread.gcspr_el0 and the > sysreg. Which one takes priority? I should probably check the branch out > to see the end result. restore_gcs_context() is loading values from the signal frame in memory (which will only happen if a GCS context is present) then gcs_restore_signal() consumes the token at the top of the stack. The split is because userspace can skip the restore_X_context() functions for the optional signal frame elements by removing them from the context but we want to ensure that we always consume a token. > > + /* > > + * We let userspace set GCSPR_EL0 to anything here, we will > > + * validate later in gcs_restore_signal(). > > + */ > > + current->thread.gcspr_el0 = gcspr; > > + write_sysreg_s(current->thread.gcspr_el0, SYS_GCSPR_EL0); > So in preserve_gcs_context(), we subtract 8 from the gcspr_el0 value. > Where is it added back? When we consumed the GCS cap token. > > + if (add_all || task_gcs_el0_enabled(current)) { > > + err = sigframe_alloc(user, &user->gcs_offset, > > + sizeof(struct gcs_context)); > > + if (err) > > + return err; > > + } > I'm still not entirely convinced of this conditional saving and the > interaction with unwinders. In a previous thread you mentioned that we > need to keep the GCSPR_EL0 sysreg value up to date even after disabling > GCS for a thread as not to confuse the unwinders. We could get a signal > delivered together with a sigreturn without any context switch. Do we > lose any state? > It might help if you describe the scenario, maybe even adding a comment > in the code, otherwise I'm sure we'll forget in a few months time. We should probably just change that back to saving unconditionally - it looks like the decision on worrying about overflowing the default signal frame is that we just shouldn't.
On Fri, Aug 23, 2024 at 11:25:30AM +0100, Mark Brown wrote: > On Fri, Aug 23, 2024 at 10:37:19AM +0100, Catalin Marinas wrote: > > On Thu, Aug 22, 2024 at 02:15:28AM +0100, Mark Brown wrote: > > > > + gcs_preserve_current_state(); > > > + gcspr = current->thread.gcspr_el0 - 8; > > > > + __put_user_error(gcspr, &ctx->gcspr, err); > > > Do we actually need to store the gcspr value after the cap token has > > been pushed or just the value of the interrupted context? If we at some > > point get a sigaltshadowstack() syscall, the saved GCS wouldn't point to > > the new stack but rather the original one. Unwinders should be able to > > get the actual GCSPR_EL0 register, no need for the sigcontext to point > > to the new shadow stack. > > We could store either the cap token or the interrupted GCSPR_EL0 (the > address below the cap token). It felt more joined up to go with the cap > token since notionally signal return is consuming the cap token but > either way would work, we could just add an offset when looking at the > pointer. In a hypothetical sigaltshadowstack() scenario, would the cap go on the new signal shadow stack or on the old one? I assume on the new one but in sigcontext we'd save the original GCSPR_EL0. In such hypothetical case, the original GCSPR_EL0 would not need 8 subtracted. I need to think some more about this. The gcs_restore_signal() function makes sense, it starts with the current GCSPR_EL0 on the signal stack and consumes the token, adds 8 to the shadow stack pointer. The restore_gcs_context() one is confusing as it happens before consuming the cap token and assumes that the GCSPR_EL0 value actually points to the signal stack. If we ever implement an alternative shadow stack, the original GCSPR_EL0 of the interrupted context would be lost. I know it's not planned for now but the principles should be the same. The sigframe.uc should store the interrupted state. To me the order for sigreturn should be first to consume the cap token, validate it etc. and then restore GCSPR_EL0 to whatever was saved in the sigframe.uc prior to the signal being delivered. > > Also in gcs_signal_entry() in the previous patch, we seem to subtract 16 > > rather than 8. > > We need to not only place a cap but also a GCS frame for the sigreturn > trampoline, the sigreturn trampoline isn't part of the interrupted > context so isn't included in the signal frame but it needs to have a > record on the GCS so that the signal handler doesn't just generate a GCS > fault if it tries to return to the trampoline. This means that the > GCSPR_EL0 that is set for the signal handler needs to move two entries, > one for the cap token and one for the trampoline. Yes, this makes sense. > > What I find confusing is that both restore_gcs_context() and > > gcs_restore_signal() seem to touch current->thread.gcspr_el0 and the > > sysreg. Which one takes priority? I should probably check the branch out > > to see the end result. > > restore_gcs_context() is loading values from the signal frame in memory > (which will only happen if a GCS context is present) then > gcs_restore_signal() consumes the token at the top of the stack. The > split is because userspace can skip the restore_X_context() functions > for the optional signal frame elements by removing them from the context > but we want to ensure that we always consume a token. I agree we should always consume a token but this should be done from the actual hardware GCSPR_EL0 value on the sigreturn call rather than the one restored from sigframe.uc. The restoring should be the last step.
On Fri, Aug 23, 2024 at 04:59:11PM +0100, Catalin Marinas wrote: > On Fri, Aug 23, 2024 at 11:25:30AM +0100, Mark Brown wrote: > > We could store either the cap token or the interrupted GCSPR_EL0 (the > > address below the cap token). It felt more joined up to go with the cap > > token since notionally signal return is consuming the cap token but > > either way would work, we could just add an offset when looking at the > > pointer. > In a hypothetical sigaltshadowstack() scenario, would the cap go on the > new signal shadow stack or on the old one? I assume on the new one but > in sigcontext we'd save the original GCSPR_EL0. In such hypothetical > case, the original GCSPR_EL0 would not need 8 subtracted. I would have put the token on the old stack since that's what we'd be returning to. This raises interesting questions about what happens if the reason for the signal is that we just overflowed the normal stack (which are among the issues that have got in the way of working out if or how we do something with sigaltshadowstack). I'm not clear what the purpose of the token would be on the new stack, the token basically says "this is somewhere we can sigreturn to", that's not the case for the alternative stack. > I need to think some more about this. The gcs_restore_signal() function > makes sense, it starts with the current GCSPR_EL0 on the signal stack > and consumes the token, adds 8 to the shadow stack pointer. The > restore_gcs_context() one is confusing as it happens before consuming > the cap token and assumes that the GCSPR_EL0 value actually points to > the signal stack. If we ever implement an alternative shadow stack, the > original GCSPR_EL0 of the interrupted context would be lost. I know it's > not planned for now but the principles should be the same. The > sigframe.uc should store the interrupted state. I think the issues you're pointing out here go to the thing with the cap token marking a place we can sigreturn to and therefore being on the original stack. > To me the order for sigreturn should be first to consume the cap token, > validate it etc. and then restore GCSPR_EL0 to whatever was saved in the > sigframe.uc prior to the signal being delivered. To me what we're doing here is that the signal frame says where userspace wants to point GCSPR_EL0 in the returned context, we then go and confirm that this is a valid address by looking at it and checking for a token. The token serves to validate what was saved in sigframe.uc so that it can't just be pointed at some random address. > > restore_gcs_context() is loading values from the signal frame in memory > > (which will only happen if a GCS context is present) then > > gcs_restore_signal() consumes the token at the top of the stack. The > > split is because userspace can skip the restore_X_context() functions > > for the optional signal frame elements by removing them from the context > > but we want to ensure that we always consume a token. > I agree we should always consume a token but this should be done from > the actual hardware GCSPR_EL0 value on the sigreturn call rather than > the one restored from sigframe.uc. The restoring should be the last > step. If we look for a token at the GCSPR_EL0 on sigreturn then it wouldn't be valid to call sigreturn() (well, without doing something first to unwind the stack which seems hostile and would require code to become GCS aware that wouldn't otherwise), if you do a sigreturn from somewhere other than the vDSO then you'd have at least the vDSO signal frame left in the GCS. You'd also be able to point GCSPR_EL0 to anywhere since we'd just load a value from the signal frame but not do the cap verification.
On Fri, Aug 23, 2024 at 11:01:13PM +0100, Mark Brown wrote: > On Fri, Aug 23, 2024 at 04:59:11PM +0100, Catalin Marinas wrote: > > On Fri, Aug 23, 2024 at 11:25:30AM +0100, Mark Brown wrote: > > > > We could store either the cap token or the interrupted GCSPR_EL0 (the > > > address below the cap token). It felt more joined up to go with the cap > > > token since notionally signal return is consuming the cap token but > > > either way would work, we could just add an offset when looking at the > > > pointer. > > > In a hypothetical sigaltshadowstack() scenario, would the cap go on the > > new signal shadow stack or on the old one? I assume on the new one but > > in sigcontext we'd save the original GCSPR_EL0. In such hypothetical > > case, the original GCSPR_EL0 would not need 8 subtracted. > > I would have put the token on the old stack since that's what we'd be > returning to. After some more spec reading, your approach makes sense as it matches the GCSSS[12] instructions where the outgoing, rather than incoming, shadow stack is capped. So all good I think. However, a bit more below on the restore order (it's ok but a bit confusing). > This raises interesting questions about what happens if > the reason for the signal is that we just overflowed the normal stack > (which are among the issues that have got in the way of working out if > or how we do something with sigaltshadowstack). That's not that different from the classic case where we get an error trying to setup the frame. signal_setup_done() handles it by forcing a SIGSEGV. I'd say we do the same here. > I'm not clear what the > purpose of the token would be on the new stack, the token basically says > "this is somewhere we can sigreturn to", that's not the case for the > alternative stack. Yeah, I thought we have to somehow mark the top of the stack with this token. But looking at the architecture stack switching, it caps the outgoing stack (in our case this would be the interrupted one). So that's settled. On the patch itself, I think there are some small inconsistencies on how it reads the GCSPR_EL0: preserve_gcs_context() does a gcs_preserve_current_state() and subsequently reads the value from the thread structure. A bit later, gcs_signal_entry() goes for the sysreg directly. I don't think that's a problem even if the thread gets preempted but it would be nice to be consistent. Maybe leave the gcs_preserve_current_state() only a context switch thing. Would it work if we don't touch the thread structure at all in the signal code? We wouldn't deliver a signal in the middle of the switch_to() code. So any value we write in thread struct would be overridden at the next switch. If GCS is disabled for a guest, we save the GCSPR_EL0 with the cap size subtracted but there's no cap written. In restore_gcs_context() it doesn't look like we add the cap size back when writing GCSPR_EL0. If GCS is enabled, we do consume the cap and add 8 but otherwise it looks that we keep decreasing GCSPR_EL0. I think we should always subtract the cap size if GCS is enabled. This could could do with some refactoring as I find it hard to follow (not sure exactly how, maybe just comments will do). I'd also keep a single write to GCSPR_EL0 on the return path but I'm ok with two if we need to cope with GCS being disabled but the GCSPR_EL0 still being saved/restored. Another aspect for gcs_restore_signal(), I think it makes more sense for the cap to be consumed _after_ restoring the sigcontext since this has the actual gcspr_el0 where we stored the cap and represents the original stack. If we'll get an alternative shadow stack, current GCSPR_EL0 on sigreturn points to that alternative shadow stack rather than the original one. That's what confused me when reviewing the patch and I thought the cap goes to the top of the signal stack.
On Mon, Aug 26, 2024 at 01:00:09PM +0300, Catalin Marinas wrote: > On Fri, Aug 23, 2024 at 11:01:13PM +0100, Mark Brown wrote: > > On Fri, Aug 23, 2024 at 04:59:11PM +0100, Catalin Marinas wrote: > gcs_preserve_current_state() only a context switch thing. Would it work > if we don't touch the thread structure at all in the signal code? We > wouldn't deliver a signal in the middle of the switch_to() code. So any > value we write in thread struct would be overridden at the next switch. I think so, yes. > If GCS is disabled for a guest, we save the GCSPR_EL0 with the cap size s/guest/task/ I guess? > subtracted but there's no cap written. In restore_gcs_context() it > doesn't look like we add the cap size back when writing GCSPR_EL0. If > GCS is enabled, we do consume the cap and add 8 but otherwise it looks > that we keep decreasing GCSPR_EL0. I think we should always subtract the > cap size if GCS is enabled. This could could do with some refactoring as > I find it hard to follow (not sure exactly how, maybe just comments will > do). I've changed this so we instead only add the frame for the token if GCS is enabled and updated the comment, that way we don't modify GCSPR_EL0 in cases where GCS is not enabled. > I'd also keep a single write to GCSPR_EL0 on the return path but I'm ok > with two if we need to cope with GCS being disabled but the GCSPR_EL0 > still being saved/restored. I think the handling for the various options in the second case mean that it's clearer and simpler to write once when we restore the frame and once when we consume the token. > Another aspect for gcs_restore_signal(), I think it makes more sense for > the cap to be consumed _after_ restoring the sigcontext since this has > the actual gcspr_el0 where we stored the cap and represents the original > stack. If we'll get an alternative shadow stack, current GCSPR_EL0 on > sigreturn points to that alternative shadow stack rather than the > original one. That's what confused me when reviewing the patch and I > thought the cap goes to the top of the signal stack. I've moved gcs_restore_signal() before the altstack restore which I think is what you're looking for here?
The arm64 Guarded Control Stack (GCS) feature provides support for hardware protected stacks of return addresses, intended to provide hardening against return oriented programming (ROP) attacks and to make it easier to gather call stacks for applications such as profiling. When GCS is active a secondary stack called the Guarded Control Stack is maintained, protected with a memory attribute which means that it can only be written with specific GCS operations. The current GCS pointer can not be directly written to by userspace. When a BL is executed the value stored in LR is also pushed onto the GCS, and when a RET is executed the top of the GCS is popped and compared to LR with a fault being raised if the values do not match. GCS operations may only be performed on GCS pages, a data abort is generated if they are not. The combination of hardware enforcement and lack of extra instructions in the function entry and exit paths should result in something which has less overhead and is more difficult to attack than a purely software implementation like clang's shadow stacks. This series implements support for use of GCS by userspace, along with support for use of GCS within KVM guests. It does not enable use of GCS by either EL1 or EL2, this will be implemented separately. Executables are started without GCS and must use a prctl() to enable it, it is expected that this will be done very early in application execution by the dynamic linker or other startup code. For dynamic linking this will be done by checking that everything in the executable is marked as GCS compatible. x86 has an equivalent feature called shadow stacks, this series depends on the x86 patches for generic memory management support for the new guarded/shadow stack page type and shares APIs as much as possible. As there has been extensive discussion with the wider community around the ABI for shadow stacks I have as far as practical kept implementation decisions close to those for x86, anticipating that review would lead to similar conclusions in the absence of strong reasoning for divergence. The main divergence I am concious of is that x86 allows shadow stack to be enabled and disabled repeatedly, freeing the shadow stack for the thread whenever disabled, while this implementation keeps the GCS allocated after disable but refuses to reenable it. This is to avoid races with things actively walking the GCS during a disable, we do anticipate that some systems will wish to disable GCS at runtime but are not aware of any demand for subsequently reenabling it. x86 uses an arch_prctl() to manage enable and disable, since only x86 and S/390 use arch_prctl() a generic prctl() was proposed[1] as part of a patch set for the equivalent RISC-V Zicfiss feature which I initially adopted fairly directly but following review feedback has been revised quite a bit. We currently maintain the x86 pattern of implicitly allocating a shadow stack for threads started with shadow stack enabled, there has been some discussion of removing this support and requiring the use of clone3() with explicit allocation of shadow stacks instead. I have no strong feelings either way, implicit allocation is not really consistent with anything else we do and creates the potential for errors around thread exit but on the other hand it is existing ABI on x86 and minimises the changes needed in userspace code. glibc and bionic changes using this ABI have been implemented and tested. Headless Android systems have been validated and Ross Burton has used this code has been used to bring up a Yocto system with GCS enabed as standard, a test implementation of V8 support has also been done. There is an open issue with support for CRIU, on x86 this required the ability to set the GCS mode via ptrace. This series supports configuring mode bits other than enable/disable via ptrace but it needs to be confirmed if this is sufficient. It is likely that we could relax some of the barriers added here with some more targeted placements, this is left for further study. There is an in process series adding clone3() support for shadow stacks: https://lore.kernel.org/r/20240819-clone3-shadow-stack-v9-0-962d74f99464@kernel.org Previous versions of this series depended on that, this dependency has been removed in order to make merging easier. [1] https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@rivosinc.com/ Signed-off-by: Mark Brown <broonie@kernel.org> --- Changes in v11: - Remove the dependency on the addition of clone3() support for shadow stacks, rebasing onto v6.11-rc3. - Make ID_AA64PFR1_EL1.GCS writeable in KVM. - Hide GCS registers when GCS is not enabled for KVM guests. - Require HCRX_EL2.GCSEn if booting at EL1. - Require that GCSCR_EL1 and GCSCRE0_EL1 be initialised regardless of if we boot at EL2 or EL1. - Remove some stray use of bit 63 in signal cap tokens. - Warn if we see a GCS with VM_SHARED. - Remove rdundant check for VM_WRITE in fault handling. - Cleanups and clarifications in the ABI document. - Clean up and improve documentation of some sync placement. - Only set the EL0 GCS mode if it's actually changed. - Various minor fixes and tweaks. - Link to v10: https://lore.kernel.org/r/20240801-arm64-gcs-v10-0-699e2bd2190b@kernel.org Changes in v10: - Fix issues with THP. - Tighten up requirements for initialising GCSCR*. - Only generate GCS signal frames for threads using GCS. - Only context switch EL1 GCS registers if S1PIE is enabled. - Move context switch of GCSCRE0_EL1 to EL0 context switch. - Make GCS registers unconditionally visible to userspace. - Use FHU infrastructure. - Don't change writability of ID_AA64PFR1_EL1 for KVM. - Remove unused arguments from alloc_gcs(). - Typo fixes. - Link to v9: https://lore.kernel.org/r/20240625-arm64-gcs-v9-0-0f634469b8f0@kernel.org Changes in v9: - Rebase onto v6.10-rc3. - Restructure and clarify memory management fault handling. - Fix up basic-gcs for the latest clone3() changes. - Convert to newly merged KVM ID register based feature configuration. - Fixes for NV traps. - Link to v8: https://lore.kernel.org/r/20240203-arm64-gcs-v8-0-c9fec77673ef@kernel.org Changes in v8: - Invalidate signal cap token on stack when consuming. - Typo and other trivial fixes. - Don't try to use process_vm_write() on GCS, it intentionally does not work. - Fix leak of thread GCSs. - Rebase onto latest clone3() series. - Link to v7: https://lore.kernel.org/r/20231122-arm64-gcs-v7-0-201c483bd775@kernel.org Changes in v7: - Rebase onto v6.7-rc2 via the clone3() patch series. - Change the token used to cap the stack during signal handling to be compatible with GCSPOPM. - Fix flags for new page types. - Fold in support for clone3(). - Replace copy_to_user_gcs() with put_user_gcs(). - Link to v6: https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org Changes in v6: - Rebase onto v6.6-rc3. - Add some more gcsb_dsync() barriers following spec clarifications. - Due to ongoing discussion around clone()/clone3() I've not updated anything there, the behaviour is the same as on previous versions. - Link to v5: https://lore.kernel.org/r/20230822-arm64-gcs-v5-0-9ef181dd6324@kernel.org Changes in v5: - Don't map any permissions for user GCSs, we always use EL0 accessors or use a separate mapping of the page. - Reduce the standard size of the GCS to RLIMIT_STACK/2. - Enforce a PAGE_SIZE alignment requirement on map_shadow_stack(). - Clarifications and fixes to documentation. - More tests. - Link to v4: https://lore.kernel.org/r/20230807-arm64-gcs-v4-0-68cfa37f9069@kernel.org Changes in v4: - Implement flags for map_shadow_stack() allowing the cap and end of stack marker to be enabled independently or not at all. - Relax size and alignment requirements for map_shadow_stack(). - Add more blurb explaining the advantages of hardware enforcement. - Link to v3: https://lore.kernel.org/r/20230731-arm64-gcs-v3-0-cddf9f980d98@kernel.org Changes in v3: - Rebase onto v6.5-rc4. - Add a GCS barrier on context switch. - Add a GCS stress test. - Link to v2: https://lore.kernel.org/r/20230724-arm64-gcs-v2-0-dc2c1d44c2eb@kernel.org Changes in v2: - Rebase onto v6.5-rc3. - Rework prctl() interface to allow each bit to be locked independently. - map_shadow_stack() now places the cap token based on the size requested by the caller not the actual space allocated. - Mode changes other than enable via ptrace are now supported. - Expand test coverage. - Various smaller fixes and adjustments. - Link to v1: https://lore.kernel.org/r/20230716-arm64-gcs-v1-0-bf567f93bba6@kernel.org --- Mark Brown (39): mm: Introduce ARCH_HAS_USER_SHADOW_STACK arm64/mm: Restructure arch_validate_flags() for extensibility prctl: arch-agnostic prctl for shadow stack mman: Add map_shadow_stack() flags arm64: Document boot requirements for Guarded Control Stacks arm64/gcs: Document the ABI for Guarded Control Stacks arm64/sysreg: Add definitions for architected GCS caps arm64/gcs: Add manual encodings of GCS instructions arm64/gcs: Provide put_user_gcs() arm64/gcs: Provide basic EL2 setup to allow GCS usage at EL0 and EL1 arm64/cpufeature: Runtime detection of Guarded Control Stack (GCS) arm64/mm: Allocate PIE slots for EL0 guarded control stack mm: Define VM_SHADOW_STACK for arm64 when we support GCS arm64/mm: Map pages for guarded control stack KVM: arm64: Manage GCS access and registers for guests arm64/idreg: Add overrride for GCS arm64/hwcap: Add hwcap for GCS arm64/traps: Handle GCS exceptions arm64/mm: Handle GCS data aborts arm64/gcs: Context switch GCS state for EL0 arm64/gcs: Ensure that new threads have a GCS arm64/gcs: Implement shadow stack prctl() interface arm64/mm: Implement map_shadow_stack() arm64/signal: Set up and restore the GCS context for signal handlers arm64/signal: Expose GCS state in signal frames arm64/ptrace: Expose GCS via ptrace and core files arm64: Add Kconfig for Guarded Control Stack (GCS) kselftest/arm64: Verify the GCS hwcap kselftest/arm64: Add GCS as a detected feature in the signal tests kselftest/arm64: Add framework support for GCS to signal handling tests kselftest/arm64: Allow signals tests to specify an expected si_code kselftest/arm64: Always run signals tests with GCS enabled kselftest/arm64: Add very basic GCS test program kselftest/arm64: Add a GCS test program built with the system libc kselftest/arm64: Add test coverage for GCS mode locking kselftest/arm64: Add GCS signal tests kselftest/arm64: Add a GCS stress test kselftest/arm64: Enable GCS for the FP stress tests KVM: selftests: arm64: Add GCS registers to get-reg-list Documentation/admin-guide/kernel-parameters.txt | 3 + Documentation/arch/arm64/booting.rst | 32 + Documentation/arch/arm64/elf_hwcaps.rst | 2 + Documentation/arch/arm64/gcs.rst | 230 +++++++ Documentation/arch/arm64/index.rst | 1 + Documentation/filesystems/proc.rst | 2 +- arch/arm64/Kconfig | 20 + arch/arm64/include/asm/cpufeature.h | 6 + arch/arm64/include/asm/el2_setup.h | 29 + arch/arm64/include/asm/esr.h | 28 +- arch/arm64/include/asm/exception.h | 2 + arch/arm64/include/asm/gcs.h | 107 +++ arch/arm64/include/asm/hwcap.h | 1 + arch/arm64/include/asm/kvm_host.h | 12 + arch/arm64/include/asm/mman.h | 23 +- arch/arm64/include/asm/pgtable-prot.h | 14 +- arch/arm64/include/asm/processor.h | 7 + arch/arm64/include/asm/sysreg.h | 20 + arch/arm64/include/asm/uaccess.h | 40 ++ arch/arm64/include/asm/vncr_mapping.h | 2 + arch/arm64/include/uapi/asm/hwcap.h | 1 + arch/arm64/include/uapi/asm/ptrace.h | 8 + arch/arm64/include/uapi/asm/sigcontext.h | 9 + arch/arm64/kernel/cpufeature.c | 12 + arch/arm64/kernel/cpuinfo.c | 1 + arch/arm64/kernel/entry-common.c | 23 + arch/arm64/kernel/pi/idreg-override.c | 2 + arch/arm64/kernel/process.c | 88 +++ arch/arm64/kernel/ptrace.c | 54 ++ arch/arm64/kernel/signal.c | 225 ++++++- arch/arm64/kernel/traps.c | 11 + arch/arm64/kvm/hyp/include/hyp/sysreg-sr.h | 49 +- arch/arm64/kvm/sys_regs.c | 27 +- arch/arm64/mm/Makefile | 1 + arch/arm64/mm/fault.c | 40 ++ arch/arm64/mm/gcs.c | 252 +++++++ arch/arm64/mm/mmap.c | 10 +- arch/arm64/tools/cpucaps | 1 + arch/x86/Kconfig | 1 + arch/x86/include/uapi/asm/mman.h | 3 - fs/proc/task_mmu.c | 2 +- include/linux/mm.h | 18 +- include/uapi/asm-generic/mman.h | 4 + include/uapi/linux/elf.h | 1 + include/uapi/linux/prctl.h | 22 + kernel/sys.c | 30 + mm/Kconfig | 6 + tools/testing/selftests/arm64/Makefile | 2 +- tools/testing/selftests/arm64/abi/hwcap.c | 19 + tools/testing/selftests/arm64/fp/assembler.h | 15 + tools/testing/selftests/arm64/fp/fpsimd-test.S | 2 + tools/testing/selftests/arm64/fp/sve-test.S | 2 + tools/testing/selftests/arm64/fp/za-test.S | 2 + tools/testing/selftests/arm64/fp/zt-test.S | 2 + tools/testing/selftests/arm64/gcs/.gitignore | 5 + tools/testing/selftests/arm64/gcs/Makefile | 24 + tools/testing/selftests/arm64/gcs/asm-offsets.h | 0 tools/testing/selftests/arm64/gcs/basic-gcs.c | 357 ++++++++++ tools/testing/selftests/arm64/gcs/gcs-locking.c | 200 ++++++ .../selftests/arm64/gcs/gcs-stress-thread.S | 311 +++++++++ tools/testing/selftests/arm64/gcs/gcs-stress.c | 530 +++++++++++++++ tools/testing/selftests/arm64/gcs/gcs-util.h | 100 +++ tools/testing/selftests/arm64/gcs/libc-gcs.c | 728 +++++++++++++++++++++ tools/testing/selftests/arm64/signal/.gitignore | 1 + .../testing/selftests/arm64/signal/test_signals.c | 17 +- .../testing/selftests/arm64/signal/test_signals.h | 6 + .../selftests/arm64/signal/test_signals_utils.c | 32 +- .../selftests/arm64/signal/test_signals_utils.h | 39 ++ .../arm64/signal/testcases/gcs_exception_fault.c | 62 ++ .../selftests/arm64/signal/testcases/gcs_frame.c | 88 +++ .../arm64/signal/testcases/gcs_write_fault.c | 67 ++ .../selftests/arm64/signal/testcases/testcases.c | 7 + .../selftests/arm64/signal/testcases/testcases.h | 1 + tools/testing/selftests/kvm/aarch64/get-reg-list.c | 28 + 74 files changed, 4086 insertions(+), 43 deletions(-) --- base-commit: 7c626ce4bae1ac14f60076d00eafe71af30450ba change-id: 20230303-arm64-gcs-e311ab0d8729 Best regards,