diff mbox series

[RFT,v8,4/9] fork: Add shadow stack support to clone3()

Message ID 20240808-clone3-shadow-stack-v8-4-0acf37caf14c@kernel.org
State New
Headers show
Series fork: Support shadow stacks in clone3() | expand

Commit Message

Mark Brown Aug. 8, 2024, 8:15 a.m. UTC
Unlike with the normal stack there is no API for configuring the the shadow
stack for a new thread, instead the kernel will dynamically allocate a new
shadow stack with the same size as the normal stack. This appears to be due
to the shadow stack series having been in development since before the more
extensible clone3() was added rather than anything more deliberate.

Add parameters to clone3() specifying the location and size of a shadow
stack for the newly created process.  If no shadow stack is specified
then the existing implicit allocation behaviour is maintained.

If a stack is specified then it is required to have an architecture
defined token placed on the stack, this will be consumed by the new
task.  If the token is not provided then this will be reported as a
segmentation fault with si_code SEGV_CPERR, as a runtime shadow stack
protection error would be.  This allows architectures to implement the
validation of the token in the child process context.

If the architecture does not support shadow stacks the shadow stack
parameters must be zero, architectures that do support the feature are
expected to enforce the same requirement on individual systems that lack
shadow stack support.

Update the existing x86 implementation to pay attention to the newly added
arguments, in order to maintain compatibility we use the existing behaviour
if no shadow stack is specified. Minimal validation is done of the supplied
parameters, detailed enforcement is left to when the thread is executed.
Since we are now using more fields from the kernel_clone_args we pass that
into the shadow stack code rather than individual fields.

At present this implementation does not consume the shadow stack token
atomically as would be desirable, it uses a separate read and write.

Signed-off-by: Mark Brown <broonie@kernel.org>
---
 arch/x86/include/asm/shstk.h |  11 +++--
 arch/x86/kernel/process.c    |   2 +-
 arch/x86/kernel/shstk.c      | 105 ++++++++++++++++++++++++++++++++++---------
 include/linux/sched/task.h   |  13 ++++++
 include/uapi/linux/sched.h   |  13 ++++--
 kernel/fork.c                |  76 ++++++++++++++++++++++++++-----
 6 files changed, 178 insertions(+), 42 deletions(-)

Comments

Mark Brown Aug. 9, 2024, 11:06 p.m. UTC | #1
On Fri, Aug 09, 2024 at 07:19:26PM +0100, Catalin Marinas wrote:
> On Thu, Aug 08, 2024 at 09:15:25AM +0100, Mark Brown wrote:

> > +	/* This should really be an atomic cmpxchg.  It is not. */
> > +	if (access_remote_vm(mm, addr, &val, sizeof(val),
> > +			     FOLL_FORCE) != sizeof(val))
> > +		goto out;

> If we restrict the shadow stack creation only to the CLONE_VM case, we'd
> not need the remote vm access, it's in the current mm context already.
> More on this below.

The discussion in previous iterations was that it seemed better to allow
even surprising use cases since it simplifies the analysis of what we
have covered.  If the user has specified a shadow stack we just do what
they asked for and let them worry about if it's useful.

> > +	if (val != expected)
> > +		goto out;

> I'm confused that we need to consume the token here. I could not find
> the default shadow stack allocation doing this, only setting it via
> create_rstor_token() (or I did not search enough). In the default case,
> is the user consuming it? To me the only difference should been the
> default allocation vs the one passed by the user via clone3(), with the
> latter maybe requiring the user to set the token initially.

As discussed for a couple of previous versions if we don't have the
token and userspace can specify any old shadow stack page as the shadow
stack this allows clone3() to be used to overwrite the shadow stack of
another thread, you can point to a shadow stack page which is currently
in use and then run some code that causes shadow stack writes.  This
could potentially then in turn be used as part of a bigger exploit
chain, probably it's hard to get anything beyond just causing the other
thread to fault but won't be impossible.

With a kernel allocated shadow stack this is not an issue since we are
placing the shadow stack in new memory, userspace can't control where we
place it so it can't overwrite an existing shadow stack.

> > +		/*
> > +		 * For CLONE_VFORK the child will share the parents
> > +		 * shadow stack.  Make sure to clear the internal
> > +		 * tracking of the thread shadow stack so the freeing
> > +		 * logic run for child knows to leave it alone.
> > +		 */
> > +		if (clone_flags & CLONE_VFORK) {
> > +			shstk->base = 0;
> > +			shstk->size = 0;
> > +			return 0;
> > +		}

> I think we should leave the CLONE_VFORK check on its own independent of
> the clone3() arguments. If one passes both CLONE_VFORK and specific
> shadow stack address/size, they should be ignored (or maybe return an
> error if you want to make it stricter).

This is existing logic from the current x86 code that's been reindented
due to the addition of explicitly specified shadow stacks, it's not new
behaviour.  It is needed to stop the child thinking it has the parent's
shadow stack in the CLONE_VFORK case.

> > -	/*
> > -	 * For !CLONE_VM the child will use a copy of the parents shadow
> > -	 * stack.
> > -	 */
> > -	if (!(clone_flags & CLONE_VM))
> > -		return 0;
> > +		/*
> > +		 * For !CLONE_VM the child will use a copy of the
> > +		 * parents shadow stack.
> > +		 */
> > +		if (!(clone_flags & CLONE_VM))
> > +			return 0;

> Is the !CLONE_VM case specific only to the default shadow stack
> allocation? Sorry if this has been discussed already (or I completely
> forgot) but I thought we'd only implement this for the thread creation
> case. The typical fork() for a new process should inherit the parent's
> layout, so applicable to the clone3() with the shadow stack arguments as
> well (which should be ignored or maybe return an error with !CLONE_VM).

This is again all existing behaviour for the case where the user has not
specified a shadow stack reindented, as mentioned above if the user has
specified one explicitly then we just do what we were asked.  The
existing behaviour is to only create a new shadow stack for the child in
the CLONE_VM case and leave the child using the same shadow stack as the
parent in the copied mm for !CLONE_VM.

> > @@ -2790,6 +2808,8 @@ pid_t kernel_clone(struct kernel_clone_args *args)
> >  	 */
> >  	trace_sched_process_fork(current, p);
> >  
> > +	shstk_post_fork(p, args);

> Do we need this post fork call? Can we not handle the setup via the
> copy_thread() path in shstk_alloc_thread_stack()?

It looks like we do actually have the new mm in the process before we
call copy_thread() so we could move things into there though we'd loose
a small bit of factoring out of the error handling (at one point I had
more code factored out but right now it's quite small, looking again we
could also factor out the get_task_mm()/mput()).  ISTR having the new
process' mm was the biggest reason for this initially but looking again
I'm not sure why that was.  It does still feel like even the small
amount that's factored out currently is useful though, a bit less
duplication in the architecture code which feels welcome here.
Catalin Marinas Aug. 13, 2024, 4:25 p.m. UTC | #2
On Sat, Aug 10, 2024 at 12:06:12AM +0100, Mark Brown wrote:
> On Fri, Aug 09, 2024 at 07:19:26PM +0100, Catalin Marinas wrote:
> > On Thu, Aug 08, 2024 at 09:15:25AM +0100, Mark Brown wrote:
> > > +	/* This should really be an atomic cmpxchg.  It is not. */
> > > +	if (access_remote_vm(mm, addr, &val, sizeof(val),
> > > +			     FOLL_FORCE) != sizeof(val))
> > > +		goto out;
> 
> > If we restrict the shadow stack creation only to the CLONE_VM case, we'd
> > not need the remote vm access, it's in the current mm context already.
> > More on this below.
> 
> The discussion in previous iterations was that it seemed better to allow
> even surprising use cases since it simplifies the analysis of what we
> have covered.  If the user has specified a shadow stack we just do what
> they asked for and let them worry about if it's useful.

Thanks for the summary of the past discussions, the patch makes more
sense now. I guess it's easier to follow a clone*() syscall where one
can set a new stack pointer even in the !CLONE_VM case. Just let it set
the shadow stack as well with the new ABI.

However, the x86 would be slightly inconsistent here between clone() and
clone3(). I guess it depends how you look at it. The classic clone()
syscall, if one doesn't pass CLONE_VM but does set new stack, there's no
new shadow stack allocated which I'd expect since it's a new stack.
Well, I doubt anyone cares about this scenario. Are there real cases of
!CLONE_VM but with a new stack?

> > > +	if (val != expected)
> > > +		goto out;
> 
> > I'm confused that we need to consume the token here. I could not find
> > the default shadow stack allocation doing this, only setting it via
> > create_rstor_token() (or I did not search enough). In the default case,
> > is the user consuming it? To me the only difference should been the
> > default allocation vs the one passed by the user via clone3(), with the
> > latter maybe requiring the user to set the token initially.
> 
> As discussed for a couple of previous versions if we don't have the
> token and userspace can specify any old shadow stack page as the shadow
> stack this allows clone3() to be used to overwrite the shadow stack of
> another thread, you can point to a shadow stack page which is currently
> in use and then run some code that causes shadow stack writes.  This
> could potentially then in turn be used as part of a bigger exploit
> chain, probably it's hard to get anything beyond just causing the other
> thread to fault but won't be impossible.
> 
> With a kernel allocated shadow stack this is not an issue since we are
> placing the shadow stack in new memory, userspace can't control where we
> place it so it can't overwrite an existing shadow stack.

IIUC, the kernel-allocated shadow stack will have the token always set
while the user-allocated one will be cleared. I was looking to
understand the inconsistency between these two cases in terms of the
final layout of the new shadow stack: one with the token, the other
without. I can see the need for checking but maybe start with requiring
it to be 0 and setting the token before returning, for consistency with
clone().

In the kernel-allocated shadow stack, is the token used for anything? I
can see it's used for signal delivery and return but I couldn't figure
out what it is used for in a thread's shadow stack.

Also, can one not use the clone3() to point to the clone()-allocated
shadow stack? Maybe that's unlikely as an app tends to stick to one
syscall flavour or the other.

> > > +		/*
> > > +		 * For CLONE_VFORK the child will share the parents
> > > +		 * shadow stack.  Make sure to clear the internal
> > > +		 * tracking of the thread shadow stack so the freeing
> > > +		 * logic run for child knows to leave it alone.
> > > +		 */
> > > +		if (clone_flags & CLONE_VFORK) {
> > > +			shstk->base = 0;
> > > +			shstk->size = 0;
> > > +			return 0;
> > > +		}
> 
> > I think we should leave the CLONE_VFORK check on its own independent of
> > the clone3() arguments. If one passes both CLONE_VFORK and specific
> > shadow stack address/size, they should be ignored (or maybe return an
> > error if you want to make it stricter).
> 
> This is existing logic from the current x86 code that's been reindented
> due to the addition of explicitly specified shadow stacks, it's not new
> behaviour.  It is needed to stop the child thinking it has the parent's
> shadow stack in the CLONE_VFORK case.

I figured that. But similar to the current !CLONE_VM behaviour where no
new shadow stack is allocated even if a new stack is passed to clone(),
I was thinking of something similar here for consistency: don't set up a
shadow stack in the CLONE_VFORK case or at least allow it only if a new
stack is being set up (if we extend this to clone(), it would be a small
ABI change).

> > > -	/*
> > > -	 * For !CLONE_VM the child will use a copy of the parents shadow
> > > -	 * stack.
> > > -	 */
> > > -	if (!(clone_flags & CLONE_VM))
> > > -		return 0;
> > > +		/*
> > > +		 * For !CLONE_VM the child will use a copy of the
> > > +		 * parents shadow stack.
> > > +		 */
> > > +		if (!(clone_flags & CLONE_VM))
> > > +			return 0;
> 
> > Is the !CLONE_VM case specific only to the default shadow stack
> > allocation? Sorry if this has been discussed already (or I completely
> > forgot) but I thought we'd only implement this for the thread creation
> > case. The typical fork() for a new process should inherit the parent's
> > layout, so applicable to the clone3() with the shadow stack arguments as
> > well (which should be ignored or maybe return an error with !CLONE_VM).
> 
> This is again all existing behaviour for the case where the user has not
> specified a shadow stack reindented, as mentioned above if the user has
> specified one explicitly then we just do what we were asked.  The
> existing behaviour is to only create a new shadow stack for the child in
> the CLONE_VM case and leave the child using the same shadow stack as the
> parent in the copied mm for !CLONE_VM.

I guess I was rather questioning the current choices than the new
clone3() ABI. But even for the new clone3() ABI, does it make sense to
set up a shadow stack if the current stack isn't changed? We'll end up
with a lot of possible combinations that will never get tested but
potentially become obscure ABI. Limiting the options to the sane choices
only helps with validation and unsurprising changes later on.

> > > @@ -2790,6 +2808,8 @@ pid_t kernel_clone(struct kernel_clone_args *args)
> > >  	 */
> > >  	trace_sched_process_fork(current, p);
> > >  
> > > +	shstk_post_fork(p, args);
> 
> > Do we need this post fork call? Can we not handle the setup via the
> > copy_thread() path in shstk_alloc_thread_stack()?
> 
> It looks like we do actually have the new mm in the process before we
> call copy_thread() so we could move things into there though we'd loose
> a small bit of factoring out of the error handling (at one point I had
> more code factored out but right now it's quite small, looking again we
> could also factor out the get_task_mm()/mput()).  ISTR having the new
> process' mm was the biggest reason for this initially but looking again
> I'm not sure why that was.  It does still feel like even the small
> amount that's factored out currently is useful though, a bit less
> duplication in the architecture code which feels welcome here.

I think you can probably keep this. My comment was based on the
assumption that we only support the CLONE_VM case where we wouldn't need
the access_remote_vm(), just some direct write similar to
write_user_shstk_64().

I still think we should have limited this ABI to the CLONE_VM and
!CLONE_VFORK cases but I don't have a strong view if the consensus was
to allow it for classic fork() and vfork() like uses (I just think they
won't be used).
Mark Brown Aug. 13, 2024, 6:58 p.m. UTC | #3
On Tue, Aug 13, 2024 at 05:25:47PM +0100, Catalin Marinas wrote:

> However, the x86 would be slightly inconsistent here between clone() and
> clone3(). I guess it depends how you look at it. The classic clone()
> syscall, if one doesn't pass CLONE_VM but does set new stack, there's no
> new shadow stack allocated which I'd expect since it's a new stack.
> Well, I doubt anyone cares about this scenario. Are there real cases of
> !CLONE_VM but with a new stack?

ISTR the concerns were around someone being clever with vfork() but I
don't remember anything super concrete.  In terms of the inconsistency
here that was actually another thing that came up - if userspace
specifies a stack for clone3() it'll just get used even with CLONE_VFORK
so it seemed to make sense to do the same thing for the shadow stack.
This was part of the thinking when we were looking at it, if you can
specify a regular stack you should be able to specify a shadow stack.

> > > I'm confused that we need to consume the token here. I could not find
> > > the default shadow stack allocation doing this, only setting it via
> > > create_rstor_token() (or I did not search enough). In the default case,

> > As discussed for a couple of previous versions if we don't have the
> > token and userspace can specify any old shadow stack page as the shadow
> > stack this allows clone3() to be used to overwrite the shadow stack of
> > another thread, you can point to a shadow stack page which is currently

> IIUC, the kernel-allocated shadow stack will have the token always set
> while the user-allocated one will be cleared. I was looking to

No, when the kernel allocates we don't bother with tokens at all.  We
only look for and clear a token with the user specified shadow stack.

> understand the inconsistency between these two cases in terms of the
> final layout of the new shadow stack: one with the token, the other
> without. I can see the need for checking but maybe start with requiring
> it to be 0 and setting the token before returning, for consistency with
> clone().

The layout should be the same, the shadow stack will point to where the
token would be - the only difference is if we checked to see if there
was a token there.  Since we either clear the token on use or allocate a
fresh page in both cases the value there will be 0.

> In the kernel-allocated shadow stack, is the token used for anything? I
> can see it's used for signal delivery and return but I couldn't figure
> out what it is used for in a thread's shadow stack.

For arm64 we place differently formatted tokens there during signal
handling, and a token is placed at the top of the stack as part of the
architected stack pivoting instructions (and a token at the destination
consumed).  I believe x86 has the same pivoting behaviour but ICBW.  A
user specified shadow stack is handled in a very similar way to what
would happen if the newly created thread immediately pivoted to the
specified stack.

> Also, can one not use the clone3() to point to the clone()-allocated
> shadow stack? Maybe that's unlikely as an app tends to stick to one
> syscall flavour or the other.

A valid token will only be present on an inactive stack.  If a thread
pivots away from a kernel allocated stack then another thread could be
started using the original kernel allocated stack, any program doing
this should think carefully about the lifecycle of the kernel allocated
stack but it's possible.  If a thread has not pivoted away from it's
stack then there won't be a token at the top of the stack and it won't
be possible to pivot to it.

> > > > +		/*
> > > > +		 * For CLONE_VFORK the child will share the parents
> > > > +		 * shadow stack.  Make sure to clear the internal
> > > > +		 * tracking of the thread shadow stack so the freeing
> > > > +		 * logic run for child knows to leave it alone.
> > > > +		 */
> > > > +		if (clone_flags & CLONE_VFORK) {
> > > > +			shstk->base = 0;
> > > > +			shstk->size = 0;
> > > > +			return 0;
> > > > +		}

> > > I think we should leave the CLONE_VFORK check on its own independent of
> > > the clone3() arguments. If one passes both CLONE_VFORK and specific
> > > shadow stack address/size, they should be ignored (or maybe return an
> > > error if you want to make it stricter).

> > This is existing logic from the current x86 code that's been reindented
> > due to the addition of explicitly specified shadow stacks, it's not new
> > behaviour.  It is needed to stop the child thinking it has the parent's
> > shadow stack in the CLONE_VFORK case.

> I figured that. But similar to the current !CLONE_VM behaviour where no
> new shadow stack is allocated even if a new stack is passed to clone(),
> I was thinking of something similar here for consistency: don't set up a
> shadow stack in the CLONE_VFORK case or at least allow it only if a new
> stack is being set up (if we extend this to clone(), it would be a small
> ABI change).

We could restrict specifying a shadow stack to only be supported when a
regular stack is also specified, if we're doing that I'd prefer to do it
in all cases rather than only for vfork() since that reduces the number
of special cases and we don't restrict normal stacks like that.

> > This is again all existing behaviour for the case where the user has not
> > specified a shadow stack reindented, as mentioned above if the user has
> > specified one explicitly then we just do what we were asked.  The
> > existing behaviour is to only create a new shadow stack for the child in
> > the CLONE_VM case and leave the child using the same shadow stack as the
> > parent in the copied mm for !CLONE_VM.

> I guess I was rather questioning the current choices than the new
> clone3() ABI. But even for the new clone3() ABI, does it make sense to
> set up a shadow stack if the current stack isn't changed? We'll end up
> with a lot of possible combinations that will never get tested but
> potentially become obscure ABI. Limiting the options to the sane choices
> only helps with validation and unsurprising changes later on.

OTOH if we add the restrictions it's more code (and more test code) to
check, and thinking about if we've missed some important use case.  Not
that it's a *huge* amount of code, like I say I'd not be too unhappy
with adding a restriction on having a regular stack specified in order
to specify a shadow stack.
Catalin Marinas Aug. 14, 2024, 9:38 a.m. UTC | #4
On Tue, Aug 13, 2024 at 07:58:26PM +0100, Mark Brown wrote:
> On Tue, Aug 13, 2024 at 05:25:47PM +0100, Catalin Marinas wrote:
> > However, the x86 would be slightly inconsistent here between clone() and
> > clone3(). I guess it depends how you look at it. The classic clone()
> > syscall, if one doesn't pass CLONE_VM but does set new stack, there's no
> > new shadow stack allocated which I'd expect since it's a new stack.
> > Well, I doubt anyone cares about this scenario. Are there real cases of
> > !CLONE_VM but with a new stack?
> 
> ISTR the concerns were around someone being clever with vfork() but I
> don't remember anything super concrete.  In terms of the inconsistency
> here that was actually another thing that came up - if userspace
> specifies a stack for clone3() it'll just get used even with CLONE_VFORK
> so it seemed to make sense to do the same thing for the shadow stack.
> This was part of the thinking when we were looking at it, if you can
> specify a regular stack you should be able to specify a shadow stack.

Yes, I agree. But by this logic, I was wondering why the current clone()
behaviour does not allocate a shadow stack when a new stack is
requested with CLONE_VFORK. That's rather theoretical though and we may
not want to change the ABI.

> > > > I'm confused that we need to consume the token here. I could not find
> > > > the default shadow stack allocation doing this, only setting it via
> > > > create_rstor_token() (or I did not search enough). In the default case,
> 
> > > As discussed for a couple of previous versions if we don't have the
> > > token and userspace can specify any old shadow stack page as the shadow
> > > stack this allows clone3() to be used to overwrite the shadow stack of
> > > another thread, you can point to a shadow stack page which is currently
> 
> > IIUC, the kernel-allocated shadow stack will have the token always set
> > while the user-allocated one will be cleared. I was looking to
> 
> No, when the kernel allocates we don't bother with tokens at all.  We
> only look for and clear a token with the user specified shadow stack.

Ah, you are right, I misread the alloc_shstk() function. It takes a
set_res_tok parameter which is false for the normal allocation.

> > I guess I was rather questioning the current choices than the new
> > clone3() ABI. But even for the new clone3() ABI, does it make sense to
> > set up a shadow stack if the current stack isn't changed? We'll end up
> > with a lot of possible combinations that will never get tested but
> > potentially become obscure ABI. Limiting the options to the sane choices
> > only helps with validation and unsurprising changes later on.
> 
> OTOH if we add the restrictions it's more code (and more test code) to
> check, and thinking about if we've missed some important use case.  Not
> that it's a *huge* amount of code, like I say I'd not be too unhappy
> with adding a restriction on having a regular stack specified in order
> to specify a shadow stack.

I guess we just follow the normal stack behaviour for clone3(), at least
we'd be consistent with that.

Anyway, I understood this patch now and the ABI decisions. FWIW:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Brown Aug. 14, 2024, 1:20 p.m. UTC | #5
On Wed, Aug 14, 2024 at 10:38:54AM +0100, Catalin Marinas wrote:
> On Tue, Aug 13, 2024 at 07:58:26PM +0100, Mark Brown wrote:

> > ISTR the concerns were around someone being clever with vfork() but I
> > don't remember anything super concrete.  In terms of the inconsistency
> > here that was actually another thing that came up - if userspace
> > specifies a stack for clone3() it'll just get used even with CLONE_VFORK
> > so it seemed to make sense to do the same thing for the shadow stack.
> > This was part of the thinking when we were looking at it, if you can
> > specify a regular stack you should be able to specify a shadow stack.

> Yes, I agree. But by this logic, I was wondering why the current clone()
> behaviour does not allocate a shadow stack when a new stack is
> requested with CLONE_VFORK. That's rather theoretical though and we may
> not want to change the ABI.

The default for vfork() is to reuse both the normal and shadow stacks,
clone3() does make it all much more flexible.  All the shadow stack
ABI predates clone3(), even if it ended up getting merged after.

> Anyway, I understood this patch now and the ABI decisions. FWIW:

> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks!
Edgecombe, Rick P Aug. 15, 2024, 12:18 a.m. UTC | #6
On Thu, 2024-08-08 at 09:15 +0100, Mark Brown wrote:
> +int arch_shstk_post_fork(struct task_struct *t, struct kernel_clone_args
> *args)
> +{
> +       /*
> +        * SSP is aligned, so reserved bits and mode bit are a zero, just mark
> +        * the token 64-bit.
> +        */
> +       struct mm_struct *mm;
> +       unsigned long addr, ssp;
> +       u64 expected;
> +       u64 val;
> +       int ret = -EINVAL;

We should probably?
if (!features_enabled(ARCH_SHSTK_SHSTK))
	return 0;

> +
> +       ssp = args->shadow_stack + args->shadow_stack_size;
> +       addr = ssp - SS_FRAME_SIZE;
> +       expected = ssp | BIT(0);
> +
> +       mm = get_task_mm(t);
> +       if (!mm)
> +               return -EFAULT;

We could check that the VMA is shadow stack here. I'm not sure what could go
wrong though. If you point it to RW memory it could start the thread with that
as a shadow stack and just blow up at the first call. It might be nicer to fail
earlier though.

> +
> +       /* This should really be an atomic cmpxchg.  It is not. */
> +       if (access_remote_vm(mm, addr, &val, sizeof(val),
> +                            FOLL_FORCE) != sizeof(val))
> +               goto out;
> +
> +       if (val != expected)
> +               goto out;
> +       val = 0;

After a token is consumed normally, it doesn't set it to zero. Instead it sets
it to a "previous-ssp token". I don't think we actually want to do that here
though because it involves the old SSP, which doesn't really apply in this case.
I don't see any problem with zero, but was there any special thinking behind it?

> +       if (access_remote_vm(mm, addr, &val, sizeof(val),
> +                            FOLL_FORCE | FOLL_WRITE) != sizeof(val))
> +               goto out;

The GUPs still seem a bit unfortunate for a couple reasons:
 - We could do a CMPXCHG version and are just not (I see ARM has identical code
in gcs_consume_token()). It's not the only race like this though FWIW.
 - I *think* this is the only unprivileged FOLL_FORCE that can write to the
current process in the kernel. As is, it could be used on normal RO mappings, at
least in a limited way. Maybe another point for the VMA check. We'd want to
check that it is normal shadow stack?
 - Lingering doubts about the wisdom of doing GUPs during task creation.

I don't think they are show stoppers, but the VMA check would be nice to have in
the first upstream support.

> +
> +       ret = 0;
> +
> +out:
> +       mmput(mm);
> +       return ret;
> +}
> +
> 
[snip]
> 
>  
> +static void shstk_post_fork(struct task_struct *p,
> +                           struct kernel_clone_args *args)
> +{
> +       if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
> +               return;
> +
> +       if (!args->shadow_stack)
> +               return;
> +
> +       if (arch_shstk_post_fork(p, args) != 0)
> +               force_sig_fault_to_task(SIGSEGV, SEGV_CPERR, NULL, p);
> +}
> +

Hmm, is this forcing the signal on the new task, which is set up on a user
provided shadow stack that failed the token check? It would handle the signal
with an arbitrary SSP then I think. We should probably fail the clone call in
the parent instead, which can be done by doing the work in copy_process(). Do
you see a problem with doing it at the end of copy_process()? I don't know if
there could be ordering constraints.
Mark Brown Aug. 15, 2024, 2:24 p.m. UTC | #7
On Thu, Aug 15, 2024 at 12:18:23AM +0000, Edgecombe, Rick P wrote:
> On Thu, 2024-08-08 at 09:15 +0100, Mark Brown wrote:

> > +       ssp = args->shadow_stack + args->shadow_stack_size;
> > +       addr = ssp - SS_FRAME_SIZE;
> > +       expected = ssp | BIT(0);

> > +       mm = get_task_mm(t);
> > +       if (!mm)
> > +               return -EFAULT;

> We could check that the VMA is shadow stack here. I'm not sure what could go
> wrong though. If you point it to RW memory it could start the thread with that
> as a shadow stack and just blow up at the first call. It might be nicer to fail
> earlier though.

Sure, I wasn't doing anything since like you say the new thread will
fail anyway but we can do the check.  As you point out below it'll close
down the possibility of writing to memory.

> > +       /* This should really be an atomic cmpxchg.  It is not. */
> > +       if (access_remote_vm(mm, addr, &val, sizeof(val),
> > +                            FOLL_FORCE) != sizeof(val))
> > +               goto out;
> > +
> > +       if (val != expected)
> > +               goto out;
> > +       val = 0;

> After a token is consumed normally, it doesn't set it to zero. Instead it sets
> it to a "previous-ssp token". I don't think we actually want to do that here
> though because it involves the old SSP, which doesn't really apply in this case.
> I don't see any problem with zero, but was there any special thinking behind it?

I wasn't aware of the x86 behaviour for pivots here, 0 was just a
default thing to choose for an invalid value.  arm64 will also leave
a value on the outgoing stack as a product of the two step pivots we
have but it's not really something you'd look for.

> > +       if (access_remote_vm(mm, addr, &val, sizeof(val),
> > +                            FOLL_FORCE | FOLL_WRITE) != sizeof(val))
> > +               goto out;

> The GUPs still seem a bit unfortunate for a couple reasons:
>  - We could do a CMPXCHG version and are just not (I see ARM has identical code
> in gcs_consume_token()). It's not the only race like this though FWIW.
>  - I *think* this is the only unprivileged FOLL_FORCE that can write to the
> current process in the kernel. As is, it could be used on normal RO mappings, at
> least in a limited way. Maybe another point for the VMA check. We'd want to
> check that it is normal shadow stack?
>  - Lingering doubts about the wisdom of doing GUPs during task creation.

> I don't think they are show stoppers, but the VMA check would be nice to have in
> the first upstream support.

The check you suggest for shadow stack memory should avoid abuse of the
FOLL_FORCE at least.  It'd be a bit narrow, you'd only be able to
overwrite a value where we managed to read a valid token, but it's
there.

> > +static void shstk_post_fork(struct task_struct *p,
> > +                           struct kernel_clone_args *args)
> > +{
> > +       if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
> > +               return;
> > +
> > +       if (!args->shadow_stack)
> > +               return;
> > +
> > +       if (arch_shstk_post_fork(p, args) != 0)
> > +               force_sig_fault_to_task(SIGSEGV, SEGV_CPERR, NULL, p);
> > +}

> Hmm, is this forcing the signal on the new task, which is set up on a user
> provided shadow stack that failed the token check? It would handle the signal
> with an arbitrary SSP then I think. We should probably fail the clone call in
> the parent instead, which can be done by doing the work in copy_process(). Do

One thing I was thinking when writing this was that I wanted to make it
possible to implement the check in the vDSO if there's any architectures
that could do so, avoiding any need to GUP, but I can't see that that's
actually been possible.

> you see a problem with doing it at the end of copy_process()? I don't know if
> there could be ordering constraints.

I was concerned when I was writing the code about ordring constraints,
but I did revise what the code was doing several times and as I was
saying in reply to Catalin I'm no longer sure those apply.
Edgecombe, Rick P Aug. 16, 2024, 2:52 p.m. UTC | #8
On Fri, 2024-08-16 at 09:44 +0100, Catalin Marinas wrote:
> > After a token is consumed normally, it doesn't set it to zero. Instead it
> > sets
> > it to a "previous-ssp token". I don't think we actually want to do that here
> > though because it involves the old SSP, which doesn't really apply in this
> > case.
> > I don't see any problem with zero, but was there any special thinking behind
> > it?
> 
> BTW, since it's the parent setting up the shadow stack in its own
> address space before forking, I think at least the read can avoid
> access_remote_vm() and we could do it earlier, even before the new
> process is created.

Hmm. Makes sense. It's a bit racy since the parent could consume that token from
another thread, but it would be a race in any case.

> 
> > > +       if (access_remote_vm(mm, addr, &val, sizeof(val),
> > > +                            FOLL_FORCE | FOLL_WRITE) != sizeof(val))
> > > +               goto out;
> > 
> > The GUPs still seem a bit unfortunate for a couple reasons:
> >   - We could do a CMPXCHG version and are just not (I see ARM has identical
> > code
> > in gcs_consume_token()). It's not the only race like this though FWIW.
> >   - I *think* this is the only unprivileged FOLL_FORCE that can write to the
> > current process in the kernel. As is, it could be used on normal RO
> > mappings, at
> > least in a limited way. Maybe another point for the VMA check. We'd want to
> > check that it is normal shadow stack?
> >   - Lingering doubts about the wisdom of doing GUPs during task creation.
> 
> I don't like the access_remote_vm() either. In the common (practically
> only) case with CLONE_VM, the mm is actually current->mm, so no need for
> a GUP.

On the x86 side, we don't have a shadow stack access CMPXCHG. We will have to
GUP and do a normal CMPXCHG off of the direct map to handle it fully properly in
any case (CLONE_VM or not).

> 
> We could, in theory, consume this token in the parent before the child
> mm is created. The downside is that if a parent forks multiple
> processes using the same shadow stack, it will have to set the token
> each time. I'd be fine with this, that's really only for the mostly
> theoretical case where one doesn't use CLONE_VM and still want a
> separate stack and shadow stack.
> 
> > I don't think they are show stoppers, but the VMA check would be nice to
> > have in
> > the first upstream support.
> 
> Good point.
Mark Brown Aug. 16, 2024, 3:30 p.m. UTC | #9
On Fri, Aug 16, 2024 at 02:52:28PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-08-16 at 09:44 +0100, Catalin Marinas wrote:

> > BTW, since it's the parent setting up the shadow stack in its own
> > address space before forking, I think at least the read can avoid
> > access_remote_vm() and we could do it earlier, even before the new
> > process is created.

> Hmm. Makes sense. It's a bit racy since the parent could consume that token from
> another thread, but it would be a race in any case.

So it sounds like we might be coming round to this?  I've got a new
version that verifies the VM_SHADOW_STACK good to go but if we're going
to switch back to consuming the token in the parent context I may as
well do that.  Like I said in the other mail I'd rather not flip flop
on this.
Mark Brown Aug. 16, 2024, 3:46 p.m. UTC | #10
On Fri, Aug 16, 2024 at 04:29:13PM +0100, Catalin Marinas wrote:
> On Fri, Aug 16, 2024 at 11:51:57AM +0100, Mark Brown wrote:

> > I change back to parsing the token in the parent but I don't want to end
> > up in a cycle of bouncing between the two implementations depending on
> > who's reviewed the most recent version.

> You and others spent a lot more time looking at shadow stacks than me.
> I'm not necessarily asking to change stuff but rather understand the
> choices made.

I'm a little ambivalent on this - on the one hand accessing the child's
memory is not a thing of great beauty but on the other hand it does
make the !CLONE_VM case more solid.  My general instinct is that the
ugliness is less of an issue than the "oh, there's a gap there" stuff
with the !CLONE_VM case since it's more "why are we doing that?" than
"we missed this".
Mark Brown Aug. 16, 2024, 5:06 p.m. UTC | #11
On Fri, Aug 16, 2024 at 04:38:48PM +0100, Catalin Marinas wrote:
> On Fri, Aug 16, 2024 at 02:52:28PM +0000, Edgecombe, Rick P wrote:

> > On the x86 side, we don't have a shadow stack access CMPXCHG. We will have to
> > GUP and do a normal CMPXCHG off of the direct map to handle it fully properly in
> > any case (CLONE_VM or not).

> I guess we could do the same here and for the arm64 gcs_consume_token().
> Basically get_user_page_vma_remote() gives us the page together with the
> vma that you mentioned needs checking. We can then do a cmpxchg directly
> on the page_address(). It's probably faster anyway than doing GUP twice.

There was some complication with get_user_page_vma_remote() while I was
working on an earlier version which meant I didn't use it, though with
adding checking of VMAs perhaps whatever it was isn't such an issue any
more.
Jann Horn Aug. 16, 2024, 5:08 p.m. UTC | #12
On Thu, Aug 15, 2024 at 2:18 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Thu, 2024-08-08 at 09:15 +0100, Mark Brown wrote:
> > +       if (access_remote_vm(mm, addr, &val, sizeof(val),
> > +                            FOLL_FORCE | FOLL_WRITE) != sizeof(val))
> > +               goto out;
>
> The GUPs still seem a bit unfortunate for a couple reasons:
>  - We could do a CMPXCHG version and are just not (I see ARM has identical code
> in gcs_consume_token()). It's not the only race like this though FWIW.
>  - I *think* this is the only unprivileged FOLL_FORCE that can write to the
> current process in the kernel. As is, it could be used on normal RO mappings, at
> least in a limited way. Maybe another point for the VMA check. We'd want to
> check that it is normal shadow stack?

Yeah, having a FOLL_FORCE write in clone3 would be a weakness for
userspace CFI and probably make it possible to violate mseal()
restrictions that are supposed to enforce that address space regions
are read-only.

>  - Lingering doubts about the wisdom of doing GUPs during task creation.
>
> I don't think they are show stoppers, but the VMA check would be nice to have in
> the first upstream support.
[...]
> > +static void shstk_post_fork(struct task_struct *p,
> > +                           struct kernel_clone_args *args)
> > +{
> > +       if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
> > +               return;
> > +
> > +       if (!args->shadow_stack)
> > +               return;
> > +
> > +       if (arch_shstk_post_fork(p, args) != 0)
> > +               force_sig_fault_to_task(SIGSEGV, SEGV_CPERR, NULL, p);
> > +}
> > +
>
> Hmm, is this forcing the signal on the new task, which is set up on a user
> provided shadow stack that failed the token check? It would handle the signal
> with an arbitrary SSP then I think. We should probably fail the clone call in
> the parent instead, which can be done by doing the work in copy_process(). Do
> you see a problem with doing it at the end of copy_process()? I don't know if
> there could be ordering constraints.

FWIW I think we have things like force_fatal_sig() and
force_exit_sig() to send signals that userspace can't catch with
signal handlers - if you have to do the copying after the new task has
been set up, something along those lines might be the right way to
kill the child.

Though, did anyone in the thread yet suggest that you could do this
before the child process has fully materialized but after the child MM
has been set up? Somewhere in copy_process() between copy_mm() and the
"/* No more failure paths after this point. */" comment?
Mark Brown Aug. 16, 2024, 5:17 p.m. UTC | #13
On Fri, Aug 16, 2024 at 07:08:09PM +0200, Jann Horn wrote:

> Yeah, having a FOLL_FORCE write in clone3 would be a weakness for
> userspace CFI and probably make it possible to violate mseal()
> restrictions that are supposed to enforce that address space regions
> are read-only.

Note that this will only happen for shadow stack pages (with the new
version) and only for a valid token at the specific address.  mseal()ing
a shadow stack to be read only is hopefully not going to go terribly
well for userspace.

> Though, did anyone in the thread yet suggest that you could do this
> before the child process has fully materialized but after the child MM
> has been set up? Somewhere in copy_process() between copy_mm() and the
> "/* No more failure paths after this point. */" comment?

Yes, I'e got a version that does that waiting to go pending some
discussion on if we even do the check for the token in the child mm.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 4cb77e004615..252feeda6999 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -6,6 +6,7 @@ 
 #include <linux/types.h>
 
 struct task_struct;
+struct kernel_clone_args;
 struct ksignal;
 
 #ifdef CONFIG_X86_USER_SHADOW_STACK
@@ -16,8 +17,8 @@  struct thread_shstk {
 
 long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
 void reset_thread_features(void);
-unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clone_flags,
-				       unsigned long stack_size);
+unsigned long shstk_alloc_thread_stack(struct task_struct *p,
+				       const struct kernel_clone_args *args);
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
@@ -28,8 +29,10 @@  static inline long shstk_prctl(struct task_struct *task, int option,
 			       unsigned long arg2) { return -EINVAL; }
 static inline void reset_thread_features(void) {}
 static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
-						     unsigned long clone_flags,
-						     unsigned long stack_size) { return 0; }
+						     const struct kernel_clone_args *args)
+{
+	return 0;
+}
 static inline void shstk_free(struct task_struct *p) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index f63f8fd00a91..59456ab8d93f 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -207,7 +207,7 @@  int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
 	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
 	 * update it.
 	 */
-	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
+	new_ssp = shstk_alloc_thread_stack(p, args);
 	if (IS_ERR_VALUE(new_ssp))
 		return PTR_ERR((void *)new_ssp);
 
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 059685612362..d7005974aff5 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -191,44 +191,105 @@  void reset_thread_features(void)
 	current->thread.features_locked = 0;
 }
 
-unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, unsigned long clone_flags,
-				       unsigned long stack_size)
+int arch_shstk_post_fork(struct task_struct *t, struct kernel_clone_args *args)
+{
+	/*
+	 * SSP is aligned, so reserved bits and mode bit are a zero, just mark
+	 * the token 64-bit.
+	 */
+	struct mm_struct *mm;
+	unsigned long addr, ssp;
+	u64 expected;
+	u64 val;
+	int ret = -EINVAL;
+
+	ssp = args->shadow_stack + args->shadow_stack_size;
+	addr = ssp - SS_FRAME_SIZE;
+	expected = ssp | BIT(0);
+
+	mm = get_task_mm(t);
+	if (!mm)
+		return -EFAULT;
+
+	/* This should really be an atomic cmpxchg.  It is not. */
+	if (access_remote_vm(mm, addr, &val, sizeof(val),
+			     FOLL_FORCE) != sizeof(val))
+		goto out;
+
+	if (val != expected)
+		goto out;
+	val = 0;
+	if (access_remote_vm(mm, addr, &val, sizeof(val),
+			     FOLL_FORCE | FOLL_WRITE) != sizeof(val))
+		goto out;
+
+	ret = 0;
+
+out:
+	mmput(mm);
+	return ret;
+}
+
+unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
+				       const struct kernel_clone_args *args)
 {
 	struct thread_shstk *shstk = &tsk->thread.shstk;
+	unsigned long clone_flags = args->flags;
 	unsigned long addr, size;
 
 	/*
 	 * If shadow stack is not enabled on the new thread, skip any
-	 * switch to a new shadow stack.
+	 * implicit switch to a new shadow stack and reject attempts to
+	 * explciitly specify one.
 	 */
-	if (!features_enabled(ARCH_SHSTK_SHSTK))
+	if (!features_enabled(ARCH_SHSTK_SHSTK)) {
+		if (args->shadow_stack || args->shadow_stack_size)
+			return (unsigned long)ERR_PTR(-EINVAL);
+
 		return 0;
+	}
 
 	/*
-	 * For CLONE_VFORK the child will share the parents shadow stack.
-	 * Make sure to clear the internal tracking of the thread shadow
-	 * stack so the freeing logic run for child knows to leave it alone.
+	 * If the user specified a shadow stack then do some basic
+	 * validation and use it, otherwise fall back to a default
+	 * shadow stack size if the clone_flags don't indicate an
+	 * allocation is unneeded.
 	 */
-	if (clone_flags & CLONE_VFORK) {
+	if (args->shadow_stack) {
+		addr = args->shadow_stack;
+		size = args->shadow_stack_size;
 		shstk->base = 0;
 		shstk->size = 0;
-		return 0;
-	}
+	} else {
+		/*
+		 * For CLONE_VFORK the child will share the parents
+		 * shadow stack.  Make sure to clear the internal
+		 * tracking of the thread shadow stack so the freeing
+		 * logic run for child knows to leave it alone.
+		 */
+		if (clone_flags & CLONE_VFORK) {
+			shstk->base = 0;
+			shstk->size = 0;
+			return 0;
+		}
 
-	/*
-	 * For !CLONE_VM the child will use a copy of the parents shadow
-	 * stack.
-	 */
-	if (!(clone_flags & CLONE_VM))
-		return 0;
+		/*
+		 * For !CLONE_VM the child will use a copy of the
+		 * parents shadow stack.
+		 */
+		if (!(clone_flags & CLONE_VM))
+			return 0;
 
-	size = adjust_shstk_size(stack_size);
-	addr = alloc_shstk(0, size, 0, false);
-	if (IS_ERR_VALUE(addr))
-		return addr;
+		size = args->stack_size;
+		size = adjust_shstk_size(size);
+		addr = alloc_shstk(0, size, 0, false);
+		if (IS_ERR_VALUE(addr))
+			return addr;
 
-	shstk->base = addr;
-	shstk->size = size;
+		/* We allocated the shadow stack, we should deallocate it. */
+		shstk->base = addr;
+		shstk->size = size;
+	}
 
 	return addr + size;
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index d362aacf9f89..56b2013d7fe5 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -43,6 +43,8 @@  struct kernel_clone_args {
 	void *fn_arg;
 	struct cgroup *cgrp;
 	struct css_set *cset;
+	unsigned long shadow_stack;
+	unsigned long shadow_stack_size;
 };
 
 /*
@@ -230,4 +232,15 @@  static inline void task_unlock(struct task_struct *p)
 
 DEFINE_GUARD(task_lock, struct task_struct *, task_lock(_T), task_unlock(_T))
 
+#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
+int arch_shstk_post_fork(struct task_struct *p,
+			 struct kernel_clone_args *args);
+#else
+static inline int arch_shstk_post_fork(struct task_struct *p,
+				       struct kernel_clone_args *args)
+{
+	return 0;
+}
+#endif
+
 #endif /* _LINUX_SCHED_TASK_H */
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..8b7af52548fd 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -84,6 +84,10 @@ 
  *                kernel's limit of nested PID namespaces.
  * @cgroup:       If CLONE_INTO_CGROUP is specified set this to
  *                a file descriptor for the cgroup.
+ * @shadow_stack: Pointer to the memory allocated for the child
+ *                shadow stack.
+ * @shadow_stack_size: Specify the size of the shadow stack for
+ *                     the child process.
  *
  * The structure is versioned by size and thus extensible.
  * New struct members must go at the end of the struct and
@@ -101,12 +105,15 @@  struct clone_args {
 	__aligned_u64 set_tid;
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
+	__aligned_u64 shadow_stack;
+	__aligned_u64 shadow_stack_size;
 };
 #endif
 
-#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
-#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
-#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER0  64 /* sizeof first published struct */
+#define CLONE_ARGS_SIZE_VER1  80 /* sizeof second published struct */
+#define CLONE_ARGS_SIZE_VER2  88 /* sizeof third published struct */
+#define CLONE_ARGS_SIZE_VER3 104 /* sizeof fourth published struct */
 
 /*
  * Scheduling policies
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..18278c72681c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -128,6 +128,11 @@ 
  */
 #define MAX_THREADS FUTEX_TID_MASK
 
+/*
+ * Require that shadow stacks can store at least one element
+ */
+#define SHADOW_STACK_SIZE_MIN sizeof(void *)
+
 /*
  * Protected counters by write_lock_irq(&tasklist_lock)
  */
@@ -2729,6 +2734,19 @@  struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
 	return copy_process(NULL, 0, node, &args);
 }
 
+static void shstk_post_fork(struct task_struct *p,
+			    struct kernel_clone_args *args)
+{
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
+		return;
+
+	if (!args->shadow_stack)
+		return;
+
+	if (arch_shstk_post_fork(p, args) != 0)
+		force_sig_fault_to_task(SIGSEGV, SEGV_CPERR, NULL, p);
+}
+
 /*
  *  Ok, this is the main fork-routine.
  *
@@ -2790,6 +2808,8 @@  pid_t kernel_clone(struct kernel_clone_args *args)
 	 */
 	trace_sched_process_fork(current, p);
 
+	shstk_post_fork(p, args);
+
 	pid = get_task_pid(p, PIDTYPE_PID);
 	nr = pid_vnr(pid);
 
@@ -2939,7 +2959,9 @@  noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		     CLONE_ARGS_SIZE_VER1);
 	BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
 		     CLONE_ARGS_SIZE_VER2);
-	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
+	BUILD_BUG_ON(offsetofend(struct clone_args, shadow_stack_size) !=
+		     CLONE_ARGS_SIZE_VER3);
+	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
 
 	if (unlikely(usize > PAGE_SIZE))
 		return -E2BIG;
@@ -2972,16 +2994,18 @@  noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
 		return -EINVAL;
 
 	*kargs = (struct kernel_clone_args){
-		.flags		= args.flags,
-		.pidfd		= u64_to_user_ptr(args.pidfd),
-		.child_tid	= u64_to_user_ptr(args.child_tid),
-		.parent_tid	= u64_to_user_ptr(args.parent_tid),
-		.exit_signal	= args.exit_signal,
-		.stack		= args.stack,
-		.stack_size	= args.stack_size,
-		.tls		= args.tls,
-		.set_tid_size	= args.set_tid_size,
-		.cgroup		= args.cgroup,
+		.flags			= args.flags,
+		.pidfd			= u64_to_user_ptr(args.pidfd),
+		.child_tid		= u64_to_user_ptr(args.child_tid),
+		.parent_tid		= u64_to_user_ptr(args.parent_tid),
+		.exit_signal		= args.exit_signal,
+		.stack			= args.stack,
+		.stack_size		= args.stack_size,
+		.tls			= args.tls,
+		.set_tid_size		= args.set_tid_size,
+		.cgroup			= args.cgroup,
+		.shadow_stack		= args.shadow_stack,
+		.shadow_stack_size	= args.shadow_stack_size,
 	};
 
 	if (args.set_tid &&
@@ -3022,6 +3046,34 @@  static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
 	return true;
 }
 
+/**
+ * clone3_shadow_stack_valid - check and prepare shadow stack
+ * @kargs: kernel clone args
+ *
+ * Verify that shadow stacks are only enabled if supported.
+ */
+static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
+{
+	if (kargs->shadow_stack) {
+		if (!kargs->shadow_stack_size)
+			return false;
+
+		if (kargs->shadow_stack_size < SHADOW_STACK_SIZE_MIN)
+			return false;
+
+		if (kargs->shadow_stack_size > rlimit(RLIMIT_STACK))
+			return false;
+
+		/*
+		 * The architecture must check support on the specific
+		 * machine.
+		 */
+		return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
+	} else {
+		return !kargs->shadow_stack_size;
+	}
+}
+
 static bool clone3_args_valid(struct kernel_clone_args *kargs)
 {
 	/* Verify that no unknown flags are passed along. */
@@ -3044,7 +3096,7 @@  static bool clone3_args_valid(struct kernel_clone_args *kargs)
 	    kargs->exit_signal)
 		return false;
 
-	if (!clone3_stack_valid(kargs))
+	if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
 		return false;
 
 	return true;