mbox series

[RFT,v8,0/9] fork: Support shadow stacks in clone3()

Message ID 20240808-clone3-shadow-stack-v8-0-0acf37caf14c@kernel.org
Headers show
Series fork: Support shadow stacks in clone3() | expand

Message

Mark Brown Aug. 8, 2024, 8:15 a.m. UTC
The kernel has recently added support for shadow stacks, currently
x86 only using their CET feature but both arm64 and RISC-V have
equivalent features (GCS and Zicfiss respectively), I am actively
working on GCS[1].  With shadow stacks the hardware maintains an
additional stack containing only the return addresses for branch
instructions which is not generally writeable by userspace and ensures
that any returns are to the recorded addresses.  This provides some
protection against ROP attacks and making it easier to collect call
stacks.  These shadow stacks are allocated in the address space of the
userspace process.

Our API for shadow stacks does not currently offer userspace any
flexiblity for managing the allocation of shadow stacks for newly
created threads, instead the kernel allocates a new shadow stack with
the same size as the normal stack whenever a thread is created with the
feature enabled.  The stacks allocated in this way are freed by the
kernel when the thread exits or shadow stacks are disabled for the
thread.  This lack of flexibility and control isn't ideal, in the vast
majority of cases the shadow stack will be over allocated and the
implicit allocation and deallocation is not consistent with other
interfaces.  As far as I can tell the interface is done in this manner
mainly because the shadow stack patches were in development since before
clone3() was implemented.

Since clone3() is readily extensible let's add support for specifying a
shadow stack when creating a new thread or process in a similar manner
to how the normal stack is specified, keeping the current implicit
allocation behaviour if one is not specified either with clone3() or
through the use of clone().  The user must provide a shadow stack
address and size, this must point to memory mapped for use as a shadow
stackby map_shadow_stack() with a shadow stack token at the top of the
stack.

Please note that the x86 portions of this code are build tested only, I
don't appear to have a system that can run CET avaible to me, I have
done testing with an integration into my pending work for GCS.  There is
some possibility that the arm64 implementation may require the use of
clone3() and explicit userspace allocation of shadow stacks, this is
still under discussion.

Please further note that the token consumption done by clone3() is not
currently implemented in an atomic fashion, Rick indicated that he would
look into fixing this if people are OK with the implementation.

A new architecture feature Kconfig option for shadow stacks is added as
here, this was suggested as part of the review comments for the arm64
GCS series and since we need to detect if shadow stacks are supported it
seemed sensible to roll it in here.

[1] https://lore.kernel.org/r/20231009-arm64-gcs-v6-0-78e55deaa4dd@kernel.org/

Signed-off-by: Mark Brown <broonie@kernel.org>
---
Changes in v8:
- Fix token verification with user specified shadow stack.
- Don't track user managed shadow stacks for child processes.
- Link to v7: https://lore.kernel.org/r/20240731-clone3-shadow-stack-v7-0-a9532eebfb1d@kernel.org

Changes in v7:
- Rebase onto v6.11-rc1.
- Typo fixes.
- Link to v6: https://lore.kernel.org/r/20240623-clone3-shadow-stack-v6-0-9ee7783b1fb9@kernel.org

Changes in v6:
- Rebase onto v6.10-rc3.
- Ensure we don't try to free the parent shadow stack in error paths of
  x86 arch code.
- Spelling fixes in userspace API document.
- Additional cleanups and improvements to the clone3() tests to support
  the shadow stack tests.
- Link to v5: https://lore.kernel.org/r/20240203-clone3-shadow-stack-v5-0-322c69598e4b@kernel.org

Changes in v5:
- Rebase onto v6.8-rc2.
- Rework ABI to have the user allocate the shadow stack memory with
  map_shadow_stack() and a token.
- Force inlining of the x86 shadow stack enablement.
- Move shadow stack enablement out into a shared header for reuse by
  other tests.
- Link to v4: https://lore.kernel.org/r/20231128-clone3-shadow-stack-v4-0-8b28ffe4f676@kernel.org

Changes in v4:
- Formatting changes.
- Use a define for minimum shadow stack size and move some basic
  validation to fork.c.
- Link to v3: https://lore.kernel.org/r/20231120-clone3-shadow-stack-v3-0-a7b8ed3e2acc@kernel.org

Changes in v3:
- Rebase onto v6.7-rc2.
- Remove stale shadow_stack in internal kargs.
- If a shadow stack is specified unconditionally use it regardless of
  CLONE_ parameters.
- Force enable shadow stacks in the selftest.
- Update changelogs for RISC-V feature rename.
- Link to v2: https://lore.kernel.org/r/20231114-clone3-shadow-stack-v2-0-b613f8681155@kernel.org

Changes in v2:
- Rebase onto v6.7-rc1.
- Remove ability to provide preallocated shadow stack, just specify the
  desired size.
- Link to v1: https://lore.kernel.org/r/20231023-clone3-shadow-stack-v1-0-d867d0b5d4d0@kernel.org

---
Mark Brown (9):
      Documentation: userspace-api: Add shadow stack API documentation
      selftests: Provide helper header for shadow stack testing
      mm: Introduce ARCH_HAS_USER_SHADOW_STACK
      fork: Add shadow stack support to clone3()
      selftests/clone3: Remove redundant flushes of output streams
      selftests/clone3: Factor more of main loop into test_clone3()
      selftests/clone3: Explicitly handle child exits due to signals
      selftests/clone3: Allow tests to flag if -E2BIG is a valid error code
      selftests/clone3: Test shadow stack support

 Documentation/userspace-api/index.rst             |   1 +
 Documentation/userspace-api/shadow_stack.rst      |  41 ++++
 arch/x86/Kconfig                                  |   1 +
 arch/x86/include/asm/shstk.h                      |  11 +-
 arch/x86/kernel/process.c                         |   2 +-
 arch/x86/kernel/shstk.c                           | 105 +++++++---
 fs/proc/task_mmu.c                                |   2 +-
 include/linux/mm.h                                |   2 +-
 include/linux/sched/task.h                        |  13 ++
 include/uapi/linux/sched.h                        |  13 +-
 kernel/fork.c                                     |  76 ++++++--
 mm/Kconfig                                        |   6 +
 tools/testing/selftests/clone3/clone3.c           | 224 ++++++++++++++++++----
 tools/testing/selftests/clone3/clone3_selftests.h |  40 +++-
 tools/testing/selftests/ksft_shstk.h              |  63 ++++++
 15 files changed, 513 insertions(+), 87 deletions(-)
---
base-commit: 8400291e289ee6b2bf9779ff1c83a291501f017b
change-id: 20231019-clone3-shadow-stack-15d40d2bf536

Best regards,

Comments

Jann Horn Aug. 16, 2024, 3:52 p.m. UTC | #1
On Thu, Aug 8, 2024 at 10:16 AM Mark Brown <broonie@kernel.org> wrote:
> Since clone3() is readily extensible let's add support for specifying a
> shadow stack when creating a new thread or process in a similar manner
> to how the normal stack is specified, keeping the current implicit
> allocation behaviour if one is not specified either with clone3() or
> through the use of clone().  The user must provide a shadow stack
> address and size, this must point to memory mapped for use as a shadow
> stackby map_shadow_stack() with a shadow stack token at the top of the
> stack.

As a heads-up so you don't get surprised by this in the future:

Because clone3() does not pass the flags in a register like clone()
does, it is not available in places like docker containers that use
the default Docker seccomp policy
(https://github.com/moby/moby/blob/master/profiles/seccomp/default.json).
Docker uses seccomp to filter clone() arguments (to prevent stuff like
namespace creation), and that's not possible with clone3(), so
clone3() is blocked.

The same thing applies to things like sandboxed renderer processes of
web browsers - they want to block anything other than creating normal
threads, so they use seccomp to block stuff like namespace creation
and creating new processes.

I briefly mentioned this here during clone3 development, though I
probably should have been more explicit about how it would be
beneficial for clone3 to pass flags in a register:
<https://lore.kernel.org/all/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com/>

So if you want your feature to be available in such contexts, you'll
probably have to either add a new syscall clone4() that passes the
flags in a register; or do the plumbing work required to make it
possible to seccomp-filter things other than register contexts (by
invoking seccomp again from the clone3 handler with some kinda
pseudo-syscall?); or change the signature of the existing syscall (but
that would require something like using the high bit of the size to
signal that there's a flags argument in another register, which is
probably more ugly than just adding a new syscall).
Mark Brown Aug. 16, 2024, 4:19 p.m. UTC | #2
On Fri, Aug 16, 2024 at 05:52:20PM +0200, Jann Horn wrote:

> As a heads-up so you don't get surprised by this in the future:

> Because clone3() does not pass the flags in a register like clone()
> does, it is not available in places like docker containers that use
> the default Docker seccomp policy
> (https://github.com/moby/moby/blob/master/profiles/seccomp/default.json).
> Docker uses seccomp to filter clone() arguments (to prevent stuff like
> namespace creation), and that's not possible with clone3(), so
> clone3() is blocked.

This is probably fine, the existing shadow stack ABI provides a sensible
default behaviour for things that just use regular clone().  This series
just adds more control for things using clone3(), the main issue would
be anything that *needs* to specify stack size/placement and can't use
clone3().  That would need a separate userspace API if required, and
we'd still want to extend clone3() anyway.