Message ID | 20241023150511.3923558-1-kevin.brodsky@arm.com |
---|---|
Headers | show |
Series | Improve arm64 pkeys handling in signal delivery | expand |
On 10/23/24 08:05, Kevin Brodsky wrote: ...> diff --git a/tools/testing/selftests/mm/pkey-x86.h b/tools/testing/selftests/mm/pkey-x86.h > index 5f28e26a2511..53ed9a336ffe 100644 > --- a/tools/testing/selftests/mm/pkey-x86.h > +++ b/tools/testing/selftests/mm/pkey-x86.h > @@ -34,6 +34,8 @@ > #define PAGE_SIZE 4096 > #define MB (1<<20) > > +#define PKEY_ALLOW_NONE 0x55555555 Hi Kevin, Looking at this in context, I think "PKEY_ALLOW_NONE" is not a great name. On one hand, we have: PKEY_DISABLE_ACCESS PKEY_DISABLE_WRITE which are values for *A* pkey. But PKEY_ALLOW_NONE is a whole register value and spans permissions for many keys. We don't want folks trying to do something like: pkey_alloc(flags, PKEY_ALLOW_NONE); If I were naming it in x86 code, I'd probably call it: PKRU_ALLOW_NONE or something. > static inline void __page_o_noops(void) > { > /* 8-bytes of instruction * 512 bytes = 1 page */ > diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c > index a8088b645ad6..b5e1767ee5d9 100644 > --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c > +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c > @@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; > pthread_cond_t cond = PTHREAD_COND_INITIALIZER; > siginfo_t siginfo = {0}; > > +static u64 pkey_reg_no_access; Ideally, this would be a real const or a #define because it really is static. Right? Or is there something dynamic about the ARM implementation's value? ... > * Setup alternate signal stack, which should be pkey_mprotect()ed by > @@ -142,7 +145,8 @@ static void *thread_segv_maperr_ptr(void *ptr) > syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0); > > /* Disable MPK 0. Only MPK 1 is enabled. */ > - __write_pkey_reg(0x55555551); > + pkey_reg = set_pkey_bits(pkey_reg_no_access, 1, 0); > + __write_pkey_reg(pkey_reg); The existing magic numbers are not great, but could we do: #define PKEY_ALLOW_ALL 0x0 So that this can be written like this: pkey_reg = PKRU_ALLOW_NONE; pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL); That would get rid of the magic '0'. > /* Segfault */ > *bad = 1; > @@ -240,6 +244,7 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void) > int pkey; > int parent_pid = 0; > int child_pid = 0; > + u64 pkey_reg; > > sa.sa_flags = SA_SIGINFO | SA_ONSTACK; > > @@ -257,7 +262,9 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void) > assert(stack != MAP_FAILED); > > /* Allow access to MPK 0 and MPK 1 */ > - __write_pkey_reg(0x55555550); > + pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0); > + pkey_reg = set_pkey_bits(pkey_reg, 1, 0); > + __write_pkey_reg(pkey_reg); ... and using the pattern from above, this is quite a bit more readable: pkey_reg = PKRU_ALLOW_NONE; pkey_reg = set_pkey_bits(pkey_reg, 0, PKEY_ALLOW_ALL); pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL); ... > + /* Only allow X for MPK 0 and nothing for other keys */ > + pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, > + PKEY_DISABLE_ACCESS); If the comment says "only allow X", then I'd expect the code to say: pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, PKEY_X); ... or something similar.
On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote: > diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c > index f5fb48dabebe..d2e4e50977ae 100644 > --- a/arch/arm64/kernel/signal.c > +++ b/arch/arm64/kernel/signal.c > @@ -66,9 +66,63 @@ struct rt_sigframe_user_layout { > unsigned long end_offset; > }; > > +/* > + * Holds any EL0-controlled state that influences unprivileged memory accesses. > + * This includes both accesses done in userspace and uaccess done in the kernel. > + * > + * This state needs to be carefully managed to ensure that it doesn't cause > + * uaccess to fail when setting up the signal frame, and the signal handler > + * itself also expects a well-defined state when entered. > + */ > +struct user_access_state { > + u64 por_el0; > +}; > + > #define TERMINATOR_SIZE round_up(sizeof(struct _aarch64_ctx), 16) > #define EXTRA_CONTEXT_SIZE round_up(sizeof(struct extra_context), 16) > > +/* > + * Save the unpriv access state into ua_state and reset it to disable any > + * restrictions. > + */ > +static void save_reset_user_access_state(struct user_access_state *ua_state) > +{ > + if (system_supports_poe()) { > + /* > + * Enable all permissions in all 8 keys > + * (inspired by REPEAT_BYTE()) > + */ > + u64 por_enable_all = (~0u / POE_MASK) * POE_RXW; I think this should be ~0ul. > @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn) > { > struct pt_regs *regs = current_pt_regs(); > struct rt_sigframe __user *frame; > + struct user_access_state ua_state; > > /* Always make any pending restarted system calls return -EINTR */ > current->restart_block.fn = do_no_restart_syscall; > @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn) > if (!access_ok(frame, sizeof (*frame))) > goto badframe; > > - if (restore_sigframe(regs, frame)) > + if (restore_sigframe(regs, frame, &ua_state)) > goto badframe; > > if (restore_altstack(&frame->uc.uc_stack)) > goto badframe; > > + restore_user_access_state(&ua_state); > + > return regs->regs[0]; > > badframe: The saving part I'm fine with. For restoring, I was wondering whether we can get a more privileged POR_EL0 if reading the frame somehow failed. This is largely theoretical, there are other ways to attack like writing POR_EL0 directly than unmapping/remapping the signal stack. What I'd change here is always restore_user_access_state() to POR_EL0_INIT. Maybe just initialise ua_state above and add the function call after the badframe label. Either way: Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
On 24/10/2024 12:59, Catalin Marinas wrote: > On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote: >> +/* >> + * Save the unpriv access state into ua_state and reset it to disable any >> + * restrictions. >> + */ >> +static void save_reset_user_access_state(struct user_access_state *ua_state) >> +{ >> + if (system_supports_poe()) { >> + /* >> + * Enable all permissions in all 8 keys >> + * (inspired by REPEAT_BYTE()) >> + */ >> + u64 por_enable_all = (~0u / POE_MASK) * POE_RXW; > I think this should be ~0ul. It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32 bits are RES0). That said, given that D128 has 4-bit pkeys, we could anticipate and fill the top 32 bits too (should make no difference on D64). >> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn) >> { >> struct pt_regs *regs = current_pt_regs(); >> struct rt_sigframe __user *frame; >> + struct user_access_state ua_state; >> >> /* Always make any pending restarted system calls return -EINTR */ >> current->restart_block.fn = do_no_restart_syscall; >> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn) >> if (!access_ok(frame, sizeof (*frame))) >> goto badframe; >> >> - if (restore_sigframe(regs, frame)) >> + if (restore_sigframe(regs, frame, &ua_state)) >> goto badframe; >> >> if (restore_altstack(&frame->uc.uc_stack)) >> goto badframe; >> >> + restore_user_access_state(&ua_state); >> + >> return regs->regs[0]; >> >> badframe: > The saving part I'm fine with. For restoring, I was wondering whether we > can get a more privileged POR_EL0 if reading the frame somehow failed. > This is largely theoretical, there are other ways to attack like > writing POR_EL0 directly than unmapping/remapping the signal stack. > > What I'd change here is always restore_user_access_state() to > POR_EL0_INIT. Maybe just initialise ua_state above and add the function > call after the badframe label. I'm not sure I understand. When we enter this function, POR_EL0 is set to whatever the signal handler set it to (POR_EL0_INIT by default). There are then two cases: 1) Everything succeeds, including reading the saved POR_EL0 from the frame. We then call restore_user_access_state(), setting POR_EL0 to the value we've read, and return to userspace. 2) Any uaccess fails (for instance reading POR_EL0). In that case we leave POR_EL0 unchanged and deliver SIGSEGV. In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or whatever the signal handler set it to. It's not clear to me that forcing it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV handler will be able to recover, since the new signal frame we will create for it may be a mix of interrupted state and signal handler state (depending on exactly where we fail). Kevin
On Thu, Oct 24, 2024 at 04:42:10PM +0100, Catalin Marinas wrote: > On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote: > > On 24/10/2024 12:59, Catalin Marinas wrote: > > > On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote: > > >> +/* > > >> + * Save the unpriv access state into ua_state and reset it to disable any > > >> + * restrictions. > > >> + */ > > >> +static void save_reset_user_access_state(struct user_access_state *ua_state) > > >> +{ > > >> + if (system_supports_poe()) { > > >> + /* > > >> + * Enable all permissions in all 8 keys > > >> + * (inspired by REPEAT_BYTE()) > > >> + */ > > >> + u64 por_enable_all = (~0u / POE_MASK) * POE_RXW; > > > I think this should be ~0ul. > > > > It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the > > lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32 > > bits are RES0). That said, given that D128 has 4-bit pkeys, we could > > anticipate and fill the top 32 bits too (should make no difference on D64). > > I guess we could leave it as 32-bit for now and remember to update it > when we enable more keys with D128. Setting the top RES0 bits doesn't > hurt either since they are already documented in the Arm ARM. Up to you, > it's fine like above as well. Can we maybe just have a brute-force loop that constructs the value using the appropriate #define macros? The compiler will const-fold it; I'd be prepared to bet that the generated code would be identical... > > >> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn) > > >> { > > >> struct pt_regs *regs = current_pt_regs(); > > >> struct rt_sigframe __user *frame; > > >> + struct user_access_state ua_state; > > >> > > >> /* Always make any pending restarted system calls return -EINTR */ > > >> current->restart_block.fn = do_no_restart_syscall; > > >> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn) > > >> if (!access_ok(frame, sizeof (*frame))) > > >> goto badframe; > > >> > > >> - if (restore_sigframe(regs, frame)) > > >> + if (restore_sigframe(regs, frame, &ua_state)) > > >> goto badframe; > > >> > > >> if (restore_altstack(&frame->uc.uc_stack)) > > >> goto badframe; > > >> > > >> + restore_user_access_state(&ua_state); > > >> + > > >> return regs->regs[0]; > > >> > > >> badframe: > > > The saving part I'm fine with. For restoring, I was wondering whether we > > > can get a more privileged POR_EL0 if reading the frame somehow failed. > > > This is largely theoretical, there are other ways to attack like > > > writing POR_EL0 directly than unmapping/remapping the signal stack. > > > > > > What I'd change here is always restore_user_access_state() to > > > POR_EL0_INIT. Maybe just initialise ua_state above and add the function > > > call after the badframe label. > > > > I'm not sure I understand. When we enter this function, POR_EL0 is set > > to whatever the signal handler set it to (POR_EL0_INIT by default). > > There are then two cases: > > 1) Everything succeeds, including reading the saved POR_EL0 from the > > frame. We then call restore_user_access_state(), setting POR_EL0 to the > > value we've read, and return to userspace. > > 2) Any uaccess fails (for instance reading POR_EL0). In that case we > > leave POR_EL0 unchanged and deliver SIGSEGV. > > > > In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or > > whatever the signal handler set it to. It's not clear to me that forcing > > it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV > > handler will be able to recover, since the new signal frame we will > > create for it may be a mix of interrupted state and signal handler state > > (depending on exactly where we fail). > > If the SIGSEGV delivery succeeds, returning would restore the POR_EL0 > set up by the previous signal handler, potentially more privileged. Does > it matter? Can it return all the way to the original context? That seems a valid concern. It looks a bit like we don't back out the temporary change to POR_EL0 if writing the sigframe fails, so the temporary "allow all" perms might get saved out into the SIGSEGV sigframe on the alternate signal stack, and will then be restored as the user thread's POR_EL0 when the SIGSEGV returns. (This is all assuming that the force_sig(SIGSEGV) logic works properly at all... I'm still trying to puzzle it out!) Cheers ---Dave
On 24/10/2024 18:19, Dave Martin wrote: > On Thu, Oct 24, 2024 at 04:42:10PM +0100, Catalin Marinas wrote: >> On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote: >>> On 24/10/2024 12:59, Catalin Marinas wrote: >>>> On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote: >>>>> +/* >>>>> + * Save the unpriv access state into ua_state and reset it to disable any >>>>> + * restrictions. >>>>> + */ >>>>> +static void save_reset_user_access_state(struct user_access_state *ua_state) >>>>> +{ >>>>> + if (system_supports_poe()) { >>>>> + /* >>>>> + * Enable all permissions in all 8 keys >>>>> + * (inspired by REPEAT_BYTE()) >>>>> + */ >>>>> + u64 por_enable_all = (~0u / POE_MASK) * POE_RXW; >>>> I think this should be ~0ul. >>> It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the >>> lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32 >>> bits are RES0). That said, given that D128 has 4-bit pkeys, we could >>> anticipate and fill the top 32 bits too (should make no difference on D64). >> I guess we could leave it as 32-bit for now and remember to update it >> when we enable more keys with D128. Setting the top RES0 bits doesn't >> hurt either since they are already documented in the Arm ARM. Up to you, >> it's fine like above as well. > Can we maybe just have a brute-force loop that constructs the value > using the appropriate #define macros? > > The compiler will const-fold it; I'd be prepared to bet that the > generated code would be identical... Fine by me, I suppose I was too eager to use the one-liner I had found :) Building that value based on arch_max_pkey() is probably a better idea in the long run. (And indeed the codegen is the same, it boils down to a mov w0, #0x77777777 in both case.) >>>>> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn) >>>>> { >>>>> struct pt_regs *regs = current_pt_regs(); >>>>> struct rt_sigframe __user *frame; >>>>> + struct user_access_state ua_state; >>>>> >>>>> /* Always make any pending restarted system calls return -EINTR */ >>>>> current->restart_block.fn = do_no_restart_syscall; >>>>> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn) >>>>> if (!access_ok(frame, sizeof (*frame))) >>>>> goto badframe; >>>>> >>>>> - if (restore_sigframe(regs, frame)) >>>>> + if (restore_sigframe(regs, frame, &ua_state)) >>>>> goto badframe; >>>>> >>>>> if (restore_altstack(&frame->uc.uc_stack)) >>>>> goto badframe; >>>>> >>>>> + restore_user_access_state(&ua_state); >>>>> + >>>>> return regs->regs[0]; >>>>> >>>>> badframe: >>>> The saving part I'm fine with. For restoring, I was wondering whether we >>>> can get a more privileged POR_EL0 if reading the frame somehow failed. >>>> This is largely theoretical, there are other ways to attack like >>>> writing POR_EL0 directly than unmapping/remapping the signal stack. >>>> >>>> What I'd change here is always restore_user_access_state() to >>>> POR_EL0_INIT. Maybe just initialise ua_state above and add the function >>>> call after the badframe label. >>> I'm not sure I understand. When we enter this function, POR_EL0 is set >>> to whatever the signal handler set it to (POR_EL0_INIT by default). >>> There are then two cases: >>> 1) Everything succeeds, including reading the saved POR_EL0 from the >>> frame. We then call restore_user_access_state(), setting POR_EL0 to the >>> value we've read, and return to userspace. >>> 2) Any uaccess fails (for instance reading POR_EL0). In that case we >>> leave POR_EL0 unchanged and deliver SIGSEGV. >>> >>> In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or >>> whatever the signal handler set it to. It's not clear to me that forcing >>> it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV >>> handler will be able to recover, since the new signal frame we will >>> create for it may be a mix of interrupted state and signal handler state >>> (depending on exactly where we fail). >> If the SIGSEGV delivery succeeds, returning would restore the POR_EL0 >> set up by the previous signal handler, potentially more privileged. Does >> it matter? Can it return all the way to the original context? What we store into the signal frame when delivering that SIGSEGV is a mixture of the original state (up to the point of failure) and the signal handler's state (what we couldn't restore). It's hard to reason about how that SIGSEGV handler could possibly handle this, but in any case it would have to massage its signal frame so that the next sigreturn does the right thing. Restoring only part of the frame records is bound to cause trouble and that's true for POR_EL0 as well - I doubt there's much value in special-casing it. > > That seems a valid concern. > > It looks a bit like we don't back out the temporary change to POR_EL0 > if writing the sigframe fails, so the temporary "allow all" perms might > get saved out into the SIGSEGV sigframe on the alternate signal > stack, and will then be restored as the user thread's POR_EL0 when the > SIGSEGV returns. It sounds like you're referring to the delivery case, not the return case. In the delivery case (setup_rt_frame()), the "allow all" value will never be saved into the sigframe because we call restore_user_access_state() if anything failed (this is new in v2, exactly to prevent that scenario). Kevin
On 23/10/2024 18:51, Dave Hansen wrote: > On 10/23/24 08:05, Kevin Brodsky wrote: > ...> diff --git a/tools/testing/selftests/mm/pkey-x86.h > b/tools/testing/selftests/mm/pkey-x86.h >> index 5f28e26a2511..53ed9a336ffe 100644 >> --- a/tools/testing/selftests/mm/pkey-x86.h >> +++ b/tools/testing/selftests/mm/pkey-x86.h >> @@ -34,6 +34,8 @@ >> #define PAGE_SIZE 4096 >> #define MB (1<<20) >> >> +#define PKEY_ALLOW_NONE 0x55555555 > Hi Kevin, > > Looking at this in context, I think "PKEY_ALLOW_NONE" is not a great > name. On one hand, we have: > > PKEY_DISABLE_ACCESS > PKEY_DISABLE_WRITE > > which are values for *A* pkey. > > But PKEY_ALLOW_NONE is a whole register value and spans permissions for > many keys. We don't want folks trying to do something like: > > pkey_alloc(flags, PKEY_ALLOW_NONE); > > If I were naming it in x86 code, I'd probably call it: > > PKRU_ALLOW_NONE > > or something. I agree, the naming is not ideal, I lacked inspiration! Maybe PKEY_REG_ALLOW_NONE to remain generic? > >> static inline void __page_o_noops(void) >> { >> /* 8-bytes of instruction * 512 bytes = 1 page */ >> diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c >> index a8088b645ad6..b5e1767ee5d9 100644 >> --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c >> +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c >> @@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; >> pthread_cond_t cond = PTHREAD_COND_INITIALIZER; >> siginfo_t siginfo = {0}; >> >> +static u64 pkey_reg_no_access; > Ideally, this would be a real const or a #define because it really is > static. Right? Or is there something dynamic about the ARM > implementation's value? It isn't dynamic no, the issue is that on architectures where pkeys restrict execution we need to allow X for pkey 0. Of course it would be possible to define PKEY_REG_ALLOW_ALL in such a way that X is allowed for pkey 0, but I was concerned this might be misleading. No strong opinion either way, happy to make it purely a macro, maybe with a better name? > ... >> * Setup alternate signal stack, which should be pkey_mprotect()ed by >> @@ -142,7 +145,8 @@ static void *thread_segv_maperr_ptr(void *ptr) >> syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0); >> >> /* Disable MPK 0. Only MPK 1 is enabled. */ >> - __write_pkey_reg(0x55555551); >> + pkey_reg = set_pkey_bits(pkey_reg_no_access, 1, 0); >> + __write_pkey_reg(pkey_reg); > The existing magic numbers are not great, but could we do: > > #define PKEY_ALLOW_ALL 0x0 > > So that this can be written like this: > > pkey_reg = PKRU_ALLOW_NONE; > pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL); > > That would get rid of the magic '0'. Definitely better yes. But how about using Yury's uapi addition, PKEY_UNRESTRICTED [1]? [1] https://lore.kernel.org/all/20241022120128.359652-1-yury.khrustalev@arm.com/ > >> /* Segfault */ >> *bad = 1; >> @@ -240,6 +244,7 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void) >> int pkey; >> int parent_pid = 0; >> int child_pid = 0; >> + u64 pkey_reg; >> >> sa.sa_flags = SA_SIGINFO | SA_ONSTACK; >> >> @@ -257,7 +262,9 @@ static void test_sigsegv_handler_with_different_pkey_for_stack(void) >> assert(stack != MAP_FAILED); >> >> /* Allow access to MPK 0 and MPK 1 */ >> - __write_pkey_reg(0x55555550); >> + pkey_reg = set_pkey_bits(pkey_reg_no_access, 0, 0); >> + pkey_reg = set_pkey_bits(pkey_reg, 1, 0); >> + __write_pkey_reg(pkey_reg); > ... and using the pattern from above, this is quite a bit more readable: > > pkey_reg = PKRU_ALLOW_NONE; > pkey_reg = set_pkey_bits(pkey_reg, 0, PKEY_ALLOW_ALL); > pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL); > > ... >> + /* Only allow X for MPK 0 and nothing for other keys */ >> + pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, >> + PKEY_DISABLE_ACCESS); > If the comment says "only allow X", then I'd expect the code to say: > > pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, PKEY_X); > > ... or something similar. I could #define PKEY_X PKEY_DISABLE_ACCESS but is the mixture of negative and positive polarity really helping? We cannot define PKEY_R and PKEY_W so that (for instance) PKEY_R | PKEY_X does what it says. Having to use PKEY_DISABLE_ACCESS to mean "X only" is not ideal, but this is what userspace already has to do. Either way if we define PKEY_REG_ALLOW_NONE or similar to allow X for pkey 0 as suggested then this will go. Thanks for the review! Kevin
On Fri, Oct 25, 2024 at 10:24:41AM +0200, Kevin Brodsky wrote: > On 24/10/2024 18:19, Dave Martin wrote: > > On Thu, Oct 24, 2024 at 04:42:10PM +0100, Catalin Marinas wrote: > >> On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote: > >>> On 24/10/2024 12:59, Catalin Marinas wrote: > >>>> On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote: > >>>>> +/* > >>>>> + * Save the unpriv access state into ua_state and reset it to disable any > >>>>> + * restrictions. > >>>>> + */ > >>>>> +static void save_reset_user_access_state(struct user_access_state *ua_state) > >>>>> +{ > >>>>> + if (system_supports_poe()) { > >>>>> + /* > >>>>> + * Enable all permissions in all 8 keys > >>>>> + * (inspired by REPEAT_BYTE()) > >>>>> + */ > >>>>> + u64 por_enable_all = (~0u / POE_MASK) * POE_RXW; > >>>> I think this should be ~0ul. > >>> It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the > >>> lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32 > >>> bits are RES0). That said, given that D128 has 4-bit pkeys, we could > >>> anticipate and fill the top 32 bits too (should make no difference on D64). > >> I guess we could leave it as 32-bit for now and remember to update it > >> when we enable more keys with D128. Setting the top RES0 bits doesn't > >> hurt either since they are already documented in the Arm ARM. Up to you, > >> it's fine like above as well. > > Can we maybe just have a brute-force loop that constructs the value > > using the appropriate #define macros? > > > > The compiler will const-fold it; I'd be prepared to bet that the > > generated code would be identical... > > Fine by me, I suppose I was too eager to use the one-liner I had found > :) Building that value based on arch_max_pkey() is probably a better > idea in the long run. (And indeed the codegen is the same, it boils down > to a mov w0, #0x77777777 in both case.) The one-line was a neat trick (after the brief WTF moment) :) I guess my uneasiness comes from baking the number of pkeys in via the type of 0u and an implicit relationship that this happens to have with the number bits per pkey in the POR. [...] Cheers ---Dave
On Fri, Oct 25, 2024 at 10:24:41AM +0200, Kevin Brodsky wrote: > On 24/10/2024 18:19, Dave Martin wrote: > > On Thu, Oct 24, 2024 at 04:42:10PM +0100, Catalin Marinas wrote: > >> On Thu, Oct 24, 2024 at 04:55:48PM +0200, Kevin Brodsky wrote: > >>> On 24/10/2024 12:59, Catalin Marinas wrote: > >>>> On Wed, Oct 23, 2024 at 04:05:09PM +0100, Kevin Brodsky wrote: > >>>>> +/* > >>>>> + * Save the unpriv access state into ua_state and reset it to disable any > >>>>> + * restrictions. > >>>>> + */ > >>>>> +static void save_reset_user_access_state(struct user_access_state *ua_state) > >>>>> +{ > >>>>> + if (system_supports_poe()) { > >>>>> + /* > >>>>> + * Enable all permissions in all 8 keys > >>>>> + * (inspired by REPEAT_BYTE()) > >>>>> + */ > >>>>> + u64 por_enable_all = (~0u / POE_MASK) * POE_RXW; > >>>> I think this should be ~0ul. > >>> It is ~0u on purpose, because unlike in REPEAT_BYTE(), I only wanted the > >>> lower 32 bits to be filled with POE_RXW (we only have 8 keys, the top 32 > >>> bits are RES0). That said, given that D128 has 4-bit pkeys, we could > >>> anticipate and fill the top 32 bits too (should make no difference on D64). > >> I guess we could leave it as 32-bit for now and remember to update it > >> when we enable more keys with D128. Setting the top RES0 bits doesn't > >> hurt either since they are already documented in the Arm ARM. Up to you, > >> it's fine like above as well. > > Can we maybe just have a brute-force loop that constructs the value > > using the appropriate #define macros? > > > > The compiler will const-fold it; I'd be prepared to bet that the > > generated code would be identical... > > Fine by me, I suppose I was too eager to use the one-liner I had found > :) Building that value based on arch_max_pkey() is probably a better > idea in the long run. (And indeed the codegen is the same, it boils down > to a mov w0, #0x77777777 in both case.) > > >>>>> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn) > >>>>> { > >>>>> struct pt_regs *regs = current_pt_regs(); > >>>>> struct rt_sigframe __user *frame; > >>>>> + struct user_access_state ua_state; > >>>>> > >>>>> /* Always make any pending restarted system calls return -EINTR */ > >>>>> current->restart_block.fn = do_no_restart_syscall; > >>>>> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn) > >>>>> if (!access_ok(frame, sizeof (*frame))) > >>>>> goto badframe; > >>>>> > >>>>> - if (restore_sigframe(regs, frame)) > >>>>> + if (restore_sigframe(regs, frame, &ua_state)) > >>>>> goto badframe; > >>>>> > >>>>> if (restore_altstack(&frame->uc.uc_stack)) > >>>>> goto badframe; > >>>>> > >>>>> + restore_user_access_state(&ua_state); > >>>>> + > >>>>> return regs->regs[0]; > >>>>> > >>>>> badframe: > >>>> The saving part I'm fine with. For restoring, I was wondering whether we > >>>> can get a more privileged POR_EL0 if reading the frame somehow failed. > >>>> This is largely theoretical, there are other ways to attack like > >>>> writing POR_EL0 directly than unmapping/remapping the signal stack. > >>>> > >>>> What I'd change here is always restore_user_access_state() to > >>>> POR_EL0_INIT. Maybe just initialise ua_state above and add the function > >>>> call after the badframe label. > >>> I'm not sure I understand. When we enter this function, POR_EL0 is set > >>> to whatever the signal handler set it to (POR_EL0_INIT by default). > >>> There are then two cases: > >>> 1) Everything succeeds, including reading the saved POR_EL0 from the > >>> frame. We then call restore_user_access_state(), setting POR_EL0 to the > >>> value we've read, and return to userspace. > >>> 2) Any uaccess fails (for instance reading POR_EL0). In that case we > >>> leave POR_EL0 unchanged and deliver SIGSEGV. > >>> > >>> In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or > >>> whatever the signal handler set it to. It's not clear to me that forcing > >>> it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV > >>> handler will be able to recover, since the new signal frame we will > >>> create for it may be a mix of interrupted state and signal handler state > >>> (depending on exactly where we fail). > >> If the SIGSEGV delivery succeeds, returning would restore the POR_EL0 > >> set up by the previous signal handler, potentially more privileged. Does > >> it matter? Can it return all the way to the original context? > > What we store into the signal frame when delivering that SIGSEGV is a > mixture of the original state (up to the point of failure) and the > signal handler's state (what we couldn't restore). It's hard to reason > about how that SIGSEGV handler could possibly handle this, but in any > case it would have to massage its signal frame so that the next > sigreturn does the right thing. Restoring only part of the frame records > is bound to cause trouble and that's true for POR_EL0 as well - I doubt > there's much value in special-casing it. This feels like a simplification? We can leave a mix of restored and unrestored state when generating the SIGSEGV signal frame, providing that those changes will make no difference when the rt_sigreturn is replayed. POR_EL0 will make a difference, though. The POR_EL0 image in the SIGSEGV signal frame needs be the same value that caused the original rt_sigreturn to barf (if this is what caused the barf). It should be up to the SIGSEGV handler to decide what (if anything) to do about that. The kernel can't know what userspace intended. Note that for this to work, the SIGSEGV stack (whether main or alternate) must be accessible with POR_EL0_INIT permissions, or the SIGSEGV handler must start with a (gross) asm shim to establish a usable POR_EL0. But that's not really our problem here. (I'm not saying that the kernel necessarily fails to do this -- I haven't checked -- but just trying to understand the problem here.) The actual problem here is that if the SIGSEGV handler wants to bail out with a siglongjmp(), there is no way to determine the correct value of POR_EL0 to restore. I wonder whether POR_EL0 should be saved in sigjmp_buf (depending on whether sigjmp_buf is horribly inextensible and also full up, or merely horribly inextensible). Does anyone know whether PKRU in sigjmp_buf on x86? > > > > > That seems a valid concern. > > > > It looks a bit like we don't back out the temporary change to POR_EL0 > > if writing the sigframe fails, so the temporary "allow all" perms might > > get saved out into the SIGSEGV sigframe on the alternate signal > > stack, and will then be restored as the user thread's POR_EL0 when the > > SIGSEGV returns. > > It sounds like you're referring to the delivery case, not the return > case. In the delivery case (setup_rt_frame()), the "allow all" value > will never be saved into the sigframe because we call > restore_user_access_state() if anything failed (this is new in v2, > exactly to prevent that scenario). Ah, right -- I missed that detail. Cheers ---Dave
On 10/25/24 01:31, Kevin Brodsky wrote: > I agree, the naming is not ideal, I lacked inspiration! Maybe > PKEY_REG_ALLOW_NONE to remain generic? Works for me. >>> static inline void __page_o_noops(void) >>> { >>> /* 8-bytes of instruction * 512 bytes = 1 page */ >>> diff --git a/tools/testing/selftests/mm/pkey_sighandler_tests.c b/tools/testing/selftests/mm/pkey_sighandler_tests.c >>> index a8088b645ad6..b5e1767ee5d9 100644 >>> --- a/tools/testing/selftests/mm/pkey_sighandler_tests.c >>> +++ b/tools/testing/selftests/mm/pkey_sighandler_tests.c >>> @@ -37,6 +37,8 @@ pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; >>> pthread_cond_t cond = PTHREAD_COND_INITIALIZER; >>> siginfo_t siginfo = {0}; >>> >>> +static u64 pkey_reg_no_access; >> Ideally, this would be a real const or a #define because it really is >> static. Right? Or is there something dynamic about the ARM >> implementation's value? > > It isn't dynamic no, the issue is that on architectures where pkeys > restrict execution we need to allow X for pkey 0. Of course it would be > possible to define PKEY_REG_ALLOW_ALL in such a way that X is allowed > for pkey 0, but I was concerned this might be misleading. No strong > opinion either way, happy to make it purely a macro, maybe with a better > name? I do think we should differentiate truly "no access" value from the one that allows X on pkey 0, at least in the selftest. Define a helper that uses the *real* "no access" value: /* * Returns the most restrictive register value * that can be used in the selftest. */ static inline u64 pkey_reg_restrictive_default(void) { /* * The selftest code runs (mostly) with its code mapped with * pkey-0. Allows execution on pkey-0 so that each site doesn't * have to do this: */ return set_pkey_bits(PKEY_REG_NO_ACCESS, 0, PKEY_X); } and then use it like this: pkey_reg = pkey_reg_restrictive_default(); pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL); >>> * Setup alternate signal stack, which should be pkey_mprotect()ed by >>> @@ -142,7 +145,8 @@ static void *thread_segv_maperr_ptr(void *ptr) >>> syscall_raw(SYS_sigaltstack, (long)stack, 0, 0, 0, 0, 0); >>> >>> /* Disable MPK 0. Only MPK 1 is enabled. */ >>> - __write_pkey_reg(0x55555551); >>> + pkey_reg = set_pkey_bits(pkey_reg_no_access, 1, 0); >>> + __write_pkey_reg(pkey_reg); >> The existing magic numbers are not great, but could we do: >> >> #define PKEY_ALLOW_ALL 0x0 >> >> So that this can be written like this: >> >> pkey_reg = PKRU_ALLOW_NONE; >> pkey_reg = set_pkey_bits(pkey_reg, 1, PKEY_ALLOW_ALL); >> >> That would get rid of the magic '0'. > > Definitely better yes. But how about using Yury's uapi addition, > PKEY_UNRESTRICTED [1]? > Works for me. >> ... >>> + /* Only allow X for MPK 0 and nothing for other keys */ >>> + pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, >>> + PKEY_DISABLE_ACCESS); >> If the comment says "only allow X", then I'd expect the code to say: >> >> pkey_reg_no_access = set_pkey_bits(PKEY_ALLOW_NONE, 0, PKEY_X); >> >> ... or something similar. > > I could #define PKEY_X PKEY_DISABLE_ACCESS but is the mixture of > negative and positive polarity really helping? We cannot define PKEY_R > and PKEY_W so that (for instance) PKEY_R | PKEY_X does what it says. > Having to use PKEY_DISABLE_ACCESS to mean "X only" is not ideal, but > this is what userspace already has to do. There would be some churn, but we could also convert the whole thing over to just use explicit RWX enable bits, like in the thread_segv_maperr_ptr() test: // Truly turn everything off: pkey_reg = PKEY_REG_NO_ACCESS; pkey_reg = set_pkey_perm(pkey_reg, 1, PKEY_RW); I'm not sure that's worth the churn though.
On 25/10/2024 13:33, Dave Martin wrote: > On Fri, Oct 25, 2024 at 10:24:41AM +0200, Kevin Brodsky wrote: >>>>>>> @@ -907,6 +964,7 @@ SYSCALL_DEFINE0(rt_sigreturn) >>>>>>> { >>>>>>> struct pt_regs *regs = current_pt_regs(); >>>>>>> struct rt_sigframe __user *frame; >>>>>>> + struct user_access_state ua_state; >>>>>>> >>>>>>> /* Always make any pending restarted system calls return -EINTR */ >>>>>>> current->restart_block.fn = do_no_restart_syscall; >>>>>>> @@ -923,12 +981,14 @@ SYSCALL_DEFINE0(rt_sigreturn) >>>>>>> if (!access_ok(frame, sizeof (*frame))) >>>>>>> goto badframe; >>>>>>> >>>>>>> - if (restore_sigframe(regs, frame)) >>>>>>> + if (restore_sigframe(regs, frame, &ua_state)) >>>>>>> goto badframe; >>>>>>> >>>>>>> if (restore_altstack(&frame->uc.uc_stack)) >>>>>>> goto badframe; >>>>>>> >>>>>>> + restore_user_access_state(&ua_state); >>>>>>> + >>>>>>> return regs->regs[0]; >>>>>>> >>>>>>> badframe: >>>>>> The saving part I'm fine with. For restoring, I was wondering whether we >>>>>> can get a more privileged POR_EL0 if reading the frame somehow failed. >>>>>> This is largely theoretical, there are other ways to attack like >>>>>> writing POR_EL0 directly than unmapping/remapping the signal stack. >>>>>> >>>>>> What I'd change here is always restore_user_access_state() to >>>>>> POR_EL0_INIT. Maybe just initialise ua_state above and add the function >>>>>> call after the badframe label. >>>>> I'm not sure I understand. When we enter this function, POR_EL0 is set >>>>> to whatever the signal handler set it to (POR_EL0_INIT by default). >>>>> There are then two cases: >>>>> 1) Everything succeeds, including reading the saved POR_EL0 from the >>>>> frame. We then call restore_user_access_state(), setting POR_EL0 to the >>>>> value we've read, and return to userspace. >>>>> 2) Any uaccess fails (for instance reading POR_EL0). In that case we >>>>> leave POR_EL0 unchanged and deliver SIGSEGV. >>>>> >>>>> In case 2 POR_EL0 is most likely already set to POR_EL0_INIT, or >>>>> whatever the signal handler set it to. It's not clear to me that forcing >>>>> it to POR_EL0_INIT helps much. Either way it's doubtful that the SIGSEGV >>>>> handler will be able to recover, since the new signal frame we will >>>>> create for it may be a mix of interrupted state and signal handler state >>>>> (depending on exactly where we fail). >>>> If the SIGSEGV delivery succeeds, returning would restore the POR_EL0 >>>> set up by the previous signal handler, potentially more privileged. Does >>>> it matter? Can it return all the way to the original context? >> What we store into the signal frame when delivering that SIGSEGV is a >> mixture of the original state (up to the point of failure) and the >> signal handler's state (what we couldn't restore). It's hard to reason >> about how that SIGSEGV handler could possibly handle this, but in any >> case it would have to massage its signal frame so that the next >> sigreturn does the right thing. Restoring only part of the frame records >> is bound to cause trouble and that's true for POR_EL0 as well - I doubt >> there's much value in special-casing it. > This feels like a simplification? > > We can leave a mix of restored and unrestored state when generating the > SIGSEGV signal frame, providing that those changes will make no > difference when the rt_sigreturn is replayed. I'm not sure I understand what this means. If the SIGSEGV handler were to sigreturn without touching its signal frame, things are likely to explode: it may be returning to the point where the original handler called sigreturn, for instance (if the first uaccess failed during that sigreturn call). > POR_EL0 will make a difference, though. > > The POR_EL0 image in the SIGSEGV signal frame needs be the same value > that caused the original rt_sigreturn to barf (if this is what caused > the barf). It should be up to the SIGSEGV handler to decide what (if > anything) to do about that. The kernel can't know what userspace > intended. Unless I'm missing something this is exactly what happens now: what we store in the SIGSEGV frame is the POR_EL0 value the original handler was using. > Note that for this to work, the SIGSEGV stack (whether main or > alternate) must be accessible with POR_EL0_INIT permissions, or the > SIGSEGV handler must start with a (gross) asm shim to establish a > usable POR_EL0. But that's not really our problem here. This is indeed orthogonal - the SIGSEGV handler will be run with POR_EL0_INIT, like any other handler. The value we store in the frame is unrelated. > (I'm not saying that the kernel necessarily fails to do this -- I > haven't checked -- but just trying to understand the problem here.) > > > The actual problem here is that if the SIGSEGV handler wants to bail > out with a siglongjmp(), there is no way to determine the correct value > of POR_EL0 to restore. Correct, but again this is true of any other record - for instance TPIDR2. > I wonder whether POR_EL0 should be saved in sigjmp_buf (depending on > whether sigjmp_buf is horribly inextensible and also full up, or merely > horribly inextensible). It very much feels that this is the case - if a handler relies on longjmp() or setcontext() to restore a known state, then POR_EL0 should be part of that state. > > Does anyone know whether PKRU in sigjmp_buf on x86? I can't say for sure but I don't see PKRU being handled in setjmp/longjmp in glibc at least. Kevin