mbox series

[0/5] KVM: rseq: Fix and a test for a KVM+rseq bug

Message ID 20210818001210.4073390-1-seanjc@google.com
Headers show
Series KVM: rseq: Fix and a test for a KVM+rseq bug | expand

Message

Sean Christopherson Aug. 18, 2021, 12:12 a.m. UTC
Patch 1 fixes a KVM+rseq bug where KVM's handling of TIF_NOTIFY_RESUME,
e.g. for task migration, clears the flag without informing rseq and leads
to stale data in userspace's rseq struct.

Patch 2 is a cleanup to try and make future bugs less likely.  It's also
a baby step towards moving and renaming tracehook_notify_resume() since
it has nothing to do with tracing.  It kills me to not do the move/rename
as part of this series, but having a dedicated series/discussion seems
more appropriate given the sheer number of architectures that call
tracehook_notify_resume() and the lack of an obvious home for the code.

Patch 3 is a fix/cleanup to stop overriding x86's unistd_{32,64}.h when
the include path (intentionally) omits tools' uapi headers.  KVM's
selftests do exactly that so that they can pick up the uapi headers from
the installed kernel headers, and still use various tools/ headers that
mirror kernel code, e.g. linux/types.h.  This allows the new test in
patch 4 to reference __NR_rseq without having to manually define it.

Patch 4 is a regression test for the KVM+rseq bug.

Patch 5 is a cleanup made possible by patch 3.


Sean Christopherson (5):
  KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM
    guest
  entry: rseq: Call rseq_handle_notify_resume() in
    tracehook_notify_resume()
  tools: Move x86 syscall number fallbacks to .../uapi/
  KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration
    bugs
  KVM: selftests: Remove __NR_userfaultfd syscall fallback

 arch/arm/kernel/signal.c                      |   1 -
 arch/arm64/kernel/signal.c                    |   1 -
 arch/csky/kernel/signal.c                     |   4 +-
 arch/mips/kernel/signal.c                     |   4 +-
 arch/powerpc/kernel/signal.c                  |   4 +-
 arch/s390/kernel/signal.c                     |   1 -
 include/linux/tracehook.h                     |   2 +
 kernel/entry/common.c                         |   4 +-
 kernel/rseq.c                                 |   4 +-
 .../x86/include/{ => uapi}/asm/unistd_32.h    |   0
 .../x86/include/{ => uapi}/asm/unistd_64.h    |   3 -
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   3 +
 tools/testing/selftests/kvm/rseq_test.c       | 131 ++++++++++++++++++
 14 files changed, 143 insertions(+), 20 deletions(-)
 rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%)
 rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (83%)
 create mode 100644 tools/testing/selftests/kvm/rseq_test.c

Comments

Mathieu Desnoyers Aug. 19, 2021, 9:39 p.m. UTC | #1
----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:

> Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to

> transferring to a KVM guest, which is roughly equivalent to an exit to

> userspace and processes many of the same pending actions.  While the task

> cannot be in an rseq critical section as the KVM path is reachable only

> via ioctl(KVM_RUN), the side effects that apply to rseq outside of a

> critical section still apply, e.g. the CPU ID needs to be updated if the

> task is migrated.

> 

> Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults

> and other badness in userspace VMMs that use rseq in combination with KVM,

> e.g. due to the CPU ID being stale after task migration.


I agree with the problem assessment, but I would recommend a small change
to this fix.

> 

> Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")

> Reported-by: Peter Foley <pefoley@google.com>

> Bisected-by: Doug Evans <dje@google.com>

> Cc: Shakeel Butt <shakeelb@google.com>

> Cc: Thomas Gleixner <tglx@linutronix.de>

> Cc: stable@vger.kernel.org

> Signed-off-by: Sean Christopherson <seanjc@google.com>

> ---

> kernel/entry/kvm.c | 4 +++-

> kernel/rseq.c      | 4 ++--

> 2 files changed, 5 insertions(+), 3 deletions(-)

> 

> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c

> index 49972ee99aff..049fd06b4c3d 100644

> --- a/kernel/entry/kvm.c

> +++ b/kernel/entry/kvm.c

> @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,

> unsigned long ti_work)

> 		if (ti_work & _TIF_NEED_RESCHED)

> 			schedule();

> 

> -		if (ti_work & _TIF_NOTIFY_RESUME)

> +		if (ti_work & _TIF_NOTIFY_RESUME) {

> 			tracehook_notify_resume(NULL);

> +			rseq_handle_notify_resume(NULL, NULL);

> +		}

> 

> 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);

> 		if (ret)

> diff --git a/kernel/rseq.c b/kernel/rseq.c

> index 35f7bd0fced0..58c79a7918cd 100644

> --- a/kernel/rseq.c

> +++ b/kernel/rseq.c

> @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs

> *rseq_cs)

> 

> static int rseq_ip_fixup(struct pt_regs *regs)

> {

> -	unsigned long ip = instruction_pointer(regs);

> +	unsigned long ip = regs ? instruction_pointer(regs) : 0;

> 	struct task_struct *t = current;

> 	struct rseq_cs rseq_cs;

> 	int ret;

> @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)

> 	 * If not nested over a rseq critical section, restart is useless.

> 	 * Clear the rseq_cs pointer and return.

> 	 */

> -	if (!in_rseq_cs(ip, &rseq_cs))

> +	if (!regs || !in_rseq_cs(ip, &rseq_cs))


I think clearing the thread's rseq_cs unconditionally here when regs is NULL
is not the behavior we want when this is called from xfer_to_guest_mode_work.

If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
kill this application in the rseq_syscall handler when exiting back to usermode
when the ioctl eventually returns.

However, clearing the thread's rseq_cs will prevent this from happening.

So I would favor an approach where we simply do:

if (!regs)
     return 0;

Immediately at the beginning of rseq_ip_fixup, before getting the instruction
pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it
is not relevant to do any fixup here, because it is nested in a ioctl system
call.

Effectively, this would preserve the SIGSEGV behavior when this ioctl is
erroneously called by user-space from a rseq critical section.

Thanks for looking into this !

Mathieu

> 		return clear_rseq_cs(t);

> 	ret = rseq_need_restart(t, rseq_cs.flags);

> 	if (ret <= 0)

> --

> 2.33.0.rc1.237.g0d66db33f3-goog


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Sean Christopherson Aug. 19, 2021, 11:48 p.m. UTC | #2
On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
> ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:

> > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)

> > 	 * If not nested over a rseq critical section, restart is useless.

> > 	 * Clear the rseq_cs pointer and return.

> > 	 */

> > -	if (!in_rseq_cs(ip, &rseq_cs))

> > +	if (!regs || !in_rseq_cs(ip, &rseq_cs))

> 

> I think clearing the thread's rseq_cs unconditionally here when regs is NULL

> is not the behavior we want when this is called from xfer_to_guest_mode_work.

> 

> If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)

> from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to

> kill this application in the rseq_syscall handler when exiting back to usermode

> when the ioctl eventually returns.

> 

> However, clearing the thread's rseq_cs will prevent this from happening.

> 

> So I would favor an approach where we simply do:

> 

> if (!regs)

>      return 0;

> 

> Immediately at the beginning of rseq_ip_fixup, before getting the instruction

> pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it

> is not relevant to do any fixup here, because it is nested in a ioctl system

> call.

> 

> Effectively, this would preserve the SIGSEGV behavior when this ioctl is

> erroneously called by user-space from a rseq critical section.


Ha, that's effectively what I implemented first, but I changed it because of the
comment in clear_rseq_cs() that says:

  The rseq_cs field is set to NULL on preemption or signal delivery ... as well
  as well as on top of code outside of the rseq assembly block.

Which makes it sound like something might rely on clearing rseq_cs?

Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an
rseq critical section, and because syscalls in critical sections are illegal, by
definition clearing rseq_cs is a nop unless userspace is misbehaving.

If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it
not worth the extra code to detect an error that will likely be caught anyways?

diff --git a/kernel/rseq.c b/kernel/rseq.c
index 35f7bd0fced0..28b8342290b0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)

        if (unlikely(t->flags & PF_EXITING))
                return;
+       if (!regs) {
+#ifdef CONFIG_DEBUG_RSEQ
+               if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs))
+                       goto error;
+#endif
+               return;
+       }
        ret = rseq_ip_fixup(regs);
        if (unlikely(ret < 0))
                goto error;

> Thanks for looking into this !

> 

> Mathieu

> 

> > 		return clear_rseq_cs(t);

> > 	ret = rseq_need_restart(t, rseq_cs.flags);

> > 	if (ret <= 0)

> > --

> > 2.33.0.rc1.237.g0d66db33f3-goog

> 

> -- 

> Mathieu Desnoyers

> EfficiOS Inc.

> http://www.efficios.com
Mathieu Desnoyers Aug. 20, 2021, 6:51 p.m. UTC | #3
----- On Aug 19, 2021, at 7:48 PM, Sean Christopherson seanjc@google.com wrote:

> On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:

>> ----- On Aug 17, 2021, at 8:12 PM, Sean Christopherson seanjc@google.com wrote:

>> > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)

>> > 	 * If not nested over a rseq critical section, restart is useless.

>> > 	 * Clear the rseq_cs pointer and return.

>> > 	 */

>> > -	if (!in_rseq_cs(ip, &rseq_cs))

>> > +	if (!regs || !in_rseq_cs(ip, &rseq_cs))

>> 

>> I think clearing the thread's rseq_cs unconditionally here when regs is NULL

>> is not the behavior we want when this is called from xfer_to_guest_mode_work.

>> 

>> If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)

>> from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to

>> kill this application in the rseq_syscall handler when exiting back to usermode

>> when the ioctl eventually returns.

>> 

>> However, clearing the thread's rseq_cs will prevent this from happening.

>> 

>> So I would favor an approach where we simply do:

>> 

>> if (!regs)

>>      return 0;

>> 

>> Immediately at the beginning of rseq_ip_fixup, before getting the instruction

>> pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it

>> is not relevant to do any fixup here, because it is nested in a ioctl system

>> call.

>> 

>> Effectively, this would preserve the SIGSEGV behavior when this ioctl is

>> erroneously called by user-space from a rseq critical section.

> 

> Ha, that's effectively what I implemented first, but I changed it because of the

> comment in clear_rseq_cs() that says:

> 

>  The rseq_cs field is set to NULL on preemption or signal delivery ... as well

>  as well as on top of code outside of the rseq assembly block.

> 

> Which makes it sound like something might rely on clearing rseq_cs?


This comment is describing succinctly the lazy clear scheme for rseq_cs.

Without the lazy clear scheme, a rseq c.s. would look like:

 *                     init(rseq_cs)
 *                     cpu = TLS->rseq::cpu_id_start
 *   [1]               TLS->rseq::rseq_cs = rseq_cs
 *   [start_ip]        ----------------------------
 *   [2]               if (cpu != TLS->rseq::cpu_id)
 *                             goto abort_ip;
 *   [3]               <last_instruction_in_cs>
 *   [post_commit_ip]  ----------------------------
 *   [4]               TLS->rseq::rseq_cs = NULL

But as a fast-path optimization, [4] is not entirely needed because the rseq_cs
descriptor contains information about the instruction pointer range of the critical
section. Therefore, userspace can omit [4], but if the kernel never clears it, it
means that it will have to re-read the rseq_cs descriptor's content each time it
needs to check it to confirm that it is not nested over a rseq c.s..

So making the kernel lazily clear the rseq_cs pointer is just an optimization which
ensures that the kernel won't do useless work the next time it needs to check
rseq_cs, given that it has already validated that the userspace code is currently
not within the rseq c.s. currently advertised by the rseq_cs field.

> 

> Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an

> rseq critical section, and because syscalls in critical sections are illegal, by

> definition clearing rseq_cs is a nop unless userspace is misbehaving.


Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ
code executed when returning from ioctl to userspace will be able to validate that
it is not nested within a rseq critical section.

> 

> If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it

> not worth the extra code to detect an error that will likely be caught anyways?


The error will indeed already be caught on return from ioctl to userspace, so I
don't see any added value in duplicating this check.

Thanks,

Mathieu

> 

> diff --git a/kernel/rseq.c b/kernel/rseq.c

> index 35f7bd0fced0..28b8342290b0 100644

> --- a/kernel/rseq.c

> +++ b/kernel/rseq.c

> @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,

> struct pt_regs *regs)

> 

>        if (unlikely(t->flags & PF_EXITING))

>                return;

> +       if (!regs) {

> +#ifdef CONFIG_DEBUG_RSEQ

> +               if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs))

> +                       goto error;

> +#endif

> +               return;

> +       }

>        ret = rseq_ip_fixup(regs);

>        if (unlikely(ret < 0))

>                goto error;

> 

>> Thanks for looking into this !

>> 

>> Mathieu

>> 

>> > 		return clear_rseq_cs(t);

>> > 	ret = rseq_need_restart(t, rseq_cs.flags);

>> > 	if (ret <= 0)

>> > --

>> > 2.33.0.rc1.237.g0d66db33f3-goog

>> 

>> --

>> Mathieu Desnoyers

>> EfficiOS Inc.

> > http://www.efficios.com


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Sean Christopherson Aug. 20, 2021, 10:26 p.m. UTC | #4
On Fri, Aug 20, 2021, Mathieu Desnoyers wrote:
> Without the lazy clear scheme, a rseq c.s. would look like:

> 

>  *                     init(rseq_cs)

>  *                     cpu = TLS->rseq::cpu_id_start

>  *   [1]               TLS->rseq::rseq_cs = rseq_cs

>  *   [start_ip]        ----------------------------

>  *   [2]               if (cpu != TLS->rseq::cpu_id)

>  *                             goto abort_ip;

>  *   [3]               <last_instruction_in_cs>

>  *   [post_commit_ip]  ----------------------------

>  *   [4]               TLS->rseq::rseq_cs = NULL

> 

> But as a fast-path optimization, [4] is not entirely needed because the rseq_cs

> descriptor contains information about the instruction pointer range of the critical

> section. Therefore, userspace can omit [4], but if the kernel never clears it, it

> means that it will have to re-read the rseq_cs descriptor's content each time it

> needs to check it to confirm that it is not nested over a rseq c.s..

> 

> So making the kernel lazily clear the rseq_cs pointer is just an optimization which

> ensures that the kernel won't do useless work the next time it needs to check

> rseq_cs, given that it has already validated that the userspace code is currently

> not within the rseq c.s. currently advertised by the rseq_cs field.


Thanks for the explanation, much appreciated!
Paolo Bonzini Sept. 6, 2021, 10:28 a.m. UTC | #5
On 20/08/21 20:51, Mathieu Desnoyers wrote:
>> Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an

>> rseq critical section, and because syscalls in critical sections are illegal, by

>> definition clearing rseq_cs is a nop unless userspace is misbehaving.

> Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ

> code executed when returning from ioctl to userspace will be able to validate that

> it is not nested within a rseq critical section.

> 

>> If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it

>> not worth the extra code to detect an error that will likely be caught anyways?

> The error will indeed already be caught on return from ioctl to userspace, so I

> don't see any added value in duplicating this check.


Sean, can you send a v2 (even for this patch only would be okay)?

Thanks,

Paolo
Sean Christopherson Sept. 7, 2021, 2:38 p.m. UTC | #6
On Mon, Sep 06, 2021, Paolo Bonzini wrote:
> On 20/08/21 20:51, Mathieu Desnoyers wrote:

> > > Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in an

> > > rseq critical section, and because syscalls in critical sections are illegal, by

> > > definition clearing rseq_cs is a nop unless userspace is misbehaving.

> > Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ

> > code executed when returning from ioctl to userspace will be able to validate that

> > it is not nested within a rseq critical section.

> > 

> > > If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is it

> > > not worth the extra code to detect an error that will likely be caught anyways?

> > The error will indeed already be caught on return from ioctl to userspace, so I

> > don't see any added value in duplicating this check.

> 

> Sean, can you send a v2 (even for this patch only would be okay)?


Made it all the way to v3 while you were out :-)

https://lkml.kernel.org/r/20210901203030.1292304-1-seanjc@google.com
Paolo Bonzini Sept. 22, 2021, 2:12 p.m. UTC | #7
On 18/08/21 02:12, Sean Christopherson wrote:
> Patch 1 fixes a KVM+rseq bug where KVM's handling of TIF_NOTIFY_RESUME,

> e.g. for task migration, clears the flag without informing rseq and leads

> to stale data in userspace's rseq struct.

> 

> Patch 2 is a cleanup to try and make future bugs less likely.  It's also

> a baby step towards moving and renaming tracehook_notify_resume() since

> it has nothing to do with tracing.  It kills me to not do the move/rename

> as part of this series, but having a dedicated series/discussion seems

> more appropriate given the sheer number of architectures that call

> tracehook_notify_resume() and the lack of an obvious home for the code.

> 

> Patch 3 is a fix/cleanup to stop overriding x86's unistd_{32,64}.h when

> the include path (intentionally) omits tools' uapi headers.  KVM's

> selftests do exactly that so that they can pick up the uapi headers from

> the installed kernel headers, and still use various tools/ headers that

> mirror kernel code, e.g. linux/types.h.  This allows the new test in

> patch 4 to reference __NR_rseq without having to manually define it.

> 

> Patch 4 is a regression test for the KVM+rseq bug.

> 

> Patch 5 is a cleanup made possible by patch 3.

> 

> 

> Sean Christopherson (5):

>    KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM

>      guest

>    entry: rseq: Call rseq_handle_notify_resume() in

>      tracehook_notify_resume()

>    tools: Move x86 syscall number fallbacks to .../uapi/

>    KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration

>      bugs

>    KVM: selftests: Remove __NR_userfaultfd syscall fallback

> 

>   arch/arm/kernel/signal.c                      |   1 -

>   arch/arm64/kernel/signal.c                    |   1 -

>   arch/csky/kernel/signal.c                     |   4 +-

>   arch/mips/kernel/signal.c                     |   4 +-

>   arch/powerpc/kernel/signal.c                  |   4 +-

>   arch/s390/kernel/signal.c                     |   1 -

>   include/linux/tracehook.h                     |   2 +

>   kernel/entry/common.c                         |   4 +-

>   kernel/rseq.c                                 |   4 +-

>   .../x86/include/{ => uapi}/asm/unistd_32.h    |   0

>   .../x86/include/{ => uapi}/asm/unistd_64.h    |   3 -

>   tools/testing/selftests/kvm/.gitignore        |   1 +

>   tools/testing/selftests/kvm/Makefile          |   3 +

>   tools/testing/selftests/kvm/rseq_test.c       | 131 ++++++++++++++++++

>   14 files changed, 143 insertions(+), 20 deletions(-)

>   rename tools/arch/x86/include/{ => uapi}/asm/unistd_32.h (100%)

>   rename tools/arch/x86/include/{ => uapi}/asm/unistd_64.h (83%)

>   create mode 100644 tools/testing/selftests/kvm/rseq_test.c

> 


Queued v3, thanks.  I'll send it in a separate pull request to Linus 
since it touches stuff outside my usual turf.

Thanks,

Paolo