mbox series

[v3,00/13] Overhauling amx_test

Message ID 20230221163655.920289-1-mizhang@google.com
Headers show
Series Overhauling amx_test | expand

Message

Mingwei Zhang Feb. 21, 2023, 4:36 p.m. UTC
In this version, I have integrated Aaron's changes to the amx_test. In
addition, we also integrated one fix patch for a kernel warning due to
xsave address issue.

Patch 1:
Fix a host FPU kernel warning due to missing XTILEDATA in xinit.

Patch 2-8:
Overhaul amx_test. These patches were basically from v2.

Patch 9-13:
Overhaul amx_test from Aaron. I modified the changelog a little bit.


v2 -> v3:
 - integrate Aaron's 5 commits with minor changes on commit message.
 - Add one fix patch for a kernel warning.

v2:
https://lore.kernel.org/all/20230214184606.510551-1-mizhang@google.com/


Aaron Lewis (5):
  KVM: selftests: x86: Assert that XTILE is XSAVE-enabled
  KVM: selftests: x86: Assert that both XTILE{CFG,DATA} are
    XSAVE-enabled
  KVM: selftests: x86: Remove redundant check that XSAVE is supported
  KVM: selftests: x86: Check that the palette table exists before using
    it
  KVM: selftests: x86: Check that XTILEDATA supports XFD

Mingwei Zhang (8):
  x86/fpu/xstate: Avoid getting xstate address of init_fpstate if
    fpstate contains the component
  KVM: selftests: x86: Add a working xstate data structure
  KVM: selftests: x86: Fix an error in comment of amx_test
  KVM: selftests: x86: Add check of CR0.TS in the #NM handler in
    amx_test
  KVM: selftests: x86: Add the XFD check to IA32_XFD in #NM handler
  KVM: selftests: x86: Fix the checks to XFD_ERR using and operation
  KVM: selftests: x86: Enable checking on xcomp_bv in amx_test
  KVM: selftests: x86: Repeat the checking of xheader when
    IA32_XFD[XTILEDATA] is set in amx_test

 arch/x86/kernel/fpu/xstate.c                  | 10 ++-
 .../selftests/kvm/include/x86_64/processor.h  | 14 ++++
 tools/testing/selftests/kvm/x86_64/amx_test.c | 80 +++++++++----------
 3 files changed, 59 insertions(+), 45 deletions(-)

Comments

Thomas Gleixner Feb. 21, 2023, 8:54 p.m. UTC | #1
Mingwei!

On Tue, Feb 21 2023 at 16:36, Mingwei Zhang wrote:
> Avoid getting xstate address of init_fpstate if fpstate contains the xstate
> component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls
> __raw_xsave_addr(xinit, 18), it triggers a warning as follows.

This does not parse. Aside of that it gets the ordering of the changelog
wrong. It explain first what the patch is doing by repeating the way too
long subject line and then tries to give some explanation about the
problem.

KVM does not call __raw_xsave_addr() and the problem is completely
independent of KVM.

> __raw_xsave_addr() is an internal function that assume caller does the
> checking, ie., all function arguments should be checked before calling.
> So, instead of removing the WARNING, add checks in
> __copy_xstate_to_uabi_buf().

I don't see checks added. The patch open codes copy_feature() and makes
the __raw_xsave_addr() invocations conditional.

So something like this:

   Subject: x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_to_uabi_buf()

   __copy_xstate_to_uabi_buf() copies either from the tasks XSAVE buffer
   or from init_fpstate into the ptrace buffer. Dynamic features, like
   XTILEDATA, have an all zeroes init state and are not saved in
   init_fpstate, which means the corresponding bit is not set in the
   xfeatures bitmap of the init_fpstate header.

   But __copy_xstate_to_uabi_buf() retrieves addresses for both the
   tasks xstate and init_fpstate unconditionally via __raw_xsave_addr().

   So if the tasks XSAVE buffer has a dynamic feature set, then the
   address retrieval for init_fpstate triggers the warning in
   __raw_xsave_addr() which checks the feature bit in the init_fpstate
   header.

   Prevent this by open coding copy_feature() and making the address
   retrieval conditional on the tasks XSAVE header bit.

So the order here is (in paragraphs):

   1) Explain the context
   2) Explain whats wrong
   3) Explain the consequence
   4) Explain the solution briefly

It does not even need a backtrace, because the backtrace does not give
any relevant information which is not in the text above already.

The actual callchain is completely irrelevant for desribing this
issue. If you really want to add a backtrace, then please trim it by
removing the irrelevant information. See
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces
for further information.

So for this case this would be:

WARNING: CPU: 35 PID: 15304 at arch/x86/kernel/fpu/xstate.c:934 __raw_xsave_addr+0xc8/0xe0
Call Trace:
 <TASK>
 __copy_xstate_to_uabi_buf+0x3cb/0x520
 fpu_copy_guest_fpstate_to_uabi+0x29/0x50

But even fpu_copy_guest_fpstate_to_uabi() is already useless because
__copy_xstate_to_uabi_buf() has multiple callers which all have the very
same problem and they are very simple to find.

Backtraces are useful to illustrate a hard to understand callchain, but
having ~40 lines of backtrace which only contains two relevant lines is
just a distraction.

See?

> Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()")

The problem did not exist at this point in time because dynamic
xfeatures did not exist, neither in hardware nor in kernel support.

Even if dynamic features would have existed then the commit would not be
the one which introduced the problem, All the commit does is to move
back then valid code into a conditional code path.

It fixes:

  471f0aa7fa64 ("x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly")

which attempted to fix exactly the problem you are seeing, but only
addressed half of it. The underlying problem was introduced with
2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")

Random fixes tags are not really helpful.

> @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
>  			pkru.pkru = pkru_val;
>  			membuf_write(&to, &pkru, sizeof(pkru));
>  		} else {
> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
> -				     __raw_xsave_addr(xsave, i),
> -				     __raw_xsave_addr(xinit, i),
> -				     xstate_sizes[i]);

Please add a comment here to explain why this is open coded and does not
use copy_feature(). Something like:

    			/*
                         * Open code copy_feature() to prevent retrieval
                         * of a dynamic features address from xinit, which
                         * is invalid because xinit does not contain the
                         * all zeros init states of dynamic features and
                         * emits a warning.
                         */

> +			xsave_addr = (header.xfeatures & BIT_ULL(i)) ?
> +				__raw_xsave_addr(xsave, i) :
> +				__raw_xsave_addr(xinit, i);
> +
> +			membuf_write(&to, xsave_addr, xstate_sizes[i]);

Other than that. Nice detective work!

Thanks,

        tglx
Chang S. Bae Feb. 22, 2023, 3:05 a.m. UTC | #2
On 2/21/2023 8:36 AM, Mingwei Zhang wrote:
> Avoid getting xstate address of init_fpstate if fpstate contains the xstate
> component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls
> __raw_xsave_addr(xinit, 18), it triggers a warning as follows.
> 
> __raw_xsave_addr() is an internal function that assume caller does the
> checking, ie., all function arguments should be checked before calling.
> So, instead of removing the WARNING, add checks in
> __copy_xstate_to_uabi_buf().
> 

<snip>

> @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
>   			pkru.pkru = pkru_val;
>   			membuf_write(&to, &pkru, sizeof(pkru));
>   		} else {
> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
> -				     __raw_xsave_addr(xsave, i),
> -				     __raw_xsave_addr(xinit, i),
> -				     xstate_sizes[i]);
> +			xsave_addr = (header.xfeatures & BIT_ULL(i)) ?
> +				__raw_xsave_addr(xsave, i) :
> +				__raw_xsave_addr(xinit, i);
> +
> +			membuf_write(&to, xsave_addr, xstate_sizes[i]);
>   		}
>   		/*
>   		 * Keep track of the last copied state in the non-compacted

So this hunk is under for_each_extended_xfeature(i, mask) -- it skips 
the copy routine if mask[i] == 0; instead, it fills zeros.

We have this [1]:

	if (fpu_state_size_dynamic())
		mask &= (header.xfeatures | xinit->header.xcomp_bv);

If header.xfeatures[18] = 0 then mask[18] = 0 because 
xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm 
confused about the problem that you described here.

Can you elaborate on your test case a bit? Let me try to reproduce the 
issue on my end.

Thanks,
Chang

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n1134
Thomas Gleixner Feb. 22, 2023, 8:38 a.m. UTC | #3
On Tue, Feb 21 2023 at 19:05, Chang S. Bae wrote:
> On 2/21/2023 8:36 AM, Mingwei Zhang wrote:
>> @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
>>   			pkru.pkru = pkru_val;
>>   			membuf_write(&to, &pkru, sizeof(pkru));
>>   		} else {
>> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
>> -				     __raw_xsave_addr(xsave, i),
>> -				     __raw_xsave_addr(xinit, i),
>> -				     xstate_sizes[i]);
>> +			xsave_addr = (header.xfeatures & BIT_ULL(i)) ?
>> +				__raw_xsave_addr(xsave, i) :
>> +				__raw_xsave_addr(xinit, i);
>> +
>> +			membuf_write(&to, xsave_addr, xstate_sizes[i]);
>>   		}
>>   		/*
>>   		 * Keep track of the last copied state in the non-compacted
>
> So this hunk is under for_each_extended_xfeature(i, mask) -- it skips 
> the copy routine if mask[i] == 0; instead, it fills zeros.
>
> We have this [1]:
>
> 	if (fpu_state_size_dynamic())
> 		mask &= (header.xfeatures | xinit->header.xcomp_bv);
>
> If header.xfeatures[18] = 0 then mask[18] = 0 because 
> xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm 
> confused about the problem that you described here.

Read the suggested changelog I wrote in my reply to Mingwei.

TLDR:

        xsave.header.xfeatures[18] = 1
        xinit.header.xfeatures[18] = 0
    ->  mask[18] = 1
    ->  __raw_xsave_addr(xsave, 18)     <- Success
    ->  __raw_xsave_addr(xinit, 18)     <- WARN

Thanks,

        tglx
Mingwei Zhang Feb. 22, 2023, 6:40 p.m. UTC | #4
> > We have this [1]:
> >
> >       if (fpu_state_size_dynamic())
> >               mask &= (header.xfeatures | xinit->header.xcomp_bv);
> >
> > If header.xfeatures[18] = 0 then mask[18] = 0 because
> > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm
> > confused about the problem that you described here.
>
> Read the suggested changelog I wrote in my reply to Mingwei.
>
> TLDR:
>
>         xsave.header.xfeatures[18] = 1
>         xinit.header.xfeatures[18] = 0
>     ->  mask[18] = 1
>     ->  __raw_xsave_addr(xsave, 18)     <- Success
>     ->  __raw_xsave_addr(xinit, 18)     <- WARN
>
> Thanks,
>
>         tglx

Hi Thomas,

Thanks for the review and I will provide the next version separately
from the series, since this one is independent from the rest.

Chang: to reproduce this issue, you can simply run the amx_test in the
kvm selftest directory.

Thanks.
-Mingwei
Chang S. Bae Feb. 22, 2023, 10:13 p.m. UTC | #5
On 2/22/2023 10:40 AM, Mingwei Zhang wrote:
>>> We have this [1]:
>>>
>>>        if (fpu_state_size_dynamic())
>>>                mask &= (header.xfeatures | xinit->header.xcomp_bv);
>>>
>>> If header.xfeatures[18] = 0 then mask[18] = 0 because
>>> xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm
>>> confused about the problem that you described here.
>>
>> Read the suggested changelog I wrote in my reply to Mingwei.
>>
>> TLDR:
>>
>>          xsave.header.xfeatures[18] = 1
>>          xinit.header.xfeatures[18] = 0
>>      ->  mask[18] = 1
>>      ->  __raw_xsave_addr(xsave, 18)     <- Success
>>      ->  __raw_xsave_addr(xinit, 18)     <- WARN

Oh, sigh.. This should be caught last time.

Hmm, then since we store init state for legacy ones [1], unless it is 
too aggressive, perhaps the loop can be simplified like this:

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 714166cc25f2..2dac6f5f3ade 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1118,21 +1118,13 @@ void __copy_xstate_to_uabi_buf(struct membuf to, 
struct fpstate *fpstate,
         zerofrom = offsetof(struct xregs_state, extended_state_area);

         /*
-        * The ptrace buffer is in non-compacted XSAVE format.  In
-        * non-compacted format disabled features still occupy state space,
-        * but there is no state to copy from in the compacted
-        * init_fpstate. The gap tracking will zero these states.
+        * Indicate which states to copy from fpstate. When not present in
+        * fpstate, those extended states are either initialized or
+        * disabled. They are also known to have an all zeros init state.
+        * Thus, remove them from 'mask' to zero those features in the user
+        * buffer instead of retrieving them from init_fpstate.
          */
-       mask = fpstate->user_xfeatures;
-
-       /*
-        * Dynamic features are not present in init_fpstate. When they are
-        * in an all zeros init state, remove those from 'mask' to zero
-        * those features in the user buffer instead of retrieving them
-        * from init_fpstate.
-        */
-       if (fpu_state_size_dynamic())
-               mask &= (header.xfeatures | xinit->header.xcomp_bv);
+       mask = header.xfeatures;

         for_each_extended_xfeature(i, mask) {
                 /*
@@ -1151,9 +1143,8 @@ void __copy_xstate_to_uabi_buf(struct membuf to, 
struct fpstate *fpstate,
                         pkru.pkru = pkru_val;
                         membuf_write(&to, &pkru, sizeof(pkru));
                 } else {
-                       copy_feature(header.xfeatures & BIT_ULL(i), &to,
+                       membuf_write(&to,
                                      __raw_xsave_addr(xsave, i),
-                                    __raw_xsave_addr(xinit, i),
                                      xstate_sizes[i]);
                 }
                 /*

> Chang: to reproduce this issue, you can simply run the amx_test in the
> kvm selftest directory.

Yeah, I was able to reproduce it with this ptrace test:

diff --git a/tools/testing/selftests/x86/amx.c 
b/tools/testing/selftests/x86/amx.c
index 625e42901237..ae02bc81846d 100644
--- a/tools/testing/selftests/x86/amx.c
+++ b/tools/testing/selftests/x86/amx.c
@@ -14,8 +14,10 @@
  #include <sys/auxv.h>
  #include <sys/mman.h>
  #include <sys/shm.h>
+#include <sys/ptrace.h>
  #include <sys/syscall.h>
  #include <sys/wait.h>
+#include <sys/uio.h>

  #include "../kselftest.h" /* For __cpuid_count() */

@@ -826,6 +828,76 @@ static void test_context_switch(void)
         free(finfo);
  }

+/* Ptrace test */
+
+static bool inject_tiledata(pid_t target)
+{
+       struct xsave_buffer *xbuf;
+       struct iovec iov;
+
+       xbuf = alloc_xbuf();
+       if (!xbuf)
+               fatal_error("unable to allocate XSAVE buffer");
+
+       load_rand_tiledata(xbuf);
+
+       memcpy(&stashed_xsave->bytes[xtiledata.xbuf_offset],
+              &xbuf->bytes[xtiledata.xbuf_offset],
+              xtiledata.size);
+
+       iov.iov_base = xbuf;
+       iov.iov_len = xbuf_size;
+
+       if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
+               fatal_error("PTRACE_SETREGSET");
+
+       if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
+               err(1, "PTRACE_GETREGSET");
+
+       if (!memcmp(&stashed_xsave->bytes[xtiledata.xbuf_offset],
+                   &xbuf->bytes[xtiledata.xbuf_offset],
+                   xtiledata.size))
+               return true;
+       else
+               return false;
+}
+
+static void test_ptrace(void)
+{
+       pid_t child;
+       int status;
+
+       child = fork();
+       if (child < 0) {
+               err(1, "fork");
+       } else if (!child) {
+               if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
+                       err(1, "PTRACE_TRACEME");
+
+               /* Use the state to expand the kernel buffer */
+               load_rand_tiledata(stashed_xsave);
+
+               raise(SIGTRAP);
+               _exit(0);
+       }
+
+       do {
+               wait(&status);
+       } while (WSTOPSIG(status) != SIGTRAP);
+
+       printf("\tInject tile data via ptrace()\n");
+
+       if (inject_tiledata(child))
+               printf("[OK]\tTile data was written on ptracee.\n");
+       else
+               printf("[FAIL]\tTile data was not written on ptracee.\n");
+
+       ptrace(PTRACE_DETACH, child, NULL, NULL);
+       wait(&status);
+       if (!WIFEXITED(status) || WEXITSTATUS(status))
+               err(1, "ptrace test");
+}
+
  int main(void)
  {
         /* Check hardware availability at first */
@@ -846,6 +918,8 @@ int main(void)
         ctxtswtest_config.num_threads = 5;
         test_context_switch();

+       test_ptrace();
+
         clearhandler(SIGILL);
         free_stashed_xsave();

Thanks,
Chang

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
Mingwei Zhang Feb. 24, 2023, 11:56 p.m. UTC | #6
On Wed, Feb 22, 2023, Chang S. Bae wrote:
> On 2/22/2023 10:40 AM, Mingwei Zhang wrote:
> > > > We have this [1]:
> > > > 
> > > >        if (fpu_state_size_dynamic())
> > > >                mask &= (header.xfeatures | xinit->header.xcomp_bv);
> > > > 
> > > > If header.xfeatures[18] = 0 then mask[18] = 0 because
> > > > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm
> > > > confused about the problem that you described here.
> > > 
> > > Read the suggested changelog I wrote in my reply to Mingwei.
> > > 
> > > TLDR:
> > > 
> > >          xsave.header.xfeatures[18] = 1
> > >          xinit.header.xfeatures[18] = 0
> > >      ->  mask[18] = 1
> > >      ->  __raw_xsave_addr(xsave, 18)     <- Success
> > >      ->  __raw_xsave_addr(xinit, 18)     <- WARN
> 
> Oh, sigh.. This should be caught last time.
> 
> Hmm, then since we store init state for legacy ones [1], unless it is too
> aggressive, perhaps the loop can be simplified like this:
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 714166cc25f2..2dac6f5f3ade 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1118,21 +1118,13 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
> struct fpstate *fpstate,
>         zerofrom = offsetof(struct xregs_state, extended_state_area);
> 
>         /*
> -        * The ptrace buffer is in non-compacted XSAVE format.  In
> -        * non-compacted format disabled features still occupy state space,
> -        * but there is no state to copy from in the compacted
> -        * init_fpstate. The gap tracking will zero these states.
> +        * Indicate which states to copy from fpstate. When not present in
> +        * fpstate, those extended states are either initialized or
> +        * disabled. They are also known to have an all zeros init state.
> +        * Thus, remove them from 'mask' to zero those features in the user
> +        * buffer instead of retrieving them from init_fpstate.
>          */
> -       mask = fpstate->user_xfeatures;

Do we need to change this line and the comments? I don't see any of
these was relevant to this issue. The original code semantic is to
traverse all user_xfeatures, if it is available in fpstate, copy it from
there; otherwise, copy it from init_fpstate. We do not assume the
component in init_fpstate (but not in fpstate) are all zeros, do we? If
it is safe to assume that, then it might be ok. But at least in this
patch, I want to keep the original semantics as is without the
assumption.
> -
> -       /*
> -        * Dynamic features are not present in init_fpstate. When they are
> -        * in an all zeros init state, remove those from 'mask' to zero
> -        * those features in the user buffer instead of retrieving them
> -        * from init_fpstate.
> -        */
> -       if (fpu_state_size_dynamic())
> -               mask &= (header.xfeatures | xinit->header.xcomp_bv);
> +       mask = header.xfeatures;

Same here. Let's not adding this optimization in this patch.

>
>         for_each_extended_xfeature(i, mask) {
>                 /*
> @@ -1151,9 +1143,8 @@ void __copy_xstate_to_uabi_buf(struct membuf to,
> struct fpstate *fpstate,
>                         pkru.pkru = pkru_val;
>                         membuf_write(&to, &pkru, sizeof(pkru));
>                 } else {
> -                       copy_feature(header.xfeatures & BIT_ULL(i), &to,
> +                       membuf_write(&to,
>                                      __raw_xsave_addr(xsave, i),
> -                                    __raw_xsave_addr(xinit, i),
>                                      xstate_sizes[i]);
>                 }
>                 /*
> 
> > Chang: to reproduce this issue, you can simply run the amx_test in the
> > kvm selftest directory.
> 
> Yeah, I was able to reproduce it with this ptrace test:
> 
> diff --git a/tools/testing/selftests/x86/amx.c
> b/tools/testing/selftests/x86/amx.c
> index 625e42901237..ae02bc81846d 100644
> --- a/tools/testing/selftests/x86/amx.c
> +++ b/tools/testing/selftests/x86/amx.c
> @@ -14,8 +14,10 @@
>  #include <sys/auxv.h>
>  #include <sys/mman.h>
>  #include <sys/shm.h>
> +#include <sys/ptrace.h>
>  #include <sys/syscall.h>
>  #include <sys/wait.h>
> +#include <sys/uio.h>
> 
>  #include "../kselftest.h" /* For __cpuid_count() */
> 
> @@ -826,6 +828,76 @@ static void test_context_switch(void)
>         free(finfo);
>  }
> 
> +/* Ptrace test */
> +
> +static bool inject_tiledata(pid_t target)
> +{
> +       struct xsave_buffer *xbuf;
> +       struct iovec iov;
> +
> +       xbuf = alloc_xbuf();
> +       if (!xbuf)
> +               fatal_error("unable to allocate XSAVE buffer");
> +
> +       load_rand_tiledata(xbuf);
> +
> +       memcpy(&stashed_xsave->bytes[xtiledata.xbuf_offset],
> +              &xbuf->bytes[xtiledata.xbuf_offset],
> +              xtiledata.size);
> +
> +       iov.iov_base = xbuf;
> +       iov.iov_len = xbuf_size;
> +
> +       if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
> +               fatal_error("PTRACE_SETREGSET");
> +
> +       if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov))
> +               err(1, "PTRACE_GETREGSET");
> +
> +       if (!memcmp(&stashed_xsave->bytes[xtiledata.xbuf_offset],
> +                   &xbuf->bytes[xtiledata.xbuf_offset],
> +                   xtiledata.size))
> +               return true;
> +       else
> +               return false;
> +}
> +
> +static void test_ptrace(void)
> +{
> +       pid_t child;
> +       int status;
> +
> +       child = fork();
> +       if (child < 0) {
> +               err(1, "fork");
> +       } else if (!child) {
> +               if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
> +                       err(1, "PTRACE_TRACEME");
> +
> +               /* Use the state to expand the kernel buffer */
> +               load_rand_tiledata(stashed_xsave);
> +
> +               raise(SIGTRAP);
> +               _exit(0);
> +       }
> +
> +       do {
> +               wait(&status);
> +       } while (WSTOPSIG(status) != SIGTRAP);
> +
> +       printf("\tInject tile data via ptrace()\n");
> +
> +       if (inject_tiledata(child))
> +               printf("[OK]\tTile data was written on ptracee.\n");
> +       else
> +               printf("[FAIL]\tTile data was not written on ptracee.\n");
> +
> +       ptrace(PTRACE_DETACH, child, NULL, NULL);
> +       wait(&status);
> +       if (!WIFEXITED(status) || WEXITSTATUS(status))
> +               err(1, "ptrace test");
> +}
> +
>  int main(void)
>  {
>         /* Check hardware availability at first */
> @@ -846,6 +918,8 @@ int main(void)
>         ctxtswtest_config.num_threads = 5;
>         test_context_switch();
> 
> +       test_ptrace();
> +
>         clearhandler(SIGILL);
>         free_stashed_xsave();
> 
> Thanks,
> Chang
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
> 

Nice one. Yeah both ptrace and KVM are calling this function so the above
code would also be enough to trigger the bug.


Thanks.
-Mingwei
Chang S. Bae Feb. 25, 2023, 12:47 a.m. UTC | #7
On 2/24/2023 3:56 PM, Mingwei Zhang wrote:
> On Wed, Feb 22, 2023, Chang S. Bae wrote:
>>
>>          /*
>> -        * The ptrace buffer is in non-compacted XSAVE format.  In
>> -        * non-compacted format disabled features still occupy state space,
>> -        * but there is no state to copy from in the compacted
>> -        * init_fpstate. The gap tracking will zero these states.
>> +        * Indicate which states to copy from fpstate. When not present in
>> +        * fpstate, those extended states are either initialized or
>> +        * disabled. They are also known to have an all zeros init state.
>> +        * Thus, remove them from 'mask' to zero those features in the user
>> +        * buffer instead of retrieving them from init_fpstate.
>>           */
>> -       mask = fpstate->user_xfeatures;
> 
> Do we need to change this line and the comments? I don't see any of
> these was relevant to this issue. The original code semantic is to
> traverse all user_xfeatures, if it is available in fpstate, copy it from
> there; otherwise, copy it from init_fpstate. We do not assume the
> component in init_fpstate (but not in fpstate) are all zeros, do we? If
> it is safe to assume that, then it might be ok. But at least in this
> patch, I want to keep the original semantics as is without the
> assumption.

Here it has [1]:

	 *
	 * XSAVE could be used, but that would require to reshuffle the
	 * data when XSAVEC/S is available because XSAVEC/S uses xstate
	 * compaction. But doing so is a pointless exercise because most
	 * components have an all zeros init state except for the legacy
	 * ones (FP and SSE). Those can be saved with FXSAVE into the
	 * legacy area. Adding new features requires to ensure that init
	 * state is all zeroes or if not to add the necessary handling
	 * here.
	 */
	fxsave(&init_fpstate.regs.fxsave);

Thus, init_fpstate has zeros for those extended states. Then, copying 
from init_fpstate is the same as membuf_zero() by the gap tracking. But, 
we have two ways to do the same thing here.

So I think it works that simply copying the state from fpstate only for 
those present there, then letting the gap tracking zero out for the rest 
of the userspace buffer for features that are either disabled or 
initialized.

Then, we can remove accessing init_fpstate in the copy loop and which is 
the source of the problem. So I think this line change is relevant and 
also makes the code simple.

I guess I'm fine if you don't want to do this. Then, let me follow up 
with something like this at first. Something like yours could be a 
fallback option for other good reasons, otherwise.

Thanks,
Chang

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
Mingwei Zhang Feb. 25, 2023, 1:09 a.m. UTC | #8
On Fri, Feb 24, 2023, Chang S. Bae wrote:
> On 2/24/2023 3:56 PM, Mingwei Zhang wrote:
> > On Wed, Feb 22, 2023, Chang S. Bae wrote:
> > > 
> > >          /*
> > > -        * The ptrace buffer is in non-compacted XSAVE format.  In
> > > -        * non-compacted format disabled features still occupy state space,
> > > -        * but there is no state to copy from in the compacted
> > > -        * init_fpstate. The gap tracking will zero these states.
> > > +        * Indicate which states to copy from fpstate. When not present in
> > > +        * fpstate, those extended states are either initialized or
> > > +        * disabled. They are also known to have an all zeros init state.
> > > +        * Thus, remove them from 'mask' to zero those features in the user
> > > +        * buffer instead of retrieving them from init_fpstate.
> > >           */
> > > -       mask = fpstate->user_xfeatures;
> > 
> > Do we need to change this line and the comments? I don't see any of
> > these was relevant to this issue. The original code semantic is to
> > traverse all user_xfeatures, if it is available in fpstate, copy it from
> > there; otherwise, copy it from init_fpstate. We do not assume the
> > component in init_fpstate (but not in fpstate) are all zeros, do we? If
> > it is safe to assume that, then it might be ok. But at least in this
> > patch, I want to keep the original semantics as is without the
> > assumption.
> 
> Here it has [1]:
> 
> 	 *
> 	 * XSAVE could be used, but that would require to reshuffle the
> 	 * data when XSAVEC/S is available because XSAVEC/S uses xstate
> 	 * compaction. But doing so is a pointless exercise because most
> 	 * components have an all zeros init state except for the legacy
> 	 * ones (FP and SSE). Those can be saved with FXSAVE into the
> 	 * legacy area. Adding new features requires to ensure that init
> 	 * state is all zeroes or if not to add the necessary handling
> 	 * here.
> 	 */
> 	fxsave(&init_fpstate.regs.fxsave);

ah, I see.
> 
> Thus, init_fpstate has zeros for those extended states. Then, copying from
> init_fpstate is the same as membuf_zero() by the gap tracking. But, we have
> two ways to do the same thing here.
> 
> So I think it works that simply copying the state from fpstate only for
> those present there, then letting the gap tracking zero out for the rest of
> the userspace buffer for features that are either disabled or initialized.
> 
> Then, we can remove accessing init_fpstate in the copy loop and which is the
> source of the problem. So I think this line change is relevant and also
> makes the code simple.
> 
> I guess I'm fine if you don't want to do this. Then, let me follow up with
> something like this at first. Something like yours could be a fallback
> option for other good reasons, otherwise.

hmm. I see. But this is still because of the software implementation.
What if there is a new hardware component that requires a non-zero init
state.

For instance, in the past, we had PKRU component, whose init value is
0x555...54. Of course, that is a bad example because now we kick it out
of the XSAVE/XRSTOR and special handling that, but there is no guarantee
that in the future we will never need a non-zero init state.

So, I will send out my fix and let you, Thomas and potentially other
folks to decide what is the best option. Overall, I get your point.

Thanks
-Mingwei
>
> Thanks,
> Chang
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
> 
>
Chang S. Bae Feb. 25, 2023, 1:39 a.m. UTC | #9
On 2/24/2023 5:09 PM, Mingwei Zhang wrote:
> On Fri, Feb 24, 2023, Chang S. Bae wrote:
>> On 2/24/2023 3:56 PM, Mingwei Zhang wrote:
>>> On Wed, Feb 22, 2023, Chang S. Bae wrote:
>>>>
>>>>           /*
>>>> -        * The ptrace buffer is in non-compacted XSAVE format.  In
>>>> -        * non-compacted format disabled features still occupy state space,
>>>> -        * but there is no state to copy from in the compacted
>>>> -        * init_fpstate. The gap tracking will zero these states.
>>>> +        * Indicate which states to copy from fpstate. When not present in
>>>> +        * fpstate, those extended states are either initialized or
>>>> +        * disabled. They are also known to have an all zeros init state.
>>>> +        * Thus, remove them from 'mask' to zero those features in the user
>>>> +        * buffer instead of retrieving them from init_fpstate.
>>>>            */
>>>> -       mask = fpstate->user_xfeatures;
>>>
>>> Do we need to change this line and the comments? I don't see any of
>>> these was relevant to this issue. The original code semantic is to
>>> traverse all user_xfeatures, if it is available in fpstate, copy it from
>>> there; otherwise, copy it from init_fpstate. We do not assume the
>>> component in init_fpstate (but not in fpstate) are all zeros, do we? If
>>> it is safe to assume that, then it might be ok. But at least in this
>>> patch, I want to keep the original semantics as is without the
>>> assumption.
>>
>> Here it has [1]:
>>
>> 	 *
>> 	 * XSAVE could be used, but that would require to reshuffle the
>> 	 * data when XSAVEC/S is available because XSAVEC/S uses xstate
>> 	 * compaction. But doing so is a pointless exercise because most
>> 	 * components have an all zeros init state except for the legacy
>> 	 * ones (FP and SSE). Those can be saved with FXSAVE into the
>> 	 * legacy area. Adding new features requires to ensure that init
>> 	 * state is all zeroes or if not to add the necessary handling
>> 	 * here.
>> 	 */
>> 	fxsave(&init_fpstate.regs.fxsave);
> 
> ah, I see.
>>
>> Thus, init_fpstate has zeros for those extended states. Then, copying from
>> init_fpstate is the same as membuf_zero() by the gap tracking. But, we have
>> two ways to do the same thing here.
>>
>> So I think it works that simply copying the state from fpstate only for
>> those present there, then letting the gap tracking zero out for the rest of
>> the userspace buffer for features that are either disabled or initialized.
>>
>> Then, we can remove accessing init_fpstate in the copy loop and which is the
>> source of the problem. So I think this line change is relevant and also
>> makes the code simple.
>>
>> I guess I'm fine if you don't want to do this. Then, let me follow up with
>> something like this at first. Something like yours could be a fallback
>> option for other good reasons, otherwise.
> 
> hmm. I see. But this is still because of the software implementation.
> What if there is a new hardware component that requires a non-zero init
> state.
> 
> For instance, in the past, we had PKRU component, whose init value is
> 0x555...54. Of course, that is a bad example because now we kick it out
> of the XSAVE/XRSTOR and special handling that, but there is no guarantee
> that in the future we will never need a non-zero init state.

Yeah, then that feature enabling has to update [1] to record the special 
init value there. So one who writes that code has to responsibly check 
what has to be adjusted like for other feature enabling in general.

Now we have established the dynamic states which are also assumed to 
have an all-zeros init state. Anything new state located after that 
should be a dynamic state because of the sigaltstack.

Of course, with that though, I don't know whether we will see anything 
with a non-zero init state in the future or not.

> So, I will send out my fix and let you, Thomas and potentially other
> folks to decide what is the best option. Overall, I get your point.

Let's coordinate this. We shouldn't throw multiple versions at the same 
time. (Let me follow up with you.)

Thanks,
Chang
Sean Christopherson March 24, 2023, 8:58 p.m. UTC | #10
On Tue, 21 Feb 2023 16:36:42 +0000, Mingwei Zhang wrote:
> In this version, I have integrated Aaron's changes to the amx_test. In
> addition, we also integrated one fix patch for a kernel warning due to
> xsave address issue.
> 
> Patch 1:
> Fix a host FPU kernel warning due to missing XTILEDATA in xinit.
> 
> [...]

Applied everything except patch 7 to kvm-x86 selftests.  Please holler if I
missed something subtle about patch 7 (using & vs. ==).  This is at the head
of kvm-x86/selftests, i.e. I can fix it up if necessary.

[01/13] x86/fpu/xstate: Avoid getting xstate address of init_fpstate if fpstate contains the component
        (no commit info)
[02/13] KVM: selftests: x86: Add a working xstate data structure
        https://github.com/kvm-x86/linux/commit/03e8ddff78ac
[03/13] KVM: selftests: x86: Fix an error in comment of amx_test
        https://github.com/kvm-x86/linux/commit/c23d3b210baf
[04/13] KVM: selftests: x86: Enable checking on xcomp_bv in amx_test
        https://github.com/kvm-x86/linux/commit/1e2de0651567
[05/13] KVM: selftests: x86: Add check of CR0.TS in the #NM handler in amx_test
        https://github.com/kvm-x86/linux/commit/04f5d4539105
[06/13] KVM: selftests: x86: Add the XFD check to IA32_XFD in #NM handler
        https://github.com/kvm-x86/linux/commit/eeae141ddb54

[08/13] KVM: selftests: x86: Repeat the checking of xheader when IA32_XFD[XTILEDATA] is set in amx_test
        https://github.com/kvm-x86/linux/commit/cabed3f958e9
[09/13] KVM: selftests: x86: Assert that XTILE is XSAVE-enabled
        https://github.com/kvm-x86/linux/commit/44217830267d
[10/13] KVM: selftests: x86: Assert that both XTILE{CFG,DATA} are XSAVE-enabled
        https://github.com/kvm-x86/linux/commit/7b328195c462
[11/13] KVM: selftests: x86: Remove redundant check that XSAVE is supported
        https://github.com/kvm-x86/linux/commit/08f63d826980
[12/13] KVM: selftests: x86: Check that the palette table exists before using it
        https://github.com/kvm-x86/linux/commit/7042ef575496
[13/13] KVM: selftests: x86: Check that XTILEDATA supports XFD
        https://github.com/kvm-x86/linux/commit/564435d346ff

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes
Sean Christopherson March 24, 2023, 9:01 p.m. UTC | #11
On Fri, Mar 24, 2023, Sean Christopherson wrote:
> On Tue, 21 Feb 2023 16:36:42 +0000, Mingwei Zhang wrote:
> > In this version, I have integrated Aaron's changes to the amx_test. In
> > addition, we also integrated one fix patch for a kernel warning due to
> > xsave address issue.
> > 
> > Patch 1:
> > Fix a host FPU kernel warning due to missing XTILEDATA in xinit.
> > 
> > [...]
> 
> Applied everything except patch 7 to kvm-x86 selftests.  Please holler if I
> missed something subtle about patch 7 (using & vs. ==).  This is at the head
> of kvm-x86/selftests, i.e. I can fix it up if necessary.
> 
> [01/13] x86/fpu/xstate: Avoid getting xstate address of init_fpstate if fpstate contains the component
>         (no commit info)

*sigh*  And by "everything" I meant "all of the selftests patches".
Sean Christopherson March 24, 2023, 9:30 p.m. UTC | #12
On Fri, Mar 24, 2023, Sean Christopherson wrote:
> On Fri, Mar 24, 2023, Sean Christopherson wrote:
> > On Tue, 21 Feb 2023 16:36:42 +0000, Mingwei Zhang wrote:
> > > In this version, I have integrated Aaron's changes to the amx_test. In
> > > addition, we also integrated one fix patch for a kernel warning due to
> > > xsave address issue.
> > > 
> > > Patch 1:
> > > Fix a host FPU kernel warning due to missing XTILEDATA in xinit.
> > > 
> > > [...]
> > 
> > Applied everything except patch 7 to kvm-x86 selftests.  Please holler if I
> > missed something subtle about patch 7 (using & vs. ==).  This is at the head
> > of kvm-x86/selftests, i.e. I can fix it up if necessary.
> > 
> > [01/13] x86/fpu/xstate: Avoid getting xstate address of init_fpstate if fpstate contains the component
> >         (no commit info)
> 
> *sigh*  And by "everything" I meant "all of the selftests patches".

Continuing my circus of goofs, I already force pushed selftests due to an unrelated
mixup.  New hashes below (the comment above still stands in case another overwrite
is necessary).

       [1/11] KVM: selftests: Add a fully functional "struct xstate" for x86
             https://github.com/kvm-x86/linux/commit/5de4a3765b7e
       [2/11] KVM: selftests: Fix an error in comment of amx_test
             https://github.com/kvm-x86/linux/commit/bec357a4af55
       [3/11] KVM: selftests: Enable checking on xcomp_bv in amx_test
             https://github.com/kvm-x86/linux/commit/48ad4222c43c
       [4/11] KVM: selftests: Add check of CR0.TS in the #NM handler in amx_test
             https://github.com/kvm-x86/linux/commit/0aeb9729486a
       [5/11] KVM: selftests: Assert that XTILE_DATA is set in IA32_XFD on #NM
             https://github.com/kvm-x86/linux/commit/9cbd9aaa670f
       [6/11] KVM: selftests: Verify XTILE_DATA in XSTATE isn't affected by IA32_XFD
             https://github.com/kvm-x86/linux/commit/bfc5afc37c9d
       [7/11] KVM: selftests: Assert that XTILE is XSAVE-enabled
             https://github.com/kvm-x86/linux/commit/7e1075f05078
       [8/11] KVM: selftests: Assert that both XTILE{CFG,DATA} are XSAVE-enabled
             https://github.com/kvm-x86/linux/commit/2ab3991b0b9b
       [9/11] KVM: selftests: Move XSAVE and OSXSAVE CPUID checks into AMX's init_regs()
             https://github.com/kvm-x86/linux/commit/d01d4a4f7bd2
       [10/11] KVM: selftests: Check that the palette table exists before using it
             https://github.com/kvm-x86/linux/commit/d32fb0714293
       [11/11] KVM: selftests: Check that XTILEDATA supports XFD
             https://github.com/kvm-x86/linux/commit/d563164eaeb1