diff mbox series

[RFC,v2] x86/fpu: make kernel-mode FPU reliably usable in softirqs

Message ID 20250304204954.3901-1-ebiggers@kernel.org
State New
Headers show
Series [RFC,v2] x86/fpu: make kernel-mode FPU reliably usable in softirqs | expand

Commit Message

Eric Biggers March 4, 2025, 8:49 p.m. UTC
From: Eric Biggers <ebiggers@google.com>

Currently kernel-mode FPU is not always usable in softirq context on
x86, since softirqs can nest inside a kernel-mode FPU section in task
context, and nested use of kernel-mode FPU is not supported.

Therefore, x86 SIMD-optimized code that can be called in softirq context
has to sometimes fall back to non-SIMD code.  There are two options for
the fallback, both of which are pretty terrible:

  (a) Use a scalar fallback.  This can be 10-100x slower than vectorized
      code because it cannot use specialized instructions like AES, SHA,
      or carryless multiplication.

  (b) Execute the request asynchronously using a kworker.  In other
      words, use the "crypto SIMD helper" in crypto/simd.c.

Currently most of the x86 en/decryption code (skcipher and aead
algorithms) uses option (b), since this avoids the slow scalar fallback
and it is easier to wire up.  But option (b) is still really bad for its
own reasons:

  - Punting the request to a kworker is bad for performance too.

  - It forces the algorithm to be marked as asynchronous
    (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
    users who request a synchronous algorithm.  That's another huge
    performance problem, which is especially unfortunate for users who
    don't even do en/decryption in softirq context.

  - It makes all en/decryption operations take a detour through
    crypto/simd.c.  That involves additional checks and an additional
    indirect call, which slow down en/decryption for *everyone*.

Fortunately, the skcipher and aead APIs are only usable in task and
softirq context in the first place.  Thus, if kernel-mode FPU were to be
reliably usable in softirq context, no fallback would be needed.
Indeed, other architectures such as arm, arm64, and riscv have already
done this.

Therefore, this patch updates x86 accordingly to reliably support
kernel-mode FPU in softirqs.

This is done by just disabling softirq processing in kernel-mode FPU
sections (when hardirqs are not already disabled), as that prevents the
nesting that was problematic.

This will delay some softirqs slightly, but only ones that would have
otherwise been nested inside a task context kernel-mode FPU section.
Any such softirqs would have taken the slow fallback path before if they
tried to do any en/decryption.  Now these softirqs will just run at the
end of the task context kernel-mode FPU section (since local_bh_enable()
runs pending softirqs) and will no longer take the slow fallback path.

Alternatives considered:

- Make kernel-mode FPU sections fully preemptible.  This would require
  growing task_struct by another struct fpstate which is more than 2K.

- Make softirqs save/restore the kernel-mode FPU state to a per-CPU
  struct fpstate when nested use is detected.  Somewhat interesting, but
  seems unnecessary when a simpler solution exists.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---

Changed in v2:
    - Retain support for kernel-mode FPU when hardirqs are disabled
      (needed for system suspend and resume)
    - Send the x86 patch as a standalone patch.

 arch/x86/include/asm/fpu/api.h | 17 +++++++----------
 arch/x86/kernel/fpu/core.c     | 17 +++++++++++++----
 2 files changed, 20 insertions(+), 14 deletions(-)


base-commit: 0ad2507d5d93f39619fc42372c347d6006b64319

Comments

Ingo Molnar March 5, 2025, 9:07 a.m. UTC | #1
* Eric Biggers <ebiggers@kernel.org> wrote:

> From: Eric Biggers <ebiggers@google.com>
> 
> Currently kernel-mode FPU is not always usable in softirq context on
> x86, since softirqs can nest inside a kernel-mode FPU section in task
> context, and nested use of kernel-mode FPU is not supported.
> 
> Therefore, x86 SIMD-optimized code that can be called in softirq context
> has to sometimes fall back to non-SIMD code.  There are two options for
> the fallback, both of which are pretty terrible:
> 
>   (a) Use a scalar fallback.  This can be 10-100x slower than vectorized
>       code because it cannot use specialized instructions like AES, SHA,
>       or carryless multiplication.
> 
>   (b) Execute the request asynchronously using a kworker.  In other
>       words, use the "crypto SIMD helper" in crypto/simd.c.
> 
> Currently most of the x86 en/decryption code (skcipher and aead
> algorithms) uses option (b), since this avoids the slow scalar fallback
> and it is easier to wire up.  But option (b) is still really bad for its
> own reasons:
> 
>   - Punting the request to a kworker is bad for performance too.
>
>   - It forces the algorithm to be marked as asynchronous
>     (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
>     users who request a synchronous algorithm.  That's another huge
>     performance problem, which is especially unfortunate for users who
>     don't even do en/decryption in softirq context.
> 
>   - It makes all en/decryption operations take a detour through
>     crypto/simd.c.  That involves additional checks and an additional
>     indirect call, which slow down en/decryption for *everyone*.
> 
> Fortunately, the skcipher and aead APIs are only usable in task and 
> softirq context in the first place.  Thus, if kernel-mode FPU were to 
> be reliably usable in softirq context, no fallback would be needed. 
> Indeed, other architectures such as arm, arm64, and riscv have 
> already done this.
> 
> Therefore, this patch updates x86 accordingly to reliably support
> kernel-mode FPU in softirqs.
> 
> This is done by just disabling softirq processing in kernel-mode FPU
> sections (when hardirqs are not already disabled), as that prevents the
> nesting that was problematic.
> 
> This will delay some softirqs slightly, but only ones that would have
> otherwise been nested inside a task context kernel-mode FPU section.
> Any such softirqs would have taken the slow fallback path before if they
> tried to do any en/decryption.  Now these softirqs will just run at the
> end of the task context kernel-mode FPU section (since local_bh_enable()
> runs pending softirqs) and will no longer take the slow fallback path.
> 
> Alternatives considered:
> 
> - Make kernel-mode FPU sections fully preemptible.  This would require
>   growing task_struct by another struct fpstate which is more than 2K.

So that's something that will probably happen once the kernel is built 
using APX anyway?

> - Make softirqs save/restore the kernel-mode FPU state to a per-CPU
>   struct fpstate when nested use is detected.  Somewhat interesting, but
>   seems unnecessary when a simpler solution exists.

So:

>  void kernel_fpu_begin_mask(unsigned int kfpu_mask)
>  {
> -	preempt_disable();
> +	if (!irqs_disabled())
> +		fpregs_lock();

> +	if (!irqs_disabled())
> +		fpregs_unlock();

So why is the irqs_disabled() check needed here? (On x86 it can be a 
bit expensive at times, because the IRQ flag has to be loaded, 
including all flags, so basically it's a soft synchronization point of 
a sort.)

Ie. why cannot we simply do a local_bh_disable()/enable() pair (on 
!RT), ie. fpregs_lock()/fpregs_unlock()?

local_bh_disable() is very similar in cost to preempt_disable(), both 
are increasing the preempt_count.

Thanks,

	Ingo
Dave Hansen March 5, 2025, 4:55 p.m. UTC | #2
On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered:
>> - Make kernel-mode FPU sections fully preemptible.  This would require
>>   growing task_struct by another struct fpstate which is more than 2K.
> 
> So that's something that will probably happen once the kernel is built 
> using APX anyway?

I was expecting that building the kernel with APX would be very
different than a kernel_fpu_begin(). We don't just need *one* more save
area for APX registers: we need a stack, just like normal GPRs.

We'd effectively need to enlarge pt_regs and fix up things like
PUSH_AND_CLEAR_REGS to save the APX registers in addition to the good
old GPRs before calling C code.

That's what I was thinking at least. Did folks have more clever ideas?
Ingo Molnar March 5, 2025, 5:37 p.m. UTC | #3
* Dave Hansen <dave.hansen@intel.com> wrote:

> On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered:
> >> - Make kernel-mode FPU sections fully preemptible.  This would require
> >>   growing task_struct by another struct fpstate which is more than 2K.
> > 
> > So that's something that will probably happen once the kernel is built 
> > using APX anyway?
> 
> I was expecting that building the kernel with APX would be very 
> different than a kernel_fpu_begin(). We don't just need *one* more 
> save area for APX registers: we need a stack, just like normal GPRs.

Yes - but my point is: with any APX build we'd probably be saving 
FPU(-ish) registers at entry points, into a separate context area. If 
that includes FPU registers then we'd not have to do 
kernel_fpu_begin()/end().

In other words, we'd be doing something close to 'growing task_struct 
by another struct fpstate', or so - regardless of whether it's in 
task_struct or some sort of extended pt_regs. The kernel would also be 
close to 'FPU-safe', i.e. there likely wouldn't be a need for 
kernel_fpu_begin()/end().

Thanks,

	Ingo
Eric Biggers March 5, 2025, 5:39 p.m. UTC | #4
On Wed, Mar 05, 2025 at 10:07:45AM +0100, Ingo Molnar wrote:
> 
> * Eric Biggers <ebiggers@kernel.org> wrote:
> 
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > Currently kernel-mode FPU is not always usable in softirq context on
> > x86, since softirqs can nest inside a kernel-mode FPU section in task
> > context, and nested use of kernel-mode FPU is not supported.
> > 
> > Therefore, x86 SIMD-optimized code that can be called in softirq context
> > has to sometimes fall back to non-SIMD code.  There are two options for
> > the fallback, both of which are pretty terrible:
> > 
> >   (a) Use a scalar fallback.  This can be 10-100x slower than vectorized
> >       code because it cannot use specialized instructions like AES, SHA,
> >       or carryless multiplication.
> > 
> >   (b) Execute the request asynchronously using a kworker.  In other
> >       words, use the "crypto SIMD helper" in crypto/simd.c.
> > 
> > Currently most of the x86 en/decryption code (skcipher and aead
> > algorithms) uses option (b), since this avoids the slow scalar fallback
> > and it is easier to wire up.  But option (b) is still really bad for its
> > own reasons:
> > 
> >   - Punting the request to a kworker is bad for performance too.
> >
> >   - It forces the algorithm to be marked as asynchronous
> >     (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API
> >     users who request a synchronous algorithm.  That's another huge
> >     performance problem, which is especially unfortunate for users who
> >     don't even do en/decryption in softirq context.
> > 
> >   - It makes all en/decryption operations take a detour through
> >     crypto/simd.c.  That involves additional checks and an additional
> >     indirect call, which slow down en/decryption for *everyone*.
> > 
> > Fortunately, the skcipher and aead APIs are only usable in task and 
> > softirq context in the first place.  Thus, if kernel-mode FPU were to 
> > be reliably usable in softirq context, no fallback would be needed. 
> > Indeed, other architectures such as arm, arm64, and riscv have 
> > already done this.
> > 
> > Therefore, this patch updates x86 accordingly to reliably support
> > kernel-mode FPU in softirqs.
> > 
> > This is done by just disabling softirq processing in kernel-mode FPU
> > sections (when hardirqs are not already disabled), as that prevents the
> > nesting that was problematic.
> > 
> > This will delay some softirqs slightly, but only ones that would have
> > otherwise been nested inside a task context kernel-mode FPU section.
> > Any such softirqs would have taken the slow fallback path before if they
> > tried to do any en/decryption.  Now these softirqs will just run at the
> > end of the task context kernel-mode FPU section (since local_bh_enable()
> > runs pending softirqs) and will no longer take the slow fallback path.
> > 
> > Alternatives considered:
> > 
> > - Make kernel-mode FPU sections fully preemptible.  This would require
> >   growing task_struct by another struct fpstate which is more than 2K.
> 
> So that's something that will probably happen once the kernel is built 
> using APX anyway?

The APX state is just 16 GPRs, for 128 bytes total.  That's about 5% of the size
of the fpstate (assuming AVX512 is supported).  As Dave mentioned, for in-kernel
use of APX it probably will make more sense to treat the new GPRs like the
existing ones, instead of using XSTATE and integrating it with kernel-mode FPU.
I.e., they will be saved/restored using plain moves to/from a dedicated buffer.

> 
> > - Make softirqs save/restore the kernel-mode FPU state to a per-CPU
> >   struct fpstate when nested use is detected.  Somewhat interesting, but
> >   seems unnecessary when a simpler solution exists.
> 
> So:
> 
> >  void kernel_fpu_begin_mask(unsigned int kfpu_mask)
> >  {
> > -	preempt_disable();
> > +	if (!irqs_disabled())
> > +		fpregs_lock();
> 
> > +	if (!irqs_disabled())
> > +		fpregs_unlock();
> 
> So why is the irqs_disabled() check needed here? (On x86 it can be a 
> bit expensive at times, because the IRQ flag has to be loaded, 
> including all flags, so basically it's a soft synchronization point of 
> a sort.)
> 
> Ie. why cannot we simply do a local_bh_disable()/enable() pair (on 
> !RT), ie. fpregs_lock()/fpregs_unlock()?
> 
> local_bh_disable() is very similar in cost to preempt_disable(), both 
> are increasing the preempt_count.

It's to keep kernel_fpu_begin()/end() working when hardirqs are disabled, since
local_bh_disable()/enable() require that hardirqs be enabled.  See the changelog
and https://lore.kernel.org/r/20250228035924.GC5588@sol.localdomain/.  There are
other directions we could go, but this seems to be the simplest solution.  If we
forbid kernel_fpu_begin() with hardirqs disabled (as PS1 did), then a call to
irqs_disabled() is still needed in irq_fpu_usable().  To avoid irqs_disabled()
entirely, we'd need to avoid disabling softirqs, which would mean supporting
nested kernel-mode FPU in softirqs.  I can sent out a patch that does that using
a per-CPU buffer, if you'd like to see that.  I wasn't super happy with the
extra edge cases and memory usage, but we could go in that direction.

- Eric
Dave Hansen March 5, 2025, 6:04 p.m. UTC | #5
On 3/5/25 09:37, Ingo Molnar wrote:
> 
> * Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered:
>>>> - Make kernel-mode FPU sections fully preemptible.  This would require
>>>>   growing task_struct by another struct fpstate which is more than 2K.
>>>
>>> So that's something that will probably happen once the kernel is built 
>>> using APX anyway?
>>
>> I was expecting that building the kernel with APX would be very 
>> different than a kernel_fpu_begin(). We don't just need *one* more 
>> save area for APX registers: we need a stack, just like normal GPRs.
> 
> Yes - but my point is: with any APX build we'd probably be saving 
> FPU(-ish) registers at entry points, into a separate context area. If 
> that includes FPU registers then we'd not have to do 
> kernel_fpu_begin()/end().

That's true. But wouldn't it be a bit silly to include _all_ FPU
registers? If the kernel isn't using AVX512, why bother saving and
restoring AVX512?

> In other words, we'd be doing something close to 'growing task_struct 
> by another struct fpstate', or so - regardless of whether it's in 
> task_struct or some sort of extended pt_regs. The kernel would also be 
> close to 'FPU-safe', i.e. there likely wouldn't be a need for 
> kernel_fpu_begin()/end().
The new APX registers are 128 bytes, total. 'struct fpstate' is ~3k on
most CPUs these days and >11k with AMX.

I was thinking that growing things (say pt_regs) by 128b would be
acceptable, given some nice performance gains from AMX. Growing by 3k
would be cause some real headaches. Growing by 11k would be a non-starter.

I'm pretty sure the torches and pitchforks would come out if our syscall
latency included another 1k of save/restore much less 3k or 11k.
Ingo Molnar March 5, 2025, 6:09 p.m. UTC | #6
* Eric Biggers <ebiggers@kernel.org> wrote:

> [...] To avoid irqs_disabled() entirely, we'd need to avoid disabling 
> softirqs, which would mean supporting nested kernel-mode FPU in 
> softirqs.  I can sent out a patch that does that using a per-CPU 
> buffer, if you'd like to see that.  I wasn't super happy with the 
> extra edge cases and memory usage, but we could go in that direction.

Meh: so I just checked, and local_bh_disable()/enable() are pretty 
heavy these days - it's not just a simple preempt-count twiddle and a 
check anymore. :-/ I don't think my initial argument of irqs_disabled() 
overhead is really valid - and if we really cared we could halve it by 
saving the irqs_disabled() status at kernel_fpu_begin() time and 
reading it at kernel_fpu_end() time.

And the alternative of having nested FPU usage and extra per-CPU FPU 
save areas for the kernel feels a bit fragile, even without having seen 
the patch.

So I think I'll commit your patch to tip:x86/fpu as-is, unless someone 
objects.


BTW., a side note, I was also reviewing the kernel_fpu_begin()/end() 
codepaths, and we have gems like:

        /* Put sane initial values into the control registers. */
        if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM))
                ldmxcsr(MXCSR_DEFAULT);

        if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU))
                asm volatile ("fninit");

has the LDMXCSR instruction, or its effects, ever shown up in profiles?

Because AFAICS these will execute all the time on x86-64, because:

static inline void kernel_fpu_begin(void)
{
#ifdef CONFIG_X86_64
        /*
         * Any 64-bit code that uses 387 instructions must explicitly request
         * KFPU_387.
         */
        kernel_fpu_begin_mask(KFPU_MXCSR);

And X86_FEATURE_XMM is set in pretty much every x86 CPU.

Thanks,

	Ingo
Ingo Molnar March 5, 2025, 6:13 p.m. UTC | #7
* Dave Hansen <dave.hansen@intel.com> wrote:

> On 3/5/25 09:37, Ingo Molnar wrote:
> > 
> > * Dave Hansen <dave.hansen@intel.com> wrote:
> > 
> >> On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered:
> >>>> - Make kernel-mode FPU sections fully preemptible.  This would require
> >>>>   growing task_struct by another struct fpstate which is more than 2K.
> >>>
> >>> So that's something that will probably happen once the kernel is built 
> >>> using APX anyway?
> >>
> >> I was expecting that building the kernel with APX would be very 
> >> different than a kernel_fpu_begin(). We don't just need *one* more 
> >> save area for APX registers: we need a stack, just like normal GPRs.
> > 
> > Yes - but my point is: with any APX build we'd probably be saving 
> > FPU(-ish) registers at entry points, into a separate context area. If 
> > that includes FPU registers then we'd not have to do 
> > kernel_fpu_begin()/end().
> 
> That's true. But wouldn't it be a bit silly to include _all_ FPU
> registers? If the kernel isn't using AVX512, why bother saving and
> restoring AVX512?

Fair enough - although I bet the execution time difference between a 
partial and a full FPU context save isn't as large as the buffer size 
would suggest... There's a lot of setup cost in XSAVE* instructions 
last I checked.

Thanks,

	Ingo
Eric Biggers March 5, 2025, 8:30 p.m. UTC | #8
On Wed, Mar 05, 2025 at 07:09:15PM +0100, Ingo Molnar wrote:
> 
> * Eric Biggers <ebiggers@kernel.org> wrote:
> 
> > [...] To avoid irqs_disabled() entirely, we'd need to avoid disabling 
> > softirqs, which would mean supporting nested kernel-mode FPU in 
> > softirqs.  I can sent out a patch that does that using a per-CPU 
> > buffer, if you'd like to see that.  I wasn't super happy with the 
> > extra edge cases and memory usage, but we could go in that direction.
> 
> Meh: so I just checked, and local_bh_disable()/enable() are pretty 
> heavy these days - it's not just a simple preempt-count twiddle and a 
> check anymore. :-/ I don't think my initial argument of irqs_disabled() 
> overhead is really valid - and if we really cared we could halve it by 
> saving the irqs_disabled() status at kernel_fpu_begin() time and 
> reading it at kernel_fpu_end() time.
> 
> And the alternative of having nested FPU usage and extra per-CPU FPU 
> save areas for the kernel feels a bit fragile, even without having seen 
> the patch.
> 
> So I think I'll commit your patch to tip:x86/fpu as-is, unless someone 
> objects.
> 
> 
> BTW., a side note, I was also reviewing the kernel_fpu_begin()/end() 
> codepaths, and we have gems like:
> 
>         /* Put sane initial values into the control registers. */
>         if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM))
>                 ldmxcsr(MXCSR_DEFAULT);
> 
>         if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU))
>                 asm volatile ("fninit");
> 
> has the LDMXCSR instruction, or its effects, ever shown up in profiles?
> 
> Because AFAICS these will execute all the time on x86-64, because:
> 
> static inline void kernel_fpu_begin(void)
> {
> #ifdef CONFIG_X86_64
>         /*
>          * Any 64-bit code that uses 387 instructions must explicitly request
>          * KFPU_387.
>          */
>         kernel_fpu_begin_mask(KFPU_MXCSR);
> 
> And X86_FEATURE_XMM is set in pretty much every x86 CPU.

I did some benchmarks with AES-XTS encryption of 16-byte messages (which is
unrealistically small, but this makes it easier to see the overhead of
kernel-mode FPU...).  The baseline was 384 MB/s.  Removing the use of
crypto/simd.c, which this work makes possible, increases it to 487 MB/s.  v1 of
this patch decreases it to 479 MB/s, and v2 (which added irqs_disabled() checks
to kernel_fpu_begin() and kernel_fpu_end()) decreases it to 461 MB/s.  An
experimental patch that I have that supports nested kernel-mode FPU in softirqs
by using per-CPU areas maintains 480 MB/s.

CPU was AMD Ryzen 9 9950X (Zen 5).  No debugging options were enabled.

Deleting the ldmxcsr(MXCSR_DEFAULT) adds about 14 MB/s.

But given the large improvement from no longer using crypto/simd.c, the other
overheads seem much smaller and maybe are not worth worrying too much about.

In case you're interested, the following is my experimental patch for supporting
nested use in softirqs:

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index f86ad3335529..70729d2bd64f 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -16,11 +16,11 @@
 
 /*
  * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
  * disables preemption so be careful if you intend to use it for long periods
  * of time.
- * If you intend to use the FPU in irq/softirq you need to check first with
+ * If you intend to use the FPU in hardirq you need to check first with
  * irq_fpu_usable() if it is possible.
  */
 
 /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
 #define KFPU_387	_BITUL(0)	/* 387 state will be initialized */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1209c7aebb21..a524260a0fa6 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -42,11 +42,13 @@ struct fpu_state_config fpu_user_cfg __ro_after_init;
  * depending on the FPU hardware format:
  */
 struct fpstate init_fpstate __ro_after_init;
 
 /* Track in-kernel FPU usage */
-static DEFINE_PER_CPU(bool, in_kernel_fpu);
+static DEFINE_PER_CPU(unsigned int, kernel_fpu_depth);
+
+DEFINE_PER_CPU(struct fpstate *, saved_kernel_fpstate);
 
 /*
  * Track which context is using the FPU on the CPU:
  */
 DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
@@ -58,14 +60,10 @@ DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 bool irq_fpu_usable(void)
 {
 	if (WARN_ON_ONCE(in_nmi()))
 		return false;
 
-	/* In kernel FPU usage already active? */
-	if (this_cpu_read(in_kernel_fpu))
-		return false;
-
 	/*
 	 * When not in NMI or hard interrupt context, FPU can be used in:
 	 *
 	 * - Task context except from within fpregs_lock()'ed critical
 	 *   regions.
@@ -75,13 +73,14 @@ bool irq_fpu_usable(void)
 	 */
 	if (!in_hardirq())
 		return true;
 
 	/*
-	 * In hard interrupt context it's safe when soft interrupts
-	 * are enabled, which means the interrupt did not hit in
-	 * a fpregs_lock()'ed critical region.
+	 * In hard interrupt context it's safe when soft interrupts are enabled,
+	 * which means the interrupt did not hit in a fpregs_lock()'ed critical
+	 * region, nor did it hit while serving a softirq (which could have
+	 * already been using nested kernel-mode FPU).
 	 */
 	return !softirq_count();
 }
 EXPORT_SYMBOL(irq_fpu_usable);
 
@@ -96,10 +95,34 @@ static void update_avx_timestamp(struct fpu *fpu)
 
 	if (fpu->fpstate->regs.xsave.header.xfeatures & AVX512_TRACKING_MASK)
 		fpu->avx512_timestamp = jiffies;
 }
 
+static __always_inline void
+__save_fpregs_to_fpstate(struct fpstate *fpstate,
+			 struct fpu *fpu, bool have_fpu)
+{
+	if (likely(use_xsave())) {
+		os_xsave(fpstate);
+		if (have_fpu)
+			update_avx_timestamp(fpu);
+		return;
+	}
+
+	if (likely(use_fxsr())) {
+		fxsave(&fpstate->regs.fxsave);
+		return;
+	}
+
+	/*
+	 * Legacy FPU register saving, FNSAVE always clears FPU registers,
+	 * so we have to reload them from the memory state.
+	 */
+	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpstate->regs.fsave));
+	frstor(&fpstate->regs.fsave);
+}
+
 /*
  * Save the FPU register state in fpu->fpstate->regs. The register state is
  * preserved.
  *
  * Must be called with fpregs_lock() held.
@@ -112,27 +135,11 @@ static void update_avx_timestamp(struct fpu *fpu)
  *
  * FXSAVE and all XSAVE variants preserve the FPU register state.
  */
 void save_fpregs_to_fpstate(struct fpu *fpu)
 {
-	if (likely(use_xsave())) {
-		os_xsave(fpu->fpstate);
-		update_avx_timestamp(fpu);
-		return;
-	}
-
-	if (likely(use_fxsr())) {
-		fxsave(&fpu->fpstate->regs.fxsave);
-		return;
-	}
-
-	/*
-	 * Legacy FPU register saving, FNSAVE always clears FPU registers,
-	 * so we have to reload them from the memory state.
-	 */
-	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->fpstate->regs.fsave));
-	frstor(&fpu->fpstate->regs.fsave);
+	__save_fpregs_to_fpstate(fpu->fpstate, fpu, true);
 }
 
 void restore_fpregs_from_fpstate(struct fpstate *fpstate, u64 mask)
 {
 	/*
@@ -418,23 +425,31 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
 EXPORT_SYMBOL_GPL(fpu_copy_uabi_to_guest_fpstate);
 #endif /* CONFIG_KVM */
 
 void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 {
+	unsigned int prev_depth;
+
 	preempt_disable();
 
 	WARN_ON_FPU(!irq_fpu_usable());
-	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
-
-	this_cpu_write(in_kernel_fpu, true);
-
-	if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
-	    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
-		set_thread_flag(TIF_NEED_FPU_LOAD);
-		save_fpregs_to_fpstate(&current->thread.fpu);
+	prev_depth = __this_cpu_read(kernel_fpu_depth);
+	__this_cpu_write(kernel_fpu_depth, prev_depth + 1);
+
+	if (prev_depth != 0) {
+		WARN_ON_FPU(in_task());
+		WARN_ON_FPU(prev_depth != 1);
+		__save_fpregs_to_fpstate(__this_cpu_read(saved_kernel_fpstate),
+					 NULL, false);
+	} else {
+		if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
+		    !test_thread_flag(TIF_NEED_FPU_LOAD)) {
+			set_thread_flag(TIF_NEED_FPU_LOAD);
+			save_fpregs_to_fpstate(&current->thread.fpu);
+		}
+		__cpu_invalidate_fpregs_state();
 	}
-	__cpu_invalidate_fpregs_state();
 
 	/* Put sane initial values into the control registers. */
 	if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM))
 		ldmxcsr(MXCSR_DEFAULT);
 
@@ -443,13 +458,18 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 }
 EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
 
 void kernel_fpu_end(void)
 {
-	WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
+	unsigned int depth = __this_cpu_read(kernel_fpu_depth);
 
-	this_cpu_write(in_kernel_fpu, false);
+	if (depth > 1)
+		restore_fpregs_from_fpstate(__this_cpu_read(saved_kernel_fpstate),
+					    XFEATURE_MASK_FPSTATE);
+	else if (WARN_ON_ONCE(depth == 0))
+		depth = 1;
+	__this_cpu_write(kernel_fpu_depth, depth - 1);
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(kernel_fpu_end);
 
 /*
@@ -468,19 +488,10 @@ void fpu_sync_fpstate(struct fpu *fpu)
 
 	trace_x86_fpu_after_save(fpu);
 	fpregs_unlock();
 }
 
-static inline unsigned int init_fpstate_copy_size(void)
-{
-	if (!use_xsave())
-		return fpu_kernel_cfg.default_size;
-
-	/* XSAVE(S) just needs the legacy and the xstate header part */
-	return sizeof(init_fpstate.regs.xsave);
-}
-
 static inline void fpstate_init_fxstate(struct fpstate *fpstate)
 {
 	fpstate->regs.fxsave.cwd = 0x37f;
 	fpstate->regs.fxsave.mxcsr = MXCSR_DEFAULT;
 }
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 998a08f17e33..549852552f51 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -7,10 +7,11 @@
 #include <asm/setup.h>
 
 #include <linux/sched.h>
 #include <linux/sched/task.h>
 #include <linux/init.h>
+#include <linux/slab.h>
 
 #include "internal.h"
 #include "legacy.h"
 #include "xstate.h"
 
@@ -205,10 +206,35 @@ static void __init fpu__init_system_xstate_size_legacy(void)
 	fpu_user_cfg.max_size = size;
 	fpu_user_cfg.default_size = size;
 	fpstate_reset(&current->thread.fpu);
 }
 
+/*
+ * Allocate per-CPU areas for saving the kernel-mode FPU registers.  This is
+ * needed to reliably support use of kernel-mode FPU in softirqs:
+ */
+static void __init fpu__init_saved_kernel_fpstates(void)
+{
+	size_t size;
+	void *p;
+	struct fpstate *fpstate;
+	int cpu;
+
+	size = arch_task_struct_size;
+	size -= offsetof(struct task_struct, thread.fpu.__fpstate);
+	size += (__alignof__(struct fpstate) - 1) & ~(ARCH_KMALLOC_MINALIGN - 1);
+	for_each_possible_cpu(cpu) {
+		p = kmalloc(size, GFP_KERNEL);
+		if (!p)
+			panic("Out of memory");
+		fpstate = PTR_ALIGN(p, __alignof__(struct fpstate));
+		memcpy(&fpstate->regs, &init_fpstate.regs,
+		       init_fpstate_copy_size());
+		per_cpu(saved_kernel_fpstate, cpu) = fpstate;
+	}
+}
+
 /*
  * Called on the boot CPU once per system bootup, to set up the initial
  * FPU state that is later cloned into all processes:
  */
 void __init fpu__init_system(void)
@@ -224,6 +250,7 @@ void __init fpu__init_system(void)
 
 	fpu__init_system_generic();
 	fpu__init_system_xstate_size_legacy();
 	fpu__init_system_xstate(fpu_kernel_cfg.max_size);
 	fpu__init_task_struct_size();
+	fpu__init_saved_kernel_fpstates();
 }
diff --git a/arch/x86/kernel/fpu/internal.h b/arch/x86/kernel/fpu/internal.h
index dbdb31f55fc7..24f5f6d238b9 100644
--- a/arch/x86/kernel/fpu/internal.h
+++ b/arch/x86/kernel/fpu/internal.h
@@ -23,6 +23,17 @@ static __always_inline __pure bool use_fxsr(void)
 
 /* Used in init.c */
 extern void fpstate_init_user(struct fpstate *fpstate);
 extern void fpstate_reset(struct fpu *fpu);
 
+DECLARE_PER_CPU(struct fpstate *, saved_kernel_fpstate);
+
+static inline unsigned int init_fpstate_copy_size(void)
+{
+	if (!use_xsave())
+		return fpu_kernel_cfg.default_size;
+
+	/* XSAVE(S) just needs the legacy and the xstate header part */
+	return sizeof(init_fpstate.regs.xsave);
+}
+
 #endif
David Laight March 5, 2025, 9:22 p.m. UTC | #9
On Wed, 5 Mar 2025 18:37:25 +0100
Ingo Molnar <mingo@kernel.org> wrote:

> * Dave Hansen <dave.hansen@intel.com> wrote:
> 
> > On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered:  
> > >> - Make kernel-mode FPU sections fully preemptible.  This would require
> > >>   growing task_struct by another struct fpstate which is more than 2K.  
> > > 
> > > So that's something that will probably happen once the kernel is built 
> > > using APX anyway?  
> > 
> > I was expecting that building the kernel with APX would be very 
> > different than a kernel_fpu_begin(). We don't just need *one* more 
> > save area for APX registers: we need a stack, just like normal GPRs.  
> 
> Yes - but my point is: with any APX build we'd probably be saving 
> FPU(-ish) registers at entry points, into a separate context area. If 
> that includes FPU registers then we'd not have to do 
> kernel_fpu_begin()/end().

Since the registers are caller saved (like the SSE onwards ones)
none of them really need to be saved on syscall entry
(just zeroed on return).
They do need saving on interrupt entry.

For some unknown reason the kernel saves the xyzmm ones on syscall entry.
For normal programs they won't be live - because of the asm syscall
wrapper is called from C.
So I think they can only be live if a system call is directly inlined
into the C function. Just marking them all 'clobbered' would have done.
But it now all too late to change.

	David

> 
> In other words, we'd be doing something close to 'growing task_struct 
> by another struct fpstate', or so - regardless of whether it's in 
> task_struct or some sort of extended pt_regs. The kernel would also be 
> close to 'FPU-safe', i.e. there likely wouldn't be a need for 
> kernel_fpu_begin()/end().
> 
> Thanks,
> 
> 	Ingo
>
Ingo Molnar March 6, 2025, 11:42 a.m. UTC | #10
* Eric Biggers <ebiggers@kernel.org> wrote:

> Deleting the ldmxcsr(MXCSR_DEFAULT) adds about 14 MB/s.

BTW., that's still a +3% performance improvement, so it is worth 
considering in addition to the current patch IMO.

Thanks,

	Ingo
Peter Zijlstra March 6, 2025, 12:09 p.m. UTC | #11
On Wed, Mar 05, 2025 at 10:07:45AM +0100, Ingo Molnar wrote:
> > From: Eric Biggers <ebiggers@google.com>

> > Alternatives considered:
> > 
> > - Make kernel-mode FPU sections fully preemptible.  This would require
> >   growing task_struct by another struct fpstate which is more than 2K.
> 
> So that's something that will probably happen once the kernel is built 
> using APX anyway?

ISTR looking into this at some point for RT. I also have vague memories
of other architectures doing something similar, but its all a long time
ago.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index f86ad3335529d..f42de5f05e7eb 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -14,14 +14,13 @@ 
 
 #include <asm/fpu/types.h>
 
 /*
  * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It
- * disables preemption so be careful if you intend to use it for long periods
- * of time.
- * If you intend to use the FPU in irq/softirq you need to check first with
- * irq_fpu_usable() if it is possible.
+ * disables preemption and softirq processing, so be careful if you intend to
+ * use it for long periods of time.  Kernel-mode FPU cannot be used in all
+ * contexts -- see irq_fpu_usable() for details.
  */
 
 /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */
 #define KFPU_387	_BITUL(0)	/* 387 state will be initialized */
 #define KFPU_MXCSR	_BITUL(1)	/* MXCSR will be initialized */
@@ -48,25 +47,23 @@  static inline void kernel_fpu_begin(void)
 	kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR);
 #endif
 }
 
 /*
- * Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate.
- * A context switch will (and softirq might) save CPU's FPU registers to
- * fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving CPU's FPU registers in
- * a random state.
+ * Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate, or while
+ * using the FPU in kernel mode.  A context switch will (and softirq might) save
+ * CPU's FPU registers to fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving
+ * CPU's FPU registers in a random state.
  *
  * local_bh_disable() protects against both preemption and soft interrupts
  * on !RT kernels.
  *
  * On RT kernels local_bh_disable() is not sufficient because it only
  * serializes soft interrupt related sections via a local lock, but stays
  * preemptible. Disabling preemption is the right choice here as bottom
  * half processing is always in thread context on RT kernels so it
  * implicitly prevents bottom half processing as well.
- *
- * Disabling preemption also serializes against kernel_fpu_begin().
  */
 static inline void fpregs_lock(void)
 {
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
 		local_bh_disable();
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 1209c7aebb211..f937329d1aa9c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -58,13 +58,20 @@  DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 bool irq_fpu_usable(void)
 {
 	if (WARN_ON_ONCE(in_nmi()))
 		return false;
 
-	/* In kernel FPU usage already active? */
-	if (this_cpu_read(in_kernel_fpu))
+	/*
+	 * In kernel FPU usage already active?  This detects any explicitly
+	 * nested usage in task or softirq context, which is unsupported.  It
+	 * also detects attempted usage in a hardirq that has interrupted a
+	 * kernel-mode FPU section.
+	 */
+	if (this_cpu_read(in_kernel_fpu)) {
+		WARN_ON_FPU(!in_hardirq());
 		return false;
+	}
 
 	/*
 	 * When not in NMI or hard interrupt context, FPU can be used in:
 	 *
 	 * - Task context except from within fpregs_lock()'ed critical
@@ -418,11 +425,12 @@  int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
 EXPORT_SYMBOL_GPL(fpu_copy_uabi_to_guest_fpstate);
 #endif /* CONFIG_KVM */
 
 void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 {
-	preempt_disable();
+	if (!irqs_disabled())
+		fpregs_lock();
 
 	WARN_ON_FPU(!irq_fpu_usable());
 	WARN_ON_FPU(this_cpu_read(in_kernel_fpu));
 
 	this_cpu_write(in_kernel_fpu, true);
@@ -446,11 +454,12 @@  EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask);
 void kernel_fpu_end(void)
 {
 	WARN_ON_FPU(!this_cpu_read(in_kernel_fpu));
 
 	this_cpu_write(in_kernel_fpu, false);
-	preempt_enable();
+	if (!irqs_disabled())
+		fpregs_unlock();
 }
 EXPORT_SYMBOL_GPL(kernel_fpu_end);
 
 /*
  * Sync the FPU register state to current's memory register state when the