Message ID | 20250304204954.3901-1-ebiggers@kernel.org |
---|---|
State | New |
Headers | show |
Series | [RFC,v2] x86/fpu: make kernel-mode FPU reliably usable in softirqs | expand |
* Eric Biggers <ebiggers@kernel.org> wrote: > From: Eric Biggers <ebiggers@google.com> > > Currently kernel-mode FPU is not always usable in softirq context on > x86, since softirqs can nest inside a kernel-mode FPU section in task > context, and nested use of kernel-mode FPU is not supported. > > Therefore, x86 SIMD-optimized code that can be called in softirq context > has to sometimes fall back to non-SIMD code. There are two options for > the fallback, both of which are pretty terrible: > > (a) Use a scalar fallback. This can be 10-100x slower than vectorized > code because it cannot use specialized instructions like AES, SHA, > or carryless multiplication. > > (b) Execute the request asynchronously using a kworker. In other > words, use the "crypto SIMD helper" in crypto/simd.c. > > Currently most of the x86 en/decryption code (skcipher and aead > algorithms) uses option (b), since this avoids the slow scalar fallback > and it is easier to wire up. But option (b) is still really bad for its > own reasons: > > - Punting the request to a kworker is bad for performance too. > > - It forces the algorithm to be marked as asynchronous > (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API > users who request a synchronous algorithm. That's another huge > performance problem, which is especially unfortunate for users who > don't even do en/decryption in softirq context. > > - It makes all en/decryption operations take a detour through > crypto/simd.c. That involves additional checks and an additional > indirect call, which slow down en/decryption for *everyone*. > > Fortunately, the skcipher and aead APIs are only usable in task and > softirq context in the first place. Thus, if kernel-mode FPU were to > be reliably usable in softirq context, no fallback would be needed. > Indeed, other architectures such as arm, arm64, and riscv have > already done this. > > Therefore, this patch updates x86 accordingly to reliably support > kernel-mode FPU in softirqs. > > This is done by just disabling softirq processing in kernel-mode FPU > sections (when hardirqs are not already disabled), as that prevents the > nesting that was problematic. > > This will delay some softirqs slightly, but only ones that would have > otherwise been nested inside a task context kernel-mode FPU section. > Any such softirqs would have taken the slow fallback path before if they > tried to do any en/decryption. Now these softirqs will just run at the > end of the task context kernel-mode FPU section (since local_bh_enable() > runs pending softirqs) and will no longer take the slow fallback path. > > Alternatives considered: > > - Make kernel-mode FPU sections fully preemptible. This would require > growing task_struct by another struct fpstate which is more than 2K. So that's something that will probably happen once the kernel is built using APX anyway? > - Make softirqs save/restore the kernel-mode FPU state to a per-CPU > struct fpstate when nested use is detected. Somewhat interesting, but > seems unnecessary when a simpler solution exists. So: > void kernel_fpu_begin_mask(unsigned int kfpu_mask) > { > - preempt_disable(); > + if (!irqs_disabled()) > + fpregs_lock(); > + if (!irqs_disabled()) > + fpregs_unlock(); So why is the irqs_disabled() check needed here? (On x86 it can be a bit expensive at times, because the IRQ flag has to be loaded, including all flags, so basically it's a soft synchronization point of a sort.) Ie. why cannot we simply do a local_bh_disable()/enable() pair (on !RT), ie. fpregs_lock()/fpregs_unlock()? local_bh_disable() is very similar in cost to preempt_disable(), both are increasing the preempt_count. Thanks, Ingo
On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered: >> - Make kernel-mode FPU sections fully preemptible. This would require >> growing task_struct by another struct fpstate which is more than 2K. > > So that's something that will probably happen once the kernel is built > using APX anyway? I was expecting that building the kernel with APX would be very different than a kernel_fpu_begin(). We don't just need *one* more save area for APX registers: we need a stack, just like normal GPRs. We'd effectively need to enlarge pt_regs and fix up things like PUSH_AND_CLEAR_REGS to save the APX registers in addition to the good old GPRs before calling C code. That's what I was thinking at least. Did folks have more clever ideas?
* Dave Hansen <dave.hansen@intel.com> wrote: > On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered: > >> - Make kernel-mode FPU sections fully preemptible. This would require > >> growing task_struct by another struct fpstate which is more than 2K. > > > > So that's something that will probably happen once the kernel is built > > using APX anyway? > > I was expecting that building the kernel with APX would be very > different than a kernel_fpu_begin(). We don't just need *one* more > save area for APX registers: we need a stack, just like normal GPRs. Yes - but my point is: with any APX build we'd probably be saving FPU(-ish) registers at entry points, into a separate context area. If that includes FPU registers then we'd not have to do kernel_fpu_begin()/end(). In other words, we'd be doing something close to 'growing task_struct by another struct fpstate', or so - regardless of whether it's in task_struct or some sort of extended pt_regs. The kernel would also be close to 'FPU-safe', i.e. there likely wouldn't be a need for kernel_fpu_begin()/end(). Thanks, Ingo
On Wed, Mar 05, 2025 at 10:07:45AM +0100, Ingo Molnar wrote: > > * Eric Biggers <ebiggers@kernel.org> wrote: > > > From: Eric Biggers <ebiggers@google.com> > > > > Currently kernel-mode FPU is not always usable in softirq context on > > x86, since softirqs can nest inside a kernel-mode FPU section in task > > context, and nested use of kernel-mode FPU is not supported. > > > > Therefore, x86 SIMD-optimized code that can be called in softirq context > > has to sometimes fall back to non-SIMD code. There are two options for > > the fallback, both of which are pretty terrible: > > > > (a) Use a scalar fallback. This can be 10-100x slower than vectorized > > code because it cannot use specialized instructions like AES, SHA, > > or carryless multiplication. > > > > (b) Execute the request asynchronously using a kworker. In other > > words, use the "crypto SIMD helper" in crypto/simd.c. > > > > Currently most of the x86 en/decryption code (skcipher and aead > > algorithms) uses option (b), since this avoids the slow scalar fallback > > and it is easier to wire up. But option (b) is still really bad for its > > own reasons: > > > > - Punting the request to a kworker is bad for performance too. > > > > - It forces the algorithm to be marked as asynchronous > > (CRYPTO_ALG_ASYNC), preventing it from being used by crypto API > > users who request a synchronous algorithm. That's another huge > > performance problem, which is especially unfortunate for users who > > don't even do en/decryption in softirq context. > > > > - It makes all en/decryption operations take a detour through > > crypto/simd.c. That involves additional checks and an additional > > indirect call, which slow down en/decryption for *everyone*. > > > > Fortunately, the skcipher and aead APIs are only usable in task and > > softirq context in the first place. Thus, if kernel-mode FPU were to > > be reliably usable in softirq context, no fallback would be needed. > > Indeed, other architectures such as arm, arm64, and riscv have > > already done this. > > > > Therefore, this patch updates x86 accordingly to reliably support > > kernel-mode FPU in softirqs. > > > > This is done by just disabling softirq processing in kernel-mode FPU > > sections (when hardirqs are not already disabled), as that prevents the > > nesting that was problematic. > > > > This will delay some softirqs slightly, but only ones that would have > > otherwise been nested inside a task context kernel-mode FPU section. > > Any such softirqs would have taken the slow fallback path before if they > > tried to do any en/decryption. Now these softirqs will just run at the > > end of the task context kernel-mode FPU section (since local_bh_enable() > > runs pending softirqs) and will no longer take the slow fallback path. > > > > Alternatives considered: > > > > - Make kernel-mode FPU sections fully preemptible. This would require > > growing task_struct by another struct fpstate which is more than 2K. > > So that's something that will probably happen once the kernel is built > using APX anyway? The APX state is just 16 GPRs, for 128 bytes total. That's about 5% of the size of the fpstate (assuming AVX512 is supported). As Dave mentioned, for in-kernel use of APX it probably will make more sense to treat the new GPRs like the existing ones, instead of using XSTATE and integrating it with kernel-mode FPU. I.e., they will be saved/restored using plain moves to/from a dedicated buffer. > > > - Make softirqs save/restore the kernel-mode FPU state to a per-CPU > > struct fpstate when nested use is detected. Somewhat interesting, but > > seems unnecessary when a simpler solution exists. > > So: > > > void kernel_fpu_begin_mask(unsigned int kfpu_mask) > > { > > - preempt_disable(); > > + if (!irqs_disabled()) > > + fpregs_lock(); > > > + if (!irqs_disabled()) > > + fpregs_unlock(); > > So why is the irqs_disabled() check needed here? (On x86 it can be a > bit expensive at times, because the IRQ flag has to be loaded, > including all flags, so basically it's a soft synchronization point of > a sort.) > > Ie. why cannot we simply do a local_bh_disable()/enable() pair (on > !RT), ie. fpregs_lock()/fpregs_unlock()? > > local_bh_disable() is very similar in cost to preempt_disable(), both > are increasing the preempt_count. It's to keep kernel_fpu_begin()/end() working when hardirqs are disabled, since local_bh_disable()/enable() require that hardirqs be enabled. See the changelog and https://lore.kernel.org/r/20250228035924.GC5588@sol.localdomain/. There are other directions we could go, but this seems to be the simplest solution. If we forbid kernel_fpu_begin() with hardirqs disabled (as PS1 did), then a call to irqs_disabled() is still needed in irq_fpu_usable(). To avoid irqs_disabled() entirely, we'd need to avoid disabling softirqs, which would mean supporting nested kernel-mode FPU in softirqs. I can sent out a patch that does that using a per-CPU buffer, if you'd like to see that. I wasn't super happy with the extra edge cases and memory usage, but we could go in that direction. - Eric
On 3/5/25 09:37, Ingo Molnar wrote: > > * Dave Hansen <dave.hansen@intel.com> wrote: > >> On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered: >>>> - Make kernel-mode FPU sections fully preemptible. This would require >>>> growing task_struct by another struct fpstate which is more than 2K. >>> >>> So that's something that will probably happen once the kernel is built >>> using APX anyway? >> >> I was expecting that building the kernel with APX would be very >> different than a kernel_fpu_begin(). We don't just need *one* more >> save area for APX registers: we need a stack, just like normal GPRs. > > Yes - but my point is: with any APX build we'd probably be saving > FPU(-ish) registers at entry points, into a separate context area. If > that includes FPU registers then we'd not have to do > kernel_fpu_begin()/end(). That's true. But wouldn't it be a bit silly to include _all_ FPU registers? If the kernel isn't using AVX512, why bother saving and restoring AVX512? > In other words, we'd be doing something close to 'growing task_struct > by another struct fpstate', or so - regardless of whether it's in > task_struct or some sort of extended pt_regs. The kernel would also be > close to 'FPU-safe', i.e. there likely wouldn't be a need for > kernel_fpu_begin()/end(). The new APX registers are 128 bytes, total. 'struct fpstate' is ~3k on most CPUs these days and >11k with AMX. I was thinking that growing things (say pt_regs) by 128b would be acceptable, given some nice performance gains from AMX. Growing by 3k would be cause some real headaches. Growing by 11k would be a non-starter. I'm pretty sure the torches and pitchforks would come out if our syscall latency included another 1k of save/restore much less 3k or 11k.
* Eric Biggers <ebiggers@kernel.org> wrote: > [...] To avoid irqs_disabled() entirely, we'd need to avoid disabling > softirqs, which would mean supporting nested kernel-mode FPU in > softirqs. I can sent out a patch that does that using a per-CPU > buffer, if you'd like to see that. I wasn't super happy with the > extra edge cases and memory usage, but we could go in that direction. Meh: so I just checked, and local_bh_disable()/enable() are pretty heavy these days - it's not just a simple preempt-count twiddle and a check anymore. :-/ I don't think my initial argument of irqs_disabled() overhead is really valid - and if we really cared we could halve it by saving the irqs_disabled() status at kernel_fpu_begin() time and reading it at kernel_fpu_end() time. And the alternative of having nested FPU usage and extra per-CPU FPU save areas for the kernel feels a bit fragile, even without having seen the patch. So I think I'll commit your patch to tip:x86/fpu as-is, unless someone objects. BTW., a side note, I was also reviewing the kernel_fpu_begin()/end() codepaths, and we have gems like: /* Put sane initial values into the control registers. */ if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM)) ldmxcsr(MXCSR_DEFAULT); if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU)) asm volatile ("fninit"); has the LDMXCSR instruction, or its effects, ever shown up in profiles? Because AFAICS these will execute all the time on x86-64, because: static inline void kernel_fpu_begin(void) { #ifdef CONFIG_X86_64 /* * Any 64-bit code that uses 387 instructions must explicitly request * KFPU_387. */ kernel_fpu_begin_mask(KFPU_MXCSR); And X86_FEATURE_XMM is set in pretty much every x86 CPU. Thanks, Ingo
* Dave Hansen <dave.hansen@intel.com> wrote: > On 3/5/25 09:37, Ingo Molnar wrote: > > > > * Dave Hansen <dave.hansen@intel.com> wrote: > > > >> On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered: > >>>> - Make kernel-mode FPU sections fully preemptible. This would require > >>>> growing task_struct by another struct fpstate which is more than 2K. > >>> > >>> So that's something that will probably happen once the kernel is built > >>> using APX anyway? > >> > >> I was expecting that building the kernel with APX would be very > >> different than a kernel_fpu_begin(). We don't just need *one* more > >> save area for APX registers: we need a stack, just like normal GPRs. > > > > Yes - but my point is: with any APX build we'd probably be saving > > FPU(-ish) registers at entry points, into a separate context area. If > > that includes FPU registers then we'd not have to do > > kernel_fpu_begin()/end(). > > That's true. But wouldn't it be a bit silly to include _all_ FPU > registers? If the kernel isn't using AVX512, why bother saving and > restoring AVX512? Fair enough - although I bet the execution time difference between a partial and a full FPU context save isn't as large as the buffer size would suggest... There's a lot of setup cost in XSAVE* instructions last I checked. Thanks, Ingo
On Wed, Mar 05, 2025 at 07:09:15PM +0100, Ingo Molnar wrote: > > * Eric Biggers <ebiggers@kernel.org> wrote: > > > [...] To avoid irqs_disabled() entirely, we'd need to avoid disabling > > softirqs, which would mean supporting nested kernel-mode FPU in > > softirqs. I can sent out a patch that does that using a per-CPU > > buffer, if you'd like to see that. I wasn't super happy with the > > extra edge cases and memory usage, but we could go in that direction. > > Meh: so I just checked, and local_bh_disable()/enable() are pretty > heavy these days - it's not just a simple preempt-count twiddle and a > check anymore. :-/ I don't think my initial argument of irqs_disabled() > overhead is really valid - and if we really cared we could halve it by > saving the irqs_disabled() status at kernel_fpu_begin() time and > reading it at kernel_fpu_end() time. > > And the alternative of having nested FPU usage and extra per-CPU FPU > save areas for the kernel feels a bit fragile, even without having seen > the patch. > > So I think I'll commit your patch to tip:x86/fpu as-is, unless someone > objects. > > > BTW., a side note, I was also reviewing the kernel_fpu_begin()/end() > codepaths, and we have gems like: > > /* Put sane initial values into the control registers. */ > if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM)) > ldmxcsr(MXCSR_DEFAULT); > > if (unlikely(kfpu_mask & KFPU_387) && boot_cpu_has(X86_FEATURE_FPU)) > asm volatile ("fninit"); > > has the LDMXCSR instruction, or its effects, ever shown up in profiles? > > Because AFAICS these will execute all the time on x86-64, because: > > static inline void kernel_fpu_begin(void) > { > #ifdef CONFIG_X86_64 > /* > * Any 64-bit code that uses 387 instructions must explicitly request > * KFPU_387. > */ > kernel_fpu_begin_mask(KFPU_MXCSR); > > And X86_FEATURE_XMM is set in pretty much every x86 CPU. I did some benchmarks with AES-XTS encryption of 16-byte messages (which is unrealistically small, but this makes it easier to see the overhead of kernel-mode FPU...). The baseline was 384 MB/s. Removing the use of crypto/simd.c, which this work makes possible, increases it to 487 MB/s. v1 of this patch decreases it to 479 MB/s, and v2 (which added irqs_disabled() checks to kernel_fpu_begin() and kernel_fpu_end()) decreases it to 461 MB/s. An experimental patch that I have that supports nested kernel-mode FPU in softirqs by using per-CPU areas maintains 480 MB/s. CPU was AMD Ryzen 9 9950X (Zen 5). No debugging options were enabled. Deleting the ldmxcsr(MXCSR_DEFAULT) adds about 14 MB/s. But given the large improvement from no longer using crypto/simd.c, the other overheads seem much smaller and maybe are not worth worrying too much about. In case you're interested, the following is my experimental patch for supporting nested use in softirqs: diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index f86ad3335529..70729d2bd64f 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -16,11 +16,11 @@ /* * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It * disables preemption so be careful if you intend to use it for long periods * of time. - * If you intend to use the FPU in irq/softirq you need to check first with + * If you intend to use the FPU in hardirq you need to check first with * irq_fpu_usable() if it is possible. */ /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */ #define KFPU_387 _BITUL(0) /* 387 state will be initialized */ diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 1209c7aebb21..a524260a0fa6 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -42,11 +42,13 @@ struct fpu_state_config fpu_user_cfg __ro_after_init; * depending on the FPU hardware format: */ struct fpstate init_fpstate __ro_after_init; /* Track in-kernel FPU usage */ -static DEFINE_PER_CPU(bool, in_kernel_fpu); +static DEFINE_PER_CPU(unsigned int, kernel_fpu_depth); + +DEFINE_PER_CPU(struct fpstate *, saved_kernel_fpstate); /* * Track which context is using the FPU on the CPU: */ DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx); @@ -58,14 +60,10 @@ DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx); bool irq_fpu_usable(void) { if (WARN_ON_ONCE(in_nmi())) return false; - /* In kernel FPU usage already active? */ - if (this_cpu_read(in_kernel_fpu)) - return false; - /* * When not in NMI or hard interrupt context, FPU can be used in: * * - Task context except from within fpregs_lock()'ed critical * regions. @@ -75,13 +73,14 @@ bool irq_fpu_usable(void) */ if (!in_hardirq()) return true; /* - * In hard interrupt context it's safe when soft interrupts - * are enabled, which means the interrupt did not hit in - * a fpregs_lock()'ed critical region. + * In hard interrupt context it's safe when soft interrupts are enabled, + * which means the interrupt did not hit in a fpregs_lock()'ed critical + * region, nor did it hit while serving a softirq (which could have + * already been using nested kernel-mode FPU). */ return !softirq_count(); } EXPORT_SYMBOL(irq_fpu_usable); @@ -96,10 +95,34 @@ static void update_avx_timestamp(struct fpu *fpu) if (fpu->fpstate->regs.xsave.header.xfeatures & AVX512_TRACKING_MASK) fpu->avx512_timestamp = jiffies; } +static __always_inline void +__save_fpregs_to_fpstate(struct fpstate *fpstate, + struct fpu *fpu, bool have_fpu) +{ + if (likely(use_xsave())) { + os_xsave(fpstate); + if (have_fpu) + update_avx_timestamp(fpu); + return; + } + + if (likely(use_fxsr())) { + fxsave(&fpstate->regs.fxsave); + return; + } + + /* + * Legacy FPU register saving, FNSAVE always clears FPU registers, + * so we have to reload them from the memory state. + */ + asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpstate->regs.fsave)); + frstor(&fpstate->regs.fsave); +} + /* * Save the FPU register state in fpu->fpstate->regs. The register state is * preserved. * * Must be called with fpregs_lock() held. @@ -112,27 +135,11 @@ static void update_avx_timestamp(struct fpu *fpu) * * FXSAVE and all XSAVE variants preserve the FPU register state. */ void save_fpregs_to_fpstate(struct fpu *fpu) { - if (likely(use_xsave())) { - os_xsave(fpu->fpstate); - update_avx_timestamp(fpu); - return; - } - - if (likely(use_fxsr())) { - fxsave(&fpu->fpstate->regs.fxsave); - return; - } - - /* - * Legacy FPU register saving, FNSAVE always clears FPU registers, - * so we have to reload them from the memory state. - */ - asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->fpstate->regs.fsave)); - frstor(&fpu->fpstate->regs.fsave); + __save_fpregs_to_fpstate(fpu->fpstate, fpu, true); } void restore_fpregs_from_fpstate(struct fpstate *fpstate, u64 mask) { /* @@ -418,23 +425,31 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, EXPORT_SYMBOL_GPL(fpu_copy_uabi_to_guest_fpstate); #endif /* CONFIG_KVM */ void kernel_fpu_begin_mask(unsigned int kfpu_mask) { + unsigned int prev_depth; + preempt_disable(); WARN_ON_FPU(!irq_fpu_usable()); - WARN_ON_FPU(this_cpu_read(in_kernel_fpu)); - - this_cpu_write(in_kernel_fpu, true); - - if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) && - !test_thread_flag(TIF_NEED_FPU_LOAD)) { - set_thread_flag(TIF_NEED_FPU_LOAD); - save_fpregs_to_fpstate(¤t->thread.fpu); + prev_depth = __this_cpu_read(kernel_fpu_depth); + __this_cpu_write(kernel_fpu_depth, prev_depth + 1); + + if (prev_depth != 0) { + WARN_ON_FPU(in_task()); + WARN_ON_FPU(prev_depth != 1); + __save_fpregs_to_fpstate(__this_cpu_read(saved_kernel_fpstate), + NULL, false); + } else { + if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) && + !test_thread_flag(TIF_NEED_FPU_LOAD)) { + set_thread_flag(TIF_NEED_FPU_LOAD); + save_fpregs_to_fpstate(¤t->thread.fpu); + } + __cpu_invalidate_fpregs_state(); } - __cpu_invalidate_fpregs_state(); /* Put sane initial values into the control registers. */ if (likely(kfpu_mask & KFPU_MXCSR) && boot_cpu_has(X86_FEATURE_XMM)) ldmxcsr(MXCSR_DEFAULT); @@ -443,13 +458,18 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask) } EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask); void kernel_fpu_end(void) { - WARN_ON_FPU(!this_cpu_read(in_kernel_fpu)); + unsigned int depth = __this_cpu_read(kernel_fpu_depth); - this_cpu_write(in_kernel_fpu, false); + if (depth > 1) + restore_fpregs_from_fpstate(__this_cpu_read(saved_kernel_fpstate), + XFEATURE_MASK_FPSTATE); + else if (WARN_ON_ONCE(depth == 0)) + depth = 1; + __this_cpu_write(kernel_fpu_depth, depth - 1); preempt_enable(); } EXPORT_SYMBOL_GPL(kernel_fpu_end); /* @@ -468,19 +488,10 @@ void fpu_sync_fpstate(struct fpu *fpu) trace_x86_fpu_after_save(fpu); fpregs_unlock(); } -static inline unsigned int init_fpstate_copy_size(void) -{ - if (!use_xsave()) - return fpu_kernel_cfg.default_size; - - /* XSAVE(S) just needs the legacy and the xstate header part */ - return sizeof(init_fpstate.regs.xsave); -} - static inline void fpstate_init_fxstate(struct fpstate *fpstate) { fpstate->regs.fxsave.cwd = 0x37f; fpstate->regs.fxsave.mxcsr = MXCSR_DEFAULT; } diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c index 998a08f17e33..549852552f51 100644 --- a/arch/x86/kernel/fpu/init.c +++ b/arch/x86/kernel/fpu/init.c @@ -7,10 +7,11 @@ #include <asm/setup.h> #include <linux/sched.h> #include <linux/sched/task.h> #include <linux/init.h> +#include <linux/slab.h> #include "internal.h" #include "legacy.h" #include "xstate.h" @@ -205,10 +206,35 @@ static void __init fpu__init_system_xstate_size_legacy(void) fpu_user_cfg.max_size = size; fpu_user_cfg.default_size = size; fpstate_reset(¤t->thread.fpu); } +/* + * Allocate per-CPU areas for saving the kernel-mode FPU registers. This is + * needed to reliably support use of kernel-mode FPU in softirqs: + */ +static void __init fpu__init_saved_kernel_fpstates(void) +{ + size_t size; + void *p; + struct fpstate *fpstate; + int cpu; + + size = arch_task_struct_size; + size -= offsetof(struct task_struct, thread.fpu.__fpstate); + size += (__alignof__(struct fpstate) - 1) & ~(ARCH_KMALLOC_MINALIGN - 1); + for_each_possible_cpu(cpu) { + p = kmalloc(size, GFP_KERNEL); + if (!p) + panic("Out of memory"); + fpstate = PTR_ALIGN(p, __alignof__(struct fpstate)); + memcpy(&fpstate->regs, &init_fpstate.regs, + init_fpstate_copy_size()); + per_cpu(saved_kernel_fpstate, cpu) = fpstate; + } +} + /* * Called on the boot CPU once per system bootup, to set up the initial * FPU state that is later cloned into all processes: */ void __init fpu__init_system(void) @@ -224,6 +250,7 @@ void __init fpu__init_system(void) fpu__init_system_generic(); fpu__init_system_xstate_size_legacy(); fpu__init_system_xstate(fpu_kernel_cfg.max_size); fpu__init_task_struct_size(); + fpu__init_saved_kernel_fpstates(); } diff --git a/arch/x86/kernel/fpu/internal.h b/arch/x86/kernel/fpu/internal.h index dbdb31f55fc7..24f5f6d238b9 100644 --- a/arch/x86/kernel/fpu/internal.h +++ b/arch/x86/kernel/fpu/internal.h @@ -23,6 +23,17 @@ static __always_inline __pure bool use_fxsr(void) /* Used in init.c */ extern void fpstate_init_user(struct fpstate *fpstate); extern void fpstate_reset(struct fpu *fpu); +DECLARE_PER_CPU(struct fpstate *, saved_kernel_fpstate); + +static inline unsigned int init_fpstate_copy_size(void) +{ + if (!use_xsave()) + return fpu_kernel_cfg.default_size; + + /* XSAVE(S) just needs the legacy and the xstate header part */ + return sizeof(init_fpstate.regs.xsave); +} + #endif
On Wed, 5 Mar 2025 18:37:25 +0100 Ingo Molnar <mingo@kernel.org> wrote: > * Dave Hansen <dave.hansen@intel.com> wrote: > > > On 3/5/25 01:07, Ingo Molnar wrote:>> Alternatives considered: > > >> - Make kernel-mode FPU sections fully preemptible. This would require > > >> growing task_struct by another struct fpstate which is more than 2K. > > > > > > So that's something that will probably happen once the kernel is built > > > using APX anyway? > > > > I was expecting that building the kernel with APX would be very > > different than a kernel_fpu_begin(). We don't just need *one* more > > save area for APX registers: we need a stack, just like normal GPRs. > > Yes - but my point is: with any APX build we'd probably be saving > FPU(-ish) registers at entry points, into a separate context area. If > that includes FPU registers then we'd not have to do > kernel_fpu_begin()/end(). Since the registers are caller saved (like the SSE onwards ones) none of them really need to be saved on syscall entry (just zeroed on return). They do need saving on interrupt entry. For some unknown reason the kernel saves the xyzmm ones on syscall entry. For normal programs they won't be live - because of the asm syscall wrapper is called from C. So I think they can only be live if a system call is directly inlined into the C function. Just marking them all 'clobbered' would have done. But it now all too late to change. David > > In other words, we'd be doing something close to 'growing task_struct > by another struct fpstate', or so - regardless of whether it's in > task_struct or some sort of extended pt_regs. The kernel would also be > close to 'FPU-safe', i.e. there likely wouldn't be a need for > kernel_fpu_begin()/end(). > > Thanks, > > Ingo >
* Eric Biggers <ebiggers@kernel.org> wrote:
> Deleting the ldmxcsr(MXCSR_DEFAULT) adds about 14 MB/s.
BTW., that's still a +3% performance improvement, so it is worth
considering in addition to the current patch IMO.
Thanks,
Ingo
On Wed, Mar 05, 2025 at 10:07:45AM +0100, Ingo Molnar wrote: > > From: Eric Biggers <ebiggers@google.com> > > Alternatives considered: > > > > - Make kernel-mode FPU sections fully preemptible. This would require > > growing task_struct by another struct fpstate which is more than 2K. > > So that's something that will probably happen once the kernel is built > using APX anyway? ISTR looking into this at some point for RT. I also have vague memories of other architectures doing something similar, but its all a long time ago.
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index f86ad3335529d..f42de5f05e7eb 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -14,14 +14,13 @@ #include <asm/fpu/types.h> /* * Use kernel_fpu_begin/end() if you intend to use FPU in kernel context. It - * disables preemption so be careful if you intend to use it for long periods - * of time. - * If you intend to use the FPU in irq/softirq you need to check first with - * irq_fpu_usable() if it is possible. + * disables preemption and softirq processing, so be careful if you intend to + * use it for long periods of time. Kernel-mode FPU cannot be used in all + * contexts -- see irq_fpu_usable() for details. */ /* Kernel FPU states to initialize in kernel_fpu_begin_mask() */ #define KFPU_387 _BITUL(0) /* 387 state will be initialized */ #define KFPU_MXCSR _BITUL(1) /* MXCSR will be initialized */ @@ -48,25 +47,23 @@ static inline void kernel_fpu_begin(void) kernel_fpu_begin_mask(KFPU_387 | KFPU_MXCSR); #endif } /* - * Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate. - * A context switch will (and softirq might) save CPU's FPU registers to - * fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving CPU's FPU registers in - * a random state. + * Use fpregs_lock() while editing CPU's FPU registers or fpu->fpstate, or while + * using the FPU in kernel mode. A context switch will (and softirq might) save + * CPU's FPU registers to fpu->fpstate.regs and set TIF_NEED_FPU_LOAD leaving + * CPU's FPU registers in a random state. * * local_bh_disable() protects against both preemption and soft interrupts * on !RT kernels. * * On RT kernels local_bh_disable() is not sufficient because it only * serializes soft interrupt related sections via a local lock, but stays * preemptible. Disabling preemption is the right choice here as bottom * half processing is always in thread context on RT kernels so it * implicitly prevents bottom half processing as well. - * - * Disabling preemption also serializes against kernel_fpu_begin(). */ static inline void fpregs_lock(void) { if (!IS_ENABLED(CONFIG_PREEMPT_RT)) local_bh_disable(); diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 1209c7aebb211..f937329d1aa9c 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -58,13 +58,20 @@ DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx); bool irq_fpu_usable(void) { if (WARN_ON_ONCE(in_nmi())) return false; - /* In kernel FPU usage already active? */ - if (this_cpu_read(in_kernel_fpu)) + /* + * In kernel FPU usage already active? This detects any explicitly + * nested usage in task or softirq context, which is unsupported. It + * also detects attempted usage in a hardirq that has interrupted a + * kernel-mode FPU section. + */ + if (this_cpu_read(in_kernel_fpu)) { + WARN_ON_FPU(!in_hardirq()); return false; + } /* * When not in NMI or hard interrupt context, FPU can be used in: * * - Task context except from within fpregs_lock()'ed critical @@ -418,11 +425,12 @@ int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, EXPORT_SYMBOL_GPL(fpu_copy_uabi_to_guest_fpstate); #endif /* CONFIG_KVM */ void kernel_fpu_begin_mask(unsigned int kfpu_mask) { - preempt_disable(); + if (!irqs_disabled()) + fpregs_lock(); WARN_ON_FPU(!irq_fpu_usable()); WARN_ON_FPU(this_cpu_read(in_kernel_fpu)); this_cpu_write(in_kernel_fpu, true); @@ -446,11 +454,12 @@ EXPORT_SYMBOL_GPL(kernel_fpu_begin_mask); void kernel_fpu_end(void) { WARN_ON_FPU(!this_cpu_read(in_kernel_fpu)); this_cpu_write(in_kernel_fpu, false); - preempt_enable(); + if (!irqs_disabled()) + fpregs_unlock(); } EXPORT_SYMBOL_GPL(kernel_fpu_end); /* * Sync the FPU register state to current's memory register state when the