Message ID | 20250430110734.392235199@infradead.org |
---|---|
Headers | show |
Series | objtool: Detect and warn about indirect calls in __nocfi functions | expand |
On April 30, 2025 4:07:34 AM PDT, Peter Zijlstra <peterz@infradead.org> wrote: >Hi! > >On kCFI (CONFIG_CFI_CLANG=y) builds all indirect calls should have the CFI >check on (with very few exceptions). Not having the CFI checks undermines the >protection provided by CFI and will make these sites candidates for people >wanting to steal your cookies. > >Specifically the ABI changes are so that doing indirect calls without the CFI >magic, to a CFI adorned function is not compatible (although it happens to work >for some setups, it very much does not for FineIBT). > >Rust people tripped over this the other day, since their 'core' happened to >have some no_sanitize(kcfi) bits in, which promptly exploded when ran with >FineIBT on. > >Since this is very much not a supported model -- on purpose, have objtool >detect and warn about such constructs. > >This effort [1] found all existing [2] non-cfi indirect calls in the kernel. > >Notably the KVM fastop emulation stuff -- which I've completely rewritten for >this version -- the generated code doesn't look horrific, but is slightly more >verbose. I'm running on the assumption that instruction emulation is not super >performance critical these days of zero VM-exit VMs etc. > >KVM has another; the VMX interrupt injection stuff calls the IDT handler >directly. Is there an alternative? Can we keep a table of Linux functions >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those? > >HyperV hypercall page stuff, which I've previously suggested use direct calls, >and which I've now converted (after getting properly annoyed with that code). > >Also available at: > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/core > >Changes since v1: > > - complete rewrite of the fastop stuff > - HyperV tweaks (Michael) > - objtool changes (Josh) > > >[1] https://lkml.kernel.org/r/20250410154556.GB9003@noisy.programming.kicks-ass.net >[2] https://lkml.kernel.org/r/20250410194334.GA3248459@google.com > We do have a table of handlers higher up in the stack in the form of the dispatch tables for FRED. They don't in general even need the assembly entry stubs, either.
On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote: > >KVM has another; the VMX interrupt injection stuff calls the IDT handler > >directly. Is there an alternative? Can we keep a table of Linux functions > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those? > We do have a table of handlers higher up in the stack in the form of > the dispatch tables for FRED. They don't in general even need the > assembly entry stubs, either. Oh, right. I'll go have a look at those.
From: Peter Zijlstra <peterz@infradead.org> Sent: Wednesday, April 30, 2025 4:08 AM > > What used to be a simple few instructions has turned into a giant mess > (for x86_64). Not only does it use static_branch wrong, it mixes it > with dynamic branches for no apparent reason. > > Notably it uses static_branch through an out-of-line function call, > which completely defeats the purpose, since instead of a simple > JMP/NOP site, you get a CALL+RET+TEST+Jcc sequence in return, which is > absolutely idiotic. > > Add to that a dynamic test of hyperv_paravisor_present, something > which is set once and never changed. > > Replace all this idiocy with a single direct function call to the > right hypercall variant. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> I've done these tests on Hyper-V VMs with this patch series. My focus is the Hyper-V changes in Patches 11 and 12, not the other changes. * Normal VM boot and basic smoke test * TDX and SEV-SNP VMs boot and basic smoke test. These VMs have a paravisor * Normal VM taking a panic and running the kdump kernel * Normal VM suspending for hibernation, then resuming from hibernation * Verified that IBT is enabled in a normal VM. It's not offered in a TDX VM on Hyper-V when a paravisor is used. I don't know about the case without a paravisor. * Building a 64-bit kernel with and without CONFIG_AMD_MEM_ENCRYPT and CONFIG_INTEL_TDX_GUEST. * Building a 32-bit kernel (but I did not try to run it) TDX and SEV-SNP VMs without a paravisor are not tested, so updating the static call, and the new direct call path, has not been tested for TDX and SNP hypercalls. I don't have a hardware environment where I can test without a paravisor. Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> > --- > arch/x86/hyperv/hv_init.c | 20 +++++ > arch/x86/hyperv/ivm.c | 15 ++++ > arch/x86/include/asm/mshyperv.h | 137 +++++++++++----------------------------- > arch/x86/kernel/cpu/mshyperv.c | 19 +++-- > 4 files changed, 89 insertions(+), 102 deletions(-) > > --- a/arch/x86/hyperv/hv_init.c > +++ b/arch/x86/hyperv/hv_init.c > @@ -35,7 +35,27 @@ > #include <linux/highmem.h> > > void *hv_hypercall_pg; > + > +#ifdef CONFIG_X86_64 > +u64 hv_std_hypercall(u64 control, u64 param1, u64 param2) > +{ > + u64 hv_status; > + > + if (!hv_hypercall_pg) > + return U64_MAX; > + > + register u64 __r8 asm("r8") = param2; > + asm volatile (CALL_NOSPEC > + : "=a" (hv_status), ASM_CALL_CONSTRAINT, > + "+c" (control), "+d" (param1), "+r" (__r8) > + : THUNK_TARGET(hv_hypercall_pg) > + : "cc", "memory", "r9", "r10", "r11"); > + > + return hv_status; > +} > +#else > EXPORT_SYMBOL_GPL(hv_hypercall_pg); > +#endif > > union hv_ghcb * __percpu *hv_ghcb_pg; > > --- a/arch/x86/hyperv/ivm.c > +++ b/arch/x86/hyperv/ivm.c > @@ -376,9 +376,23 @@ int hv_snp_boot_ap(u32 cpu, unsigned lon > return ret; > } > > +u64 hv_snp_hypercall(u64 control, u64 param1, u64 param2) > +{ > + u64 hv_status; > + > + register u64 __r8 asm("r8") = param2; > + asm volatile("vmmcall" > + : "=a" (hv_status), ASM_CALL_CONSTRAINT, > + "+c" (control), "+d" (param1), "+r" (__r8) > + : : "cc", "memory", "r9", "r10", "r11"); > + > + return hv_status; > +} > + > #else > static inline void hv_ghcb_msr_write(u64 msr, u64 value) {} > static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {} > +u64 hv_snp_hypercall(u64 control, u64 param1, u64 param2) { return U64_MAX; } > #endif /* CONFIG_AMD_MEM_ENCRYPT */ > > #ifdef CONFIG_INTEL_TDX_GUEST > @@ -428,6 +442,7 @@ u64 hv_tdx_hypercall(u64 control, u64 pa > #else > static inline void hv_tdx_msr_write(u64 msr, u64 value) {} > static inline void hv_tdx_msr_read(u64 msr, u64 *value) {} > +u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2) { return U64_MAX; } > #endif /* CONFIG_INTEL_TDX_GUEST */ > > #if defined(CONFIG_AMD_MEM_ENCRYPT) || defined(CONFIG_INTEL_TDX_GUEST) > --- a/arch/x86/include/asm/mshyperv.h > +++ b/arch/x86/include/asm/mshyperv.h > @@ -6,6 +6,7 @@ > #include <linux/nmi.h> > #include <linux/msi.h> > #include <linux/io.h> > +#include <linux/static_call.h> > #include <asm/nospec-branch.h> > #include <asm/paravirt.h> > #include <hyperv/hvhdk.h> > @@ -38,16 +39,21 @@ static inline unsigned char hv_get_nmi_r > return 0; > } > > -#if IS_ENABLED(CONFIG_HYPERV) > -extern bool hyperv_paravisor_present; > +extern u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2); > +extern u64 hv_snp_hypercall(u64 control, u64 param1, u64 param2); > +extern u64 hv_std_hypercall(u64 control, u64 param1, u64 param2); > > +#if IS_ENABLED(CONFIG_HYPERV) > extern void *hv_hypercall_pg; > > extern union hv_ghcb * __percpu *hv_ghcb_pg; > > bool hv_isolation_type_snp(void); > bool hv_isolation_type_tdx(void); > -u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2); > + > +#ifdef CONFIG_X86_64 > +DECLARE_STATIC_CALL(hv_hypercall, hv_std_hypercall); > +#endif > > /* > * DEFAULT INIT GPAT and SEGMENT LIMIT value in struct VMSA > @@ -64,37 +70,15 @@ static inline u64 hv_do_hypercall(u64 co > { > u64 input_address = input ? virt_to_phys(input) : 0; > u64 output_address = output ? virt_to_phys(output) : 0; > - u64 hv_status; > > #ifdef CONFIG_X86_64 > - if (hv_isolation_type_tdx() && !hyperv_paravisor_present) > - return hv_tdx_hypercall(control, input_address, output_address); > - > - if (hv_isolation_type_snp() && !hyperv_paravisor_present) { > - __asm__ __volatile__("mov %[output_address], %%r8\n" > - "vmmcall" > - : "=a" (hv_status), ASM_CALL_CONSTRAINT, > - "+c" (control), "+d" (input_address) > - : [output_address] "r" (output_address) > - : "cc", "memory", "r8", "r9", "r10", "r11"); > - return hv_status; > - } > - > - if (!hv_hypercall_pg) > - return U64_MAX; > - > - __asm__ __volatile__("mov %[output_address], %%r8\n" > - CALL_NOSPEC > - : "=a" (hv_status), ASM_CALL_CONSTRAINT, > - "+c" (control), "+d" (input_address) > - : [output_address] "r" (output_address), > - THUNK_TARGET(hv_hypercall_pg) > - : "cc", "memory", "r8", "r9", "r10", "r11"); > + return static_call_mod(hv_hypercall)(control, input_address, output_address); > #else > u32 input_address_hi = upper_32_bits(input_address); > u32 input_address_lo = lower_32_bits(input_address); > u32 output_address_hi = upper_32_bits(output_address); > u32 output_address_lo = lower_32_bits(output_address); > + u64 hv_status; > > if (!hv_hypercall_pg) > return U64_MAX; > @@ -107,8 +91,8 @@ static inline u64 hv_do_hypercall(u64 co > "D"(output_address_hi), "S"(output_address_lo), > THUNK_TARGET(hv_hypercall_pg) > : "cc", "memory"); > -#endif /* !x86_64 */ > return hv_status; > +#endif /* !x86_64 */ > } > > /* Hypercall to the L0 hypervisor */ > @@ -120,41 +104,23 @@ static inline u64 hv_do_nested_hypercall > /* Fast hypercall with 8 bytes of input and no output */ > static inline u64 _hv_do_fast_hypercall8(u64 control, u64 input1) > { > - u64 hv_status; > - > #ifdef CONFIG_X86_64 > - if (hv_isolation_type_tdx() && !hyperv_paravisor_present) > - return hv_tdx_hypercall(control, input1, 0); > - > - if (hv_isolation_type_snp() && !hyperv_paravisor_present) { > - __asm__ __volatile__( > - "vmmcall" > - : "=a" (hv_status), ASM_CALL_CONSTRAINT, > - "+c" (control), "+d" (input1) > - :: "cc", "r8", "r9", "r10", "r11"); > - } else { > - __asm__ __volatile__(CALL_NOSPEC > - : "=a" (hv_status), ASM_CALL_CONSTRAINT, > - "+c" (control), "+d" (input1) > - : THUNK_TARGET(hv_hypercall_pg) > - : "cc", "r8", "r9", "r10", "r11"); > - } > + return static_call_mod(hv_hypercall)(control, input1, 0); > #else > - { > - u32 input1_hi = upper_32_bits(input1); > - u32 input1_lo = lower_32_bits(input1); > - > - __asm__ __volatile__ (CALL_NOSPEC > - : "=A"(hv_status), > - "+c"(input1_lo), > - ASM_CALL_CONSTRAINT > - : "A" (control), > - "b" (input1_hi), > - THUNK_TARGET(hv_hypercall_pg) > - : "cc", "edi", "esi"); > - } > -#endif > + u32 input1_hi = upper_32_bits(input1); > + u32 input1_lo = lower_32_bits(input1); > + u64 hv_status; > + > + __asm__ __volatile__ (CALL_NOSPEC > + : "=A"(hv_status), > + "+c"(input1_lo), > + ASM_CALL_CONSTRAINT > + : "A" (control), > + "b" (input1_hi), > + THUNK_TARGET(hv_hypercall_pg) > + : "cc", "edi", "esi"); > return hv_status; > +#endif > } > > static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1) > @@ -174,45 +140,24 @@ static inline u64 hv_do_fast_nested_hype > /* Fast hypercall with 16 bytes of input */ > static inline u64 _hv_do_fast_hypercall16(u64 control, u64 input1, u64 input2) > { > - u64 hv_status; > - > #ifdef CONFIG_X86_64 > - if (hv_isolation_type_tdx() && !hyperv_paravisor_present) > - return hv_tdx_hypercall(control, input1, input2); > - > - if (hv_isolation_type_snp() && !hyperv_paravisor_present) { > - __asm__ __volatile__("mov %[input2], %%r8\n" > - "vmmcall" > - : "=a" (hv_status), ASM_CALL_CONSTRAINT, > - "+c" (control), "+d" (input1) > - : [input2] "r" (input2) > - : "cc", "r8", "r9", "r10", "r11"); > - } else { > - __asm__ __volatile__("mov %[input2], %%r8\n" > - CALL_NOSPEC > - : "=a" (hv_status), ASM_CALL_CONSTRAINT, > - "+c" (control), "+d" (input1) > - : [input2] "r" (input2), > - THUNK_TARGET(hv_hypercall_pg) > - : "cc", "r8", "r9", "r10", "r11"); > - } > + return static_call_mod(hv_hypercall)(control, input1, input2); > #else > - { > - u32 input1_hi = upper_32_bits(input1); > - u32 input1_lo = lower_32_bits(input1); > - u32 input2_hi = upper_32_bits(input2); > - u32 input2_lo = lower_32_bits(input2); > - > - __asm__ __volatile__ (CALL_NOSPEC > - : "=A"(hv_status), > - "+c"(input1_lo), ASM_CALL_CONSTRAINT > - : "A" (control), "b" (input1_hi), > - "D"(input2_hi), "S"(input2_lo), > - THUNK_TARGET(hv_hypercall_pg) > - : "cc"); > - } > -#endif > + u32 input1_hi = upper_32_bits(input1); > + u32 input1_lo = lower_32_bits(input1); > + u32 input2_hi = upper_32_bits(input2); > + u32 input2_lo = lower_32_bits(input2); > + u64 hv_status; > + > + __asm__ __volatile__ (CALL_NOSPEC > + : "=A"(hv_status), > + "+c"(input1_lo), ASM_CALL_CONSTRAINT > + : "A" (control), "b" (input1_hi), > + "D"(input2_hi), "S"(input2_lo), > + THUNK_TARGET(hv_hypercall_pg) > + : "cc"); > return hv_status; > +#endif > } > > static inline u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2) > --- a/arch/x86/kernel/cpu/mshyperv.c > +++ b/arch/x86/kernel/cpu/mshyperv.c > @@ -37,10 +37,6 @@ > bool hv_nested; > struct ms_hyperv_info ms_hyperv; > > -/* Used in modules via hv_do_hypercall(): see arch/x86/include/asm/mshyperv.h */ > -bool hyperv_paravisor_present __ro_after_init; > -EXPORT_SYMBOL_GPL(hyperv_paravisor_present); > - > #if IS_ENABLED(CONFIG_HYPERV) > static inline unsigned int hv_get_nested_msr(unsigned int reg) > { > @@ -287,8 +283,18 @@ static void __init x86_setup_ops_for_tsc > old_restore_sched_clock_state = x86_platform.restore_sched_clock_state; > x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state; > } > + > +#ifdef CONFIG_X86_64 > +DEFINE_STATIC_CALL(hv_hypercall, hv_std_hypercall); > +EXPORT_STATIC_CALL_TRAMP_GPL(hv_hypercall); > +#define hypercall_update(hc) static_call_update(hv_hypercall, hc) > +#endif > #endif /* CONFIG_HYPERV */ > > +#ifndef hypercall_update > +#define hypercall_update(hc) (void)hc > +#endif > + > static uint32_t __init ms_hyperv_platform(void) > { > u32 eax; > @@ -483,14 +489,14 @@ static void __init ms_hyperv_init_platfo > ms_hyperv.shared_gpa_boundary = > BIT_ULL(ms_hyperv.shared_gpa_boundary_bits); > > - hyperv_paravisor_present = !!ms_hyperv.paravisor_present; > - > pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n", > ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b); > > > if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) { > static_branch_enable(&isolation_type_snp); > + if (!ms_hyperv.paravisor_present) > + hypercall_update(hv_snp_hypercall); > } else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_TDX) { > static_branch_enable(&isolation_type_tdx); > > @@ -498,6 +504,7 @@ static void __init ms_hyperv_init_platfo > ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED; > > if (!ms_hyperv.paravisor_present) { > + hypercall_update(hv_tdx_hypercall); > /* > * Mark the Hyper-V TSC page feature as disabled > * in a TDX VM without paravisor so that the > >
On Thu, May 01, 2025 at 02:36:26AM +0000, Michael Kelley wrote: > From: Peter Zijlstra <peterz@infradead.org> Sent: Wednesday, April 30, 2025 4:08 AM > > @@ -528,8 +546,8 @@ void __init hyperv_init(void) > > if (hv_isolation_type_tdx() && !ms_hyperv.paravisor_present) > > goto skip_hypercall_pg_init; > > > > - hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, > > - VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX, > > + hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, MODULES_VADDR, > > + MODULES_END, GFP_KERNEL, PAGE_KERNEL_ROX, > > Curiosity question (which I forgot ask about in v1): Is this change so that the > hypercall page kernel address is "close enough" for the direct call to work from > built-in code and from module code? Or is there some other reason? No, you nailed it. Because we want to do a direct CALL, the hypercall page must be in the disp32 range relative to the call site. The module address space ensures this. > > VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, > > __builtin_return_address(0)); > > if (hv_hypercall_pg == NULL) > > @@ -567,27 +585,9 @@ void __init hyperv_init(void) > > wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); > > } > > > > -skip_hypercall_pg_init: > > - /* > > - * Some versions of Hyper-V that provide IBT in guest VMs have a bug > > - * in that there's no ENDBR64 instruction at the entry to the > > - * hypercall page. Because hypercalls are invoked via an indirect call > > - * to the hypercall page, all hypercall attempts fail when IBT is > > - * enabled, and Linux panics. For such buggy versions, disable IBT. > > - * > > - * Fixed versions of Hyper-V always provide ENDBR64 on the hypercall > > - * page, so if future Linux kernel versions enable IBT for 32-bit > > - * builds, additional hypercall page hackery will be required here > > - * to provide an ENDBR32. > > - */ > > -#ifdef CONFIG_X86_KERNEL_IBT > > - if (cpu_feature_enabled(X86_FEATURE_IBT) && > > - *(u32 *)hv_hypercall_pg != gen_endbr()) { > > - setup_clear_cpu_cap(X86_FEATURE_IBT); > > - pr_warn("Disabling IBT because of Hyper-V bug\n"); > > - } > > -#endif > > Nit: With this IBT code removed, the #include <asm/ibt.h> at the top > of this source code file should be removed. Indeed so. > > > + hv_set_hypercall_pg(hv_hypercall_pg); > > > > +skip_hypercall_pg_init: > > /* > > * hyperv_init() is called before LAPIC is initialized: see > > * apic_intr_mode_init() -> x86_platform.apic_post_init() and > > > > > > The nit notwithstanding, > > Reviewed-by: Michael Kelley <mhklinux@outlook.com> Thanks!
On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote: > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote: > > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler > > >directly. Is there an alternative? Can we keep a table of Linux functions > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those? > > > We do have a table of handlers higher up in the stack in the form of > > the dispatch tables for FRED. They don't in general even need the > > assembly entry stubs, either. > > Oh, right. I'll go have a look at those. Right, so perhaps the easiest way around this is to setup the FRED entry tables unconditionally, have VMX mandate CONFIG_FRED and then have it always use the FRED entry points. Let me see how ugly that gets.
On Thu, May 01, 2025 at 12:30:38PM +0200, Peter Zijlstra wrote: > On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote: > > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote: > > > > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler > > > >directly. Is there an alternative? Can we keep a table of Linux functions > > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those? > > > > > We do have a table of handlers higher up in the stack in the form of > > > the dispatch tables for FRED. They don't in general even need the > > > assembly entry stubs, either. > > > > Oh, right. I'll go have a look at those. > > Right, so perhaps the easiest way around this is to setup the FRED entry > tables unconditionally, have VMX mandate CONFIG_FRED and then have it > always use the FRED entry points. > > Let me see how ugly that gets. Something like so... except this is broken. Its reporting spurious interrupts on vector 0x00, so something is buggered passing that vector along. I'll stare more at it more another time. --- diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c index f004a4dc74c2..d2322e3723f5 100644 --- a/arch/x86/entry/entry_fred.c +++ b/arch/x86/entry/entry_fred.c @@ -156,9 +156,8 @@ void __init fred_complete_exception_setup(void) fred_setup_done = true; } -static noinstr void fred_extint(struct pt_regs *regs) +noinstr void fred_extint(struct pt_regs *regs, unsigned int vector) { - unsigned int vector = regs->fred_ss.vector; unsigned int index = array_index_nospec(vector - FIRST_SYSTEM_VECTOR, NR_SYSTEM_VECTORS); @@ -230,7 +229,7 @@ __visible noinstr void fred_entry_from_user(struct pt_regs *regs) switch (regs->fred_ss.type) { case EVENT_TYPE_EXTINT: - return fred_extint(regs); + return fred_extint(regs, regs->fred_ss.vector); case EVENT_TYPE_NMI: if (likely(regs->fred_ss.vector == X86_TRAP_NMI)) return fred_exc_nmi(regs); @@ -262,7 +261,7 @@ __visible noinstr void fred_entry_from_kernel(struct pt_regs *regs) switch (regs->fred_ss.type) { case EVENT_TYPE_EXTINT: - return fred_extint(regs); + return fred_extint(regs, regs->fred_ss.vector); case EVENT_TYPE_NMI: if (likely(regs->fred_ss.vector == X86_TRAP_NMI)) return fred_exc_nmi(regs); @@ -286,7 +285,7 @@ __visible noinstr void __fred_entry_from_kvm(struct pt_regs *regs) { switch (regs->fred_ss.type) { case EVENT_TYPE_EXTINT: - return fred_extint(regs); + return fred_extint(regs, regs->fred_ss.vector); case EVENT_TYPE_NMI: return fred_exc_nmi(regs); default: diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h index 2a29e5216881..e6ec5e1a0f54 100644 --- a/arch/x86/include/asm/fred.h +++ b/arch/x86/include/asm/fred.h @@ -66,6 +66,8 @@ void asm_fred_entrypoint_user(void); void asm_fred_entrypoint_kernel(void); void asm_fred_entry_from_kvm(struct fred_ss); +void fred_extint(struct pt_regs *regs, unsigned int vector); + __visible void fred_entry_from_user(struct pt_regs *regs); __visible void fred_entry_from_kernel(struct pt_regs *regs); __visible void __fred_entry_from_kvm(struct pt_regs *regs); diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h index a4ec27c67988..1de8c8b6d4c6 100644 --- a/arch/x86/include/asm/idtentry.h +++ b/arch/x86/include/asm/idtentry.h @@ -468,9 +468,9 @@ static inline void fred_install_sysvec(unsigned int vector, const idtentry_t fun #endif #define sysvec_install(vector, function) { \ - if (cpu_feature_enabled(X86_FEATURE_FRED)) \ + if (IS_ENABLED(CONFIG_X86_FRED)) \ fred_install_sysvec(vector, function); \ - else \ + if (!cpu_feature_enabled(X86_FEATURE_FRED)) \ idt_install_sysvec(vector, asm_##function); \ } @@ -647,6 +647,9 @@ DECLARE_IDTENTRY_RAW(X86_TRAP_MC, xenpv_exc_machine_check); * to avoid more ifdeffery. */ DECLARE_IDTENTRY(X86_TRAP_NMI, exc_nmi_kvm_vmx); +#ifdef CONFIG_X86_FRED +DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_NMI, exc_fred_kvm_vmx); +#endif #endif DECLARE_IDTENTRY_NMI(X86_TRAP_NMI, exc_nmi); diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c index f79c5edc0b89..654f8e835417 100644 --- a/arch/x86/kernel/irqinit.c +++ b/arch/x86/kernel/irqinit.c @@ -97,9 +97,9 @@ void __init native_init_IRQ(void) /* Execute any quirks before the call gates are initialised: */ x86_init.irqs.pre_vector_init(); - if (cpu_feature_enabled(X86_FEATURE_FRED)) + if (IS_ENABLED(CONFIG_X86_FRED)) fred_complete_exception_setup(); - else + if (!cpu_feature_enabled(X86_FEATURE_FRED)) idt_setup_apic_and_irq_gates(); lapic_assign_system_vectors(); diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index 9a95d00f1423..4690787bdd13 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -613,8 +613,17 @@ DEFINE_IDTENTRY_RAW(exc_nmi_kvm_vmx) { exc_nmi(regs); } +#ifdef CONFIG_X86_FRED +DEFINE_IDTENTRY_RAW_ERRORCODE(exc_fred_kvm_vmx) +{ + fred_extint(regs, error_code); +} +#endif #if IS_MODULE(CONFIG_KVM_INTEL) EXPORT_SYMBOL_GPL(asm_exc_nmi_kvm_vmx); +#ifdef CONFIG_X86_FRED +EXPORT_SYMBOL_GPL(asm_exc_fred_kvm_vmx); +#endif #endif #endif diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index fe8ea8c097de..9db38ae3450e 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -95,6 +95,7 @@ config KVM_SW_PROTECTED_VM config KVM_INTEL tristate "KVM for Intel (and compatible) processors support" depends on KVM && IA32_FEAT_CTL + select X86_FRED if X86_64 help Provides support for KVM on processors equipped with Intel's VT extensions, a.k.a. Virtual Machine Extensions (VMX). diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S index 090393bf829d..3aaacbbedf5c 100644 --- a/arch/x86/kvm/vmx/vmenter.S +++ b/arch/x86/kvm/vmx/vmenter.S @@ -31,7 +31,7 @@ #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE #endif -.macro VMX_DO_EVENT_IRQOFF call_insn call_target +.macro VMX_DO_EVENT_IRQOFF call_insn call_target has_error=0 /* * Unconditionally create a stack frame, getting the correct RSP on the * stack (for x86-64) would take two instructions anyways, and RBP can @@ -40,6 +40,8 @@ push %_ASM_BP mov %_ASM_SP, %_ASM_BP + UNWIND_HINT_SAVE + #ifdef CONFIG_X86_64 /* * Align RSP to a 16-byte boundary (to emulate CPU behavior) before @@ -51,7 +53,17 @@ #endif pushf push $__KERNEL_CS + + .if \has_error + lea 1f(%rip), %rax + push %rax + UNWIND_HINT_IRET_REGS + push %_ASM_ARG1 + .endif + \call_insn \call_target +1: + UNWIND_HINT_RESTORE /* * "Restore" RSP from RBP, even though IRET has already unwound RSP to @@ -363,10 +375,9 @@ SYM_FUNC_END(vmread_error_trampoline) .section .text, "ax" SYM_FUNC_START(vmx_do_interrupt_irqoff) - /* - * Calling an IDT gate directly; annotate away the CFI concern for now. - * Should be fixed if possible. - */ - ANNOTATE_NOCFI_SYM +#ifdef CONFIG_X86_FRED + VMX_DO_EVENT_IRQOFF jmp asm_exc_fred_kvm_vmx has_error=1 +#else VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1 +#endif SYM_FUNC_END(vmx_do_interrupt_irqoff) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 5c5766467a61..7ccd2f66d55d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7034,6 +7034,7 @@ void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu) } void vmx_do_interrupt_irqoff(unsigned long entry); + void vmx_do_nmi_irqoff(void); static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu) @@ -7080,6 +7081,8 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu, kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ); if (cpu_feature_enabled(X86_FEATURE_FRED)) fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector); + else if (IS_ENABLED(CONFIG_FRED)) + vmx_do_interrupt_irqoff(vector); else vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector)); kvm_after_interrupt(vcpu);
On Thu, May 01, 2025, Peter Zijlstra wrote: > On Thu, May 01, 2025 at 12:30:38PM +0200, Peter Zijlstra wrote: > > On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote: > > > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote: > > > > > > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler > > > > >directly. Is there an alternative? Can we keep a table of Linux functions > > > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those? > > > > > > > We do have a table of handlers higher up in the stack in the form of > > > > the dispatch tables for FRED. They don't in general even need the > > > > assembly entry stubs, either. > > > > > > Oh, right. I'll go have a look at those. > > > > Right, so perhaps the easiest way around this is to setup the FRED entry > > tables unconditionally, have VMX mandate CONFIG_FRED and then have it > > always use the FRED entry points. > > > > Let me see how ugly that gets. > > Something like so... except this is broken. Its reporting spurious > interrupts on vector 0x00, so something is buggered passing that vector > along. Uh, aren't you making this way more complex than it needs to be? IIUC, KVM never uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be in place because they'll never be used. The only bits of code KVM needs is the __fred_entry_from_kvm() glue. Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware. --
On Wed, Apr 30, 2025 at 1:26 PM Peter Zijlstra <peterz@infradead.org> wrote: > Notably the KVM fastop emulation stuff -- which I've completely rewritten for > this version -- the generated code doesn't look horrific, but is slightly more > verbose. I'm running on the assumption that instruction emulation is not super > performance critical these days of zero VM-exit VMs etc. It's definitely going to be slower, but I guess it's okay these days. It's really only somewhat hot with really old processors (pre-Westmere) and only when running big real mode code. Paolo > KVM has another; the VMX interrupt injection stuff calls the IDT handler > directly. Is there an alternative? Can we keep a table of Linux functions > slighly higher up the call stack (asm_\cfunc ?) and add CFI to those? > > HyperV hypercall page stuff, which I've previously suggested use direct calls, > and which I've now converted (after getting properly annoyed with that code). > > Also available at: > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/core > > Changes since v1: > > - complete rewrite of the fastop stuff > - HyperV tweaks (Michael) > - objtool changes (Josh) > > > [1] https://lkml.kernel.org/r/20250410154556.GB9003@noisy.programming.kicks-ass.net > [2] https://lkml.kernel.org/r/20250410194334.GA3248459@google.com >
On May 1, 2025 11:30:18 AM PDT, Sean Christopherson <seanjc@google.com> wrote: >On Thu, May 01, 2025, Peter Zijlstra wrote: >> On Thu, May 01, 2025 at 12:30:38PM +0200, Peter Zijlstra wrote: >> > On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote: >> > > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote: >> > > >> > > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler >> > > > >directly. Is there an alternative? Can we keep a table of Linux functions >> > > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those? >> > > >> > > > We do have a table of handlers higher up in the stack in the form of >> > > > the dispatch tables for FRED. They don't in general even need the >> > > > assembly entry stubs, either. >> > > >> > > Oh, right. I'll go have a look at those. >> > >> > Right, so perhaps the easiest way around this is to setup the FRED entry >> > tables unconditionally, have VMX mandate CONFIG_FRED and then have it >> > always use the FRED entry points. >> > >> > Let me see how ugly that gets. >> >> Something like so... except this is broken. Its reporting spurious >> interrupts on vector 0x00, so something is buggered passing that vector >> along. > >Uh, aren't you making this way more complex than it needs to be? IIUC, KVM never >uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be >in place because they'll never be used. The only bits of code KVM needs is the >__fred_entry_from_kvm() glue. > >Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware. > >-- >From 664468143109ab7c525c0babeba62195fa4c657e Mon Sep 17 00:00:00 2001 >From: Sean Christopherson <seanjc@google.com> >Date: Thu, 1 May 2025 11:20:29 -0700 >Subject: [PATCH 1/2] x86/fred: Play nice with invoking > asm_fred_entry_from_kvm() on non-FRED hardware > >Modify asm_fred_entry_from_kvm() to allow it to be invoked by KVM even >when FRED isn't fully enabled, e.g. when running with CONFIG_X86_FRED=y >on non-FRED hardware. This will allow forcing KVM to always use the FRED >entry points for 64-bit kernels, which in turn will eliminate a rather >gross non-CFI indirect call that KVM uses to trampoline IRQs by doing IDT >lookups. > >When FRED isn't enabled, simply skip ERETS and restore RBP and RSP from >the stack frame prior to doing a "regular" RET back to KVM (in quotes >because of all the RET mitigation horrors). > >Signed-off-by: Sean Christopherson <seanjc@google.com> >--- > arch/x86/entry/entry_64_fred.S | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > >diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S >index 29c5c32c16c3..7aff2f0a285f 100644 >--- a/arch/x86/entry/entry_64_fred.S >+++ b/arch/x86/entry/entry_64_fred.S >@@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > movq %rsp, %rdi /* %rdi -> pt_regs */ > call __fred_entry_from_kvm /* Call the C entry point */ > POP_REGS >- ERETS >+ >+ ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED > 1: > /* > * Objtool doesn't understand what ERETS does, this hint tells it that >@@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > * isn't strictly needed, but it's the simplest form. > */ > UNWIND_HINT_RESTORE >- pop %rbp >+ leave > RET > > SYM_FUNC_END(asm_fred_entry_from_kvm) > >base-commit: 45eb29140e68ffe8e93a5471006858a018480a45 Ok maybe I'm being dense, but what is left other than simply calling __fred_entry_from_kvm() as a normal C function? I'm on the go so there might be something in the code I'm missing, but on the surface...?
On Thu, May 01, 2025, H. Peter Anvin wrote: > On May 1, 2025 11:30:18 AM PDT, Sean Christopherson <seanjc@google.com> wrote: > >On Thu, May 01, 2025, Peter Zijlstra wrote: > >> On Thu, May 01, 2025 at 12:30:38PM +0200, Peter Zijlstra wrote: > >> > On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote: > >> > > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote: > >> > > > >> > > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler > >> > > > >directly. Is there an alternative? Can we keep a table of Linux functions > >> > > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those? > >> > > > >> > > > We do have a table of handlers higher up in the stack in the form of > >> > > > the dispatch tables for FRED. They don't in general even need the > >> > > > assembly entry stubs, either. > >> > > > >> > > Oh, right. I'll go have a look at those. > >> > > >> > Right, so perhaps the easiest way around this is to setup the FRED entry > >> > tables unconditionally, have VMX mandate CONFIG_FRED and then have it > >> > always use the FRED entry points. > >> > > >> > Let me see how ugly that gets. > >> > >> Something like so... except this is broken. Its reporting spurious > >> interrupts on vector 0x00, so something is buggered passing that vector > >> along. > > > >Uh, aren't you making this way more complex than it needs to be? IIUC, KVM never > >uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be > >in place because they'll never be used. The only bits of code KVM needs is the > >__fred_entry_from_kvm() glue. > > > >Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware. Hrm, and now I see that fred_extint() relies on fred_install_sysvec(), which makes me quite curious as to why IRQs didn't go sideways. Oh, because sysvec_table[] is statically defined at compile time except for PV crud. So yeah, I think my the patches are correct, they just the need a small bit of prep work to support dynamic setup of sysvec_table. > >-- > >From 664468143109ab7c525c0babeba62195fa4c657e Mon Sep 17 00:00:00 2001 > >From: Sean Christopherson <seanjc@google.com> > >Date: Thu, 1 May 2025 11:20:29 -0700 > >Subject: [PATCH 1/2] x86/fred: Play nice with invoking > > asm_fred_entry_from_kvm() on non-FRED hardware > > > >Modify asm_fred_entry_from_kvm() to allow it to be invoked by KVM even > >when FRED isn't fully enabled, e.g. when running with CONFIG_X86_FRED=y > >on non-FRED hardware. This will allow forcing KVM to always use the FRED > >entry points for 64-bit kernels, which in turn will eliminate a rather > >gross non-CFI indirect call that KVM uses to trampoline IRQs by doing IDT > >lookups. > > > >When FRED isn't enabled, simply skip ERETS and restore RBP and RSP from > >the stack frame prior to doing a "regular" RET back to KVM (in quotes > >because of all the RET mitigation horrors). > > > >Signed-off-by: Sean Christopherson <seanjc@google.com> > >--- > > arch/x86/entry/entry_64_fred.S | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > >diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S > >index 29c5c32c16c3..7aff2f0a285f 100644 > >--- a/arch/x86/entry/entry_64_fred.S > >+++ b/arch/x86/entry/entry_64_fred.S > >@@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > > movq %rsp, %rdi /* %rdi -> pt_regs */ > > call __fred_entry_from_kvm /* Call the C entry point */ > > POP_REGS > >- ERETS > >+ > >+ ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED > > 1: > > /* > > * Objtool doesn't understand what ERETS does, this hint tells it that > >@@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > > * isn't strictly needed, but it's the simplest form. > > */ > > UNWIND_HINT_RESTORE > >- pop %rbp > >+ leave > > RET > > > > SYM_FUNC_END(asm_fred_entry_from_kvm) > > > >base-commit: 45eb29140e68ffe8e93a5471006858a018480a45 > > Ok maybe I'm being dense, but what is left other than simply calling > __fred_entry_from_kvm() as a normal C function? > > I'm on the go so there might be something in the code I'm missing, but on the > surface...? I'm sure it's doable, though I'd be more than a little nervous about diverging from what FRED=y does, e.g. in case code somewhere expects the stack to look exactly like a real FRED event. And since we'd still need the assembly to support FRED=y, I don't see any point in adding more code when it's trivially easy to have asm_fred_entry_from_kvm() skip ERETS.
>> Something like so... except this is broken. Its reporting spurious >> interrupts on vector 0x00, so something is buggered passing that vector >> along. I'm a bit late to the party :) Peter kind of got what I had in the FRED patch set v8 or earlier: https://lore.kernel.org/lkml/20230410081438.1750-34-xin3.li@intel.com/ > Uh, aren't you making this way more complex than it needs to be? IIUC, KVM never > uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be > in place because they'll never be used. The only bits of code KVM needs is the > __fred_entry_from_kvm() glue. +1 > > Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware. > > -- > From 664468143109ab7c525c0babeba62195fa4c657e Mon Sep 17 00:00:00 2001 > From: Sean Christopherson <seanjc@google.com> > Date: Thu, 1 May 2025 11:20:29 -0700 > Subject: [PATCH 1/2] x86/fred: Play nice with invoking > asm_fred_entry_from_kvm() on non-FRED hardware > > Modify asm_fred_entry_from_kvm() to allow it to be invoked by KVM even > when FRED isn't fully enabled, e.g. when running with CONFIG_X86_FRED=y > on non-FRED hardware. This will allow forcing KVM to always use the FRED > entry points for 64-bit kernels, which in turn will eliminate a rather > gross non-CFI indirect call that KVM uses to trampoline IRQs by doing IDT > lookups. > > When FRED isn't enabled, simply skip ERETS and restore RBP and RSP from > the stack frame prior to doing a "regular" RET back to KVM (in quotes > because of all the RET mitigation horrors). > > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/entry/entry_64_fred.S | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S > index 29c5c32c16c3..7aff2f0a285f 100644 > --- a/arch/x86/entry/entry_64_fred.S > +++ b/arch/x86/entry/entry_64_fred.S > @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > movq %rsp, %rdi /* %rdi -> pt_regs */ > call __fred_entry_from_kvm /* Call the C entry point */ > POP_REGS > - ERETS > + > + ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED Neat! I ever had a plan to do this with "sub $0x8,%rsp; iret;" for non-FRED case. But obviously doing nothing here is the best. > 1: > /* > * Objtool doesn't understand what ERETS does, this hint tells it that > @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > * isn't strictly needed, but it's the simplest form. > */ > UNWIND_HINT_RESTORE > - pop %rbp > + leave This is a smart change. When !X86_FEATURE_FRED, the FRED stack frame set up for ERETS is implicitly skipped by leave. Maybe add a comment to explain LEAVE works for both cases?
On 5/1/2025 11:30 AM, Sean Christopherson wrote: > From c50fb5a8a46058bbcfdcac0a100c2aa0f7f68f1c Mon Sep 17 00:00:00 2001 > From: Sean Christopherson<seanjc@google.com> > Date: Thu, 1 May 2025 11:10:39 -0700 > Subject: [PATCH 2/2] x86/fred: KVM: VMX: Always use FRED for IRQ+NMI when > CONFIG_X86_FRED=y > > Now that FRED provides C-code entry points for handling IRQ and NMI exits, > use the FRED infrastructure for forwarding all such events even if FRED > isn't supported in hardware. Avoiding the non-FRED assembly trampolines > into the IDT handlers for IRQs eliminates the associated non-CFI indirect > call (KVM performs a CALL by doing a lookup on the IDT using the IRQ > vector). > > Force FRED for 64-bit kernels if KVM_INTEL is enabled, as the benefits of > eliminating the IRQ trampoline usage far outwieghts the code overhead for > FRED. > > Suggested-by: Peter Zijlstra<peterz@infradead.org> > Signed-off-by: Sean Christopherson<seanjc@google.com> > --- > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/vmx/vmx.c | 4 ++-- > 2 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 2eeffcec5382..712a2ff28ce4 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -95,6 +95,7 @@ config KVM_SW_PROTECTED_VM > config KVM_INTEL > tristate "KVM for Intel (and compatible) processors support" > depends on KVM && IA32_FEAT_CTL > + select X86_FRED if X86_64 I LOVE this change, but not sure if everyone is happy with it. > select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST > select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST > help > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index ef2d7208dd20..2ea89985107d 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -6995,7 +6995,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu, > return; > > kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ); > - if (cpu_feature_enabled(X86_FEATURE_FRED)) > + if (IS_ENABLED(CONFIG_X86_FRED)) "if (IS_ENABLED(CONFIG_X86_64))"? > fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector); > else > vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector)); > @@ -7268,7 +7268,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu) > return; > > kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); > - if (cpu_feature_enabled(X86_FEATURE_FRED)) > + if (IS_ENABLED(CONFIG_X86_FRED)) Ditto.
On 5/1/2025 11:59 AM, Sean Christopherson wrote: >> Ok maybe I'm being dense, but what is left other than simply calling >> __fred_entry_from_kvm() as a normal C function? >> >> I'm on the go so there might be something in the code I'm missing, but on the >> surface...? > I'm sure it's doable, though I'd be more than a little nervous about diverging > from what FRED=y does, e.g. in case code somewhere expects the stack to look > exactly like a real FRED event. __fred_entry_from_kvm() accepts a pt_regs structure pointer, with event type and vector in FRED stack frame. They are set up in the assembly. > And since we'd still need the assembly to support FRED=y, I don't see any point > in adding more code when it's trivially easy to have asm_fred_entry_from_kvm() > skip ERETS. Yeah, your change seems minimized to me.
On Thu, May 01, 2025 at 11:30:18AM -0700, Sean Christopherson wrote: > Uh, aren't you making this way more complex than it needs to be? Possibly :-) > IIUC, KVM never > uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be > in place because they'll never be used. The only bits of code KVM needs is the > __fred_entry_from_kvm() glue. But __fred_entry_from_kvm() calls into fred_extint(), which then directly uses the fred sysvec_table[] for dispatch. How would we not have to set up that table? > Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware. So the FRED NMI code is significantly different from the IDT NMI code and I really didn't want to go mixing those. If we get a nested NMI I don't think it'll behave well.
On Thu, May 01, 2025 at 08:33:57PM +0200, Paolo Bonzini wrote: > On Wed, Apr 30, 2025 at 1:26 PM Peter Zijlstra <peterz@infradead.org> wrote: > > Notably the KVM fastop emulation stuff -- which I've completely rewritten for > > this version -- the generated code doesn't look horrific, but is slightly more > > verbose. I'm running on the assumption that instruction emulation is not super > > performance critical these days of zero VM-exit VMs etc. > > It's definitely going to be slower, but I guess it's okay these days. Yeah, it adds a bunch of branches at the very least. But that was indeed the argument, we shouldn't care much these days. > It's really only somewhat hot with really old processors > (pre-Westmere) and only when running big real mode code. Right, really old stuff :-)
On May 1, 2025 10:48:42 PM PDT, Xin Li <xin@zytor.com> wrote: >On 5/1/2025 11:30 AM, Sean Christopherson wrote: >> From c50fb5a8a46058bbcfdcac0a100c2aa0f7f68f1c Mon Sep 17 00:00:00 2001 >> From: Sean Christopherson<seanjc@google.com> >> Date: Thu, 1 May 2025 11:10:39 -0700 >> Subject: [PATCH 2/2] x86/fred: KVM: VMX: Always use FRED for IRQ+NMI when >> CONFIG_X86_FRED=y >> >> Now that FRED provides C-code entry points for handling IRQ and NMI exits, >> use the FRED infrastructure for forwarding all such events even if FRED >> isn't supported in hardware. Avoiding the non-FRED assembly trampolines >> into the IDT handlers for IRQs eliminates the associated non-CFI indirect >> call (KVM performs a CALL by doing a lookup on the IDT using the IRQ >> vector). >> >> Force FRED for 64-bit kernels if KVM_INTEL is enabled, as the benefits of >> eliminating the IRQ trampoline usage far outwieghts the code overhead for >> FRED. >> >> Suggested-by: Peter Zijlstra<peterz@infradead.org> >> Signed-off-by: Sean Christopherson<seanjc@google.com> >> --- >> arch/x86/kvm/Kconfig | 1 + >> arch/x86/kvm/vmx/vmx.c | 4 ++-- >> 2 files changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig >> index 2eeffcec5382..712a2ff28ce4 100644 >> --- a/arch/x86/kvm/Kconfig >> +++ b/arch/x86/kvm/Kconfig >> @@ -95,6 +95,7 @@ config KVM_SW_PROTECTED_VM >> config KVM_INTEL >> tristate "KVM for Intel (and compatible) processors support" >> depends on KVM && IA32_FEAT_CTL >> + select X86_FRED if X86_64 > >I LOVE this change, but not sure if everyone is happy with it. > >> select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST >> select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST >> help >> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >> index ef2d7208dd20..2ea89985107d 100644 >> --- a/arch/x86/kvm/vmx/vmx.c >> +++ b/arch/x86/kvm/vmx/vmx.c >> @@ -6995,7 +6995,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu, >> return; >> kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ); >> - if (cpu_feature_enabled(X86_FEATURE_FRED)) >> + if (IS_ENABLED(CONFIG_X86_FRED)) > >"if (IS_ENABLED(CONFIG_X86_64))"? > >> fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector); >> else >> vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector)); >> @@ -7268,7 +7268,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu) >> return; >> kvm_before_interrupt(vcpu, KVM_HANDLING_NMI); >> - if (cpu_feature_enabled(X86_FEATURE_FRED)) >> + if (IS_ENABLED(CONFIG_X86_FRED)) > >Ditto. I don't think anyone will have a problem with compiling it in... it is such a small amount of code.
On Fri, May 02, 2025, Peter Zijlstra wrote: > On Thu, May 01, 2025 at 11:30:18AM -0700, Sean Christopherson wrote: > > > Uh, aren't you making this way more complex than it needs to be? > > Possibly :-) > > > IIUC, KVM never > > uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be > > in place because they'll never be used. The only bits of code KVM needs is the > > __fred_entry_from_kvm() glue. > > But __fred_entry_from_kvm() calls into fred_extint(), which then > directly uses the fred sysvec_table[] for dispatch. How would we not > have to set up that table? I missed that the first time around. From my self-reply: : Hrm, and now I see that fred_extint() relies on fred_install_sysvec(), which makes : me quite curious as to why IRQs didn't go sideways. Oh, because sysvec_table[] : is statically defined at compile time except for PV crud. : : So yeah, I think my the patches are correct, they just the need a small bit of : prep work to support dynamic setup of sysvec_table. > > Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware. > > So the FRED NMI code is significantly different from the IDT NMI code > and I really didn't want to go mixing those. > > If we get a nested NMI I don't think it'll behave well. Ah, because FRED hardwware doesn't have the crazy NMI unblocking behavior, and the FRED NMI entry code relies on that. But I don't see why we need to care about NMIs from KVM, while they do bounce through assembly to create a stack frame, the actual CALL is direct to asm_exc_nmi_kvm_vmx(). If it's just the unwind hint that's needed, that The attached patches handle the IRQ case and are passing my testing.
On Fri, May 02, 2025 at 12:53:36PM -0700, Sean Christopherson wrote: > > So the FRED NMI code is significantly different from the IDT NMI code > > and I really didn't want to go mixing those. > > > > If we get a nested NMI I don't think it'll behave well. > > Ah, because FRED hardwware doesn't have the crazy NMI unblocking behavior, and > the FRED NMI entry code relies on that. Just so. > But I don't see why we need to care about NMIs from KVM, while they do bounce > through assembly to create a stack frame, the actual CALL is direct to > asm_exc_nmi_kvm_vmx(). If it's just the unwind hint that's needed, that That part is all fine. > arch/x86/entry/entry_64_fred.S | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S > index 29c5c32c16c3..7aff2f0a285f 100644 > --- a/arch/x86/entry/entry_64_fred.S > +++ b/arch/x86/entry/entry_64_fred.S > @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > movq %rsp, %rdi /* %rdi -> pt_regs */ > call __fred_entry_from_kvm /* Call the C entry point */ > POP_REGS > - ERETS > + > + ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED > 1: > /* > * Objtool doesn't understand what ERETS does, this hint tells it that > @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > * isn't strictly needed, but it's the simplest form. > */ > UNWIND_HINT_RESTORE > - pop %rbp > + leave > RET So this, while clever, might be a problem with ORC unwinding. Because now the stack is different depending on the alternative, and we can't deal with that. Anyway, I'll go have a poke on Monday (or Tuesday if Monday turns out to be a bank holiday :-).
On Sat, May 03, 2025 at 11:50:23AM +0200, Peter Zijlstra wrote: > > +++ b/arch/x86/entry/entry_64_fred.S > > @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > > movq %rsp, %rdi /* %rdi -> pt_regs */ > > call __fred_entry_from_kvm /* Call the C entry point */ > > POP_REGS > > - ERETS > > + > > + ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED > > 1: > > /* > > * Objtool doesn't understand what ERETS does, this hint tells it that > > @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > > * isn't strictly needed, but it's the simplest form. > > */ > > UNWIND_HINT_RESTORE > > - pop %rbp > > + leave > > RET > > So this, while clever, might be a problem with ORC unwinding. Because > now the stack is different depending on the alternative, and we can't > deal with that. > > Anyway, I'll go have a poke on Monday (or Tuesday if Monday turns out to > be a bank holiday :-). Can we just adjust the stack in the alternative? ALTERNATIVE "add $64 %rsp", __stringify(ERETS), X86_FEATURE_FRED 1: UNWIND_HINT_RESTORE pop %rbp RET
On Sat, May 03, 2025 at 11:28:37AM -0700, Josh Poimboeuf wrote: > On Sat, May 03, 2025 at 11:50:23AM +0200, Peter Zijlstra wrote: > > > +++ b/arch/x86/entry/entry_64_fred.S > > > @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > > > movq %rsp, %rdi /* %rdi -> pt_regs */ > > > call __fred_entry_from_kvm /* Call the C entry point */ > > > POP_REGS > > > - ERETS > > > + > > > + ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED > > > 1: > > > /* > > > * Objtool doesn't understand what ERETS does, this hint tells it that > > > @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > > > * isn't strictly needed, but it's the simplest form. > > > */ > > > UNWIND_HINT_RESTORE > > > - pop %rbp > > > + leave > > > RET > > > > So this, while clever, might be a problem with ORC unwinding. Because > > now the stack is different depending on the alternative, and we can't > > deal with that. > > > > Anyway, I'll go have a poke on Monday (or Tuesday if Monday turns out to > > be a bank holiday :-). > > Can we just adjust the stack in the alternative? > > ALTERNATIVE "add $64 %rsp", __stringify(ERETS), X86_FEATURE_FRED Yes, that should work. But I wanted to have a poke at objtool, so it will properly complain about the mistake first.
On Tue, May 06, 2025 at 09:31:00AM +0200, Peter Zijlstra wrote: > On Sat, May 03, 2025 at 11:28:37AM -0700, Josh Poimboeuf wrote: > > On Sat, May 03, 2025 at 11:50:23AM +0200, Peter Zijlstra wrote: > > > > +++ b/arch/x86/entry/entry_64_fred.S > > > > @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > > > > movq %rsp, %rdi /* %rdi -> pt_regs */ > > > > call __fred_entry_from_kvm /* Call the C entry point */ > > > > POP_REGS > > > > - ERETS > > > > + > > > > + ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED > > > > 1: > > > > /* > > > > * Objtool doesn't understand what ERETS does, this hint tells it that > > > > @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm) > > > > * isn't strictly needed, but it's the simplest form. > > > > */ > > > > UNWIND_HINT_RESTORE > > > > - pop %rbp > > > > + leave > > > > RET > > > > > > So this, while clever, might be a problem with ORC unwinding. Because > > > now the stack is different depending on the alternative, and we can't > > > deal with that. > > > > > > Anyway, I'll go have a poke on Monday (or Tuesday if Monday turns out to > > > be a bank holiday :-). > > > > Can we just adjust the stack in the alternative? > > > > ALTERNATIVE "add $64 %rsp", __stringify(ERETS), X86_FEATURE_FRED > > Yes, that should work. Nope, it needs to be "mov %rbp, %rsp". Because that is the actual rsp value after ERETS-to-self. > But I wanted to have a poke at objtool, so it > will properly complain about the mistake first. So a metric ton of fail here :/ The biggest problem is the UNWIND_HINT_RESTORE right after the alternative. This ensures that objtool thinks all paths through the alternative end up with the same stack. And hence won't actually complain. Second being of course, that in order to get IRET and co correct, we'd need far more of an emulator. Also, it actually chokes on this variant, and I've not yet figured out why. Whatever state should be created by that mov, the restore hint should wipe it all. But still the ORC generation bails with unknown base reg -1.
On Tue, May 06, 2025 at 03:32:34PM +0200, Peter Zijlstra wrote: > On Tue, May 06, 2025 at 09:31:00AM +0200, Peter Zijlstra wrote: > > On Sat, May 03, 2025 at 11:28:37AM -0700, Josh Poimboeuf wrote: > > > Can we just adjust the stack in the alternative? > > > > > > ALTERNATIVE "add $64 %rsp", __stringify(ERETS), X86_FEATURE_FRED > > > > Yes, that should work. > > Nope, it needs to be "mov %rbp, %rsp". Because that is the actual rsp > value after ERETS-to-self. > > > But I wanted to have a poke at objtool, so it > > will properly complain about the mistake first. > > So a metric ton of fail here :/ > > The biggest problem is the UNWIND_HINT_RESTORE right after the > alternative. This ensures that objtool thinks all paths through the > alternative end up with the same stack. And hence won't actually > complain. Right, that's what the unwind hints are made for, it's up to the human to get it right. > Second being of course, that in order to get IRET and co correct, we'd > need far more of an emulator. At least finding RSP should be pretty easy, it's at a known location on the stack. We already have an ORC type for doing that, but that would again require an unwind hint, unless we make objtool smart enough to know that. But then the ORC would be inconsistent between the two alternative paths. > Also, it actually chokes on this variant, and I've not yet figured out > why. Whatever state should be created by that mov, the restore hint > should wipe it all. But still the ORC generation bails with unknown base > reg -1. Weird, I'm not seeing that.