[v2,00/13] objtool: Detect and warn about indirect calls in __nocfi functions

Message ID	20250430110734.392235199@infradead.org
Headers	show Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA8A723C4F0; Wed, 30 Apr 2025 11:26:47 +0000 (UTC) Message-ID: <20250430110734.392235199@infradead.org> User-Agent: quilt/0.66 Date: Wed, 30 Apr 2025 13:07:34 +0200 From: Peter Zijlstra <peterz@infradead.org> To: x86@kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, seanjc@google.com, pbonzini@redhat.com, ardb@kernel.org, kees@kernel.org, Arnd Bergmann <arnd@arndb.de>, gregkh@linuxfoundation.org, jpoimboe@kernel.org, peterz@infradead.org, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, linux-efi@vger.kernel.org, samitolvanen@google.com, ojeda@kernel.org Subject: [PATCH v2 00/13] objtool: Detect and warn about indirect calls in __nocfi functions Precedence: bulk
Series	objtool: Detect and warn about indirect calls in __nocfi functions \| expand [v2,00/13] objtool: Detect and warn about indirect calls in __nocfi functions [v2,01/13] x86/kvm/emulate: Implement test_cc() in C [v2,02/13] x86/kvm/emulate: Introduce COP1 [v2,03/13] x86/kvm/emulate: Introduce COP2 [v2,04/13] x86/kvm/emulate: Introduce COP2R [v2,05/13] x86/kvm/emulate: Introduce COP2W [v2,06/13] x86/kvm/emulate: Introduce COP2CL [v2,07/13] x86/kvm/emulate: Introduce COP1SRC2 [v2,08/13] x86/kvm/emulate: Introduce COP3WCL [v2,09/13] x86/kvm/emulate: Convert em_salc() to C [v2,10/13] x86/kvm/emulate: Remove fastops [v2,11/13] x86,hyperv: Clean up hv_do_hypercall() [v2,12/13] x86_64,hyperv: Use direct call to hypercall-page [v2,13/13] objtool: Validate kCFI calls

Peter Zijlstra April 30, 2025, 11:07 a.m. UTC

Hi!

On kCFI (CONFIG_CFI_CLANG=y) builds all indirect calls should have the CFI
check on (with very few exceptions). Not having the CFI checks undermines the
protection provided by CFI and will make these sites candidates for people
wanting to steal your cookies.

Specifically the ABI changes are so that doing indirect calls without the CFI
magic, to a CFI adorned function is not compatible (although it happens to work
for some setups, it very much does not for FineIBT).

Rust people tripped over this the other day, since their 'core' happened to
have some no_sanitize(kcfi) bits in, which promptly exploded when ran with
FineIBT on.

Since this is very much not a supported model -- on purpose, have objtool
detect and warn about such constructs.

This effort [1] found all existing [2] non-cfi indirect calls in the kernel.

Notably the KVM fastop emulation stuff -- which I've completely rewritten for
this version -- the generated code doesn't look horrific, but is slightly more
verbose. I'm running on the assumption that instruction emulation is not super
performance critical these days of zero VM-exit VMs etc.

KVM has another; the VMX interrupt injection stuff calls the IDT handler
directly.  Is there an alternative? Can we keep a table of Linux functions
slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?

HyperV hypercall page stuff, which I've previously suggested use direct calls,
and which I've now converted (after getting properly annoyed with that code).

Also available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/core

Changes since v1:

 - complete rewrite of the fastop stuff
 - HyperV tweaks (Michael)
 - objtool changes (Josh)


[1] https://lkml.kernel.org/r/20250410154556.GB9003@noisy.programming.kicks-ass.net
[2] https://lkml.kernel.org/r/20250410194334.GA3248459@google.com

H. Peter Anvin April 30, 2025, 2:24 p.m. UTC | #1

On April 30, 2025 4:07:34 AM PDT, Peter Zijlstra <peterz@infradead.org> wrote:
>Hi!
>
>On kCFI (CONFIG_CFI_CLANG=y) builds all indirect calls should have the CFI
>check on (with very few exceptions). Not having the CFI checks undermines the
>protection provided by CFI and will make these sites candidates for people
>wanting to steal your cookies.
>
>Specifically the ABI changes are so that doing indirect calls without the CFI
>magic, to a CFI adorned function is not compatible (although it happens to work
>for some setups, it very much does not for FineIBT).
>
>Rust people tripped over this the other day, since their 'core' happened to
>have some no_sanitize(kcfi) bits in, which promptly exploded when ran with
>FineIBT on.
>
>Since this is very much not a supported model -- on purpose, have objtool
>detect and warn about such constructs.
>
>This effort [1] found all existing [2] non-cfi indirect calls in the kernel.
>
>Notably the KVM fastop emulation stuff -- which I've completely rewritten for
>this version -- the generated code doesn't look horrific, but is slightly more
>verbose. I'm running on the assumption that instruction emulation is not super
>performance critical these days of zero VM-exit VMs etc.
>
>KVM has another; the VMX interrupt injection stuff calls the IDT handler
>directly.  Is there an alternative? Can we keep a table of Linux functions
>slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?
>
>HyperV hypercall page stuff, which I've previously suggested use direct calls,
>and which I've now converted (after getting properly annoyed with that code).
>
>Also available at:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/core
>
>Changes since v1:
>
> - complete rewrite of the fastop stuff
> - HyperV tweaks (Michael)
> - objtool changes (Josh)
>
>
>[1] https://lkml.kernel.org/r/20250410154556.GB9003@noisy.programming.kicks-ass.net
>[2] https://lkml.kernel.org/r/20250410194334.GA3248459@google.com
>

We do have a table of handlers higher up in the stack in the form of the dispatch tables for FRED. They don't in general even need the assembly entry stubs, either.

Peter Zijlstra April 30, 2025, 7:06 p.m. UTC | #2

On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote:

> >KVM has another; the VMX interrupt injection stuff calls the IDT handler
> >directly.  Is there an alternative? Can we keep a table of Linux functions
> >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?

> We do have a table of handlers higher up in the stack in the form of
> the dispatch tables for FRED. They don't in general even need the
> assembly entry stubs, either.

Oh, right. I'll go have a look at those.

Michael Kelley May 1, 2025, 2:36 a.m. UTC | #3

From: Peter Zijlstra <peterz@infradead.org> Sent: Wednesday, April 30, 2025 4:08 AM
> 
> What used to be a simple few instructions has turned into a giant mess
> (for x86_64). Not only does it use static_branch wrong, it mixes it
> with dynamic branches for no apparent reason.
> 
> Notably it uses static_branch through an out-of-line function call,
> which completely defeats the purpose, since instead of a simple
> JMP/NOP site, you get a CALL+RET+TEST+Jcc sequence in return, which is
> absolutely idiotic.
> 
> Add to that a dynamic test of hyperv_paravisor_present, something
> which is set once and never changed.
> 
> Replace all this idiocy with a single direct function call to the
> right hypercall variant.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I've done these tests on Hyper-V VMs with this patch series. My focus
is the Hyper-V changes in Patches 11 and 12, not the other changes.

* Normal VM boot and basic smoke test
* TDX and SEV-SNP VMs boot and basic smoke test. These VMs have
   a paravisor
* Normal VM taking a panic and running the kdump kernel
* Normal VM suspending for hibernation, then resuming from
   hibernation
* Verified that IBT is enabled in a normal VM. It's not offered in a TDX
   VM on Hyper-V when a paravisor is used. I don't know about the case
   without a paravisor.
* Building a 64-bit kernel with and without CONFIG_AMD_MEM_ENCRYPT
   and CONFIG_INTEL_TDX_GUEST.
* Building a 32-bit kernel (but I did not try to run it)

TDX and SEV-SNP VMs without a paravisor are not tested, so updating
the static call, and the new direct call path, has not been tested for
TDX and SNP hypercalls. I don't have a hardware environment where I
can test without a paravisor.

Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>

> ---
>  arch/x86/hyperv/hv_init.c       |   20 +++++
>  arch/x86/hyperv/ivm.c           |   15 ++++
>  arch/x86/include/asm/mshyperv.h |  137 +++++++++++-----------------------------
>  arch/x86/kernel/cpu/mshyperv.c  |   19 +++--
>  4 files changed, 89 insertions(+), 102 deletions(-)
> 
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -35,7 +35,27 @@
>  #include <linux/highmem.h>
> 
>  void *hv_hypercall_pg;
> +
> +#ifdef CONFIG_X86_64
> +u64 hv_std_hypercall(u64 control, u64 param1, u64 param2)
> +{
> +	u64 hv_status;
> +
> +	if (!hv_hypercall_pg)
> +		return U64_MAX;
> +
> +	register u64 __r8 asm("r8") = param2;
> +	asm volatile (CALL_NOSPEC
> +		      : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +		        "+c" (control), "+d" (param1), "+r" (__r8)
> +		      : THUNK_TARGET(hv_hypercall_pg)
> +		      : "cc", "memory", "r9", "r10", "r11");
> +
> +	return hv_status;
> +}
> +#else
>  EXPORT_SYMBOL_GPL(hv_hypercall_pg);
> +#endif
> 
>  union hv_ghcb * __percpu *hv_ghcb_pg;
> 
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -376,9 +376,23 @@ int hv_snp_boot_ap(u32 cpu, unsigned lon
>  	return ret;
>  }
> 
> +u64 hv_snp_hypercall(u64 control, u64 param1, u64 param2)
> +{
> +	u64 hv_status;
> +
> +	register u64 __r8 asm("r8") = param2;
> +	asm volatile("vmmcall"
> +		     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> +		       "+c" (control), "+d" (param1), "+r" (__r8)
> +		     : : "cc", "memory", "r9", "r10", "r11");
> +
> +	return hv_status;
> +}
> +
>  #else
>  static inline void hv_ghcb_msr_write(u64 msr, u64 value) {}
>  static inline void hv_ghcb_msr_read(u64 msr, u64 *value) {}
> +u64 hv_snp_hypercall(u64 control, u64 param1, u64 param2) { return U64_MAX; }
>  #endif /* CONFIG_AMD_MEM_ENCRYPT */
> 
>  #ifdef CONFIG_INTEL_TDX_GUEST
> @@ -428,6 +442,7 @@ u64 hv_tdx_hypercall(u64 control, u64 pa
>  #else
>  static inline void hv_tdx_msr_write(u64 msr, u64 value) {}
>  static inline void hv_tdx_msr_read(u64 msr, u64 *value) {}
> +u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2) { return U64_MAX; }
>  #endif /* CONFIG_INTEL_TDX_GUEST */
> 
>  #if defined(CONFIG_AMD_MEM_ENCRYPT) || defined(CONFIG_INTEL_TDX_GUEST)
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -6,6 +6,7 @@
>  #include <linux/nmi.h>
>  #include <linux/msi.h>
>  #include <linux/io.h>
> +#include <linux/static_call.h>
>  #include <asm/nospec-branch.h>
>  #include <asm/paravirt.h>
>  #include <hyperv/hvhdk.h>
> @@ -38,16 +39,21 @@ static inline unsigned char hv_get_nmi_r
>  	return 0;
>  }
> 
> -#if IS_ENABLED(CONFIG_HYPERV)
> -extern bool hyperv_paravisor_present;
> +extern u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2);
> +extern u64 hv_snp_hypercall(u64 control, u64 param1, u64 param2);
> +extern u64 hv_std_hypercall(u64 control, u64 param1, u64 param2);
> 
> +#if IS_ENABLED(CONFIG_HYPERV)
>  extern void *hv_hypercall_pg;
> 
>  extern union hv_ghcb * __percpu *hv_ghcb_pg;
> 
>  bool hv_isolation_type_snp(void);
>  bool hv_isolation_type_tdx(void);
> -u64 hv_tdx_hypercall(u64 control, u64 param1, u64 param2);
> +
> +#ifdef CONFIG_X86_64
> +DECLARE_STATIC_CALL(hv_hypercall, hv_std_hypercall);
> +#endif
> 
>  /*
>   * DEFAULT INIT GPAT and SEGMENT LIMIT value in struct VMSA
> @@ -64,37 +70,15 @@ static inline u64 hv_do_hypercall(u64 co
>  {
>  	u64 input_address = input ? virt_to_phys(input) : 0;
>  	u64 output_address = output ? virt_to_phys(output) : 0;
> -	u64 hv_status;
> 
>  #ifdef CONFIG_X86_64
> -	if (hv_isolation_type_tdx() && !hyperv_paravisor_present)
> -		return hv_tdx_hypercall(control, input_address, output_address);
> -
> -	if (hv_isolation_type_snp() && !hyperv_paravisor_present) {
> -		__asm__ __volatile__("mov %[output_address], %%r8\n"
> -				     "vmmcall"
> -				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -				       "+c" (control), "+d" (input_address)
> -				     : [output_address] "r" (output_address)
> -				     : "cc", "memory", "r8", "r9", "r10", "r11");
> -		return hv_status;
> -	}
> -
> -	if (!hv_hypercall_pg)
> -		return U64_MAX;
> -
> -	__asm__ __volatile__("mov %[output_address], %%r8\n"
> -			     CALL_NOSPEC
> -			     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -			       "+c" (control), "+d" (input_address)
> -			     : [output_address] "r" (output_address),
> -			       THUNK_TARGET(hv_hypercall_pg)
> -			     : "cc", "memory", "r8", "r9", "r10", "r11");
> +	return static_call_mod(hv_hypercall)(control, input_address, output_address);
>  #else
>  	u32 input_address_hi = upper_32_bits(input_address);
>  	u32 input_address_lo = lower_32_bits(input_address);
>  	u32 output_address_hi = upper_32_bits(output_address);
>  	u32 output_address_lo = lower_32_bits(output_address);
> +	u64 hv_status;
> 
>  	if (!hv_hypercall_pg)
>  		return U64_MAX;
> @@ -107,8 +91,8 @@ static inline u64 hv_do_hypercall(u64 co
>  			       "D"(output_address_hi), "S"(output_address_lo),
>  			       THUNK_TARGET(hv_hypercall_pg)
>  			     : "cc", "memory");
> -#endif /* !x86_64 */
>  	return hv_status;
> +#endif /* !x86_64 */
>  }
> 
>  /* Hypercall to the L0 hypervisor */
> @@ -120,41 +104,23 @@ static inline u64 hv_do_nested_hypercall
>  /* Fast hypercall with 8 bytes of input and no output */
>  static inline u64 _hv_do_fast_hypercall8(u64 control, u64 input1)
>  {
> -	u64 hv_status;
> -
>  #ifdef CONFIG_X86_64
> -	if (hv_isolation_type_tdx() && !hyperv_paravisor_present)
> -		return hv_tdx_hypercall(control, input1, 0);
> -
> -	if (hv_isolation_type_snp() && !hyperv_paravisor_present) {
> -		__asm__ __volatile__(
> -				"vmmcall"
> -				: "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -				"+c" (control), "+d" (input1)
> -				:: "cc", "r8", "r9", "r10", "r11");
> -	} else {
> -		__asm__ __volatile__(CALL_NOSPEC
> -				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -				       "+c" (control), "+d" (input1)
> -				     : THUNK_TARGET(hv_hypercall_pg)
> -				     : "cc", "r8", "r9", "r10", "r11");
> -	}
> +	return static_call_mod(hv_hypercall)(control, input1, 0);
>  #else
> -	{
> -		u32 input1_hi = upper_32_bits(input1);
> -		u32 input1_lo = lower_32_bits(input1);
> -
> -		__asm__ __volatile__ (CALL_NOSPEC
> -				      : "=A"(hv_status),
> -					"+c"(input1_lo),
> -					ASM_CALL_CONSTRAINT
> -				      :	"A" (control),
> -					"b" (input1_hi),
> -					THUNK_TARGET(hv_hypercall_pg)
> -				      : "cc", "edi", "esi");
> -	}
> -#endif
> +	u32 input1_hi = upper_32_bits(input1);
> +	u32 input1_lo = lower_32_bits(input1);
> +	u64 hv_status;
> +
> +	__asm__ __volatile__ (CALL_NOSPEC
> +			      : "=A"(hv_status),
> +			      "+c"(input1_lo),
> +			      ASM_CALL_CONSTRAINT
> +			      :	"A" (control),
> +			      "b" (input1_hi),
> +			      THUNK_TARGET(hv_hypercall_pg)
> +			      : "cc", "edi", "esi");
>  	return hv_status;
> +#endif
>  }
> 
>  static inline u64 hv_do_fast_hypercall8(u16 code, u64 input1)
> @@ -174,45 +140,24 @@ static inline u64 hv_do_fast_nested_hype
>  /* Fast hypercall with 16 bytes of input */
>  static inline u64 _hv_do_fast_hypercall16(u64 control, u64 input1, u64 input2)
>  {
> -	u64 hv_status;
> -
>  #ifdef CONFIG_X86_64
> -	if (hv_isolation_type_tdx() && !hyperv_paravisor_present)
> -		return hv_tdx_hypercall(control, input1, input2);
> -
> -	if (hv_isolation_type_snp() && !hyperv_paravisor_present) {
> -		__asm__ __volatile__("mov %[input2], %%r8\n"
> -				     "vmmcall"
> -				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -				       "+c" (control), "+d" (input1)
> -				     : [input2] "r" (input2)
> -				     : "cc", "r8", "r9", "r10", "r11");
> -	} else {
> -		__asm__ __volatile__("mov %[input2], %%r8\n"
> -				     CALL_NOSPEC
> -				     : "=a" (hv_status), ASM_CALL_CONSTRAINT,
> -				       "+c" (control), "+d" (input1)
> -				     : [input2] "r" (input2),
> -				       THUNK_TARGET(hv_hypercall_pg)
> -				     : "cc", "r8", "r9", "r10", "r11");
> -	}
> +	return static_call_mod(hv_hypercall)(control, input1, input2);
>  #else
> -	{
> -		u32 input1_hi = upper_32_bits(input1);
> -		u32 input1_lo = lower_32_bits(input1);
> -		u32 input2_hi = upper_32_bits(input2);
> -		u32 input2_lo = lower_32_bits(input2);
> -
> -		__asm__ __volatile__ (CALL_NOSPEC
> -				      : "=A"(hv_status),
> -					"+c"(input1_lo), ASM_CALL_CONSTRAINT
> -				      :	"A" (control), "b" (input1_hi),
> -					"D"(input2_hi), "S"(input2_lo),
> -					THUNK_TARGET(hv_hypercall_pg)
> -				      : "cc");
> -	}
> -#endif
> +	u32 input1_hi = upper_32_bits(input1);
> +	u32 input1_lo = lower_32_bits(input1);
> +	u32 input2_hi = upper_32_bits(input2);
> +	u32 input2_lo = lower_32_bits(input2);
> +	u64 hv_status;
> +
> +	__asm__ __volatile__ (CALL_NOSPEC
> +			      : "=A"(hv_status),
> +			      "+c"(input1_lo), ASM_CALL_CONSTRAINT
> +			      :	"A" (control), "b" (input1_hi),
> +			      "D"(input2_hi), "S"(input2_lo),
> +			      THUNK_TARGET(hv_hypercall_pg)
> +			      : "cc");
>  	return hv_status;
> +#endif
>  }
> 
>  static inline u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -37,10 +37,6 @@
>  bool hv_nested;
>  struct ms_hyperv_info ms_hyperv;
> 
> -/* Used in modules via hv_do_hypercall(): see arch/x86/include/asm/mshyperv.h */
> -bool hyperv_paravisor_present __ro_after_init;
> -EXPORT_SYMBOL_GPL(hyperv_paravisor_present);
> -
>  #if IS_ENABLED(CONFIG_HYPERV)
>  static inline unsigned int hv_get_nested_msr(unsigned int reg)
>  {
> @@ -287,8 +283,18 @@ static void __init x86_setup_ops_for_tsc
>  	old_restore_sched_clock_state = x86_platform.restore_sched_clock_state;
>  	x86_platform.restore_sched_clock_state = hv_restore_sched_clock_state;
>  }
> +
> +#ifdef CONFIG_X86_64
> +DEFINE_STATIC_CALL(hv_hypercall, hv_std_hypercall);
> +EXPORT_STATIC_CALL_TRAMP_GPL(hv_hypercall);
> +#define hypercall_update(hc) static_call_update(hv_hypercall, hc)
> +#endif
>  #endif /* CONFIG_HYPERV */
> 
> +#ifndef hypercall_update
> +#define hypercall_update(hc) (void)hc
> +#endif
> +
>  static uint32_t  __init ms_hyperv_platform(void)
>  {
>  	u32 eax;
> @@ -483,14 +489,14 @@ static void __init ms_hyperv_init_platfo
>  			ms_hyperv.shared_gpa_boundary =
>  				BIT_ULL(ms_hyperv.shared_gpa_boundary_bits);
> 
> -		hyperv_paravisor_present = !!ms_hyperv.paravisor_present;
> -
>  		pr_info("Hyper-V: Isolation Config: Group A 0x%x, Group B 0x%x\n",
>  			ms_hyperv.isolation_config_a, ms_hyperv.isolation_config_b);
> 
> 
>  		if (hv_get_isolation_type() == HV_ISOLATION_TYPE_SNP) {
>  			static_branch_enable(&isolation_type_snp);
> +			if (!ms_hyperv.paravisor_present)
> +				hypercall_update(hv_snp_hypercall);
>  		} else if (hv_get_isolation_type() == HV_ISOLATION_TYPE_TDX) {
>  			static_branch_enable(&isolation_type_tdx);
> 
> @@ -498,6 +504,7 @@ static void __init ms_hyperv_init_platfo
>  			ms_hyperv.hints &= ~HV_X64_APIC_ACCESS_RECOMMENDED;
> 
>  			if (!ms_hyperv.paravisor_present) {
> +				hypercall_update(hv_tdx_hypercall);
>  				/*
>  				 * Mark the Hyper-V TSC page feature as disabled
>  				 * in a TDX VM without paravisor so that the
> 
>

Peter Zijlstra May 1, 2025, 8:59 a.m. UTC | #4

On Thu, May 01, 2025 at 02:36:26AM +0000, Michael Kelley wrote:
> From: Peter Zijlstra <peterz@infradead.org> Sent: Wednesday, April 30, 2025 4:08 AM
> > @@ -528,8 +546,8 @@ void __init hyperv_init(void)
> >  	if (hv_isolation_type_tdx() && !ms_hyperv.paravisor_present)
> >  		goto skip_hypercall_pg_init;
> > 
> > -	hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START,
> > -			VMALLOC_END, GFP_KERNEL, PAGE_KERNEL_ROX,
> > +	hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, MODULES_VADDR,
> > +			MODULES_END, GFP_KERNEL, PAGE_KERNEL_ROX,
> 
> Curiosity question (which I forgot ask about in v1):  Is this change so that the
> hypercall page kernel address is "close enough" for the direct call to work from
> built-in code and from module code?  Or is there some other reason?

No, you nailed it. Because we want to do a direct CALL, the hypercall
page must be in the disp32 range relative to the call site. The module
address space ensures this.

> >  			VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
> >  			__builtin_return_address(0));
> >  	if (hv_hypercall_pg == NULL)
> > @@ -567,27 +585,9 @@ void __init hyperv_init(void)
> >  		wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> >  	}
> > 
> > -skip_hypercall_pg_init:
> > -	/*
> > -	 * Some versions of Hyper-V that provide IBT in guest VMs have a bug
> > -	 * in that there's no ENDBR64 instruction at the entry to the
> > -	 * hypercall page. Because hypercalls are invoked via an indirect call
> > -	 * to the hypercall page, all hypercall attempts fail when IBT is
> > -	 * enabled, and Linux panics. For such buggy versions, disable IBT.
> > -	 *
> > -	 * Fixed versions of Hyper-V always provide ENDBR64 on the hypercall
> > -	 * page, so if future Linux kernel versions enable IBT for 32-bit
> > -	 * builds, additional hypercall page hackery will be required here
> > -	 * to provide an ENDBR32.
> > -	 */
> > -#ifdef CONFIG_X86_KERNEL_IBT
> > -	if (cpu_feature_enabled(X86_FEATURE_IBT) &&
> > -	    *(u32 *)hv_hypercall_pg != gen_endbr()) {
> > -		setup_clear_cpu_cap(X86_FEATURE_IBT);
> > -		pr_warn("Disabling IBT because of Hyper-V bug\n");
> > -	}
> > -#endif
> 
> Nit: With this IBT code removed, the #include <asm/ibt.h> at the top
> of this source code file should be removed.

Indeed so.

> 
> > +	hv_set_hypercall_pg(hv_hypercall_pg);
> > 
> > +skip_hypercall_pg_init:
> >  	/*
> >  	 * hyperv_init() is called before LAPIC is initialized: see
> >  	 * apic_intr_mode_init() -> x86_platform.apic_post_init() and
> > 
> > 
> 
> The nit notwithstanding,
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>

Thanks!

Peter Zijlstra May 1, 2025, 10:30 a.m. UTC | #5

On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote:
> 
> > >KVM has another; the VMX interrupt injection stuff calls the IDT handler
> > >directly.  Is there an alternative? Can we keep a table of Linux functions
> > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?
> 
> > We do have a table of handlers higher up in the stack in the form of
> > the dispatch tables for FRED. They don't in general even need the
> > assembly entry stubs, either.
> 
> Oh, right. I'll go have a look at those.

Right, so perhaps the easiest way around this is to setup the FRED entry
tables unconditionally, have VMX mandate CONFIG_FRED and then have it
always use the FRED entry points.

Let me see how ugly that gets.

Peter Zijlstra May 1, 2025, 3:38 p.m. UTC | #6

On Thu, May 01, 2025 at 12:30:38PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote:
> > 
> > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler
> > > >directly.  Is there an alternative? Can we keep a table of Linux functions
> > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?
> > 
> > > We do have a table of handlers higher up in the stack in the form of
> > > the dispatch tables for FRED. They don't in general even need the
> > > assembly entry stubs, either.
> > 
> > Oh, right. I'll go have a look at those.
> 
> Right, so perhaps the easiest way around this is to setup the FRED entry
> tables unconditionally, have VMX mandate CONFIG_FRED and then have it
> always use the FRED entry points.
> 
> Let me see how ugly that gets.

Something like so... except this is broken. Its reporting spurious
interrupts on vector 0x00, so something is buggered passing that vector
along.

I'll stare more at it more another time.

---

diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index f004a4dc74c2..d2322e3723f5 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -156,9 +156,8 @@ void __init fred_complete_exception_setup(void)
 	fred_setup_done = true;
 }
 
-static noinstr void fred_extint(struct pt_regs *regs)
+noinstr void fred_extint(struct pt_regs *regs, unsigned int vector)
 {
-	unsigned int vector = regs->fred_ss.vector;
 	unsigned int index = array_index_nospec(vector - FIRST_SYSTEM_VECTOR,
 						NR_SYSTEM_VECTORS);
 
@@ -230,7 +229,7 @@ __visible noinstr void fred_entry_from_user(struct pt_regs *regs)
 
 	switch (regs->fred_ss.type) {
 	case EVENT_TYPE_EXTINT:
-		return fred_extint(regs);
+		return fred_extint(regs, regs->fred_ss.vector);
 	case EVENT_TYPE_NMI:
 		if (likely(regs->fred_ss.vector == X86_TRAP_NMI))
 			return fred_exc_nmi(regs);
@@ -262,7 +261,7 @@ __visible noinstr void fred_entry_from_kernel(struct pt_regs *regs)
 
 	switch (regs->fred_ss.type) {
 	case EVENT_TYPE_EXTINT:
-		return fred_extint(regs);
+		return fred_extint(regs, regs->fred_ss.vector);
 	case EVENT_TYPE_NMI:
 		if (likely(regs->fred_ss.vector == X86_TRAP_NMI))
 			return fred_exc_nmi(regs);
@@ -286,7 +285,7 @@ __visible noinstr void __fred_entry_from_kvm(struct pt_regs *regs)
 {
 	switch (regs->fred_ss.type) {
 	case EVENT_TYPE_EXTINT:
-		return fred_extint(regs);
+		return fred_extint(regs, regs->fred_ss.vector);
 	case EVENT_TYPE_NMI:
 		return fred_exc_nmi(regs);
 	default:
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 2a29e5216881..e6ec5e1a0f54 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -66,6 +66,8 @@ void asm_fred_entrypoint_user(void);
 void asm_fred_entrypoint_kernel(void);
 void asm_fred_entry_from_kvm(struct fred_ss);
 
+void fred_extint(struct pt_regs *regs, unsigned int vector);
+
 __visible void fred_entry_from_user(struct pt_regs *regs);
 __visible void fred_entry_from_kernel(struct pt_regs *regs);
 __visible void __fred_entry_from_kvm(struct pt_regs *regs);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index a4ec27c67988..1de8c8b6d4c6 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -468,9 +468,9 @@ static inline void fred_install_sysvec(unsigned int vector, const idtentry_t fun
 #endif
 
 #define sysvec_install(vector, function) {				\
-	if (cpu_feature_enabled(X86_FEATURE_FRED))			\
+	if (IS_ENABLED(CONFIG_X86_FRED))				\
 		fred_install_sysvec(vector, function);			\
-	else								\
+	if (!cpu_feature_enabled(X86_FEATURE_FRED))			\
 		idt_install_sysvec(vector, asm_##function);		\
 }
 
@@ -647,6 +647,9 @@ DECLARE_IDTENTRY_RAW(X86_TRAP_MC,	xenpv_exc_machine_check);
  * to avoid more ifdeffery.
  */
 DECLARE_IDTENTRY(X86_TRAP_NMI,		exc_nmi_kvm_vmx);
+#ifdef CONFIG_X86_FRED
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_NMI, exc_fred_kvm_vmx);
+#endif
 #endif
 
 DECLARE_IDTENTRY_NMI(X86_TRAP_NMI,	exc_nmi);
diff --git a/arch/x86/kernel/irqinit.c b/arch/x86/kernel/irqinit.c
index f79c5edc0b89..654f8e835417 100644
--- a/arch/x86/kernel/irqinit.c
+++ b/arch/x86/kernel/irqinit.c
@@ -97,9 +97,9 @@ void __init native_init_IRQ(void)
 	/* Execute any quirks before the call gates are initialised: */
 	x86_init.irqs.pre_vector_init();
 
-	if (cpu_feature_enabled(X86_FEATURE_FRED))
+	if (IS_ENABLED(CONFIG_X86_FRED))
 		fred_complete_exception_setup();
-	else
+	if (!cpu_feature_enabled(X86_FEATURE_FRED))
 		idt_setup_apic_and_irq_gates();
 
 	lapic_assign_system_vectors();
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 9a95d00f1423..4690787bdd13 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -613,8 +613,17 @@ DEFINE_IDTENTRY_RAW(exc_nmi_kvm_vmx)
 {
 	exc_nmi(regs);
 }
+#ifdef CONFIG_X86_FRED
+DEFINE_IDTENTRY_RAW_ERRORCODE(exc_fred_kvm_vmx)
+{
+	fred_extint(regs, error_code);
+}
+#endif
 #if IS_MODULE(CONFIG_KVM_INTEL)
 EXPORT_SYMBOL_GPL(asm_exc_nmi_kvm_vmx);
+#ifdef CONFIG_X86_FRED
+EXPORT_SYMBOL_GPL(asm_exc_fred_kvm_vmx);
+#endif
 #endif
 #endif
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index fe8ea8c097de..9db38ae3450e 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -95,6 +95,7 @@ config KVM_SW_PROTECTED_VM
 config KVM_INTEL
 	tristate "KVM for Intel (and compatible) processors support"
 	depends on KVM && IA32_FEAT_CTL
+	select X86_FRED if X86_64
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
 	  extensions, a.k.a. Virtual Machine Extensions (VMX).
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 090393bf829d..3aaacbbedf5c 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -31,7 +31,7 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
-.macro VMX_DO_EVENT_IRQOFF call_insn call_target
+.macro VMX_DO_EVENT_IRQOFF call_insn call_target has_error=0
 	/*
 	 * Unconditionally create a stack frame, getting the correct RSP on the
 	 * stack (for x86-64) would take two instructions anyways, and RBP can
@@ -40,6 +40,8 @@
 	push %_ASM_BP
 	mov %_ASM_SP, %_ASM_BP
 
+	UNWIND_HINT_SAVE
+
 #ifdef CONFIG_X86_64
 	/*
 	 * Align RSP to a 16-byte boundary (to emulate CPU behavior) before
@@ -51,7 +53,17 @@
 #endif
 	pushf
 	push $__KERNEL_CS
+
+	.if \has_error
+	lea 1f(%rip), %rax
+	push %rax
+	UNWIND_HINT_IRET_REGS
+	push %_ASM_ARG1
+	.endif
+
 	\call_insn \call_target
+1:
+	UNWIND_HINT_RESTORE
 
 	/*
 	 * "Restore" RSP from RBP, even though IRET has already unwound RSP to
@@ -363,10 +375,9 @@ SYM_FUNC_END(vmread_error_trampoline)
 .section .text, "ax"
 
 SYM_FUNC_START(vmx_do_interrupt_irqoff)
-	/*
-	 * Calling an IDT gate directly; annotate away the CFI concern for now.
-	 * Should be fixed if possible.
-	 */
-	ANNOTATE_NOCFI_SYM
+#ifdef CONFIG_X86_FRED
+	VMX_DO_EVENT_IRQOFF jmp asm_exc_fred_kvm_vmx has_error=1
+#else
 	VMX_DO_EVENT_IRQOFF CALL_NOSPEC _ASM_ARG1
+#endif
 SYM_FUNC_END(vmx_do_interrupt_irqoff)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5c5766467a61..7ccd2f66d55d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7034,6 +7034,7 @@ void vmx_apicv_pre_state_restore(struct kvm_vcpu *vcpu)
 }
 
 void vmx_do_interrupt_irqoff(unsigned long entry);
+
 void vmx_do_nmi_irqoff(void);
 
 static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
@@ -7080,6 +7081,8 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
 	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
 	if (cpu_feature_enabled(X86_FEATURE_FRED))
 		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
+	else if (IS_ENABLED(CONFIG_FRED))
+		vmx_do_interrupt_irqoff(vector);
 	else
 		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
 	kvm_after_interrupt(vcpu);

Sean Christopherson May 1, 2025, 6:30 p.m. UTC | #7

On Thu, May 01, 2025, Peter Zijlstra wrote:
> On Thu, May 01, 2025 at 12:30:38PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote:
> > > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote:
> > > 
> > > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler
> > > > >directly.  Is there an alternative? Can we keep a table of Linux functions
> > > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?
> > > 
> > > > We do have a table of handlers higher up in the stack in the form of
> > > > the dispatch tables for FRED. They don't in general even need the
> > > > assembly entry stubs, either.
> > > 
> > > Oh, right. I'll go have a look at those.
> > 
> > Right, so perhaps the easiest way around this is to setup the FRED entry
> > tables unconditionally, have VMX mandate CONFIG_FRED and then have it
> > always use the FRED entry points.
> > 
> > Let me see how ugly that gets.
> 
> Something like so... except this is broken. Its reporting spurious
> interrupts on vector 0x00, so something is buggered passing that vector
> along.

Uh, aren't you making this way more complex than it needs to be?  IIUC, KVM never
uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be
in place because they'll never be used.  The only bits of code KVM needs is the
__fred_entry_from_kvm() glue.

Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware.

--

Paolo Bonzini May 1, 2025, 6:33 p.m. UTC | #8

On Wed, Apr 30, 2025 at 1:26 PM Peter Zijlstra <peterz@infradead.org> wrote:
> Notably the KVM fastop emulation stuff -- which I've completely rewritten for
> this version -- the generated code doesn't look horrific, but is slightly more
> verbose. I'm running on the assumption that instruction emulation is not super
> performance critical these days of zero VM-exit VMs etc.

It's definitely going to be slower, but I guess it's okay these days.
It's really only somewhat hot with really old processors
(pre-Westmere) and only when running big real mode code.

Paolo

> KVM has another; the VMX interrupt injection stuff calls the IDT handler
> directly.  Is there an alternative? Can we keep a table of Linux functions
> slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?
>
> HyperV hypercall page stuff, which I've previously suggested use direct calls,
> and which I've now converted (after getting properly annoyed with that code).
>
> Also available at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/core
>
> Changes since v1:
>
>  - complete rewrite of the fastop stuff
>  - HyperV tweaks (Michael)
>  - objtool changes (Josh)
>
>
> [1] https://lkml.kernel.org/r/20250410154556.GB9003@noisy.programming.kicks-ass.net
> [2] https://lkml.kernel.org/r/20250410194334.GA3248459@google.com
>

H. Peter Anvin May 1, 2025, 6:42 p.m. UTC | #9

On May 1, 2025 11:30:18 AM PDT, Sean Christopherson <seanjc@google.com> wrote:
>On Thu, May 01, 2025, Peter Zijlstra wrote:
>> On Thu, May 01, 2025 at 12:30:38PM +0200, Peter Zijlstra wrote:
>> > On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote:
>> > > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote:
>> > > 
>> > > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler
>> > > > >directly.  Is there an alternative? Can we keep a table of Linux functions
>> > > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?
>> > > 
>> > > > We do have a table of handlers higher up in the stack in the form of
>> > > > the dispatch tables for FRED. They don't in general even need the
>> > > > assembly entry stubs, either.
>> > > 
>> > > Oh, right. I'll go have a look at those.
>> > 
>> > Right, so perhaps the easiest way around this is to setup the FRED entry
>> > tables unconditionally, have VMX mandate CONFIG_FRED and then have it
>> > always use the FRED entry points.
>> > 
>> > Let me see how ugly that gets.
>> 
>> Something like so... except this is broken. Its reporting spurious
>> interrupts on vector 0x00, so something is buggered passing that vector
>> along.
>
>Uh, aren't you making this way more complex than it needs to be?  IIUC, KVM never
>uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be
>in place because they'll never be used.  The only bits of code KVM needs is the
>__fred_entry_from_kvm() glue.
>
>Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware.
>
>--
>From 664468143109ab7c525c0babeba62195fa4c657e Mon Sep 17 00:00:00 2001
>From: Sean Christopherson <seanjc@google.com>
>Date: Thu, 1 May 2025 11:20:29 -0700
>Subject: [PATCH 1/2] x86/fred: Play nice with invoking
> asm_fred_entry_from_kvm() on non-FRED hardware
>
>Modify asm_fred_entry_from_kvm() to allow it to be invoked by KVM even
>when FRED isn't fully enabled, e.g. when running with CONFIG_X86_FRED=y
>on non-FRED hardware.  This will allow forcing KVM to always use the FRED
>entry points for 64-bit kernels, which in turn will eliminate a rather
>gross non-CFI indirect call that KVM uses to trampoline IRQs by doing IDT
>lookups.
>
>When FRED isn't enabled, simply skip ERETS and restore RBP and RSP from
>the stack frame prior to doing a "regular" RET back to KVM (in quotes
>because of all the RET mitigation horrors).
>
>Signed-off-by: Sean Christopherson <seanjc@google.com>
>---
> arch/x86/entry/entry_64_fred.S | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
>index 29c5c32c16c3..7aff2f0a285f 100644
>--- a/arch/x86/entry/entry_64_fred.S
>+++ b/arch/x86/entry/entry_64_fred.S
>@@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> 	movq %rsp, %rdi				/* %rdi -> pt_regs */
> 	call __fred_entry_from_kvm		/* Call the C entry point */
> 	POP_REGS
>-	ERETS
>+
>+	ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED
> 1:
> 	/*
> 	 * Objtool doesn't understand what ERETS does, this hint tells it that
>@@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> 	 * isn't strictly needed, but it's the simplest form.
> 	 */
> 	UNWIND_HINT_RESTORE
>-	pop %rbp
>+	leave
> 	RET
> 
> SYM_FUNC_END(asm_fred_entry_from_kvm)
>
>base-commit: 45eb29140e68ffe8e93a5471006858a018480a45

Ok maybe I'm being dense, but what is left other than simply calling __fred_entry_from_kvm() as a normal C function? 

I'm on the go so there might be something in the code I'm missing, but on the surface...?

Sean Christopherson May 1, 2025, 6:59 p.m. UTC | #10

On Thu, May 01, 2025, H. Peter Anvin wrote:
> On May 1, 2025 11:30:18 AM PDT, Sean Christopherson <seanjc@google.com> wrote:
> >On Thu, May 01, 2025, Peter Zijlstra wrote:
> >> On Thu, May 01, 2025 at 12:30:38PM +0200, Peter Zijlstra wrote:
> >> > On Wed, Apr 30, 2025 at 09:06:00PM +0200, Peter Zijlstra wrote:
> >> > > On Wed, Apr 30, 2025 at 07:24:15AM -0700, H. Peter Anvin wrote:
> >> > > 
> >> > > > >KVM has another; the VMX interrupt injection stuff calls the IDT handler
> >> > > > >directly.  Is there an alternative? Can we keep a table of Linux functions
> >> > > > >slighly higher up the call stack (asm_\cfunc ?) and add CFI to those?
> >> > > 
> >> > > > We do have a table of handlers higher up in the stack in the form of
> >> > > > the dispatch tables for FRED. They don't in general even need the
> >> > > > assembly entry stubs, either.
> >> > > 
> >> > > Oh, right. I'll go have a look at those.
> >> > 
> >> > Right, so perhaps the easiest way around this is to setup the FRED entry
> >> > tables unconditionally, have VMX mandate CONFIG_FRED and then have it
> >> > always use the FRED entry points.
> >> > 
> >> > Let me see how ugly that gets.
> >> 
> >> Something like so... except this is broken. Its reporting spurious
> >> interrupts on vector 0x00, so something is buggered passing that vector
> >> along.
> >
> >Uh, aren't you making this way more complex than it needs to be?  IIUC, KVM never
> >uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be
> >in place because they'll never be used.  The only bits of code KVM needs is the
> >__fred_entry_from_kvm() glue.
> >
> >Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware.

Hrm, and now I see that fred_extint() relies on fred_install_sysvec(), which makes
me quite curious as to why IRQs didn't go sideways.  Oh, because sysvec_table[]
is statically defined at compile time except for PV crud.

So yeah, I think my the patches are correct, they just the need a small bit of
prep work to support dynamic setup of sysvec_table.

> >--
> >From 664468143109ab7c525c0babeba62195fa4c657e Mon Sep 17 00:00:00 2001
> >From: Sean Christopherson <seanjc@google.com>
> >Date: Thu, 1 May 2025 11:20:29 -0700
> >Subject: [PATCH 1/2] x86/fred: Play nice with invoking
> > asm_fred_entry_from_kvm() on non-FRED hardware
> >
> >Modify asm_fred_entry_from_kvm() to allow it to be invoked by KVM even
> >when FRED isn't fully enabled, e.g. when running with CONFIG_X86_FRED=y
> >on non-FRED hardware.  This will allow forcing KVM to always use the FRED
> >entry points for 64-bit kernels, which in turn will eliminate a rather
> >gross non-CFI indirect call that KVM uses to trampoline IRQs by doing IDT
> >lookups.
> >
> >When FRED isn't enabled, simply skip ERETS and restore RBP and RSP from
> >the stack frame prior to doing a "regular" RET back to KVM (in quotes
> >because of all the RET mitigation horrors).
> >
> >Signed-off-by: Sean Christopherson <seanjc@google.com>
> >---
> > arch/x86/entry/entry_64_fred.S | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> >diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
> >index 29c5c32c16c3..7aff2f0a285f 100644
> >--- a/arch/x86/entry/entry_64_fred.S
> >+++ b/arch/x86/entry/entry_64_fred.S
> >@@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> > 	movq %rsp, %rdi				/* %rdi -> pt_regs */
> > 	call __fred_entry_from_kvm		/* Call the C entry point */
> > 	POP_REGS
> >-	ERETS
> >+
> >+	ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED
> > 1:
> > 	/*
> > 	 * Objtool doesn't understand what ERETS does, this hint tells it that
> >@@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> > 	 * isn't strictly needed, but it's the simplest form.
> > 	 */
> > 	UNWIND_HINT_RESTORE
> >-	pop %rbp
> >+	leave
> > 	RET
> > 
> > SYM_FUNC_END(asm_fred_entry_from_kvm)
> >
> >base-commit: 45eb29140e68ffe8e93a5471006858a018480a45
> 
> Ok maybe I'm being dense, but what is left other than simply calling
> __fred_entry_from_kvm() as a normal C function? 
> 
> I'm on the go so there might be something in the code I'm missing, but on the
> surface...?

I'm sure it's doable, though I'd be more than a little nervous about diverging
from what FRED=y does, e.g. in case code somewhere expects the stack to look
exactly like a real FRED event.

And since we'd still need the assembly to support FRED=y, I don't see any point
in adding more code when it's trivially easy to have asm_fred_entry_from_kvm()
skip ERETS.

Xin Li May 2, 2025, 5:46 a.m. UTC | #11

>> Something like so... except this is broken. Its reporting spurious
>> interrupts on vector 0x00, so something is buggered passing that vector
>> along.

I'm a bit late to the party :)

Peter kind of got what I had in the FRED patch set v8 or earlier:

https://lore.kernel.org/lkml/20230410081438.1750-34-xin3.li@intel.com/


> Uh, aren't you making this way more complex than it needs to be?  IIUC, KVM never
> uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be
> in place because they'll never be used.  The only bits of code KVM needs is the
> __fred_entry_from_kvm() glue.

+1

> 
> Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware.
> 
> --
>  From 664468143109ab7c525c0babeba62195fa4c657e Mon Sep 17 00:00:00 2001
> From: Sean Christopherson <seanjc@google.com>
> Date: Thu, 1 May 2025 11:20:29 -0700
> Subject: [PATCH 1/2] x86/fred: Play nice with invoking
>   asm_fred_entry_from_kvm() on non-FRED hardware
> 
> Modify asm_fred_entry_from_kvm() to allow it to be invoked by KVM even
> when FRED isn't fully enabled, e.g. when running with CONFIG_X86_FRED=y
> on non-FRED hardware.  This will allow forcing KVM to always use the FRED
> entry points for 64-bit kernels, which in turn will eliminate a rather
> gross non-CFI indirect call that KVM uses to trampoline IRQs by doing IDT
> lookups.
> 
> When FRED isn't enabled, simply skip ERETS and restore RBP and RSP from
> the stack frame prior to doing a "regular" RET back to KVM (in quotes
> because of all the RET mitigation horrors).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/entry/entry_64_fred.S | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
> index 29c5c32c16c3..7aff2f0a285f 100644
> --- a/arch/x86/entry/entry_64_fred.S
> +++ b/arch/x86/entry/entry_64_fred.S
> @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
>   	movq %rsp, %rdi				/* %rdi -> pt_regs */
>   	call __fred_entry_from_kvm		/* Call the C entry point */
>   	POP_REGS
> -	ERETS
> +
> +	ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED

Neat!

I ever had a plan to do this with "sub $0x8,%rsp; iret;" for non-FRED
case.  But obviously doing nothing here is the best.

>   1:
>   	/*
>   	 * Objtool doesn't understand what ERETS does, this hint tells it that
> @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
>   	 * isn't strictly needed, but it's the simplest form.
>   	 */
>   	UNWIND_HINT_RESTORE
> -	pop %rbp
> +	leave

This is a smart change.

When !X86_FEATURE_FRED, the FRED stack frame set up for ERETS is
implicitly skipped by leave.  Maybe add a comment to explain LEAVE works
for both cases?

Xin Li May 2, 2025, 5:48 a.m. UTC | #12

On 5/1/2025 11:30 AM, Sean Christopherson wrote:
>  From c50fb5a8a46058bbcfdcac0a100c2aa0f7f68f1c Mon Sep 17 00:00:00 2001
> From: Sean Christopherson<seanjc@google.com>
> Date: Thu, 1 May 2025 11:10:39 -0700
> Subject: [PATCH 2/2] x86/fred: KVM: VMX: Always use FRED for IRQ+NMI when
>   CONFIG_X86_FRED=y
> 
> Now that FRED provides C-code entry points for handling IRQ and NMI exits,
> use the FRED infrastructure for forwarding all such events even if FRED
> isn't supported in hardware.  Avoiding the non-FRED assembly trampolines
> into the IDT handlers for IRQs eliminates the associated non-CFI indirect
> call (KVM performs a CALL by doing a lookup on the IDT using the IRQ
> vector).
> 
> Force FRED for 64-bit kernels if KVM_INTEL is enabled, as the benefits of
> eliminating the IRQ trampoline usage far outwieghts the code overhead for
> FRED.
> 
> Suggested-by: Peter Zijlstra<peterz@infradead.org>
> Signed-off-by: Sean Christopherson<seanjc@google.com>
> ---
>   arch/x86/kvm/Kconfig   | 1 +
>   arch/x86/kvm/vmx/vmx.c | 4 ++--
>   2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 2eeffcec5382..712a2ff28ce4 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -95,6 +95,7 @@ config KVM_SW_PROTECTED_VM
>   config KVM_INTEL
>   	tristate "KVM for Intel (and compatible) processors support"
>   	depends on KVM && IA32_FEAT_CTL
> +	select X86_FRED if X86_64

I LOVE this change, but not sure if everyone is happy with it.

>   	select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
>   	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
>   	help
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ef2d7208dd20..2ea89985107d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6995,7 +6995,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
>   		return;
>   
>   	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
> -	if (cpu_feature_enabled(X86_FEATURE_FRED))
> +	if (IS_ENABLED(CONFIG_X86_FRED))

"if (IS_ENABLED(CONFIG_X86_64))"?

>   		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
>   	else
>   		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
> @@ -7268,7 +7268,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu)
>   		return;
>   
>   	kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
> -	if (cpu_feature_enabled(X86_FEATURE_FRED))
> +	if (IS_ENABLED(CONFIG_X86_FRED))

Ditto.

Xin Li May 2, 2025, 6:12 a.m. UTC | #13

On 5/1/2025 11:59 AM, Sean Christopherson wrote:
>> Ok maybe I'm being dense, but what is left other than simply calling
>> __fred_entry_from_kvm() as a normal C function?
>>
>> I'm on the go so there might be something in the code I'm missing, but on the
>> surface...?
> I'm sure it's doable, though I'd be more than a little nervous about diverging
> from what FRED=y does, e.g. in case code somewhere expects the stack to look
> exactly like a real FRED event.

__fred_entry_from_kvm() accepts a pt_regs structure pointer, with event
type and vector in FRED stack frame.  They are set up in the assembly.

> And since we'd still need the assembly to support FRED=y, I don't see any point
> in adding more code when it's trivially easy to have asm_fred_entry_from_kvm()
> skip ERETS.

Yeah, your change seems minimized to me.

Peter Zijlstra May 2, 2025, 8:40 a.m. UTC | #14

On Thu, May 01, 2025 at 11:30:18AM -0700, Sean Christopherson wrote:

> Uh, aren't you making this way more complex than it needs to be? 

Possibly :-)

> IIUC, KVM never
> uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be
> in place because they'll never be used.  The only bits of code KVM needs is the
> __fred_entry_from_kvm() glue.

But __fred_entry_from_kvm() calls into fred_extint(), which then
directly uses the fred sysvec_table[] for dispatch. How would we not
have to set up that table?

> Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware.

So the FRED NMI code is significantly different from the IDT NMI code
and I really didn't want to go mixing those.

If we get a nested NMI I don't think it'll behave well.

Peter Zijlstra May 2, 2025, 11:08 a.m. UTC | #15

On Thu, May 01, 2025 at 08:33:57PM +0200, Paolo Bonzini wrote:
> On Wed, Apr 30, 2025 at 1:26 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > Notably the KVM fastop emulation stuff -- which I've completely rewritten for
> > this version -- the generated code doesn't look horrific, but is slightly more
> > verbose. I'm running on the assumption that instruction emulation is not super
> > performance critical these days of zero VM-exit VMs etc.
> 
> It's definitely going to be slower, but I guess it's okay these days.

Yeah, it adds a bunch of branches at the very least. But that was indeed
the argument, we shouldn't care much these days.

> It's really only somewhat hot with really old processors
> (pre-Westmere) and only when running big real mode code.

Right, really old stuff :-)

H. Peter Anvin May 2, 2025, 7:43 p.m. UTC | #16

On May 1, 2025 10:48:42 PM PDT, Xin Li <xin@zytor.com> wrote:
>On 5/1/2025 11:30 AM, Sean Christopherson wrote:
>>  From c50fb5a8a46058bbcfdcac0a100c2aa0f7f68f1c Mon Sep 17 00:00:00 2001
>> From: Sean Christopherson<seanjc@google.com>
>> Date: Thu, 1 May 2025 11:10:39 -0700
>> Subject: [PATCH 2/2] x86/fred: KVM: VMX: Always use FRED for IRQ+NMI when
>>   CONFIG_X86_FRED=y
>> 
>> Now that FRED provides C-code entry points for handling IRQ and NMI exits,
>> use the FRED infrastructure for forwarding all such events even if FRED
>> isn't supported in hardware.  Avoiding the non-FRED assembly trampolines
>> into the IDT handlers for IRQs eliminates the associated non-CFI indirect
>> call (KVM performs a CALL by doing a lookup on the IDT using the IRQ
>> vector).
>> 
>> Force FRED for 64-bit kernels if KVM_INTEL is enabled, as the benefits of
>> eliminating the IRQ trampoline usage far outwieghts the code overhead for
>> FRED.
>> 
>> Suggested-by: Peter Zijlstra<peterz@infradead.org>
>> Signed-off-by: Sean Christopherson<seanjc@google.com>
>> ---
>>   arch/x86/kvm/Kconfig   | 1 +
>>   arch/x86/kvm/vmx/vmx.c | 4 ++--
>>   2 files changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
>> index 2eeffcec5382..712a2ff28ce4 100644
>> --- a/arch/x86/kvm/Kconfig
>> +++ b/arch/x86/kvm/Kconfig
>> @@ -95,6 +95,7 @@ config KVM_SW_PROTECTED_VM
>>   config KVM_INTEL
>>   	tristate "KVM for Intel (and compatible) processors support"
>>   	depends on KVM && IA32_FEAT_CTL
>> +	select X86_FRED if X86_64
>
>I LOVE this change, but not sure if everyone is happy with it.
>
>>   	select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
>>   	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
>>   	help
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index ef2d7208dd20..2ea89985107d 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -6995,7 +6995,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
>>   		return;
>>     	kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
>> -	if (cpu_feature_enabled(X86_FEATURE_FRED))
>> +	if (IS_ENABLED(CONFIG_X86_FRED))
>
>"if (IS_ENABLED(CONFIG_X86_64))"?
>
>>   		fred_entry_from_kvm(EVENT_TYPE_EXTINT, vector);
>>   	else
>>   		vmx_do_interrupt_irqoff(gate_offset((gate_desc *)host_idt_base + vector));
>> @@ -7268,7 +7268,7 @@ noinstr void vmx_handle_nmi(struct kvm_vcpu *vcpu)
>>   		return;
>>     	kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
>> -	if (cpu_feature_enabled(X86_FEATURE_FRED))
>> +	if (IS_ENABLED(CONFIG_X86_FRED))
>
>Ditto.

I don't think anyone will have a problem with compiling it in... it is such a small amount of code.

Sean Christopherson May 2, 2025, 7:53 p.m. UTC | #17

On Fri, May 02, 2025, Peter Zijlstra wrote:
> On Thu, May 01, 2025 at 11:30:18AM -0700, Sean Christopherson wrote:
> 
> > Uh, aren't you making this way more complex than it needs to be? 
> 
> Possibly :-)
> 
> > IIUC, KVM never
> > uses the FRED hardware entry points, i.e. the FRED entry tables don't need to be
> > in place because they'll never be used.  The only bits of code KVM needs is the
> > __fred_entry_from_kvm() glue.
> 
> But __fred_entry_from_kvm() calls into fred_extint(), which then
> directly uses the fred sysvec_table[] for dispatch. How would we not
> have to set up that table?

I missed that the first time around.  From my self-reply:

 : Hrm, and now I see that fred_extint() relies on fred_install_sysvec(), which makes
 : me quite curious as to why IRQs didn't go sideways.  Oh, because sysvec_table[]
 : is statically defined at compile time except for PV crud.
 : 
 : So yeah, I think my the patches are correct, they just the need a small bit of
 : prep work to support dynamic setup of sysvec_table.

> > Lightly tested, but this combo works for IRQs and NMIs on non-FRED hardware.
> 
> So the FRED NMI code is significantly different from the IDT NMI code
> and I really didn't want to go mixing those.
> 
> If we get a nested NMI I don't think it'll behave well.

Ah, because FRED hardwware doesn't have the crazy NMI unblocking behavior, and
the FRED NMI entry code relies on that.

But I don't see why we need to care about NMIs from KVM, while they do bounce
through assembly to create a stack frame, the actual CALL is direct to
asm_exc_nmi_kvm_vmx().  If it's just the unwind hint that's needed, that

The attached patches handle the IRQ case and are passing my testing.

Peter Zijlstra May 3, 2025, 9:50 a.m. UTC | #18

On Fri, May 02, 2025 at 12:53:36PM -0700, Sean Christopherson wrote:

> > So the FRED NMI code is significantly different from the IDT NMI code
> > and I really didn't want to go mixing those.
> > 
> > If we get a nested NMI I don't think it'll behave well.
> 
> Ah, because FRED hardwware doesn't have the crazy NMI unblocking behavior, and
> the FRED NMI entry code relies on that.

Just so.

> But I don't see why we need to care about NMIs from KVM, while they do bounce
> through assembly to create a stack frame, the actual CALL is direct to
> asm_exc_nmi_kvm_vmx().  If it's just the unwind hint that's needed, that

That part is all fine.


>  arch/x86/entry/entry_64_fred.S | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/entry/entry_64_fred.S b/arch/x86/entry/entry_64_fred.S
> index 29c5c32c16c3..7aff2f0a285f 100644
> --- a/arch/x86/entry/entry_64_fred.S
> +++ b/arch/x86/entry/entry_64_fred.S
> @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
>  	movq %rsp, %rdi				/* %rdi -> pt_regs */
>  	call __fred_entry_from_kvm		/* Call the C entry point */
>  	POP_REGS
> -	ERETS
> +
> +	ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED
>  1:
>  	/*
>  	 * Objtool doesn't understand what ERETS does, this hint tells it that
> @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
>  	 * isn't strictly needed, but it's the simplest form.
>  	 */
>  	UNWIND_HINT_RESTORE
> -	pop %rbp
> +	leave
>  	RET

So this, while clever, might be a problem with ORC unwinding. Because
now the stack is different depending on the alternative, and we can't
deal with that.

Anyway, I'll go have a poke on Monday (or Tuesday if Monday turns out to
be a bank holiday :-).

Josh Poimboeuf May 3, 2025, 6:28 p.m. UTC | #19

On Sat, May 03, 2025 at 11:50:23AM +0200, Peter Zijlstra wrote:
> > +++ b/arch/x86/entry/entry_64_fred.S
> > @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> >  	movq %rsp, %rdi				/* %rdi -> pt_regs */
> >  	call __fred_entry_from_kvm		/* Call the C entry point */
> >  	POP_REGS
> > -	ERETS
> > +
> > +	ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED
> >  1:
> >  	/*
> >  	 * Objtool doesn't understand what ERETS does, this hint tells it that
> > @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> >  	 * isn't strictly needed, but it's the simplest form.
> >  	 */
> >  	UNWIND_HINT_RESTORE
> > -	pop %rbp
> > +	leave
> >  	RET
> 
> So this, while clever, might be a problem with ORC unwinding. Because
> now the stack is different depending on the alternative, and we can't
> deal with that.
> 
> Anyway, I'll go have a poke on Monday (or Tuesday if Monday turns out to
> be a bank holiday :-).

Can we just adjust the stack in the alternative?

	ALTERNATIVE "add $64 %rsp", __stringify(ERETS), X86_FEATURE_FRED
1:
	UNWIND_HINT_RESTORE
	pop %rbp
	RET

Peter Zijlstra May 6, 2025, 7:31 a.m. UTC | #20

On Sat, May 03, 2025 at 11:28:37AM -0700, Josh Poimboeuf wrote:
> On Sat, May 03, 2025 at 11:50:23AM +0200, Peter Zijlstra wrote:
> > > +++ b/arch/x86/entry/entry_64_fred.S
> > > @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> > >  	movq %rsp, %rdi				/* %rdi -> pt_regs */
> > >  	call __fred_entry_from_kvm		/* Call the C entry point */
> > >  	POP_REGS
> > > -	ERETS
> > > +
> > > +	ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED
> > >  1:
> > >  	/*
> > >  	 * Objtool doesn't understand what ERETS does, this hint tells it that
> > > @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> > >  	 * isn't strictly needed, but it's the simplest form.
> > >  	 */
> > >  	UNWIND_HINT_RESTORE
> > > -	pop %rbp
> > > +	leave
> > >  	RET
> > 
> > So this, while clever, might be a problem with ORC unwinding. Because
> > now the stack is different depending on the alternative, and we can't
> > deal with that.
> > 
> > Anyway, I'll go have a poke on Monday (or Tuesday if Monday turns out to
> > be a bank holiday :-).
> 
> Can we just adjust the stack in the alternative?
> 
> 	ALTERNATIVE "add $64 %rsp", __stringify(ERETS), X86_FEATURE_FRED

Yes, that should work. But I wanted to have a poke at objtool, so it
will properly complain about the mistake first.

Peter Zijlstra May 6, 2025, 1:32 p.m. UTC | #21

On Tue, May 06, 2025 at 09:31:00AM +0200, Peter Zijlstra wrote:
> On Sat, May 03, 2025 at 11:28:37AM -0700, Josh Poimboeuf wrote:
> > On Sat, May 03, 2025 at 11:50:23AM +0200, Peter Zijlstra wrote:
> > > > +++ b/arch/x86/entry/entry_64_fred.S
> > > > @@ -116,7 +116,8 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> > > >  	movq %rsp, %rdi				/* %rdi -> pt_regs */
> > > >  	call __fred_entry_from_kvm		/* Call the C entry point */
> > > >  	POP_REGS
> > > > -	ERETS
> > > > +
> > > > +	ALTERNATIVE "", __stringify(ERETS), X86_FEATURE_FRED
> > > >  1:
> > > >  	/*
> > > >  	 * Objtool doesn't understand what ERETS does, this hint tells it that
> > > > @@ -124,7 +125,7 @@ SYM_FUNC_START(asm_fred_entry_from_kvm)
> > > >  	 * isn't strictly needed, but it's the simplest form.
> > > >  	 */
> > > >  	UNWIND_HINT_RESTORE
> > > > -	pop %rbp
> > > > +	leave
> > > >  	RET
> > > 
> > > So this, while clever, might be a problem with ORC unwinding. Because
> > > now the stack is different depending on the alternative, and we can't
> > > deal with that.
> > > 
> > > Anyway, I'll go have a poke on Monday (or Tuesday if Monday turns out to
> > > be a bank holiday :-).
> > 
> > Can we just adjust the stack in the alternative?
> > 
> > 	ALTERNATIVE "add $64 %rsp", __stringify(ERETS), X86_FEATURE_FRED
> 
> Yes, that should work. 

Nope, it needs to be "mov %rbp, %rsp". Because that is the actual rsp
value after ERETS-to-self.

> But I wanted to have a poke at objtool, so it
> will properly complain about the mistake first.

So a metric ton of fail here :/

The biggest problem is the UNWIND_HINT_RESTORE right after the
alternative. This ensures that objtool thinks all paths through the
alternative end up with the same stack. And hence won't actually
complain.

Second being of course, that in order to get IRET and co correct, we'd
need far more of an emulator.

Also, it actually chokes on this variant, and I've not yet figured out
why. Whatever state should be created by that mov, the restore hint
should wipe it all. But still the ORC generation bails with unknown base
reg -1.

Josh Poimboeuf May 6, 2025, 7:18 p.m. UTC | #22

On Tue, May 06, 2025 at 03:32:34PM +0200, Peter Zijlstra wrote:
> On Tue, May 06, 2025 at 09:31:00AM +0200, Peter Zijlstra wrote:
> > On Sat, May 03, 2025 at 11:28:37AM -0700, Josh Poimboeuf wrote:
> > > Can we just adjust the stack in the alternative?
> > > 
> > > 	ALTERNATIVE "add $64 %rsp", __stringify(ERETS), X86_FEATURE_FRED
> > 
> > Yes, that should work. 
> 
> Nope, it needs to be "mov %rbp, %rsp". Because that is the actual rsp
> value after ERETS-to-self.
> 
> > But I wanted to have a poke at objtool, so it
> > will properly complain about the mistake first.
> 
> So a metric ton of fail here :/
> 
> The biggest problem is the UNWIND_HINT_RESTORE right after the
> alternative. This ensures that objtool thinks all paths through the
> alternative end up with the same stack. And hence won't actually
> complain.

Right, that's what the unwind hints are made for, it's up to the human
to get it right.

> Second being of course, that in order to get IRET and co correct, we'd
> need far more of an emulator.

At least finding RSP should be pretty easy, it's at a known location on
the stack.  We already have an ORC type for doing that, but that would
again require an unwind hint, unless we make objtool smart enough to
know that.  But then the ORC would be inconsistent between the two
alternative paths.

> Also, it actually chokes on this variant, and I've not yet figured out
> why. Whatever state should be created by that mov, the restore hint
> should wipe it all. But still the ORC generation bails with unknown base
> reg -1.

Weird, I'm not seeing that.

[v2,00/13] objtool: Detect and warn about indirect calls in __nocfi functions

Message

Comments