diff mbox series

efi/x86: use naked RET on mixed mode call wrapper

Message ID 20220715194550.793957-1-cascardo@canonical.com
State New
Headers show
Series efi/x86: use naked RET on mixed mode call wrapper | expand

Commit Message

Thadeu Lima de Souza Cascardo July 15, 2022, 7:45 p.m. UTC
When running with return thunks enabled under 32-bit EFI, the system
crashes with:

[    0.137688] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[    0.138136] BUG: unable to handle page fault for address: 000000005bc02900
[    0.138136] #PF: supervisor instruction fetch in kernel mode
[    0.138136] #PF: error_code(0x0011) - permissions violation
[    0.138136] PGD 18f7063 P4D 18f7063 PUD 18ff063 PMD 190e063 PTE 800000005bc02063
[    0.138136] Oops: 0011 [#1] PREEMPT SMP PTI
[    0.138136] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.19.0-rc6+ #166
[    0.138136] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[    0.138136] RIP: 0010:0x5bc02900
[    0.138136] Code: Unable to access opcode bytes at RIP 0x5bc028d6.
[    0.138136] RSP: 0018:ffffffffb3203e10 EFLAGS: 00010046
[    0.138136] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000048
[    0.138136] RDX: 000000000190dfac RSI: 0000000000001710 RDI: 000000007eae823b
[    0.138136] RBP: ffffffffb3203e70 R08: 0000000001970000 R09: ffffffffb3203e28
[    0.138136] R10: 747563657865206c R11: 6c6977203a696665 R12: 0000000000001710
[    0.138136] R13: 0000000000000030 R14: 0000000001970000 R15: 0000000000000001
[    0.138136] FS:  0000000000000000(0000) GS:ffff8e013ca00000(0000) knlGS:0000000000000000
[    0.138136] CS:  0010 DS: 0018 ES: 0018 CR0: 0000000080050033
[    0.138136] CR2: 000000005bc02900 CR3: 0000000001930000 CR4: 00000000000006f0
[    0.138136] Call Trace:
[    0.138136]  <TASK>
[    0.138136]  ? efi_set_virtual_address_map+0x9c/0x175
[    0.138136]  efi_enter_virtual_mode+0x4a6/0x53e
[    0.138136]  start_kernel+0x67c/0x71e
[    0.138136]  x86_64_start_reservations+0x24/0x2a
[    0.138136]  x86_64_start_kernel+0xe9/0xf4
[    0.138136]  secondary_startup_64_no_verify+0xe5/0xeb
[    0.138136]  </TASK>

That's because it cannot jump to the return thunk from the 32-bit code.
Using a naked RET and marking it as safe allows the system to proceed
booting.

Fixes: aa3d480315ba ("x86: Use return-thunk in asm code")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: <stable@vger.kernel.org>
---

Does this leave one potential attack vector open? Perhaps, since this is
running under a different mapping (AFAIU), the risk is reduced? Or rather, the
attacker could attack using the firmware RETs anyway?

Alternatively, we could use IBPB when available when using the wrapper.

Thoughts?

---
 arch/x86/platform/efi/efi_thunk_64.S | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra July 18, 2022, 4:10 p.m. UTC | #1
On Mon, Jul 18, 2022 at 10:59:18AM -0300, Thadeu Lima de Souza Cascardo wrote:

> And I think you nailed what I had in mind for using IBPB when doing firmware
> calls, and perhaps this is wanted even when we ignore this naked RET here.
> 
> There is a typo on your patch below, but I will give it a try and see if it
> doesn't blow up on AMD systems without IBPB (by way of emulation).

Yeah, Boris pointed out the same. Typing is hard ;-)
Peter Zijlstra July 18, 2022, 4:38 p.m. UTC | #2
On Mon, Jul 18, 2022 at 06:28:27PM +0200, Borislav Petkov wrote:
> On Mon, Jul 18, 2022 at 01:41:37PM +0200, Peter Zijlstra wrote:
> > diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> > index 10a3bfc1eb23..f934dcdb7c0d 100644
> > --- a/arch/x86/include/asm/nospec-branch.h
> > +++ b/arch/x86/include/asm/nospec-branch.h
> > @@ -297,6 +297,8 @@ do {									\
> >  	alternative_msr_write(MSR_IA32_SPEC_CTRL,			\
> >  			      spec_ctrl_current() | SPEC_CTRL_IBRS,	\
> >  			      X86_FEATURE_USE_IBRS_FW);			\
> > +	altnerative_msr_write(MSR_IA32_PRED_CMD, PRED_CMD_IBPB,		\
> > +			      X86_FEATURE_USE_IBPB_FW);			\
> >  } while (0)
> 
> So I'm being told we need to untrain on return from EFI to protect the
> kernel from it. Ontop of yours.

I don't think there's any credible way we can protect against EFI taking
over the system if it wants to. It runs at CPL0 and has access to the
direct map. If EFI wants it can take over the system without trying.
Borislav Petkov July 18, 2022, 6:46 p.m. UTC | #3
On Mon, Jul 18, 2022 at 11:34:02AM -0700, Linus Torvalds wrote:
> Why would we have to protect the kernel from EFI?

Yes, we cleared this up on IRC in the meantime.

This was raised as a concern in case we don't trust EFI. But we cannot
not (double negation on purpose) trust EFI because it can do whatever it
likes anyway, "underneath" the OS.

I'm keeping the UNTRAIN_RET-in-C diff in my patches/ folder, though - I
get the feeling we might need it soon for something else.

:-)
Ard Biesheuvel July 19, 2022, 3:22 p.m. UTC | #4
On Mon, 18 Jul 2022 at 20:46, Borislav Petkov <bp@suse.de> wrote:
>
> On Mon, Jul 18, 2022 at 11:34:02AM -0700, Linus Torvalds wrote:
> > Why would we have to protect the kernel from EFI?
>
> Yes, we cleared this up on IRC in the meantime.
>
> This was raised as a concern in case we don't trust EFI. But we cannot
> not (double negation on purpose) trust EFI because it can do whatever it
> likes anyway, "underneath" the OS.
>
> I'm keeping the UNTRAIN_RET-in-C diff in my patches/ folder, though - I
> get the feeling we might need it soon for something else.
>
> :-)
>

The code in question is a little trampoline that executes from the EFI
mixed mode 1:1 mapping of the kernel text, and never via the kernel
mapping, so we should just move it into .rodata instead (and fix up
the mixed mode virtual address map logic accordingly). I don't think
mapping the kernel text and rodata into the 1;1 EFI map is needed at
all, tbh, and the only thing we ever access via that mapping is that
little trampoline.

Something like

--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -176,7 +176,7 @@ virt_to_phys_or_null_size(void *va, unsigned long size)

 int __init efi_setup_page_tables(unsigned long pa_memmap, unsigned num_pages)
 {
-       unsigned long pfn, text, pf, rodata;
+       unsigned long pfn, pf, rodata;
        struct page *page;
        unsigned npages;
        pgd_t *pgd = efi_mm.pgd;
@@ -222,7 +222,7 @@ int __init efi_setup_page_tables(unsigned long
pa_memmap, unsigned num_pages)
        /*
         * When making calls to the firmware everything needs to be 1:1
         * mapped and addressable with 32-bit pointers. Map the kernel
-        * text and allocate a new stack because we can't rely on the
+        * rodata and allocate a new stack because we can't rely on the
         * stack pointer being < 4GB.
         */
        if (!efi_is_mixed())
@@ -236,21 +236,11 @@ int __init efi_setup_page_tables(unsigned long
pa_memmap, unsigned num_pages)

        efi_mixed_mode_stack_pa = page_to_phys(page + 1); /* stack grows down */

-       npages = (_etext - _text) >> PAGE_SHIFT;
-       text = __pa(_text);
-       pfn = text >> PAGE_SHIFT;
-
-       pf = _PAGE_ENC;
-       if (kernel_map_pages_in_pgd(pgd, pfn, text, npages, pf)) {
-               pr_err("Failed to map kernel text 1:1\n");
-               return 1;
-       }
-
        npages = (__end_rodata - __start_rodata) >> PAGE_SHIFT;
        rodata = __pa(__start_rodata);
        pfn = rodata >> PAGE_SHIFT;

-       pf = _PAGE_NX | _PAGE_ENC;
+       pf = _PAGE_ENC;
        if (kernel_map_pages_in_pgd(pgd, pfn, rodata, npages, pf)) {
                pr_err("Failed to map kernel rodata 1:1\n");
                return 1;
diff --git a/arch/x86/platform/efi/efi_thunk_64.S
b/arch/x86/platform/efi/efi_thunk_64.S
index 854dd81804b7..d4ee75ebfad6 100644
--- a/arch/x86/platform/efi/efi_thunk_64.S
+++ b/arch/x86/platform/efi/efi_thunk_64.S
@@ -71,7 +71,9 @@ STACK_FRAME_NON_STANDARD __efi64_thunk
        pushq   $__KERNEL32_CS
        pushq   %rdi                    /* EFI runtime service address */
        lretq
+SYM_FUNC_END(__efi64_thunk)

+       .section ".rodata", "a"
 1:     movq    0x20(%rsp), %rsp
        pop     %rbx
        pop     %rbp
@@ -81,7 +83,6 @@ STACK_FRAME_NON_STANDARD __efi64_thunk
 2:     pushl   $__KERNEL_CS
        pushl   %ebp
        lret
-SYM_FUNC_END(__efi64_thunk)

        .bss
        .balign 8
Borislav Petkov July 19, 2022, 5:37 p.m. UTC | #5
On Tue, Jul 19, 2022 at 05:22:28PM +0200, Ard Biesheuvel wrote:
> The code in question is a little trampoline that executes from the EFI
> mixed mode 1:1 mapping of the kernel text, and never via the kernel
> mapping, so we should just move it into .rodata instead (and fix up
> the mixed mode virtual address map logic accordingly). I don't think
> mapping the kernel text and rodata into the 1;1 EFI map is needed at
> all, tbh, and the only thing we ever access via that mapping is that
> little trampoline.
> 
> Something like

I'm obviously always for simplifications like that. I'm guessing this
should be tested for a full next release before it goes to Linus?

Thx.
diff mbox series

Patch

diff --git a/arch/x86/platform/efi/efi_thunk_64.S b/arch/x86/platform/efi/efi_thunk_64.S
index 9ffe2bad27d5..4e5257a4811b 100644
--- a/arch/x86/platform/efi/efi_thunk_64.S
+++ b/arch/x86/platform/efi/efi_thunk_64.S
@@ -23,6 +23,7 @@ 
 #include <linux/objtool.h>
 #include <asm/page_types.h>
 #include <asm/segment.h>
+#include <asm/nospec-branch.h>
 
 	.text
 	.code64
@@ -75,7 +76,9 @@  STACK_FRAME_NON_STANDARD __efi64_thunk
 1:	movq	0x20(%rsp), %rsp
 	pop	%rbx
 	pop	%rbp
-	RET
+	ANNOTATE_UNRET_SAFE
+	ret
+	int3
 
 	.code32
 2:	pushl	$__KERNEL_CS