mbox series

[v5,0/7] arm64: efi: leave MMU and caches on at boot

Message ID 20221108182204.2447664-1-ardb@kernel.org
Headers show
Series arm64: efi: leave MMU and caches on at boot | expand

Message

Ard Biesheuvel Nov. 8, 2022, 6:21 p.m. UTC
The purpose of this series is to remove any explicit cache maintenance
for coherency during early boot that becomes unnecessary if we simply
retain the cacheable 1:1 mapping of all of system RAM provided by EFI,
and use it to populate the ID map page tables. After setting up this
preliminary ID map, we disable the MMU, drop to EL1, reprogram the MAIR,
TCR and SCTLR registers as before, and proceed as usual, avoiding the
need for any manipulations of memory while the MMU and caches are off.

The only properties of the firmware provided 1:1 map we rely on is that
it does not require any explicit cache maintenance for coherency, and
that it covers the entire memory footprint of the image, including the
BSS and padding at the end - all else is under control of the kernel
itself, as before.

Changes since v4:
- add patch to align the callers of finalise_el2()
- also clean HYP text to the PoC when booting at EL2 with the MMU on
- add a warning and a taint when doing non-EFI boot with the MMU and
  caches enabled
- rebase onto zboot changes in efi/next - this means that patches #6 and
  #7 will not apply onto arm64/for-next so a shared stable branch will
  be needed if we want to queue this up for v6.2

Changes since v3:
- drop EFI_LOADER_CODE memory type patch that has been queued in the
  mean time
- rebased onto [partial] series that moves efi-entry.S into the libstub/
  source directory
- fixed a correctness issue in patch #2

Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>

Ard Biesheuvel (7):
  arm64: head: Move all finalise_el2 calls to after __enable_mmu
  arm64: kernel: move identity map out of .text mapping
  arm64: head: record the MMU state at primary entry
  arm64: head: avoid cache invalidation when entering with the MMU on
  arm64: head: Clean the ID map and the HYP text to the PoC if needed
  arm64: lds: reduce effective minimum image alignment to 64k
  efi: arm64: enter with MMU and caches enabled

 arch/arm64/include/asm/efi.h               |  9 +-
 arch/arm64/kernel/head.S                   | 93 +++++++++++++++-----
 arch/arm64/kernel/image-vars.h             |  5 +-
 arch/arm64/kernel/setup.c                  |  9 +-
 arch/arm64/kernel/sleep.S                  |  6 +-
 arch/arm64/kernel/vmlinux.lds.S            | 13 ++-
 arch/arm64/mm/cache.S                      |  5 +-
 arch/arm64/mm/proc.S                       |  2 -
 drivers/firmware/efi/libstub/Makefile      |  4 +-
 drivers/firmware/efi/libstub/arm64-entry.S | 67 --------------
 drivers/firmware/efi/libstub/arm64-stub.c  | 26 ++++--
 drivers/firmware/efi/libstub/arm64.c       | 41 +++++++--
 include/linux/efi.h                        |  6 +-
 13 files changed, 159 insertions(+), 127 deletions(-)
 delete mode 100644 drivers/firmware/efi/libstub/arm64-entry.S

Comments

Mark Rutland Nov. 11, 2022, 5:36 p.m. UTC | #1
Hi Ard,

Sorry for the late-in-the-day reply here...

On Tue, Nov 08, 2022 at 07:21:57PM +0100, Ard Biesheuvel wrote:
> The purpose of this series is to remove any explicit cache maintenance
> for coherency during early boot that becomes unnecessary if we simply
> retain the cacheable 1:1 mapping of all of system RAM provided by EFI,
> and use it to populate the ID map page tables. After setting up this
> preliminary ID map, we disable the MMU, drop to EL1, reprogram the MAIR,
> TCR and SCTLR registers as before, and proceed as usual, avoiding the
> need for any manipulations of memory while the MMU and caches are off.
> 
> The only properties of the firmware provided 1:1 map we rely on is that
> it does not require any explicit cache maintenance for coherency, and
> that it covers the entire memory footprint of the image, including the
> BSS and padding at the end - all else is under control of the kernel
> itself, as before.

As a high-level thing, I'm still very much not keen on entering the kernel with
the MMU on. Given that we have to support booting with the MMU off for !EFI
boot (including kexec when EFI is in use), I think this makes it harder to
reason about the boot code overall (e.g. due to the conditional maintenance
added to head.S), and adds more scope for error, even if it simplifies the EFI
stub itself.

I reckon that (sticking with entering with the MMU off), there's more that we
can do to split the table creation into more stages, and to minimize the early
portion of that which has to run with the MMU off. That would benefit non-EFI
boot and kexec, and retain the single-boot-flow that we currently have.

My rough thinking was:

1) Reduce the idmap down to a single page, such that we only need to clear
   NR_PAGETABLE_LEVELS pages to initialize this.

2) Create a small stub at a fixed TTBR1 VA which we use to create a new initial
   mapping of the kernel image (either in TTBR0 as with the currently idmap, or
   in TTBR1 directly). The stub logic could be small enough that it could be
   mapped at page granularity, and we'd only need to initialize
   NR_PAGETABLE_LEVELS pages before enabling the MMU.

   This would then bounce onto the next stage, either in TTBR0 directly, or
   bouncing through there as with the TTBR1 replacement logic.

   We could plausibly write that in C, and the early page table asm logic could
   be simplified.

Thanks,
Mark.

> Changes since v4:
> - add patch to align the callers of finalise_el2()
> - also clean HYP text to the PoC when booting at EL2 with the MMU on
> - add a warning and a taint when doing non-EFI boot with the MMU and
>   caches enabled
> - rebase onto zboot changes in efi/next - this means that patches #6 and
>   #7 will not apply onto arm64/for-next so a shared stable branch will
>   be needed if we want to queue this up for v6.2
> 
> Changes since v3:
> - drop EFI_LOADER_CODE memory type patch that has been queued in the
>   mean time
> - rebased onto [partial] series that moves efi-entry.S into the libstub/
>   source directory
> - fixed a correctness issue in patch #2
> 
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> 
> Ard Biesheuvel (7):
>   arm64: head: Move all finalise_el2 calls to after __enable_mmu
>   arm64: kernel: move identity map out of .text mapping
>   arm64: head: record the MMU state at primary entry
>   arm64: head: avoid cache invalidation when entering with the MMU on
>   arm64: head: Clean the ID map and the HYP text to the PoC if needed
>   arm64: lds: reduce effective minimum image alignment to 64k
>   efi: arm64: enter with MMU and caches enabled
> 
>  arch/arm64/include/asm/efi.h               |  9 +-
>  arch/arm64/kernel/head.S                   | 93 +++++++++++++++-----
>  arch/arm64/kernel/image-vars.h             |  5 +-
>  arch/arm64/kernel/setup.c                  |  9 +-
>  arch/arm64/kernel/sleep.S                  |  6 +-
>  arch/arm64/kernel/vmlinux.lds.S            | 13 ++-
>  arch/arm64/mm/cache.S                      |  5 +-
>  arch/arm64/mm/proc.S                       |  2 -
>  drivers/firmware/efi/libstub/Makefile      |  4 +-
>  drivers/firmware/efi/libstub/arm64-entry.S | 67 --------------
>  drivers/firmware/efi/libstub/arm64-stub.c  | 26 ++++--
>  drivers/firmware/efi/libstub/arm64.c       | 41 +++++++--
>  include/linux/efi.h                        |  6 +-
>  13 files changed, 159 insertions(+), 127 deletions(-)
>  delete mode 100644 drivers/firmware/efi/libstub/arm64-entry.S
> 
> -- 
> 2.35.1
>
Will Deacon Nov. 15, 2022, 11:17 a.m. UTC | #2
On Fri, Nov 11, 2022 at 05:36:19PM +0000, Mark Rutland wrote:
> On Tue, Nov 08, 2022 at 07:21:57PM +0100, Ard Biesheuvel wrote:
> > The purpose of this series is to remove any explicit cache maintenance
> > for coherency during early boot that becomes unnecessary if we simply
> > retain the cacheable 1:1 mapping of all of system RAM provided by EFI,
> > and use it to populate the ID map page tables. After setting up this
> > preliminary ID map, we disable the MMU, drop to EL1, reprogram the MAIR,
> > TCR and SCTLR registers as before, and proceed as usual, avoiding the
> > need for any manipulations of memory while the MMU and caches are off.
> > 
> > The only properties of the firmware provided 1:1 map we rely on is that
> > it does not require any explicit cache maintenance for coherency, and
> > that it covers the entire memory footprint of the image, including the
> > BSS and padding at the end - all else is under control of the kernel
> > itself, as before.
> 
> As a high-level thing, I'm still very much not keen on entering the kernel with
> the MMU on. Given that we have to support booting with the MMU off for !EFI
> boot (including kexec when EFI is in use), I think this makes it harder to
> reason about the boot code overall (e.g. due to the conditional maintenance
> added to head.S), and adds more scope for error, even if it simplifies the EFI
> stub itself.

As discussed offline, two things that would help the current series are:

  (1) Some performance numbers comparing MMU off vs MMU on boot

  (2) Use of a separate entry point for the MMU on case, potentially failing
      the boot if the MMU is on and we're not using EFI

Will
Ard Biesheuvel Nov. 15, 2022, 11:21 a.m. UTC | #3
On Tue, 15 Nov 2022 at 12:17, Will Deacon <will@kernel.org> wrote:
>
> On Fri, Nov 11, 2022 at 05:36:19PM +0000, Mark Rutland wrote:
> > On Tue, Nov 08, 2022 at 07:21:57PM +0100, Ard Biesheuvel wrote:
> > > The purpose of this series is to remove any explicit cache maintenance
> > > for coherency during early boot that becomes unnecessary if we simply
> > > retain the cacheable 1:1 mapping of all of system RAM provided by EFI,
> > > and use it to populate the ID map page tables. After setting up this
> > > preliminary ID map, we disable the MMU, drop to EL1, reprogram the MAIR,
> > > TCR and SCTLR registers as before, and proceed as usual, avoiding the
> > > need for any manipulations of memory while the MMU and caches are off.
> > >
> > > The only properties of the firmware provided 1:1 map we rely on is that
> > > it does not require any explicit cache maintenance for coherency, and
> > > that it covers the entire memory footprint of the image, including the
> > > BSS and padding at the end - all else is under control of the kernel
> > > itself, as before.
> >
> > As a high-level thing, I'm still very much not keen on entering the kernel with
> > the MMU on. Given that we have to support booting with the MMU off for !EFI
> > boot (including kexec when EFI is in use), I think this makes it harder to
> > reason about the boot code overall (e.g. due to the conditional maintenance
> > added to head.S), and adds more scope for error, even if it simplifies the EFI
> > stub itself.
>
> As discussed offline, two things that would help the current series are:
>
>   (1) Some performance numbers comparing MMU off vs MMU on boot
>
>   (2) Use of a separate entry point for the MMU on case, potentially failing
>       the boot if the MMU is on and we're not using EFI
>

Ack.

But thinking about (2) again, failing the boot is better done at a
time when you can inform the user about it, no?

IOW, just going into a deadloop really early if you enter the bare
metal entry point with the MMU on is going to be hard to distinguish
from other issues, whereas panicking after the console up is more
likely to help getting the actual issue diagnosed.

So perhaps we should panic() instead of warn+taint when this condition
occurs, and do it from an early initcall instead of from setup_arch().
Will Deacon Nov. 15, 2022, 11:31 a.m. UTC | #4
On Tue, Nov 15, 2022 at 12:21:55PM +0100, Ard Biesheuvel wrote:
> On Tue, 15 Nov 2022 at 12:17, Will Deacon <will@kernel.org> wrote:
> >
> > On Fri, Nov 11, 2022 at 05:36:19PM +0000, Mark Rutland wrote:
> > > On Tue, Nov 08, 2022 at 07:21:57PM +0100, Ard Biesheuvel wrote:
> > > > The purpose of this series is to remove any explicit cache maintenance
> > > > for coherency during early boot that becomes unnecessary if we simply
> > > > retain the cacheable 1:1 mapping of all of system RAM provided by EFI,
> > > > and use it to populate the ID map page tables. After setting up this
> > > > preliminary ID map, we disable the MMU, drop to EL1, reprogram the MAIR,
> > > > TCR and SCTLR registers as before, and proceed as usual, avoiding the
> > > > need for any manipulations of memory while the MMU and caches are off.
> > > >
> > > > The only properties of the firmware provided 1:1 map we rely on is that
> > > > it does not require any explicit cache maintenance for coherency, and
> > > > that it covers the entire memory footprint of the image, including the
> > > > BSS and padding at the end - all else is under control of the kernel
> > > > itself, as before.
> > >
> > > As a high-level thing, I'm still very much not keen on entering the kernel with
> > > the MMU on. Given that we have to support booting with the MMU off for !EFI
> > > boot (including kexec when EFI is in use), I think this makes it harder to
> > > reason about the boot code overall (e.g. due to the conditional maintenance
> > > added to head.S), and adds more scope for error, even if it simplifies the EFI
> > > stub itself.
> >
> > As discussed offline, two things that would help the current series are:
> >
> >   (1) Some performance numbers comparing MMU off vs MMU on boot
> >
> >   (2) Use of a separate entry point for the MMU on case, potentially failing
> >       the boot if the MMU is on and we're not using EFI
> >
> 
> Ack.
> 
> But thinking about (2) again, failing the boot is better done at a
> time when you can inform the user about it, no?
> 
> IOW, just going into a deadloop really early if you enter the bare
> metal entry point with the MMU on is going to be hard to distinguish
> from other issues, whereas panicking after the console up is more
> likely to help getting the actual issue diagnosed.

Agreed.

> So perhaps we should panic() instead of warn+taint when this condition
> occurs, and do it from an early initcall instead of from setup_arch().

To be honest, and I appreciate that this is unhelpful, but I'm fine with
the warn+taint and prefer that to a fatal stop.

Will
Ard Biesheuvel Nov. 26, 2022, 2:16 p.m. UTC | #5
On Tue, 15 Nov 2022 at 12:31, Will Deacon <will@kernel.org> wrote:
>
> On Tue, Nov 15, 2022 at 12:21:55PM +0100, Ard Biesheuvel wrote:
> > On Tue, 15 Nov 2022 at 12:17, Will Deacon <will@kernel.org> wrote:
> > >
> > > On Fri, Nov 11, 2022 at 05:36:19PM +0000, Mark Rutland wrote:
> > > > On Tue, Nov 08, 2022 at 07:21:57PM +0100, Ard Biesheuvel wrote:
> > > > > The purpose of this series is to remove any explicit cache maintenance
> > > > > for coherency during early boot that becomes unnecessary if we simply
> > > > > retain the cacheable 1:1 mapping of all of system RAM provided by EFI,
> > > > > and use it to populate the ID map page tables. After setting up this
> > > > > preliminary ID map, we disable the MMU, drop to EL1, reprogram the MAIR,
> > > > > TCR and SCTLR registers as before, and proceed as usual, avoiding the
> > > > > need for any manipulations of memory while the MMU and caches are off.
> > > > >
> > > > > The only properties of the firmware provided 1:1 map we rely on is that
> > > > > it does not require any explicit cache maintenance for coherency, and
> > > > > that it covers the entire memory footprint of the image, including the
> > > > > BSS and padding at the end - all else is under control of the kernel
> > > > > itself, as before.
> > > >
> > > > As a high-level thing, I'm still very much not keen on entering the kernel with
> > > > the MMU on. Given that we have to support booting with the MMU off for !EFI
> > > > boot (including kexec when EFI is in use), I think this makes it harder to
> > > > reason about the boot code overall (e.g. due to the conditional maintenance
> > > > added to head.S), and adds more scope for error, even if it simplifies the EFI
> > > > stub itself.
> > >
> > > As discussed offline, two things that would help the current series are:
> > >
> > >   (1) Some performance numbers comparing MMU off vs MMU on boot
> > >

Finally got around to measuring this - I lost access to my TX2 machine
for a couple of days during the past week,

With the patch below applied to mainline, I measure ~6 ms spent
cleaning the entire image to the PoC (which is the bulk of it) and
subsequently populating the initial ID map and activating it.

This drops to about 0.6 ms with my changes applied. This is unlikely
to ever matter in practice, perhaps, but I will note that booting a VM
in EFI mode using Tianocore/EDK2 from the point where KVM clears the
counter to the point where we start user space can be done (on the
same machine) in 500-700 ms so it is not entirely insignificant
either.

I could try and measure it on bare metal as well, but I suppose that
launch times are even less relevant there so I didn't bother.
Ard Biesheuvel Nov. 26, 2022, 2:17 p.m. UTC | #6
On Sat, 26 Nov 2022 at 15:16, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Tue, 15 Nov 2022 at 12:31, Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, Nov 15, 2022 at 12:21:55PM +0100, Ard Biesheuvel wrote:
> > > On Tue, 15 Nov 2022 at 12:17, Will Deacon <will@kernel.org> wrote:
> > > >
> > > > On Fri, Nov 11, 2022 at 05:36:19PM +0000, Mark Rutland wrote:
> > > > > On Tue, Nov 08, 2022 at 07:21:57PM +0100, Ard Biesheuvel wrote:
> > > > > > The purpose of this series is to remove any explicit cache maintenance
> > > > > > for coherency during early boot that becomes unnecessary if we simply
> > > > > > retain the cacheable 1:1 mapping of all of system RAM provided by EFI,
> > > > > > and use it to populate the ID map page tables. After setting up this
> > > > > > preliminary ID map, we disable the MMU, drop to EL1, reprogram the MAIR,
> > > > > > TCR and SCTLR registers as before, and proceed as usual, avoiding the
> > > > > > need for any manipulations of memory while the MMU and caches are off.
> > > > > >
> > > > > > The only properties of the firmware provided 1:1 map we rely on is that
> > > > > > it does not require any explicit cache maintenance for coherency, and
> > > > > > that it covers the entire memory footprint of the image, including the
> > > > > > BSS and padding at the end - all else is under control of the kernel
> > > > > > itself, as before.
> > > > >
> > > > > As a high-level thing, I'm still very much not keen on entering the kernel with
> > > > > the MMU on. Given that we have to support booting with the MMU off for !EFI
> > > > > boot (including kexec when EFI is in use), I think this makes it harder to
> > > > > reason about the boot code overall (e.g. due to the conditional maintenance
> > > > > added to head.S), and adds more scope for error, even if it simplifies the EFI
> > > > > stub itself.
> > > >
> > > > As discussed offline, two things that would help the current series are:
> > > >
> > > >   (1) Some performance numbers comparing MMU off vs MMU on boot
> > > >
>
> Finally got around to measuring this - I lost access to my TX2 machine
> for a couple of days during the past week,
>
> With the patch below applied to mainline, I measure ~6 ms spent
> cleaning the entire image to the PoC (which is the bulk of it) and
> subsequently populating the initial ID map and activating it.
>
> This drops to about 0.6 ms with my changes applied. This is unlikely
> to ever matter in practice, perhaps, but I will note that booting a VM
> in EFI mode using Tianocore/EDK2 from the point where KVM clears the
> counter to the point where we start user space can be done (on the
> same machine) in 500-700 ms so it is not entirely insignificant
> either.
>
> I could try and measure it on bare metal as well, but I suppose that
> launch times are even less relevant there so I didn't bother.

diff --git a/arch/arm64/kernel/efi-entry.S b/arch/arm64/kernel/efi-entry.S
index 61a87fa1c3055e26..27f59784a1c0be2c 100644
--- a/arch/arm64/kernel/efi-entry.S
+++ b/arch/arm64/kernel/efi-entry.S
@@ -22,6 +22,7 @@ SYM_CODE_START(efi_enter_kernel)
        ldr     w2, =primary_entry_offset
        add     x19, x0, x2             // relocated Image entrypoint
        mov     x20, x1                 // DTB address
+       mrs     x27, cntvct_el0

        /*
         * Clean the copied Image to the PoC, and ensure it is not shadowed by
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 2196aad7b55bcef0..068a7d111836382b 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -430,6 +430,8 @@ SYM_FUNC_START_LOCAL(__primary_switched)

        str_l   x21, __fdt_pointer, x5          // Save FDT pointer

+       str_l   x27, boot_args + 8, x5
+
        ldr_l   x4, kimage_vaddr                // Save the offset between
        sub     x4, x4, x0                      // the kernel virtual and
        str_l   x4, kimage_voffset, x5          // physical mappings
@@ -797,6 +799,10 @@ SYM_FUNC_START_LOCAL(__primary_switch)
        adrp    x1, reserved_pg_dir
        adrp    x2, init_idmap_pg_dir
        bl      __enable_mmu
+
+       mrs     x0, cntvct_el0
+       sub     x27, x0, x27
+
 #ifdef CONFIG_RELOCATABLE
        adrp    x23, KERNEL_START
        and     x23, x23, MIN_KIMG_ALIGN - 1