mbox series

[RFCv2,0/9] UEFI emulator for kexec

Message ID 20240819145417.23367-1-piliu@redhat.com
Headers show
Series UEFI emulator for kexec | expand

Message

Pingfan Liu Aug. 19, 2024, 2:53 p.m. UTC
*** Background ***

As more PE format kernel images are introduced, it post challenge to kexec to
cope with the new format.

In my attempt to add support for arm64 zboot image in the kernel [1],
Ard suggested using an emulator to tackle this issue.  Last year, when
Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
emulator approach again [3]

After discussion, Ard's approach seems to be a more promising solution
to handle PE format kernels once and for all.  This series follows that
approach and implements an emulator to emulate EFI boot time services,
allowing the efistub kernel to self-extract and boot.

Another year has passed, and UKI kernel is more and more frequently used
in product. I think it is time to pay effort to resolve this issue.


*** Overview of implement ***
The whole model consits of three parts:

-1. The emulator
It is a self-relocatable PIC code, which is finally linked into kernel, but not
export any internal symbol to kernel.  It mainly contains: a PE file parser,
which loads PE format kernel, a group of functions to emulate efi boot service.

-2. inside kernel, PE-format loader
Its main task is to set up two extra kexec_segment, one for emulator, the other
for passing information from the first kernel to emulator.

-3. set up identity mapping only for the memory used by the emulator.
Here it relies on kimage_alloc_control_pages() to get pages, which will not
stamped during the process of kexec relocate (cp from src to dst). And since the
mapping only covers a small range of memory, it cost small amount memory.


*** To do ***

Currently, it only works on arm64 virt machine. For x86, it needs some slightly
changes. (I plan to do it in the next version)

Also, this series does not implement a memory allocator, which I plan to
implement with the help of bitmap.

About console, currently it hard code for arm64 virt machine, later it should
extract the information through ACPI table.

For kdump code, it is not implmented yet. But it should share the majority of
this series.


*** Test of this series ***
I have tested this series on arm64 virt machine. There I booted the vmlinuz.efi
and kexec_file_load a UKI image, then switch to the second kernel.

I used a modified kexec-tools [4], which just skips the check of the file format and passes the file directly to kernel.

[1]: https://lore.kernel.org/linux-arm-kernel/ZBvKSis+dfnqa+Vz@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e
[2]: https://lore.kernel.org/lkml/20230918173607.421d2616@rotkaeppchen/T/
[3]: https://lore.kernel.org/lkml/20230918173607.421d2616@rotkaeppchen/T/#mc60aa591cb7616ceb39e1c98f352383f9ba6e985
[4]: https://github.com/pfliu/kexec-tools.git branch: kexec_uefi_emulator


RFCv1 -> RFCv2:
-1.Support to run UKI kernel by: add LoadImage() and StartImage(), add
   PE file relocation support, add InstallMultiProtocol()
-2.Also set up idmap for EFI runtime memory descriptor since UKI's
   systemd-stub calls runtime service
-3.Move kexec_pe_image.c from arch/arm64/kernel to kernel/, since it
   aims to provide a more general architecture support.

RFCv1: https://lore.kernel.org/linux-efi/20240718085759.13247-1-piliu@redhat.com/
RFCv2: https://github.com/pfliu/linux.git  branch kexec_uefi_emulator_RFCv2

Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Jan Hendrik Farr <kernel@jfarr.cc>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: kexec@lists.infradead.org
Cc: linux-efi@vger.kernel.org
Cc: linux-kernel@vger.kernel.org



Pingfan Liu (9):
  efi/libstub: Ask efi_random_alloc() to skip unusable memory
  efi/libstub: Complete efi_simple_text_output_protocol
  efi/emulator: Initial rountines to emulate EFI boot time service
  efi/emulator: Turn on mmu for arm64
  kexec: Introduce kexec_pe_image to parse and load PE file
  arm64: kexec: Introduce a new member param_mem to kimage_arch
  arm64: mm: Change to prototype of
  arm64: kexec: Prepare page table for emulator
  arm64: kexec: Enable kexec_pe_image

 arch/arm64/Kconfig                            |   4 +
 arch/arm64/include/asm/kexec.h                |   1 +
 arch/arm64/include/asm/mmu.h                  |   6 +
 arch/arm64/kernel/asm-offsets.c               |   2 +-
 arch/arm64/kernel/machine_kexec.c             | 103 +++-
 arch/arm64/kernel/machine_kexec_file.c        |   4 +
 arch/arm64/kernel/relocate_kernel.S           |   2 +-
 arch/arm64/mm/mmu.c                           |  67 ++-
 drivers/firmware/efi/Makefile                 |   1 +
 drivers/firmware/efi/efi_emulator/Makefile    |  99 ++++
 .../firmware/efi/efi_emulator/amba-pl011.c    |  81 +++
 .../efi_emulator/arm64_emulator_service.lds   |  45 ++
 .../firmware/efi/efi_emulator/arm64_proc.S    | 175 ++++++
 .../firmware/efi/efi_emulator/config_table.c  |  25 +
 drivers/firmware/efi/efi_emulator/core.c      | 376 +++++++++++++
 .../firmware/efi/efi_emulator/device_handle.c | 138 +++++
 drivers/firmware/efi/efi_emulator/earlycon.h  |  19 +
 .../firmware/efi/efi_emulator/efi_emulator.S  |  12 +
 drivers/firmware/efi/efi_emulator/emulator.h  | 106 ++++
 drivers/firmware/efi/efi_emulator/entry.c     |  68 +++
 drivers/firmware/efi/efi_emulator/head.S      |  10 +
 drivers/firmware/efi/efi_emulator/lib.c       |  73 +++
 drivers/firmware/efi/efi_emulator/memory.c    |  27 +
 .../firmware/efi/efi_emulator/memory_api.c    |  74 +++
 drivers/firmware/efi/efi_emulator/misc.c      |  43 ++
 drivers/firmware/efi/efi_emulator/pe_loader.c | 173 ++++++
 drivers/firmware/efi/efi_emulator/printf.c    | 373 +++++++++++++
 .../efi/efi_emulator/protocol_device_path.c   |  75 +++
 .../protocol_simple_text_output.c             |  50 ++
 drivers/firmware/efi/libstub/efistub.h        |   7 +
 drivers/firmware/efi/libstub/randomalloc.c    |   5 +
 include/linux/efi_emulator.h                  |  46 ++
 include/linux/kexec.h                         |   6 +
 kernel/Makefile                               |   1 +
 kernel/kexec_pe_image.c                       | 503 ++++++++++++++++++
 35 files changed, 2764 insertions(+), 36 deletions(-)
 create mode 100644 drivers/firmware/efi/efi_emulator/Makefile
 create mode 100644 drivers/firmware/efi/efi_emulator/amba-pl011.c
 create mode 100644 drivers/firmware/efi/efi_emulator/arm64_emulator_service.lds
 create mode 100644 drivers/firmware/efi/efi_emulator/arm64_proc.S
 create mode 100644 drivers/firmware/efi/efi_emulator/config_table.c
 create mode 100644 drivers/firmware/efi/efi_emulator/core.c
 create mode 100644 drivers/firmware/efi/efi_emulator/device_handle.c
 create mode 100644 drivers/firmware/efi/efi_emulator/earlycon.h
 create mode 100644 drivers/firmware/efi/efi_emulator/efi_emulator.S
 create mode 100644 drivers/firmware/efi/efi_emulator/emulator.h
 create mode 100644 drivers/firmware/efi/efi_emulator/entry.c
 create mode 100644 drivers/firmware/efi/efi_emulator/head.S
 create mode 100644 drivers/firmware/efi/efi_emulator/lib.c
 create mode 100644 drivers/firmware/efi/efi_emulator/memory.c
 create mode 100644 drivers/firmware/efi/efi_emulator/memory_api.c
 create mode 100644 drivers/firmware/efi/efi_emulator/misc.c
 create mode 100644 drivers/firmware/efi/efi_emulator/pe_loader.c
 create mode 100644 drivers/firmware/efi/efi_emulator/printf.c
 create mode 100644 drivers/firmware/efi/efi_emulator/protocol_device_path.c
 create mode 100644 drivers/firmware/efi/efi_emulator/protocol_simple_text_output.c
 create mode 100644 include/linux/efi_emulator.h
 create mode 100644 kernel/kexec_pe_image.c

Comments

Pingfan Liu Aug. 20, 2024, 12:58 a.m. UTC | #1
On Tue, Aug 20, 2024 at 2:00 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
>
> On Mon Aug 19, 2024 at 5:53 PM EEST, Pingfan Liu wrote:
> > efi_random_alloc() demands EFI_ALLOCATE_ADDRESS when allocate_pages(),
> > but the current implement can not ensure the selected target locates
> > inside free area, that is to exclude EFI_BOOT_SERVICES_*,
> > EFI_RUNTIME_SERVICES_* etc.
> >
> > Fix the issue by checking md->type.
>
> If it is a fix shouldn't this have a fixes tag?
>
Yes, I will supplement the following in the next version
Fixes: 2ddbfc81eac8 ("efi: stub: add implementation of efi_random_alloc()")

> >
> > Signed-off-by: Pingfan Liu <piliu@redhat.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > To: linux-efi@vger.kernel.org
> > ---
> >  drivers/firmware/efi/libstub/randomalloc.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/firmware/efi/libstub/randomalloc.c b/drivers/firmware/efi/libstub/randomalloc.c
> > index c41e7b2091cdd..7304e767688f2 100644
> > --- a/drivers/firmware/efi/libstub/randomalloc.c
> > +++ b/drivers/firmware/efi/libstub/randomalloc.c
> > @@ -79,6 +79,8 @@ efi_status_t efi_random_alloc(unsigned long size,
> >               efi_memory_desc_t *md = (void *)map->map + map_offset;
> >               unsigned long slots;
> >
>
> I'd add this inline comment:
>
> /* Skip "unconventional" memory: */
>

Adopt.

Thanks for your kind review.

Best Regards,

Pingfan

> > +             if (!(md->type & (EFI_CONVENTIONAL_MEMORY || EFI_PERSISTENT_MEMORY)))
> > +                     continue;
> >               slots = get_entry_num_slots(md, size, ilog2(align), alloc_min,
> >                                           alloc_max);
> >               MD_NUM_SLOTS(md) = slots;
> > @@ -111,6 +113,9 @@ efi_status_t efi_random_alloc(unsigned long size,
> >               efi_physical_addr_t target;
> >               unsigned long pages;
> >
> > +             if (!(md->type & (EFI_CONVENTIONAL_MEMORY || EFI_PERSISTENT_MEMORY)))
> > +                     continue;
> > +
> >               if (total_mirrored_slots > 0 &&
> >                   !(md->attribute & EFI_MEMORY_MORE_RELIABLE))
> >                       continue;
>
> BR, Jarkko
>
Lennart Poettering Aug. 21, 2024, 2:27 p.m. UTC | #2
On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:

> *** Background ***
>
> As more PE format kernel images are introduced, it post challenge to kexec to
> cope with the new format.
>
> In my attempt to add support for arm64 zboot image in the kernel [1],
> Ard suggested using an emulator to tackle this issue.  Last year, when
> Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> emulator approach again [3]

Hmm, systemd's systemd-stub code tries to load certain "side-car"
files placed next to the UKI, via the UEFI file system APIs. What's
your intention with the UEFI emulator regarding that? The sidecars are
somewhat important, because that's how we parameterize otherwise
strictly sealed, immutable UKIs.

Hence, what's the story there? implement some form of fs driver (for
what fs precisely?) in the emulator too?

And regarding tpm? tpms require drivers and i guess at the moment uefi
emulator would run those aren't available anymore? but we really
should do a separator measurement then. (also there needs to be some
way to pass over measurement log of that measurement?)

Lennart

--
Lennart Poettering, Berlin
Pingfan Liu Aug. 22, 2024, 5:42 a.m. UTC | #3
On Wed, Aug 21, 2024 at 10:27 PM Lennart Poettering
<mzxreary@0pointer.de> wrote:
>
> On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:
>
> > *** Background ***
> >
> > As more PE format kernel images are introduced, it post challenge to kexec to
> > cope with the new format.
> >
> > In my attempt to add support for arm64 zboot image in the kernel [1],
> > Ard suggested using an emulator to tackle this issue.  Last year, when
> > Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> > emulator approach again [3]
>
> Hmm, systemd's systemd-stub code tries to load certain "side-car"
> files placed next to the UKI, via the UEFI file system APIs. What's
> your intention with the UEFI emulator regarding that? The sidecars are
> somewhat important, because that's how we parameterize otherwise
> strictly sealed, immutable UKIs.
>
IIUC, you are referring to UKI addons.

> Hence, what's the story there? implement some form of fs driver (for
> what fs precisely?) in the emulator too?
>
As for addon, that is a missing part in this series. I have overlooked
this issue. Originally, I thought that there was no need to implement
a disk driver and vfat file system, just preload them into memory, and
finally present them through the uefi API. I will take a closer look
at it and chew on it.

> And regarding tpm? tpms require drivers and i guess at the moment uefi
> emulator would run those aren't available anymore? but we really
> should do a separator measurement then. (also there needs to be some
> way to pass over measurement log of that measurement?)
>

It is a pity that it is a common issue persistent with kexec-reboot
kernel nowadays.
I am not familiar with TPM and have no clear idea for the time being.
(emulating Platform Configuration Registers ?).  But since this
emulator is held inside a linux kernel image, and the UKI's signature
is checked during kexec_file_load. All of them are safe from
modification, this security is not an urgent issue.

Thanks for sharing your thoughts and insights.

Best Regards,

Pingfan

> Lennart
>
> --
> Lennart Poettering, Berlin
>
Dave Young Aug. 22, 2024, 6:16 a.m. UTC | #4
On Thu, 22 Aug 2024 at 13:42, Pingfan Liu <piliu@redhat.com> wrote:
>
> On Wed, Aug 21, 2024 at 10:27 PM Lennart Poettering
> <mzxreary@0pointer.de> wrote:
> >
> > On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:
> >
> > > *** Background ***
> > >
> > > As more PE format kernel images are introduced, it post challenge to kexec to
> > > cope with the new format.
> > >
> > > In my attempt to add support for arm64 zboot image in the kernel [1],
> > > Ard suggested using an emulator to tackle this issue.  Last year, when
> > > Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> > > emulator approach again [3]
> >
> > Hmm, systemd's systemd-stub code tries to load certain "side-car"
> > files placed next to the UKI, via the UEFI file system APIs. What's
> > your intention with the UEFI emulator regarding that? The sidecars are
> > somewhat important, because that's how we parameterize otherwise
> > strictly sealed, immutable UKIs.
> >
> IIUC, you are referring to UKI addons.
>
> > Hence, what's the story there? implement some form of fs driver (for
> > what fs precisely?) in the emulator too?
> >
> As for addon, that is a missing part in this series. I have overlooked
> this issue. Originally, I thought that there was no need to implement
> a disk driver and vfat file system, just preload them into memory, and
> finally present them through the uefi API. I will take a closer look
> at it and chew on it.
>

Hi Pingfan,

If more and more stuff needs coming in,  not only the limited boot
services then it will be way too complicated and hard to maintain and
debug,  also the two kexec code paths are duplicated somehow. It is
really bad..

I forgot why we can not just extract the kernel from UKI and then load
it directly,  if the embedded kernel is also signed it should be good?

Thanks
Dave
Lennart Poettering Aug. 22, 2024, 8:23 a.m. UTC | #5
On Do, 22.08.24 13:42, Pingfan Liu (piliu@redhat.com) wrote:

 > On Wed, Aug 21, 2024 at 10:27 PM Lennart Poettering
> <mzxreary@0pointer.de> wrote:
> >
> > On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:
> >
> > > *** Background ***
> > >
> > > As more PE format kernel images are introduced, it post challenge to kexec to
> > > cope with the new format.
> > >
> > > In my attempt to add support for arm64 zboot image in the kernel [1],
> > > Ard suggested using an emulator to tackle this issue.  Last year, when
> > > Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> > > emulator approach again [3]
> >
> > Hmm, systemd's systemd-stub code tries to load certain "side-car"
> > files placed next to the UKI, via the UEFI file system APIs. What's
> > your intention with the UEFI emulator regarding that? The sidecars are
> > somewhat important, because that's how we parameterize otherwise
> > strictly sealed, immutable UKIs.
> >
> IIUC, you are referring to UKI addons.

Yeah, UKI addons, as well as credential files, and sysext/confext
DDIs.

The addons are the most interesting btw, because we load them into
memory as PE files, and ask the UEFI to authenticate them.

> > Hence, what's the story there? implement some form of fs driver (for
> > what fs precisely?) in the emulator too?
> >
> As for addon, that is a missing part in this series. I have overlooked
> this issue. Originally, I thought that there was no need to implement
> a disk driver and vfat file system, just preload them into memory, and
> finally present them through the uefi API. I will take a closer look
> at it and chew on it.

It doesn't have to be VFAT btw. It just has to be something. For
example, it might suffice to take these files, pack them up as cpio or
so and pass them along with the UEFI execution. The UEFI emulator
would then have to expose them as a file system then.

We are not talking of a bazillion of files here, it's mostly a
smallish number of sidecar files I'd expect.

> > And regarding tpm? tpms require drivers and i guess at the moment uefi
> > emulator would run those aren't available anymore? but we really
> > should do a separator measurement then. (also there needs to be some
> > way to pass over measurement log of that measurement?)
>
> It is a pity that it is a common issue persistent with kexec-reboot
> kernel nowadays.
> I am not familiar with TPM and have no clear idea for the time being.
> (emulating Platform Configuration Registers ?).  But since this
> emulator is held inside a linux kernel image, and the UKI's signature
> is checked during kexec_file_load. All of them are safe from
> modification, this security is not an urgent issue.

Hmm, I'd really think about this with some priority. The measurement
stuff should not be an afterthought, it typically has major
implications on how you design your transitions, because measurements
of some component always need to happen *before* you pass control to
it, otherwise they are pointless.

Lennart

--
Lennart Poettering, Berlin
Pingfan Liu Aug. 22, 2024, 10:45 a.m. UTC | #6
On Thu, Aug 22, 2024 at 4:23 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Do, 22.08.24 13:42, Pingfan Liu (piliu@redhat.com) wrote:
>
>  > On Wed, Aug 21, 2024 at 10:27 PM Lennart Poettering
> > <mzxreary@0pointer.de> wrote:
> > >
> > > On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:
> > >
> > > > *** Background ***
> > > >
> > > > As more PE format kernel images are introduced, it post challenge to kexec to
> > > > cope with the new format.
> > > >
> > > > In my attempt to add support for arm64 zboot image in the kernel [1],
> > > > Ard suggested using an emulator to tackle this issue.  Last year, when
> > > > Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> > > > emulator approach again [3]
> > >
> > > Hmm, systemd's systemd-stub code tries to load certain "side-car"
> > > files placed next to the UKI, via the UEFI file system APIs. What's
> > > your intention with the UEFI emulator regarding that? The sidecars are
> > > somewhat important, because that's how we parameterize otherwise
> > > strictly sealed, immutable UKIs.
> > >
> > IIUC, you are referring to UKI addons.
>
> Yeah, UKI addons, as well as credential files, and sysext/confext
> DDIs.
>
> The addons are the most interesting btw, because we load them into
> memory as PE files, and ask the UEFI to authenticate them.
>
> > > Hence, what's the story there? implement some form of fs driver (for
> > > what fs precisely?) in the emulator too?
> > >
> > As for addon, that is a missing part in this series. I have overlooked
> > this issue. Originally, I thought that there was no need to implement
> > a disk driver and vfat file system, just preload them into memory, and
> > finally present them through the uefi API. I will take a closer look
> > at it and chew on it.
>
> It doesn't have to be VFAT btw. It just has to be something. For
> example, it might suffice to take these files, pack them up as cpio or
> so and pass them along with the UEFI execution. The UEFI emulator
> would then have to expose them as a file system then.
>
> We are not talking of a bazillion of files here, it's mostly a
> smallish number of sidecar files I'd expect.
>
Yes, I think about using <key, value>, where key is the file path,
value is the file content.

> > > And regarding tpm? tpms require drivers and i guess at the moment uefi
> > > emulator would run those aren't available anymore? but we really
> > > should do a separator measurement then. (also there needs to be some
> > > way to pass over measurement log of that measurement?)
> >
> > It is a pity that it is a common issue persistent with kexec-reboot
> > kernel nowadays.
> > I am not familiar with TPM and have no clear idea for the time being.
> > (emulating Platform Configuration Registers ?).  But since this
> > emulator is held inside a linux kernel image, and the UKI's signature
> > is checked during kexec_file_load. All of them are safe from
> > modification, this security is not an urgent issue.
>
> Hmm, I'd really think about this with some priority. The measurement
> stuff should not be an afterthought, it typically has major
> implications on how you design your transitions, because measurements
> of some component always need to happen *before* you pass control to
> it, otherwise they are pointless.
>

OK, I will look into the details of TPM to see how to bail out.

Thanks,

Pingfan
Pingfan Liu Aug. 22, 2024, 10:51 a.m. UTC | #7
On Thu, Aug 22, 2024 at 2:17 PM Dave Young <dyoung@redhat.com> wrote:
>
> On Thu, 22 Aug 2024 at 13:42, Pingfan Liu <piliu@redhat.com> wrote:
> >
> > On Wed, Aug 21, 2024 at 10:27 PM Lennart Poettering
> > <mzxreary@0pointer.de> wrote:
> > >
> > > On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:
> > >
> > > > *** Background ***
> > > >
> > > > As more PE format kernel images are introduced, it post challenge to kexec to
> > > > cope with the new format.
> > > >
> > > > In my attempt to add support for arm64 zboot image in the kernel [1],
> > > > Ard suggested using an emulator to tackle this issue.  Last year, when
> > > > Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> > > > emulator approach again [3]
> > >
> > > Hmm, systemd's systemd-stub code tries to load certain "side-car"
> > > files placed next to the UKI, via the UEFI file system APIs. What's
> > > your intention with the UEFI emulator regarding that? The sidecars are
> > > somewhat important, because that's how we parameterize otherwise
> > > strictly sealed, immutable UKIs.
> > >
> > IIUC, you are referring to UKI addons.
> >
> > > Hence, what's the story there? implement some form of fs driver (for
> > > what fs precisely?) in the emulator too?
> > >
> > As for addon, that is a missing part in this series. I have overlooked
> > this issue. Originally, I thought that there was no need to implement
> > a disk driver and vfat file system, just preload them into memory, and
> > finally present them through the uefi API. I will take a closer look
> > at it and chew on it.
> >
>
> Hi Pingfan,
>
> If more and more stuff needs coming in,  not only the limited boot
> services then it will be way too complicated and hard to maintain and
> debug,  also the two kexec code paths are duplicated somehow. It is
> really bad..
>
OK, I will try to keep things easier. And what do you mean about " two
kexec code paths"?

> I forgot why we can not just extract the kernel from UKI and then load
> it directly,  if the embedded kernel is also signed it should be good?
>

I think the main concern is about the signature.

Thanks,

Pingfan
Jan Hendrik Farr Aug. 22, 2024, 10:56 a.m. UTC | #8
Hi Dave,

> I forgot why we can not just extract the kernel from UKI and then load
> it directly,  if the embedded kernel is also signed it should be good?

The problem is that in the basic usecase for UKI you only sign the entire
UKI PE file and not the included kernel, because you only want that kernel
to be run with that one initrd and that one kernel cmdline.

So at a minimum you have to have the signature on the whole UKI checked by
the kernel and than have the kernel extract UKI into its parts unless you
somehow want to extent trust into userspace to have a helper program do that.

That's what my UKI support implementation from last year did.

v1: https://lore.kernel.org/lkml/20230909161851.223627-1-kernel@jfarr.cc/
v2: https://lore.kernel.org/lkml/20230911052535.335770-1-kernel@jfarr.cc/
v3-wip: https://github.com/Cydox/linux/blob/2908db6d8556fa617298cfb713355edaa9e4b095/arch/x86/kernel/kexec-uki.c

It however also lacks support for the "side-car" files. One option to add them
would be to load them using subsequent calls to kexec_file_load with a special
flag maybe.

TPM measurements are also not done although they are way easier to
implement with this approach as we still have the rest of the kernel around.

However TPM measurements in this case would be implemented by the kexec loader
in the kernel not by the UKI deciding what to measure. So we would have to
have a very firm agreement on what to measure.

Going the UEFI emulator route gives the UKI format (and other (future) formats)
way more flexibility. The cost is to potentially implementing a large portion
of the UEFI spec, especially if the goal is to support future unknown formats
which IIRC was one of the reasons this approach was suggested.

Kind regards,
Jan
Jan Hendrik Farr Aug. 22, 2024, 11:42 a.m. UTC | #9
On 22 18:45:38, Pingfan Liu wrote:
> On Thu, Aug 22, 2024 at 4:23 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
> >
> > On Do, 22.08.24 13:42, Pingfan Liu (piliu@redhat.com) wrote:
> >
> >  > On Wed, Aug 21, 2024 at 10:27 PM Lennart Poettering
> > > <mzxreary@0pointer.de> wrote:
> > > >
> > > > On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:
> > > >
> > > > > *** Background ***
> > > > >
> > > > > As more PE format kernel images are introduced, it post challenge to kexec to
> > > > > cope with the new format.
> > > > >
> > > > > In my attempt to add support for arm64 zboot image in the kernel [1],
> > > > > Ard suggested using an emulator to tackle this issue.  Last year, when
> > > > > Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> > > > > emulator approach again [3]
> > > >
> > > > Hmm, systemd's systemd-stub code tries to load certain "side-car"
> > > > files placed next to the UKI, via the UEFI file system APIs. What's
> > > > your intention with the UEFI emulator regarding that? The sidecars are
> > > > somewhat important, because that's how we parameterize otherwise
> > > > strictly sealed, immutable UKIs.
> > > >
> > > IIUC, you are referring to UKI addons.
> >
> > Yeah, UKI addons, as well as credential files, and sysext/confext
> > DDIs.
> >
> > The addons are the most interesting btw, because we load them into
> > memory as PE files, and ask the UEFI to authenticate them.
> >
> > > > Hence, what's the story there? implement some form of fs driver (for
> > > > what fs precisely?) in the emulator too?
> > > >
> > > As for addon, that is a missing part in this series. I have overlooked
> > > this issue. Originally, I thought that there was no need to implement
> > > a disk driver and vfat file system, just preload them into memory, and
> > > finally present them through the uefi API. I will take a closer look
> > > at it and chew on it.
> >
> > It doesn't have to be VFAT btw. It just has to be something. For
> > example, it might suffice to take these files, pack them up as cpio or
> > so and pass them along with the UEFI execution. The UEFI emulator
> > would then have to expose them as a file system then.
> >
> > We are not talking of a bazillion of files here, it's mostly a
> > smallish number of sidecar files I'd expect.
> >
> Yes, I think about using <key, value>, where key is the file path,
> value is the file content.
> 
> > > > And regarding tpm? tpms require drivers and i guess at the moment uefi
> > > > emulator would run those aren't available anymore? but we really
> > > > should do a separator measurement then. (also there needs to be some
> > > > way to pass over measurement log of that measurement?)
> > >
> > > It is a pity that it is a common issue persistent with kexec-reboot
> > > kernel nowadays.
> > > I am not familiar with TPM and have no clear idea for the time being.
> > > (emulating Platform Configuration Registers ?).  But since this
> > > emulator is held inside a linux kernel image, and the UKI's signature
> > > is checked during kexec_file_load. All of them are safe from
> > > modification, this security is not an urgent issue.
> >
> > Hmm, I'd really think about this with some priority. The measurement
> > stuff should not be an afterthought, it typically has major
> > implications on how you design your transitions, because measurements
> > of some component always need to happen *before* you pass control to
> > it, otherwise they are pointless.
> >
> 
> OK, I will look into the details of TPM to see how to bail out.

This issue is why I thought a different approach to the UEFI emulator
might be useful:

(1) On "kexec -l" execute the EFI binary inside the kernel as a kthread
until it exits boot services and record all TPM measurements into a
buffer
(2) On "kexec -e" use the kernels tpm driver to actually perform all the
prerecorded measurements.
(3) Transition into a "purgatory" that will
clean up the address space to make sure we get to an identity
mapping.
(4) Return control to the EFI app at the point it exited boot services

Additional advantage is that we have filesystem access during (1) so it's
simple to load additional sidecar files for the UKI.

I have two questions:

1: Does the systemd-stub only perform measurements before exiting boot
services or also afterwards?

2: Is it okay to just go to an identity mapping when boot services are
exited or is the identidy mapping actually required for the entire
execution of the EFI app (I know the UEFI spec calls for this, but I
think it should be possible to clean up the address space in a purgatory)?


I played around with this approach last year and got the start of the
kernels EFI stub executing in a kthread and calling into provided boot
services, but the difficult part is memory allocation and cleaning up
the address space in a purgatory.

> 
> Thanks,
> 
> Pingfan
> 

Kind regards,
Jan
Lennart Poettering Aug. 22, 2024, 11:45 a.m. UTC | #10
On Do, 22.08.24 13:42, Jan Hendrik Farr (kernel@jfarr.cc) wrote:

> I have two questions:
>
> 1: Does the systemd-stub only perform measurements before exiting boot
> services or also afterwards?

Nope. we pass control to the kernel's own stub, and that calls EBS(),
not systemd-stub. Hence, no, we are just measuring a things before
EBS(), not after.

Lennart

--
Lennart Poettering, Berlin
Dave Young Aug. 22, 2024, 11:54 a.m. UTC | #11
On Thu, 22 Aug 2024 at 18:51, Pingfan Liu <piliu@redhat.com> wrote:
>
> On Thu, Aug 22, 2024 at 2:17 PM Dave Young <dyoung@redhat.com> wrote:
> >
> > On Thu, 22 Aug 2024 at 13:42, Pingfan Liu <piliu@redhat.com> wrote:
> > >
> > > On Wed, Aug 21, 2024 at 10:27 PM Lennart Poettering
> > > <mzxreary@0pointer.de> wrote:
> > > >
> > > > On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:
> > > >
> > > > > *** Background ***
> > > > >
> > > > > As more PE format kernel images are introduced, it post challenge to kexec to
> > > > > cope with the new format.
> > > > >
> > > > > In my attempt to add support for arm64 zboot image in the kernel [1],
> > > > > Ard suggested using an emulator to tackle this issue.  Last year, when
> > > > > Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> > > > > emulator approach again [3]
> > > >
> > > > Hmm, systemd's systemd-stub code tries to load certain "side-car"
> > > > files placed next to the UKI, via the UEFI file system APIs. What's
> > > > your intention with the UEFI emulator regarding that? The sidecars are
> > > > somewhat important, because that's how we parameterize otherwise
> > > > strictly sealed, immutable UKIs.
> > > >
> > > IIUC, you are referring to UKI addons.
> > >
> > > > Hence, what's the story there? implement some form of fs driver (for
> > > > what fs precisely?) in the emulator too?
> > > >
> > > As for addon, that is a missing part in this series. I have overlooked
> > > this issue. Originally, I thought that there was no need to implement
> > > a disk driver and vfat file system, just preload them into memory, and
> > > finally present them through the uefi API. I will take a closer look
> > > at it and chew on it.
> > >
> >
> > Hi Pingfan,
> >
> > If more and more stuff needs coming in,  not only the limited boot
> > services then it will be way too complicated and hard to maintain and
> > debug,  also the two kexec code paths are duplicated somehow. It is
> > really bad..
> >
> OK, I will try to keep things easier. And what do you mean about " two
> kexec code paths"?

I mean we have the EFI and non-EFI kexec implementation. Also for the
EFI kexec code for X86 there is something we passed from 1st kernel to
2nd kernel due to no EFI firmware phase, anyway this part can be
cleaned up if the emulator can be done gracefully.

>
> > I forgot why we can not just extract the kernel from UKI and then load
> > it directly,  if the embedded kernel is also signed it should be good?
> >
>
> I think the main concern is about the signature.

I thought for the minimum case of kdump, we may just live with kernel
signed only and leave the initrd/cmdline unsigned.   Anyway for kexec
reboot it is a problem.

>
> Thanks,
>
> Pingfan
>
Dave Young Aug. 22, 2024, 12:04 p.m. UTC | #12
On Thu, 22 Aug 2024 at 18:56, Jan Hendrik Farr <kernel@jfarr.cc> wrote:
>
> Hi Dave,
>
> > I forgot why we can not just extract the kernel from UKI and then load
> > it directly,  if the embedded kernel is also signed it should be good?
>
> The problem is that in the basic usecase for UKI you only sign the entire
> UKI PE file and not the included kernel, because you only want that kernel
> to be run with that one initrd and that one kernel cmdline.

Hmm,  as replied to Pinfan I thought that both the included kernel and
UKI can be signed, and for kdump case kexec_file_load can be used
simply.

>
> So at a minimum you have to have the signature on the whole UKI checked by
> the kernel and than have the kernel extract UKI into its parts unless you
> somehow want to extent trust into userspace to have a helper program do that.

extend trust into userspace is hard, previously when Vivek created the
kexec_file_load this has been explored and he gave up this option. :(

Pingfan,  nice to see you have something done as POC at least, and
good to see this topic is live. I just have some worries about the
complexity of the emulator though.

Thanks
Dave
Pingfan Liu Aug. 22, 2024, 2:29 p.m. UTC | #13
On Thu, Aug 22, 2024 at 4:23 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Do, 22.08.24 13:42, Pingfan Liu (piliu@redhat.com) wrote:
>
>  > On Wed, Aug 21, 2024 at 10:27 PM Lennart Poettering
> > <mzxreary@0pointer.de> wrote:
> > >
> > > On Mo, 19.08.24 22:53, Pingfan Liu (piliu@redhat.com) wrote:
> > >
> > > > *** Background ***
> > > >
> > > > As more PE format kernel images are introduced, it post challenge to kexec to
> > > > cope with the new format.
> > > >
> > > > In my attempt to add support for arm64 zboot image in the kernel [1],
> > > > Ard suggested using an emulator to tackle this issue.  Last year, when
> > > > Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> > > > emulator approach again [3]
> > >
> > > Hmm, systemd's systemd-stub code tries to load certain "side-car"
> > > files placed next to the UKI, via the UEFI file system APIs. What's
> > > your intention with the UEFI emulator regarding that? The sidecars are
> > > somewhat important, because that's how we parameterize otherwise
> > > strictly sealed, immutable UKIs.
> > >
> > IIUC, you are referring to UKI addons.
>
> Yeah, UKI addons, as well as credential files, and sysext/confext
> DDIs.
>
> The addons are the most interesting btw, because we load them into
> memory as PE files, and ask the UEFI to authenticate them.
>
> > > Hence, what's the story there? implement some form of fs driver (for
> > > what fs precisely?) in the emulator too?
> > >
> > As for addon, that is a missing part in this series. I have overlooked
> > this issue. Originally, I thought that there was no need to implement
> > a disk driver and vfat file system, just preload them into memory, and
> > finally present them through the uefi API. I will take a closer look
> > at it and chew on it.
>
> It doesn't have to be VFAT btw. It just has to be something. For
> example, it might suffice to take these files, pack them up as cpio or
> so and pass them along with the UEFI execution. The UEFI emulator
> would then have to expose them as a file system then.
>
> We are not talking of a bazillion of files here, it's mostly a
> smallish number of sidecar files I'd expect.
>
> > > And regarding tpm? tpms require drivers and i guess at the moment uefi
> > > emulator would run those aren't available anymore? but we really
> > > should do a separator measurement then. (also there needs to be some
> > > way to pass over measurement log of that measurement?)
> >
> > It is a pity that it is a common issue persistent with kexec-reboot
> > kernel nowadays.
> > I am not familiar with TPM and have no clear idea for the time being.
> > (emulating Platform Configuration Registers ?).  But since this
> > emulator is held inside a linux kernel image, and the UKI's signature
> > is checked during kexec_file_load. All of them are safe from
> > modification, this security is not an urgent issue.
>
> Hmm, I'd really think about this with some priority. The measurement
> stuff should not be an afterthought, it typically has major
> implications on how you design your transitions, because measurements
> of some component always need to happen *before* you pass control to
> it, otherwise they are pointless.
>

At present, my emulator returns false to is_efi_secure_boot(), so
systemd-stub does not care about the measurement, and moves on.

Could you enlighten me about how systemd utilizes the measurement? I
grepped 'TPM2_PCR_KERNEL_CONFIG', and saw the systemd-stub asks to
extend PCR. But where is the value checked? I guess the systemd will
hang if the check fails.

Thanks,

Pingfan
Lennart Poettering Aug. 26, 2024, 1:39 p.m. UTC | #14
On Do, 22.08.24 22:29, Pingfan Liu (piliu@redhat.com) wrote:

> > Hmm, I'd really think about this with some priority. The measurement
> > stuff should not be an afterthought, it typically has major
> > implications on how you design your transitions, because measurements
> > of some component always need to happen *before* you pass control to
> > it, otherwise they are pointless.
> >
>
> At present, my emulator returns false to is_efi_secure_boot(), so
> systemd-stub does not care about the measurement, and moves on.
>
> Could you enlighten me about how systemd utilizes the measurement? I
> grepped 'TPM2_PCR_KERNEL_CONFIG', and saw the systemd-stub asks to
> extend PCR. But where is the value checked? I guess the systemd will
> hang if the check fails.

systemd's "systemd-pcrlock" tool will look for measurements like that
and generate disk encryption TPM policies from that.

Lennart

--
Lennart Poettering, Berlin
Ard Biesheuvel Aug. 28, 2024, 5:08 p.m. UTC | #15
On Mon, 19 Aug 2024 at 16:55, Pingfan Liu <piliu@redhat.com> wrote:
>
> *** Background ***
>
> As more PE format kernel images are introduced, it post challenge to kexec to
> cope with the new format.
>
> In my attempt to add support for arm64 zboot image in the kernel [1],
> Ard suggested using an emulator to tackle this issue.  Last year, when
> Jan tried to introduce UKI support in the kernel [2], Ard mentioned the
> emulator approach again [3]
>
> After discussion, Ard's approach seems to be a more promising solution
> to handle PE format kernels once and for all.  This series follows that
> approach and implements an emulator to emulate EFI boot time services,
> allowing the efistub kernel to self-extract and boot.
>
> Another year has passed, and UKI kernel is more and more frequently used
> in product. I think it is time to pay effort to resolve this issue.
>
>
> *** Overview of implement ***
> The whole model consits of three parts:
>
> -1. The emulator
> It is a self-relocatable PIC code, which is finally linked into kernel, but not
> export any internal symbol to kernel.  It mainly contains: a PE file parser,
> which loads PE format kernel, a group of functions to emulate efi boot service.
>
> -2. inside kernel, PE-format loader
> Its main task is to set up two extra kexec_segment, one for emulator, the other
> for passing information from the first kernel to emulator.
>
> -3. set up identity mapping only for the memory used by the emulator.
> Here it relies on kimage_alloc_control_pages() to get pages, which will not
> stamped during the process of kexec relocate (cp from src to dst). And since the
> mapping only covers a small range of memory, it cost small amount memory.
>
>
> *** To do ***
>
> Currently, it only works on arm64 virt machine. For x86, it needs some slightly
> changes. (I plan to do it in the next version)
>
> Also, this series does not implement a memory allocator, which I plan to
> implement with the help of bitmap.
>
> About console, currently it hard code for arm64 virt machine, later it should
> extract the information through ACPI table.
>
> For kdump code, it is not implmented yet. But it should share the majority of
> this series.
>
>
> *** Test of this series ***
> I have tested this series on arm64 virt machine. There I booted the vmlinuz.efi
> and kexec_file_load a UKI image, then switch to the second kernel.
>
> I used a modified kexec-tools [4], which just skips the check of the file format and passes the file directly to kernel.
>
> [1]: https://lore.kernel.org/linux-arm-kernel/ZBvKSis+dfnqa+Vz@piliu.users.ipa.redhat.com/T/#m42abb0ad3c10126b8b3bfae8a596deb707d6f76e
> [2]: https://lore.kernel.org/lkml/20230918173607.421d2616@rotkaeppchen/T/
> [3]: https://lore.kernel.org/lkml/20230918173607.421d2616@rotkaeppchen/T/#mc60aa591cb7616ceb39e1c98f352383f9ba6e985
> [4]: https://github.com/pfliu/kexec-tools.git branch: kexec_uefi_emulator
>
>
> RFCv1 -> RFCv2:
> -1.Support to run UKI kernel by: add LoadImage() and StartImage(), add
>    PE file relocation support, add InstallMultiProtocol()
> -2.Also set up idmap for EFI runtime memory descriptor since UKI's
>    systemd-stub calls runtime service
> -3.Move kexec_pe_image.c from arch/arm64/kernel to kernel/, since it
>    aims to provide a more general architecture support.
>
> RFCv1: https://lore.kernel.org/linux-efi/20240718085759.13247-1-piliu@redhat.com/
> RFCv2: https://github.com/pfliu/linux.git  branch kexec_uefi_emulator_RFCv2
>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: Jan Hendrik Farr <kernel@jfarr.cc>
> Cc: Philipp Rudo <prudo@redhat.com>
> Cc: Lennart Poettering <mzxreary@0pointer.de>
> Cc: Jarkko Sakkinen <jarkko@kernel.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Baoquan He <bhe@redhat.com>
> Cc: Dave Young <dyoung@redhat.com>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: kexec@lists.infradead.org
> Cc: linux-efi@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
>
>
>
> Pingfan Liu (9):
>   efi/libstub: Ask efi_random_alloc() to skip unusable memory
>   efi/libstub: Complete efi_simple_text_output_protocol
>   efi/emulator: Initial rountines to emulate EFI boot time service
>   efi/emulator: Turn on mmu for arm64
>   kexec: Introduce kexec_pe_image to parse and load PE file
>   arm64: kexec: Introduce a new member param_mem to kimage_arch
>   arm64: mm: Change to prototype of
>   arm64: kexec: Prepare page table for emulator
>   arm64: kexec: Enable kexec_pe_image
>

Thanks for putting this RFC together. This is useful work, and gives
us food for thought and discussion.

There are a few problems that become apparent when going through these changes.

1. Implementing UEFI entirely is intractable, and unnecessary.
Implementing the subset of UEFI that is actually needed to boot Linux
*is* tractable, though, but we need to work together to write this
down somewhere.
  - the EFI stub needs the boot services for the EFI memory map and
the allocation routines
  - GRUB needs block I/O
  - systemd-stub/UKI needs file I/O to look for sidecars
  - etc etc

I implemented a Rust 'efiloader' crate a while ago that encapsulates
most of this (it can boot Linux/arm64 on QEMU and boot x86 via GRUB in
user space **). Adding file I/O to this should be straight-forward -
as Lennart points out, we only need the protocol, it doesn't need to
be backed by an actual file system, it just needs to be able to expose
other files in the right way.

2. Running the UEFI emulator on bare metal is not going to scale.
Cloning UART driver code and MMU code etc is a can of worms that you
want to leave closed. And as Lennart points out, there is other
hardware (TPM) that needs to be accessible as well. Providing a
separate set of drivers for all hardware that the EFI emulator may
need to access is not a tractable problem either.

The fix for this, as I see it, is to run the EFI emulator in user
space, to the point where the payload calls ExitBootServices(). This
will allow all I/O and memory protocol to be implemented trivially,
using C library routines. I have a crude prototype** of this running
to the point where ExitBootServices() is called (and then it crashes).
The tricky yet interesting bit here is how we migrate a chunk of user
space memory to the bare metal context that will be created by the
kexec syscall later (in which the call to ExitBootServices() would
return and proceed with the boot). But the principle is rather
straight-forward, and would permit us, e.g., to kexec an OS installer
too.

3. We need to figure out how to support TPM and PCRs in the context of
kexec. This is a fundamental issue with verified boot, given that the
kexec PCR state is necessarily different from the boot state, and so
we cannot reuse the TPM directly if we want to pretend that we are
doing an ordinary boot in kexec. The alternative is to leave the TPM
in a state where the kexec kernel can access its sealed secrets, and
mock up the TCG2 EFI protocols using a shim that sits between the TPM
hardware (as the real TCG2 protocols will be long gone) and the EFI
payload. But as I said, this is a fundamental issue, as the ability to
pretend that a kexec boot is a pristine boot would mean that verified
boot is broken.


As future work, I'd like to propose to collaborate on some alignment
regarding a UEFI baseline for Linux, i.e., the parts that we actually
need to boot Linux.

For this series in particular, I don't see a way forward where we
adopt this approach, and carry all this code inside the kernel.

Thanks.
Ard.
Pingfan Liu Sept. 2, 2024, 5:40 a.m. UTC | #16
On Thu, Aug 29, 2024 at 1:08 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
[...]

Hi Ard,

Thanks for sharing your insight and thoughts.

>
> Thanks for putting this RFC together. This is useful work, and gives
> us food for thought and discussion.
>
> There are a few problems that become apparent when going through these changes.
>
> 1. Implementing UEFI entirely is intractable, and unnecessary.
> Implementing the subset of UEFI that is actually needed to boot Linux
> *is* tractable, though, but we need to work together to write this
> down somewhere.
>   - the EFI stub needs the boot services for the EFI memory map and
> the allocation routines
>   - GRUB needs block I/O
>   - systemd-stub/UKI needs file I/O to look for sidecars
>   - etc etc
>
> I implemented a Rust 'efiloader' crate a while ago that encapsulates
> most of this (it can boot Linux/arm64 on QEMU and boot x86 via GRUB in
> user space **). Adding file I/O to this should be straight-forward -
> as Lennart points out, we only need the protocol, it doesn't need to
> be backed by an actual file system, it just needs to be able to expose
> other files in the right way.
>
> 2. Running the UEFI emulator on bare metal is not going to scale.
> Cloning UART driver code and MMU code etc is a can of worms that you
> want to leave closed. And as Lennart points out, there is other

As for MMU code, if the 1st kernel does not turn it off, it can be
eliminated from the emulator code, which should not be hard to
implement on arm64. And already done in x86.

> hardware (TPM) that needs to be accessible as well. Providing a
> separate set of drivers for all hardware that the EFI emulator may
> need to access is not a tractable problem either.
>
> The fix for this, as I see it, is to run the EFI emulator in user
> space, to the point where the payload calls ExitBootServices(). This
> will allow all I/O and memory protocol to be implemented trivially,
> using C library routines. I have a crude prototype** of this running

Yes, that is a definitely promising method, By this way, we can handle
device operations more elegantly. In fact, I used it to develop and
debug part of my emulator service code.

But when debugging x86 efi-stub, I encounter some problem with the
privileged instruction, which causes segment fault. It originates from
kaslr_get_random_long().  I think it can be worked around by emulating
the instruction if the instruction reads the system state. But if the
instruction tries to update system state, it can not be fixed since
the system is still owned by the kernel instead of owned by the
emulator exclusively.

So here we need another agreement on the stub's behavior before
ExitBootServices().

> to the point where ExitBootServices() is called (and then it crashes).
> The tricky yet interesting bit here is how we migrate a chunk of user
> space memory to the bare metal context that will be created by the
> kexec syscall later (in which the call to ExitBootServices() would
> return and proceed with the boot). But the principle is rather
> straight-forward, and would permit us, e.g., to kexec an OS installer
> too.
>
> 3. We need to figure out how to support TPM and PCRs in the context of
> kexec. This is a fundamental issue with verified boot, given that the
> kexec PCR state is necessarily different from the boot state, and so
> we cannot reuse the TPM directly if we want to pretend that we are
> doing an ordinary boot in kexec. The alternative is to leave the TPM

Here, I miss the big picture. Could you enlighten me more about this?
As I thought, the linux kernel will not lock itself down onto a
specific firmware. So the trust is one direction, i.e. from bootloader
to kernel. In UKI case, systemd-stub takes the measurement and extends
the PCR 11/12/13 as in
https://uapi-group.org/specifications/specs/linux_tpm_pcr_registry/

Later systemd-pcrlock appraises the value in those registers. If the
sections in UKI are intact, the kexec reboot will go smoothly.


> in a state where the kexec kernel can access its sealed secrets, and
> mock up the TCG2 EFI protocols using a shim that sits between the TPM
> hardware (as the real TCG2 protocols will be long gone) and the EFI
> payload. But as I said, this is a fundamental issue, as the ability to
> pretend that a kexec boot is a pristine boot would mean that verified
> boot is broken.
>
>
> As future work, I'd like to propose to collaborate on some alignment
> regarding a UEFI baseline for Linux, i.e., the parts that we actually
> need to boot Linux.
>

Do you mean that user space code and kernel code? And I think for the
user space code, it should be better to integrate the code in
kexec-tools so that we have a uniform interface for kexec boot.

Looking forward to the collaboration to make kexec able to boot UKI soon.


Thanks,

Pingfan
Philipp Rudo Sept. 6, 2024, 10:54 a.m. UTC | #17
Hi Ard,
Hi Jan,

On Wed, 28 Aug 2024 19:08:14 +0200
Ard Biesheuvel <ardb@kernel.org> wrote:

[...]

> Thanks for putting this RFC together. This is useful work, and gives
> us food for thought and discussion.
> 
> There are a few problems that become apparent when going through these changes.
> 
> 1. Implementing UEFI entirely is intractable, and unnecessary.
> Implementing the subset of UEFI that is actually needed to boot Linux
> *is* tractable, though, but we need to work together to write this
> down somewhere.
>   - the EFI stub needs the boot services for the EFI memory map and
> the allocation routines
>   - GRUB needs block I/O
>   - systemd-stub/UKI needs file I/O to look for sidecars
>   - etc etc
> 
> I implemented a Rust 'efiloader' crate a while ago that encapsulates
> most of this (it can boot Linux/arm64 on QEMU and boot x86 via GRUB in
> user space **). Adding file I/O to this should be straight-forward -
> as Lennart points out, we only need the protocol, it doesn't need to
> be backed by an actual file system, it just needs to be able to expose
> other files in the right way.
> 
> 2. Running the UEFI emulator on bare metal is not going to scale.
> Cloning UART driver code and MMU code etc is a can of worms that you
> want to leave closed. And as Lennart points out, there is other
> hardware (TPM) that needs to be accessible as well. Providing a
> separate set of drivers for all hardware that the EFI emulator may
> need to access is not a tractable problem either.
> 
> The fix for this, as I see it, is to run the EFI emulator in user
> space, to the point where the payload calls ExitBootServices(). This
> will allow all I/O and memory protocol to be implemented trivially,
> using C library routines. I have a crude prototype** of this running
> to the point where ExitBootServices() is called (and then it crashes).
> The tricky yet interesting bit here is how we migrate a chunk of user
> space memory to the bare metal context that will be created by the
> kexec syscall later (in which the call to ExitBootServices() would
> return and proceed with the boot). But the principle is rather
> straight-forward, and would permit us, e.g., to kexec an OS installer
> too.

I mostly agree on what you have wrote. But I see a big problem in
running the EFI emulator in user space when it comes to secure boot.
The chain of trust ends in the kernel. So it's the kernel that needs to
verify that the image to be loaded can be trusted. But when the EFI
runtime is in user space the kernel simply cannot do that. Which means,
if we want to go this way, we would need to extend the chain of trust
to user space. Which will be a whole bucket of worms, not just a can.

That's why I tend more to Jan's suggestion to include the EFI runtime
in the kernel. Alas, that comes with it's own problem, as that requires
to run code in the kernel that was never intended to run in kernel
context. So even when we can trust the code not to be malicious, we
cannot trust it to not accidentally change the system state in a way
the kernel doesn't expect...

Let me throw an other wild idea in the ring. Instead of implementing
a EFI runtime we could also include a eBPF version of the stub into the
images. kexec could then extract the eBPF program and let it run just
like any other eBPF program with all the pros (and cons) that come with
it. That won't be as generic as the EFI runtime, e.g. you couldn't
simply kexec any OS installer. On the other hand it would make it
easier to port UKIs et al. to non-EFI systems. What do you think?

Thanks
Philipp

> 3. We need to figure out how to support TPM and PCRs in the context of
> kexec. This is a fundamental issue with verified boot, given that the
> kexec PCR state is necessarily different from the boot state, and so
> we cannot reuse the TPM directly if we want to pretend that we are
> doing an ordinary boot in kexec. The alternative is to leave the TPM
> in a state where the kexec kernel can access its sealed secrets, and
> mock up the TCG2 EFI protocols using a shim that sits between the TPM
> hardware (as the real TCG2 protocols will be long gone) and the EFI
> payload. But as I said, this is a fundamental issue, as the ability to
> pretend that a kexec boot is a pristine boot would mean that verified
> boot is broken.
> 
> 
> As future work, I'd like to propose to collaborate on some alignment
> regarding a UEFI baseline for Linux, i.e., the parts that we actually
> need to boot Linux.
> 
> For this series in particular, I don't see a way forward where we
> adopt this approach, and carry all this code inside the kernel.
> 
> Thanks.
> Ard.
>
Jarkko Sakkinen Sept. 7, 2024, 11:27 a.m. UTC | #18
On Fri Sep 6, 2024 at 1:54 PM EEST, Philipp Rudo wrote:
> Let me throw an other wild idea in the ring. Instead of implementing
> a EFI runtime we could also include a eBPF version of the stub into the
> images. kexec could then extract the eBPF program and let it run just
> like any other eBPF program with all the pros (and cons) that come with
> it. That won't be as generic as the EFI runtime, e.g. you couldn't
> simply kexec any OS installer. On the other hand it would make it
> easier to port UKIs et al. to non-EFI systems. What do you think?

BPF would have some guarantees that are favorable such as programs
always end, even faulty ones. It always has implicit "ExitBootServices".

Just a remark.

BR, Jarkko
Jarkko Sakkinen Sept. 7, 2024, 11:31 a.m. UTC | #19
On Sat Sep 7, 2024 at 2:27 PM EEST, Jarkko Sakkinen wrote:
> On Fri Sep 6, 2024 at 1:54 PM EEST, Philipp Rudo wrote:
> > Let me throw an other wild idea in the ring. Instead of implementing
> > a EFI runtime we could also include a eBPF version of the stub into the
> > images. kexec could then extract the eBPF program and let it run just
> > like any other eBPF program with all the pros (and cons) that come with
> > it. That won't be as generic as the EFI runtime, e.g. you couldn't
> > simply kexec any OS installer. On the other hand it would make it
> > easier to port UKIs et al. to non-EFI systems. What do you think?
>
> BPF would have some guarantees that are favorable such as programs
> always end, even faulty ones. It always has implicit "ExitBootServices".
>
> Just a remark.

Some days ago I was thinking could some of the kernel functionality be
eBPF at least like in formal theory because most of it is amortized,
i.e. does a fixed chunk of work. Not going into that rabbit hole but
I really like this idea and could be good experimentation ground for
such innovation.

BR, Jarkko
Jarkko Sakkinen Sept. 7, 2024, 11:41 a.m. UTC | #20
On Sat Sep 7, 2024 at 2:31 PM EEST, Jarkko Sakkinen wrote:
> On Sat Sep 7, 2024 at 2:27 PM EEST, Jarkko Sakkinen wrote:
> > On Fri Sep 6, 2024 at 1:54 PM EEST, Philipp Rudo wrote:
> > > Let me throw an other wild idea in the ring. Instead of implementing
> > > a EFI runtime we could also include a eBPF version of the stub into the
> > > images. kexec could then extract the eBPF program and let it run just
> > > like any other eBPF program with all the pros (and cons) that come with
> > > it. That won't be as generic as the EFI runtime, e.g. you couldn't
> > > simply kexec any OS installer. On the other hand it would make it
> > > easier to port UKIs et al. to non-EFI systems. What do you think?
> >
> > BPF would have some guarantees that are favorable such as programs
> > always end, even faulty ones. It always has implicit "ExitBootServices".
> >
> > Just a remark.
>
> Some days ago I was thinking could some of the kernel functionality be
> eBPF at least like in formal theory because most of it is amortized,
> i.e. does a fixed chunk of work. Not going into that rabbit hole but
> I really like this idea and could be good experimentation ground for
> such innovation.

E.g. let's imagine there would imaginary eBPF-TPM driver framework.

How I would go doing that would be to take the existing TPM driver
functionality and provide extra functions and resources available for
subsystem specific BPF environment, and have the orhestration code as
eBPF. I pretty much concluded that there is a chance that such could
work out.

Not something in my immediate table but it is still really interesting
idea, as instead of using language to separate "safe" and unsafe"
regions you would use "VM" environments to create the walls. In the
end of the day that would also great venture for Rust in kernel, i.e.
compile that BPF from Rust.

Sorry going of the hook the comment triggered me ;-)

BR, Jarkko
Lennart Poettering Sept. 9, 2024, 9:48 a.m. UTC | #21
On Fr, 06.09.24 12:54, Philipp Rudo (prudo@redhat.com) wrote:

> I mostly agree on what you have wrote. But I see a big problem in
> running the EFI emulator in user space when it comes to secure boot.
> The chain of trust ends in the kernel. So it's the kernel that needs to
> verify that the image to be loaded can be trusted. But when the EFI
> runtime is in user space the kernel simply cannot do that. Which means,
> if we want to go this way, we would need to extend the chain of trust
> to user space. Which will be a whole bucket of worms, not just a
> can.

May it would be nice to have a way to "zap" userspace away, i.e. allow
the kernel to get rid of all processes in some way, reliable. And then
simply start a new userspace, from a trusted definition. Or in other
words: if you don't want to trust the usual userspace, then let's
maybe just terminate it, and create it anew, with a clean, pristine
definition the old userspace cannot get access to.

> Let me throw an other wild idea in the ring. Instead of implementing
> a EFI runtime we could also include a eBPF version of the stub into the
> images. kexec could then extract the eBPF program and let it run just
> like any other eBPF program with all the pros (and cons) that come with
> it. That won't be as generic as the EFI runtime, e.g. you couldn't
> simply kexec any OS installer. On the other hand it would make it
> easier to port UKIs et al. to non-EFI systems. What do you think?

ebpf is not turing complete, I am not sure how far you will make it
with this, in the various implementations of EFI payloads there are
plenty of loops, sometimes IO loops, sometimes hash loops of huge data
(for measurements). As I understand ebpf is not really compatible such
code.

Lennart

--
Lennart Poettering, Berlin
Jan Hendrik Farr Sept. 9, 2024, 10:42 a.m. UTC | #22
On 09 11:48:30, Lennart Poettering wrote:
> On Fr, 06.09.24 12:54, Philipp Rudo (prudo@redhat.com) wrote:
> 
> > I mostly agree on what you have wrote. But I see a big problem in
> > running the EFI emulator in user space when it comes to secure boot.
> > The chain of trust ends in the kernel. So it's the kernel that needs to
> > verify that the image to be loaded can be trusted. But when the EFI
> > runtime is in user space the kernel simply cannot do that. Which means,
> > if we want to go this way, we would need to extend the chain of trust
> > to user space. Which will be a whole bucket of worms, not just a
> > can.
> 
> May it would be nice to have a way to "zap" userspace away, i.e. allow
> the kernel to get rid of all processes in some way, reliable. And then
> simply start a new userspace, from a trusted definition. Or in other
> words: if you don't want to trust the usual userspace, then let's
> maybe just terminate it, and create it anew, with a clean, pristine
> definition the old userspace cannot get access to.

Well, this is an interesting idea!

However, I'm sceptical if this could be done in a secure way. How do we
ensure that nothing the old userspace did with the various interfaces to
the kernel has no impact on the new userspace? Maybe others can chime in
on this? Does kernel_lockdown give more guarantees related to this?

Even if this is possible in a secure way, there is a problem with doing
this for kernels that are to be kexec'd on kernel panic. In this
approach we can't pre-run them until EBS(), so we would rely on the old
kernel to still be intact when we want to kexec reboot.



You could do a system where you kexec into an intermediate kernel. That
kernel get's kexec'd with a signed initrd that can use the normal
kexec_load syscall to load do any kind of preparation in userspace.
Problem: For that intermediate enviroment we already need a format
that combines kernel image, initrd, cmdline all signed in one package
aka UKI. Was it the chicken or the egg?

But this shows that if we implemented UKIs the easy way (kernel simply
checks signature, extracts the pieces, and kexecs them like normal),
this approach could always be used to support kexec for other future
formats. They could use the kernels UKI support to boot into an
intermediate kernel with UEFI implemented in userspace in the initrd.

So basically support UKIs the easy way and use them to be able to
securely zap away userspace and start with a fresh kernel and signed
userspace as a way to support other UEFI formats that are not UKI.

> 
> > Let me throw an other wild idea in the ring. Instead of implementing
> > a EFI runtime we could also include a eBPF version of the stub into the
> > images. kexec could then extract the eBPF program and let it run just
> > like any other eBPF program with all the pros (and cons) that come with
> > it. That won't be as generic as the EFI runtime, e.g. you couldn't
> > simply kexec any OS installer. On the other hand it would make it
> > easier to port UKIs et al. to non-EFI systems. What do you think?
> 
> ebpf is not turing complete, I am not sure how far you will make it
> with this, in the various implementations of EFI payloads there are
> plenty of loops, sometimes IO loops, sometimes hash loops of huge data
> (for measurements). As I understand ebpf is not really compatible such
> code.
> 
> Lennart
> 
> --
> Lennart Poettering, Berlin
Pingfan Liu Sept. 9, 2024, 1:38 p.m. UTC | #23
Hi Lennart,

I spent some time understanding the systemd-pcrlock and TPM stuff, and
got some idea about it. Could you correct me if I'm wrong? Please see
the following comments inlined.

On Mon, Aug 26, 2024 at 9:40 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
>
> On Do, 22.08.24 22:29, Pingfan Liu (piliu@redhat.com) wrote:
>
> > > Hmm, I'd really think about this with some priority. The measurement
> > > stuff should not be an afterthought, it typically has major
> > > implications on how you design your transitions, because measurements
> > > of some component always need to happen *before* you pass control to
> > > it, otherwise they are pointless.
> > >
> >
> > At present, my emulator returns false to is_efi_secure_boot(), so
> > systemd-stub does not care about the measurement, and moves on.
> >
> > Could you enlighten me about how systemd utilizes the measurement? I
> > grepped 'TPM2_PCR_KERNEL_CONFIG', and saw the systemd-stub asks to
> > extend PCR. But where is the value checked? I guess the systemd will
> > hang if the check fails.
>
> systemd's "systemd-pcrlock" tool will look for measurements like that
> and generate disk encryption TPM policies from that.
>

Before kexec reboots to the new kernel
systemd-pcrlock can predict the expected PCR value and store it in the
file system.
One thing should be noticed is that PCR value can not be affected.

And kexec rebooting happens. systemd-stub extends the PCR value. When
the system is up, systemd checks the real PCR value against the
expected value rendered by systemd-pcrlock? If matching, all related
policies succeed.

Do I understand correctly?

Thanks,

Pingfan
Philipp Rudo Sept. 9, 2024, 1:49 p.m. UTC | #24
Hi Lennart,
Hi Jan,

On Mon, 9 Sep 2024 12:42:45 +0200
Jan Hendrik Farr <kernel@jfarr.cc> wrote:

> On 09 11:48:30, Lennart Poettering wrote:
> > On Fr, 06.09.24 12:54, Philipp Rudo (prudo@redhat.com) wrote:
> >   
> > > I mostly agree on what you have wrote. But I see a big problem in
> > > running the EFI emulator in user space when it comes to secure boot.
> > > The chain of trust ends in the kernel. So it's the kernel that needs to
> > > verify that the image to be loaded can be trusted. But when the EFI
> > > runtime is in user space the kernel simply cannot do that. Which means,
> > > if we want to go this way, we would need to extend the chain of trust
> > > to user space. Which will be a whole bucket of worms, not just a
> > > can.  
> > 
> > May it would be nice to have a way to "zap" userspace away, i.e. allow
> > the kernel to get rid of all processes in some way, reliable. And then
> > simply start a new userspace, from a trusted definition. Or in other
> > words: if you don't want to trust the usual userspace, then let's
> > maybe just terminate it, and create it anew, with a clean, pristine
> > definition the old userspace cannot get access to.  
> 
> Well, this is an interesting idea!
> 
> However, I'm sceptical if this could be done in a secure way. How do we
> ensure that nothing the old userspace did with the various interfaces to
> the kernel has no impact on the new userspace? Maybe others can chime in
> on this? Does kernel_lockdown give more guarantees related to this?
> 
> Even if this is possible in a secure way, there is a problem with doing
> this for kernels that are to be kexec'd on kernel panic. In this
> approach we can't pre-run them until EBS(), so we would rely on the old
> kernel to still be intact when we want to kexec reboot.

I don't believe there's a way to do that on running kernels. As Jan
pointed out, this cannot be done during reboot, as for kdump that would
mean to run after a panic. So it would need to run when the new image
is loaded. But at that time your user space is running. Plus you also
always have a user space component that triggers kexec. So you cannot
simply "zap" user space but have to somehow stash it away, run your
trusted user space and, then restore the old user space again. That
sounds pretty error prone to me. Plus it will tank your performance
every time you do a kexec, which for kdump is every boot...

> You could do a system where you kexec into an intermediate kernel. That
> kernel get's kexec'd with a signed initrd that can use the normal
> kexec_load syscall to load do any kind of preparation in userspace.
> Problem: For that intermediate enviroment we already need a format
> that combines kernel image, initrd, cmdline all signed in one package
> aka UKI. Was it the chicken or the egg?
> 
> But this shows that if we implemented UKIs the easy way (kernel simply
> checks signature, extracts the pieces, and kexecs them like normal),
> this approach could always be used to support kexec for other future
> formats. They could use the kernels UKI support to boot into an
> intermediate kernel with UEFI implemented in userspace in the initrd.
> 
> So basically support UKIs the easy way and use them to be able to
> securely zap away userspace and start with a fresh kernel and signed
> userspace as a way to support other UEFI formats that are not UKI.

Well, in theory that should work. But I see several problems:

1) How does the first kernel tell the intermediate kernel which
file(s) with wich command line to load? In fact, how does the first
kernel get the information itself? You would need a new system call
that takes two kernel images, one for the intermediate and one for the
kernel to load,for that.

Of course you could also build the intermediate UKI during kernel build
and include it into the image. Similar to what is done with the
purgatory. But that would totally bloat the kernel image. 

2) I expect that to be extremely painful to debug, if the intermediate
kernel runs into a panic. For sure kdump won't work in that case...

3) Distros would need maintain and test the additional UKI.

4) This approach basically needs to boot twice. But there are people
out there who fight to reduce boot times extremely hard. For them every
millisecond counts. Telling them that they will need to wait twice as
long will be very hard to sell.

> >   
> > > Let me throw an other wild idea in the ring. Instead of implementing
> > > a EFI runtime we could also include a eBPF version of the stub into the
> > > images. kexec could then extract the eBPF program and let it run just
> > > like any other eBPF program with all the pros (and cons) that come with
> > > it. That won't be as generic as the EFI runtime, e.g. you couldn't
> > > simply kexec any OS installer. On the other hand it would make it
> > > easier to port UKIs et al. to non-EFI systems. What do you think?  
> > 
> > ebpf is not turing complete, I am not sure how far you will make it
> > with this, in the various implementations of EFI payloads there are
> > plenty of loops, sometimes IO loops, sometimes hash loops of huge data
> > (for measurements). As I understand ebpf is not really compatible such
> > code.

I don't believe we can simply take all those payloads and recompile
them to eBPF. There definitely needs to be some refactoring done first.
For example the IO loops you can drop for eBPF and simply map to the
corresponding kernel function, letting them do the full IO in one go.
There will be cases where that will be more difficult like for hash
loops when you have to have the same hash at the end. But I believe
even for that ways could be found to get it to work.

Anyway, I'm sure that the picture I have in my head is way
oversimplified. There will be many pitfalls to handle for sure. Still I
believe it would be a nice experiment.

Thanks
Philipp
Philipp Rudo Sept. 9, 2024, 1:55 p.m. UTC | #25
Hi Jarkko,

On Sat, 07 Sep 2024 14:41:38 +0300
"Jarkko Sakkinen" <jarkko@kernel.org> wrote:

> On Sat Sep 7, 2024 at 2:31 PM EEST, Jarkko Sakkinen wrote:
> > On Sat Sep 7, 2024 at 2:27 PM EEST, Jarkko Sakkinen wrote:  
> > > On Fri Sep 6, 2024 at 1:54 PM EEST, Philipp Rudo wrote:  
> > > > Let me throw an other wild idea in the ring. Instead of implementing
> > > > a EFI runtime we could also include a eBPF version of the stub into the
> > > > images. kexec could then extract the eBPF program and let it run just
> > > > like any other eBPF program with all the pros (and cons) that come with
> > > > it. That won't be as generic as the EFI runtime, e.g. you couldn't
> > > > simply kexec any OS installer. On the other hand it would make it
> > > > easier to port UKIs et al. to non-EFI systems. What do you think?  
> > >
> > > BPF would have some guarantees that are favorable such as programs
> > > always end, even faulty ones. It always has implicit "ExitBootServices".
> > >
> > > Just a remark.  
> >
> > Some days ago I was thinking could some of the kernel functionality be
> > eBPF at least like in formal theory because most of it is amortized,
> > i.e. does a fixed chunk of work. Not going into that rabbit hole but
> > I really like this idea and could be good experimentation ground for
> > such innovation.  
> 
> E.g. let's imagine there would imaginary eBPF-TPM driver framework.
> 
> How I would go doing that would be to take the existing TPM driver
> functionality and provide extra functions and resources available for
> subsystem specific BPF environment, and have the orhestration code as
> eBPF. I pretty much concluded that there is a chance that such could
> work out.
> 
> Not something in my immediate table but it is still really interesting
> idea, as instead of using language to separate "safe" and unsafe"
> regions you would use "VM" environments to create the walls. In the
> end of the day that would also great venture for Rust in kernel, i.e.
> compile that BPF from Rust.
> 
> Sorry going of the hook the comment triggered me ;-)

I'm glad you like the idea :-)

Sounds like an interesting idea you are having there!

Thanks
Philipp
Ard Biesheuvel Sept. 9, 2024, 2:04 p.m. UTC | #26
On Mon, 9 Sept 2024 at 15:49, Philipp Rudo <prudo@redhat.com> wrote:
>
> Hi Lennart,
> Hi Jan,
>
> On Mon, 9 Sep 2024 12:42:45 +0200
> Jan Hendrik Farr <kernel@jfarr.cc> wrote:
>
> > On 09 11:48:30, Lennart Poettering wrote:
> > > On Fr, 06.09.24 12:54, Philipp Rudo (prudo@redhat.com) wrote:
> > >
> > > > I mostly agree on what you have wrote. But I see a big problem in
> > > > running the EFI emulator in user space when it comes to secure boot.
> > > > The chain of trust ends in the kernel. So it's the kernel that needs to
> > > > verify that the image to be loaded can be trusted. But when the EFI
> > > > runtime is in user space the kernel simply cannot do that. Which means,
> > > > if we want to go this way, we would need to extend the chain of trust
> > > > to user space. Which will be a whole bucket of worms, not just a
> > > > can.
> > >
> > > May it would be nice to have a way to "zap" userspace away, i.e. allow
> > > the kernel to get rid of all processes in some way, reliable. And then
> > > simply start a new userspace, from a trusted definition. Or in other
> > > words: if you don't want to trust the usual userspace, then let's
> > > maybe just terminate it, and create it anew, with a clean, pristine
> > > definition the old userspace cannot get access to.
> >
> > Well, this is an interesting idea!
> >
> > However, I'm sceptical if this could be done in a secure way. How do we
> > ensure that nothing the old userspace did with the various interfaces to
> > the kernel has no impact on the new userspace? Maybe others can chime in
> > on this? Does kernel_lockdown give more guarantees related to this?
> >
> > Even if this is possible in a secure way, there is a problem with doing
> > this for kernels that are to be kexec'd on kernel panic. In this
> > approach we can't pre-run them until EBS(), so we would rely on the old
> > kernel to still be intact when we want to kexec reboot.
>
> I don't believe there's a way to do that on running kernels. As Jan
> pointed out, this cannot be done during reboot, as for kdump that would
> mean to run after a panic. So it would need to run when the new image
> is loaded. But at that time your user space is running. Plus you also
> always have a user space component that triggers kexec. So you cannot
> simply "zap" user space but have to somehow stash it away, run your
> trusted user space and, then restore the old user space again. That
> sounds pretty error prone to me. Plus it will tank your performance
> every time you do a kexec, which for kdump is every boot...
>

kdump has a kexec kernel 'standby' to launch when the kernel panics.
So for the UKI/EFI payload case, this would imply that the load
involves running the payload until EBS() and freezing the state.

Whether execution occurs in true user space or in a deprivileged
kernel context is an implementation detail, imho. We don't want to run
external code in privileged mode inside the kernel in any case, as
this would violate lockdown already. But it should be feasible to have
a EFI compatible layer in the kernel that invokes the EFI entrypoint
of an image in a way that protects the host kernel. This could be user
mode on the CPU or perhaps a minimal KVM virtual machine.

The advantage of this approach is that the whole concept of purgatory
can be avoided - the EFI boot phase runs in parallel with the previous
kernel, which has full control over authentication and [emulated] PCR
externsion, and has ultimate control over whether the kexec reboot is
permitted.

> > You could do a system where you kexec into an intermediate kernel. That
> > kernel get's kexec'd with a signed initrd that can use the normal
> > kexec_load syscall to load do any kind of preparation in userspace.
> > Problem: For that intermediate enviroment we already need a format
> > that combines kernel image, initrd, cmdline all signed in one package
> > aka UKI. Was it the chicken or the egg?
> >
> > But this shows that if we implemented UKIs the easy way (kernel simply
> > checks signature, extracts the pieces, and kexecs them like normal),
> > this approach could always be used to support kexec for other future
> > formats. They could use the kernels UKI support to boot into an
> > intermediate kernel with UEFI implemented in userspace in the initrd.
> >
> > So basically support UKIs the easy way and use them to be able to
> > securely zap away userspace and start with a fresh kernel and signed
> > userspace as a way to support other UEFI formats that are not UKI.
>
> Well, in theory that should work. But I see several problems:
>
> 1) How does the first kernel tell the intermediate kernel which
> file(s) with wich command line to load? In fact, how does the first
> kernel get the information itself? You would need a new system call
> that takes two kernel images, one for the intermediate and one for the
> kernel to load,for that.
>
> Of course you could also build the intermediate UKI during kernel build
> and include it into the image. Similar to what is done with the
> purgatory. But that would totally bloat the kernel image.
>
> 2) I expect that to be extremely painful to debug, if the intermediate
> kernel runs into a panic. For sure kdump won't work in that case...
>
> 3) Distros would need maintain and test the additional UKI.
>
> 4) This approach basically needs to boot twice. But there are people
> out there who fight to reduce boot times extremely hard. For them every
> millisecond counts. Telling them that they will need to wait twice as
> long will be very hard to sell.
>

I don't think intermediate kernels are the solution here. We need to
run as much as possible under the control of the preceding kernel, and
minimize the bare metal handover that occurs after EBS(). Adding more
code to the purgatory (as this series does) is not acceptable to me,
as it is extremely difficult to debug, and duplicates drivers and
other logic (making it an 'intermediate kernel' of sorts already)

> > >
> > > > Let me throw an other wild idea in the ring. Instead of implementing
> > > > a EFI runtime we could also include a eBPF version of the stub into the
> > > > images. kexec could then extract the eBPF program and let it run just
> > > > like any other eBPF program with all the pros (and cons) that come with
> > > > it. That won't be as generic as the EFI runtime, e.g. you couldn't
> > > > simply kexec any OS installer. On the other hand it would make it
> > > > easier to port UKIs et al. to non-EFI systems. What do you think?
> > >
> > > ebpf is not turing complete, I am not sure how far you will make it
> > > with this, in the various implementations of EFI payloads there are
> > > plenty of loops, sometimes IO loops, sometimes hash loops of huge data
> > > (for measurements). As I understand ebpf is not really compatible such
> > > code.
>
> I don't believe we can simply take all those payloads and recompile
> them to eBPF. There definitely needs to be some refactoring done first.
> For example the IO loops you can drop for eBPF and simply map to the
> corresponding kernel function, letting them do the full IO in one go.
> There will be cases where that will be more difficult like for hash
> loops when you have to have the same hash at the end. But I believe
> even for that ways could be found to get it to work.
>
> Anyway, I'm sure that the picture I have in my head is way
> oversimplified. There will be many pitfalls to handle for sure. Still I
> believe it would be a nice experiment.
>

Today, UKI functionality is implemented in terms of EFI API calls. Any
solution that needs either a parallel implementation (eBPF vs EFI) or
needs to unpack the UKI in order to perform the steps that the UKI
would perform itself if it were executed in an EFI environment is a
no-go in my opinion.

So either we provide some EFI compatible runtime sufficient to run a
UKI, or we re-engineer UKI to be built on top of an abstraction that
can be implemented straight-forwardly both on system firmware and in
the EFI context.
Jan Hendrik Farr Sept. 9, 2024, 2:37 p.m. UTC | #27
On 09 16:04:50, Ard Biesheuvel wrote:
> 
> [...]
>
> kdump has a kexec kernel 'standby' to launch when the kernel panics.
> So for the UKI/EFI payload case, this would imply that the load
> involves running the payload until EBS() and freezing the state.
> 
> Whether execution occurs in true user space or in a deprivileged
> kernel context is an implementation detail, imho. We don't want to run
> external code in privileged mode inside the kernel in any case, as
> this would violate lockdown already. But it should be feasible to have
> a EFI compatible layer in the kernel that invokes the EFI entrypoint
> of an image in a way that protects the host kernel. This could be user
> mode on the CPU or perhaps a minimal KVM virtual machine.

This solution is what I'm currently in favor of (besides my original
approach), see: https://lore.kernel.org/kexec/Zt7EbvWjF9WPCYfn@gardel-login/T/#md4f02b7cb6c694cb28aa8d36fe47a02bd4dc17a4
Jarkko Sakkinen Sept. 9, 2024, 5:09 p.m. UTC | #28
On Mon Sep 9, 2024 at 4:55 PM EEST, Philipp Rudo wrote:
> Hi Jarkko,
>
> On Sat, 07 Sep 2024 14:41:38 +0300
> "Jarkko Sakkinen" <jarkko@kernel.org> wrote:
>
> > On Sat Sep 7, 2024 at 2:31 PM EEST, Jarkko Sakkinen wrote:
> > > On Sat Sep 7, 2024 at 2:27 PM EEST, Jarkko Sakkinen wrote:  
> > > > On Fri Sep 6, 2024 at 1:54 PM EEST, Philipp Rudo wrote:  
> > > > > Let me throw an other wild idea in the ring. Instead of implementing
> > > > > a EFI runtime we could also include a eBPF version of the stub into the
> > > > > images. kexec could then extract the eBPF program and let it run just
> > > > > like any other eBPF program with all the pros (and cons) that come with
> > > > > it. That won't be as generic as the EFI runtime, e.g. you couldn't
> > > > > simply kexec any OS installer. On the other hand it would make it
> > > > > easier to port UKIs et al. to non-EFI systems. What do you think?  
> > > >
> > > > BPF would have some guarantees that are favorable such as programs
> > > > always end, even faulty ones. It always has implicit "ExitBootServices".
> > > >
> > > > Just a remark.  
> > >
> > > Some days ago I was thinking could some of the kernel functionality be
> > > eBPF at least like in formal theory because most of it is amortized,
> > > i.e. does a fixed chunk of work. Not going into that rabbit hole but
> > > I really like this idea and could be good experimentation ground for
> > > such innovation.  
> > 
> > E.g. let's imagine there would imaginary eBPF-TPM driver framework.
> > 
> > How I would go doing that would be to take the existing TPM driver
> > functionality and provide extra functions and resources available for
> > subsystem specific BPF environment, and have the orhestration code as
> > eBPF. I pretty much concluded that there is a chance that such could
> > work out.
> > 
> > Not something in my immediate table but it is still really interesting
> > idea, as instead of using language to separate "safe" and unsafe"
> > regions you would use "VM" environments to create the walls. In the
> > end of the day that would also great venture for Rust in kernel, i.e.
> > compile that BPF from Rust.
> > 
> > Sorry going of the hook the comment triggered me ;-)
>
> I'm glad you like the idea :-)
>
> Sounds like an interesting idea you are having there!

Yeah, if you go forward with this please CC to me any possible
follow-ups :-)

BR, Jarkko
Lennart Poettering Sept. 10, 2024, 7:06 a.m. UTC | #29
On Mo, 09.09.24 21:38, Pingfan Liu (piliu@redhat.com) wrote:

> Hi Lennart,
>
> I spent some time understanding the systemd-pcrlock and TPM stuff, and
> got some idea about it. Could you correct me if I'm wrong? Please see
> the following comments inlined.
>
> On Mon, Aug 26, 2024 at 9:40 PM Lennart Poettering <mzxreary@0pointer.de> wrote:
> >
> > On Do, 22.08.24 22:29, Pingfan Liu (piliu@redhat.com) wrote:
> >
> > > > Hmm, I'd really think about this with some priority. The measurement
> > > > stuff should not be an afterthought, it typically has major
> > > > implications on how you design your transitions, because measurements
> > > > of some component always need to happen *before* you pass control to
> > > > it, otherwise they are pointless.
> > > >
> > >
> > > At present, my emulator returns false to is_efi_secure_boot(), so
> > > systemd-stub does not care about the measurement, and moves on.
> > >
> > > Could you enlighten me about how systemd utilizes the measurement? I
> > > grepped 'TPM2_PCR_KERNEL_CONFIG', and saw the systemd-stub asks to
> > > extend PCR. But where is the value checked? I guess the systemd will
> > > hang if the check fails.
> >
> > systemd's "systemd-pcrlock" tool will look for measurements like that
> > and generate disk encryption TPM policies from that.
> >
>
> Before kexec reboots to the new kernel
> systemd-pcrlock can predict the expected PCR value and store it in the
> file system.

I's a set of PCR values pcrlock predicts, one or more for each PCR. It
then compiles a TPM "policy" from that, which is identified by a hash,
and that hash is then stored in a TPM "nvindex" (which is a bit of
memory a tpm provides).

> One thing should be noticed is that PCR value can not be affected.

Well, a kexec *should* affect some PCRs. Replacement of the kernel
*must* be visible in the measurement logs somehow, in a predictable
fashion.

> And kexec rebooting happens. systemd-stub extends the PCR value. When
> the system is up, systemd checks the real PCR value against the
> expected value rendered by systemd-pcrlock? If matching, all related
> policies succeed.

Well, it's not systemd that checks that, but the TPM. i.e. not the
untrusted OS but the the suppedly more trusted TPM.

So, key is that we want that measurements take place, the kexec
operation *must* be made visible in the measurement logs. But it must
be in a well-defined way, and ideally as an extension of the
measurements sd-stub currently makes.

(BTW, I personally don't think emulating EFI is really that
important. As long as we get the key functionality that sd-stub
provides also when doing kexec I am happy. i.e. whether it is sd-stub
that does this or some other piece of code doesn't really matter to
me. What I do care about is that we can parameterize the invoked
kernel in a similar fashion as we can parameterize sd-stub, and that
the measurements applied are also equivalent.)

Lennart

--
Lennart Poettering, Berlin
Lennart Poettering Sept. 10, 2024, 7:54 a.m. UTC | #30
On Mo, 09.09.24 12:42, Jan Hendrik Farr (kernel@jfarr.cc) wrote:

> On 09 11:48:30, Lennart Poettering wrote:
> > On Fr, 06.09.24 12:54, Philipp Rudo (prudo@redhat.com) wrote:
> >
> > > I mostly agree on what you have wrote. But I see a big problem in
> > > running the EFI emulator in user space when it comes to secure boot.
> > > The chain of trust ends in the kernel. So it's the kernel that needs to
> > > verify that the image to be loaded can be trusted. But when the EFI
> > > runtime is in user space the kernel simply cannot do that. Which means,
> > > if we want to go this way, we would need to extend the chain of trust
> > > to user space. Which will be a whole bucket of worms, not just a
> > > can.
> >
> > May it would be nice to have a way to "zap" userspace away, i.e. allow
> > the kernel to get rid of all processes in some way, reliable. And then
> > simply start a new userspace, from a trusted definition. Or in other
> > words: if you don't want to trust the usual userspace, then let's
> > maybe just terminate it, and create it anew, with a clean, pristine
> > definition the old userspace cannot get access to.
>
> Well, this is an interesting idea!
>
> However, I'm sceptical if this could be done in a secure way. How do we
> ensure that nothing the old userspace did with the various interfaces to
> the kernel has no impact on the new userspace? Maybe others can chime in
> on this? Does kernel_lockdown give more guarantees related to this?

Yeah, it's not a trivial thing. I.e. I guess things like sysfs and
procfs will retain ownership/access mode. sysctls and sysfs attrs are
going to retain their most recently written contents and things like
that. Synthetic network interfaces, DM devices, loopback devices all
would survive this.

So, no idea how realistic this is, but I would *love* it, not only for
this purpose here, but also for the "soft-reboot" logic we have in
system these days, which shuts down userspace and starts it up again,
as a form of super-fast reboot that doesn't replace the kernel. If we
could reliably reset sysfs/sysctl/procfs/… during this, this would be
really lovely.

Lennart

--
Lennart Poettering, Berlin
Pingfan Liu Oct. 8, 2024, 11:59 a.m. UTC | #31
On Thu, Aug 29, 2024 at 1:08 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
[...]
>
> Thanks for putting this RFC together. This is useful work, and gives
> us food for thought and discussion.
>
> There are a few problems that become apparent when going through these changes.
>
> 1. Implementing UEFI entirely is intractable, and unnecessary.
> Implementing the subset of UEFI that is actually needed to boot Linux
> *is* tractable, though, but we need to work together to write this
> down somewhere.
>   - the EFI stub needs the boot services for the EFI memory map and
> the allocation routines
>   - GRUB needs block I/O
>   - systemd-stub/UKI needs file I/O to look for sidecars
>   - etc etc
>

I have created a git repo to hold the record for the current status.
[https://github.com/rhkdump/kexec_uefi.git]
And uefi_subset.md records the minimal requirement of uefi.

But I have a question about "GRUB needs block I/O", is it required? As
I know, the kernel image e.g. UKI, zboot will be supported. But why
should grub be supported too?

Thanks,

Pingfan

> I implemented a Rust 'efiloader' crate a while ago that encapsulates
> most of this (it can boot Linux/arm64 on QEMU and boot x86 via GRUB in
> user space **). Adding file I/O to this should be straight-forward -
> as Lennart points out, we only need the protocol, it doesn't need to
> be backed by an actual file system, it just needs to be able to expose
> other files in the right way.
>
> 2. Running the UEFI emulator on bare metal is not going to scale.
> Cloning UART driver code and MMU code etc is a can of worms that you
> want to leave closed. And as Lennart points out, there is other
> hardware (TPM) that needs to be accessible as well. Providing a
> separate set of drivers for all hardware that the EFI emulator may
> need to access is not a tractable problem either.
>
> The fix for this, as I see it, is to run the EFI emulator in user
> space, to the point where the payload calls ExitBootServices(). This
> will allow all I/O and memory protocol to be implemented trivially,
> using C library routines. I have a crude prototype** of this running
> to the point where ExitBootServices() is called (and then it crashes).
> The tricky yet interesting bit here is how we migrate a chunk of user
> space memory to the bare metal context that will be created by the
> kexec syscall later (in which the call to ExitBootServices() would
> return and proceed with the boot). But the principle is rather
> straight-forward, and would permit us, e.g., to kexec an OS installer
> too.
>
> 3. We need to figure out how to support TPM and PCRs in the context of
> kexec. This is a fundamental issue with verified boot, given that the
> kexec PCR state is necessarily different from the boot state, and so
> we cannot reuse the TPM directly if we want to pretend that we are
> doing an ordinary boot in kexec. The alternative is to leave the TPM
> in a state where the kexec kernel can access its sealed secrets, and
> mock up the TCG2 EFI protocols using a shim that sits between the TPM
> hardware (as the real TCG2 protocols will be long gone) and the EFI
> payload. But as I said, this is a fundamental issue, as the ability to
> pretend that a kexec boot is a pristine boot would mean that verified
> boot is broken.
>
>
> As future work, I'd like to propose to collaborate on some alignment
> regarding a UEFI baseline for Linux, i.e., the parts that we actually
> need to boot Linux.
>
> For this series in particular, I don't see a way forward where we
> adopt this approach, and carry all this code inside the kernel.
>
> Thanks.
> Ard.
>