mbox series

[RFC,v2,0/3] Coroutines

Message ID cover.1738059345.git.jerome.forissier@linaro.org
Headers show
Series Coroutines | expand

Message

Jerome Forissier Jan. 28, 2025, 10:19 a.m. UTC
This series introduces a simple coroutines framework and uses it to
speed up the efi_init_obj_list() function. It is just an example to
trigger discussions; hopefully other places in U-Boot can benefit from
a similar treatment. Suggestions are welcome.

I came up with this idea after analyzing some profiling data (ftrace)
taken during the execution of the "printenv -e" command on a Kria KV260
board. When the command is first invoked, efi_init_obj_list() is called
which takes a significant amount of time (2.872 seconds). This time can
be split into 0.570 s for efi_disks_register(), 0.811 s for
efi_tcg2_register() and 1.418 s for efi_tcg2_do_initial_measurement().
All the other child functions are much quicker. Another interesting
observation is that a large part of the time is actually spent waiting
in __udelay(). More precisely:
- For efi_disks_register(): 421 ms / 570 ms = 73.8 % spent in __udelay()
- For efi_tcg2_register(): 805 ms / 811 ms = 99.1 % spent in __udelay()
- For efi_tcg2_do_initial_measurement(): 1395.025 ms / 1418.372 ms =
  98.3 % spent in __udelay()

Given the above data, it is reasonable to think that a nice speedup
could be obtained if these functions could somehow be run in parallel.
efi_tcg2_do_initial_measurement() unfortunately needs to be excluded
because it depends on the first two. But efi_disks_register() and
efi_tcg2_register() are clearly independant and are therefore good
candidates for concurrent execution.

So, I considered two options:

- Spin up a secondary core to take care of one function while the other
one runs on the main core
- Introduce some kind of software scheduling

I quickly ruled out the first one for several reasons: initializing a
secondary core is typically quite hardware-specific, it would not scale
well if more functions than available cores would need to be run in
parallel, it would make debugging harder etc.
Software scheduling however can be accomplished quite easily, especially
since we don't need to consider preemptive multitasking. Coroutines [1]
for example can perfectly do the job. They provide a way to save and
restore the execution context (registers and stack). Here is how it
look like:

static void do_some_work(int v)
{
	int i;

	for (i = 0; i < 5; i++) {
		printf("%d", v);
		/* Save context and resume main thread */
		co_yield();
	}
}

static void co1(void *arg)
{
	do_some_work(1);
	/* Mark coroutine as "done" and resume main thread */
	co_exit();
}

static void co2(void *arg)
{
	do_some_work(2);
	co_exit();
}

void main_thread(void)
{
	struct co *co1, *co2;

	co1 = co_create(co1, ...);
	co2 = co_create(co2, ...);

	do {
		printf("A");
		if (!co1->done) {
			/* Invoke or resume first coroutine */
			co_resume(co1);
		}
		printf("B");
		if (!cor21->done) {
			/* Invoke or resume second coroutine */
			co_resume(co2);
		}
	} while (!(co1->done && co2->done));

	/* At this point, co1 and co2 have both called co_exit() */
}

The above example would print: A1B2A1B2A1B2A1B2A1B2.

- The first commit introduces the coroutine framework, loosely based on
libaco [2]. The code was simplified and reformatted to better suit
U-Boot.
- The second commit modifies efi_init_obj_list() in order to turn
efi_disks_register() and efi_tcg2_register() into coroutines when
COROUTINES in enabled. On a KV260 board with a SD card inserted, this
saves about .6 second (2.2 s instead of 2.8 s).
- The third commit applies coroutines to usb_init(), which can
significantly reduce the time it takes to scan multiple buses. Tested
on arm64 QEMU with 4 XHCI buses: the USB scan takes  2.2 s instead of
5.8 s.

[1] https://en.wikipedia.org/wiki/Coroutine
[2] https://github.com/hnes/libaco/

Changes in v2

- Remove x86 and x86_64 arch code since it is untested
- Add missing SPDX license tag to arch/arm/cpu/armv8/co_switch.S
- Change Apache-2.0 SPDX license tag to "Apache-2.0 OR GPL-2.0-or-later"
- Apply coroutines to the USB bus scan in usb_init()
Jerome Forissier (3):
  Introduce coroutines framework
  efi_loader: optimize efi_init_obj_list() with coroutines
  usb: scan multiple buses simultaneously with coroutines

 arch/arm/cpu/armv8/Makefile    |   1 +
 arch/arm/cpu/armv8/co_switch.S |  36 +++++++
 drivers/usb/host/usb-uclass.c  | 152 +++++++++++++++++++++++++++++-
 include/coroutines.h           | 130 ++++++++++++++++++++++++++
 lib/Kconfig                    |  10 ++
 lib/Makefile                   |   2 +
 lib/coroutines.c               | 165 +++++++++++++++++++++++++++++++++
 lib/efi_loader/efi_setup.c     | 113 +++++++++++++++++++++-
 lib/time.c                     |  14 ++-
 9 files changed, 615 insertions(+), 8 deletions(-)
 create mode 100644 arch/arm/cpu/armv8/co_switch.S
 create mode 100644 include/coroutines.h
 create mode 100644 lib/coroutines.c

Comments

Simon Glass Jan. 28, 2025, 2:04 p.m. UTC | #1
Hi Jerome,

On Tue, 28 Jan 2025 at 03:20, Jerome Forissier
<jerome.forissier@linaro.org> wrote:
>
> This series introduces a simple coroutines framework and uses it to
> speed up the efi_init_obj_list() function. It is just an example to
> trigger discussions; hopefully other places in U-Boot can benefit from
> a similar treatment. Suggestions are welcome.
>
> I came up with this idea after analyzing some profiling data (ftrace)
> taken during the execution of the "printenv -e" command on a Kria KV260
> board. When the command is first invoked, efi_init_obj_list() is called
> which takes a significant amount of time (2.872 seconds). This time can
> be split into 0.570 s for efi_disks_register(), 0.811 s for
> efi_tcg2_register() and 1.418 s for efi_tcg2_do_initial_measurement().
> All the other child functions are much quicker. Another interesting
> observation is that a large part of the time is actually spent waiting
> in __udelay(). More precisely:
> - For efi_disks_register(): 421 ms / 570 ms = 73.8 % spent in __udelay()
> - For efi_tcg2_register(): 805 ms / 811 ms = 99.1 % spent in __udelay()
> - For efi_tcg2_do_initial_measurement(): 1395.025 ms / 1418.372 ms =
>   98.3 % spent in __udelay()
>
> Given the above data, it is reasonable to think that a nice speedup
> could be obtained if these functions could somehow be run in parallel.
> efi_tcg2_do_initial_measurement() unfortunately needs to be excluded
> because it depends on the first two. But efi_disks_register() and
> efi_tcg2_register() are clearly independant and are therefore good
> candidates for concurrent execution.
>
> So, I considered two options:
>
> - Spin up a secondary core to take care of one function while the other
> one runs on the main core
> - Introduce some kind of software scheduling
>
> I quickly ruled out the first one for several reasons: initializing a
> secondary core is typically quite hardware-specific, it would not scale
> well if more functions than available cores would need to be run in
> parallel, it would make debugging harder etc.

Yes this is fairly brittle IMO. Aide: Years ago I did a prototype of
using a second core to speed up loading from MMC. It turned out that
just 'kicking' the mmc controller at the start was enough to speed
things up. I suspect that these days the controllers are faster and
there is less of a delay getting going.

> Software scheduling however can be accomplished quite easily, especially
> since we don't need to consider preemptive multitasking. Coroutines [1]
> for example can perfectly do the job. They provide a way to save and
> restore the execution context (registers and stack). Here is how it
> look like:
>
> static void do_some_work(int v)
> {
>         int i;
>
>         for (i = 0; i < 5; i++) {
>                 printf("%d", v);
>                 /* Save context and resume main thread */
>                 co_yield();
>         }
> }
>
> static void co1(void *arg)
> {
>         do_some_work(1);
>         /* Mark coroutine as "done" and resume main thread */
>         co_exit();
> }
>
> static void co2(void *arg)
> {
>         do_some_work(2);
>         co_exit();
> }
>
> void main_thread(void)
> {
>         struct co *co1, *co2;
>
>         co1 = co_create(co1, ...);
>         co2 = co_create(co2, ...);
>
>         do {
>                 printf("A");
>                 if (!co1->done) {
>                         /* Invoke or resume first coroutine */
>                         co_resume(co1);
>                 }
>                 printf("B");
>                 if (!cor21->done) {
>                         /* Invoke or resume second coroutine */
>                         co_resume(co2);
>                 }
>         } while (!(co1->done && co2->done));
>
>         /* At this point, co1 and co2 have both called co_exit() */
> }
>
> The above example would print: A1B2A1B2A1B2A1B2A1B2.
>
> - The first commit introduces the coroutine framework, loosely based on
> libaco [2]. The code was simplified and reformatted to better suit
> U-Boot.
> - The second commit modifies efi_init_obj_list() in order to turn
> efi_disks_register() and efi_tcg2_register() into coroutines when
> COROUTINES in enabled. On a KV260 board with a SD card inserted, this
> saves about .6 second (2.2 s instead of 2.8 s).
> - The third commit applies coroutines to usb_init(), which can
> significantly reduce the time it takes to scan multiple buses. Tested
> on arm64 QEMU with 4 XHCI buses: the USB scan takes  2.2 s instead of
> 5.8 s.
>
> [1] https://en.wikipedia.org/wiki/Coroutine
> [2] https://github.com/hnes/libaco/
>
> Changes in v2
>
> - Remove x86 and x86_64 arch code since it is untested
> - Add missing SPDX license tag to arch/arm/cpu/armv8/co_switch.S
> - Change Apache-2.0 SPDX license tag to "Apache-2.0 OR GPL-2.0-or-later"
> - Apply coroutines to the USB bus scan in usb_init()
> Jerome Forissier (3):
>   Introduce coroutines framework
>   efi_loader: optimize efi_init_obj_list() with coroutines
>   usb: scan multiple buses simultaneously with coroutines
>
>  arch/arm/cpu/armv8/Makefile    |   1 +
>  arch/arm/cpu/armv8/co_switch.S |  36 +++++++
>  drivers/usb/host/usb-uclass.c  | 152 +++++++++++++++++++++++++++++-
>  include/coroutines.h           | 130 ++++++++++++++++++++++++++
>  lib/Kconfig                    |  10 ++
>  lib/Makefile                   |   2 +
>  lib/coroutines.c               | 165 +++++++++++++++++++++++++++++++++
>  lib/efi_loader/efi_setup.c     | 113 +++++++++++++++++++++-
>  lib/time.c                     |  14 ++-
>  9 files changed, 615 insertions(+), 8 deletions(-)
>  create mode 100644 arch/arm/cpu/armv8/co_switch.S
>  create mode 100644 include/coroutines.h
>  create mode 100644 lib/coroutines.c
>
> --
> 2.43.0
>

Another idea would be to use the cyclic framework.

To use cyclic for USB, we would need a struct that holds the state of
the USB scanning. This is likely just a few index variables. The nice
thing about cyclic is that we are a bit more in control of what is
going on and we know at any point what state things are in. We have a
way of aborting when we want to. Arguably many of the high-level
functions in U-Boot should be written as:

int do_more(struct my_state *state)

and just go until they run out of things to immediately do*

It would collect the state for each subsystem in one place.

The other nice thing is that we can progress the boot with multiple
subsystems at the same time, with less risk of strange behaviour (I
suspect?). In other words, start USB scanning 'in the background'
while continuing to allow the user to type stuff at the cmdline.

I believe that we would very quickly want to get progress info from
the operation, but it's hard to be sure when it is safe to read it, if
the coroutines are doing their own thing on their own time with
limited visibility from the 'main' process.

You can see this sort of API with libfdt (fdt_first/next_subnode()),
driver model (uclass_first/next_device() and bootstd
(bootflow_scan_first/next()).

If we are going to have coroutines, then we should have sandbox
support, at least. How does it look inside the debugger?

Regards,
Simon

* BTW, with EFI, do the disks and TPM actually all need to be probed
right at the start? I'm not quite convinced that anyone has seriously
looked at this.