mbox series

[v2,00/20] crypto: crypto API library interfaces for WireGuard

Message ID 20191002141713.31189-1-ard.biesheuvel@linaro.org
Headers show
Series crypto: crypto API library interfaces for WireGuard | expand

Message

Ard Biesheuvel Oct. 2, 2019, 2:16 p.m. UTC
This is a followup to RFC 'crypto: wireguard with crypto API library interface'
[0]. Since no objections were raised to my approach, I've proceeded to fix up
some minor issues, and incorporate [most of] the missing MIPS code.

Changes since RFC/v1:
- dropped the WireGuard patch itself, and the followup patches - since the
  purpose was to illustrate the extent of the required changes, there is no
  reason to keep including them.
- import the MIPS 32r2 versions of ChaCha and Poly1305, but expose both the
  crypto API and library interfaces so that not only WireGuard but also IPsec
  and Adiantum can benefit immediately. (The latter required adding support for
  the reduced round version of ChaCha to the MIPS asm code)
- fix up various minor kconfig/build issues found in randconfig testing
  (thanks Arnd!)

In the future, I would like to extend these interfaces to use static calls,
so that the accelerated implementations can be [un]plugged at runtime. For
the time being, we rely on weak aliases and conditional exports so that the
users of the library interfaces link directly to the accelerated versions,
but without the ability to unplug them.

Patches can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=wireguard-crypto-library-api-v2

Cc: Herbert Xu <herbert@gondor.apana.org.au> 
Cc: David Miller <davem@davemloft.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Martin Willi <martin@strongswan.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>

[0] https://lore.kernel.org/linux-crypto/20190929173850.26055-1-ard.biesheuvel@linaro.org/

Ard Biesheuvel (14):
  crypto: chacha - move existing library code into lib/crypto
  crypto: x86/chacha - expose SIMD ChaCha routine as library function
  crypto: arm64/chacha - expose arm64 ChaCha routine as library function
  crypto: arm/chacha - expose ARM ChaCha routine as library function
  crypto: mips/chacha - import accelerated 32r2 code from Zinc
  crypto: poly1305 - move into lib/crypto and refactor into library
  crypto: x86/poly1305 - expose existing driver as poly1305 library
  crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON
    implementation
  crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON
    implementation
  crypto: mips/poly1305 - import accelerated 32r2 code from Zinc
  int128: move __uint128_t compiler test to Kconfig
  crypto: lib/curve25519 - work around Clang stack spilling issue
  crypto: chacha20poly1305 - import construction and selftest from Zinc
  crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine

Jason A. Donenfeld (6):
  crypto: BLAKE2s - generic C library implementation and selftest
  crypto: BLAKE2s - x86_64 library implementation
  crypto: Curve25519 - generic C library implementations and selftest
  crypto: Curve25519 - x86_64 library implementation
  crypto: arm - import Bernstein and Schwabe's Curve25519 ARM
    implementation
  crypto: arm/Curve25519 - wire up NEON implementation

 arch/arm/crypto/Kconfig                   |   11 +
 arch/arm/crypto/Makefile                  |   13 +-
 arch/arm/crypto/chacha-neon-glue.c        |   42 +-
 arch/arm/crypto/curve25519-core.S         | 2062 ++++++
 arch/arm/crypto/curve25519-glue.c         |   45 +
 arch/arm/crypto/poly1305-armv4.pl         | 1236 ++++
 arch/arm/crypto/poly1305-core.S_shipped   | 1158 +++
 arch/arm/crypto/poly1305-glue.c           |  274 +
 arch/arm64/Kconfig                        |    2 +-
 arch/arm64/crypto/Kconfig                 |    6 +
 arch/arm64/crypto/Makefile                |   10 +-
 arch/arm64/crypto/chacha-neon-glue.c      |   32 +-
 arch/arm64/crypto/poly1305-armv8.pl       |  913 +++
 arch/arm64/crypto/poly1305-core.S_shipped |  835 +++
 arch/arm64/crypto/poly1305-glue.c         |  229 +
 arch/mips/Makefile                        |    2 +-
 arch/mips/crypto/Makefile                 |    6 +
 arch/mips/crypto/chacha-core.S            |  424 ++
 arch/mips/crypto/chacha-glue.c            |  161 +
 arch/mips/crypto/poly1305-core.S          |  407 ++
 arch/mips/crypto/poly1305-glue.c          |  203 +
 arch/riscv/Kconfig                        |    2 +-
 arch/x86/Kconfig                          |    2 +-
 arch/x86/crypto/Makefile                  |    3 +
 arch/x86/crypto/blake2s-core.S            |  685 ++
 arch/x86/crypto/blake2s-glue.c            |   76 +
 arch/x86/crypto/chacha_glue.c             |   38 +-
 arch/x86/crypto/curve25519-x86_64.c       | 2381 +++++++
 arch/x86/crypto/poly1305_glue.c           |  145 +-
 crypto/Kconfig                            |   70 +
 crypto/adiantum.c                         |    5 +-
 crypto/chacha_generic.c                   |   44 +-
 crypto/ecc.c                              |    2 +-
 crypto/nhpoly1305.c                       |    3 +-
 crypto/poly1305_generic.c                 |  196 +-
 include/crypto/blake2s.h                  |   56 +
 include/crypto/chacha.h                   |   36 +-
 include/crypto/chacha20poly1305.h         |   48 +
 include/crypto/curve25519.h               |   28 +
 include/crypto/internal/chacha.h          |   25 +
 include/crypto/internal/poly1305.h        |   45 +
 include/crypto/poly1305.h                 |   43 +-
 init/Kconfig                              |    4 +
 lib/Makefile                              |    3 +-
 lib/crypto/Makefile                       |   40 +-
 lib/crypto/blake2s-selftest.c             | 2093 ++++++
 lib/crypto/blake2s.c                      |  281 +
 lib/{ => crypto}/chacha.c                 |   25 +-
 lib/crypto/chacha20poly1305-selftest.c    | 7394 ++++++++++++++++++++
 lib/crypto/chacha20poly1305.c             |  369 +
 lib/crypto/curve25519-fiat32.c            |  864 +++
 lib/crypto/curve25519-hacl64.c            |  788 +++
 lib/crypto/curve25519-selftest.c          | 1321 ++++
 lib/crypto/curve25519.c                   |   86 +
 lib/crypto/libchacha.c                    |   67 +
 lib/crypto/poly1305.c                     |  248 +
 lib/ubsan.c                               |    2 +-
 lib/ubsan.h                               |    2 +-
 58 files changed, 25213 insertions(+), 378 deletions(-)
 create mode 100644 arch/arm/crypto/curve25519-core.S
 create mode 100644 arch/arm/crypto/curve25519-glue.c
 create mode 100644 arch/arm/crypto/poly1305-armv4.pl
 create mode 100644 arch/arm/crypto/poly1305-core.S_shipped
 create mode 100644 arch/arm/crypto/poly1305-glue.c
 create mode 100644 arch/arm64/crypto/poly1305-armv8.pl
 create mode 100644 arch/arm64/crypto/poly1305-core.S_shipped
 create mode 100644 arch/arm64/crypto/poly1305-glue.c
 create mode 100644 arch/mips/crypto/chacha-core.S
 create mode 100644 arch/mips/crypto/chacha-glue.c
 create mode 100644 arch/mips/crypto/poly1305-core.S
 create mode 100644 arch/mips/crypto/poly1305-glue.c
 create mode 100644 arch/x86/crypto/blake2s-core.S
 create mode 100644 arch/x86/crypto/blake2s-glue.c
 create mode 100644 arch/x86/crypto/curve25519-x86_64.c
 create mode 100644 include/crypto/blake2s.h
 create mode 100644 include/crypto/chacha20poly1305.h
 create mode 100644 include/crypto/curve25519.h
 create mode 100644 include/crypto/internal/chacha.h
 create mode 100644 include/crypto/internal/poly1305.h
 create mode 100644 lib/crypto/blake2s-selftest.c
 create mode 100644 lib/crypto/blake2s.c
 rename lib/{ => crypto}/chacha.c (84%)
 create mode 100644 lib/crypto/chacha20poly1305-selftest.c
 create mode 100644 lib/crypto/chacha20poly1305.c
 create mode 100644 lib/crypto/curve25519-fiat32.c
 create mode 100644 lib/crypto/curve25519-hacl64.c
 create mode 100644 lib/crypto/curve25519-selftest.c
 create mode 100644 lib/crypto/curve25519.c
 create mode 100644 lib/crypto/libchacha.c
 create mode 100644 lib/crypto/poly1305.c

-- 
2.20.1

Comments

Ard Biesheuvel Oct. 3, 2019, 8:43 a.m. UTC | #1
On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>

...
>

> In the future, I would like to extend these interfaces to use static calls,

> so that the accelerated implementations can be [un]plugged at runtime. For

> the time being, we rely on weak aliases and conditional exports so that the

> users of the library interfaces link directly to the accelerated versions,

> but without the ability to unplug them.

>


As it turns out, we don't actually need static calls for this.
Instead, we can simply permit weak symbol references to go unresolved
between modules (as we already do in the kernel itself, due to the
fact that ELF permits it), and have the accelerated code live in
separate modules that may not be loadable on certain systems, or be
blacklisted by the user.
Jason A. Donenfeld Oct. 4, 2019, 1:16 p.m. UTC | #2
Hi Ard,

On Wed, Oct 02, 2019 at 04:16:53PM +0200, Ard Biesheuvel wrote:
> This is a followup to RFC 'crypto: wireguard with crypto API library interface'

> [0]. Since no objections were raised to my approach, I've proceeded to fix up

> some minor issues, and incorporate [most of] the missing MIPS code.

> 

> Changes since RFC/v1:

> - dropped the WireGuard patch itself, and the followup patches - since the

>   purpose was to illustrate the extent of the required changes, there is no

>   reason to keep including them.

> - import the MIPS 32r2 versions of ChaCha and Poly1305, but expose both the

>   crypto API and library interfaces so that not only WireGuard but also IPsec

>   and Adiantum can benefit immediately. (The latter required adding support for

>   the reduced round version of ChaCha to the MIPS asm code)

> - fix up various minor kconfig/build issues found in randconfig testing

>   (thanks Arnd!)


Thanks for working on this. By wiring up the accelerated primitives,
you're essentially implementing Zinc, and I expect that by the end of
this, we'll wind up with something pretty close to where I started, with
the critical difference that the directory names are different. Despite
my initial email about WireGuard moving to the crypto API, it sounds
like in the end it is likely to stay with Zinc, just without that name.

Jason
Jason A. Donenfeld Oct. 4, 2019, 1:42 p.m. UTC | #3
On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:
> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >

> ...

> >

> > In the future, I would like to extend these interfaces to use static calls,

> > so that the accelerated implementations can be [un]plugged at runtime. For

> > the time being, we rely on weak aliases and conditional exports so that the

> > users of the library interfaces link directly to the accelerated versions,

> > but without the ability to unplug them.

> >

> 

> As it turns out, we don't actually need static calls for this.

> Instead, we can simply permit weak symbol references to go unresolved

> between modules (as we already do in the kernel itself, due to the

> fact that ELF permits it), and have the accelerated code live in

> separate modules that may not be loadable on certain systems, or be

> blacklisted by the user.


You're saying that at module insertion time, the kernel will override
weak symbols with those provided by the module itself? At runtime?

Do you know offhand how this patching works? Is there a PLT that gets
patched, and so the calls all go through a layer of function pointer
indirection? Or are all call sites fixed up at insertion time and the
call instructions rewritten with some runtime patching magic?

Unless the method is the latter, I would assume that static calls are
much faster in general? Or the approach already in this series is
perhaps fine enough, and we don't need to break this apart into
individual modules complicating everything?
Ard Biesheuvel Oct. 4, 2019, 1:52 p.m. UTC | #4
On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:

> > On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> > >

> > ...

> > >

> > > In the future, I would like to extend these interfaces to use static calls,

> > > so that the accelerated implementations can be [un]plugged at runtime. For

> > > the time being, we rely on weak aliases and conditional exports so that the

> > > users of the library interfaces link directly to the accelerated versions,

> > > but without the ability to unplug them.

> > >

> >

> > As it turns out, we don't actually need static calls for this.

> > Instead, we can simply permit weak symbol references to go unresolved

> > between modules (as we already do in the kernel itself, due to the

> > fact that ELF permits it), and have the accelerated code live in

> > separate modules that may not be loadable on certain systems, or be

> > blacklisted by the user.

>

> You're saying that at module insertion time, the kernel will override

> weak symbols with those provided by the module itself? At runtime?

>


Yes.

> Do you know offhand how this patching works? Is there a PLT that gets

> patched, and so the calls all go through a layer of function pointer

> indirection? Or are all call sites fixed up at insertion time and the

> call instructions rewritten with some runtime patching magic?

>


No magic. Take curve25519 for example, when built for ARM:

00000000 <curve25519>:
   0:   f240 0300       movw    r3, #0
                        0: R_ARM_THM_MOVW_ABS_NC        curve25519_arch
   4:   f2c0 0300       movt    r3, #0
                        4: R_ARM_THM_MOVT_ABS   curve25519_arch
   8:   b570            push    {r4, r5, r6, lr}
   a:   4604            mov     r4, r0
   c:   460d            mov     r5, r1
   e:   4616            mov     r6, r2
  10:   b173            cbz     r3, 30 <curve25519+0x30>
  12:   f7ff fffe       bl      0 <curve25519_arch>
                        12: R_ARM_THM_CALL      curve25519_arch
  16:   b158            cbz     r0, 30 <curve25519+0x30>
  18:   4620            mov     r0, r4
  1a:   2220            movs    r2, #32
  1c:   f240 0100       movw    r1, #0
                        1c: R_ARM_THM_MOVW_ABS_NC       .LANCHOR0
  20:   f2c0 0100       movt    r1, #0
                        20: R_ARM_THM_MOVT_ABS  .LANCHOR0
  24:   f7ff fffe       bl      0 <__crypto_memneq>
                        24: R_ARM_THM_CALL      __crypto_memneq
  28:   3000            adds    r0, #0
  2a:   bf18            it      ne
  2c:   2001            movne   r0, #1
  2e:   bd70            pop     {r4, r5, r6, pc}
  30:   4632            mov     r2, r6
  32:   4629            mov     r1, r5
  34:   4620            mov     r0, r4
  36:   f7ff fffe       bl      0 <curve25519_generic>
                        36: R_ARM_THM_CALL      curve25519_generic
  3a:   e7ed            b.n     18 <curve25519+0x18>

curve25519_arch is a weak reference. It either gets satisfied at
module load time, or it doesn't.

If it does get satisfied, the relocations covering the movw/movt pair
and the one covering the bl instruction get updated so that they point
to the arch routine.

If it does not get satisfied, the relocations are disregarded, in
which case the cbz instruction at offset 0x10 jumps over the bl call.

Note that this does not involve any memory accesses. It does involve
some code patching, but only of the kind the module loader already
does.

> Unless the method is the latter, I would assume that static calls are

> much faster in general? Or the approach already in this series is

> perhaps fine enough, and we don't need to break this apart into

> individual modules complicating everything?


The big advantage this approach has over Zinc is that the accelerated
code does not have to stay resident in memory if the cpu is incapable
of executing it.
Jason A. Donenfeld Oct. 4, 2019, 2:12 p.m. UTC | #5
On Wed, Oct 2, 2019 at 4:17 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> This is a followup to RFC 'crypto: wireguard with crypto API library interface'

> [0]. Since no objections were raised to my approach, I've proceeded to fix up

> some minor issues, and incorporate [most of] the missing MIPS code.


Could you get the mips64 code into the next version? WireGuard is
widely used on the Ubiquiti EdgeRouter, which runs on the Octeon, a
64-bit mips chips.
Andy Lutomirski Oct. 4, 2019, 2:50 p.m. UTC | #6
> On Oct 4, 2019, at 6:42 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> 

> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:

>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>> 

>> ...

>>> 

>>> In the future, I would like to extend these interfaces to use static calls,

>>> so that the accelerated implementations can be [un]plugged at runtime. For

>>> the time being, we rely on weak aliases and conditional exports so that the

>>> users of the library interfaces link directly to the accelerated versions,

>>> but without the ability to unplug them.

>>> 

>> 

>> As it turns out, we don't actually need static calls for this.

>> Instead, we can simply permit weak symbol references to go unresolved

>> between modules (as we already do in the kernel itself, due to the

>> fact that ELF permits it), and have the accelerated code live in

>> separate modules that may not be loadable on certain systems, or be

>> blacklisted by the user.

> 

> You're saying that at module insertion time, the kernel will override

> weak symbols with those provided by the module itself? At runtime?

> 

> Do you know offhand how this patching works? Is there a PLT that gets

> patched, and so the calls all go through a layer of function pointer

> indirection? Or are all call sites fixed up at insertion time and the

> call instructions rewritten with some runtime patching magic?

> 

> Unless the method is the latter, I would assume that static calls are

> much faster in general? Or the approach already in this series is

> perhaps fine enough, and we don't need to break this apart into

> individual modules complicating everything?


I admit I’m a bit mystified too. I think it would be great to have a feature where a special type of static call could be redirected by the module loader when a module that exports a corresponding symbol is loaded.  Unloading such a module would be interesting.

Ard, what exactly are you imagining?
Andy Lutomirski Oct. 4, 2019, 2:53 p.m. UTC | #7
> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> 

> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

>> 

>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:

>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>>> 

>>> ...

>>>> 

>>>> In the future, I would like to extend these interfaces to use static calls,

>>>> so that the accelerated implementations can be [un]plugged at runtime. For

>>>> the time being, we rely on weak aliases and conditional exports so that the

>>>> users of the library interfaces link directly to the accelerated versions,

>>>> but without the ability to unplug them.

>>>> 

>>> 

>>> As it turns out, we don't actually need static calls for this.

>>> Instead, we can simply permit weak symbol references to go unresolved

>>> between modules (as we already do in the kernel itself, due to the

>>> fact that ELF permits it), and have the accelerated code live in

>>> separate modules that may not be loadable on certain systems, or be

>>> blacklisted by the user.

>> 

>> You're saying that at module insertion time, the kernel will override

>> weak symbols with those provided by the module itself? At runtime?

>> 

> 

> Yes.

> 

>> Do you know offhand how this patching works? Is there a PLT that gets

>> patched, and so the calls all go through a layer of function pointer

>> indirection? Or are all call sites fixed up at insertion time and the

>> call instructions rewritten with some runtime patching magic?

>> 

> 

> No magic. Take curve25519 for example, when built for ARM:

> 

> 00000000 <curve25519>:

>   0:   f240 0300       movw    r3, #0

>                        0: R_ARM_THM_MOVW_ABS_NC        curve25519_arch

>   4:   f2c0 0300       movt    r3, #0

>                        4: R_ARM_THM_MOVT_ABS   curve25519_arch

>   8:   b570            push    {r4, r5, r6, lr}

>   a:   4604            mov     r4, r0

>   c:   460d            mov     r5, r1

>   e:   4616            mov     r6, r2

>  10:   b173            cbz     r3, 30 <curve25519+0x30>

>  12:   f7ff fffe       bl      0 <curve25519_arch>

>                        12: R_ARM_THM_CALL      curve25519_arch

>  16:   b158            cbz     r0, 30 <curve25519+0x30>

>  18:   4620            mov     r0, r4

>  1a:   2220            movs    r2, #32

>  1c:   f240 0100       movw    r1, #0

>                        1c: R_ARM_THM_MOVW_ABS_NC       .LANCHOR0

>  20:   f2c0 0100       movt    r1, #0

>                        20: R_ARM_THM_MOVT_ABS  .LANCHOR0

>  24:   f7ff fffe       bl      0 <__crypto_memneq>

>                        24: R_ARM_THM_CALL      __crypto_memneq

>  28:   3000            adds    r0, #0

>  2a:   bf18            it      ne

>  2c:   2001            movne   r0, #1

>  2e:   bd70            pop     {r4, r5, r6, pc}

>  30:   4632            mov     r2, r6

>  32:   4629            mov     r1, r5

>  34:   4620            mov     r0, r4

>  36:   f7ff fffe       bl      0 <curve25519_generic>

>                        36: R_ARM_THM_CALL      curve25519_generic

>  3a:   e7ed            b.n     18 <curve25519+0x18>

> 

> curve25519_arch is a weak reference. It either gets satisfied at

> module load time, or it doesn't.

> 

> If it does get satisfied, the relocations covering the movw/movt pair

> and the one covering the bl instruction get updated so that they point

> to the arch routine.

> 

> If it does not get satisfied, the relocations are disregarded, in

> which case the cbz instruction at offset 0x10 jumps over the bl call.

> 

> Note that this does not involve any memory accesses. It does involve

> some code patching, but only of the kind the module loader already

> does.


Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress?

I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. Or use static calls.
Jason A. Donenfeld Oct. 4, 2019, 2:55 p.m. UTC | #8
On Fri, Oct 4, 2019 at 4:53 PM Andy Lutomirski <luto@amacapital.net> wrote:
> I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. Or use static calls.


Static calls perform well and are well understood. This would be my preference.
Ard Biesheuvel Oct. 4, 2019, 2:56 p.m. UTC | #9
On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@amacapital.net> wrote:
>

>

>

> > On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >

> > On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> >>

> >>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:

> >>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >>>>

> >>> ...

> >>>>

> >>>> In the future, I would like to extend these interfaces to use static calls,

> >>>> so that the accelerated implementations can be [un]plugged at runtime. For

> >>>> the time being, we rely on weak aliases and conditional exports so that the

> >>>> users of the library interfaces link directly to the accelerated versions,

> >>>> but without the ability to unplug them.

> >>>>

> >>>

> >>> As it turns out, we don't actually need static calls for this.

> >>> Instead, we can simply permit weak symbol references to go unresolved

> >>> between modules (as we already do in the kernel itself, due to the

> >>> fact that ELF permits it), and have the accelerated code live in

> >>> separate modules that may not be loadable on certain systems, or be

> >>> blacklisted by the user.

> >>

> >> You're saying that at module insertion time, the kernel will override

> >> weak symbols with those provided by the module itself? At runtime?

> >>

> >

> > Yes.

> >

> >> Do you know offhand how this patching works? Is there a PLT that gets

> >> patched, and so the calls all go through a layer of function pointer

> >> indirection? Or are all call sites fixed up at insertion time and the

> >> call instructions rewritten with some runtime patching magic?

> >>

> >

> > No magic. Take curve25519 for example, when built for ARM:

> >

> > 00000000 <curve25519>:

> >   0:   f240 0300       movw    r3, #0

> >                        0: R_ARM_THM_MOVW_ABS_NC        curve25519_arch

> >   4:   f2c0 0300       movt    r3, #0

> >                        4: R_ARM_THM_MOVT_ABS   curve25519_arch

> >   8:   b570            push    {r4, r5, r6, lr}

> >   a:   4604            mov     r4, r0

> >   c:   460d            mov     r5, r1

> >   e:   4616            mov     r6, r2

> >  10:   b173            cbz     r3, 30 <curve25519+0x30>

> >  12:   f7ff fffe       bl      0 <curve25519_arch>

> >                        12: R_ARM_THM_CALL      curve25519_arch

> >  16:   b158            cbz     r0, 30 <curve25519+0x30>

> >  18:   4620            mov     r0, r4

> >  1a:   2220            movs    r2, #32

> >  1c:   f240 0100       movw    r1, #0

> >                        1c: R_ARM_THM_MOVW_ABS_NC       .LANCHOR0

> >  20:   f2c0 0100       movt    r1, #0

> >                        20: R_ARM_THM_MOVT_ABS  .LANCHOR0

> >  24:   f7ff fffe       bl      0 <__crypto_memneq>

> >                        24: R_ARM_THM_CALL      __crypto_memneq

> >  28:   3000            adds    r0, #0

> >  2a:   bf18            it      ne

> >  2c:   2001            movne   r0, #1

> >  2e:   bd70            pop     {r4, r5, r6, pc}

> >  30:   4632            mov     r2, r6

> >  32:   4629            mov     r1, r5

> >  34:   4620            mov     r0, r4

> >  36:   f7ff fffe       bl      0 <curve25519_generic>

> >                        36: R_ARM_THM_CALL      curve25519_generic

> >  3a:   e7ed            b.n     18 <curve25519+0x18>

> >

> > curve25519_arch is a weak reference. It either gets satisfied at

> > module load time, or it doesn't.

> >

> > If it does get satisfied, the relocations covering the movw/movt pair

> > and the one covering the bl instruction get updated so that they point

> > to the arch routine.

> >

> > If it does not get satisfied, the relocations are disregarded, in

> > which case the cbz instruction at offset 0x10 jumps over the bl call.

> >

> > Note that this does not involve any memory accesses. It does involve

> > some code patching, but only of the kind the module loader already

> > does.

>

> Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress?

>


Indeed, the arch module needs to be loaded first

> I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded.


That is what I am doing for chacha and poly

> Or use static calls.
Ard Biesheuvel Oct. 4, 2019, 2:59 p.m. UTC | #10
On Fri, 4 Oct 2019 at 16:55, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> On Fri, Oct 4, 2019 at 4:53 PM Andy Lutomirski <luto@amacapital.net> wrote:

> > I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. Or use static calls.

>

> Static calls perform well and are well understood. This would be my preference.


How so? No code exists yet in mainline, and the x86 code is still
heavily being discussed, requires voodoo code patching and tooling
changes. Implementations for other architectures have not even been
proposed yet, with the exception of the code I wrote for arm64.
Ard Biesheuvel Oct. 5, 2019, 7:24 a.m. UTC | #11
On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>

> On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@amacapital.net> wrote:

> >

> >

> >

> > > On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> > >

> > > On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> > >>

> > >>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:

> > >>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> > >>>>

> > >>> ...

> > >>>>

> > >>>> In the future, I would like to extend these interfaces to use static calls,

> > >>>> so that the accelerated implementations can be [un]plugged at runtime. For

> > >>>> the time being, we rely on weak aliases and conditional exports so that the

> > >>>> users of the library interfaces link directly to the accelerated versions,

> > >>>> but without the ability to unplug them.

> > >>>>

> > >>>

> > >>> As it turns out, we don't actually need static calls for this.

> > >>> Instead, we can simply permit weak symbol references to go unresolved

> > >>> between modules (as we already do in the kernel itself, due to the

> > >>> fact that ELF permits it), and have the accelerated code live in

> > >>> separate modules that may not be loadable on certain systems, or be

> > >>> blacklisted by the user.

> > >>

> > >> You're saying that at module insertion time, the kernel will override

> > >> weak symbols with those provided by the module itself? At runtime?

> > >>

> > >

> > > Yes.

> > >

> > >> Do you know offhand how this patching works? Is there a PLT that gets

> > >> patched, and so the calls all go through a layer of function pointer

> > >> indirection? Or are all call sites fixed up at insertion time and the

> > >> call instructions rewritten with some runtime patching magic?

> > >>

> > >

> > > No magic. Take curve25519 for example, when built for ARM:

> > >

> > > 00000000 <curve25519>:

> > >   0:   f240 0300       movw    r3, #0

> > >                        0: R_ARM_THM_MOVW_ABS_NC        curve25519_arch

> > >   4:   f2c0 0300       movt    r3, #0

> > >                        4: R_ARM_THM_MOVT_ABS   curve25519_arch

> > >   8:   b570            push    {r4, r5, r6, lr}

> > >   a:   4604            mov     r4, r0

> > >   c:   460d            mov     r5, r1

> > >   e:   4616            mov     r6, r2

> > >  10:   b173            cbz     r3, 30 <curve25519+0x30>

> > >  12:   f7ff fffe       bl      0 <curve25519_arch>

> > >                        12: R_ARM_THM_CALL      curve25519_arch

> > >  16:   b158            cbz     r0, 30 <curve25519+0x30>

> > >  18:   4620            mov     r0, r4

> > >  1a:   2220            movs    r2, #32

> > >  1c:   f240 0100       movw    r1, #0

> > >                        1c: R_ARM_THM_MOVW_ABS_NC       .LANCHOR0

> > >  20:   f2c0 0100       movt    r1, #0

> > >                        20: R_ARM_THM_MOVT_ABS  .LANCHOR0

> > >  24:   f7ff fffe       bl      0 <__crypto_memneq>

> > >                        24: R_ARM_THM_CALL      __crypto_memneq

> > >  28:   3000            adds    r0, #0

> > >  2a:   bf18            it      ne

> > >  2c:   2001            movne   r0, #1

> > >  2e:   bd70            pop     {r4, r5, r6, pc}

> > >  30:   4632            mov     r2, r6

> > >  32:   4629            mov     r1, r5

> > >  34:   4620            mov     r0, r4

> > >  36:   f7ff fffe       bl      0 <curve25519_generic>

> > >                        36: R_ARM_THM_CALL      curve25519_generic

> > >  3a:   e7ed            b.n     18 <curve25519+0x18>

> > >

> > > curve25519_arch is a weak reference. It either gets satisfied at

> > > module load time, or it doesn't.

> > >

> > > If it does get satisfied, the relocations covering the movw/movt pair

> > > and the one covering the bl instruction get updated so that they point

> > > to the arch routine.

> > >

> > > If it does not get satisfied, the relocations are disregarded, in

> > > which case the cbz instruction at offset 0x10 jumps over the bl call.

> > >

> > > Note that this does not involve any memory accesses. It does involve

> > > some code patching, but only of the kind the module loader already

> > > does.

> >

> > Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress?

> >

>

> Indeed, the arch module needs to be loaded first

>


Actually, this can be addressed by retaining the module dependencies
as before, but permitting the arch module to be omitted at load time.

> > I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded.

>

> That is what I am doing for chacha and poly

>

> > Or use static calls.


Given that static calls don't actually exist yet, I propose to proceed
with the approach above, and switch to static calls once all
architectures where it matters have an implementation that does not
use function pointers (which is how static calls will be implemented
generically)
Andy Lutomirski Oct. 7, 2019, 4:44 a.m. UTC | #12
> On Oct 5, 2019, at 12:24 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> 

> On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>> 

>>> On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@amacapital.net> wrote:

>>> 

>>> 

>>> 

>>>> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>>> 

>>>> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

>>>>> 

>>>>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:

>>>>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

>>>>>>> 

>>>>>> ...

>>>>>>> 

>>>>>>> In the future, I would like to extend these interfaces to use static calls,

>>>>>>> so that the accelerated implementations can be [un]plugged at runtime. For

>>>>>>> the time being, we rely on weak aliases and conditional exports so that the

>>>>>>> users of the library interfaces link directly to the accelerated versions,

>>>>>>> but without the ability to unplug them.

>>>>>>> 

>>>>>> 

>>>>>> As it turns out, we don't actually need static calls for this.

>>>>>> Instead, we can simply permit weak symbol references to go unresolved

>>>>>> between modules (as we already do in the kernel itself, due to the

>>>>>> fact that ELF permits it), and have the accelerated code live in

>>>>>> separate modules that may not be loadable on certain systems, or be

>>>>>> blacklisted by the user.

>>>>> 

>>>>> You're saying that at module insertion time, the kernel will override

>>>>> weak symbols with those provided by the module itself? At runtime?

>>>>> 

>>>> 

>>>> Yes.

>>>> 

>>>>> Do you know offhand how this patching works? Is there a PLT that gets

>>>>> patched, and so the calls all go through a layer of function pointer

>>>>> indirection? Or are all call sites fixed up at insertion time and the

>>>>> call instructions rewritten with some runtime patching magic?

>>>>> 

>>>> 

>>>> No magic. Take curve25519 for example, when built for ARM:

>>>> 

>>>> 00000000 <curve25519>:

>>>>  0:   f240 0300       movw    r3, #0

>>>>                       0: R_ARM_THM_MOVW_ABS_NC        curve25519_arch

>>>>  4:   f2c0 0300       movt    r3, #0

>>>>                       4: R_ARM_THM_MOVT_ABS   curve25519_arch

>>>>  8:   b570            push    {r4, r5, r6, lr}

>>>>  a:   4604            mov     r4, r0

>>>>  c:   460d            mov     r5, r1

>>>>  e:   4616            mov     r6, r2

>>>> 10:   b173            cbz     r3, 30 <curve25519+0x30>

>>>> 12:   f7ff fffe       bl      0 <curve25519_arch>

>>>>                       12: R_ARM_THM_CALL      curve25519_arch

>>>> 16:   b158            cbz     r0, 30 <curve25519+0x30>

>>>> 18:   4620            mov     r0, r4

>>>> 1a:   2220            movs    r2, #32

>>>> 1c:   f240 0100       movw    r1, #0

>>>>                       1c: R_ARM_THM_MOVW_ABS_NC       .LANCHOR0

>>>> 20:   f2c0 0100       movt    r1, #0

>>>>                       20: R_ARM_THM_MOVT_ABS  .LANCHOR0

>>>> 24:   f7ff fffe       bl      0 <__crypto_memneq>

>>>>                       24: R_ARM_THM_CALL      __crypto_memneq

>>>> 28:   3000            adds    r0, #0

>>>> 2a:   bf18            it      ne

>>>> 2c:   2001            movne   r0, #1

>>>> 2e:   bd70            pop     {r4, r5, r6, pc}

>>>> 30:   4632            mov     r2, r6

>>>> 32:   4629            mov     r1, r5

>>>> 34:   4620            mov     r0, r4

>>>> 36:   f7ff fffe       bl      0 <curve25519_generic>

>>>>                       36: R_ARM_THM_CALL      curve25519_generic

>>>> 3a:   e7ed            b.n     18 <curve25519+0x18>

>>>> 

>>>> curve25519_arch is a weak reference. It either gets satisfied at

>>>> module load time, or it doesn't.

>>>> 

>>>> If it does get satisfied, the relocations covering the movw/movt pair

>>>> and the one covering the bl instruction get updated so that they point

>>>> to the arch routine.

>>>> 

>>>> If it does not get satisfied, the relocations are disregarded, in

>>>> which case the cbz instruction at offset 0x10 jumps over the bl call.

>>>> 

>>>> Note that this does not involve any memory accesses. It does involve

>>>> some code patching, but only of the kind the module loader already

>>>> does.

>>> 

>>> Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress?

>>> 

>> 

>> Indeed, the arch module needs to be loaded first

>> 

> 

> Actually, this can be addressed by retaining the module dependencies

> as before, but permitting the arch module to be omitted at load time.


I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too.

> 

>>> I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded.

>> 

>> That is what I am doing for chacha and poly

>> 

>>> Or use static calls.

> 

> Given that static calls don't actually exist yet, I propose to proceed

> with the approach above, and switch to static calls once all

> architectures where it matters have an implementation that does not

> use function pointers (which is how static calls will be implemented

> generically)
Ard Biesheuvel Oct. 7, 2019, 5:23 a.m. UTC | #13
On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski <luto@amacapital.net> wrote:
>

>

>

> > On Oct 5, 2019, at 12:24 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >

> > On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >>

> >>> On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@amacapital.net> wrote:

> >>>

> >>>

> >>>

> >>>> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >>>>

> >>>> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> >>>>>

> >>>>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:

> >>>>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> >>>>>>>

> >>>>>> ...

> >>>>>>>

> >>>>>>> In the future, I would like to extend these interfaces to use static calls,

> >>>>>>> so that the accelerated implementations can be [un]plugged at runtime. For

> >>>>>>> the time being, we rely on weak aliases and conditional exports so that the

> >>>>>>> users of the library interfaces link directly to the accelerated versions,

> >>>>>>> but without the ability to unplug them.

> >>>>>>>

> >>>>>>

> >>>>>> As it turns out, we don't actually need static calls for this.

> >>>>>> Instead, we can simply permit weak symbol references to go unresolved

> >>>>>> between modules (as we already do in the kernel itself, due to the

> >>>>>> fact that ELF permits it), and have the accelerated code live in

> >>>>>> separate modules that may not be loadable on certain systems, or be

> >>>>>> blacklisted by the user.

> >>>>>

> >>>>> You're saying that at module insertion time, the kernel will override

> >>>>> weak symbols with those provided by the module itself? At runtime?

> >>>>>

> >>>>

> >>>> Yes.

> >>>>

> >>>>> Do you know offhand how this patching works? Is there a PLT that gets

> >>>>> patched, and so the calls all go through a layer of function pointer

> >>>>> indirection? Or are all call sites fixed up at insertion time and the

> >>>>> call instructions rewritten with some runtime patching magic?

> >>>>>

> >>>>

> >>>> No magic. Take curve25519 for example, when built for ARM:

> >>>>

> >>>> 00000000 <curve25519>:

> >>>>  0:   f240 0300       movw    r3, #0

> >>>>                       0: R_ARM_THM_MOVW_ABS_NC        curve25519_arch

> >>>>  4:   f2c0 0300       movt    r3, #0

> >>>>                       4: R_ARM_THM_MOVT_ABS   curve25519_arch

> >>>>  8:   b570            push    {r4, r5, r6, lr}

> >>>>  a:   4604            mov     r4, r0

> >>>>  c:   460d            mov     r5, r1

> >>>>  e:   4616            mov     r6, r2

> >>>> 10:   b173            cbz     r3, 30 <curve25519+0x30>

> >>>> 12:   f7ff fffe       bl      0 <curve25519_arch>

> >>>>                       12: R_ARM_THM_CALL      curve25519_arch

> >>>> 16:   b158            cbz     r0, 30 <curve25519+0x30>

> >>>> 18:   4620            mov     r0, r4

> >>>> 1a:   2220            movs    r2, #32

> >>>> 1c:   f240 0100       movw    r1, #0

> >>>>                       1c: R_ARM_THM_MOVW_ABS_NC       .LANCHOR0

> >>>> 20:   f2c0 0100       movt    r1, #0

> >>>>                       20: R_ARM_THM_MOVT_ABS  .LANCHOR0

> >>>> 24:   f7ff fffe       bl      0 <__crypto_memneq>

> >>>>                       24: R_ARM_THM_CALL      __crypto_memneq

> >>>> 28:   3000            adds    r0, #0

> >>>> 2a:   bf18            it      ne

> >>>> 2c:   2001            movne   r0, #1

> >>>> 2e:   bd70            pop     {r4, r5, r6, pc}

> >>>> 30:   4632            mov     r2, r6

> >>>> 32:   4629            mov     r1, r5

> >>>> 34:   4620            mov     r0, r4

> >>>> 36:   f7ff fffe       bl      0 <curve25519_generic>

> >>>>                       36: R_ARM_THM_CALL      curve25519_generic

> >>>> 3a:   e7ed            b.n     18 <curve25519+0x18>

> >>>>

> >>>> curve25519_arch is a weak reference. It either gets satisfied at

> >>>> module load time, or it doesn't.

> >>>>

> >>>> If it does get satisfied, the relocations covering the movw/movt pair

> >>>> and the one covering the bl instruction get updated so that they point

> >>>> to the arch routine.

> >>>>

> >>>> If it does not get satisfied, the relocations are disregarded, in

> >>>> which case the cbz instruction at offset 0x10 jumps over the bl call.

> >>>>

> >>>> Note that this does not involve any memory accesses. It does involve

> >>>> some code patching, but only of the kind the module loader already

> >>>> does.

> >>>

> >>> Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress?

> >>>

> >>

> >> Indeed, the arch module needs to be loaded first

> >>

> >

> > Actually, this can be addressed by retaining the module dependencies

> > as before, but permitting the arch module to be omitted at load time.

>

> I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too.

>


Most arch code depends on CPU features that may not be available given
the context, either because they are SIMD or because they are optional
CPU instructions. So we need both modules at the same time anyway, so
that we can fall back to the generic code at runtime.

So what I'd like is to have the generic module provide the library
interface, but rely on arch modules that are optional.

We already have 95% of what we need with weak references. We have the
ability to test for presence of the arch code at runtime, and we even
have code patching for all architectures (through static relocations).

However, one could argue that this is more a [space] optimization than
anything else, so I am willing to park this discussion until support
for static calls has been merged, and proceed with something simpler.
Andy Lutomirski Oct. 7, 2019, 3:01 p.m. UTC | #14
On Sun, Oct 6, 2019 at 10:24 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>

> On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski <luto@amacapital.net> wrote:

> >


> > > Actually, this can be addressed by retaining the module dependencies

> > > as before, but permitting the arch module to be omitted at load time.

> >

> > I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too.

> >

>

> Most arch code depends on CPU features that may not be available given

> the context, either because they are SIMD or because they are optional

> CPU instructions. So we need both modules at the same time anyway, so

> that we can fall back to the generic code at runtime.

>

> So what I'd like is to have the generic module provide the library

> interface, but rely on arch modules that are optional.

>

> We already have 95% of what we need with weak references. We have the

> ability to test for presence of the arch code at runtime, and we even

> have code patching for all architectures (through static relocations).

>

> However, one could argue that this is more a [space] optimization than

> anything else, so I am willing to park this discussion until support

> for static calls has been merged, and proceed with something simpler.


I'd suggest tabling it until static calls are merged.  PeterZ just
sent a new patchbomb for it anyway.

What I'm trying to get at here and apparently saying badly is that I
want to avoid a situation where lsmod shows the arch module loaded but
the arch code isn't actually executing.  Regardless of how everything
gets wired up (static calls, weak refs, etc), the system's behavior
should match the system's configuration, which means that we should
not allow any improper order of loading things so that everything
*appears* to be loaded but does not actually function.

Saying "modprobe will do the right thing and let's not worry about
silly admins using insmod directly" is not a good solution.
Ard Biesheuvel Oct. 7, 2019, 3:12 p.m. UTC | #15
On Mon, 7 Oct 2019 at 17:02, Andy Lutomirski <luto@amacapital.net> wrote:
>

> On Sun, Oct 6, 2019 at 10:24 PM Ard Biesheuvel

> <ard.biesheuvel@linaro.org> wrote:

> >

> > On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski <luto@amacapital.net> wrote:

> > >

>

> > > > Actually, this can be addressed by retaining the module dependencies

> > > > as before, but permitting the arch module to be omitted at load time.

> > >

> > > I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too.

> > >

> >

> > Most arch code depends on CPU features that may not be available given

> > the context, either because they are SIMD or because they are optional

> > CPU instructions. So we need both modules at the same time anyway, so

> > that we can fall back to the generic code at runtime.

> >

> > So what I'd like is to have the generic module provide the library

> > interface, but rely on arch modules that are optional.

> >

> > We already have 95% of what we need with weak references. We have the

> > ability to test for presence of the arch code at runtime, and we even

> > have code patching for all architectures (through static relocations).

> >

> > However, one could argue that this is more a [space] optimization than

> > anything else, so I am willing to park this discussion until support

> > for static calls has been merged, and proceed with something simpler.

>

> I'd suggest tabling it until static calls are merged.  PeterZ just

> sent a new patchbomb for it anyway.

>


As it turns out, static calls are a poor fit for this. Imagine an interface like

poly1305_init(state)
poly1305_update(state, input)
poly1305_fina(state, digest)

which can be implemented by different libraries. The problem is that
state is opaque, and so it is generally not guaranteed that a sequence
that was started using one implementation can be completed using
another one.

Since the whole point is having a simple library interface,
complicating this with RCU hooks or other crazy plumbing to ensure
that no calls are in progress when you switch one out for another one,
I don't think static calls are suitable for this use case.

> What I'm trying to get at here and apparently saying badly is that I

> want to avoid a situation where lsmod shows the arch module loaded but

> the arch code isn't actually executing.


My goal here is to allow the generic library to be loaded with or
without the arch code, with the arch code always being used when it is
loaded. This is what I implemented using weak references, but it
requires a tweak in the module loader (two lines but not pretty).
Using weak references preserves the dependency order, since the
generic module will depend on the arch module (and up the refcount) it
any of its weak references were fulfilled by the arch module in
question. Using static calls will invert the dependency relation,
since the arch code will need to perform a static_call_update() to
make [users of] the generic library point to its code. How this works
with managing the module refcounts and unload order is an open
question afaict.

> Regardless of how everything

> gets wired up (static calls, weak refs, etc), the system's behavior

> should match the system's configuration, which means that we should

> not allow any improper order of loading things so that everything

> *appears* to be loaded but does not actually function.

>


Yes. that is the whole point.

> Saying "modprobe will do the right thing and let's not worry about

> silly admins using insmod directly" is not a good solution.


Agreed.

I have disregarded static calls and weak references for the time
being, and I will proceed with an implementation that uses neither.
The downside of this is that, e.g., we are forced to load the NEON
module on non-NEON capable hardware without calling any of its code,
but this is basically the Zinc situation only with the NEON and the
generic code living in different modules.
Andy Lutomirski Oct. 7, 2019, 4:05 p.m. UTC | #16
On Mon, Oct 7, 2019 at 8:12 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>

> On Mon, 7 Oct 2019 at 17:02, Andy Lutomirski <luto@amacapital.net> wrote:

> >

> > On Sun, Oct 6, 2019 at 10:24 PM Ard Biesheuvel

> > <ard.biesheuvel@linaro.org> wrote:

> > >

> > > On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski <luto@amacapital.net> wrote:

> > > >

> >

> > > > > Actually, this can be addressed by retaining the module dependencies

> > > > > as before, but permitting the arch module to be omitted at load time.

> > > >

> > > > I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too.

> > > >

> > >

> > > Most arch code depends on CPU features that may not be available given

> > > the context, either because they are SIMD or because they are optional

> > > CPU instructions. So we need both modules at the same time anyway, so

> > > that we can fall back to the generic code at runtime.

> > >

> > > So what I'd like is to have the generic module provide the library

> > > interface, but rely on arch modules that are optional.

> > >

> > > We already have 95% of what we need with weak references. We have the

> > > ability to test for presence of the arch code at runtime, and we even

> > > have code patching for all architectures (through static relocations).

> > >

> > > However, one could argue that this is more a [space] optimization than

> > > anything else, so I am willing to park this discussion until support

> > > for static calls has been merged, and proceed with something simpler.

> >

> > I'd suggest tabling it until static calls are merged.  PeterZ just

> > sent a new patchbomb for it anyway.

> >

>

> As it turns out, static calls are a poor fit for this. Imagine an interface like

>

> poly1305_init(state)

> poly1305_update(state, input)

> poly1305_fina(state, digest)

>

> which can be implemented by different libraries. The problem is that

> state is opaque, and so it is generally not guaranteed that a sequence

> that was started using one implementation can be completed using

> another one.

>

> Since the whole point is having a simple library interface,

> complicating this with RCU hooks or other crazy plumbing to ensure

> that no calls are in progress when you switch one out for another one,

> I don't think static calls are suitable for this use case.

>

> > What I'm trying to get at here and apparently saying badly is that I

> > want to avoid a situation where lsmod shows the arch module loaded but

> > the arch code isn't actually executing.

>

> My goal here is to allow the generic library to be loaded with or

> without the arch code, with the arch code always being used when it is

> loaded. This is what I implemented using weak references, but it

> requires a tweak in the module loader (two lines but not pretty).

> Using weak references preserves the dependency order, since the

> generic module will depend on the arch module (and up the refcount) it

> any of its weak references were fulfilled by the arch module in

> question. Using static calls will invert the dependency relation,

> since the arch code will need to perform a static_call_update() to

> make [users of] the generic library point to its code. How this works

> with managing the module refcounts and unload order is an open

> question afaict.


Indeed.  Dealing with unloading when static calls are in use may be messy.

>

> > Regardless of how everything

> > gets wired up (static calls, weak refs, etc), the system's behavior

> > should match the system's configuration, which means that we should

> > not allow any improper order of loading things so that everything

> > *appears* to be loaded but does not actually function.

> >

>

> Yes. that is the whole point.

>

> > Saying "modprobe will do the right thing and let's not worry about

> > silly admins using insmod directly" is not a good solution.

>

> Agreed.

>

> I have disregarded static calls and weak references for the time

> being, and I will proceed with an implementation that uses neither.

> The downside of this is that, e.g., we are forced to load the NEON

> module on non-NEON capable hardware without calling any of its code,

> but this is basically the Zinc situation only with the NEON and the

> generic code living in different modules.


Makes sense.  It can always be improved after the initial
implementation is merged.