Message ID | 20191002141713.31189-1-ard.biesheuvel@linaro.org |
---|---|
Headers | show |
Series | crypto: crypto API library interfaces for WireGuard | expand |
On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > ... > > In the future, I would like to extend these interfaces to use static calls, > so that the accelerated implementations can be [un]plugged at runtime. For > the time being, we rely on weak aliases and conditional exports so that the > users of the library interfaces link directly to the accelerated versions, > but without the ability to unplug them. > As it turns out, we don't actually need static calls for this. Instead, we can simply permit weak symbol references to go unresolved between modules (as we already do in the kernel itself, due to the fact that ELF permits it), and have the accelerated code live in separate modules that may not be loadable on certain systems, or be blacklisted by the user.
Hi Ard, On Wed, Oct 02, 2019 at 04:16:53PM +0200, Ard Biesheuvel wrote: > This is a followup to RFC 'crypto: wireguard with crypto API library interface' > [0]. Since no objections were raised to my approach, I've proceeded to fix up > some minor issues, and incorporate [most of] the missing MIPS code. > > Changes since RFC/v1: > - dropped the WireGuard patch itself, and the followup patches - since the > purpose was to illustrate the extent of the required changes, there is no > reason to keep including them. > - import the MIPS 32r2 versions of ChaCha and Poly1305, but expose both the > crypto API and library interfaces so that not only WireGuard but also IPsec > and Adiantum can benefit immediately. (The latter required adding support for > the reduced round version of ChaCha to the MIPS asm code) > - fix up various minor kconfig/build issues found in randconfig testing > (thanks Arnd!) Thanks for working on this. By wiring up the accelerated primitives, you're essentially implementing Zinc, and I expect that by the end of this, we'll wind up with something pretty close to where I started, with the critical difference that the directory names are different. Despite my initial email about WireGuard moving to the crypto API, it sounds like in the end it is likely to stay with Zinc, just without that name. Jason
On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: > On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > > ... > > > > In the future, I would like to extend these interfaces to use static calls, > > so that the accelerated implementations can be [un]plugged at runtime. For > > the time being, we rely on weak aliases and conditional exports so that the > > users of the library interfaces link directly to the accelerated versions, > > but without the ability to unplug them. > > > > As it turns out, we don't actually need static calls for this. > Instead, we can simply permit weak symbol references to go unresolved > between modules (as we already do in the kernel itself, due to the > fact that ELF permits it), and have the accelerated code live in > separate modules that may not be loadable on certain systems, or be > blacklisted by the user. You're saying that at module insertion time, the kernel will override weak symbols with those provided by the module itself? At runtime? Do you know offhand how this patching works? Is there a PLT that gets patched, and so the calls all go through a layer of function pointer indirection? Or are all call sites fixed up at insertion time and the call instructions rewritten with some runtime patching magic? Unless the method is the latter, I would assume that static calls are much faster in general? Or the approach already in this series is perhaps fine enough, and we don't need to break this apart into individual modules complicating everything?
On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: > > On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > > > > ... > > > > > > In the future, I would like to extend these interfaces to use static calls, > > > so that the accelerated implementations can be [un]plugged at runtime. For > > > the time being, we rely on weak aliases and conditional exports so that the > > > users of the library interfaces link directly to the accelerated versions, > > > but without the ability to unplug them. > > > > > > > As it turns out, we don't actually need static calls for this. > > Instead, we can simply permit weak symbol references to go unresolved > > between modules (as we already do in the kernel itself, due to the > > fact that ELF permits it), and have the accelerated code live in > > separate modules that may not be loadable on certain systems, or be > > blacklisted by the user. > > You're saying that at module insertion time, the kernel will override > weak symbols with those provided by the module itself? At runtime? > Yes. > Do you know offhand how this patching works? Is there a PLT that gets > patched, and so the calls all go through a layer of function pointer > indirection? Or are all call sites fixed up at insertion time and the > call instructions rewritten with some runtime patching magic? > No magic. Take curve25519 for example, when built for ARM: 00000000 <curve25519>: 0: f240 0300 movw r3, #0 0: R_ARM_THM_MOVW_ABS_NC curve25519_arch 4: f2c0 0300 movt r3, #0 4: R_ARM_THM_MOVT_ABS curve25519_arch 8: b570 push {r4, r5, r6, lr} a: 4604 mov r4, r0 c: 460d mov r5, r1 e: 4616 mov r6, r2 10: b173 cbz r3, 30 <curve25519+0x30> 12: f7ff fffe bl 0 <curve25519_arch> 12: R_ARM_THM_CALL curve25519_arch 16: b158 cbz r0, 30 <curve25519+0x30> 18: 4620 mov r0, r4 1a: 2220 movs r2, #32 1c: f240 0100 movw r1, #0 1c: R_ARM_THM_MOVW_ABS_NC .LANCHOR0 20: f2c0 0100 movt r1, #0 20: R_ARM_THM_MOVT_ABS .LANCHOR0 24: f7ff fffe bl 0 <__crypto_memneq> 24: R_ARM_THM_CALL __crypto_memneq 28: 3000 adds r0, #0 2a: bf18 it ne 2c: 2001 movne r0, #1 2e: bd70 pop {r4, r5, r6, pc} 30: 4632 mov r2, r6 32: 4629 mov r1, r5 34: 4620 mov r0, r4 36: f7ff fffe bl 0 <curve25519_generic> 36: R_ARM_THM_CALL curve25519_generic 3a: e7ed b.n 18 <curve25519+0x18> curve25519_arch is a weak reference. It either gets satisfied at module load time, or it doesn't. If it does get satisfied, the relocations covering the movw/movt pair and the one covering the bl instruction get updated so that they point to the arch routine. If it does not get satisfied, the relocations are disregarded, in which case the cbz instruction at offset 0x10 jumps over the bl call. Note that this does not involve any memory accesses. It does involve some code patching, but only of the kind the module loader already does. > Unless the method is the latter, I would assume that static calls are > much faster in general? Or the approach already in this series is > perhaps fine enough, and we don't need to break this apart into > individual modules complicating everything? The big advantage this approach has over Zinc is that the accelerated code does not have to stay resident in memory if the cpu is incapable of executing it.
On Wed, Oct 2, 2019 at 4:17 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > This is a followup to RFC 'crypto: wireguard with crypto API library interface' > [0]. Since no objections were raised to my approach, I've proceeded to fix up > some minor issues, and incorporate [most of] the missing MIPS code. Could you get the mips64 code into the next version? WireGuard is widely used on the Ubiquiti EdgeRouter, which runs on the Octeon, a 64-bit mips chips.
> On Oct 4, 2019, at 6:42 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: >>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>> >> ... >>> >>> In the future, I would like to extend these interfaces to use static calls, >>> so that the accelerated implementations can be [un]plugged at runtime. For >>> the time being, we rely on weak aliases and conditional exports so that the >>> users of the library interfaces link directly to the accelerated versions, >>> but without the ability to unplug them. >>> >> >> As it turns out, we don't actually need static calls for this. >> Instead, we can simply permit weak symbol references to go unresolved >> between modules (as we already do in the kernel itself, due to the >> fact that ELF permits it), and have the accelerated code live in >> separate modules that may not be loadable on certain systems, or be >> blacklisted by the user. > > You're saying that at module insertion time, the kernel will override > weak symbols with those provided by the module itself? At runtime? > > Do you know offhand how this patching works? Is there a PLT that gets > patched, and so the calls all go through a layer of function pointer > indirection? Or are all call sites fixed up at insertion time and the > call instructions rewritten with some runtime patching magic? > > Unless the method is the latter, I would assume that static calls are > much faster in general? Or the approach already in this series is > perhaps fine enough, and we don't need to break this apart into > individual modules complicating everything? I admit I’m a bit mystified too. I think it would be great to have a feature where a special type of static call could be redirected by the module loader when a module that exports a corresponding symbol is loaded. Unloading such a module would be interesting. Ard, what exactly are you imagining?
> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote: >> >>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: >>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>> >>> ... >>>> >>>> In the future, I would like to extend these interfaces to use static calls, >>>> so that the accelerated implementations can be [un]plugged at runtime. For >>>> the time being, we rely on weak aliases and conditional exports so that the >>>> users of the library interfaces link directly to the accelerated versions, >>>> but without the ability to unplug them. >>>> >>> >>> As it turns out, we don't actually need static calls for this. >>> Instead, we can simply permit weak symbol references to go unresolved >>> between modules (as we already do in the kernel itself, due to the >>> fact that ELF permits it), and have the accelerated code live in >>> separate modules that may not be loadable on certain systems, or be >>> blacklisted by the user. >> >> You're saying that at module insertion time, the kernel will override >> weak symbols with those provided by the module itself? At runtime? >> > > Yes. > >> Do you know offhand how this patching works? Is there a PLT that gets >> patched, and so the calls all go through a layer of function pointer >> indirection? Or are all call sites fixed up at insertion time and the >> call instructions rewritten with some runtime patching magic? >> > > No magic. Take curve25519 for example, when built for ARM: > > 00000000 <curve25519>: > 0: f240 0300 movw r3, #0 > 0: R_ARM_THM_MOVW_ABS_NC curve25519_arch > 4: f2c0 0300 movt r3, #0 > 4: R_ARM_THM_MOVT_ABS curve25519_arch > 8: b570 push {r4, r5, r6, lr} > a: 4604 mov r4, r0 > c: 460d mov r5, r1 > e: 4616 mov r6, r2 > 10: b173 cbz r3, 30 <curve25519+0x30> > 12: f7ff fffe bl 0 <curve25519_arch> > 12: R_ARM_THM_CALL curve25519_arch > 16: b158 cbz r0, 30 <curve25519+0x30> > 18: 4620 mov r0, r4 > 1a: 2220 movs r2, #32 > 1c: f240 0100 movw r1, #0 > 1c: R_ARM_THM_MOVW_ABS_NC .LANCHOR0 > 20: f2c0 0100 movt r1, #0 > 20: R_ARM_THM_MOVT_ABS .LANCHOR0 > 24: f7ff fffe bl 0 <__crypto_memneq> > 24: R_ARM_THM_CALL __crypto_memneq > 28: 3000 adds r0, #0 > 2a: bf18 it ne > 2c: 2001 movne r0, #1 > 2e: bd70 pop {r4, r5, r6, pc} > 30: 4632 mov r2, r6 > 32: 4629 mov r1, r5 > 34: 4620 mov r0, r4 > 36: f7ff fffe bl 0 <curve25519_generic> > 36: R_ARM_THM_CALL curve25519_generic > 3a: e7ed b.n 18 <curve25519+0x18> > > curve25519_arch is a weak reference. It either gets satisfied at > module load time, or it doesn't. > > If it does get satisfied, the relocations covering the movw/movt pair > and the one covering the bl instruction get updated so that they point > to the arch routine. > > If it does not get satisfied, the relocations are disregarded, in > which case the cbz instruction at offset 0x10 jumps over the bl call. > > Note that this does not involve any memory accesses. It does involve > some code patching, but only of the kind the module loader already > does. Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress? I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. Or use static calls.
On Fri, Oct 4, 2019 at 4:53 PM Andy Lutomirski <luto@amacapital.net> wrote:
> I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. Or use static calls.
Static calls perform well and are well understood. This would be my preference.
On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > > > On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > >> > >>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: > >>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > >>>> > >>> ... > >>>> > >>>> In the future, I would like to extend these interfaces to use static calls, > >>>> so that the accelerated implementations can be [un]plugged at runtime. For > >>>> the time being, we rely on weak aliases and conditional exports so that the > >>>> users of the library interfaces link directly to the accelerated versions, > >>>> but without the ability to unplug them. > >>>> > >>> > >>> As it turns out, we don't actually need static calls for this. > >>> Instead, we can simply permit weak symbol references to go unresolved > >>> between modules (as we already do in the kernel itself, due to the > >>> fact that ELF permits it), and have the accelerated code live in > >>> separate modules that may not be loadable on certain systems, or be > >>> blacklisted by the user. > >> > >> You're saying that at module insertion time, the kernel will override > >> weak symbols with those provided by the module itself? At runtime? > >> > > > > Yes. > > > >> Do you know offhand how this patching works? Is there a PLT that gets > >> patched, and so the calls all go through a layer of function pointer > >> indirection? Or are all call sites fixed up at insertion time and the > >> call instructions rewritten with some runtime patching magic? > >> > > > > No magic. Take curve25519 for example, when built for ARM: > > > > 00000000 <curve25519>: > > 0: f240 0300 movw r3, #0 > > 0: R_ARM_THM_MOVW_ABS_NC curve25519_arch > > 4: f2c0 0300 movt r3, #0 > > 4: R_ARM_THM_MOVT_ABS curve25519_arch > > 8: b570 push {r4, r5, r6, lr} > > a: 4604 mov r4, r0 > > c: 460d mov r5, r1 > > e: 4616 mov r6, r2 > > 10: b173 cbz r3, 30 <curve25519+0x30> > > 12: f7ff fffe bl 0 <curve25519_arch> > > 12: R_ARM_THM_CALL curve25519_arch > > 16: b158 cbz r0, 30 <curve25519+0x30> > > 18: 4620 mov r0, r4 > > 1a: 2220 movs r2, #32 > > 1c: f240 0100 movw r1, #0 > > 1c: R_ARM_THM_MOVW_ABS_NC .LANCHOR0 > > 20: f2c0 0100 movt r1, #0 > > 20: R_ARM_THM_MOVT_ABS .LANCHOR0 > > 24: f7ff fffe bl 0 <__crypto_memneq> > > 24: R_ARM_THM_CALL __crypto_memneq > > 28: 3000 adds r0, #0 > > 2a: bf18 it ne > > 2c: 2001 movne r0, #1 > > 2e: bd70 pop {r4, r5, r6, pc} > > 30: 4632 mov r2, r6 > > 32: 4629 mov r1, r5 > > 34: 4620 mov r0, r4 > > 36: f7ff fffe bl 0 <curve25519_generic> > > 36: R_ARM_THM_CALL curve25519_generic > > 3a: e7ed b.n 18 <curve25519+0x18> > > > > curve25519_arch is a weak reference. It either gets satisfied at > > module load time, or it doesn't. > > > > If it does get satisfied, the relocations covering the movw/movt pair > > and the one covering the bl instruction get updated so that they point > > to the arch routine. > > > > If it does not get satisfied, the relocations are disregarded, in > > which case the cbz instruction at offset 0x10 jumps over the bl call. > > > > Note that this does not involve any memory accesses. It does involve > > some code patching, but only of the kind the module loader already > > does. > > Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress? > Indeed, the arch module needs to be loaded first > I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. That is what I am doing for chacha and poly > Or use static calls.
On Fri, 4 Oct 2019 at 16:55, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Fri, Oct 4, 2019 at 4:53 PM Andy Lutomirski <luto@amacapital.net> wrote: > > I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. Or use static calls. > > Static calls perform well and are well understood. This would be my preference. How so? No code exists yet in mainline, and the x86 code is still heavily being discussed, requires voodoo code patching and tooling changes. Implementations for other architectures have not even been proposed yet, with the exception of the code I wrote for arm64.
On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > > > > > On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > > > > > On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > >> > > >>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: > > >>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > >>>> > > >>> ... > > >>>> > > >>>> In the future, I would like to extend these interfaces to use static calls, > > >>>> so that the accelerated implementations can be [un]plugged at runtime. For > > >>>> the time being, we rely on weak aliases and conditional exports so that the > > >>>> users of the library interfaces link directly to the accelerated versions, > > >>>> but without the ability to unplug them. > > >>>> > > >>> > > >>> As it turns out, we don't actually need static calls for this. > > >>> Instead, we can simply permit weak symbol references to go unresolved > > >>> between modules (as we already do in the kernel itself, due to the > > >>> fact that ELF permits it), and have the accelerated code live in > > >>> separate modules that may not be loadable on certain systems, or be > > >>> blacklisted by the user. > > >> > > >> You're saying that at module insertion time, the kernel will override > > >> weak symbols with those provided by the module itself? At runtime? > > >> > > > > > > Yes. > > > > > >> Do you know offhand how this patching works? Is there a PLT that gets > > >> patched, and so the calls all go through a layer of function pointer > > >> indirection? Or are all call sites fixed up at insertion time and the > > >> call instructions rewritten with some runtime patching magic? > > >> > > > > > > No magic. Take curve25519 for example, when built for ARM: > > > > > > 00000000 <curve25519>: > > > 0: f240 0300 movw r3, #0 > > > 0: R_ARM_THM_MOVW_ABS_NC curve25519_arch > > > 4: f2c0 0300 movt r3, #0 > > > 4: R_ARM_THM_MOVT_ABS curve25519_arch > > > 8: b570 push {r4, r5, r6, lr} > > > a: 4604 mov r4, r0 > > > c: 460d mov r5, r1 > > > e: 4616 mov r6, r2 > > > 10: b173 cbz r3, 30 <curve25519+0x30> > > > 12: f7ff fffe bl 0 <curve25519_arch> > > > 12: R_ARM_THM_CALL curve25519_arch > > > 16: b158 cbz r0, 30 <curve25519+0x30> > > > 18: 4620 mov r0, r4 > > > 1a: 2220 movs r2, #32 > > > 1c: f240 0100 movw r1, #0 > > > 1c: R_ARM_THM_MOVW_ABS_NC .LANCHOR0 > > > 20: f2c0 0100 movt r1, #0 > > > 20: R_ARM_THM_MOVT_ABS .LANCHOR0 > > > 24: f7ff fffe bl 0 <__crypto_memneq> > > > 24: R_ARM_THM_CALL __crypto_memneq > > > 28: 3000 adds r0, #0 > > > 2a: bf18 it ne > > > 2c: 2001 movne r0, #1 > > > 2e: bd70 pop {r4, r5, r6, pc} > > > 30: 4632 mov r2, r6 > > > 32: 4629 mov r1, r5 > > > 34: 4620 mov r0, r4 > > > 36: f7ff fffe bl 0 <curve25519_generic> > > > 36: R_ARM_THM_CALL curve25519_generic > > > 3a: e7ed b.n 18 <curve25519+0x18> > > > > > > curve25519_arch is a weak reference. It either gets satisfied at > > > module load time, or it doesn't. > > > > > > If it does get satisfied, the relocations covering the movw/movt pair > > > and the one covering the bl instruction get updated so that they point > > > to the arch routine. > > > > > > If it does not get satisfied, the relocations are disregarded, in > > > which case the cbz instruction at offset 0x10 jumps over the bl call. > > > > > > Note that this does not involve any memory accesses. It does involve > > > some code patching, but only of the kind the module loader already > > > does. > > > > Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress? > > > > Indeed, the arch module needs to be loaded first > Actually, this can be addressed by retaining the module dependencies as before, but permitting the arch module to be omitted at load time. > > I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. > > That is what I am doing for chacha and poly > > > Or use static calls. Given that static calls don't actually exist yet, I propose to proceed with the approach above, and switch to static calls once all architectures where it matters have an implementation that does not use function pointers (which is how static calls will be implemented generically)
> On Oct 5, 2019, at 12:24 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >> >>> On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@amacapital.net> wrote: >>> >>> >>> >>>> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>> >>>> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote: >>>>> >>>>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: >>>>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >>>>>>> >>>>>> ... >>>>>>> >>>>>>> In the future, I would like to extend these interfaces to use static calls, >>>>>>> so that the accelerated implementations can be [un]plugged at runtime. For >>>>>>> the time being, we rely on weak aliases and conditional exports so that the >>>>>>> users of the library interfaces link directly to the accelerated versions, >>>>>>> but without the ability to unplug them. >>>>>>> >>>>>> >>>>>> As it turns out, we don't actually need static calls for this. >>>>>> Instead, we can simply permit weak symbol references to go unresolved >>>>>> between modules (as we already do in the kernel itself, due to the >>>>>> fact that ELF permits it), and have the accelerated code live in >>>>>> separate modules that may not be loadable on certain systems, or be >>>>>> blacklisted by the user. >>>>> >>>>> You're saying that at module insertion time, the kernel will override >>>>> weak symbols with those provided by the module itself? At runtime? >>>>> >>>> >>>> Yes. >>>> >>>>> Do you know offhand how this patching works? Is there a PLT that gets >>>>> patched, and so the calls all go through a layer of function pointer >>>>> indirection? Or are all call sites fixed up at insertion time and the >>>>> call instructions rewritten with some runtime patching magic? >>>>> >>>> >>>> No magic. Take curve25519 for example, when built for ARM: >>>> >>>> 00000000 <curve25519>: >>>> 0: f240 0300 movw r3, #0 >>>> 0: R_ARM_THM_MOVW_ABS_NC curve25519_arch >>>> 4: f2c0 0300 movt r3, #0 >>>> 4: R_ARM_THM_MOVT_ABS curve25519_arch >>>> 8: b570 push {r4, r5, r6, lr} >>>> a: 4604 mov r4, r0 >>>> c: 460d mov r5, r1 >>>> e: 4616 mov r6, r2 >>>> 10: b173 cbz r3, 30 <curve25519+0x30> >>>> 12: f7ff fffe bl 0 <curve25519_arch> >>>> 12: R_ARM_THM_CALL curve25519_arch >>>> 16: b158 cbz r0, 30 <curve25519+0x30> >>>> 18: 4620 mov r0, r4 >>>> 1a: 2220 movs r2, #32 >>>> 1c: f240 0100 movw r1, #0 >>>> 1c: R_ARM_THM_MOVW_ABS_NC .LANCHOR0 >>>> 20: f2c0 0100 movt r1, #0 >>>> 20: R_ARM_THM_MOVT_ABS .LANCHOR0 >>>> 24: f7ff fffe bl 0 <__crypto_memneq> >>>> 24: R_ARM_THM_CALL __crypto_memneq >>>> 28: 3000 adds r0, #0 >>>> 2a: bf18 it ne >>>> 2c: 2001 movne r0, #1 >>>> 2e: bd70 pop {r4, r5, r6, pc} >>>> 30: 4632 mov r2, r6 >>>> 32: 4629 mov r1, r5 >>>> 34: 4620 mov r0, r4 >>>> 36: f7ff fffe bl 0 <curve25519_generic> >>>> 36: R_ARM_THM_CALL curve25519_generic >>>> 3a: e7ed b.n 18 <curve25519+0x18> >>>> >>>> curve25519_arch is a weak reference. It either gets satisfied at >>>> module load time, or it doesn't. >>>> >>>> If it does get satisfied, the relocations covering the movw/movt pair >>>> and the one covering the bl instruction get updated so that they point >>>> to the arch routine. >>>> >>>> If it does not get satisfied, the relocations are disregarded, in >>>> which case the cbz instruction at offset 0x10 jumps over the bl call. >>>> >>>> Note that this does not involve any memory accesses. It does involve >>>> some code patching, but only of the kind the module loader already >>>> does. >>> >>> Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress? >>> >> >> Indeed, the arch module needs to be loaded first >> > > Actually, this can be addressed by retaining the module dependencies > as before, but permitting the arch module to be omitted at load time. I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too. > >>> I think it might be better to allow two different modules to export the same symbol but only allow one of them to be loaded. >> >> That is what I am doing for chacha and poly >> >>> Or use static calls. > > Given that static calls don't actually exist yet, I propose to proceed > with the approach above, and switch to static calls once all > architectures where it matters have an implementation that does not > use function pointers (which is how static calls will be implemented > generically)
On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > On Oct 5, 2019, at 12:24 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > > > On Fri, 4 Oct 2019 at 16:56, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > >> > >>> On Fri, 4 Oct 2019 at 16:53, Andy Lutomirski <luto@amacapital.net> wrote: > >>> > >>> > >>> > >>>> On Oct 4, 2019, at 6:52 AM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > >>>> > >>>> On Fri, 4 Oct 2019 at 15:42, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > >>>>> > >>>>>> On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: > >>>>>> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > >>>>>>> > >>>>>> ... > >>>>>>> > >>>>>>> In the future, I would like to extend these interfaces to use static calls, > >>>>>>> so that the accelerated implementations can be [un]plugged at runtime. For > >>>>>>> the time being, we rely on weak aliases and conditional exports so that the > >>>>>>> users of the library interfaces link directly to the accelerated versions, > >>>>>>> but without the ability to unplug them. > >>>>>>> > >>>>>> > >>>>>> As it turns out, we don't actually need static calls for this. > >>>>>> Instead, we can simply permit weak symbol references to go unresolved > >>>>>> between modules (as we already do in the kernel itself, due to the > >>>>>> fact that ELF permits it), and have the accelerated code live in > >>>>>> separate modules that may not be loadable on certain systems, or be > >>>>>> blacklisted by the user. > >>>>> > >>>>> You're saying that at module insertion time, the kernel will override > >>>>> weak symbols with those provided by the module itself? At runtime? > >>>>> > >>>> > >>>> Yes. > >>>> > >>>>> Do you know offhand how this patching works? Is there a PLT that gets > >>>>> patched, and so the calls all go through a layer of function pointer > >>>>> indirection? Or are all call sites fixed up at insertion time and the > >>>>> call instructions rewritten with some runtime patching magic? > >>>>> > >>>> > >>>> No magic. Take curve25519 for example, when built for ARM: > >>>> > >>>> 00000000 <curve25519>: > >>>> 0: f240 0300 movw r3, #0 > >>>> 0: R_ARM_THM_MOVW_ABS_NC curve25519_arch > >>>> 4: f2c0 0300 movt r3, #0 > >>>> 4: R_ARM_THM_MOVT_ABS curve25519_arch > >>>> 8: b570 push {r4, r5, r6, lr} > >>>> a: 4604 mov r4, r0 > >>>> c: 460d mov r5, r1 > >>>> e: 4616 mov r6, r2 > >>>> 10: b173 cbz r3, 30 <curve25519+0x30> > >>>> 12: f7ff fffe bl 0 <curve25519_arch> > >>>> 12: R_ARM_THM_CALL curve25519_arch > >>>> 16: b158 cbz r0, 30 <curve25519+0x30> > >>>> 18: 4620 mov r0, r4 > >>>> 1a: 2220 movs r2, #32 > >>>> 1c: f240 0100 movw r1, #0 > >>>> 1c: R_ARM_THM_MOVW_ABS_NC .LANCHOR0 > >>>> 20: f2c0 0100 movt r1, #0 > >>>> 20: R_ARM_THM_MOVT_ABS .LANCHOR0 > >>>> 24: f7ff fffe bl 0 <__crypto_memneq> > >>>> 24: R_ARM_THM_CALL __crypto_memneq > >>>> 28: 3000 adds r0, #0 > >>>> 2a: bf18 it ne > >>>> 2c: 2001 movne r0, #1 > >>>> 2e: bd70 pop {r4, r5, r6, pc} > >>>> 30: 4632 mov r2, r6 > >>>> 32: 4629 mov r1, r5 > >>>> 34: 4620 mov r0, r4 > >>>> 36: f7ff fffe bl 0 <curve25519_generic> > >>>> 36: R_ARM_THM_CALL curve25519_generic > >>>> 3a: e7ed b.n 18 <curve25519+0x18> > >>>> > >>>> curve25519_arch is a weak reference. It either gets satisfied at > >>>> module load time, or it doesn't. > >>>> > >>>> If it does get satisfied, the relocations covering the movw/movt pair > >>>> and the one covering the bl instruction get updated so that they point > >>>> to the arch routine. > >>>> > >>>> If it does not get satisfied, the relocations are disregarded, in > >>>> which case the cbz instruction at offset 0x10 jumps over the bl call. > >>>> > >>>> Note that this does not involve any memory accesses. It does involve > >>>> some code patching, but only of the kind the module loader already > >>>> does. > >>> > >>> Won’t this have the counterintuitive property that, if you load the modules in the opposite order, the reference won’t be re-resolved and performance will silently regress? > >>> > >> > >> Indeed, the arch module needs to be loaded first > >> > > > > Actually, this can be addressed by retaining the module dependencies > > as before, but permitting the arch module to be omitted at load time. > > I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too. > Most arch code depends on CPU features that may not be available given the context, either because they are SIMD or because they are optional CPU instructions. So we need both modules at the same time anyway, so that we can fall back to the generic code at runtime. So what I'd like is to have the generic module provide the library interface, but rely on arch modules that are optional. We already have 95% of what we need with weak references. We have the ability to test for presence of the arch code at runtime, and we even have code patching for all architectures (through static relocations). However, one could argue that this is more a [space] optimization than anything else, so I am willing to park this discussion until support for static calls has been merged, and proceed with something simpler.
On Sun, Oct 6, 2019 at 10:24 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > Actually, this can be addressed by retaining the module dependencies > > > as before, but permitting the arch module to be omitted at load time. > > > > I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too. > > > > Most arch code depends on CPU features that may not be available given > the context, either because they are SIMD or because they are optional > CPU instructions. So we need both modules at the same time anyway, so > that we can fall back to the generic code at runtime. > > So what I'd like is to have the generic module provide the library > interface, but rely on arch modules that are optional. > > We already have 95% of what we need with weak references. We have the > ability to test for presence of the arch code at runtime, and we even > have code patching for all architectures (through static relocations). > > However, one could argue that this is more a [space] optimization than > anything else, so I am willing to park this discussion until support > for static calls has been merged, and proceed with something simpler. I'd suggest tabling it until static calls are merged. PeterZ just sent a new patchbomb for it anyway. What I'm trying to get at here and apparently saying badly is that I want to avoid a situation where lsmod shows the arch module loaded but the arch code isn't actually executing. Regardless of how everything gets wired up (static calls, weak refs, etc), the system's behavior should match the system's configuration, which means that we should not allow any improper order of loading things so that everything *appears* to be loaded but does not actually function. Saying "modprobe will do the right thing and let's not worry about silly admins using insmod directly" is not a good solution.
On Mon, 7 Oct 2019 at 17:02, Andy Lutomirski <luto@amacapital.net> wrote: > > On Sun, Oct 6, 2019 at 10:24 PM Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: > > > > On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > > > > Actually, this can be addressed by retaining the module dependencies > > > > as before, but permitting the arch module to be omitted at load time. > > > > > > I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too. > > > > > > > Most arch code depends on CPU features that may not be available given > > the context, either because they are SIMD or because they are optional > > CPU instructions. So we need both modules at the same time anyway, so > > that we can fall back to the generic code at runtime. > > > > So what I'd like is to have the generic module provide the library > > interface, but rely on arch modules that are optional. > > > > We already have 95% of what we need with weak references. We have the > > ability to test for presence of the arch code at runtime, and we even > > have code patching for all architectures (through static relocations). > > > > However, one could argue that this is more a [space] optimization than > > anything else, so I am willing to park this discussion until support > > for static calls has been merged, and proceed with something simpler. > > I'd suggest tabling it until static calls are merged. PeterZ just > sent a new patchbomb for it anyway. > As it turns out, static calls are a poor fit for this. Imagine an interface like poly1305_init(state) poly1305_update(state, input) poly1305_fina(state, digest) which can be implemented by different libraries. The problem is that state is opaque, and so it is generally not guaranteed that a sequence that was started using one implementation can be completed using another one. Since the whole point is having a simple library interface, complicating this with RCU hooks or other crazy plumbing to ensure that no calls are in progress when you switch one out for another one, I don't think static calls are suitable for this use case. > What I'm trying to get at here and apparently saying badly is that I > want to avoid a situation where lsmod shows the arch module loaded but > the arch code isn't actually executing. My goal here is to allow the generic library to be loaded with or without the arch code, with the arch code always being used when it is loaded. This is what I implemented using weak references, but it requires a tweak in the module loader (two lines but not pretty). Using weak references preserves the dependency order, since the generic module will depend on the arch module (and up the refcount) it any of its weak references were fulfilled by the arch module in question. Using static calls will invert the dependency relation, since the arch code will need to perform a static_call_update() to make [users of] the generic library point to its code. How this works with managing the module refcounts and unload order is an open question afaict. > Regardless of how everything > gets wired up (static calls, weak refs, etc), the system's behavior > should match the system's configuration, which means that we should > not allow any improper order of loading things so that everything > *appears* to be loaded but does not actually function. > Yes. that is the whole point. > Saying "modprobe will do the right thing and let's not worry about > silly admins using insmod directly" is not a good solution. Agreed. I have disregarded static calls and weak references for the time being, and I will proceed with an implementation that uses neither. The downside of this is that, e.g., we are forced to load the NEON module on non-NEON capable hardware without calling any of its code, but this is basically the Zinc situation only with the NEON and the generic code living in different modules.
On Mon, Oct 7, 2019 at 8:12 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On Mon, 7 Oct 2019 at 17:02, Andy Lutomirski <luto@amacapital.net> wrote: > > > > On Sun, Oct 6, 2019 at 10:24 PM Ard Biesheuvel > > <ard.biesheuvel@linaro.org> wrote: > > > > > > On Mon, 7 Oct 2019 at 06:44, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > > > > > > > Actually, this can be addressed by retaining the module dependencies > > > > > as before, but permitting the arch module to be omitted at load time. > > > > > > > > I think that, to avoid surprises, you should refuse to load the arch module if the generic module is loaded, too. > > > > > > > > > > Most arch code depends on CPU features that may not be available given > > > the context, either because they are SIMD or because they are optional > > > CPU instructions. So we need both modules at the same time anyway, so > > > that we can fall back to the generic code at runtime. > > > > > > So what I'd like is to have the generic module provide the library > > > interface, but rely on arch modules that are optional. > > > > > > We already have 95% of what we need with weak references. We have the > > > ability to test for presence of the arch code at runtime, and we even > > > have code patching for all architectures (through static relocations). > > > > > > However, one could argue that this is more a [space] optimization than > > > anything else, so I am willing to park this discussion until support > > > for static calls has been merged, and proceed with something simpler. > > > > I'd suggest tabling it until static calls are merged. PeterZ just > > sent a new patchbomb for it anyway. > > > > As it turns out, static calls are a poor fit for this. Imagine an interface like > > poly1305_init(state) > poly1305_update(state, input) > poly1305_fina(state, digest) > > which can be implemented by different libraries. The problem is that > state is opaque, and so it is generally not guaranteed that a sequence > that was started using one implementation can be completed using > another one. > > Since the whole point is having a simple library interface, > complicating this with RCU hooks or other crazy plumbing to ensure > that no calls are in progress when you switch one out for another one, > I don't think static calls are suitable for this use case. > > > What I'm trying to get at here and apparently saying badly is that I > > want to avoid a situation where lsmod shows the arch module loaded but > > the arch code isn't actually executing. > > My goal here is to allow the generic library to be loaded with or > without the arch code, with the arch code always being used when it is > loaded. This is what I implemented using weak references, but it > requires a tweak in the module loader (two lines but not pretty). > Using weak references preserves the dependency order, since the > generic module will depend on the arch module (and up the refcount) it > any of its weak references were fulfilled by the arch module in > question. Using static calls will invert the dependency relation, > since the arch code will need to perform a static_call_update() to > make [users of] the generic library point to its code. How this works > with managing the module refcounts and unload order is an open > question afaict. Indeed. Dealing with unloading when static calls are in use may be messy. > > > Regardless of how everything > > gets wired up (static calls, weak refs, etc), the system's behavior > > should match the system's configuration, which means that we should > > not allow any improper order of loading things so that everything > > *appears* to be loaded but does not actually function. > > > > Yes. that is the whole point. > > > Saying "modprobe will do the right thing and let's not worry about > > silly admins using insmod directly" is not a good solution. > > Agreed. > > I have disregarded static calls and weak references for the time > being, and I will proceed with an implementation that uses neither. > The downside of this is that, e.g., we are forced to load the NEON > module on non-NEON capable hardware without calling any of its code, > but this is basically the Zinc situation only with the NEON and the > generic code living in different modules. Makes sense. It can always be improved after the initial implementation is merged.