Message ID | 20191002141713.31189-3-ard.biesheuvel@linaro.org |
---|---|
State | New |
Headers | show |
Series | crypto: crypto API library interfaces for WireGuard | expand |
On Wed, Oct 02, 2019 at 04:16:55PM +0200, Ard Biesheuvel wrote: > Wire the existing x86 SIMD ChaCha code into the new ChaCha library > interface. Again, this says _what_, but not _why_. Yes, changelogs are hard :( thanks, greg k-h
On Wed, Oct 02, 2019 at 04:16:55PM +0200, Ard Biesheuvel wrote: > Wire the existing x86 SIMD ChaCha code into the new ChaCha library > interface. > > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > --- > arch/x86/crypto/chacha_glue.c | 36 ++++++++++++++++++++ > crypto/Kconfig | 1 + > include/crypto/chacha.h | 6 ++++ > 3 files changed, 43 insertions(+) > > diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c > index bc62daa8dafd..fd9ef42842cf 100644 > --- a/arch/x86/crypto/chacha_glue.c > +++ b/arch/x86/crypto/chacha_glue.c > @@ -123,6 +123,42 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src, > } > } > > +void hchacha_block(const u32 *state, u32 *stream, int nrounds) > +{ > + state = PTR_ALIGN(state, CHACHA_STATE_ALIGN); > + > + if (!crypto_simd_usable()) { > + hchacha_block_generic(state, stream, nrounds); > + } else { > + kernel_fpu_begin(); > + hchacha_block_ssse3(state, stream, nrounds); > + kernel_fpu_end(); > + } > +} > +EXPORT_SYMBOL(hchacha_block); Please correct me if I'm wrong: The approach here is slightly different from Zinc. In Zinc, I had one entry point that conditionally called into the architecture-specific implementation, and I did it inline using #includes so that in some cases it could be optimized out. Here, you override the original symbol defined by the generic module from the architecture-specific implementation, and in there you decide which way to branch. Your approach has the advantage that you don't need to #include a .c file like I did, an ugly yet very effective approach. But it has two disadvantages: 1. For architecture-specific code that _always_ runs, such as the MIPS32r2 implementation of chacha, the compiler no longer has an opportunity to remove the generic code entirely from the binary, which under Zinc resulted in a smaller module. 2. The inliner can't make optimizations for that call. Disadvantage (2) might not make much of a difference. Disadvantage (1) seems like a bigger deal. However, perhaps the linker is smart and can remove the code and symbol? Or if not, is there a way to make the linker smart? Or would all this require crazy LTO which isn't going to happen any time soon?
On Fri, 4 Oct 2019 at 15:36, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Wed, Oct 02, 2019 at 04:16:55PM +0200, Ard Biesheuvel wrote: > > Wire the existing x86 SIMD ChaCha code into the new ChaCha library > > interface. > > > > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > > --- > > arch/x86/crypto/chacha_glue.c | 36 ++++++++++++++++++++ > > crypto/Kconfig | 1 + > > include/crypto/chacha.h | 6 ++++ > > 3 files changed, 43 insertions(+) > > > > diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c > > index bc62daa8dafd..fd9ef42842cf 100644 > > --- a/arch/x86/crypto/chacha_glue.c > > +++ b/arch/x86/crypto/chacha_glue.c > > @@ -123,6 +123,42 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src, > > } > > } > > > > +void hchacha_block(const u32 *state, u32 *stream, int nrounds) > > +{ > > + state = PTR_ALIGN(state, CHACHA_STATE_ALIGN); > > + > > + if (!crypto_simd_usable()) { > > + hchacha_block_generic(state, stream, nrounds); > > + } else { > > + kernel_fpu_begin(); > > + hchacha_block_ssse3(state, stream, nrounds); > > + kernel_fpu_end(); > > + } > > +} > > +EXPORT_SYMBOL(hchacha_block); > > Please correct me if I'm wrong: > > The approach here is slightly different from Zinc. In Zinc, I had one > entry point that conditionally called into the architecture-specific > implementation, and I did it inline using #includes so that in some > cases it could be optimized out. > > Here, you override the original symbol defined by the generic module > from the architecture-specific implementation, and in there you decide > which way to branch. > > Your approach has the advantage that you don't need to #include a .c > file like I did, an ugly yet very effective approach. > > But it has two disadvantages: > > 1. For architecture-specific code that _always_ runs, such as the > MIPS32r2 implementation of chacha, the compiler no longer has an > opportunity to remove the generic code entirely from the binary, > which under Zinc resulted in a smaller module. > It does. If you don't call hchacha_block_generic() in your code, the library that exposes it never gets loaded in the first place. Note that in this particular case, hchacha_block_generic() is exposed by code that is always builtin so it doesn't matter. > 2. The inliner can't make optimizations for that call. > > Disadvantage (2) might not make much of a difference. Disadvantage (1) > seems like a bigger deal. However, perhaps the linker is smart and can > remove the code and symbol? Or if not, is there a way to make the linker > smart? Or would all this require crazy LTO which isn't going to happen > any time soon?
diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c index bc62daa8dafd..fd9ef42842cf 100644 --- a/arch/x86/crypto/chacha_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -123,6 +123,42 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src, } } +void hchacha_block(const u32 *state, u32 *stream, int nrounds) +{ + state = PTR_ALIGN(state, CHACHA_STATE_ALIGN); + + if (!crypto_simd_usable()) { + hchacha_block_generic(state, stream, nrounds); + } else { + kernel_fpu_begin(); + hchacha_block_ssse3(state, stream, nrounds); + kernel_fpu_end(); + } +} +EXPORT_SYMBOL(hchacha_block); + +void chacha_init(u32 *state, const u32 *key, const u8 *iv) +{ + state = PTR_ALIGN(state, CHACHA_STATE_ALIGN); + + chacha_init_generic(state, key, iv); +} +EXPORT_SYMBOL(chacha_init); + +void chacha_crypt(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, + int nrounds) +{ + state = PTR_ALIGN(state, CHACHA_STATE_ALIGN); + + if (bytes <= CHACHA_BLOCK_SIZE || !crypto_simd_usable()) + return chacha_crypt_generic(state, dst, src, bytes, nrounds); + + kernel_fpu_begin(); + chacha_dosimd(state, dst, src, bytes, nrounds); + kernel_fpu_end(); +} +EXPORT_SYMBOL(chacha_crypt); + static int chacha_simd_stream_xor(struct skcipher_walk *walk, const struct chacha_ctx *ctx, const u8 *iv) { diff --git a/crypto/Kconfig b/crypto/Kconfig index 074b125819b0..f90b53a526ba 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1436,6 +1436,7 @@ config CRYPTO_CHACHA20_X86_64 depends on X86 && 64BIT select CRYPTO_BLKCIPHER select CRYPTO_CHACHA20 + select CRYPTO_ARCH_HAVE_LIB_CHACHA help SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20, XChaCha20, and XChaCha12 stream ciphers. diff --git a/include/crypto/chacha.h b/include/crypto/chacha.h index 95a4a0ff4f7d..58192096679d 100644 --- a/include/crypto/chacha.h +++ b/include/crypto/chacha.h @@ -25,6 +25,12 @@ #define CHACHA_BLOCK_SIZE 64 #define CHACHAPOLY_IV_SIZE 12 +#ifdef CONFIG_X86_64 +#define CHACHA_STATE_WORDS ((CHACHA_BLOCK_SIZE + 12) / sizeof(u32)) +#else +#define CHACHA_STATE_WORDS (CHACHA_BLOCK_SIZE / sizeof(u32)) +#endif + /* 192-bit nonce, then 64-bit stream position */ #define XCHACHA_IV_SIZE 32
Wire the existing x86 SIMD ChaCha code into the new ChaCha library interface. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> --- arch/x86/crypto/chacha_glue.c | 36 ++++++++++++++++++++ crypto/Kconfig | 1 + include/crypto/chacha.h | 6 ++++ 3 files changed, 43 insertions(+) -- 2.20.1