diff mbox series

[v2,02/20] crypto: x86/chacha - expose SIMD ChaCha routine as library function

Message ID 20191002141713.31189-3-ard.biesheuvel@linaro.org
State New
Headers show
Series crypto: crypto API library interfaces for WireGuard | expand

Commit Message

Ard Biesheuvel Oct. 2, 2019, 2:16 p.m. UTC
Wire the existing x86 SIMD ChaCha code into the new ChaCha library
interface.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 arch/x86/crypto/chacha_glue.c | 36 ++++++++++++++++++++
 crypto/Kconfig                |  1 +
 include/crypto/chacha.h       |  6 ++++
 3 files changed, 43 insertions(+)

-- 
2.20.1

Comments

Greg KH Oct. 2, 2019, 2:31 p.m. UTC | #1
On Wed, Oct 02, 2019 at 04:16:55PM +0200, Ard Biesheuvel wrote:
> Wire the existing x86 SIMD ChaCha code into the new ChaCha library

> interface.


Again, this says _what_, but not _why_.

Yes, changelogs are hard :(

thanks,

greg k-h
Jason A. Donenfeld Oct. 4, 2019, 1:36 p.m. UTC | #2
On Wed, Oct 02, 2019 at 04:16:55PM +0200, Ard Biesheuvel wrote:
> Wire the existing x86 SIMD ChaCha code into the new ChaCha library

> interface.

> 

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

>  arch/x86/crypto/chacha_glue.c | 36 ++++++++++++++++++++

>  crypto/Kconfig                |  1 +

>  include/crypto/chacha.h       |  6 ++++

>  3 files changed, 43 insertions(+)

> 

> diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c

> index bc62daa8dafd..fd9ef42842cf 100644

> --- a/arch/x86/crypto/chacha_glue.c

> +++ b/arch/x86/crypto/chacha_glue.c

> @@ -123,6 +123,42 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src,

>  	}

>  }

>  

> +void hchacha_block(const u32 *state, u32 *stream, int nrounds)

> +{

> +	state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);

> +

> +	if (!crypto_simd_usable()) {

> +		hchacha_block_generic(state, stream, nrounds);

> +	} else {

> +		kernel_fpu_begin();

> +		hchacha_block_ssse3(state, stream, nrounds);

> +		kernel_fpu_end();

> +	}

> +}

> +EXPORT_SYMBOL(hchacha_block);


Please correct me if I'm wrong:

The approach here is slightly different from Zinc. In Zinc, I had one
entry point that conditionally called into the architecture-specific
implementation, and I did it inline using #includes so that in some
cases it could be optimized out.

Here, you override the original symbol defined by the generic module
from the architecture-specific implementation, and in there you decide
which way to branch.

Your approach has the advantage that you don't need to #include a .c
file like I did, an ugly yet very effective approach.

But it has two disadvantages:

1. For architecture-specific code that _always_ runs, such as the
  MIPS32r2 implementation of chacha, the compiler no longer has an
  opportunity to remove the generic code entirely from the binary,
  which under Zinc resulted in a smaller module.

2. The inliner can't make optimizations for that call.

Disadvantage (2) might not make much of a difference. Disadvantage (1)
seems like a bigger deal. However, perhaps the linker is smart and can
remove the code and symbol? Or if not, is there a way to make the linker
smart? Or would all this require crazy LTO which isn't going to happen
any time soon?
Ard Biesheuvel Oct. 4, 2019, 1:54 p.m. UTC | #3
On Fri, 4 Oct 2019 at 15:36, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> On Wed, Oct 02, 2019 at 04:16:55PM +0200, Ard Biesheuvel wrote:

> > Wire the existing x86 SIMD ChaCha code into the new ChaCha library

> > interface.

> >

> > Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> > ---

> >  arch/x86/crypto/chacha_glue.c | 36 ++++++++++++++++++++

> >  crypto/Kconfig                |  1 +

> >  include/crypto/chacha.h       |  6 ++++

> >  3 files changed, 43 insertions(+)

> >

> > diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c

> > index bc62daa8dafd..fd9ef42842cf 100644

> > --- a/arch/x86/crypto/chacha_glue.c

> > +++ b/arch/x86/crypto/chacha_glue.c

> > @@ -123,6 +123,42 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src,

> >       }

> >  }

> >

> > +void hchacha_block(const u32 *state, u32 *stream, int nrounds)

> > +{

> > +     state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);

> > +

> > +     if (!crypto_simd_usable()) {

> > +             hchacha_block_generic(state, stream, nrounds);

> > +     } else {

> > +             kernel_fpu_begin();

> > +             hchacha_block_ssse3(state, stream, nrounds);

> > +             kernel_fpu_end();

> > +     }

> > +}

> > +EXPORT_SYMBOL(hchacha_block);

>

> Please correct me if I'm wrong:

>

> The approach here is slightly different from Zinc. In Zinc, I had one

> entry point that conditionally called into the architecture-specific

> implementation, and I did it inline using #includes so that in some

> cases it could be optimized out.

>

> Here, you override the original symbol defined by the generic module

> from the architecture-specific implementation, and in there you decide

> which way to branch.

>

> Your approach has the advantage that you don't need to #include a .c

> file like I did, an ugly yet very effective approach.

>

> But it has two disadvantages:

>

> 1. For architecture-specific code that _always_ runs, such as the

>   MIPS32r2 implementation of chacha, the compiler no longer has an

>   opportunity to remove the generic code entirely from the binary,

>   which under Zinc resulted in a smaller module.

>


It does. If you don't call hchacha_block_generic() in your code, the
library that exposes it never gets loaded in the first place.

Note that in this particular case, hchacha_block_generic() is exposed
by code that is always builtin so it doesn't matter.

> 2. The inliner can't make optimizations for that call.

>

> Disadvantage (2) might not make much of a difference. Disadvantage (1)

> seems like a bigger deal. However, perhaps the linker is smart and can

> remove the code and symbol? Or if not, is there a way to make the linker

> smart? Or would all this require crazy LTO which isn't going to happen

> any time soon?
diff mbox series

Patch

diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
index bc62daa8dafd..fd9ef42842cf 100644
--- a/arch/x86/crypto/chacha_glue.c
+++ b/arch/x86/crypto/chacha_glue.c
@@ -123,6 +123,42 @@  static void chacha_dosimd(u32 *state, u8 *dst, const u8 *src,
 	}
 }
 
+void hchacha_block(const u32 *state, u32 *stream, int nrounds)
+{
+	state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);
+
+	if (!crypto_simd_usable()) {
+		hchacha_block_generic(state, stream, nrounds);
+	} else {
+		kernel_fpu_begin();
+		hchacha_block_ssse3(state, stream, nrounds);
+		kernel_fpu_end();
+	}
+}
+EXPORT_SYMBOL(hchacha_block);
+
+void chacha_init(u32 *state, const u32 *key, const u8 *iv)
+{
+	state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);
+
+	chacha_init_generic(state, key, iv);
+}
+EXPORT_SYMBOL(chacha_init);
+
+void chacha_crypt(u32 *state, u8 *dst, const u8 *src, unsigned int bytes,
+		  int nrounds)
+{
+	state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);
+
+	if (bytes <= CHACHA_BLOCK_SIZE || !crypto_simd_usable())
+		return chacha_crypt_generic(state, dst, src, bytes, nrounds);
+
+	kernel_fpu_begin();
+	chacha_dosimd(state, dst, src, bytes, nrounds);
+	kernel_fpu_end();
+}
+EXPORT_SYMBOL(chacha_crypt);
+
 static int chacha_simd_stream_xor(struct skcipher_walk *walk,
 				  const struct chacha_ctx *ctx, const u8 *iv)
 {
diff --git a/crypto/Kconfig b/crypto/Kconfig
index 074b125819b0..f90b53a526ba 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1436,6 +1436,7 @@  config CRYPTO_CHACHA20_X86_64
 	depends on X86 && 64BIT
 	select CRYPTO_BLKCIPHER
 	select CRYPTO_CHACHA20
+	select CRYPTO_ARCH_HAVE_LIB_CHACHA
 	help
 	  SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20,
 	  XChaCha20, and XChaCha12 stream ciphers.
diff --git a/include/crypto/chacha.h b/include/crypto/chacha.h
index 95a4a0ff4f7d..58192096679d 100644
--- a/include/crypto/chacha.h
+++ b/include/crypto/chacha.h
@@ -25,6 +25,12 @@ 
 #define CHACHA_BLOCK_SIZE	64
 #define CHACHAPOLY_IV_SIZE	12
 
+#ifdef CONFIG_X86_64
+#define CHACHA_STATE_WORDS	((CHACHA_BLOCK_SIZE + 12) / sizeof(u32))
+#else
+#define CHACHA_STATE_WORDS	(CHACHA_BLOCK_SIZE / sizeof(u32))
+#endif
+
 /* 192-bit nonce, then 64-bit stream position */
 #define XCHACHA_IV_SIZE		32