[v3,00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s

Message ID	20201223081003.373663-1-ebiggers@kernel.org
Headers	show Return-Path: <linux-crypto-owner@kernel.org> From: Eric Biggers <ebiggers@kernel.org> To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, Ard Biesheuvel <ardb@kernel.org>, Herbert Xu <herbert@gondor.apana.org.au>, David Sterba <dsterba@suse.com>, "Jason A . Donenfeld" <Jason@zx2c4.com>, Paul Crowley <paulcrowley@google.com> Subject: [PATCH v3 00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s Date: Wed, 23 Dec 2020 00:09:49 -0800 Message-Id: <20201223081003.373663-1-ebiggers@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	crypto: arm32-optimized BLAKE2b and BLAKE2s \| expand [v3,00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s [v3,01/14] crypto: blake2s - define shash_alg structs using macros [v3,04/14] crypto: blake2s - move update and final logic to internal/blake2s.h [v3,05/14] crypto: blake2s - share the "shash" API boilerplate code [v3,07/14] crypto: blake2s - add comment for blake2s_state fields [v3,09/14] crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h> [v3,10/14] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s [v3,13/14] crypto: blake2b - update file comment

Eric Biggers Dec. 23, 2020, 8:09 a.m. UTC

This patchset adds 32-bit ARM assembly language implementations of
BLAKE2b and BLAKE2s.

As a prerequisite to adding these without copy-and-pasting lots of code,
this patchset also reworks the existing BLAKE2b and BLAKE2s code to
provide helper functions that make implementing "shash" providers for
these algorithms much easier.  These changes also eliminate unnecessary
differences between the BLAKE2b and BLAKE2s code.

The new BLAKE2b implementation is NEON-accelerated, while the new
BLAKE2s implementation uses scalar instructions since NEON doesn't work
very well for it.  The BLAKE2b implementation is faster and is expected
to be useful as a replacement for SHA-1 in dm-verity, while the BLAKE2s
implementation would be useful for WireGuard which uses BLAKE2s.

Both new implementations are wired up to the shash API, while the new
BLAKE2s implementation is also wired up to the library API.

See the individual commits for full details, including benchmarks.

This patchset was tested on a Raspberry Pi 2 (which uses a Cortex-A7
processor) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, plus other tests.

This patchset applies to mainline commit 614cb5894306.

Changed since v2:
   - Reworked the shash helpers again.  Now they are inline functions,
     and for BLAKE2s they now share more code with the library API.
   - Made the BLAKE2b code be more consistent with the BLAKE2s code.
   - Moved the BLAKE2s changes first in the patchset so that the BLAKE2b
     changes can be made just by syncing the code with BLAKE2s.
   - Added a few BLAKE2s cleanups (which get included in BLAKE2b too).
   - Improved some comments in the new asm files.

Changed since v1:
   - Added ARM scalar implementation of BLAKE2s.
   - Adjusted the BLAKE2b helper functions to be consistent with what I
     decided to do for BLAKE2s.
   - Fixed build error in blake2b-neon-core.S in some configurations.

Eric Biggers (14):
  crypto: blake2s - define shash_alg structs using macros
  crypto: x86/blake2s - define shash_alg structs using macros
  crypto: blake2s - remove unneeded includes
  crypto: blake2s - move update and final logic to internal/blake2s.h
  crypto: blake2s - share the "shash" API boilerplate code
  crypto: blake2s - optimize blake2s initialization
  crypto: blake2s - add comment for blake2s_state fields
  crypto: blake2s - adjust include guard naming
  crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h>
  crypto: arm/blake2s - add ARM scalar optimized BLAKE2s
  wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM
  crypto: blake2b - sync with blake2s implementation
  crypto: blake2b - update file comment
  crypto: arm/blake2b - add NEON-accelerated BLAKE2b

 arch/arm/crypto/Kconfig             |  19 ++
 arch/arm/crypto/Makefile            |   4 +
 arch/arm/crypto/blake2b-neon-core.S | 347 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
 arch/arm/crypto/blake2s-core.S      | 285 +++++++++++++++++++++++
 arch/arm/crypto/blake2s-glue.c      |  78 +++++++
 arch/x86/crypto/blake2s-glue.c      | 150 +++---------
 crypto/blake2b_generic.c            | 249 +++++---------------
 crypto/blake2s_generic.c            | 158 +++----------
 drivers/net/Kconfig                 |   1 +
 include/crypto/blake2b.h            |  67 ++++++
 include/crypto/blake2s.h            |  63 ++---
 include/crypto/internal/blake2b.h   | 115 +++++++++
 include/crypto/internal/blake2s.h   | 109 ++++++++-
 lib/crypto/blake2s.c                |  48 +---
 15 files changed, 1278 insertions(+), 520 deletions(-)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
 create mode 100644 arch/arm/crypto/blake2s-core.S
 create mode 100644 arch/arm/crypto/blake2s-glue.c
 create mode 100644 include/crypto/blake2b.h
 create mode 100644 include/crypto/internal/blake2b.h


base-commit: 614cb5894306cfa2c7d9b6168182876ff5948735

Ard Biesheuvel Dec. 23, 2020, 9:06 a.m. UTC | #1

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> If no key was provided, then don't waste time initializing the block
> buffer, as its initial contents won't be used.
>
> Also, make crypto_blake2s_init() and blake2s() call a single internal
> function __blake2s_init() which treats the key as optional, rather than
> conditionally calling blake2s_init() or blake2s_init_key().  This
> reduces the compiled code size, as previously both blake2s_init() and
> blake2s_init_key() were being inlined into these two callers, except
> when the key size passed to blake2s() was a compile-time constant.
>
> These optimizations aren't that significant for BLAKE2s.  However, the
> equivalent optimizations will be more significant for BLAKE2b, as
> everything is twice as big in BLAKE2b.  And it's good to keep things
> consistent rather than making optimizations for BLAKE2b but not BLAKE2s.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  include/crypto/blake2s.h          | 53 ++++++++++++++++---------------
>  include/crypto/internal/blake2s.h |  5 +--
>  2 files changed, 28 insertions(+), 30 deletions(-)
>
> diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
> index b471deac28ff8..734ed22b7a6aa 100644
> --- a/include/crypto/blake2s.h
> +++ b/include/crypto/blake2s.h
> @@ -43,29 +43,34 @@ enum blake2s_iv {
>         BLAKE2S_IV7 = 0x5BE0CD19UL,
>  };
>
> -void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);
> -void blake2s_final(struct blake2s_state *state, u8 *out);
> -
> -static inline void blake2s_init_param(struct blake2s_state *state,
> -                                     const u32 param)
> +static inline void __blake2s_init(struct blake2s_state *state, size_t outlen,
> +                                 const void *key, size_t keylen)
>  {
> -       *state = (struct blake2s_state){{
> -               BLAKE2S_IV0 ^ param,
> -               BLAKE2S_IV1,
> -               BLAKE2S_IV2,
> -               BLAKE2S_IV3,
> -               BLAKE2S_IV4,
> -               BLAKE2S_IV5,
> -               BLAKE2S_IV6,
> -               BLAKE2S_IV7,
> -       }};
> +       state->h[0] = BLAKE2S_IV0 ^ (0x01010000 | keylen << 8 | outlen);
> +       state->h[1] = BLAKE2S_IV1;
> +       state->h[2] = BLAKE2S_IV2;
> +       state->h[3] = BLAKE2S_IV3;
> +       state->h[4] = BLAKE2S_IV4;
> +       state->h[5] = BLAKE2S_IV5;
> +       state->h[6] = BLAKE2S_IV6;
> +       state->h[7] = BLAKE2S_IV7;
> +       state->t[0] = 0;
> +       state->t[1] = 0;
> +       state->f[0] = 0;
> +       state->f[1] = 0;
> +       state->buflen = 0;
> +       state->outlen = outlen;
> +       if (keylen) {
> +               memcpy(state->buf, key, keylen);
> +               memset(&state->buf[keylen], 0, BLAKE2S_BLOCK_SIZE - keylen);
> +               state->buflen = BLAKE2S_BLOCK_SIZE;
> +       }
>  }
>
>  static inline void blake2s_init(struct blake2s_state *state,
>                                 const size_t outlen)
>  {
> -       blake2s_init_param(state, 0x01010000 | outlen);
> -       state->outlen = outlen;
> +       __blake2s_init(state, outlen, NULL, 0);
>  }
>
>  static inline void blake2s_init_key(struct blake2s_state *state,
> @@ -75,12 +80,12 @@ static inline void blake2s_init_key(struct blake2s_state *state,
>         WARN_ON(IS_ENABLED(DEBUG) && (!outlen || outlen > BLAKE2S_HASH_SIZE ||
>                 !key || !keylen || keylen > BLAKE2S_KEY_SIZE));
>
> -       blake2s_init_param(state, 0x01010000 | keylen << 8 | outlen);
> -       memcpy(state->buf, key, keylen);
> -       state->buflen = BLAKE2S_BLOCK_SIZE;
> -       state->outlen = outlen;
> +       __blake2s_init(state, outlen, key, keylen);
>  }
>
> +void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);
> +void blake2s_final(struct blake2s_state *state, u8 *out);
> +
>  static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
>                            const size_t outlen, const size_t inlen,
>                            const size_t keylen)
> @@ -91,11 +96,7 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
>                 outlen > BLAKE2S_HASH_SIZE || keylen > BLAKE2S_KEY_SIZE ||
>                 (!key && keylen)));
>
> -       if (keylen)
> -               blake2s_init_key(&state, outlen, key, keylen);
> -       else
> -               blake2s_init(&state, outlen);
> -
> +       __blake2s_init(&state, outlen, key, keylen);
>         blake2s_update(&state, in, inlen);
>         blake2s_final(&state, out);
>  }
> diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
> index 2ea0a8f5e7f41..867ef3753f5c1 100644
> --- a/include/crypto/internal/blake2s.h
> +++ b/include/crypto/internal/blake2s.h
> @@ -93,10 +93,7 @@ static inline int crypto_blake2s_init(struct shash_desc *desc)
>         struct blake2s_state *state = shash_desc_ctx(desc);
>         unsigned int outlen = crypto_shash_digestsize(desc->tfm);
>
> -       if (tctx->keylen)
> -               blake2s_init_key(state, outlen, tctx->key, tctx->keylen);
> -       else
> -               blake2s_init(state, outlen);
> +       __blake2s_init(state, outlen, tctx->key, tctx->keylen);
>         return 0;
>  }
>
> --
> 2.29.2
>

Ard Biesheuvel Dec. 23, 2020, 9:07 a.m. UTC | #2

On Wed, 23 Dec 2020 at 09:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Use the full path in the include guards for the BLAKE2s headers to avoid
> ambiguity and to match the convention for most files in include/crypto/.
>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Acked-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  include/crypto/blake2s.h          | 6 +++---
>  include/crypto/internal/blake2s.h | 6 +++---
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/include/crypto/blake2s.h b/include/crypto/blake2s.h
> index f1c8330a61a91..3f06183c2d804 100644
> --- a/include/crypto/blake2s.h
> +++ b/include/crypto/blake2s.h
> @@ -3,8 +3,8 @@
>   * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
>   */
>
> -#ifndef BLAKE2S_H
> -#define BLAKE2S_H
> +#ifndef _CRYPTO_BLAKE2S_H
> +#define _CRYPTO_BLAKE2S_H
>
>  #include <linux/types.h>
>  #include <linux/kernel.h>
> @@ -105,4 +105,4 @@ static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
>  void blake2s256_hmac(u8 *out, const u8 *in, const u8 *key, const size_t inlen,
>                      const size_t keylen);
>
> -#endif /* BLAKE2S_H */
> +#endif /* _CRYPTO_BLAKE2S_H */
> diff --git a/include/crypto/internal/blake2s.h b/include/crypto/internal/blake2s.h
> index 867ef3753f5c1..8e50d487500f2 100644
> --- a/include/crypto/internal/blake2s.h
> +++ b/include/crypto/internal/blake2s.h
> @@ -4,8 +4,8 @@
>   * Keep this in sync with the corresponding BLAKE2b header.
>   */
>
> -#ifndef BLAKE2S_INTERNAL_H
> -#define BLAKE2S_INTERNAL_H
> +#ifndef _CRYPTO_INTERNAL_BLAKE2S_H
> +#define _CRYPTO_INTERNAL_BLAKE2S_H
>
>  #include <crypto/blake2s.h>
>  #include <crypto/internal/hash.h>
> @@ -116,4 +116,4 @@ static inline int crypto_blake2s_final(struct shash_desc *desc, u8 *out,
>         return 0;
>  }
>
> -#endif /* BLAKE2S_INTERNAL_H */
> +#endif /* _CRYPTO_INTERNAL_BLAKE2S_H */
> --
> 2.29.2
>

Ard Biesheuvel Dec. 23, 2020, 9:10 a.m. UTC | #3

On Wed, 23 Dec 2020 at 09:13, Eric Biggers <ebiggers@kernel.org> wrote:
>
> From: Eric Biggers <ebiggers@google.com>
>
> Add a NEON-accelerated implementation of BLAKE2b.
>
> On Cortex-A7 (which these days is the most common ARM processor that
> doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
> SHA-256, and slightly faster than SHA-1.  It is also almost three times
> as fast as the generic implementation of BLAKE2b:
>
>         Algorithm            Cycles per byte (on 4096-byte messages)
>         ===================  =======================================
>         blake2b-256-neon     14.0
>         sha1-neon            16.3
>         blake2s-256-arm      18.8
>         sha1-asm             20.8
>         blake2s-256-generic  26.0
>         sha256-neon          28.9
>         sha256-asm           32.0
>         blake2b-256-generic  38.9
>
> This implementation isn't directly based on any other implementation,
> but it borrows some ideas from previous NEON code I've written as well
> as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
> the other NEON implementations of BLAKE2b I'm aware of (the
> implementation in the BLAKE2 official repository using intrinsics, and
> Andrew Moon's implementation which can be found in SUPERCOP).  It does
> only one block at a time, so it performs well on short messages too.
>
> NEON-accelerated BLAKE2b is useful because there is interest in using
> BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
> devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
> these devices, the performance cost of upgrading to SHA-256 may be
> unacceptable, whereas BLAKE2b-256 would actually improve performance.
>
> Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
> is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
> BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
> 64-bit operations, and because BLAKE2s's block size is too small for
> NEON to be helpful for it.  The best I've been able to do with BLAKE2s
> on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.
>
> (I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
> they're more complex as they require running multiple hashes at once.
> Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
> so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
> from the smaller number of rounds, not from the extra parallelism.)
>
> For now this BLAKE2b implementation is only wired up to the shash API,
> since there is no library API for BLAKE2b yet.  However, I've tried to
> keep things consistent with BLAKE2s, e.g. by defining
> blake2b_compress_arch() which is analogous to blake2s_compress_arch()
> and could be exported for use by the library API later if needed.
>
> Acked-by: Ard Biesheuvel <ardb@kernel.org>
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Tested-by: Ard Biesheuvel <ardb@kernel.org>

> ---
>  arch/arm/crypto/Kconfig             |  10 +
>  arch/arm/crypto/Makefile            |   2 +
>  arch/arm/crypto/blake2b-neon-core.S | 347 ++++++++++++++++++++++++++++
>  arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
>  4 files changed, 464 insertions(+)
>  create mode 100644 arch/arm/crypto/blake2b-neon-core.S
>  create mode 100644 arch/arm/crypto/blake2b-neon-glue.c
>
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index 281c829c12d0b..2b575792363e5 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -71,6 +71,16 @@ config CRYPTO_BLAKE2S_ARM
>           slower than the NEON implementation of BLAKE2b.  (There is no NEON
>           implementation of BLAKE2s, since NEON doesn't really help with it.)
>
> +config CRYPTO_BLAKE2B_NEON
> +       tristate "BLAKE2b digest algorithm (ARM NEON)"
> +       depends on KERNEL_MODE_NEON
> +       select CRYPTO_BLAKE2B
> +       help
> +         BLAKE2b digest algorithm optimized with ARM NEON instructions.
> +         On ARM processors that have NEON support but not the ARMv8
> +         Crypto Extensions, typically this BLAKE2b implementation is
> +         much faster than SHA-2 and slightly faster than SHA-1.
> +
>  config CRYPTO_AES_ARM
>         tristate "Scalar AES cipher for ARM"
>         select CRYPTO_ALGAPI
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 5ad1e985a718b..8f26c454ea12e 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>  obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
>  obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
>  obj-$(CONFIG_CRYPTO_BLAKE2S_ARM) += blake2s-arm.o
> +obj-$(CONFIG_CRYPTO_BLAKE2B_NEON) += blake2b-neon.o
>  obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha-neon.o
>  obj-$(CONFIG_CRYPTO_POLY1305_ARM) += poly1305-arm.o
>  obj-$(CONFIG_CRYPTO_NHPOLY1305_NEON) += nhpoly1305-neon.o
> @@ -31,6 +32,7 @@ sha256-arm-y  := sha256-core.o sha256_glue.o $(sha256-arm-neon-y)
>  sha512-arm-neon-$(CONFIG_KERNEL_MODE_NEON) := sha512-neon-glue.o
>  sha512-arm-y   := sha512-core.o sha512-glue.o $(sha512-arm-neon-y)
>  blake2s-arm-y   := blake2s-core.o blake2s-glue.o
> +blake2b-neon-y  := blake2b-neon-core.o blake2b-neon-glue.o
>  sha1-arm-ce-y  := sha1-ce-core.o sha1-ce-glue.o
>  sha2-arm-ce-y  := sha2-ce-core.o sha2-ce-glue.o
>  aes-arm-ce-y   := aes-ce-core.o aes-ce-glue.o
> diff --git a/arch/arm/crypto/blake2b-neon-core.S b/arch/arm/crypto/blake2b-neon-core.S
> new file mode 100644
> index 0000000000000..0406a186377fb
> --- /dev/null
> +++ b/arch/arm/crypto/blake2b-neon-core.S
> @@ -0,0 +1,347 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * BLAKE2b digest algorithm, NEON accelerated
> + *
> + * Copyright 2020 Google LLC
> + *
> + * Author: Eric Biggers <ebiggers@google.com>
> + */
> +
> +#include <linux/linkage.h>
> +
> +       .text
> +       .fpu            neon
> +
> +       // The arguments to blake2b_compress_neon()
> +       STATE           .req    r0
> +       BLOCK           .req    r1
> +       NBLOCKS         .req    r2
> +       INC             .req    r3
> +
> +       // Pointers to the rotation tables
> +       ROR24_TABLE     .req    r4
> +       ROR16_TABLE     .req    r5
> +
> +       // The original stack pointer
> +       ORIG_SP         .req    r6
> +
> +       // NEON registers which contain the message words of the current block.
> +       // M_0-M_3 are occasionally used for other purposes too.
> +       M_0             .req    d16
> +       M_1             .req    d17
> +       M_2             .req    d18
> +       M_3             .req    d19
> +       M_4             .req    d20
> +       M_5             .req    d21
> +       M_6             .req    d22
> +       M_7             .req    d23
> +       M_8             .req    d24
> +       M_9             .req    d25
> +       M_10            .req    d26
> +       M_11            .req    d27
> +       M_12            .req    d28
> +       M_13            .req    d29
> +       M_14            .req    d30
> +       M_15            .req    d31
> +
> +       .align          4
> +       // Tables for computing ror64(x, 24) and ror64(x, 16) using the vtbl.8
> +       // instruction.  This is the most efficient way to implement these
> +       // rotation amounts with NEON.  (On Cortex-A53 it's the same speed as
> +       // vshr.u64 + vsli.u64, while on Cortex-A7 it's faster.)
> +.Lror24_table:
> +       .byte           3, 4, 5, 6, 7, 0, 1, 2
> +.Lror16_table:
> +       .byte           2, 3, 4, 5, 6, 7, 0, 1
> +       // The BLAKE2b initialization vector
> +.Lblake2b_IV:
> +       .quad           0x6a09e667f3bcc908, 0xbb67ae8584caa73b
> +       .quad           0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1
> +       .quad           0x510e527fade682d1, 0x9b05688c2b3e6c1f
> +       .quad           0x1f83d9abfb41bd6b, 0x5be0cd19137e2179
> +
> +// Execute one round of BLAKE2b by updating the state matrix v[0..15] in the
> +// NEON registers q0-q7.  The message block is in q8..q15 (M_0-M_15).  The stack
> +// pointer points to a 32-byte aligned buffer containing a copy of q8 and q9
> +// (M_0-M_3), so that they can be reloaded if they are used as temporary
> +// registers.  The macro arguments s0-s15 give the order in which the message
> +// words are used in this round.  'final' is 1 if this is the final round.
> +.macro _blake2b_round  s0, s1, s2, s3, s4, s5, s6, s7, \
> +                       s8, s9, s10, s11, s12, s13, s14, s15, final=0
> +
> +       // Mix the columns:
> +       // (v[0], v[4], v[8], v[12]), (v[1], v[5], v[9], v[13]),
> +       // (v[2], v[6], v[10], v[14]), and (v[3], v[7], v[11], v[15]).
> +
> +       // a += b + m[blake2b_sigma[r][2*i + 0]];
> +       vadd.u64        q0, q0, q2
> +       vadd.u64        q1, q1, q3
> +       vadd.u64        d0, d0, M_\s0
> +       vadd.u64        d1, d1, M_\s2
> +       vadd.u64        d2, d2, M_\s4
> +       vadd.u64        d3, d3, M_\s6
> +
> +       // d = ror64(d ^ a, 32);
> +       veor            q6, q6, q0
> +       veor            q7, q7, q1
> +       vrev64.32       q6, q6
> +       vrev64.32       q7, q7
> +
> +       // c += d;
> +       vadd.u64        q4, q4, q6
> +       vadd.u64        q5, q5, q7
> +
> +       // b = ror64(b ^ c, 24);
> +       vld1.8          {M_0}, [ROR24_TABLE, :64]
> +       veor            q2, q2, q4
> +       veor            q3, q3, q5
> +       vtbl.8          d4, {d4}, M_0
> +       vtbl.8          d5, {d5}, M_0
> +       vtbl.8          d6, {d6}, M_0
> +       vtbl.8          d7, {d7}, M_0
> +
> +       // a += b + m[blake2b_sigma[r][2*i + 1]];
> +       //
> +       // M_0 got clobbered above, so we have to reload it if any of the four
> +       // message words this step needs happens to be M_0.  Otherwise we don't
> +       // need to reload it here, as it will just get clobbered again below.
> +.if \s1 == 0 || \s3 == 0 || \s5 == 0 || \s7 == 0
> +       vld1.8          {M_0}, [sp, :64]
> +.endif
> +       vadd.u64        q0, q0, q2
> +       vadd.u64        q1, q1, q3
> +       vadd.u64        d0, d0, M_\s1
> +       vadd.u64        d1, d1, M_\s3
> +       vadd.u64        d2, d2, M_\s5
> +       vadd.u64        d3, d3, M_\s7
> +
> +       // d = ror64(d ^ a, 16);
> +       vld1.8          {M_0}, [ROR16_TABLE, :64]
> +       veor            q6, q6, q0
> +       veor            q7, q7, q1
> +       vtbl.8          d12, {d12}, M_0
> +       vtbl.8          d13, {d13}, M_0
> +       vtbl.8          d14, {d14}, M_0
> +       vtbl.8          d15, {d15}, M_0
> +
> +       // c += d;
> +       vadd.u64        q4, q4, q6
> +       vadd.u64        q5, q5, q7
> +
> +       // b = ror64(b ^ c, 63);
> +       //
> +       // This rotation amount isn't a multiple of 8, so it has to be
> +       // implemented using a pair of shifts, which requires temporary
> +       // registers.  Use q8-q9 (M_0-M_3) for this, and reload them afterwards.
> +       veor            q8, q2, q4
> +       veor            q9, q3, q5
> +       vshr.u64        q2, q8, #63
> +       vshr.u64        q3, q9, #63
> +       vsli.u64        q2, q8, #1
> +       vsli.u64        q3, q9, #1
> +       vld1.8          {q8-q9}, [sp, :256]
> +
> +       // Mix the diagonals:
> +       // (v[0], v[5], v[10], v[15]), (v[1], v[6], v[11], v[12]),
> +       // (v[2], v[7], v[8], v[13]), and (v[3], v[4], v[9], v[14]).
> +       //
> +       // There are two possible ways to do this: use 'vext' instructions to
> +       // shift the rows of the matrix so that the diagonals become columns,
> +       // and undo it afterwards; or just use 64-bit operations on 'd'
> +       // registers instead of 128-bit operations on 'q' registers.  We use the
> +       // latter approach, as it performs much better on Cortex-A7.
> +
> +       // a += b + m[blake2b_sigma[r][2*i + 0]];
> +       vadd.u64        d0, d0, d5
> +       vadd.u64        d1, d1, d6
> +       vadd.u64        d2, d2, d7
> +       vadd.u64        d3, d3, d4
> +       vadd.u64        d0, d0, M_\s8
> +       vadd.u64        d1, d1, M_\s10
> +       vadd.u64        d2, d2, M_\s12
> +       vadd.u64        d3, d3, M_\s14
> +
> +       // d = ror64(d ^ a, 32);
> +       veor            d15, d15, d0
> +       veor            d12, d12, d1
> +       veor            d13, d13, d2
> +       veor            d14, d14, d3
> +       vrev64.32       d15, d15
> +       vrev64.32       d12, d12
> +       vrev64.32       d13, d13
> +       vrev64.32       d14, d14
> +
> +       // c += d;
> +       vadd.u64        d10, d10, d15
> +       vadd.u64        d11, d11, d12
> +       vadd.u64        d8, d8, d13
> +       vadd.u64        d9, d9, d14
> +
> +       // b = ror64(b ^ c, 24);
> +       vld1.8          {M_0}, [ROR24_TABLE, :64]
> +       veor            d5, d5, d10
> +       veor            d6, d6, d11
> +       veor            d7, d7, d8
> +       veor            d4, d4, d9
> +       vtbl.8          d5, {d5}, M_0
> +       vtbl.8          d6, {d6}, M_0
> +       vtbl.8          d7, {d7}, M_0
> +       vtbl.8          d4, {d4}, M_0
> +
> +       // a += b + m[blake2b_sigma[r][2*i + 1]];
> +.if \s9 == 0 || \s11 == 0 || \s13 == 0 || \s15 == 0
> +       vld1.8          {M_0}, [sp, :64]
> +.endif
> +       vadd.u64        d0, d0, d5
> +       vadd.u64        d1, d1, d6
> +       vadd.u64        d2, d2, d7
> +       vadd.u64        d3, d3, d4
> +       vadd.u64        d0, d0, M_\s9
> +       vadd.u64        d1, d1, M_\s11
> +       vadd.u64        d2, d2, M_\s13
> +       vadd.u64        d3, d3, M_\s15
> +
> +       // d = ror64(d ^ a, 16);
> +       vld1.8          {M_0}, [ROR16_TABLE, :64]
> +       veor            d15, d15, d0
> +       veor            d12, d12, d1
> +       veor            d13, d13, d2
> +       veor            d14, d14, d3
> +       vtbl.8          d12, {d12}, M_0
> +       vtbl.8          d13, {d13}, M_0
> +       vtbl.8          d14, {d14}, M_0
> +       vtbl.8          d15, {d15}, M_0
> +
> +       // c += d;
> +       vadd.u64        d10, d10, d15
> +       vadd.u64        d11, d11, d12
> +       vadd.u64        d8, d8, d13
> +       vadd.u64        d9, d9, d14
> +
> +       // b = ror64(b ^ c, 63);
> +       veor            d16, d4, d9
> +       veor            d17, d5, d10
> +       veor            d18, d6, d11
> +       veor            d19, d7, d8
> +       vshr.u64        q2, q8, #63
> +       vshr.u64        q3, q9, #63
> +       vsli.u64        q2, q8, #1
> +       vsli.u64        q3, q9, #1
> +       // Reloading q8-q9 can be skipped on the final round.
> +.if ! \final
> +       vld1.8          {q8-q9}, [sp, :256]
> +.endif
> +.endm
> +
> +//
> +// void blake2b_compress_neon(struct blake2b_state *state,
> +//                           const u8 *block, size_t nblocks, u32 inc);
> +//
> +// Only the first three fields of struct blake2b_state are used:
> +//     u64 h[8];       (inout)
> +//     u64 t[2];       (inout)
> +//     u64 f[2];       (in)
> +//
> +       .align          5
> +ENTRY(blake2b_compress_neon)
> +       push            {r4-r10}
> +
> +       // Allocate a 32-byte stack buffer that is 32-byte aligned.
> +       mov             ORIG_SP, sp
> +       sub             ip, sp, #32
> +       bic             ip, ip, #31
> +       mov             sp, ip
> +
> +       adr             ROR24_TABLE, .Lror24_table
> +       adr             ROR16_TABLE, .Lror16_table
> +
> +       mov             ip, STATE
> +       vld1.64         {q0-q1}, [ip]!          // Load h[0..3]
> +       vld1.64         {q2-q3}, [ip]!          // Load h[4..7]
> +.Lnext_block:
> +         adr           r10, .Lblake2b_IV
> +       vld1.64         {q14-q15}, [ip]         // Load t[0..1] and f[0..1]
> +       vld1.64         {q4-q5}, [r10]!         // Load IV[0..3]
> +         vmov          r7, r8, d28             // Copy t[0] to (r7, r8)
> +       vld1.64         {q6-q7}, [r10]          // Load IV[4..7]
> +         adds          r7, r7, INC             // Increment counter
> +       bcs             .Lslow_inc_ctr
> +       vmov.i32        d28[0], r7
> +       vst1.64         {d28}, [ip]             // Update t[0]
> +.Linc_ctr_done:
> +
> +       // Load the next message block and finish initializing the state matrix
> +       // 'v'.  Fortunately, there are exactly enough NEON registers to fit the
> +       // entire state matrix in q0-q7 and the entire message block in q8-15.
> +       //
> +       // However, _blake2b_round also needs some extra registers for rotates,
> +       // so we have to spill some registers.  It's better to spill the message
> +       // registers than the state registers, as the message doesn't change.
> +       // Therefore we store a copy of the first 32 bytes of the message block
> +       // (q8-q9) in an aligned buffer on the stack so that they can be
> +       // reloaded when needed.  (We could just reload directly from the
> +       // message buffer, but it's faster to use aligned loads.)
> +       vld1.8          {q8-q9}, [BLOCK]!
> +         veor          q6, q6, q14     // v[12..13] = IV[4..5] ^ t[0..1]
> +       vld1.8          {q10-q11}, [BLOCK]!
> +         veor          q7, q7, q15     // v[14..15] = IV[6..7] ^ f[0..1]
> +       vld1.8          {q12-q13}, [BLOCK]!
> +       vst1.8          {q8-q9}, [sp, :256]
> +         mov           ip, STATE
> +       vld1.8          {q14-q15}, [BLOCK]!
> +
> +       // Execute the rounds.  Each round is provided the order in which it
> +       // needs to use the message words.
> +       _blake2b_round  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
> +       _blake2b_round  14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3
> +       _blake2b_round  11, 8, 12, 0, 5, 2, 15, 13, 10, 14, 3, 6, 7, 1, 9, 4
> +       _blake2b_round  7, 9, 3, 1, 13, 12, 11, 14, 2, 6, 5, 10, 4, 0, 15, 8
> +       _blake2b_round  9, 0, 5, 7, 2, 4, 10, 15, 14, 1, 11, 12, 6, 8, 3, 13
> +       _blake2b_round  2, 12, 6, 10, 0, 11, 8, 3, 4, 13, 7, 5, 15, 14, 1, 9
> +       _blake2b_round  12, 5, 1, 15, 14, 13, 4, 10, 0, 7, 6, 3, 9, 2, 8, 11
> +       _blake2b_round  13, 11, 7, 14, 12, 1, 3, 9, 5, 0, 15, 4, 8, 6, 2, 10
> +       _blake2b_round  6, 15, 14, 9, 11, 3, 0, 8, 12, 2, 13, 7, 1, 4, 10, 5
> +       _blake2b_round  10, 2, 8, 4, 7, 6, 1, 5, 15, 11, 9, 14, 3, 12, 13, 0
> +       _blake2b_round  0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
> +       _blake2b_round  14, 10, 4, 8, 9, 15, 13, 6, 1, 12, 0, 2, 11, 7, 5, 3 \
> +                       final=1
> +
> +       // Fold the final state matrix into the hash chaining value:
> +       //
> +       //      for (i = 0; i < 8; i++)
> +       //              h[i] ^= v[i] ^ v[i + 8];
> +       //
> +         vld1.64       {q8-q9}, [ip]!          // Load old h[0..3]
> +       veor            q0, q0, q4              // v[0..1] ^= v[8..9]
> +       veor            q1, q1, q5              // v[2..3] ^= v[10..11]
> +         vld1.64       {q10-q11}, [ip]         // Load old h[4..7]
> +       veor            q2, q2, q6              // v[4..5] ^= v[12..13]
> +       veor            q3, q3, q7              // v[6..7] ^= v[14..15]
> +       veor            q0, q0, q8              // v[0..1] ^= h[0..1]
> +       veor            q1, q1, q9              // v[2..3] ^= h[2..3]
> +         mov           ip, STATE
> +         subs          NBLOCKS, NBLOCKS, #1    // nblocks--
> +         vst1.64       {q0-q1}, [ip]!          // Store new h[0..3]
> +       veor            q2, q2, q10             // v[4..5] ^= h[4..5]
> +       veor            q3, q3, q11             // v[6..7] ^= h[6..7]
> +         vst1.64       {q2-q3}, [ip]!          // Store new h[4..7]
> +
> +       // Advance to the next block, if there is one.
> +       bne             .Lnext_block            // nblocks != 0?
> +
> +       mov             sp, ORIG_SP
> +       pop             {r4-r10}
> +       mov             pc, lr
> +
> +.Lslow_inc_ctr:
> +       // Handle the case where the counter overflowed its low 32 bits, by
> +       // carrying the overflow bit into the full 128-bit counter.
> +       vmov            r9, r10, d29
> +       adcs            r8, r8, #0
> +       adcs            r9, r9, #0
> +       adc             r10, r10, #0
> +       vmov            d28, r7, r8
> +       vmov            d29, r9, r10
> +       vst1.64         {q14}, [ip]             // Update t[0] and t[1]
> +       b               .Linc_ctr_done
> +ENDPROC(blake2b_compress_neon)
> diff --git a/arch/arm/crypto/blake2b-neon-glue.c b/arch/arm/crypto/blake2b-neon-glue.c
> new file mode 100644
> index 0000000000000..34d73200e7fa6
> --- /dev/null
> +++ b/arch/arm/crypto/blake2b-neon-glue.c
> @@ -0,0 +1,105 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * BLAKE2b digest algorithm, NEON accelerated
> + *
> + * Copyright 2020 Google LLC
> + */
> +
> +#include <crypto/internal/blake2b.h>
> +#include <crypto/internal/hash.h>
> +#include <crypto/internal/simd.h>
> +
> +#include <linux/module.h>
> +#include <linux/sizes.h>
> +
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +
> +asmlinkage void blake2b_compress_neon(struct blake2b_state *state,
> +                                     const u8 *block, size_t nblocks, u32 inc);
> +
> +static void blake2b_compress_arch(struct blake2b_state *state,
> +                                 const u8 *block, size_t nblocks, u32 inc)
> +{
> +       if (!crypto_simd_usable()) {
> +               blake2b_compress_generic(state, block, nblocks, inc);
> +               return;
> +       }
> +
> +       do {
> +               const size_t blocks = min_t(size_t, nblocks,
> +                                           SZ_4K / BLAKE2B_BLOCK_SIZE);
> +
> +               kernel_neon_begin();
> +               blake2b_compress_neon(state, block, blocks, inc);
> +               kernel_neon_end();
> +
> +               nblocks -= blocks;
> +               block += blocks * BLAKE2B_BLOCK_SIZE;
> +       } while (nblocks);
> +}
> +
> +static int crypto_blake2b_update_neon(struct shash_desc *desc,
> +                                     const u8 *in, unsigned int inlen)
> +{
> +       return crypto_blake2b_update(desc, in, inlen, blake2b_compress_arch);
> +}
> +
> +static int crypto_blake2b_final_neon(struct shash_desc *desc, u8 *out)
> +{
> +       return crypto_blake2b_final(desc, out, blake2b_compress_arch);
> +}
> +
> +#define BLAKE2B_ALG(name, driver_name, digest_size)                    \
> +       {                                                               \
> +               .base.cra_name          = name,                         \
> +               .base.cra_driver_name   = driver_name,                  \
> +               .base.cra_priority      = 200,                          \
> +               .base.cra_flags         = CRYPTO_ALG_OPTIONAL_KEY,      \
> +               .base.cra_blocksize     = BLAKE2B_BLOCK_SIZE,           \
> +               .base.cra_ctxsize       = sizeof(struct blake2b_tfm_ctx), \
> +               .base.cra_module        = THIS_MODULE,                  \
> +               .digestsize             = digest_size,                  \
> +               .setkey                 = crypto_blake2b_setkey,        \
> +               .init                   = crypto_blake2b_init,          \
> +               .update                 = crypto_blake2b_update_neon,   \
> +               .final                  = crypto_blake2b_final_neon,    \
> +               .descsize               = sizeof(struct blake2b_state), \
> +       }
> +
> +static struct shash_alg blake2b_neon_algs[] = {
> +       BLAKE2B_ALG("blake2b-160", "blake2b-160-neon", BLAKE2B_160_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-256", "blake2b-256-neon", BLAKE2B_256_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-384", "blake2b-384-neon", BLAKE2B_384_HASH_SIZE),
> +       BLAKE2B_ALG("blake2b-512", "blake2b-512-neon", BLAKE2B_512_HASH_SIZE),
> +};
> +
> +static int __init blake2b_neon_mod_init(void)
> +{
> +       if (!(elf_hwcap & HWCAP_NEON))
> +               return -ENODEV;
> +
> +       return crypto_register_shashes(blake2b_neon_algs,
> +                                      ARRAY_SIZE(blake2b_neon_algs));
> +}
> +
> +static void __exit blake2b_neon_mod_exit(void)
> +{
> +       return crypto_unregister_shashes(blake2b_neon_algs,
> +                                        ARRAY_SIZE(blake2b_neon_algs));
> +}
> +
> +module_init(blake2b_neon_mod_init);
> +module_exit(blake2b_neon_mod_exit);
> +
> +MODULE_DESCRIPTION("BLAKE2b digest algorithm, NEON accelerated");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Eric Biggers <ebiggers@google.com>");
> +MODULE_ALIAS_CRYPTO("blake2b-160");
> +MODULE_ALIAS_CRYPTO("blake2b-160-neon");
> +MODULE_ALIAS_CRYPTO("blake2b-256");
> +MODULE_ALIAS_CRYPTO("blake2b-256-neon");
> +MODULE_ALIAS_CRYPTO("blake2b-384");
> +MODULE_ALIAS_CRYPTO("blake2b-384-neon");
> +MODULE_ALIAS_CRYPTO("blake2b-512");
> +MODULE_ALIAS_CRYPTO("blake2b-512-neon");
> --
> 2.29.2
>

Herbert Xu Jan. 2, 2021, 10:09 p.m. UTC | #4

On Wed, Dec 23, 2020 at 12:09:49AM -0800, Eric Biggers wrote:
> This patchset adds 32-bit ARM assembly language implementations of

> BLAKE2b and BLAKE2s.

> 

> As a prerequisite to adding these without copy-and-pasting lots of code,

> this patchset also reworks the existing BLAKE2b and BLAKE2s code to

> provide helper functions that make implementing "shash" providers for

> these algorithms much easier.  These changes also eliminate unnecessary

> differences between the BLAKE2b and BLAKE2s code.

> 

> The new BLAKE2b implementation is NEON-accelerated, while the new

> BLAKE2s implementation uses scalar instructions since NEON doesn't work

> very well for it.  The BLAKE2b implementation is faster and is expected

> to be useful as a replacement for SHA-1 in dm-verity, while the BLAKE2s

> implementation would be useful for WireGuard which uses BLAKE2s.

> 

> Both new implementations are wired up to the shash API, while the new

> BLAKE2s implementation is also wired up to the library API.

> 

> See the individual commits for full details, including benchmarks.

> 

> This patchset was tested on a Raspberry Pi 2 (which uses a Cortex-A7

> processor) with CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y, plus other tests.

> 

> This patchset applies to mainline commit 614cb5894306.

> 

> Changed since v2:

>    - Reworked the shash helpers again.  Now they are inline functions,

>      and for BLAKE2s they now share more code with the library API.

>    - Made the BLAKE2b code be more consistent with the BLAKE2s code.

>    - Moved the BLAKE2s changes first in the patchset so that the BLAKE2b

>      changes can be made just by syncing the code with BLAKE2s.

>    - Added a few BLAKE2s cleanups (which get included in BLAKE2b too).

>    - Improved some comments in the new asm files.

> 

> Changed since v1:

>    - Added ARM scalar implementation of BLAKE2s.

>    - Adjusted the BLAKE2b helper functions to be consistent with what I

>      decided to do for BLAKE2s.

>    - Fixed build error in blake2b-neon-core.S in some configurations.

> 

> Eric Biggers (14):

>   crypto: blake2s - define shash_alg structs using macros

>   crypto: x86/blake2s - define shash_alg structs using macros

>   crypto: blake2s - remove unneeded includes

>   crypto: blake2s - move update and final logic to internal/blake2s.h

>   crypto: blake2s - share the "shash" API boilerplate code

>   crypto: blake2s - optimize blake2s initialization

>   crypto: blake2s - add comment for blake2s_state fields

>   crypto: blake2s - adjust include guard naming

>   crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h>

>   crypto: arm/blake2s - add ARM scalar optimized BLAKE2s

>   wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM

>   crypto: blake2b - sync with blake2s implementation

>   crypto: blake2b - update file comment

>   crypto: arm/blake2b - add NEON-accelerated BLAKE2b

> 

>  arch/arm/crypto/Kconfig             |  19 ++

>  arch/arm/crypto/Makefile            |   4 +

>  arch/arm/crypto/blake2b-neon-core.S | 347 ++++++++++++++++++++++++++++

>  arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++

>  arch/arm/crypto/blake2s-core.S      | 285 +++++++++++++++++++++++

>  arch/arm/crypto/blake2s-glue.c      |  78 +++++++

>  arch/x86/crypto/blake2s-glue.c      | 150 +++---------

>  crypto/blake2b_generic.c            | 249 +++++---------------

>  crypto/blake2s_generic.c            | 158 +++----------

>  drivers/net/Kconfig                 |   1 +

>  include/crypto/blake2b.h            |  67 ++++++

>  include/crypto/blake2s.h            |  63 ++---

>  include/crypto/internal/blake2b.h   | 115 +++++++++

>  include/crypto/internal/blake2s.h   | 109 ++++++++-

>  lib/crypto/blake2s.c                |  48 +---

>  15 files changed, 1278 insertions(+), 520 deletions(-)

>  create mode 100644 arch/arm/crypto/blake2b-neon-core.S

>  create mode 100644 arch/arm/crypto/blake2b-neon-glue.c

>  create mode 100644 arch/arm/crypto/blake2s-core.S

>  create mode 100644 arch/arm/crypto/blake2s-glue.c

>  create mode 100644 include/crypto/blake2b.h

>  create mode 100644 include/crypto/internal/blake2b.h


All applied.  Thanks.
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[v3,00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s

Message

Comments