Message ID | 20201215234708.105527-1-ebiggers@kernel.org |
---|---|
Headers | show |
Series | crypto: add NEON-optimized BLAKE2b | expand |
On Tue, Dec 15, 2020 at 03:47:03PM -0800, Eric Biggers wrote: > This patchset adds a NEON implementation of BLAKE2b for 32-bit ARM. > Patches 1-4 prepare for it by making some updates to the generic > implementation, while patch 5 adds the actual NEON implementation. > > On Cortex-A7 (which these days is the most common ARM processor that > doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as > SHA-256, and slightly faster than SHA-1. It is also almost three times > as fast as the generic implementation of BLAKE2b: > > Algorithm Cycles per byte (on 4096-byte messages) > =================== ======================================= > blake2b-256-neon 14.1 > sha1-neon 16.4 > sha1-asm 20.8 > blake2s-256-generic 26.1 > sha256-neon 28.9 > sha256-asm 32.1 > blake2b-256-generic 39.9 > > This implementation isn't directly based on any other implementation, > but it borrows some ideas from previous NEON code I've written as well > as from chacha-neon-core.S. At least on Cortex-A7, it is faster than > the other NEON implementations of BLAKE2b I'm aware of (the > implementation in the BLAKE2 official repository using intrinsics, and > Andrew Moon's implementation which can be found in SUPERCOP). > > NEON-optimized BLAKE2b is useful because there is interest in using > BLAKE2b-256 for dm-verity on low-end Android devices (specifically, > devices that lack the ARMv8 Crypto Extensions) to replace SHA-1. On > these devices, the performance cost of upgrading to SHA-256 may be > unacceptable, whereas BLAKE2b-256 would actually improve performance. > > Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which > is intended for 32-bit platforms), on 32-bit ARM processors with NEON, > BLAKE2b is actually faster than BLAKE2s. This is because NEON supports > 64-bit operations, and because BLAKE2s's block size is too small for > NEON to be helpful for it. The best I've been able to do with BLAKE2s > on Cortex-A7 is 19.0 cpb with an optimized scalar implementation. By the way, if people are interested in having my ARM scalar implementation of BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case mentioned above. If it were to be added as "blake2s-256-arm", we'd have: Algorithm Cycles per byte (on 4096-byte messages) =================== ======================================= blake2b-256-neon 14.1 sha1-neon 16.4 blake2s-256-arm 19.0 sha1-asm 20.8 blake2s-256-generic 26.1 sha256-neon 28.9 sha256-asm 32.1 blake2b-256-generic 39.9
Hi Eric, On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote: > By the way, if people are interested in having my ARM scalar implementation of > BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: I'd certainly be interested in this. Any rough idea how it performs for pretty small messages compared to the generic implementation? 100-140 byte ranges? Is the speedup about the same as for longer messages because this doesn't parallelize across multiple blocks? Jason
On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote: > Hi Eric, > > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote: > > By the way, if people are interested in having my ARM scalar implementation of > > BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: > > I'd certainly be interested in this. Any rough idea how it performs > for pretty small messages compared to the generic implementation? > 100-140 byte ranges? Is the speedup about the same as for longer > messages because this doesn't parallelize across multiple blocks? > It does one block at a time, and there isn't much overhead, so yes the speedup on short messages should be about the same as on long messages. I did a couple quick userspace benchmarks and got (still on Cortex-A7): 100-byte messages: BLAKE2s ARM: 28.9 cpb BLAKE2s generic: 42.4 cpb 140-byte messages: BLAKE2s ARM: 29.5 cpb BLAKE2s generic: 44.0 cpb The results in the kernel may differ a bit, but probably not by much. - Eric
On Thu, Dec 17, 2020 at 4:54 AM Eric Biggers <ebiggers@kernel.org> wrote: > > On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote: > > Hi Eric, > > > > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers <ebiggers@kernel.org> wrote: > > > By the way, if people are interested in having my ARM scalar implementation of > > > BLAKE2s in the kernel too, I can send a patchset for that too. It just ended up > > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > > > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: > > > > I'd certainly be interested in this. Any rough idea how it performs > > for pretty small messages compared to the generic implementation? > > 100-140 byte ranges? Is the speedup about the same as for longer > > messages because this doesn't parallelize across multiple blocks? > > > > It does one block at a time, and there isn't much overhead, so yes the speedup > on short messages should be about the same as on long messages. > > I did a couple quick userspace benchmarks and got (still on Cortex-A7): > > 100-byte messages: > BLAKE2s ARM: 28.9 cpb > BLAKE2s generic: 42.4 cpb > > 140-byte messages: > BLAKE2s ARM: 29.5 cpb > BLAKE2s generic: 44.0 cpb > > The results in the kernel may differ a bit, but probably not by much. That's certainly a nice improvement though, and I'd very much welcome the faster implementation. Jason
On Tue, Dec 15, 2020 at 03:47:04PM -0800, Eric Biggers wrote: > From: Eric Biggers <ebiggers@google.com> > > Rename some BLAKE2b-related constants to be consistent with the names > used in the BLAKE2s implementation (see include/crypto/blake2s.h): > > BLAKE2B_*_DIGEST_SIZE => BLAKE2B_*_HASH_SIZE > BLAKE2B_BLOCKBYTES => BLAKE2B_BLOCK_SIZE > BLAKE2B_KEYBYTES => BLAKE2B_KEY_SIZE > > Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: David Sterba <dsterba@suse.com>
On Tue, Dec 15, 2020 at 03:47:06PM -0800, Eric Biggers wrote: > From: Eric Biggers <ebiggers@google.com> > > In preparation for adding architecture-specific implementations of > BLAKE2b, create a header <crypto/blake2b.h> that contains common > constants, structs, and helper functions for BLAKE2b. > > Furthermore, export the BLAKE2b generic setkey(), init(), update(), and > final() functions, and add functions __crypto_blake2b_update() and > __crypto_blake2b_final() which take a pointer to a > blake2b_compress_blocks_t function. > > This way, optimized implementations of BLAKE2b only have to provide an > implementation of blake2b_compress_blocks_t. (This is modeled on how > the nhpoly1305 implementations work. Also, the prototype of > blake2b_compress_blocks_t is meant to be similar to that of > blake2s_compress_arch().) > > Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: David Sterba <dsterba@suse.com>
On Tue, Dec 15, 2020 at 03:47:07PM -0800, Eric Biggers wrote: > From: Eric Biggers <ebiggers@google.com> > > The file comment for blake2b_generic.c makes it sound like it's the > reference implementation of BLAKE2b with only minor changes. But it's > actually been changed a lot. Update the comment to make this clearer. After your changes I agree it appropriate to reword it. > Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: David Sterba <dsterba@suse.com>
On Thu, Dec 17, 2020 at 06:15:40PM +0100, David Sterba wrote: > On Tue, Dec 15, 2020 at 03:47:06PM -0800, Eric Biggers wrote: > > From: Eric Biggers <ebiggers@google.com> > > > > In preparation for adding architecture-specific implementations of > > BLAKE2b, create a header <crypto/blake2b.h> that contains common > > constants, structs, and helper functions for BLAKE2b. > > > > Furthermore, export the BLAKE2b generic setkey(), init(), update(), and > > final() functions, and add functions __crypto_blake2b_update() and > > __crypto_blake2b_final() which take a pointer to a > > blake2b_compress_blocks_t function. > > > > This way, optimized implementations of BLAKE2b only have to provide an > > implementation of blake2b_compress_blocks_t. (This is modeled on how > > the nhpoly1305 implementations work. Also, the prototype of > > blake2b_compress_blocks_t is meant to be similar to that of > > blake2s_compress_arch().) > > > > Signed-off-by: Eric Biggers <ebiggers@google.com> > > Reviewed-by: David Sterba <dsterba@suse.com> I ended up making some changes to this patch in v2, to keep things consistent with what I decided to do for BLAKE2s, so I didn't add your Reviewed-by to this patch. If you could review the new version too that would be great -- thanks! - Eric