[v2,05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc

Message ID	20191002141713.31189-6-ard.biesheuvel@linaro.org
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-crypto-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Ard Biesheuvel <ard.biesheuvel@linaro.org> To: linux-crypto@vger.kernel.org Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>, Herbert Xu <herbert@gondor.apana.org.au>, David Miller <davem@davemloft.net>, Greg KH <gregkh@linuxfoundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, "Jason A . Donenfeld" <Jason@zx2c4.com>, Samuel Neves <sneves@dei.uc.pt>, Dan Carpenter <dan.carpenter@oracle.com>, Arnd Bergmann <arnd@arndb.de>, Eric Biggers <ebiggers@google.com>, Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>, Catalin Marinas <catalin.marinas@arm.com>, Martin Willi <martin@strongswan.org>, Peter Zijlstra <peterz@infradead.org>, Josh Poimboeuf <jpoimboe@redhat.com>, =?utf-8?q?Ren=C3=A9_van_Dorst?= <opensource@vdorst.com> Subject: [PATCH v2 05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc Date: Wed, 2 Oct 2019 16:16:58 +0200 Message-Id: <20191002141713.31189-6-ard.biesheuvel@linaro.org> In-Reply-To: <20191002141713.31189-1-ard.biesheuvel@linaro.org> References: <20191002141713.31189-1-ard.biesheuvel@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk
Series	crypto: crypto API library interfaces for WireGuard \| expand [v2,00/20] crypto: crypto API library interfaces for WireGuard [v2,01/20] crypto: chacha - move existing library code into lib/crypto [v2,02/20] crypto: x86/chacha - expose SIMD ChaCha routine as library function [v2,03/20] crypto: arm64/chacha - expose arm64 ChaCha routine as library function [v2,04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function [v2,05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc [v2,06/20] crypto: poly1305 - move into lib/crypto and refactor into library [v2,07/20] crypto: x86/poly1305 - expose existing driver as poly1305 library [v2,08/20] crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation [v2,09/20] crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation [v2,10/20] crypto: mips/poly1305 - import accelerated 32r2 code from Zinc [v2,11/20] int128: move __uint128_t compiler test to Kconfig [v2,12/20] crypto: BLAKE2s - generic C library implementation and selftest [v2,13/20] crypto: BLAKE2s - x86_64 library implementation [v2,14/20] crypto: Curve25519 - generic C library implementations and selftest [v2,15/20] crypto: lib/curve25519 - work around Clang stack spilling issue [v2,16/20] crypto: Curve25519 - x86_64 library implementation [v2,17/20] crypto: arm - import Bernstein and Schwabe's Curve25519 ARM implementation [v2,18/20] crypto: arm/Curve25519 - wire up NEON implementation [v2,19/20] crypto: chacha20poly1305 - import construction and selftest from Zinc [v2,20/20] crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine

Hi Ard and Jason, Quoting Ard Biesheuvel <ard.biesheuvel@linaro.org>: > On Fri, 4 Oct 2019 at 17:15, René van Dorst <opensource@vdorst.com> wrote: >> >> Hi Jason, >> >> Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>: >> >> > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel >> > <ard.biesheuvel@linaro.org> wrote: >> >> The round count is passed via the fifth function parameter, so it is >> >> already on the stack. Reloading it for every block doesn't sound like >> >> a huge deal to me. >> > >> > Please benchmark it to indicate that, if it really isn't a big deal. I >> > recall finding that memory accesses on common mips32r2 commodity >> > router hardware was extremely inefficient. The whole thing is designed >> > to minimize memory accesses, which are the primary bottleneck on that >> > platform. >> >> I also think it isn't a big deal, but I shall benchmark it this weekend. >> If I am correct a memory write will first put in cache. So if you read >> it again and it is in cache it is very fast. 1 or 2 clockcycles. >> Also the value isn't used directly after it is read. >> So cpu don't have to stall on this read. >> > > Thanks René. > > Note that the round count is not being spilled. I [re]load it from the > stack as a function parameter. > > So instead of > > li $at, 20 > > I do > > lw $at, 16($sp) > > > Thanks a lot for taking the time to double check this. I think it > would be nice to be able to expose xchacha12 like we do on other > architectures. I dust off my old benchmark code and put it on top of latest WireGuard source [0]. It benchmarks the chacha20poly1305_{de,en}crypt functions with different data block sizes (x bytes). It runs two tests, first one is see how many runs we get in 1 second results in MB/Sec and other one measures the used cpu cycles per loop. The test is preformed on a Mediatek MT7621A SoC running at 880MHz. Baseline [1]: root@OpenWrt:~# insmod wg-speed-baseline.ko [ 2029.866393] wireguard: chacha20 self-tests: pass [ 2029.894301] wireguard: poly1305 self-tests: pass [ 2029.906428] wireguard: chacha20poly1305 self-tests: pass [ 2030.121001] wireguard: chacha20poly1305_encrypt: 1 bytes, 0.253 MB/sec, 1598 cycles [ 2030.340786] wireguard: chacha20poly1305_encrypt: 16 bytes, 4.178 MB/sec, 1554 cycles [ 2030.561434] wireguard: chacha20poly1305_encrypt: 64 bytes, 15.392 MB/sec, 1692 cycles [ 2030.784635] wireguard: chacha20poly1305_encrypt: 128 bytes, 22.106 MB/sec, 2381 cycles [ 2031.081534] wireguard: chacha20poly1305_encrypt: 1420 bytes, 35.480 MB/sec, 16751 cycles [ 2031.371369] wireguard: chacha20poly1305_encrypt: 1440 bytes, 36.117 MB/sec, 16712 cycles [ 2031.589621] wireguard: chacha20poly1305_decrypt: 1 bytes, 0.246 MB/sec, 1648 cycles [ 2031.809392] wireguard: chacha20poly1305_decrypt: 16 bytes, 4.064 MB/sec, 1598 cycles [ 2032.030034] wireguard: chacha20poly1305_decrypt: 64 bytes, 14.990 MB/sec, 1738 cycles [ 2032.253245] wireguard: chacha20poly1305_decrypt: 128 bytes, 21.679 MB/sec, 2428 cycles [ 2032.540150] wireguard: chacha20poly1305_decrypt: 1420 bytes, 35.480 MB/sec, 16793 cycles [ 2032.829954] wireguard: chacha20poly1305_decrypt: 1440 bytes, 35.979 MB/sec, 16756 cycles [ 2032.850563] wireguard: blake2s self-tests: pass [ 2033.073767] wireguard: curve25519 self-tests: pass [ 2033.083600] wireguard: allowedips self-tests: pass [ 2033.097982] wireguard: nonce counter self-tests: pass [ 2033.535726] wireguard: ratelimiter self-tests: pass [ 2033.545615] wireguard: WireGuard 0.0.20190913-4-g5cca99692496 loaded. See www.wireguard.com for information. [ 2033.565197] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. Modified chacha20-mips.S [2]: root@OpenWrt:~# rmmod wireguard.ko root@OpenWrt:~# insmod wg-speed-nround-stack.ko [ 2045.129910] wireguard: chacha20 self-tests: pass [ 2045.157824] wireguard: poly1305 self-tests: pass [ 2045.169962] wireguard: chacha20poly1305 self-tests: pass [ 2045.381034] wireguard: chacha20poly1305_encrypt: 1 bytes, 0.251 MB/sec, 1607 cycles [ 2045.600801] wireguard: chacha20poly1305_encrypt: 16 bytes, 4.174 MB/sec, 1555 cycles [ 2045.821437] wireguard: chacha20poly1305_encrypt: 64 bytes, 15.392 MB/sec, 1691 cycles [ 2046.044650] wireguard: chacha20poly1305_encrypt: 128 bytes, 22.082 MB/sec, 2379 cycles [ 2046.341509] wireguard: chacha20poly1305_encrypt: 1420 bytes, 35.615 MB/sec, 16739 cycles [ 2046.631333] wireguard: chacha20poly1305_encrypt: 1440 bytes, 36.117 MB/sec, 16705 cycles [ 2046.849614] wireguard: chacha20poly1305_decrypt: 1 bytes, 0.246 MB/sec, 1647 cycles [ 2047.069403] wireguard: chacha20poly1305_decrypt: 16 bytes, 4.056 MB/sec, 1600 cycles [ 2047.290036] wireguard: chacha20poly1305_decrypt: 64 bytes, 15.001 MB/sec, 1736 cycles [ 2047.513253] wireguard: chacha20poly1305_decrypt: 128 bytes, 21.666 MB/sec, 2429 cycles [ 2047.800102] wireguard: chacha20poly1305_decrypt: 1420 bytes, 35.480 MB/sec, 16785 cycles [ 2048.089967] wireguard: chacha20poly1305_decrypt: 1440 bytes, 35.979 MB/sec, 16759 cycles [ 2048.110580] wireguard: blake2s self-tests: pass [ 2048.333719] wireguard: curve25519 self-tests: pass [ 2048.343547] wireguard: allowedips self-tests: pass [ 2048.357926] wireguard: nonce counter self-tests: pass [ 2048.785837] wireguard: ratelimiter self-tests: pass [ 2048.795781] wireguard: WireGuard 0.0.20190913-5-gee7c7eec8deb loaded. See www.wireguard.com for information. [ 2048.815389] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. I don't see the extra store/load on the stack back in the results. So I think that this test proves enough that the extra nround on the stack is not a problem. Ard, I shall take a look on your hchacha code later this weekend. Greats, René [0]: https://github.com/vDorst/wireguard/commits/mips-bench [1]: https://github.com/vDorst/wireguard/commit/5cca9969249632820cb96548813a65d1f297aa8c [2]: https://github.com/vDorst/wireguard/commit/ee7c7eec8deb3d5d5dae2eec0be0aafca3fddbc2 > > Note that for xchacha, I also added a hchacha_block() routine based on > your code (with the round count as the third argument) [0]. Please let > me know if you see anything wrong with that. > > > +.globl hchacha_block > +.ent hchacha_block > +hchacha_block: > + .frame $sp, STACK_SIZE, $ra > + > + addiu $sp, -STACK_SIZE > + > + /* Save s0-s7 */ > + sw $s0, 0($sp) > + sw $s1, 4($sp) > + sw $s2, 8($sp) > + sw $s3, 12($sp) > + sw $s4, 16($sp) > + sw $s5, 20($sp) > + sw $s6, 24($sp) > + sw $s7, 28($sp) > + > + lw X0, 0(STATE) > + lw X1, 4(STATE) > + lw X2, 8(STATE) > + lw X3, 12(STATE) > + lw X4, 16(STATE) > + lw X5, 20(STATE) > + lw X6, 24(STATE) > + lw X7, 28(STATE) > + lw X8, 32(STATE) > + lw X9, 36(STATE) > + lw X10, 40(STATE) > + lw X11, 44(STATE) > + lw X12, 48(STATE) > + lw X13, 52(STATE) > + lw X14, 56(STATE) > + lw X15, 60(STATE) > + > +.Loop_hchacha_xor_rounds: > + addiu $a2, -2 > + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); > + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); > + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); > + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); > + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); > + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); > + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); > + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); > + bnez $a2, .Loop_hchacha_xor_rounds > + > + sw X0, 0(OUT) > + sw X1, 4(OUT) > + sw X2, 8(OUT) > + sw X3, 12(OUT) > + sw X12, 16(OUT) > + sw X13, 20(OUT) > + sw X14, 24(OUT) > + sw X15, 28(OUT) > + > + /* Restore used registers */ > + lw $s0, 0($sp) > + lw $s1, 4($sp) > + lw $s2, 8($sp) > + lw $s3, 12($sp) > + lw $s4, 16($sp) > + lw $s5, 20($sp) > + lw $s6, 24($sp) > + lw $s7, 28($sp) > + > + addiu $sp, STACK_SIZE > + jr $ra > +.end hchacha_block > +.set at > > > [0] > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f

diff --git a/arch/mips/Makefile b/arch/mips/Makefile index cdc09b71febe..8584c047ea59 100644 --- a/arch/mips/Makefile +++ b/arch/mips/Makefile @@ -323,7 +323,7 @@ libs-$(CONFIG_MIPS_FP_SUPPORT) += arch/mips/math-emu/ # See arch/mips/Kbuild for content of core part of the kernel core-y += arch/mips/ -drivers-$(CONFIG_MIPS_CRC_SUPPORT) += arch/mips/crypto/ +drivers-y += arch/mips/crypto/ drivers-$(CONFIG_OPROFILE) += arch/mips/oprofile/ # suspend and hibernation support diff --git a/arch/mips/crypto/Makefile b/arch/mips/crypto/Makefile index e07aca572c2e..7f7ea0020cc2 100644 --- a/arch/mips/crypto/Makefile +++ b/arch/mips/crypto/Makefile @@ -4,3 +4,6 @@ # obj-$(CONFIG_CRYPTO_CRC32_MIPS) += crc32-mips.o + +obj-$(CONFIG_CRYPTO_CHACHA_MIPS) += chacha-mips.o +chacha-mips-y := chacha-core.o chacha-glue.o diff --git a/arch/mips/crypto/chacha-core.S b/arch/mips/crypto/chacha-core.S new file mode 100644 index 000000000000..42150d15fc88 --- /dev/null +++ b/arch/mips/crypto/chacha-core.S @@ -0,0 +1,424 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2016-2018 René van Dorst <opensource@vdorst.com>. All Rights Reserved. + * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. + */ + +#define MASK_U32 0x3c +#define CHACHA20_BLOCK_SIZE 64 +#define STACK_SIZE 32 + +#define X0 $t0 +#define X1 $t1 +#define X2 $t2 +#define X3 $t3 +#define X4 $t4 +#define X5 $t5 +#define X6 $t6 +#define X7 $t7 +#define X8 $t8 +#define X9 $t9 +#define X10 $v1 +#define X11 $s6 +#define X12 $s5 +#define X13 $s4 +#define X14 $s3 +#define X15 $s2 +/* Use regs which are overwritten on exit for Tx so we don't leak clear data. */ +#define T0 $s1 +#define T1 $s0 +#define T(n) T ## n +#define X(n) X ## n + +/* Input arguments */ +#define STATE $a0 +#define OUT $a1 +#define IN $a2 +#define BYTES $a3 + +/* Output argument */ +/* NONCE[0] is kept in a register and not in memory. + * We don't want to touch original value in memory. + * Must be incremented every loop iteration. + */ +#define NONCE_0 $v0 + +/* SAVED_X and SAVED_CA are set in the jump table. + * Use regs which are overwritten on exit else we don't leak clear data. + * They are used to handling the last bytes which are not multiple of 4. + */ +#define SAVED_X X15 +#define SAVED_CA $s7 + +#define IS_UNALIGNED $s7 + +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ +#define MSB 0 +#define LSB 3 +#define ROTx rotl +#define ROTR(n) rotr n, 24 +#define CPU_TO_LE32(n) \ + wsbh n; \ + rotr n, 16; +#else +#define MSB 3 +#define LSB 0 +#define ROTx rotr +#define CPU_TO_LE32(n) +#define ROTR(n) +#endif + +#define FOR_EACH_WORD(x) \ + x( 0); \ + x( 1); \ + x( 2); \ + x( 3); \ + x( 4); \ + x( 5); \ + x( 6); \ + x( 7); \ + x( 8); \ + x( 9); \ + x(10); \ + x(11); \ + x(12); \ + x(13); \ + x(14); \ + x(15); + +#define FOR_EACH_WORD_REV(x) \ + x(15); \ + x(14); \ + x(13); \ + x(12); \ + x(11); \ + x(10); \ + x( 9); \ + x( 8); \ + x( 7); \ + x( 6); \ + x( 5); \ + x( 4); \ + x( 3); \ + x( 2); \ + x( 1); \ + x( 0); + +#define PLUS_ONE_0 1 +#define PLUS_ONE_1 2 +#define PLUS_ONE_2 3 +#define PLUS_ONE_3 4 +#define PLUS_ONE_4 5 +#define PLUS_ONE_5 6 +#define PLUS_ONE_6 7 +#define PLUS_ONE_7 8 +#define PLUS_ONE_8 9 +#define PLUS_ONE_9 10 +#define PLUS_ONE_10 11 +#define PLUS_ONE_11 12 +#define PLUS_ONE_12 13 +#define PLUS_ONE_13 14 +#define PLUS_ONE_14 15 +#define PLUS_ONE_15 16 +#define PLUS_ONE(x) PLUS_ONE_ ## x +#define _CONCAT3(a,b,c) a ## b ## c +#define CONCAT3(a,b,c) _CONCAT3(a,b,c) + +#define STORE_UNALIGNED(x) \ +CONCAT3(.Lchacha_mips_xor_unaligned_, PLUS_ONE(x), _b: ;) \ + .if (x != 12); \ + lw T0, (x*4)(STATE); \ + .endif; \ + lwl T1, (x*4)+MSB ## (IN); \ + lwr T1, (x*4)+LSB ## (IN); \ + .if (x == 12); \ + addu X ## x, NONCE_0; \ + .else; \ + addu X ## x, T0; \ + .endif; \ + CPU_TO_LE32(X ## x); \ + xor X ## x, T1; \ + swl X ## x, (x*4)+MSB ## (OUT); \ + swr X ## x, (x*4)+LSB ## (OUT); + +#define STORE_ALIGNED(x) \ +CONCAT3(.Lchacha_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \ + .if (x != 12); \ + lw T0, (x*4)(STATE); \ + .endif; \ + lw T1, (x*4) ## (IN); \ + .if (x == 12); \ + addu X ## x, NONCE_0; \ + .else; \ + addu X ## x, T0; \ + .endif; \ + CPU_TO_LE32(X ## x); \ + xor X ## x, T1; \ + sw X ## x, (x*4) ## (OUT); + +/* Jump table macro. + * Used for setup and handling the last bytes, which are not multiple of 4. + * X15 is free to store Xn + * Every jumptable entry must be equal in size. + */ +#define JMPTBL_ALIGNED(x) \ +.Lchacha_mips_jmptbl_aligned_ ## x: ; \ + .set noreorder; \ + b .Lchacha_mips_xor_aligned_ ## x ## _b; \ + .if (x == 12); \ + addu SAVED_X, X ## x, NONCE_0; \ + .else; \ + addu SAVED_X, X ## x, SAVED_CA; \ + .endif; \ + .set reorder + +#define JMPTBL_UNALIGNED(x) \ +.Lchacha_mips_jmptbl_unaligned_ ## x: ; \ + .set noreorder; \ + b .Lchacha_mips_xor_unaligned_ ## x ## _b; \ + .if (x == 12); \ + addu SAVED_X, X ## x, NONCE_0; \ + .else; \ + addu SAVED_X, X ## x, SAVED_CA; \ + .endif; \ + .set reorder + +#define AXR(A, B, C, D, K, L, M, N, V, W, Y, Z, S) \ + addu X(A), X(K); \ + addu X(B), X(L); \ + addu X(C), X(M); \ + addu X(D), X(N); \ + xor X(V), X(A); \ + xor X(W), X(B); \ + xor X(Y), X(C); \ + xor X(Z), X(D); \ + rotl X(V), S; \ + rotl X(W), S; \ + rotl X(Y), S; \ + rotl X(Z), S; + +.text +.set reorder +.set noat +.globl chacha_mips +.ent chacha_mips +chacha_mips: + .frame $sp, STACK_SIZE, $ra + + /* Load number of rounds */ + lw $at, 16($sp) + + addiu $sp, -STACK_SIZE + + /* Return bytes = 0. */ + beqz BYTES, .Lchacha_mips_end + + lw NONCE_0, 48(STATE) + + /* Save s0-s7 */ + sw $s0, 0($sp) + sw $s1, 4($sp) + sw $s2, 8($sp) + sw $s3, 12($sp) + sw $s4, 16($sp) + sw $s5, 20($sp) + sw $s6, 24($sp) + sw $s7, 28($sp) + + /* Test IN or OUT is unaligned. + * IS_UNALIGNED = ( IN | OUT ) & 0x00000003 + */ + or IS_UNALIGNED, IN, OUT + andi IS_UNALIGNED, 0x3 + + b .Lchacha_rounds_start + +.align 4 +.Loop_chacha_rounds: + addiu IN, CHACHA20_BLOCK_SIZE + addiu OUT, CHACHA20_BLOCK_SIZE + addiu NONCE_0, 1 + +.Lchacha_rounds_start: + lw X0, 0(STATE) + lw X1, 4(STATE) + lw X2, 8(STATE) + lw X3, 12(STATE) + + lw X4, 16(STATE) + lw X5, 20(STATE) + lw X6, 24(STATE) + lw X7, 28(STATE) + lw X8, 32(STATE) + lw X9, 36(STATE) + lw X10, 40(STATE) + lw X11, 44(STATE) + + move X12, NONCE_0 + lw X13, 52(STATE) + lw X14, 56(STATE) + lw X15, 60(STATE) + +.Loop_chacha_xor_rounds: + addiu $at, -2 + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); + bnez $at, .Loop_chacha_xor_rounds + + addiu BYTES, -(CHACHA20_BLOCK_SIZE) + + /* Is data src/dst unaligned? Jump */ + bnez IS_UNALIGNED, .Loop_chacha_unaligned + + /* Set number rounds here to fill delayslot. */ + lw $at, (STACK_SIZE+16)($sp) + + /* BYTES < 0, it has no full block. */ + bltz BYTES, .Lchacha_mips_no_full_block_aligned + + FOR_EACH_WORD_REV(STORE_ALIGNED) + + /* BYTES > 0? Loop again. */ + bgtz BYTES, .Loop_chacha_rounds + + /* Place this here to fill delay slot */ + addiu NONCE_0, 1 + + /* BYTES < 0? Handle last bytes */ + bltz BYTES, .Lchacha_mips_xor_bytes + +.Lchacha_mips_xor_done: + /* Restore used registers */ + lw $s0, 0($sp) + lw $s1, 4($sp) + lw $s2, 8($sp) + lw $s3, 12($sp) + lw $s4, 16($sp) + lw $s5, 20($sp) + lw $s6, 24($sp) + lw $s7, 28($sp) + + /* Write NONCE_0 back to right location in state */ + sw NONCE_0, 48(STATE) + +.Lchacha_mips_end: + addiu $sp, STACK_SIZE + jr $ra + +.Lchacha_mips_no_full_block_aligned: + /* Restore the offset on BYTES */ + addiu BYTES, CHACHA20_BLOCK_SIZE + + /* Get number of full WORDS */ + andi $at, BYTES, MASK_U32 + + /* Load upper half of jump table addr */ + lui T0, %hi(.Lchacha_mips_jmptbl_aligned_0) + + /* Calculate lower half jump table offset */ + ins T0, $at, 1, 6 + + /* Add offset to STATE */ + addu T1, STATE, $at + + /* Add lower half jump table addr */ + addiu T0, %lo(.Lchacha_mips_jmptbl_aligned_0) + + /* Read value from STATE */ + lw SAVED_CA, 0(T1) + + /* Store remaining bytecounter as negative value */ + subu BYTES, $at, BYTES + + jr T0 + + /* Jump table */ + FOR_EACH_WORD(JMPTBL_ALIGNED) + + +.Loop_chacha_unaligned: + /* Set number rounds here to fill delayslot. */ + lw $at, (STACK_SIZE+16)($sp) + + /* BYTES > 0, it has no full block. */ + bltz BYTES, .Lchacha_mips_no_full_block_unaligned + + FOR_EACH_WORD_REV(STORE_UNALIGNED) + + /* BYTES > 0? Loop again. */ + bgtz BYTES, .Loop_chacha_rounds + + /* Write NONCE_0 back to right location in state */ + sw NONCE_0, 48(STATE) + + .set noreorder + /* Fall through to byte handling */ + bgez BYTES, .Lchacha_mips_xor_done +.Lchacha_mips_xor_unaligned_0_b: +.Lchacha_mips_xor_aligned_0_b: + /* Place this here to fill delay slot */ + addiu NONCE_0, 1 + .set reorder + +.Lchacha_mips_xor_bytes: + addu IN, $at + addu OUT, $at + /* First byte */ + lbu T1, 0(IN) + addiu $at, BYTES, 1 + CPU_TO_LE32(SAVED_X) + ROTR(SAVED_X) + xor T1, SAVED_X + sb T1, 0(OUT) + beqz $at, .Lchacha_mips_xor_done + /* Second byte */ + lbu T1, 1(IN) + addiu $at, BYTES, 2 + ROTx SAVED_X, 8 + xor T1, SAVED_X + sb T1, 1(OUT) + beqz $at, .Lchacha_mips_xor_done + /* Third byte */ + lbu T1, 2(IN) + ROTx SAVED_X, 8 + xor T1, SAVED_X + sb T1, 2(OUT) + b .Lchacha_mips_xor_done + +.Lchacha_mips_no_full_block_unaligned: + /* Restore the offset on BYTES */ + addiu BYTES, CHACHA20_BLOCK_SIZE + + /* Get number of full WORDS */ + andi $at, BYTES, MASK_U32 + + /* Load upper half of jump table addr */ + lui T0, %hi(.Lchacha_mips_jmptbl_unaligned_0) + + /* Calculate lower half jump table offset */ + ins T0, $at, 1, 6 + + /* Add offset to STATE */ + addu T1, STATE, $at + + /* Add lower half jump table addr */ + addiu T0, %lo(.Lchacha_mips_jmptbl_unaligned_0) + + /* Read value from STATE */ + lw SAVED_CA, 0(T1) + + /* Store remaining bytecounter as negative value */ + subu BYTES, $at, BYTES + + jr T0 + + /* Jump table */ + FOR_EACH_WORD(JMPTBL_UNALIGNED) +.end chacha_mips +.set at diff --git a/arch/mips/crypto/chacha-glue.c b/arch/mips/crypto/chacha-glue.c new file mode 100644 index 000000000000..de01dc57751e --- /dev/null +++ b/arch/mips/crypto/chacha-glue.c @@ -0,0 +1,161 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * MIPS accelerated ChaCha and XChaCha stream ciphers, + * including ChaCha20 (RFC7539) + * + * Copyright (C) 2019 Linaro, Ltd. <ard.biesheuvel@linaro.org> + */ + +#include <crypto/algapi.h> +#include <crypto/internal/chacha.h> +#include <crypto/internal/skcipher.h> +#include <linux/kernel.h> +#include <linux/module.h> + +asmlinkage void chacha_mips(const u32 *state, u8 *dst, const u8 *src, + unsigned int bytes, int nrounds); + +void hchacha_block(const u32 *state, u32 *stream, int nrounds) +{ + hchacha_block_generic(state, stream, nrounds); +} +EXPORT_SYMBOL(hchacha_block); + +void chacha_init(u32 *state, const u32 *key, const u8 *iv) +{ + chacha_init_generic(state, key, iv); +} +EXPORT_SYMBOL(chacha_init); + +void chacha_crypt(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, + int nrounds) +{ + chacha_mips(state, dst, src, bytes, nrounds); +} +EXPORT_SYMBOL(chacha_crypt); + +static int chacha_mips_stream_xor(struct skcipher_request *req, + const struct chacha_ctx *ctx, const u8 *iv) +{ + struct skcipher_walk walk; + u32 state[16]; + int err; + + err = skcipher_walk_virt(&walk, req, false); + + crypto_chacha_init(state, ctx, iv); + + while (walk.nbytes > 0) { + unsigned int nbytes = walk.nbytes; + + if (nbytes < walk.total) + nbytes = round_down(nbytes, walk.stride); + + chacha_mips(state, walk.dst.virt.addr, walk.src.virt.addr, + nbytes, ctx->nrounds); + err = skcipher_walk_done(&walk, walk.nbytes - nbytes); + } + + return err; +} + +static int __chacha_mips(struct skcipher_request *req) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); + + return chacha_mips_stream_xor(req, ctx, req->iv); +} + +static int xchacha_mips(struct skcipher_request *req) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); + struct chacha_ctx subctx; + u32 state[16]; + u8 real_iv[16]; + + crypto_chacha_init(state, ctx, req->iv); + + hchacha_block_generic(state, subctx.key, ctx->nrounds); + subctx.nrounds = ctx->nrounds; + + memcpy(&real_iv[0], req->iv + 24, 8); + memcpy(&real_iv[8], req->iv + 16, 8); + return chacha_mips_stream_xor(req, &subctx, real_iv); +} + +static struct skcipher_alg algs[] = { + { + .base.cra_name = "chacha20", + .base.cra_driver_name = "chacha20-mips", + .base.cra_priority = 200, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = CHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .walksize = 4 * CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha20_setkey, + .encrypt = __chacha_mips, + .decrypt = __chacha_mips, + }, { + .base.cra_name = "xchacha20", + .base.cra_driver_name = "xchacha20-mips", + .base.cra_priority = 200, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = XCHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .walksize = 4 * CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha20_setkey, + .encrypt = xchacha_mips, + .decrypt = xchacha_mips, + }, { + .base.cra_name = "xchacha12", + .base.cra_driver_name = "xchacha12-mips", + .base.cra_priority = 200, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = XCHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .walksize = 4 * CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha12_setkey, + .encrypt = xchacha_mips, + .decrypt = xchacha_mips, + } +}; + +static int __init chacha_simd_mod_init(void) +{ + return crypto_register_skciphers(algs, ARRAY_SIZE(algs)); +} + +static void __exit chacha_simd_mod_fini(void) +{ + crypto_unregister_skciphers(algs, ARRAY_SIZE(algs)); +} + +module_init(chacha_simd_mod_init); +module_exit(chacha_simd_mod_fini); + +MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (MIPS accelerated)"); +MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>"); +MODULE_LICENSE("GPL v2"); +MODULE_ALIAS_CRYPTO("chacha20"); +MODULE_ALIAS_CRYPTO("chacha20-mips"); +MODULE_ALIAS_CRYPTO("xchacha20"); +MODULE_ALIAS_CRYPTO("xchacha20-mips"); +MODULE_ALIAS_CRYPTO("xchacha12"); +MODULE_ALIAS_CRYPTO("xchacha12-mips"); diff --git a/crypto/Kconfig b/crypto/Kconfig index f90b53a526ba..43e94ac5d117 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1441,6 +1441,12 @@ config CRYPTO_CHACHA20_X86_64 SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20, XChaCha20, and XChaCha12 stream ciphers. +config CRYPTO_CHACHA_MIPS + tristate "ChaCha stream cipher algorithms (MIPS 32r2 optimized)" + depends on CPU_MIPS32_R2 + select CRYPTO_CHACHA20 + select CRYPTO_ARCH_HAVE_LIB_CHACHA + config CRYPTO_SEED tristate "SEED cipher algorithm" select CRYPTO_ALGAPI

[v2,05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc

Commit Message

Comments

Patch