Message ID | 20191002141713.31189-6-ard.biesheuvel@linaro.org |
---|---|
State | New |
Headers | show |
Series | crypto: crypto API library interfaces for WireGuard | expand |
On Wed, Oct 02, 2019 at 04:16:58PM +0200, Ard Biesheuvel wrote: > This integrates the accelerated MIPS 32r2 implementation of ChaCha > into both the API and library interfaces of the kernel crypto stack. > > The significance of this is that, in addition to becoming available > as an accelerated library implementation, it can also be used by > existing crypto API code such as Adiantum (for block encryption on > ultra low performance cores) or IPsec using chacha20poly1305. These > are use cases that have already opted into using the abstract crypto > API. In order to support Adiantum, the core assembler routine has > been adapted to take the round count as a function argument rather > than hardcoding it to 20. Could you resubmit this with first my original commit and then with your changes on top? I'd like to see and be able to review exactly what's changed. If I recall correctly, René and I were really starved for registers and tried pretty hard to avoid spilling to the stack, so I'm interested to learn how you crammed a bit more sauce in there. I also wonder if maybe it'd be better to just leave this as is with 20 rounds, which it was previously optimized for, and just not do accelerated Adiantum for MIPS. Android has long since given up on the ISA entirely.
On Fri, 4 Oct 2019 at 15:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Wed, Oct 02, 2019 at 04:16:58PM +0200, Ard Biesheuvel wrote: > > This integrates the accelerated MIPS 32r2 implementation of ChaCha > > into both the API and library interfaces of the kernel crypto stack. > > > > The significance of this is that, in addition to becoming available > > as an accelerated library implementation, it can also be used by > > existing crypto API code such as Adiantum (for block encryption on > > ultra low performance cores) or IPsec using chacha20poly1305. These > > are use cases that have already opted into using the abstract crypto > > API. In order to support Adiantum, the core assembler routine has > > been adapted to take the round count as a function argument rather > > than hardcoding it to 20. > > Could you resubmit this with first my original commit and then with your > changes on top? I'd like to see and be able to review exactly what's > changed. If I recall correctly, René and I were really starved for > registers and tried pretty hard to avoid spilling to the stack, so I'm > interested to learn how you crammed a bit more sauce in there. > The round count is passed via the fifth function parameter, so it is already on the stack. Reloading it for every block doesn't sound like a huge deal to me. > I also wonder if maybe it'd be better to just leave this as is with 20 > rounds, which it was previously optimized for, and just not do > accelerated Adiantum for MIPS. Android has long since given up on the > ISA entirely. Adiantum does not depend on Android - anyone running linux on his MIPS router can use it if they want encrypted storage.
On Fri, 4 Oct 2019 at 16:38, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On Fri, 4 Oct 2019 at 15:46, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > On Wed, Oct 02, 2019 at 04:16:58PM +0200, Ard Biesheuvel wrote: > > > This integrates the accelerated MIPS 32r2 implementation of ChaCha > > > into both the API and library interfaces of the kernel crypto stack. > > > > > > The significance of this is that, in addition to becoming available > > > as an accelerated library implementation, it can also be used by > > > existing crypto API code such as Adiantum (for block encryption on > > > ultra low performance cores) or IPsec using chacha20poly1305. These > > > are use cases that have already opted into using the abstract crypto > > > API. In order to support Adiantum, the core assembler routine has > > > been adapted to take the round count as a function argument rather > > > than hardcoding it to 20. > > > > Could you resubmit this with first my original commit and then with your > > changes on top? I'd like to see and be able to review exactly what's > > changed. If I recall correctly, René and I were really starved for > > registers and tried pretty hard to avoid spilling to the stack, so I'm > > interested to learn how you crammed a bit more sauce in there. > > > > The round count is passed via the fifth function parameter, so it is > already on the stack. Reloading it for every block doesn't sound like > a huge deal to me. > > > I also wonder if maybe it'd be better to just leave this as is with 20 > > rounds, which it was previously optimized for, and just not do > > accelerated Adiantum for MIPS. Android has long since given up on the > > ISA entirely. > > Adiantum does not depend on Android - anyone running linux on his MIPS > router can use it if they want encrypted storage. But to answer your first question: sure, i will split off the changes.
On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > The round count is passed via the fifth function parameter, so it is > already on the stack. Reloading it for every block doesn't sound like > a huge deal to me. Please benchmark it to indicate that, if it really isn't a big deal. I recall finding that memory accesses on common mips32r2 commodity router hardware was extremely inefficient. The whole thing is designed to minimize memory accesses, which are the primary bottleneck on that platform. Seems like this thing might be best deferred for after this all lands. IOW, let's get this in with the 20 round original now, and later you can submit a change for the 12 round and René and I can spend time dusting off our test rigs and seeing which strategy works best. I very nearly tossed out a bunch of old router hardware last night when cleaning up. Glad I saved it!
On Fri, 4 Oct 2019 at 16:59, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > The round count is passed via the fifth function parameter, so it is > > already on the stack. Reloading it for every block doesn't sound like > > a huge deal to me. > > Please benchmark it to indicate that, if it really isn't a big deal. I > recall finding that memory accesses on common mips32r2 commodity > router hardware was extremely inefficient. The whole thing is designed > to minimize memory accesses, which are the primary bottleneck on that > platform. > Reloading a single word from the stack each time we load, xor and store 64 bytes of data from/to memory is highly unlikely to be noticeable. > Seems like this thing might be best deferred for after this all lands. > IOW, let's get this in with the 20 round original now, and later you > can submit a change for the 12 round and René and I can spend time > dusting off our test rigs and seeing which strategy works best. I very > nearly tossed out a bunch of old router hardware last night when > cleaning up. Glad I saved it! I don't agree but I don't care deeply enough to argue about it :-)
Hi Jason, Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>: > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: >> The round count is passed via the fifth function parameter, so it is >> already on the stack. Reloading it for every block doesn't sound like >> a huge deal to me. > > Please benchmark it to indicate that, if it really isn't a big deal. I > recall finding that memory accesses on common mips32r2 commodity > router hardware was extremely inefficient. The whole thing is designed > to minimize memory accesses, which are the primary bottleneck on that > platform. I also think it isn't a big deal, but I shall benchmark it this weekend. If I am correct a memory write will first put in cache. So if you read it again and it is in cache it is very fast. 1 or 2 clockcycles. Also the value isn't used directly after it is read. So cpu don't have to stall on this read. Greats, René > > Seems like this thing might be best deferred for after this all lands. > IOW, let's get this in with the 20 round original now, and later you > can submit a change for the 12 round and René and I can spend time > dusting off our test rigs and seeing which strategy works best. I very > nearly tossed out a bunch of old router hardware last night when > cleaning up. Glad I saved it!
On Fri, 4 Oct 2019 at 17:15, René van Dorst <opensource@vdorst.com> wrote: > > Hi Jason, > > Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>: > > > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel > > <ard.biesheuvel@linaro.org> wrote: > >> The round count is passed via the fifth function parameter, so it is > >> already on the stack. Reloading it for every block doesn't sound like > >> a huge deal to me. > > > > Please benchmark it to indicate that, if it really isn't a big deal. I > > recall finding that memory accesses on common mips32r2 commodity > > router hardware was extremely inefficient. The whole thing is designed > > to minimize memory accesses, which are the primary bottleneck on that > > platform. > > I also think it isn't a big deal, but I shall benchmark it this weekend. > If I am correct a memory write will first put in cache. So if you read > it again and it is in cache it is very fast. 1 or 2 clockcycles. > Also the value isn't used directly after it is read. > So cpu don't have to stall on this read. > Thanks René. Note that the round count is not being spilled. I [re]load it from the stack as a function parameter. So instead of li $at, 20 I do lw $at, 16($sp) Thanks a lot for taking the time to double check this. I think it would be nice to be able to expose xchacha12 like we do on other architectures. Note that for xchacha, I also added a hchacha_block() routine based on your code (with the round count as the third argument) [0]. Please let me know if you see anything wrong with that. +.globl hchacha_block +.ent hchacha_block +hchacha_block: + .frame $sp, STACK_SIZE, $ra + + addiu $sp, -STACK_SIZE + + /* Save s0-s7 */ + sw $s0, 0($sp) + sw $s1, 4($sp) + sw $s2, 8($sp) + sw $s3, 12($sp) + sw $s4, 16($sp) + sw $s5, 20($sp) + sw $s6, 24($sp) + sw $s7, 28($sp) + + lw X0, 0(STATE) + lw X1, 4(STATE) + lw X2, 8(STATE) + lw X3, 12(STATE) + lw X4, 16(STATE) + lw X5, 20(STATE) + lw X6, 24(STATE) + lw X7, 28(STATE) + lw X8, 32(STATE) + lw X9, 36(STATE) + lw X10, 40(STATE) + lw X11, 44(STATE) + lw X12, 48(STATE) + lw X13, 52(STATE) + lw X14, 56(STATE) + lw X15, 60(STATE) + +.Loop_hchacha_xor_rounds: + addiu $a2, -2 + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); + bnez $a2, .Loop_hchacha_xor_rounds + + sw X0, 0(OUT) + sw X1, 4(OUT) + sw X2, 8(OUT) + sw X3, 12(OUT) + sw X12, 16(OUT) + sw X13, 20(OUT) + sw X14, 24(OUT) + sw X15, 28(OUT) + + /* Restore used registers */ + lw $s0, 0($sp) + lw $s1, 4($sp) + lw $s2, 8($sp) + lw $s3, 12($sp) + lw $s4, 16($sp) + lw $s5, 20($sp) + lw $s6, 24($sp) + lw $s7, 28($sp) + + addiu $sp, STACK_SIZE + jr $ra +.end hchacha_block +.set at [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f
Hi Ard and Jason, Quoting Ard Biesheuvel <ard.biesheuvel@linaro.org>: > On Fri, 4 Oct 2019 at 17:15, René van Dorst <opensource@vdorst.com> wrote: >> >> Hi Jason, >> >> Quoting "Jason A. Donenfeld" <Jason@zx2c4.com>: >> >> > On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel >> > <ard.biesheuvel@linaro.org> wrote: >> >> The round count is passed via the fifth function parameter, so it is >> >> already on the stack. Reloading it for every block doesn't sound like >> >> a huge deal to me. >> > >> > Please benchmark it to indicate that, if it really isn't a big deal. I >> > recall finding that memory accesses on common mips32r2 commodity >> > router hardware was extremely inefficient. The whole thing is designed >> > to minimize memory accesses, which are the primary bottleneck on that >> > platform. >> >> I also think it isn't a big deal, but I shall benchmark it this weekend. >> If I am correct a memory write will first put in cache. So if you read >> it again and it is in cache it is very fast. 1 or 2 clockcycles. >> Also the value isn't used directly after it is read. >> So cpu don't have to stall on this read. >> > > Thanks René. > > Note that the round count is not being spilled. I [re]load it from the > stack as a function parameter. > > So instead of > > li $at, 20 > > I do > > lw $at, 16($sp) > > > Thanks a lot for taking the time to double check this. I think it > would be nice to be able to expose xchacha12 like we do on other > architectures. I dust off my old benchmark code and put it on top of latest WireGuard source [0]. It benchmarks the chacha20poly1305_{de,en}crypt functions with different data block sizes (x bytes). It runs two tests, first one is see how many runs we get in 1 second results in MB/Sec and other one measures the used cpu cycles per loop. The test is preformed on a Mediatek MT7621A SoC running at 880MHz. Baseline [1]: root@OpenWrt:~# insmod wg-speed-baseline.ko [ 2029.866393] wireguard: chacha20 self-tests: pass [ 2029.894301] wireguard: poly1305 self-tests: pass [ 2029.906428] wireguard: chacha20poly1305 self-tests: pass [ 2030.121001] wireguard: chacha20poly1305_encrypt: 1 bytes, 0.253 MB/sec, 1598 cycles [ 2030.340786] wireguard: chacha20poly1305_encrypt: 16 bytes, 4.178 MB/sec, 1554 cycles [ 2030.561434] wireguard: chacha20poly1305_encrypt: 64 bytes, 15.392 MB/sec, 1692 cycles [ 2030.784635] wireguard: chacha20poly1305_encrypt: 128 bytes, 22.106 MB/sec, 2381 cycles [ 2031.081534] wireguard: chacha20poly1305_encrypt: 1420 bytes, 35.480 MB/sec, 16751 cycles [ 2031.371369] wireguard: chacha20poly1305_encrypt: 1440 bytes, 36.117 MB/sec, 16712 cycles [ 2031.589621] wireguard: chacha20poly1305_decrypt: 1 bytes, 0.246 MB/sec, 1648 cycles [ 2031.809392] wireguard: chacha20poly1305_decrypt: 16 bytes, 4.064 MB/sec, 1598 cycles [ 2032.030034] wireguard: chacha20poly1305_decrypt: 64 bytes, 14.990 MB/sec, 1738 cycles [ 2032.253245] wireguard: chacha20poly1305_decrypt: 128 bytes, 21.679 MB/sec, 2428 cycles [ 2032.540150] wireguard: chacha20poly1305_decrypt: 1420 bytes, 35.480 MB/sec, 16793 cycles [ 2032.829954] wireguard: chacha20poly1305_decrypt: 1440 bytes, 35.979 MB/sec, 16756 cycles [ 2032.850563] wireguard: blake2s self-tests: pass [ 2033.073767] wireguard: curve25519 self-tests: pass [ 2033.083600] wireguard: allowedips self-tests: pass [ 2033.097982] wireguard: nonce counter self-tests: pass [ 2033.535726] wireguard: ratelimiter self-tests: pass [ 2033.545615] wireguard: WireGuard 0.0.20190913-4-g5cca99692496 loaded. See www.wireguard.com for information. [ 2033.565197] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. Modified chacha20-mips.S [2]: root@OpenWrt:~# rmmod wireguard.ko root@OpenWrt:~# insmod wg-speed-nround-stack.ko [ 2045.129910] wireguard: chacha20 self-tests: pass [ 2045.157824] wireguard: poly1305 self-tests: pass [ 2045.169962] wireguard: chacha20poly1305 self-tests: pass [ 2045.381034] wireguard: chacha20poly1305_encrypt: 1 bytes, 0.251 MB/sec, 1607 cycles [ 2045.600801] wireguard: chacha20poly1305_encrypt: 16 bytes, 4.174 MB/sec, 1555 cycles [ 2045.821437] wireguard: chacha20poly1305_encrypt: 64 bytes, 15.392 MB/sec, 1691 cycles [ 2046.044650] wireguard: chacha20poly1305_encrypt: 128 bytes, 22.082 MB/sec, 2379 cycles [ 2046.341509] wireguard: chacha20poly1305_encrypt: 1420 bytes, 35.615 MB/sec, 16739 cycles [ 2046.631333] wireguard: chacha20poly1305_encrypt: 1440 bytes, 36.117 MB/sec, 16705 cycles [ 2046.849614] wireguard: chacha20poly1305_decrypt: 1 bytes, 0.246 MB/sec, 1647 cycles [ 2047.069403] wireguard: chacha20poly1305_decrypt: 16 bytes, 4.056 MB/sec, 1600 cycles [ 2047.290036] wireguard: chacha20poly1305_decrypt: 64 bytes, 15.001 MB/sec, 1736 cycles [ 2047.513253] wireguard: chacha20poly1305_decrypt: 128 bytes, 21.666 MB/sec, 2429 cycles [ 2047.800102] wireguard: chacha20poly1305_decrypt: 1420 bytes, 35.480 MB/sec, 16785 cycles [ 2048.089967] wireguard: chacha20poly1305_decrypt: 1440 bytes, 35.979 MB/sec, 16759 cycles [ 2048.110580] wireguard: blake2s self-tests: pass [ 2048.333719] wireguard: curve25519 self-tests: pass [ 2048.343547] wireguard: allowedips self-tests: pass [ 2048.357926] wireguard: nonce counter self-tests: pass [ 2048.785837] wireguard: ratelimiter self-tests: pass [ 2048.795781] wireguard: WireGuard 0.0.20190913-5-gee7c7eec8deb loaded. See www.wireguard.com for information. [ 2048.815389] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. I don't see the extra store/load on the stack back in the results. So I think that this test proves enough that the extra nround on the stack is not a problem. Ard, I shall take a look on your hchacha code later this weekend. Greats, René [0]: https://github.com/vDorst/wireguard/commits/mips-bench [1]: https://github.com/vDorst/wireguard/commit/5cca9969249632820cb96548813a65d1f297aa8c [2]: https://github.com/vDorst/wireguard/commit/ee7c7eec8deb3d5d5dae2eec0be0aafca3fddbc2 > > Note that for xchacha, I also added a hchacha_block() routine based on > your code (with the round count as the third argument) [0]. Please let > me know if you see anything wrong with that. > > > +.globl hchacha_block > +.ent hchacha_block > +hchacha_block: > + .frame $sp, STACK_SIZE, $ra > + > + addiu $sp, -STACK_SIZE > + > + /* Save s0-s7 */ > + sw $s0, 0($sp) > + sw $s1, 4($sp) > + sw $s2, 8($sp) > + sw $s3, 12($sp) > + sw $s4, 16($sp) > + sw $s5, 20($sp) > + sw $s6, 24($sp) > + sw $s7, 28($sp) > + > + lw X0, 0(STATE) > + lw X1, 4(STATE) > + lw X2, 8(STATE) > + lw X3, 12(STATE) > + lw X4, 16(STATE) > + lw X5, 20(STATE) > + lw X6, 24(STATE) > + lw X7, 28(STATE) > + lw X8, 32(STATE) > + lw X9, 36(STATE) > + lw X10, 40(STATE) > + lw X11, 44(STATE) > + lw X12, 48(STATE) > + lw X13, 52(STATE) > + lw X14, 56(STATE) > + lw X15, 60(STATE) > + > +.Loop_hchacha_xor_rounds: > + addiu $a2, -2 > + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); > + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); > + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); > + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); > + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); > + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); > + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); > + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); > + bnez $a2, .Loop_hchacha_xor_rounds > + > + sw X0, 0(OUT) > + sw X1, 4(OUT) > + sw X2, 8(OUT) > + sw X3, 12(OUT) > + sw X12, 16(OUT) > + sw X13, 20(OUT) > + sw X14, 24(OUT) > + sw X15, 28(OUT) > + > + /* Restore used registers */ > + lw $s0, 0($sp) > + lw $s1, 4($sp) > + lw $s2, 8($sp) > + lw $s3, 12($sp) > + lw $s4, 16($sp) > + lw $s5, 20($sp) > + lw $s6, 24($sp) > + lw $s7, 28($sp) > + > + addiu $sp, STACK_SIZE > + jr $ra > +.end hchacha_block > +.set at > > > [0] > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f
Quoting Ard Biesheuvel <ard.biesheuvel@linaro.org>: <snip> Hi Ard, > Thanks a lot for taking the time to double check this. I think it > would be nice to be able to expose xchacha12 like we do on other > architectures. > > Note that for xchacha, I also added a hchacha_block() routine based on > your code (with the round count as the third argument) [0]. Please let > me know if you see anything wrong with that. > > > +.globl hchacha_block > +.ent hchacha_block > +hchacha_block: > + .frame $sp, STACK_SIZE, $ra > + > + addiu $sp, -STACK_SIZE > + > + /* Save s0-s7 */ > + sw $s0, 0($sp) > + sw $s1, 4($sp) > + sw $s2, 8($sp) > + sw $s3, 12($sp) > + sw $s4, 16($sp) > + sw $s5, 20($sp) > + sw $s6, 24($sp) > + sw $s7, 28($sp) We only have to preserve the used s registers. Currently X11 to X15 are using the registers s6 down to s2. But by shuffling/redefine the needed registers, so that we use all the non-preserve registers, I can reduce the used s registers to one. Registers we don't use and don't have to preserve are a3, at and v0. Also STATE(a0) can be reused because we only need that pointer while loading the values from memory. So: #undef X12 #undef X13 #undef X14 #undef X15 #define X12 $a3 #define X13 $at #define X14 $v0 #define X15 STATE And save X11(s6) on the stack. See the full code here [0]. For the rest the code looks good! Greats, René [0]: https://github.com/vDorst/wireguard/commit/562a516ae3b282b32f57d3239369360bc926df60 > + > + lw X0, 0(STATE) > + lw X1, 4(STATE) > + lw X2, 8(STATE) > + lw X3, 12(STATE) > + lw X4, 16(STATE) > + lw X5, 20(STATE) > + lw X6, 24(STATE) > + lw X7, 28(STATE) > + lw X8, 32(STATE) > + lw X9, 36(STATE) > + lw X10, 40(STATE) > + lw X11, 44(STATE) > + lw X12, 48(STATE) > + lw X13, 52(STATE) > + lw X14, 56(STATE) > + lw X15, 60(STATE) > + > +.Loop_hchacha_xor_rounds: > + addiu $a2, -2 > + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); > + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); > + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); > + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); > + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); > + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); > + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); > + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); > + bnez $a2, .Loop_hchacha_xor_rounds > + > + sw X0, 0(OUT) > + sw X1, 4(OUT) > + sw X2, 8(OUT) > + sw X3, 12(OUT) > + sw X12, 16(OUT) > + sw X13, 20(OUT) > + sw X14, 24(OUT) > + sw X15, 28(OUT) > + > + /* Restore used registers */ > + lw $s0, 0($sp) > + lw $s1, 4($sp) > + lw $s2, 8($sp) > + lw $s3, 12($sp) > + lw $s4, 16($sp) > + lw $s5, 20($sp) > + lw $s6, 24($sp) > + lw $s7, 28($sp) > + > + addiu $sp, STACK_SIZE > + jr $ra > +.end hchacha_block > +.set at > > > [0] > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/commit/?h=wireguard-crypto-library-api-v3&id=cc74a037f8152d52bd17feaf8d9142b61761484f
diff --git a/arch/mips/Makefile b/arch/mips/Makefile index cdc09b71febe..8584c047ea59 100644 --- a/arch/mips/Makefile +++ b/arch/mips/Makefile @@ -323,7 +323,7 @@ libs-$(CONFIG_MIPS_FP_SUPPORT) += arch/mips/math-emu/ # See arch/mips/Kbuild for content of core part of the kernel core-y += arch/mips/ -drivers-$(CONFIG_MIPS_CRC_SUPPORT) += arch/mips/crypto/ +drivers-y += arch/mips/crypto/ drivers-$(CONFIG_OPROFILE) += arch/mips/oprofile/ # suspend and hibernation support diff --git a/arch/mips/crypto/Makefile b/arch/mips/crypto/Makefile index e07aca572c2e..7f7ea0020cc2 100644 --- a/arch/mips/crypto/Makefile +++ b/arch/mips/crypto/Makefile @@ -4,3 +4,6 @@ # obj-$(CONFIG_CRYPTO_CRC32_MIPS) += crc32-mips.o + +obj-$(CONFIG_CRYPTO_CHACHA_MIPS) += chacha-mips.o +chacha-mips-y := chacha-core.o chacha-glue.o diff --git a/arch/mips/crypto/chacha-core.S b/arch/mips/crypto/chacha-core.S new file mode 100644 index 000000000000..42150d15fc88 --- /dev/null +++ b/arch/mips/crypto/chacha-core.S @@ -0,0 +1,424 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2016-2018 René van Dorst <opensource@vdorst.com>. All Rights Reserved. + * Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. + */ + +#define MASK_U32 0x3c +#define CHACHA20_BLOCK_SIZE 64 +#define STACK_SIZE 32 + +#define X0 $t0 +#define X1 $t1 +#define X2 $t2 +#define X3 $t3 +#define X4 $t4 +#define X5 $t5 +#define X6 $t6 +#define X7 $t7 +#define X8 $t8 +#define X9 $t9 +#define X10 $v1 +#define X11 $s6 +#define X12 $s5 +#define X13 $s4 +#define X14 $s3 +#define X15 $s2 +/* Use regs which are overwritten on exit for Tx so we don't leak clear data. */ +#define T0 $s1 +#define T1 $s0 +#define T(n) T ## n +#define X(n) X ## n + +/* Input arguments */ +#define STATE $a0 +#define OUT $a1 +#define IN $a2 +#define BYTES $a3 + +/* Output argument */ +/* NONCE[0] is kept in a register and not in memory. + * We don't want to touch original value in memory. + * Must be incremented every loop iteration. + */ +#define NONCE_0 $v0 + +/* SAVED_X and SAVED_CA are set in the jump table. + * Use regs which are overwritten on exit else we don't leak clear data. + * They are used to handling the last bytes which are not multiple of 4. + */ +#define SAVED_X X15 +#define SAVED_CA $s7 + +#define IS_UNALIGNED $s7 + +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ +#define MSB 0 +#define LSB 3 +#define ROTx rotl +#define ROTR(n) rotr n, 24 +#define CPU_TO_LE32(n) \ + wsbh n; \ + rotr n, 16; +#else +#define MSB 3 +#define LSB 0 +#define ROTx rotr +#define CPU_TO_LE32(n) +#define ROTR(n) +#endif + +#define FOR_EACH_WORD(x) \ + x( 0); \ + x( 1); \ + x( 2); \ + x( 3); \ + x( 4); \ + x( 5); \ + x( 6); \ + x( 7); \ + x( 8); \ + x( 9); \ + x(10); \ + x(11); \ + x(12); \ + x(13); \ + x(14); \ + x(15); + +#define FOR_EACH_WORD_REV(x) \ + x(15); \ + x(14); \ + x(13); \ + x(12); \ + x(11); \ + x(10); \ + x( 9); \ + x( 8); \ + x( 7); \ + x( 6); \ + x( 5); \ + x( 4); \ + x( 3); \ + x( 2); \ + x( 1); \ + x( 0); + +#define PLUS_ONE_0 1 +#define PLUS_ONE_1 2 +#define PLUS_ONE_2 3 +#define PLUS_ONE_3 4 +#define PLUS_ONE_4 5 +#define PLUS_ONE_5 6 +#define PLUS_ONE_6 7 +#define PLUS_ONE_7 8 +#define PLUS_ONE_8 9 +#define PLUS_ONE_9 10 +#define PLUS_ONE_10 11 +#define PLUS_ONE_11 12 +#define PLUS_ONE_12 13 +#define PLUS_ONE_13 14 +#define PLUS_ONE_14 15 +#define PLUS_ONE_15 16 +#define PLUS_ONE(x) PLUS_ONE_ ## x +#define _CONCAT3(a,b,c) a ## b ## c +#define CONCAT3(a,b,c) _CONCAT3(a,b,c) + +#define STORE_UNALIGNED(x) \ +CONCAT3(.Lchacha_mips_xor_unaligned_, PLUS_ONE(x), _b: ;) \ + .if (x != 12); \ + lw T0, (x*4)(STATE); \ + .endif; \ + lwl T1, (x*4)+MSB ## (IN); \ + lwr T1, (x*4)+LSB ## (IN); \ + .if (x == 12); \ + addu X ## x, NONCE_0; \ + .else; \ + addu X ## x, T0; \ + .endif; \ + CPU_TO_LE32(X ## x); \ + xor X ## x, T1; \ + swl X ## x, (x*4)+MSB ## (OUT); \ + swr X ## x, (x*4)+LSB ## (OUT); + +#define STORE_ALIGNED(x) \ +CONCAT3(.Lchacha_mips_xor_aligned_, PLUS_ONE(x), _b: ;) \ + .if (x != 12); \ + lw T0, (x*4)(STATE); \ + .endif; \ + lw T1, (x*4) ## (IN); \ + .if (x == 12); \ + addu X ## x, NONCE_0; \ + .else; \ + addu X ## x, T0; \ + .endif; \ + CPU_TO_LE32(X ## x); \ + xor X ## x, T1; \ + sw X ## x, (x*4) ## (OUT); + +/* Jump table macro. + * Used for setup and handling the last bytes, which are not multiple of 4. + * X15 is free to store Xn + * Every jumptable entry must be equal in size. + */ +#define JMPTBL_ALIGNED(x) \ +.Lchacha_mips_jmptbl_aligned_ ## x: ; \ + .set noreorder; \ + b .Lchacha_mips_xor_aligned_ ## x ## _b; \ + .if (x == 12); \ + addu SAVED_X, X ## x, NONCE_0; \ + .else; \ + addu SAVED_X, X ## x, SAVED_CA; \ + .endif; \ + .set reorder + +#define JMPTBL_UNALIGNED(x) \ +.Lchacha_mips_jmptbl_unaligned_ ## x: ; \ + .set noreorder; \ + b .Lchacha_mips_xor_unaligned_ ## x ## _b; \ + .if (x == 12); \ + addu SAVED_X, X ## x, NONCE_0; \ + .else; \ + addu SAVED_X, X ## x, SAVED_CA; \ + .endif; \ + .set reorder + +#define AXR(A, B, C, D, K, L, M, N, V, W, Y, Z, S) \ + addu X(A), X(K); \ + addu X(B), X(L); \ + addu X(C), X(M); \ + addu X(D), X(N); \ + xor X(V), X(A); \ + xor X(W), X(B); \ + xor X(Y), X(C); \ + xor X(Z), X(D); \ + rotl X(V), S; \ + rotl X(W), S; \ + rotl X(Y), S; \ + rotl X(Z), S; + +.text +.set reorder +.set noat +.globl chacha_mips +.ent chacha_mips +chacha_mips: + .frame $sp, STACK_SIZE, $ra + + /* Load number of rounds */ + lw $at, 16($sp) + + addiu $sp, -STACK_SIZE + + /* Return bytes = 0. */ + beqz BYTES, .Lchacha_mips_end + + lw NONCE_0, 48(STATE) + + /* Save s0-s7 */ + sw $s0, 0($sp) + sw $s1, 4($sp) + sw $s2, 8($sp) + sw $s3, 12($sp) + sw $s4, 16($sp) + sw $s5, 20($sp) + sw $s6, 24($sp) + sw $s7, 28($sp) + + /* Test IN or OUT is unaligned. + * IS_UNALIGNED = ( IN | OUT ) & 0x00000003 + */ + or IS_UNALIGNED, IN, OUT + andi IS_UNALIGNED, 0x3 + + b .Lchacha_rounds_start + +.align 4 +.Loop_chacha_rounds: + addiu IN, CHACHA20_BLOCK_SIZE + addiu OUT, CHACHA20_BLOCK_SIZE + addiu NONCE_0, 1 + +.Lchacha_rounds_start: + lw X0, 0(STATE) + lw X1, 4(STATE) + lw X2, 8(STATE) + lw X3, 12(STATE) + + lw X4, 16(STATE) + lw X5, 20(STATE) + lw X6, 24(STATE) + lw X7, 28(STATE) + lw X8, 32(STATE) + lw X9, 36(STATE) + lw X10, 40(STATE) + lw X11, 44(STATE) + + move X12, NONCE_0 + lw X13, 52(STATE) + lw X14, 56(STATE) + lw X15, 60(STATE) + +.Loop_chacha_xor_rounds: + addiu $at, -2 + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 16); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 12); + AXR( 0, 1, 2, 3, 4, 5, 6, 7, 12,13,14,15, 8); + AXR( 8, 9,10,11, 12,13,14,15, 4, 5, 6, 7, 7); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 16); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 12); + AXR( 0, 1, 2, 3, 5, 6, 7, 4, 15,12,13,14, 8); + AXR(10,11, 8, 9, 15,12,13,14, 5, 6, 7, 4, 7); + bnez $at, .Loop_chacha_xor_rounds + + addiu BYTES, -(CHACHA20_BLOCK_SIZE) + + /* Is data src/dst unaligned? Jump */ + bnez IS_UNALIGNED, .Loop_chacha_unaligned + + /* Set number rounds here to fill delayslot. */ + lw $at, (STACK_SIZE+16)($sp) + + /* BYTES < 0, it has no full block. */ + bltz BYTES, .Lchacha_mips_no_full_block_aligned + + FOR_EACH_WORD_REV(STORE_ALIGNED) + + /* BYTES > 0? Loop again. */ + bgtz BYTES, .Loop_chacha_rounds + + /* Place this here to fill delay slot */ + addiu NONCE_0, 1 + + /* BYTES < 0? Handle last bytes */ + bltz BYTES, .Lchacha_mips_xor_bytes + +.Lchacha_mips_xor_done: + /* Restore used registers */ + lw $s0, 0($sp) + lw $s1, 4($sp) + lw $s2, 8($sp) + lw $s3, 12($sp) + lw $s4, 16($sp) + lw $s5, 20($sp) + lw $s6, 24($sp) + lw $s7, 28($sp) + + /* Write NONCE_0 back to right location in state */ + sw NONCE_0, 48(STATE) + +.Lchacha_mips_end: + addiu $sp, STACK_SIZE + jr $ra + +.Lchacha_mips_no_full_block_aligned: + /* Restore the offset on BYTES */ + addiu BYTES, CHACHA20_BLOCK_SIZE + + /* Get number of full WORDS */ + andi $at, BYTES, MASK_U32 + + /* Load upper half of jump table addr */ + lui T0, %hi(.Lchacha_mips_jmptbl_aligned_0) + + /* Calculate lower half jump table offset */ + ins T0, $at, 1, 6 + + /* Add offset to STATE */ + addu T1, STATE, $at + + /* Add lower half jump table addr */ + addiu T0, %lo(.Lchacha_mips_jmptbl_aligned_0) + + /* Read value from STATE */ + lw SAVED_CA, 0(T1) + + /* Store remaining bytecounter as negative value */ + subu BYTES, $at, BYTES + + jr T0 + + /* Jump table */ + FOR_EACH_WORD(JMPTBL_ALIGNED) + + +.Loop_chacha_unaligned: + /* Set number rounds here to fill delayslot. */ + lw $at, (STACK_SIZE+16)($sp) + + /* BYTES > 0, it has no full block. */ + bltz BYTES, .Lchacha_mips_no_full_block_unaligned + + FOR_EACH_WORD_REV(STORE_UNALIGNED) + + /* BYTES > 0? Loop again. */ + bgtz BYTES, .Loop_chacha_rounds + + /* Write NONCE_0 back to right location in state */ + sw NONCE_0, 48(STATE) + + .set noreorder + /* Fall through to byte handling */ + bgez BYTES, .Lchacha_mips_xor_done +.Lchacha_mips_xor_unaligned_0_b: +.Lchacha_mips_xor_aligned_0_b: + /* Place this here to fill delay slot */ + addiu NONCE_0, 1 + .set reorder + +.Lchacha_mips_xor_bytes: + addu IN, $at + addu OUT, $at + /* First byte */ + lbu T1, 0(IN) + addiu $at, BYTES, 1 + CPU_TO_LE32(SAVED_X) + ROTR(SAVED_X) + xor T1, SAVED_X + sb T1, 0(OUT) + beqz $at, .Lchacha_mips_xor_done + /* Second byte */ + lbu T1, 1(IN) + addiu $at, BYTES, 2 + ROTx SAVED_X, 8 + xor T1, SAVED_X + sb T1, 1(OUT) + beqz $at, .Lchacha_mips_xor_done + /* Third byte */ + lbu T1, 2(IN) + ROTx SAVED_X, 8 + xor T1, SAVED_X + sb T1, 2(OUT) + b .Lchacha_mips_xor_done + +.Lchacha_mips_no_full_block_unaligned: + /* Restore the offset on BYTES */ + addiu BYTES, CHACHA20_BLOCK_SIZE + + /* Get number of full WORDS */ + andi $at, BYTES, MASK_U32 + + /* Load upper half of jump table addr */ + lui T0, %hi(.Lchacha_mips_jmptbl_unaligned_0) + + /* Calculate lower half jump table offset */ + ins T0, $at, 1, 6 + + /* Add offset to STATE */ + addu T1, STATE, $at + + /* Add lower half jump table addr */ + addiu T0, %lo(.Lchacha_mips_jmptbl_unaligned_0) + + /* Read value from STATE */ + lw SAVED_CA, 0(T1) + + /* Store remaining bytecounter as negative value */ + subu BYTES, $at, BYTES + + jr T0 + + /* Jump table */ + FOR_EACH_WORD(JMPTBL_UNALIGNED) +.end chacha_mips +.set at diff --git a/arch/mips/crypto/chacha-glue.c b/arch/mips/crypto/chacha-glue.c new file mode 100644 index 000000000000..de01dc57751e --- /dev/null +++ b/arch/mips/crypto/chacha-glue.c @@ -0,0 +1,161 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * MIPS accelerated ChaCha and XChaCha stream ciphers, + * including ChaCha20 (RFC7539) + * + * Copyright (C) 2019 Linaro, Ltd. <ard.biesheuvel@linaro.org> + */ + +#include <crypto/algapi.h> +#include <crypto/internal/chacha.h> +#include <crypto/internal/skcipher.h> +#include <linux/kernel.h> +#include <linux/module.h> + +asmlinkage void chacha_mips(const u32 *state, u8 *dst, const u8 *src, + unsigned int bytes, int nrounds); + +void hchacha_block(const u32 *state, u32 *stream, int nrounds) +{ + hchacha_block_generic(state, stream, nrounds); +} +EXPORT_SYMBOL(hchacha_block); + +void chacha_init(u32 *state, const u32 *key, const u8 *iv) +{ + chacha_init_generic(state, key, iv); +} +EXPORT_SYMBOL(chacha_init); + +void chacha_crypt(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, + int nrounds) +{ + chacha_mips(state, dst, src, bytes, nrounds); +} +EXPORT_SYMBOL(chacha_crypt); + +static int chacha_mips_stream_xor(struct skcipher_request *req, + const struct chacha_ctx *ctx, const u8 *iv) +{ + struct skcipher_walk walk; + u32 state[16]; + int err; + + err = skcipher_walk_virt(&walk, req, false); + + crypto_chacha_init(state, ctx, iv); + + while (walk.nbytes > 0) { + unsigned int nbytes = walk.nbytes; + + if (nbytes < walk.total) + nbytes = round_down(nbytes, walk.stride); + + chacha_mips(state, walk.dst.virt.addr, walk.src.virt.addr, + nbytes, ctx->nrounds); + err = skcipher_walk_done(&walk, walk.nbytes - nbytes); + } + + return err; +} + +static int __chacha_mips(struct skcipher_request *req) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); + + return chacha_mips_stream_xor(req, ctx, req->iv); +} + +static int xchacha_mips(struct skcipher_request *req) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); + struct chacha_ctx subctx; + u32 state[16]; + u8 real_iv[16]; + + crypto_chacha_init(state, ctx, req->iv); + + hchacha_block_generic(state, subctx.key, ctx->nrounds); + subctx.nrounds = ctx->nrounds; + + memcpy(&real_iv[0], req->iv + 24, 8); + memcpy(&real_iv[8], req->iv + 16, 8); + return chacha_mips_stream_xor(req, &subctx, real_iv); +} + +static struct skcipher_alg algs[] = { + { + .base.cra_name = "chacha20", + .base.cra_driver_name = "chacha20-mips", + .base.cra_priority = 200, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = CHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .walksize = 4 * CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha20_setkey, + .encrypt = __chacha_mips, + .decrypt = __chacha_mips, + }, { + .base.cra_name = "xchacha20", + .base.cra_driver_name = "xchacha20-mips", + .base.cra_priority = 200, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = XCHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .walksize = 4 * CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha20_setkey, + .encrypt = xchacha_mips, + .decrypt = xchacha_mips, + }, { + .base.cra_name = "xchacha12", + .base.cra_driver_name = "xchacha12-mips", + .base.cra_priority = 200, + .base.cra_blocksize = 1, + .base.cra_ctxsize = sizeof(struct chacha_ctx), + .base.cra_module = THIS_MODULE, + + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = XCHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .walksize = 4 * CHACHA_BLOCK_SIZE, + .setkey = crypto_chacha12_setkey, + .encrypt = xchacha_mips, + .decrypt = xchacha_mips, + } +}; + +static int __init chacha_simd_mod_init(void) +{ + return crypto_register_skciphers(algs, ARRAY_SIZE(algs)); +} + +static void __exit chacha_simd_mod_fini(void) +{ + crypto_unregister_skciphers(algs, ARRAY_SIZE(algs)); +} + +module_init(chacha_simd_mod_init); +module_exit(chacha_simd_mod_fini); + +MODULE_DESCRIPTION("ChaCha and XChaCha stream ciphers (MIPS accelerated)"); +MODULE_AUTHOR("Ard Biesheuvel <ard.biesheuvel@linaro.org>"); +MODULE_LICENSE("GPL v2"); +MODULE_ALIAS_CRYPTO("chacha20"); +MODULE_ALIAS_CRYPTO("chacha20-mips"); +MODULE_ALIAS_CRYPTO("xchacha20"); +MODULE_ALIAS_CRYPTO("xchacha20-mips"); +MODULE_ALIAS_CRYPTO("xchacha12"); +MODULE_ALIAS_CRYPTO("xchacha12-mips"); diff --git a/crypto/Kconfig b/crypto/Kconfig index f90b53a526ba..43e94ac5d117 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1441,6 +1441,12 @@ config CRYPTO_CHACHA20_X86_64 SSSE3, AVX2, and AVX-512VL optimized implementations of the ChaCha20, XChaCha20, and XChaCha12 stream ciphers. +config CRYPTO_CHACHA_MIPS + tristate "ChaCha stream cipher algorithms (MIPS 32r2 optimized)" + depends on CPU_MIPS32_R2 + select CRYPTO_CHACHA20 + select CRYPTO_ARCH_HAVE_LIB_CHACHA + config CRYPTO_SEED tristate "SEED cipher algorithm" select CRYPTO_ALGAPI