Message ID | 20180918161646.19105-1-Jason@zx2c4.com |
---|---|
Headers | show |
Series | WireGuard: Secure Network Tunnel | expand |
Hi David, On Tue, Sep 18, 2018 at 06:01:47PM +0100, David Howells wrote: > Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > A while back, I noticed that the crypto and crypto API usage in big_keys > > were entirely broken in multiple ways, so I rewrote it. Now, I'm > > rewriting it again, but this time using Zinc's ChaCha20Poly1305 > > function. > > warthog>git grep chacha20poly1305_decrypt net-next/master > warthog1> > > Where do I find this? > > David Previously in the patchset: https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard https://lore.kernel.org/lkml/20180918161646.19105-1-Jason@zx2c4.com/ Regards, Jason
Hi Ard, On Tue, Sep 18, 2018 at 11:28:50AM -0700, Ard Biesheuvel wrote: > On 18 September 2018 at 09:16, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > - While I initially wasn't going to do this for the initial > > patchset, it was just so simple to do: now there's a nosimd > > module parameter that can be used to disable simd instructions > > for debugging and testing, or on weird systems. > > > > I was going to respond in the other thread but it is probably better > to move the discussion here. > > My concern about the monolithic nature of each algo module is not only > about SIMD, and it has nothing to do with weird systems. It has to do > with micro-architectural differences which are more common on ARM than > on other architectures *, I suppose. But generalizing from that, it > has to do with policy which is currently owned by userland and not by > the kernel. This will also be important for choosing between the time > variant but less safe table based scalar AES and the much slower time > invariant version (which is substantially slower, especially on > decryption) once we move AES into this library. > > So a command line option for the kernel is not the solution here. If > we can't have separate modules, could we at least have per-module > options that put the policy decisions back into userland? > > * as an example, the SHA256 NEON code I collaborated on with Andy > Polyakov 2 years ago is significantly faster on some cores and not on > others Interesting concern. There are micro-architectural quirks on x86 too that the current code actually already considers. Notably, we use an AVX-512VL path for Skylake-X but an AVX-512F path for Knights Landing and Coffee Lake and others, due to thermal throttling when touching the zmm registers on Skylake-X. So, in the code, we have it automatically select the right thing based on the micro-architecture. Is the same thing not possible with ARM? Do you not have access to this information already, such that the module can just always do the right thing and not require any user intervention? If so, that would be ideal. If not (and I'm curious to learn why not exactly), then indeed we could add some runtime nobs in /sys/module/ {algo}/parameters/{nob}, or the like. This would be super easy to do, should we ever encounter a situation where we're unable to auto-detect the correct thing. Regards, Jason
> +#define push_rcu(stack, p, len) ({ \ > + if (rcu_access_pointer(p)) { \ > + BUG_ON(len >= 128); \ > + stack[len++] = rcu_dereference_raw(p); \ > + } \ > + true; \ > + }) > +static void root_free_rcu(struct rcu_head *rcu) > +{ > + struct allowedips_node *node, *stack[128] = { > + container_of(rcu, struct allowedips_node, rcu) }; > + unsigned int len = 1; > + > + while (len > 0 && (node = stack[--len]) && > + push_rcu(stack, node->bit[0], len) && > + push_rcu(stack, node->bit[1], len)) > + kfree(node); > +} Hi Jason I see this BUG_ON() is still here. It really needs to be removed. It does not look like you need to crash the kernel here. Can you add in a test of len >= 128, do a WARN and then return. I think you then leak some memory, but i would much prefer that to a crashed machine. Andrew
Hi Eric, On Wed, Sep 19, 2018 at 12:55 AM Eric Biggers <ebiggers@kernel.org> wrote: > This will compute the wrong digest if called with simd_context=HAVE_FULL_SIMD > and then later with simd_context=HAVE_NO_SIMD, since poly1305_blocks_neon() > converts the accumulator from base 32 to base 26, whereas poly1305_blocks_arm() > assumes it is still in base 32. Is that intentional? I'm sure this is a rare > case, but my understanding is that the existing crypto API doesn't preclude > calling successive steps in different contexts. And I'm concerned that it could > be relevant in some cases, e.g. especially if people are importing a hash state > that was exported earlier. Handling it by silently computing the wrong digest > is not a great idea... Indeed you're right; Samuel and I were just discussing that recently. I'd rather handle this correctly even if the contexts change, so I'll see if I can fix this up properly for that unlikely case in the next revision. Jason
On Tue, Sep 18, 2018 at 06:16:38PM +0200, Jason A. Donenfeld wrote: > The C implementation was originally based on Samuel Neves' public > domain reference implementation but has since been heavily modified > for the kernel. We're able to do compile-time optimizations by moving > some scaffolding around the final function into the header file. > > Information: https://blake2.net/ > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > Signed-off-by: Samuel Neves <sneves@dei.uc.pt> > Cc: Andy Lutomirski <luto@kernel.org> > Cc: Greg KH <gregkh@linuxfoundation.org> > Cc: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com> > --- > include/zinc/blake2s.h | 95 ++ > lib/zinc/Kconfig | 3 + > lib/zinc/Makefile | 3 + > lib/zinc/blake2s/blake2s.c | 301 +++++ > lib/zinc/selftest/blake2s.h | 2095 +++++++++++++++++++++++++++++++++++ > 5 files changed, 2497 insertions(+) > create mode 100644 include/zinc/blake2s.h > create mode 100644 lib/zinc/blake2s/blake2s.c > create mode 100644 lib/zinc/selftest/blake2s.h > > diff --git a/include/zinc/blake2s.h b/include/zinc/blake2s.h > new file mode 100644 > index 000000000000..951281596274 > --- /dev/null > +++ b/include/zinc/blake2s.h > @@ -0,0 +1,95 @@ > +/* SPDX-License-Identifier: MIT > + * > + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. > + */ > + > +#ifndef _ZINC_BLAKE2S_H > +#define _ZINC_BLAKE2S_H > + > +#include <linux/types.h> > +#include <linux/kernel.h> > +#include <crypto/algapi.h> > + > +enum blake2s_lengths { > + BLAKE2S_BLOCKBYTES = 64, > + BLAKE2S_OUTBYTES = 32, > + BLAKE2S_KEYBYTES = 32 > +}; > + > +struct blake2s_state { > + u32 h[8]; > + u32 t[2]; > + u32 f[2]; > + u8 buf[BLAKE2S_BLOCKBYTES]; > + size_t buflen; > + u8 last_node; > +}; > + > +void blake2s_init(struct blake2s_state *state, const size_t outlen); > +void blake2s_init_key(struct blake2s_state *state, const size_t outlen, > + const void *key, const size_t keylen); > +void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen); > +void __blake2s_final(struct blake2s_state *state); > +static inline void blake2s_final(struct blake2s_state *state, u8 *out, > + const size_t outlen) > +{ > + int i; > + > +#ifdef DEBUG > + BUG_ON(!out || !outlen || outlen > BLAKE2S_OUTBYTES); > +#endif > + __blake2s_final(state); > + > + if (__builtin_constant_p(outlen) && !(outlen % sizeof(u32))) { > + if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) || > + IS_ALIGNED((unsigned long)out, __alignof__(u32))) { > + __le32 *outwords = (__le32 *)out; > + > + for (i = 0; i < outlen / sizeof(u32); ++i) > + outwords[i] = cpu_to_le32(state->h[i]); > + } else { > + __le32 buffer[BLAKE2S_OUTBYTES]; This buffer is 4 times too long. > + > + for (i = 0; i < outlen / sizeof(u32); ++i) > + buffer[i] = cpu_to_le32(state->h[i]); > + memcpy(out, buffer, outlen); > + memzero_explicit(buffer, sizeof(buffer)); > + } > + } else { > + u8 buffer[BLAKE2S_OUTBYTES] __aligned(__alignof__(u32)); > + __le32 *outwords = (__le32 *)buffer; > + > + for (i = 0; i < 8; ++i) > + outwords[i] = cpu_to_le32(state->h[i]); > + memcpy(out, buffer, outlen); > + memzero_explicit(buffer, sizeof(buffer)); > + } > + > + memzero_explicit(state, sizeof(*state)); > +} Or how about something much simpler: static inline void blake2s_final(struct blake2s_state *state, u8 *out, const size_t outlen) { #ifdef DEBUG BUG_ON(!out || !outlen || outlen > BLAKE2S_OUTBYTES); #endif __blake2s_final(state); cpu_to_le32_array(state->h, ARRAY_SIZE(state->h)); memcpy(out, state->h, outlen); memzero_explicit(state, sizeof(*state)); }
Hey Eric, On Wed, Sep 19, 2018 at 2:41 AM Eric Biggers <ebiggers@kernel.org> wrote: > This buffer is 4 times too long. Nice catch. > Or how about something much simpler: > > static inline void blake2s_final(struct blake2s_state *state, u8 *out, > const size_t outlen) > { > #ifdef DEBUG > BUG_ON(!out || !outlen || outlen > BLAKE2S_OUTBYTES); > #endif > __blake2s_final(state); > > cpu_to_le32_array(state->h, ARRAY_SIZE(state->h)); > memcpy(out, state->h, outlen); > > memzero_explicit(state, sizeof(*state)); > } Oh, that's excellent, thanks. Much better than prior. I'll do exactly that. Jason
On Tue, Sep 18, 2018 at 06:16:33PM +0200, Jason A. Donenfeld wrote: > diff --git a/lib/zinc/poly1305/poly1305.c b/lib/zinc/poly1305/poly1305.c > new file mode 100644 > index 000000000000..dbab82f33aa7 > --- /dev/null > +++ b/lib/zinc/poly1305/poly1305.c > @@ -0,0 +1,155 @@ > +/* SPDX-License-Identifier: MIT > + * > + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. > + * > + * Implementation of the Poly1305 message authenticator. > + * > + * Information: https://cr.yp.to/mac.html > + */ > + > +#include <zinc/poly1305.h> > + > +#include <asm/unaligned.h> > +#include <linux/kernel.h> > +#include <linux/string.h> > +#include <linux/module.h> > +#include <linux/init.h> > + > +#ifndef HAVE_POLY1305_ARCH_IMPLEMENTATION > +static inline bool poly1305_init_arch(void *ctx, > + const u8 key[POLY1305_KEY_SIZE]) > +{ > + return false; > +} > +static inline bool poly1305_blocks_arch(void *ctx, const u8 *input, > + const size_t len, const u32 padbit, > + simd_context_t *simd_context) > +{ > + return false; > +} > +static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE], > + const u32 nonce[4], > + simd_context_t *simd_context) > +{ > + return false; > +} > +void __init poly1305_fpu_init(void) > +{ > +} > +#endif > + > +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__) > +#include "poly1305-donna64.h" > +#else > +#include "poly1305-donna32.h" > +#endif > + > +void poly1305_init(struct poly1305_ctx *ctx, const u8 key[POLY1305_KEY_SIZE]) > +{ > + ctx->nonce[0] = get_unaligned_le32(&key[16]); > + ctx->nonce[1] = get_unaligned_le32(&key[20]); > + ctx->nonce[2] = get_unaligned_le32(&key[24]); > + ctx->nonce[3] = get_unaligned_le32(&key[28]); > + > + if (!poly1305_init_arch(ctx->opaque, key)) > + poly1305_init_generic(ctx->opaque, key); > + > + ctx->num = 0; > +} > +EXPORT_SYMBOL(poly1305_init); > + > +static inline void poly1305_blocks(void *ctx, const u8 *input, const size_t len, > + const u32 padbit, > + simd_context_t *simd_context) > +{ > + if (!poly1305_blocks_arch(ctx, input, len, padbit, simd_context)) > + poly1305_blocks_generic(ctx, input, len, padbit); > +} > + > +static inline void poly1305_emit(void *ctx, u8 mac[POLY1305_KEY_SIZE], > + const u32 nonce[4], > + simd_context_t *simd_context) > +{ > + if (!poly1305_emit_arch(ctx, mac, nonce, simd_context)) > + poly1305_emit_generic(ctx, mac, nonce); > +} > + > +void poly1305_update(struct poly1305_ctx *ctx, const u8 *input, size_t len, > + simd_context_t *simd_context) > +{ > + const size_t num = ctx->num % POLY1305_BLOCK_SIZE; > + size_t rem; 0 <= ctx->num < POLY1305_BLOCK_SIZE, so no need to mod by POLY1305_BLOCK_SIZE. > + > + if (num) { > + rem = POLY1305_BLOCK_SIZE - num; > + if (len < rem) { > + memcpy(ctx->data + num, input, len); > + ctx->num = num + len; > + return; > + } > + memcpy(ctx->data + num, input, rem); > + poly1305_blocks(ctx->opaque, ctx->data, POLY1305_BLOCK_SIZE, 1, > + simd_context); > + input += rem; > + len -= rem; > + } > + > + rem = len % POLY1305_BLOCK_SIZE; > + len -= rem; > + > + if (len >= POLY1305_BLOCK_SIZE) { > + poly1305_blocks(ctx->opaque, input, len, 1, simd_context); > + input += len; > + } > + > + if (rem) > + memcpy(ctx->data, input, rem); > + > + ctx->num = rem; > +} > +EXPORT_SYMBOL(poly1305_update); > + > +void poly1305_final(struct poly1305_ctx *ctx, u8 mac[POLY1305_MAC_SIZE], > + simd_context_t *simd_context) > +{ > + size_t num = ctx->num % POLY1305_BLOCK_SIZE; Same here. > +++ b/lib/zinc/selftest/poly1305.h > @@ -0,0 +1,875 @@ > +/* SPDX-License-Identifier: MIT > + * > + * Copyright (C) 2015-2018 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. > + */ > + > +#ifdef DEBUG > +struct poly1305_testvec { > + u8 input[600]; > + u8 output[POLY1305_MAC_SIZE]; > + u8 key[POLY1305_KEY_SIZE]; > + size_t ilen; > +}; > + > +static const struct poly1305_testvec poly1305_testvecs[] __initconst = { > +{ /* RFC7539 */ > + .input = { 0x43, 0x72, 0x79, 0x70, 0x74, 0x6f, 0x67, 0x72, > + 0x61, 0x70, 0x68, 0x69, 0x63, 0x20, 0x46, 0x6f, > + 0x72, 0x75, 0x6d, 0x20, 0x52, 0x65, 0x73, 0x65, > + 0x61, 0x72, 0x63, 0x68, 0x20, 0x47, 0x72, 0x6f, > + 0x75, 0x70 }, > + .ilen = 34, > + .output = { 0xa8, 0x06, 0x1d, 0xc1, 0x30, 0x51, 0x36, 0xc6, > + 0xc2, 0x2b, 0x8b, 0xaf, 0x0c, 0x01, 0x27, 0xa9 }, > + .key = { 0x85, 0xd6, 0xbe, 0x78, 0x57, 0x55, 0x6d, 0x33, > + 0x7f, 0x44, 0x52, 0xfe, 0x42, 0xd5, 0x06, 0xa8, > + 0x01, 0x03, 0x80, 0x8a, 0xfb, 0x0d, 0xb2, 0xfd, > + 0x4a, 0xbf, 0xf6, 0xaf, 0x41, 0x49, 0xf5, 0x1b }, > +}, { /* "The Poly1305-AES message-authentication code" */ Hardcoding the 'input' array to 600 bytes forces the full amount of space to be reserved in the kernel image for every test vector. Also, if anyone adds a longer test vector they will need to remember to increase the value. It should be a const pointer instead, like the test vectors in crypto/testmgr.h. - Eric
On Wed, Sep 19, 2018 at 2:50 AM Eric Biggers <ebiggers@kernel.org> wrote: > Hardcoding the 'input' array to 600 bytes forces the full amount of space to be > reserved in the kernel image for every test vector. Also, if anyone adds a > longer test vector they will need to remember to increase the value. > > It should be a const pointer instead, like the test vectors in crypto/testmgr.h. I know. The agony. This has been really annoying me. I originally did it the right way, but removed it last week, when I noticed that gcc failed to put it in the initconst section: https://git.zx2c4.com/WireGuard/commit/?id=f4698d20f13946afc6ce99e98685ba3f9adc4474 Even changing the (u8[]){ ... } into a (const u8[]){ ... } or even into a const string literal does not do the trick. It makes it into the constant data section with const, but it does not make it into the initconst section. What a bummer. I went asking about this on the gcc mailing list, to see if there was just some aspect of C that I had overlooked: https://gcc.gnu.org/ml/gcc/2018-09/msg00043.html So far, it looks like we're SOL. I could probably make some macros to do this in a .S file but... that's pretty unacceptably gross. Or I could define lots and lots of variables in __initconst, and then connect them all together at the end, but that's pretty gross too. Or I could have all this data in one variable and record offsets into it, but that's even more atrocious. So I think it comes down to these two non-ugly options: - We use fixed sized buffers, waste a lot of space, and be happy that it's cleared from memory immediately after init/insertion anyway, so it's not actually wasting ram. - We use const string literals / constant compound literals, save space on disk, but not benefit from having it cleared from memory after init/insertion. Of these, which would you prefer? I can see the argument both ways, but in the end opted for the first. Or perhaps you have a better third option? Jason
> > + const size_t num = ctx->num % POLY1305_BLOCK_SIZE; > 0 <= ctx->num < POLY1305_BLOCK_SIZE, so no need to mod by POLY1305_BLOCK_SIZE. > > + size_t num = ctx->num % POLY1305_BLOCK_SIZE; > Same here. I know, but I was having a hard time convincing gcc-8 of that invariant, and it was warning me. Perhaps this is something they fixed, though, between 8.1 and 8.2 though. I'll check back and adjust accordingly.
On Wed, Sep 19, 2018 at 3:39 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > + const size_t num = ctx->num % POLY1305_BLOCK_SIZE; > > 0 <= ctx->num < POLY1305_BLOCK_SIZE, so no need to mod by POLY1305_BLOCK_SIZE. > > > + size_t num = ctx->num % POLY1305_BLOCK_SIZE; > > Same here. > > I know, but I was having a hard time convincing gcc-8 of that > invariant, and it was warning me. Perhaps this is something they > fixed, though, between 8.1 and 8.2 though. I'll check back and adjust > accordingly. This was changed here: https://git.zx2c4.com/WireGuard/commit/?id=37f114a73ba37219b00a66f0a51219a696599745 I can't reproduce with 8.2 anymore, so perhaps I should remove it now. Unless you'd like to avoid a warning on old compilers. Since there's no difference in speed, probably we should avoid the 8.1 warning and leave it be?
On Wed, Sep 19, 2018 at 3:08 AM Eric Biggers <ebiggers@kernel.org> wrote: > Does this consistently perform as well as an implementation that organizes the > operations such that the quarterrounds for all columns/diagonals are > interleaved? As-is, there are tight dependencies in QUARTER_ROUND() (as well as > in the existing chacha20_block() in lib/chacha20.c, for that matter), so we're > heavily depending on the compiler to do the needed interleaving so as to not get > potentially disastrous performance. Making it explicit could be a good idea. It does perform as well, and the compiler outputs good code, even on older compilers. Notably that's all a single statement (via the comma operator). > > +} > > + > > +static void chacha20_generic(u8 *out, const u8 *in, u32 len, const u32 key[8], > > + const u32 counter[4]) > > +{ > > + __le32 buf[CHACHA20_BLOCK_WORDS]; > > + u32 x[] = { > > + EXPAND_32_BYTE_K, > > + key[0], key[1], key[2], key[3], > > + key[4], key[5], key[6], key[7], > > + counter[0], counter[1], counter[2], counter[3] > > + }; > > + > > + if (out != in) > > + memmove(out, in, len); > > + > > + while (len >= CHACHA20_BLOCK_SIZE) { > > + chacha20_block_generic(buf, x); > > + crypto_xor(out, (u8 *)buf, CHACHA20_BLOCK_SIZE); > > + len -= CHACHA20_BLOCK_SIZE; > > + out += CHACHA20_BLOCK_SIZE; > > + } > > + if (len) { > > + chacha20_block_generic(buf, x); > > + crypto_xor(out, (u8 *)buf, len); > > + } > > +} > > If crypto_xor_cpy() is used instead of crypto_xor(), and 'in' is incremented > along with 'out', then the memmove() is not needed. Nice idea, thanks. Implemented. Jason
Hi Andrew, On Wed, Sep 19, 2018 at 1:34 AM Andrew Lunn <andrew@lunn.ch> wrote: > I see this BUG_ON() is still here. It really needs to be removed. It > does not look like you need to crash the kernel here. Can you add in a > test of len >= 128, do a WARN and then return. I think you then leak > some memory, but i would much prefer that to a crashed machine. Sure, I'll change it to that. Regards, Jason
Hi Thomas, On Wed, Sep 19, 2018 at 12:30 AM Thomas Gleixner <tglx@linutronix.de> wrote: > I'm a bit confused by this SOB chain. So above you write that it's from > Andy Polakovs implementation and Samuel did the changes. But here it seems > you are the main author. If Samuel just did some modifications then you > want to use the Co-developed-by tag along with his SOB. Thanks, I'll use that tag. > > Also I'd recommend to add a Originally-by or Based-on-code-from: Andy > Polyakov tag. Both are not formal tags but widely in use for attributions. Great idea. > > > +++ b/lib/zinc/chacha20/chacha20-x86_64-glue.h > > @@ -0,0 +1,100 @@ > > +/* SPDX-License-Identifier: MIT > > Please put that into a separate one liner comment. Also this should be > 'GPL-2.0[+] or MIT' I think. I had that originally, but changed it to just MIT, since MIT is a subset of GPL-2.0. And looking at tree repo, it appears this is what others do too. Jason
Jason, On Wed, 19 Sep 2018, Jason A. Donenfeld wrote: > On Wed, Sep 19, 2018 at 12:30 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > > > > +++ b/lib/zinc/chacha20/chacha20-x86_64-glue.h > > > @@ -0,0 +1,100 @@ > > > +/* SPDX-License-Identifier: MIT > > > > Please put that into a separate one liner comment. Also this should be > > 'GPL-2.0[+] or MIT' I think. > > I had that originally, but changed it to just MIT, since MIT is a > subset of GPL-2.0. And looking at tree repo, it appears this is what > others do too. Subset? Not really. Both MIT and BSD3-Clause are GPL2.0 compatible licenses. And if your intention is to have those files MIT/BSD only, yes then the single license identifier is the right thing. If you want it dual licensed then it should be expressed there clearly. Thanks, tglx
Hi Thomas, On Wed, Sep 19, 2018 at 8:13 AM Thomas Gleixner <tglx@linutronix.de> wrote: > Subset? Not really. Both MIT and BSD3-Clause are GPL2.0 compatible > licenses. And if your intention is to have those files MIT/BSD only, yes > then the single license identifier is the right thing. If you want it dual > licensed then it should be expressed there clearly. I always thought "GPL2 compatible" was the same as "subset" because of the restrictions clause of GPL2. But IANAL, so for the avoidance of doubt I'll take your advice and put both. Jason
On Wed, Sep 19, 2018 at 1:50 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Wed, Sep 19, 2018 at 6:19 AM Andy Lutomirski <luto@amacapital.net> wrote: > > Can you not uglify the code a bit by using normal (non-compound) liberals as described in the response to that email? > > I can define a bunch of variables, and then string them all together > in an array at the end. This is a bit ugly and less maintainable, but > would be the best of both worlds. Alright, I wrote a little program to do this for me in a way that actually doesn't look half bad. Thanks for suggesting. Jason
On Wed, Sep 19, 2018 at 04:04:01AM +0200, Jason A. Donenfeld wrote: > Hi Andrew, > > On Wed, Sep 19, 2018 at 1:34 AM Andrew Lunn <andrew@lunn.ch> wrote: > > I see this BUG_ON() is still here. It really needs to be removed. It > > does not look like you need to crash the kernel here. Can you add in a > > test of len >= 128, do a WARN and then return. I think you then leak > > some memory, but i would much prefer that to a crashed machine. > > Sure, I'll change it to that. Great, thanks. I noticed there is at least one more BUG() statements. It would be good to remove them all. BUG() should only be used when something bad has already happened and we want to minimise the damage by killing the machine immediately. Andrew
(+ Arnd, Eric) On 18 September 2018 at 09:16, Jason A. Donenfeld <Jason@zx2c4.com> wrote: ... > diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile > new file mode 100644 > index 000000000000..83dfd63988c0 > --- /dev/null > +++ b/lib/zinc/Makefile > @@ -0,0 +1,4 @@ Apologies for not spotting these before: > +ccflags-y := -O3 -O3 optimization has been problematic in the past, at least on x86 but I think on other architectures as well. Please stick with -O2. > +ccflags-y += -Wframe-larger-than=$(if (CONFIG_KASAN),16384,8192) There is no way we can support code in the kernel with that kind of stack space requirements. I will let Arnd comment on what we typically allow, since he deals with such issues on a regular basis. > +ccflags-y += -D'pr_fmt(fmt)="zinc: " fmt' > +ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG > -- > 2.19.0 >
Hey Arnd, On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann <arnd@arndb.de> wrote: > Right, if you hit a stack requirement like this, it's usually the compiler > doing something bad, not just using too much stack but also generating > rather slow object code in the process. It's better to fix the bug by > optimizing the code to not spill registers to the stack. > > In the long run, I'd like to reduce the stack frame size further, so > best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes > (on 64-bit) is a bug in the code, and stay below that. > > For prototyping, you can just mark the broken functions individually > by setting the warning limit for a specific function that is known to > be misoptimized by the compiler (with a comment about which compiler > and architectures are affected), but not override the limit for the > entire file. Thanks for the explanation. Fortunately in my case, the issues were trivially fixable to get it under 1024/1280. (By the way, why does 64-bit have a slightly larger stack frame? To account for 32 pointers taking double the space or something?) That will be rectified in v6. There is one exception though: sometimes KASAN bloats the frame on 64-bit compiles. How would you feel about me adding 'ccflags-$(CONFIG_KASAN) += -Wframe-larger-than=16384' in my makefile? I'm not remotely close to reaching that limit (it's just a tiny bit over 1280), but it does seem like telling gcc to quiet down about stack frames when KASAN is being used might make sense. Alternatively, I see the defaults for FRAME_WARN are: config FRAME_WARN int "Warn for stack frames larger than (needs gcc 4.4)" range 0 8192 default 3072 if KASAN_EXTRA default 2048 if GCC_PLUGIN_LATENT_ENTROPY default 1280 if (!64BIT && PARISC) default 1024 if (!64BIT && !PARISC) default 2048 if 64BIT What about changing that KASAN_EXTRA into just KASAN? This seems cleanest; I'll send a patch for it. On the other hand, this KASAN behavior is only observable on 64-bit systems when I force it to be 1280, whereas the default is still 2048, so probably this isn't a problem *now* for this patchset. But it is something to think about for down the road when you lower the default frame. Regards, Jason
Hi Ard, On Thu, Sep 20, 2018 at 5:42 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > Apologies for not spotting these before: > > > +ccflags-y := -O3 > > -O3 optimization has been problematic in the past, at least on x86 but > I think on other architectures as well. Please stick with -O2. Fixed up for v6. > > > +ccflags-y += -Wframe-larger-than=$(if (CONFIG_KASAN),16384,8192) > > There is no way we can support code in the kernel with that kind of > stack space requirements. I will let Arnd comment on what we typically > allow, since he deals with such issues on a regular basis. Also fixed up for v6 by just removing that line entirely. Jason
On Fri, Sep 21, 2018 at 02:11:43AM +0200, Jason A. Donenfeld wrote: > Hey Arnd, > > On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann <arnd@arndb.de> wrote: > > Right, if you hit a stack requirement like this, it's usually the compiler > > doing something bad, not just using too much stack but also generating > > rather slow object code in the process. It's better to fix the bug by > > optimizing the code to not spill registers to the stack. > > > > In the long run, I'd like to reduce the stack frame size further, so > > best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes > > (on 64-bit) is a bug in the code, and stay below that. > > > > For prototyping, you can just mark the broken functions individually > > by setting the warning limit for a specific function that is known to > > be misoptimized by the compiler (with a comment about which compiler > > and architectures are affected), but not override the limit for the > > entire file. > > Thanks for the explanation. Fortunately in my case, the issues were > trivially fixable to get it under 1024/1280. (By the way, why does > 64-bit have a slightly larger stack frame? To account for 32 pointers > taking double the space or something?) That will be rectified in v6. Hi Jason Do you any stack usage information? A VPN can be at the bottom of some deep stack calls. Swap on NFS over the VPN? If you have one frame of 1K, you might be O.K. But if you have a few of these, i can see there might be issues of overflowing the stack. Andrew
> On Sep 20, 2018, at 8:12 PM, Andrew Lunn <andrew@lunn.ch> wrote: > >> On Fri, Sep 21, 2018 at 02:11:43AM +0200, Jason A. Donenfeld wrote: >> Hey Arnd, >> >>> On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann <arnd@arndb.de> wrote: >>> Right, if you hit a stack requirement like this, it's usually the compiler >>> doing something bad, not just using too much stack but also generating >>> rather slow object code in the process. It's better to fix the bug by >>> optimizing the code to not spill registers to the stack. >>> >>> In the long run, I'd like to reduce the stack frame size further, so >>> best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes >>> (on 64-bit) is a bug in the code, and stay below that. >>> >>> For prototyping, you can just mark the broken functions individually >>> by setting the warning limit for a specific function that is known to >>> be misoptimized by the compiler (with a comment about which compiler >>> and architectures are affected), but not override the limit for the >>> entire file. >> >> Thanks for the explanation. Fortunately in my case, the issues were >> trivially fixable to get it under 1024/1280. (By the way, why does >> 64-bit have a slightly larger stack frame? To account for 32 pointers >> taking double the space or something?) That will be rectified in v6. > > Hi Jason > > Do you any stack usage information? > > A VPN can be at the bottom of some deep stack calls. Swap on NFS over > the VPN? If you have one frame of 1K, you might be O.K. But if you > have a few of these, i can see there might be issues of overflowing > the stack. > > At the risk on suggesting something awful: on x86_64, since we turn preemption off for simd, it wouldn’t be *completely* insane to do the crypto on the irq stack. It would look like: kernel_fpu_call(func, arg); And this helper would disable preemption, enable FPU, switch to the irq stack, call func(arg), disable FPU, enable preemption, and return. And we can have large IRQ stacks. I refuse to touch this with a ten-foot pole until the lazy FPU restore patches land. All that being said, why are these frames so large? It sounds like something may be spilling that ought not to.
Hi Andy, On Fri, Sep 21, 2018 at 5:23 AM Andy Lutomirski <luto@amacapital.net> wrote: > At the risk on suggesting something awful: on x86_64, since we turn preemption off for simd, it wouldn’t be *completely* insane to do the crypto on the irq stack. It would look like: > > kernel_fpu_call(func, arg); > > And this helper would disable preemption, enable FPU, switch to the irq stack, call func(arg), disable FPU, enable preemption, and return. And we can have large IRQ stacks. > > I refuse to touch this with a ten-foot pole until the lazy FPU restore patches land. Haha. That's fun, and maybe we'll do that at some point, but I have some other reasons too for being on a workqueue now. > > All that being said, why are these frames so large? It sounds like something may be spilling that ought not to. They're not. Well, they're not anymore. I had a silly thing before like "u8 buffer[1 << 12]" in some debugging code, which is what prompted the ccflag-y addition. I cleaned up the mistakes like that and frames are now reasonable everywhere. Non-issue. Jason
Hi Ard, On Fri, Sep 21, 2018 at 6:30 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > Did any of the fuzzing/testing you mention in the cover letter occur > on the kernel versions of these algorithms? Yes, of course. Jason
> On Sep 20, 2018, at 9:30 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > >> On 20 September 2018 at 21:15, Jason A. Donenfeld <Jason@zx2c4.com> wrote: >> Hi Andy, >> >>> On Fri, Sep 21, 2018 at 5:23 AM Andy Lutomirski <luto@amacapital.net> wrote: >>> At the risk on suggesting something awful: on x86_64, since we turn preemption off for simd, it wouldn’t be *completely* insane to do the crypto on the irq stack. It would look like: >>> >>> kernel_fpu_call(func, arg); >>> >>> And this helper would disable preemption, enable FPU, switch to the irq stack, call func(arg), disable FPU, enable preemption, and return. And we can have large IRQ stacks. >>> >>> I refuse to touch this with a ten-foot pole until the lazy FPU restore patches land. >> >> Haha. That's fun, and maybe we'll do that at some point, but I have >> some other reasons too for being on a workqueue now. >> > > Kernel mode crypto is callable from any context, and SIMD can be used > in softirq context on arm64 (and on x86, even from hardirq context > IIRC if the interrupt is taken from userland), in which case we'd > already be on the irq stack. The x86_64 irq stack handles nesting already.
On Sat, Sep 22, 2018 at 6:11 PM Arnd Bergmann <arnd@arndb.de> wrote: > > On Thu, Sep 20, 2018 at 5:12 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > > > Hey Arnd, > > > > On Thu, Sep 20, 2018 at 6:02 PM Arnd Bergmann <arnd@arndb.de> wrote: > > > Right, if you hit a stack requirement like this, it's usually the compiler > > > doing something bad, not just using too much stack but also generating > > > rather slow object code in the process. It's better to fix the bug by > > > optimizing the code to not spill registers to the stack. > > > > > > In the long run, I'd like to reduce the stack frame size further, so > > > best assume that anything over 1024 bytes (on 32-bit) or 1280 bytes > > > (on 64-bit) is a bug in the code, and stay below that. > > > > > > For prototyping, you can just mark the broken functions individually > > > by setting the warning limit for a specific function that is known to > > > be misoptimized by the compiler (with a comment about which compiler > > > and architectures are affected), but not override the limit for the > > > entire file. > > > > Thanks for the explanation. Fortunately in my case, the issues were > > trivially fixable to get it under 1024/1280 > > A lot of these bugs are not trivial, but we still need a full analysis of what > failed and what the possible mititgations are. Can you describe more > specifically what causes it? I think I misread your earlier sentence and thought you had said the exact opposite. For confirmation, I've downloaded your git tree and built it with my collection of compilers (gcc-4.6 through 8.1) and tried building it in various configurations. Nothing alarming stood out, the only thing that I think would might warrant some investigation is this one: lib/zinc/curve25519/curve25519-hacl64.h: In function 'curve25519_generic': lib/zinc/curve25519/curve25519-hacl64.h:785:1: warning: the frame size of 1536 bytes is larger than 500 bytes [-Wframe-larger-than=] Without KASAN, this takes 832 bytes, which is still more than it should use from a look at the source code. I first suspected some misoptimization around the get/put_unaligned_le64() calls, but playing around with it some more led me to this patch: diff --git a/lib/zinc/curve25519/curve25519-hacl64.h b/lib/zinc/curve25519/curve25519-hacl64.h index c7b2924a68c2..1f6eb5708a0e 100644 --- a/lib/zinc/curve25519/curve25519-hacl64.h +++ b/lib/zinc/curve25519/curve25519-hacl64.h @@ -182,8 +182,7 @@ static __always_inline void fmul_mul_shift_reduce_(u128 *output, u64 *input, static __always_inline void fmul_fmul(u64 *output, u64 *input, u64 *input21) { - u64 tmp[5]; - memcpy(tmp, input, 5 * sizeof(*input)); + u64 tmp[5] = { input[0], input[1], input[2], input[3], input[4] }; { u128 b4; u128 b0; That change gets it down to lib/zinc/curve25519/curve25519-hacl64.h: In function 'curve25519_generic': lib/zinc/curve25519/curve25519-hacl64.h:788:1: warning: the frame size of 640 bytes is larger than 500 bytes [-Wframe-larger-than=] with KASAN, or 496 bytes without it. This indicates that there might be something wrong with either gcc-8 or with our fortified memset() function that requires more investigation. Maybe you can see something there that I missed. Arnd