[RFC,00/18] crypto: wireguard using the existing crypto API

Message ID	20190925161255.1871-1-ard.biesheuvel@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-crypto-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; From: Ard Biesheuvel <ard.biesheuvel@linaro.org> To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Herbert Xu <herbert@gondor.apana.org.au>, David Miller <davem@davemloft.net>, Greg KH <gregkh@linuxfoundation.org>, Linus Torvalds <torvalds@linux-foundation.org>, "Jason A . Donenfeld" <Jason@zx2c4.com>, Samuel Neves <sneves@dei.uc.pt>, Dan Carpenter <dan.carpenter@oracle.com>, Arnd Bergmann <arnd@arndb.de>, Eric Biggers <ebiggers@google.com>, Andy Lutomirski <luto@kernel.org>, Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>, Catalin Marinas <catalin.marinas@arm.com> Subject: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API Date: Wed, 25 Sep 2019 18:12:37 +0200 Message-Id: <20190925161255.1871-1-ard.biesheuvel@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk
Series	crypto: wireguard using the existing crypto API \| expand [RFC,00/18] crypto: wireguard using the existing crypto API [RFC,01/18] crypto: shash - add plumbing for operating on scatterlists [RFC,02/18] crypto: x86/poly1305 - implement .update_from_sg method [RFC,03/18] crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation [RFC,04/18] crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON implementation [RFC,05/18] crypto: chacha - move existing library code into lib/crypto [RFC,06/18] crypto: rfc7539 - switch to shash for Poly1305 [RFC,07/18] crypto: rfc7539 - use zero reqsize for sync instantiations without alignmask [RFC,08/18] crypto: testmgr - add a chacha20poly1305 test case [RFC,09/18] crypto: poly1305 - move core algorithm into lib/crypto [RFC,10/18] crypto: poly1305 - add init/update/final library routines [RFC,11/18] int128: move __uint128_t compiler test to Kconfig [RFC,12/18] crypto: BLAKE2s - generic C library implementation and selftest [RFC,13/18] crypto: Curve25519 - generic C library implementations and selftest [RFC,14/18] crypto: chacha20poly1305 - import construction and selftest from Zinc [RFC,15/18] net: WireGuard secure network tunnel [RFC,16/18] netlink: use new strict length types in policy for 5.2 [RFC,17/18] wg switch to lib/crypto algos [RFC,18/18] net: wireguard - switch to crypto API for packet encryption

Ard Biesheuvel Sept. 25, 2019, 4:12 p.m. UTC

This series proposes a way to incorporate WireGuard into the kernel
without relying on a wholesale replacement of the existing crypto
stack. It addresses two issues with the existing crypto API, i.e.,
the need to do a kmalloc() allocation for each request, and the fact
that it can only operate on scatterlists, which limits the user to
data that is already accessible via an address in the linear map.

In the implementation of WireGuard, there are a number of dependencies
on cryptographic transformations:
- curve25519, blake2s, and [x]chach20poy1305 are all being used in the
  protocol handling, handshakes etc, mostly using inputs of a fixed, short
  length, mostly allocated on the stack
- chach20poy1305 is used for en/decrypting the actual packet data, using
  scatterlists to describe where the packet data is stored in memory.

The latter transformation is 99% compatible with the existing RFC7539
IPsec template in the crypto API, which means we already have the
plumbing to instantiate the correct transforms based on implementations
of ChaCha20 and Poly1305 that are provided per-architecture. Patch #18
shows the changes that need to be made to WireGuard to switch to the
crypto API for handling the packets. 

The remaining uses of [x]chacha20poly1305 operate on stack buffers, and
so switching to the crypto AEAD API is not as straight forward. However,
for these cases, as well as the uses of blake2s and curve25519, the fact
that they operate on small, fixed size buffers means that there is
really no point in providing alternative, SIMD based implementations of
these, and we can limit ourselves to generic C library version. 

Patches #1 .. #8 make some changes to the existing RFC7539 template and
the underlying ChaCha and Poly1305 drivers to reduce the number of times
that the template calls into the drivers, and to permit users of the
template to allocate the request structure on the stack instead of on
the heap, which removes the need for doing per-packet heap allocations
on the hot path.

Patches #9 and #10 refactor the existing Poly1305 code so we can easily
layer the Chacha20Poly1305 construction library on top in patch #14.

Patches #12 and #13 import the C implementations of blake2s and Curev25519
from the Zinc patch set, but moves them into lib/crypto, which is where
we keep generic crypto library C code. (Patch #11 is a preparatory patch for
patch #13.) The selftests are included as well.

Patch #14 incorporates the [x]chach20poly1305 library interface from Zinc,
but instead of providing numerous new implementation of Chacha20 and Poly1305,
it is built on top of the existing Chacha and Poly1305 library code that we
already have in the kernel. The original selftests that operate on 64-bit
nonces are included as well. (The ones using 96-bit nonces were dropped,
since the library interface [as it was defined originally] only supports
64-bit nonces in the first place)

Patch #15 is the original patch that adds WireGuard itself, and was taken
from the last series that Jason sent to the list ~6 months ago. It is
included verbatim to better illustrate the nature of the changes being
applied in the move to the crypto API.

Patch #16 is a followup fix for WireGuard that was taken from Jason's
repository, and is required to run WireGuard on recent kernels.

Patch #17 moves wireguard over to the crypto library headers in crypto/
rather than in zinc/

Patch #18 switches wireguard from the chach20poly1305 library API to
the crypto API. Note that RFC7539 defines a 96-bit nonce whereas WireGuard
only uses 64-bits, so some of the changes in this patch were needed just to
account for that.

Note that support for the rfc7539(chacha20,poly1305) algorithm has already
been implemented by at least two drivers for asynchronous accelerators, and
it seems relatively straight-forward to modify WireGuard further to support
asynchronous completions, and offload all the per-packet crypto to a separate
IP block. (People have argued in the past that accelerators are irrelevant
since CPUs perform better, but 'speed' is not the only performance metric
that people care about - 'battery life' is another one that comes to mind)

Cc: Herbert Xu <herbert@gondor.apana.org.au> 
Cc: David Miller <davem@davemloft.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Samuel Neves <sneves@dei.uc.pt>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>

Ard Biesheuvel (15):
  crypto: shash - add plumbing for operating on scatterlists
  crypto: x86/poly1305 - implement .update_from_sg method
  crypto: arm/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON
    implementation
  crypto: arm64/poly1305 - incorporate OpenSSL/CRYPTOGAMS NEON
    implementation
  crypto: chacha - move existing library code into lib/crypto
  crypto: rfc7539 - switch to shash for Poly1305
  crypto: rfc7539 - use zero reqsize for sync instantiations without
    alignmask
  crypto: testmgr - add a chacha20poly1305 test case
  crypto: poly1305 - move core algorithm into lib/crypto
  crypto: poly1305 - add init/update/final library routines
  int128: move __uint128_t compiler test to Kconfig
  crypto: chacha20poly1305 - import construction and selftest from Zinc
  netlink: use new strict length types in policy for 5.2
  wg switch to lib/crypto algos
  net: wireguard - switch to crypto API for packet encryption

Jason A. Donenfeld (3):
  crypto: BLAKE2s - generic C library implementation and selftest
  crypto: Curve25519 - generic C library implementations and selftest
  net: WireGuard secure network tunnel

 MAINTAINERS                                  |    8 +
 arch/arm/crypto/Kconfig                      |    3 +
 arch/arm/crypto/Makefile                     |    7 +-
 arch/arm/crypto/chacha-neon-glue.c           |    2 +-
 arch/arm/crypto/poly1305-armv4.pl            | 1236 ++++
 arch/arm/crypto/poly1305-core.S_shipped      | 1158 +++
 arch/arm/crypto/poly1305-glue.c              |  253 +
 arch/arm64/crypto/Kconfig                    |    4 +
 arch/arm64/crypto/Makefile                   |    9 +-
 arch/arm64/crypto/chacha-neon-glue.c         |    2 +-
 arch/arm64/crypto/poly1305-armv8.pl          |  913 +++
 arch/arm64/crypto/poly1305-core.S_shipped    |  835 +++
 arch/arm64/crypto/poly1305-glue.c            |  215 +
 arch/x86/crypto/chacha_glue.c                |    2 +-
 arch/x86/crypto/poly1305_glue.c              |   56 +-
 crypto/Kconfig                               |   14 +
 crypto/adiantum.c                            |    5 +-
 crypto/ahash.c                               |   18 +
 crypto/chacha20poly1305.c                    |  540 +-
 crypto/chacha_generic.c                      |   42 +-
 crypto/ecc.c                                 |    2 +-
 crypto/nhpoly1305.c                          |    3 +-
 crypto/poly1305_generic.c                    |  218 +-
 crypto/shash.c                               |   24 +
 crypto/testmgr.h                             |   45 +
 drivers/net/Kconfig                          |   30 +
 drivers/net/Makefile                         |    1 +
 drivers/net/wireguard/Makefile               |   18 +
 drivers/net/wireguard/allowedips.c           |  377 +
 drivers/net/wireguard/allowedips.h           |   59 +
 drivers/net/wireguard/cookie.c               |  236 +
 drivers/net/wireguard/cookie.h               |   59 +
 drivers/net/wireguard/device.c               |  460 ++
 drivers/net/wireguard/device.h               |   65 +
 drivers/net/wireguard/main.c                 |   64 +
 drivers/net/wireguard/messages.h             |  128 +
 drivers/net/wireguard/netlink.c              |  621 ++
 drivers/net/wireguard/netlink.h              |   12 +
 drivers/net/wireguard/noise.c                |  837 +++
 drivers/net/wireguard/noise.h                |  132 +
 drivers/net/wireguard/peer.c                 |  239 +
 drivers/net/wireguard/peer.h                 |   83 +
 drivers/net/wireguard/peerlookup.c           |  221 +
 drivers/net/wireguard/peerlookup.h           |   64 +
 drivers/net/wireguard/queueing.c             |   53 +
 drivers/net/wireguard/queueing.h             |  199 +
 drivers/net/wireguard/ratelimiter.c          |  223 +
 drivers/net/wireguard/ratelimiter.h          |   19 +
 drivers/net/wireguard/receive.c              |  617 ++
 drivers/net/wireguard/selftest/allowedips.c  |  682 ++
 drivers/net/wireguard/selftest/counter.c     |  104 +
 drivers/net/wireguard/selftest/ratelimiter.c |  226 +
 drivers/net/wireguard/send.c                 |  442 ++
 drivers/net/wireguard/socket.c               |  433 ++
 drivers/net/wireguard/socket.h               |   44 +
 drivers/net/wireguard/timers.c               |  241 +
 drivers/net/wireguard/timers.h               |   31 +
 drivers/net/wireguard/version.h              |    1 +
 include/crypto/blake2s.h                     |   56 +
 include/crypto/chacha.h                      |   37 +-
 include/crypto/chacha20poly1305.h            |   37 +
 include/crypto/curve25519.h                  |   28 +
 include/crypto/hash.h                        |    3 +
 include/crypto/internal/chacha.h             |   25 +
 include/crypto/internal/hash.h               |   19 +
 include/crypto/internal/poly1305.h           |   33 +
 include/crypto/poly1305.h                    |   34 +-
 include/uapi/linux/wireguard.h               |  190 +
 init/Kconfig                                 |    1 +
 lib/Makefile                                 |    3 +-
 lib/crypto/Makefile                          |   39 +-
 lib/crypto/blake2s-selftest.c                | 2093 ++++++
 lib/crypto/blake2s.c                         |  274 +
 lib/{ => crypto}/chacha.c                    |   23 +
 lib/crypto/chacha20poly1305-selftest.c       | 7349 ++++++++++++++++++++
 lib/crypto/chacha20poly1305.c                |  216 +
 lib/crypto/curve25519-fiat32.c               |  864 +++
 lib/crypto/curve25519-hacl64.c               |  788 +++
 lib/crypto/curve25519-selftest.c             | 1321 ++++
 lib/crypto/curve25519.c                      |   73 +
 lib/crypto/poly1305.c                        |  216 +
 lib/ubsan.c                                  |    2 +-
 lib/ubsan.h                                  |    2 +-
 tools/testing/selftests/wireguard/netns.sh   |  503 ++
 84 files changed, 26192 insertions(+), 672 deletions(-)
 create mode 100644 arch/arm/crypto/poly1305-armv4.pl
 create mode 100644 arch/arm/crypto/poly1305-core.S_shipped
 create mode 100644 arch/arm/crypto/poly1305-glue.c
 create mode 100644 arch/arm64/crypto/poly1305-armv8.pl
 create mode 100644 arch/arm64/crypto/poly1305-core.S_shipped
 create mode 100644 arch/arm64/crypto/poly1305-glue.c
 create mode 100644 drivers/net/wireguard/Makefile
 create mode 100644 drivers/net/wireguard/allowedips.c
 create mode 100644 drivers/net/wireguard/allowedips.h
 create mode 100644 drivers/net/wireguard/cookie.c
 create mode 100644 drivers/net/wireguard/cookie.h
 create mode 100644 drivers/net/wireguard/device.c
 create mode 100644 drivers/net/wireguard/device.h
 create mode 100644 drivers/net/wireguard/main.c
 create mode 100644 drivers/net/wireguard/messages.h
 create mode 100644 drivers/net/wireguard/netlink.c
 create mode 100644 drivers/net/wireguard/netlink.h
 create mode 100644 drivers/net/wireguard/noise.c
 create mode 100644 drivers/net/wireguard/noise.h
 create mode 100644 drivers/net/wireguard/peer.c
 create mode 100644 drivers/net/wireguard/peer.h
 create mode 100644 drivers/net/wireguard/peerlookup.c
 create mode 100644 drivers/net/wireguard/peerlookup.h
 create mode 100644 drivers/net/wireguard/queueing.c
 create mode 100644 drivers/net/wireguard/queueing.h
 create mode 100644 drivers/net/wireguard/ratelimiter.c
 create mode 100644 drivers/net/wireguard/ratelimiter.h
 create mode 100644 drivers/net/wireguard/receive.c
 create mode 100644 drivers/net/wireguard/selftest/allowedips.c
 create mode 100644 drivers/net/wireguard/selftest/counter.c
 create mode 100644 drivers/net/wireguard/selftest/ratelimiter.c
 create mode 100644 drivers/net/wireguard/send.c
 create mode 100644 drivers/net/wireguard/socket.c
 create mode 100644 drivers/net/wireguard/socket.h
 create mode 100644 drivers/net/wireguard/timers.c
 create mode 100644 drivers/net/wireguard/timers.h
 create mode 100644 drivers/net/wireguard/version.h
 create mode 100644 include/crypto/blake2s.h
 create mode 100644 include/crypto/chacha20poly1305.h
 create mode 100644 include/crypto/curve25519.h
 create mode 100644 include/crypto/internal/chacha.h
 create mode 100644 include/crypto/internal/poly1305.h
 create mode 100644 include/uapi/linux/wireguard.h
 create mode 100644 lib/crypto/blake2s-selftest.c
 create mode 100644 lib/crypto/blake2s.c
 rename lib/{ => crypto}/chacha.c (85%)
 create mode 100644 lib/crypto/chacha20poly1305-selftest.c
 create mode 100644 lib/crypto/chacha20poly1305.c
 create mode 100644 lib/crypto/curve25519-fiat32.c
 create mode 100644 lib/crypto/curve25519-hacl64.c
 create mode 100644 lib/crypto/curve25519-selftest.c
 create mode 100644 lib/crypto/curve25519.c
 create mode 100644 lib/crypto/poly1305.c
 create mode 100755 tools/testing/selftests/wireguard/netns.sh

-- 
2.20.1

Jason A. Donenfeld Sept. 26, 2019, 8:59 a.m. UTC | #1

Hi Ard,

Thanks for taking the initiative on this. When I first discussed with
DaveM porting WireGuard to the crypto API and doing Zinc later
yesterday, I thought to myself, “I wonder if Ard might want to work on
this with me…” and sent you a message on IRC. It didn’t occur to me
that you were the one who had pushed this endeavor!

I must admit, though, I’m a bit surprised to see how it’s appearing.
When I wrote [1], I had really imagined postponing the goals of Zinc
entirely, and instead writing a small shim that calls into the
existing crypto API machinery. I imagined the series to look like
this:

1. Add blake2s generic as a crypto API shash object.
2. Add blake2s x86_64 as a crypto API shash object.
3. Add curve25519 generic as a crypto API dh object.
4. Add curve25519 x86_64 as a crypto API dh object.
5. Add curve25519 arm as a crypto API dh object.
6. The unmodified WireGuard commit.
7. A “cryptoapi.c” file for WireGuard that provides definitions of the
“just functions” approach of Zinc, but does so in terms of the crypto
API’s infrastructure, with global per-cpu lists and a few locks to
handle quick buffer and tfm reuse.

I wouldn’t expect (7) to be pretty, for the various reasons that most
people dislike the crypto API, but at least it would somewhat “work”,
not affect the general integrity of WireGuard, and provide a clear
path forward in an evolutionary manner for gradually, piecemeal,
swapping out pieces of that for a Zinc-like thing, however that winds
up appearing.

Instead what we’ve wound up with in this series is a Frankenstein’s
monster of Zinc, which appears to have basically the same goal as
Zinc, and even much of the same implementation just moved to a
different directory, but then skimps on making it actually work well
and introduces problems. (I’ll elucidate on some specific issues later
in this email so that we can get on the same page with regards to
security requirements for WireGuard.) I surmise from this Zinc-but-not
series is that what actually is going on here is mostly some kind of
power or leadership situation, which is what you’ve described to me
also at various other points and in person. I also recognize that I am
at least part way to blame for whatever dynamic there has stagnated
this process; let me try to rectify that:

A principle objection you’ve had is that Zinc moves to its own
directory, with its own name, and tries to segment itself off from the
rest of the crypto API’s infrastructure. You’ve always felt this
should be mixed in with the rest of the crypto API’s infrastructure
and directory structures in one way or another. Let’s do both of those
things – put this in a directory structure you find appropriate and
hook this into the rest of the crypto API’s infrastructure in a way
you find appropriate. I might disagree, which is why Zinc does things
the way it does, but I’m open to compromise and doing things more your
way.

Another objection you’ve had is that Zinc replaces many existing
implementations with its own. Martin wasn’t happy about that either.
So let’s not do that, and we’ll have some wholesale replacement of
implementations in future patchsets at future dates discussed and
benched and bikeshedded independently from this.

Finally, perhaps most importantly, Zinc’s been my design rather than
our design. Let’s do this together instead of me git-send-email(1)-ing
a v37.

If the process of doing that together will be fraught with difficulty,
I’m still open to the “7 patch series” with the ugly cryptoapi.c
approach, as described at the top. But I think if we start with Zinc
and whittle it down in accordance with the above, we’ll get something
mutually acceptable, and somewhat similar to this series, with a few
important exceptions, which illustrate some of the issues I see in
this RFC:

Issue 1) No fast implementations for the “it’s just functions” interface.

This is a deal breaker. I know you disagree here and perhaps think all
dynamic dispatch should be by loadable modules configured with
userspace policy and lots of function pointers and dynamically
composable DSL strings, as the current crypto API does it. But I think
a lot of other people agree with me here (and they’ve chimed in
before) that the branch predictor does things better, doesn’t have
Spectre issues, and is very simple to read and understand. For
reference, here’s what that kind of thing looks like: [2].

In this case, the relevance is that the handshake in WireGuard is
extremely performance sensitive, in order to fend off DoS. One of the
big design gambits in WireGuard is – can we make it 1-RTT to reduce
the complexity of the state machine, but keep the crypto efficient
enough that this is still safe to do from a DoS perspective. The
protocol succeeds at this goal, but in many ways, just by a hair when
at scale, and so I’m really quite loathe to decrease handshake
performance. Here’s where that matters specifically:

- Curve25519 does indeed always appear to be taking tiny 32 byte stack
inputs in WireGuard. However, your statement, “the fact that they
operate on small, fixed size buffers means that there is really no
point in providing alternative, SIMD based implementations of these,
and we can limit ourselves to generic C library version,” is just
plain wrong in this case. Curve25519 only ever operates on 32 byte
inputs, because these represent curve scalars and points. It’s not
like a block cipher where parallelism helps with larger inputs or
something. In this case, there are some pretty massive speed
improvements between the generic C implementations and the optimized
ones. Like huge. On both ARM and on Intel. And Curve25519 is the most
expensive operation in WireGuard, and each handshake message invokes a
few of them. (Aside - Something to look forward to: I’m in the process
of getting a formally verified x86_64 ADX implementation ready for
kernel usage, to replace our existing heavily-fuzzed one, which will
be cool.)

- Blake2s actually does benefit from the optimized code even for
relatively short inputs. While you might have been focused on the
super-super small inputs in noise.c, there are slightly larger ones in
cookie.c, and these are the most sensitive computations to make in
terms of DoS resistance; they’re on the “front lines” of the battle,
if you will. (Aside - Arguably WireGuard may have benefited from using
siphash with 128-bit outputs here, or calculated some security metrics
for DoS resistance in the face of forged 64-bit outputs or something,
or a different custom MAC, but hindsight is 20/20.)

- While 25519 and Blake2s are already in use, the optimized versions
of ChaPoly wind up being faster as well, even if it’s just hitting the
boring SSE code.

- On MIPS, the optimized versions of ChaPoly are a necessity. They’re
boring integer/scalar code, but they do things that the compiler
simply cannot do on the platform and we benefit immensely from it.

Taken together, we simply can’t skimp on the implementations available
on the handshake layer, so we’ll need to add some form of
implementation selection, whether it’s the method Zinc uses ([2]), or
something else we cook up together.

Issue 2) Linus’ objection to the async API invasion is more correct
than he realizes.

I could re-enumerate my objections to the API there, but I think we
all get it. It’s horrendous looking. Even the introduction of the
ivpad member (what on earth?) in the skb cb made me shutter. But
there’s actually another issue at play:

wg_noise_handshake_begin_session→derive_keys→symmetric_key_init is all
part of the handshake. We cannot afford to allocate a brand new crypto
object, parse the DSL string, connect all those function pointers,
etc. The allocations involved here aren’t really okay at all in that
path. That’s why the cryptoapi.c idea above involves just using a pool
of pre-allocated objects if we’re going to be using that API at all.
Also keep in mind that WireGuard instances sometimes have hundreds of
thousands of peers.

I’d recommend leaving this synchronous as it exists now, as Linus
suggested, and we can revisit that later down the road. There are a
number of improvements to the async API we could make down the line
that could make this viable in WireGuard. For example, I could imagine
decoupling the creation of the cipher object from its keys and
intermediate buffers, so that we could in fact allocate the cipher
objects with their DSLs globally in a safe way, while allowing the
keys and working buffers to come from elsewhere. This is deep plumbing
into the async API, but I think we could get there in time.

There’s also a degree of practicality: right now there is zero ChaPoly
async acceleration hardware anywhere that would fit into the crypto
API. At some point, it might come to exist and have incredible
performance, and then we’ll both feel very motivated to make this work
for WireGuard. But it might also not come to be (AES seems to have won
over most of the industry), in which case, why hassle?

Issue 3) WireGuard patch is out of date.

This is my fault, because I haven’t posted in a long time. There are
some important changes in the main WireGuard repo. I’ll roll another
patch soon for this so we have something recent to work off of. Sorry
about that.

Issue 4) FPU register batching?

When I introduced the simd_get/simd_put/simd_relax thing, people
seemed to think it was a good idea. My benchmarks of it showed
significant throughput improvements. Your patchset doesn’t have
anything similar to this. But on the other hand, last I spoke with the
x86 FPU guys, I thought they might actually be in the process of
making simd_get/put obsolete with some internal plumbing to make
restoration lazier. I’ll see tglx later today and will poke him about
this, as this might already be a non-issue.

So given the above, how would you like to proceed? My personal
preference would be to see you start with the Zinc patchset and rename
things and change the infrastructure to something that fits your
preferences, and we can see what that looks like. Less appealing would
be to do several iterations of you reworking Zinc from scratch and
going through the exercises all over again, but if you prefer that I
guess I could cope. Alternatively, maybe this is a lot to chew on, and
we should just throw caution into the wind, implement cryptoapi.c for
WireGuard (as described at the top), and add C functions to the crypto
API sometime later? This is what I had envisioned in [1].

And for the avoidance of doubt, or in case any of the above message
belied something different, I really am happy and relieved to have an
opportunity to work on this _with you_, and I am much more open than
before to compromise and finding practical solutions to the past
political issues. Also, if you’re into chat, we can always spec some
of the nitty-gritty aspects out over IRC or even the old-fashioned
telephone. Thanks again for pushing this forward.

Regards,
Jason

[1] https://lore.kernel.org/wireguard/CAHmME9pmfZAp5zd9BDLFc2fWUhtzZcjYZc2atTPTyNFFmEdHLg@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54

Pascal Van Leeuwen Sept. 26, 2019, 10:19 a.m. UTC | #2

> There’s also a degree of practicality: right now there is zero ChaPoly

> async acceleration hardware anywhere that would fit into the crypto

> API.

>

Actually, that assumption is factually wrong. I don't know if anything
is *publicly* available, but I can assure you the silicon is running in
labs already. And something will be publicly available early next year
at the latest. Which could nicely coincide with having Wireguard support
in the kernel (which I would also like to see happen BTW) ...

> At some point, it might come to exist and have incredible

> performance, and then we’ll both feel very motivated to make this work

> for WireGuard. But it might also not come to be (AES seems to have won

> over most of the industry), in which case, why hassle?

>

Not "at some point". It will. Very soon. Maybe not in consumer or server
CPUs, but definitely in the embedded (networking) space.
And it *will* be much faster than the embedded CPU next to it, so it will 
be worth using it for something like bulk packet encryption.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

Jason A. Donenfeld Sept. 26, 2019, 10:59 a.m. UTC | #3

On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
> Actually, that assumption is factually wrong. I don't know if anything

> is *publicly* available, but I can assure you the silicon is running in

> labs already.


Great to hear, and thanks for the information. I'll follow up with
some questions on this in another thread.

Ard Biesheuvel Sept. 26, 2019, 12:07 p.m. UTC | #4

On Thu, 26 Sep 2019 at 10:59, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

...
>

> Instead what we’ve wound up with in this series is a Frankenstein’s

> monster of Zinc, which appears to have basically the same goal as

> Zinc, and even much of the same implementation just moved to a

> different directory, but then skimps on making it actually work well

> and introduces problems. (I’ll elucidate on some specific issues later

> in this email so that we can get on the same page with regards to

> security requirements for WireGuard.) I surmise from this Zinc-but-not

> series is that what actually is going on here is mostly some kind of

> power or leadership situation, which is what you’ve described to me

> also at various other points and in person.

I'm not sure what you are alluding to here. I have always been very
clear about what I like about Zinc and what I don't like about Zinc.

I agree that it makes absolutely no sense for casual, in-kernel crypto
to jump through all the hoops that the crypto API requires. But for
operating on big chunks of data on the kernel heap, we have an
existing API that we should leverage if we can, and fix if we need to
so that all its users can benefit.

> I also recognize that I am

> at least part way to blame for whatever dynamic there has stagnated

> this process; let me try to rectify that:

>

> A principle objection you’ve had is that Zinc moves to its own

> directory, with its own name, and tries to segment itself off from the

> rest of the crypto API’s infrastructure. You’ve always felt this

> should be mixed in with the rest of the crypto API’s infrastructure

> and directory structures in one way or another. Let’s do both of those

> things – put this in a directory structure you find appropriate and

> hook this into the rest of the crypto API’s infrastructure in a way

> you find appropriate. I might disagree, which is why Zinc does things

> the way it does, but I’m open to compromise and doing things more your

> way.

>

It doesn't have to be your way or my way. The whole point of being
part of this community is that we find solutions that work for
everyone, through discussion and iterative prototyping. Turning up out
of the blue with a 50,000 line patch set and a take-it-or-leave-it
attitude goes counter to that, and this is why we have made so little
progress over the past year.

But I am happy with your willingness to collaborate and find common
ground, which was also my motivation for spending a considerable
amount of time to prepare this patch set.

> Another objection you’ve had is that Zinc replaces many existing

> implementations with its own. Martin wasn’t happy about that either.

> So let’s not do that, and we’ll have some wholesale replacement of

> implementations in future patchsets at future dates discussed and

> benched and bikeshedded independently from this.

>

> Finally, perhaps most importantly, Zinc’s been my design rather than

> our design. Let’s do this together instead of me git-send-email(1)-ing

> a v37.

>

> If the process of doing that together will be fraught with difficulty,

> I’m still open to the “7 patch series” with the ugly cryptoapi.c

> approach, as described at the top.

If your aim is to write ugly code and use that as a munition

> But I think if we start with Zinc

> and whittle it down in accordance with the above, we’ll get something

> mutually acceptable, and somewhat similar to this series, with a few

> important exceptions, which illustrate some of the issues I see in

> this RFC:

>

> Issue 1) No fast implementations for the “it’s just functions” interface.

>

> This is a deal breaker. I know you disagree here and perhaps think all

> dynamic dispatch should be by loadable modules configured with

> userspace policy and lots of function pointers and dynamically

> composable DSL strings, as the current crypto API does it. But I think

> a lot of other people agree with me here (and they’ve chimed in

> before) that the branch predictor does things better, doesn’t have

> Spectre issues, and is very simple to read and understand. For

> reference, here’s what that kind of thing looks like: [2].

>

This is one of the issues in the 'fix it for everyone else as well'
category. If we can improve the crypto API to be less susceptible to
these issues (e.g., using static calls), everybody benefits. I'd be
happy to collaborate on that.

> In this case, the relevance is that the handshake in WireGuard is

> extremely performance sensitive, in order to fend off DoS. One of the

> big design gambits in WireGuard is – can we make it 1-RTT to reduce

> the complexity of the state machine, but keep the crypto efficient

> enough that this is still safe to do from a DoS perspective. The

> protocol succeeds at this goal, but in many ways, just by a hair when

> at scale, and so I’m really quite loathe to decrease handshake

> performance.

...
> Taken together, we simply can’t skimp on the implementations available

> on the handshake layer, so we’ll need to add some form of

> implementation selection, whether it’s the method Zinc uses ([2]), or

> something else we cook up together.

>

So are you saying that the handshake timing constraints in the
WireGuard protocol are so stringent that we can't run it securely on,
e.g., an ARM CPU that lacks a NEON unit? Or given that you are not
providing accelerated implementations of blake2s or Curve25519 for
arm64, we can't run it securely on arm64 at all?

Typically, I would prefer to only introduce different versions of the
same algorithm if there is a clear performance benefit for an actual
use case.

Framing this as a security issue rather than a performance issue is
slightly disingenuous, since people are less likely to challenge it.
But the security of any VPN protocol worth its salt should not hinge
on the performance delta between the reference C code and a version
that was optimized for a particular CPU.

> Issue 2) Linus’ objection to the async API invasion is more correct

> than he realizes.

>

> I could re-enumerate my objections to the API there, but I think we

> all get it. It’s horrendous looking. Even the introduction of the

> ivpad member (what on earth?) in the skb cb made me shutter.

Your implementation of RFC7539 truncates the nonce to 64-bits, while
RFC7539 defines a clear purpose for the bits you omit. Since the Zinc
library is intended to be standalone (and you are proposing its use in
other places, like big_keys.c), you might want to document your
justification for doing so in the general case, instead of ridiculing
the code I needed to write to work around this limitation.

> But

> there’s actually another issue at play:

>

> wg_noise_handshake_begin_session→derive_keys→symmetric_key_init is all

> part of the handshake. We cannot afford to allocate a brand new crypto

> object, parse the DSL string, connect all those function pointers,

> etc.

Parsing the string and connecting the function pointers happens only
once, and only when the transform needs to be instantiated from its
constituent parts. Subsequent invocations will just grab the existing
object.

> The allocations involved here aren’t really okay at all in that

> path. That’s why the cryptoapi.c idea above involves just using a pool

> of pre-allocated objects if we’re going to be using that API at all.

> Also keep in mind that WireGuard instances sometimes have hundreds of

> thousands of peers.

>

My preference would be to address this by permitting per-request keys
in the AEAD layer. That way, we can instantiate the transform only
once, and just invoke it with the appropriate key on the hot path (and
avoid any per-keypair allocations)

> I’d recommend leaving this synchronous as it exists now, as Linus

> suggested, and we can revisit that later down the road. There are a

> number of improvements to the async API we could make down the line

> that could make this viable in WireGuard. For example, I could imagine

> decoupling the creation of the cipher object from its keys and

> intermediate buffers, so that we could in fact allocate the cipher

> objects with their DSLs globally in a safe way, while allowing the

> keys and working buffers to come from elsewhere. This is deep plumbing

> into the async API, but I think we could get there in time.

>

My changes actually move all the rfc7539() intermediate buffers to the
stack, so the only remaining allocation is the per-keypair one.

> There’s also a degree of practicality: right now there is zero ChaPoly

> async acceleration hardware anywhere that would fit into the crypto

> API. At some point, it might come to exist and have incredible

> performance, and then we’ll both feel very motivated to make this work

> for WireGuard. But it might also not come to be (AES seems to have won

> over most of the industry), in which case, why hassle?

>

As I already pointed out, we have supported hardware already: CAAM is
in mainline, and Inside-Secure patches are on the list.

> Issue 3) WireGuard patch is out of date.

>

> This is my fault, because I haven’t posted in a long time. There are

> some important changes in the main WireGuard repo. I’ll roll another

> patch soon for this so we have something recent to work off of. Sorry

> about that.

>

This is the reason I included your WG patch verbatim, to make it
easier to rebase to newer versions. In fact, I never intended or
expected anything but discussion from this submission, let alone
anyone actually merging it :-)

> Issue 4) FPU register batching?

>

> When I introduced the simd_get/simd_put/simd_relax thing, people

> seemed to think it was a good idea. My benchmarks of it showed

> significant throughput improvements. Your patchset doesn’t have

> anything similar to this.

It uses the existing SIMD batching, and enhances it slightly for the
Poly1305/shash case.

> But on the other hand, last I spoke with the

> x86 FPU guys, I thought they might actually be in the process of

> making simd_get/put obsolete with some internal plumbing to make

> restoration lazier. I’ll see tglx later today and will poke him about

> this, as this might already be a non-issue.

>

We've already made improvements here for arm64 as well (and ARM
already used lazy restore). But I think it still makes sense to
amortize these calls over a reasonable chunk of data, i.e., a packet.

>

> So given the above, how would you like to proceed? My personal

> preference would be to see you start with the Zinc patchset and rename

> things and change the infrastructure to something that fits your

> preferences, and we can see what that looks like. Less appealing would

> be to do several iterations of you reworking Zinc from scratch and

> going through the exercises all over again, but if you prefer that I

> guess I could cope. Alternatively, maybe this is a lot to chew on, and

> we should just throw caution into the wind, implement cryptoapi.c for

> WireGuard (as described at the top), and add C functions to the crypto

> API sometime later? This is what I had envisioned in [1].

>

It all depends on whether we are interested in supporting async
accelerators or not, and it is clear what my position is on this
point.

I am not convinced that we need accelerated implementations of blake2s
and curve25519, but if we do, I'd like those to be implemented as
individual modules under arch/*/crypto, with some moduleloader fu for
weak symbols or static calls thrown in if we have to. Exposing them as
shashes seems unnecessary to me at this point.

My only objection to your simd get/put interface is that it uses a
typedef rather than a struct definition (although I also wonder how we
can avoid two instances living on the same call stack, unless we
forbid functions that take a struct simd* to call functions that don't
take one, but these are details we should be able to work out.)

What I *don't* want is to merge WireGuard with its own library based
crypto now, and extend that later for async accelerators once people
realize that we really do need that as well.

> And for the avoidance of doubt, or in case any of the above message

> belied something different, I really am happy and relieved to have an

> opportunity to work on this _with you_, and I am much more open than

> before to compromise and finding practical solutions to the past

> political issues. Also, if you’re into chat, we can always spec some

> of the nitty-gritty aspects out over IRC or even the old-fashioned

> telephone. Thanks again for pushing this forward.

>

My pleasure :-)

Pascal Van Leeuwen Sept. 26, 2019, 1:06 p.m. UTC | #5

> > In this case, the relevance is that the handshake in WireGuard is

> > extremely performance sensitive, in order to fend off DoS. One of the

> > big design gambits in WireGuard is – can we make it 1-RTT to reduce

> > the complexity of the state machine, but keep the crypto efficient

> > enough that this is still safe to do from a DoS perspective. The

> > protocol succeeds at this goal, but in many ways, just by a hair when

> > at scale, and so I’m really quite loathe to decrease handshake

> > performance.

> ...

> > Taken together, we simply can’t skimp on the implementations available

> > on the handshake layer, so we’ll need to add some form of

> > implementation selection, whether it’s the method Zinc uses ([2]), or

> > something else we cook up together.

> >

> 

> So are you saying that the handshake timing constraints in the

> WireGuard protocol are so stringent that we can't run it securely on,

> e.g., an ARM CPU that lacks a NEON unit? Or given that you are not

> providing accelerated implementations of blake2s or Curve25519 for

> arm64, we can't run it securely on arm64 at all?

> 

> Typically, I would prefer to only introduce different versions of the

> same algorithm if there is a clear performance benefit for an actual

> use case.

> 

> Framing this as a security issue rather than a performance issue is

> slightly disingenuous, since people are less likely to challenge it.

> But the security of any VPN protocol worth its salt should not hinge

> on the performance delta between the reference C code and a version

> that was optimized for a particular CPU.

> 

Fully agree with that last statement. Security of a protocol should
*never* depend on the performance of a particular implementation.

I may want to run this on a very constrained embedded system that
would necessarily be very slow, and I would still want that to be
secure. If this is true, it's pretty much a deal-breaker to me ...

Which would be a shame, because I really do like some of the other
things Wireguard does and just the effort of improving VPN in general.

> > Issue 2) Linus’ objection to the async API invasion is more correct

> > than he realizes.

> >

> > I could re-enumerate my objections to the API there, but I think we

> > all get it. It’s horrendous looking. Even the introduction of the

> > ivpad member (what on earth?) in the skb cb made me shutter.

> 

> Your implementation of RFC7539 truncates the nonce to 64-bits, while

> RFC7539 defines a clear purpose for the bits you omit. Since the Zinc

> library is intended to be standalone (and you are proposing its use in

> other places, like big_keys.c), you might want to document your

> justification for doing so in the general case, instead of ridiculing

> the code I needed to write to work around this limitation.

> 

From RFC7539:
"Some protocols may have unique per-invocation inputs that are not 96
 bits in length.  For example, IPsec may specify a 64-bit nonce.  In
 such a case, it is up to the protocol document to define how to
 transform the protocol nonce into a 96-bit nonce, <<for example, by
 concatenating a constant value.>>"

So concatenating zeroes within the protocol is fine (if you can live
with the security consequences) but a generic library function should
of course take all 96 bits as input(!) Actually, the rfc7539esp variant
already takes that part of the nonce from the key, not the IV. This
may be more convenient for use with Wireguard as well? Just force the
trailing nonce portion of the key to zeroes when calling setkey().

> 

> My preference would be to address this by permitting per-request keys

> in the AEAD layer. That way, we can instantiate the transform only

> once, and just invoke it with the appropriate key on the hot path (and

> avoid any per-keypair allocations)

> 

This part I do not really understand. Why would you need to allocate a
new transform if you change the key? Why can't you just call setkey()
on the already allocated transform?

> 

> It all depends on whether we are interested in supporting async

> accelerators or not, and it is clear what my position is on this

> point.

> 

Maybe not for an initial upstream, but it should be a longer-term goal.

> 

> What I *don't* want is to merge WireGuard with its own library based

> crypto now, and extend that later for async accelerators once people

> realize that we really do need that as well.

> 

What's wrong with a step-by-step approach though? i.e. merge it with
library calls now and then gradually work towards the goal of integrating
(a tweaked version of) the Crypto API where that actually makes sense?
Rome wasn't built in one day either ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

Ard Biesheuvel Sept. 26, 2019, 1:15 p.m. UTC | #6

On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
...
> >

> > My preference would be to address this by permitting per-request keys

> > in the AEAD layer. That way, we can instantiate the transform only

> > once, and just invoke it with the appropriate key on the hot path (and

> > avoid any per-keypair allocations)

> >

> This part I do not really understand. Why would you need to allocate a

> new transform if you change the key? Why can't you just call setkey()

> on the already allocated transform?

>


Because the single transform will be shared between all users running
on different CPUs etc, and so the key should not be part of the TFM
state but of the request state.

> >

> > It all depends on whether we are interested in supporting async

> > accelerators or not, and it is clear what my position is on this

> > point.

> >

> Maybe not for an initial upstream, but it should be a longer-term goal.

>

> >

> > What I *don't* want is to merge WireGuard with its own library based

> > crypto now, and extend that later for async accelerators once people

> > realize that we really do need that as well.

> >

> What's wrong with a step-by-step approach though? i.e. merge it with

> library calls now and then gradually work towards the goal of integrating

> (a tweaked version of) the Crypto API where that actually makes sense?

> Rome wasn't built in one day either ...

>


I should clarify: what I don't want is two frameworks in the kernel
for doing async crypto, the existing one plus a new library-based one.

Pascal Van Leeuwen Sept. 26, 2019, 2:03 p.m. UTC | #7

> -----Original Message-----

> From: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> Sent: Thursday, September 26, 2019 3:16 PM

> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux-

> crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;

> Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel

> Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;

> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> <catalin.marinas@arm.com>

> Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

> 

> On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen

> <pvanleeuwen@verimatrix.com> wrote:

> ...

> > >

> > > My preference would be to address this by permitting per-request keys

> > > in the AEAD layer. That way, we can instantiate the transform only

> > > once, and just invoke it with the appropriate key on the hot path (and

> > > avoid any per-keypair allocations)

> > >

> > This part I do not really understand. Why would you need to allocate a

> > new transform if you change the key? Why can't you just call setkey()

> > on the already allocated transform?

> >

> 

> Because the single transform will be shared between all users running

> on different CPUs etc, and so the key should not be part of the TFM

> state but of the request state.

> 

So you need a transform per user, such that each user can have his own
key. But you shouldn't need to reallocate it when the user changes his
key. I also don't see how the "different CPUs" is relevant here? I can
share a single key across multiple CPUs here just fine ...

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

Ard Biesheuvel Sept. 26, 2019, 2:52 p.m. UTC | #8

On Thu, 26 Sep 2019 at 16:03, Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>

> > -----Original Message-----

> > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> > Sent: Thursday, September 26, 2019 3:16 PM

> > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> > Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux-

> > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;

> > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel

> > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;

> > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> > <catalin.marinas@arm.com>

> > Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

> >

> > On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen

> > <pvanleeuwen@verimatrix.com> wrote:

> > ...

> > > >

> > > > My preference would be to address this by permitting per-request keys

> > > > in the AEAD layer. That way, we can instantiate the transform only

> > > > once, and just invoke it with the appropriate key on the hot path (and

> > > > avoid any per-keypair allocations)

> > > >

> > > This part I do not really understand. Why would you need to allocate a

> > > new transform if you change the key? Why can't you just call setkey()

> > > on the already allocated transform?

> > >

> >

> > Because the single transform will be shared between all users running

> > on different CPUs etc, and so the key should not be part of the TFM

> > state but of the request state.

> >

> So you need a transform per user, such that each user can have his own

> key. But you shouldn't need to reallocate it when the user changes his

> key. I also don't see how the "different CPUs" is relevant here? I can

> share a single key across multiple CPUs here just fine ...

>


We need two transforms per connection, one for each direction. That is
how I currently implemented it, and it seems to me that, if
allocating/freeing those on the same path as where the keypair object
itself is allocated is too costly, I wonder why allocating the keypair
object itself is fine.

But what I am suggesting is to use a single TFM which gets shared by
all the connections, where the key for each operation is provided
per-request. That TFM cannot have a key set, because each user may use
a different key.

Pascal Van Leeuwen Sept. 26, 2019, 3:04 p.m. UTC | #9

> -----Original Message-----

> From: linux-crypto-owner@vger.kernel.org <linux-crypto-owner@vger.kernel.org> On Behalf

> Of Ard Biesheuvel

> Sent: Thursday, September 26, 2019 4:53 PM

> To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux-

> crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;

> Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel

> Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;

> Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> <catalin.marinas@arm.com>

> Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

> 

> On Thu, 26 Sep 2019 at 16:03, Pascal Van Leeuwen

> <pvanleeuwen@verimatrix.com> wrote:

> >

> > > -----Original Message-----

> > > From: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> > > Sent: Thursday, September 26, 2019 3:16 PM

> > > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> > > Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux-

> > > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;

> > > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg

> KH

> > > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel

> > > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> > > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski

> <luto@kernel.org>;

> > > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> > > <catalin.marinas@arm.com>

> > > Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

> > >

> > > On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen

> > > <pvanleeuwen@verimatrix.com> wrote:

> > > ...

> > > > >

> > > > > My preference would be to address this by permitting per-request keys

> > > > > in the AEAD layer. That way, we can instantiate the transform only

> > > > > once, and just invoke it with the appropriate key on the hot path (and

> > > > > avoid any per-keypair allocations)

> > > > >

> > > > This part I do not really understand. Why would you need to allocate a

> > > > new transform if you change the key? Why can't you just call setkey()

> > > > on the already allocated transform?

> > > >

> > >

> > > Because the single transform will be shared between all users running

> > > on different CPUs etc, and so the key should not be part of the TFM

> > > state but of the request state.

> > >

> > So you need a transform per user, such that each user can have his own

> > key. But you shouldn't need to reallocate it when the user changes his

> > key. I also don't see how the "different CPUs" is relevant here? I can

> > share a single key across multiple CPUs here just fine ...

> >

> 

> We need two transforms per connection, one for each direction. That is

> how I currently implemented it, and it seems to me that, if

> allocating/freeing those on the same path as where the keypair object

> itself is allocated is too costly, I wonder why allocating the keypair

> object itself is fine.

> 


I guess that keypair object is a Wireguard specific thing?
In that case it may not make a difference performance wise.
I just would not expect a new (pair of) transforms to be allocated
just for a rekey, only when a new connection is made. 

Thinking about this some more:
Allocating a transform is about more than just allocating the 
object, there may be all kinds of side-effects like fallback
ciphers being allocated, specific HW initialization being done, etc. 
I just feel that if you only need to change the key, you should
only change the key. As that's what the driver would be optimized
for.

Regards,
Pascal van Leeuwen
Silicon IP Architect, Multi-Protocol Engines @ Verimatrix
www.insidesecure.com

Jason A. Donenfeld Sept. 26, 2019, 8:47 p.m. UTC | #10

Hi Ard,

On Thu, Sep 26, 2019 at 2:07 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> attitude goes counter to that, and this is why we have made so little

> progress over the past year.

I also just haven't submitted much in the past year, taking a bit of a
break to see how things would settle. Seemed like rushing things
wasn't prudent, so I slowed down.

> But I am happy with your willingness to collaborate and find common

> ground, which was also my motivation for spending a considerable

> amount of time to prepare this patch set.

Super.

> > If the process of doing that together will be fraught with difficulty,

> > I’m still open to the “7 patch series” with the ugly cryptoapi.c

> > approach, as described at the top.

>

> If your aim is to write ugly code and use that as a munition

No, this is not a matter of munition at all. Please take my words
seriously; I am entirely genuine here. Three people I greatly respect
made a very compelling argument to me, prompting the decision in [1].
The argument was that trying to fix the crypto layer AND trying to get
WireGuard merged at the same time was ambitious and crazy. Maybe
instead, they argued, I should just use the old crypto API, get
WireGuard in, and then begin the Zinc process after. I think
procedurally, that's probably good advice, and the people I was
talking to seemed to have a really firm grasp on what works and what
doesn't in the mainlining process. Now it's possible their judgement
is wrong, but I really am open, in earnest, to following it. And the
way that would look would be not trying to fix the crypto API now,
getting WireGuard in based on whatever we can cobble together based on
the current foundations with some intermediate file (cryptoapi.c in my
previous email) to prevent it from infecting WireGuard. This isn't
"munition"; it's a serious proposal.

The funny thing, though, is that all the while I was under the
impression somebody had figured out a great way to do this, it turns
out you were busy with basically Zinc-but-not. So we're back to square
one: you and I both want the crypto API to change, and now we have to
figure out a way forward together on how to do this, prompting my last
email to you, indicating that I was open to all sorts of compromises.
However, I still remain fully open to following the prior suggestion,
of not doing that at all right now, and instead basing this on the
existing crypto API as-is.

[1] https://lore.kernel.org/wireguard/CAHmME9pmfZAp5zd9BDLFc2fWUhtzZcjYZc2atTPTyNFFmEdHLg@mail.gmail.com/

> > reference, here’s what that kind of thing looks like: [2].

>

> This is one of the issues in the 'fix it for everyone else as well'

> category. If we can improve the crypto API to be less susceptible to

> these issues (e.g., using static calls), everybody benefits. I'd be

> happy to collaborate on that.

Good. I'm happy to learn that you're all for having fast
implementations that underlie the simple function calls.

> > Taken together, we simply can’t skimp on the implementations available

> > on the handshake layer, so we’ll need to add some form of

> > implementation selection, whether it’s the method Zinc uses ([2]), or

> > something else we cook up together.

>

> So are you saying that the handshake timing constraints in the

> WireGuard protocol are so stringent that we can't run it securely on,

> e.g., an ARM CPU that lacks a NEON unit? Or given that you are not

> providing accelerated implementations of blake2s or Curve25519 for

> arm64, we can't run it securely on arm64 at all?

Deployed at scale, the handshake must have a certain performance to
not be DoS'd. I've spent a long time benching these and attacking my
own code.  I won't be comfortable with this going in without the fast
implementations for the handshake. And down the line, too, we can look
into how to even improve the DoS resistance. I think there's room for
improvement, and I hope at some point you'll join us in discussions on
WireGuard internals. But the bottom line is that we need those fast
primitives.

> Typically, I would prefer to only introduce different versions of the

> same algorithm if there is a clear performance benefit for an actual

> use case.

As I was saying, this is indeed the case.

> Framing this as a security issue rather than a performance issue is

> slightly disingenuous, since people are less likely to challenge it.

I'm not being disingenuous. DoS resistance is a real issue with
WireGuard. You might argue that FourQ and Siphash would have made
better choices, and that's an interesting discussion, but it is what
it is. The thing needs fast implementations. And we're going to have
to implement that code anyway for other things, so might as well get
it working well now.

> But the security of any VPN protocol worth its salt

You're not required to use WireGuard.

> Parsing the string and connecting the function pointers happens only

> once, and only when the transform needs to be instantiated from its

> constituent parts. Subsequent invocations will just grab the existing

> object.

That's good to know. It doesn't fully address the issue, though.

> My preference would be to address this by permitting per-request keys

> in the AEAD layer. That way, we can instantiate the transform only

> once, and just invoke it with the appropriate key on the hot path (and

> avoid any per-keypair allocations)

That'd be a major improvement to the async interface, yes.

> > So given the above, how would you like to proceed? My personal

> > preference would be to see you start with the Zinc patchset and rename

> > things and change the infrastructure to something that fits your

> > preferences, and we can see what that looks like. Less appealing would

> > be to do several iterations of you reworking Zinc from scratch and

> > going through the exercises all over again, but if you prefer that I

> > guess I could cope. Alternatively, maybe this is a lot to chew on, and

> > we should just throw caution into the wind, implement cryptoapi.c for

> > WireGuard (as described at the top), and add C functions to the crypto

> > API sometime later? This is what I had envisioned in [1].

> It all depends on whether we are interested in supporting async

> accelerators or not, and it is clear what my position is on this

> point.

For a first version of WireGuard, no, I'm really not interested in
that. Adding it in there is more ambitious than it looks to get it
right. Async means more buffers, which means the queuing system for
WireGuard needs to be changed. There's already ongoing research into
this, and I'm happy to consider that research with a light toward
maybe having async stuff in the future. But sticking into the code now
as-is simply does not work from a buffering/queueing perspective. So
again, let's take an iterative approach here: first we do stuff with
the simple synchronous API. After the dust has settled, hardware is
available for testing, Van Jacobson has been taken off the bookshelf
for a fresh reading, and we've all sat down for a few interesting
conversations at netdev on queueing and bufferbloat, then let's start
working this in. In otherwords, just because technically you can glue
those APIs together, sort of, doesn't mean that approach makes sense
for the system as a whole.

> I am not convinced that we need accelerated implementations of blake2s

> and curve25519, but if we do, I'd like those to be implemented as

> individual modules under arch/*/crypto, with some moduleloader fu for

> weak symbols or static calls thrown in if we have to. Exposing them as

> shashes seems unnecessary to me at this point.

We need the accelerated implementations. And we'll need it for chapoly
too, obviously. So let's work out a good way to hook that all into the
Zinc-style interface. [2] does it in a very effective way that's
overall quite good for performance and easy to follow. The
chacha20-x86_64-glue.c code itself gets called via the static symbol
chacha20_arch. This is implemented for each platform with a fall back
to one that returns false, so that the generic code is called. The
Zinc stuff here is obvious, simple, and I'm pretty sure you know
what's up with it.

I prefer each of these glue implementations to live in
lib/zinc/chacha20/chacha20-${ARCH}-glue.c. You don't like that and
want things in arch/${ARCH}/crypto/chacha20-glue.c. Okay, sure, fine,
let's do all the naming and organization and political stuff how you
like, and I'll leave aside my arguments about why I disagree. Let's
take stock of where that leaves us, in terms of files:

- lib/crypto/chacha20.c: this has a generic implementation, but at the
top of the generic implementation, it has some code like "if
(chacha20_arch(..., ..., ...)) return;"
- arch/crypto/x86_64/chacha20-glue.c: this has the chacha20_arch()
implementation, which branches out to the various SIMD implementations
depending on some booleans calculated at module load time.
- arch/crypto/arm/chacha20-glue.c: this has the chacha20_arch()
implementation, which branches out to the various SIMD implementations
depending on some booleans calculated at module load time.
- arch/crypto/mips/chacha20-glue.c: this has the chacha20_arch()
implementation, which contains an assembly version that's always run
unconditionally.

Our goals are that chacha20_arch() from each of these arch glues gets
included in the lib/crypto/chacha20.c compilation unit. The reason why
we want it in its own unit is so that the inliner can get rid of
unreached code and more tightly integrate the branches. For the MIPS
case, the advantage is clear. Here's how Zinc handles it: [3]. Some
simple ifdefs and includes. Easy and straightforward. Some people
might object, though, and it sounds like you might. So let's talk
about some alternative mechanisms with their pros and cons:

- The zinc method: straightforward, but not everybody likes ifdefs.
- Stick the filename to be included into a Kconfig variable and let
the configuration system do the logic: also straightforward. Not sure
it's kosher, but it would work.
- Weak symbols: we don't get inlining or the dead code elimination.
- Function pointers: ditto, plus spectre.
- Other ideas you might have? I'm open to suggestions here.

[2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54
[3] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n19

> What I *don't* want is to merge WireGuard with its own library based

> crypto now, and extend that later for async accelerators once people

> realize that we really do need that as well.

I wouldn't worry so much about that. Zinc/library-based-crypto is just
trying to fulfill the boring synchronous pure-code part of crypto
implementations. For the async stuff, we can work together on
improving the existing crypto API to be more appealing, in tandem with
some interesting research into packet queuing systems. From the other
thread, you might have seen that already Toke has cool ideas that I
hope we can all sit down and talk about. I'm certainly not interested
in "bolting" anything on to Zinc/library-based-crypto, and I'm happy
to work to improve the asynchronous API for doing asynchronous crypto.

Jason

Andrew Lunn Sept. 26, 2019, 9:22 p.m. UTC | #11

> > So are you saying that the handshake timing constraints in the

> > WireGuard protocol are so stringent that we can't run it securely on,

> > e.g., an ARM CPU that lacks a NEON unit? Or given that you are not

> > providing accelerated implementations of blake2s or Curve25519 for

> > arm64, we can't run it securely on arm64 at all?

> 

> Deployed at scale, the handshake must have a certain performance to

> not be DoS'd. I've spent a long time benching these and attacking my

> own code.  I won't be comfortable with this going in without the fast

> implementations for the handshake. 


As a networking guy, the relation between fast crypto for handshake
and DoS is not obvious. Could you explain this a bit?

It seems like a lot of people would like an OpenWRT box to be their
VPN gateway. And most of them are small ARM or MIPs processors. Are
you saying WireGuard will not be usable on such devices?

Thanks
	Andrew

Andy Lutomirski Sept. 26, 2019, 9:36 p.m. UTC | #12

On Thu, Sep 26, 2019 at 1:52 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> Hi Ard,

>

>

> Our goals are that chacha20_arch() from each of these arch glues gets

> included in the lib/crypto/chacha20.c compilation unit. The reason why

> we want it in its own unit is so that the inliner can get rid of

> unreached code and more tightly integrate the branches. For the MIPS

> case, the advantage is clear.

IMO this needs numbers.  My suggestion from way back, which is at
least a good deal of the way toward being doable, is to do static
calls.  This means that the common code will call out to the arch code
via a regular CALL instruction and will *not* inline the arch code.
This means that the arch code could live in its own module, it can be
selected at boot time, etc.  For x86, inlining seems a but nuts to
avoid a whole mess of:

if (use avx2)
  do_avx2_thing();
else if (use avx1)
  do_avx1_thing();
else
  etc;

On x86, direct calls are pretty cheap.  Certainly for operations like
curve25519, I doubt you will ever see a real-world effect from
inlining.  I'd be surprised for chacha20.  If you really want inlining
to dictate the overall design, I think you need some real numbers for
why it's necessary.  There also needs to be a clear story for how
exactly making everything inline plays with the actual decision of
which implementation to use.  I think it's also worth noting that LTO
is coming.

--Andy

Jakub Kicinski Sept. 26, 2019, 10:47 p.m. UTC | #13

On Thu, 26 Sep 2019 13:06:51 +0200, Jason A. Donenfeld wrote:
> On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen wrote:

> > Actually, that assumption is factually wrong. I don't know if anything

> > is *publicly* available, but I can assure you the silicon is running in

> > labs already. And something will be publicly available early next year

> > at the latest. Which could nicely coincide with having Wireguard support

> > in the kernel (which I would also like to see happen BTW) ...

> >

> > Not "at some point". It will. Very soon. Maybe not in consumer or server

> > CPUs, but definitely in the embedded (networking) space.

> > And it *will* be much faster than the embedded CPU next to it, so it will

> > be worth using it for something like bulk packet encryption.  

> 

> Super! I was wondering if you could speak a bit more about the

> interface. My biggest questions surround latency. Will it be

> synchronous or asynchronous? If the latter, why? What will its

> latencies be? How deep will its buffers be? The reason I ask is that a

> lot of crypto acceleration hardware of the past has been fast and

> having very deep buffers, but at great expense of latency. In the

> networking context, keeping latency low is pretty important.


FWIW are you familiar with existing kTLS, and IPsec offloads in the
networking stack? They offload the crypto into the NIC, inline, which
helps with the latency, and processing overhead.

There are also NIC silicon which can do some ChaCha/Poly, although 
I'm not familiar enough with WireGuard to know if offload to existing
silicon will be possible.

Dave Taht Sept. 26, 2019, 11:13 p.m. UTC | #14

On Thu, Sep 26, 2019 at 6:52 AM Pascal Van Leeuwen
<pvanleeuwen@verimatrix.com> wrote:
>

> > -----Original Message-----

> > From: Jason A. Donenfeld <Jason@zx2c4.com>

> > Sent: Thursday, September 26, 2019 1:07 PM

> > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com>

> > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux-

> > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>;

> > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH

> > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel

> > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann

> > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>;

> > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas

> > <catalin.marinas@arm.com>; Willy Tarreau <w@1wt.eu>; Netdev <netdev@vger.kernel.org>;

> > Toke Høiland-Jørgensen <toke@toke.dk>; Dave Taht <dave.taht@gmail.com>

> > Subject: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard

> > using the existing crypto API]

> >

> > [CC +willy, toke, dave, netdev]

> >

> > Hi Pascal

> >

> > On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen

> > <pvanleeuwen@verimatrix.com> wrote:

> > > Actually, that assumption is factually wrong. I don't know if anything

> > > is *publicly* available, but I can assure you the silicon is running in

> > > labs already. And something will be publicly available early next year

> > > at the latest. Which could nicely coincide with having Wireguard support

> > > in the kernel (which I would also like to see happen BTW) ...

> > >

> > > Not "at some point". It will. Very soon. Maybe not in consumer or server

> > > CPUs, but definitely in the embedded (networking) space.

> > > And it *will* be much faster than the embedded CPU next to it, so it will

> > > be worth using it for something like bulk packet encryption.

> >

> > Super! I was wondering if you could speak a bit more about the

> > interface. My biggest questions surround latency. Will it be

> > synchronous or asynchronous?

> >

> The hardware being external to the CPU and running in parallel with it,

> obviously asynchronous.

>

> > If the latter, why?

> >

> Because, as you probably already guessed, the round-trip latency is way

> longer than the actual processing time, at least for small packets.

>

> Partly because the only way to communicate between the CPU and the HW

> accelerator (whether that is crypto, a GPU, a NIC, etc.) that doesn't

> keep the CPU busy moving data is through memory, with the HW doing DMA.

> And, as any programmer should now, round trip times to memory are huge

> relative to the processing speed.

>

> And partly because these accelerators are very similar to CPU's in

> terms of architecture, doing pipelined processing and having multiple

> of such pipelines in parallel. Except that these pipelines are not

> working on low-level instructions but on full packets/blocks. So they

> need to have many packets in flight to keep those pipelines fully

> occupied. And packets need to move through the various pipeline stages,

> so they incur the time needed to process them multiple times. (just

> like e.g. a multiply instruction with a throughput of 1 per cycle

> actually may need 4 or more cycles to actually provide its result)

>

> Could you do that from a synchronous interface? In theory, probably,

> if you would spawn a new thread for every new packet arriving and

> rely on the scheduler to preempt the waiting threads. But you'd need

> as many threads as the HW  accelerator can have packets in flight,

> while an async would need only 2 threads: one to handle the input to

> the accelerator and one to handle the output (or at most one thread

> per CPU, if you want to divide the workload)

>

> Such a many-thread approach seems very inefficient to me.

>

> > What will its latencies be?

> >

> Depends very much on the specific integration scenario (i.e. bus

> speed, bus hierarchy, cache hierarchy, memory speed, etc.) but on

> the order of a few thousand CPU clocks is not unheard of.

> Which is an eternity for the CPU, but still only a few uSec in

> human time. Not a problem unless you're a high-frequency trader and

> every ns counts ...

> It's not like the CPU would process those packets in zero time.

>

> > How deep will its buffers be?

> >

> That of course depends on the specific accelerator implementation,

> but possibly dozens of small packets in our case, as you'd need

> at least width x depth packets in there to keep the pipes busy.

> Just like a modern CPU needs hundreds of instructions in flight

> to keep all its resources busy.

>

> > The reason I ask is that a

> > lot of crypto acceleration hardware of the past has been fast and

> > having very deep buffers, but at great expense of latency.

> >

> Define "great expense". Everything is relative. The latency is very

> high compared to per-packet processing time but at the same time it's

> only on the order of a few uSec. Which may not even be significant on

> the total time it takes for the packet to travel from input MAC to

> output MAC, considering the CPU will still need to parse and classify

> it and do pre- and postprocessing on it.

>

> > In the networking context, keeping latency low is pretty important.

> >

> I've been doing this for IPsec for nearly 20 years now and I've never

> heard anyone complain about our latency, so it must be OK.


Well, it depends on where your bottlenecks are. On low-end hardware
you can and do tend to bottleneck on the crypto step, and with
uncontrolled, non-fq'd non-aqm'd buffering you get results like this:

http://blog.cerowrt.org/post/wireguard/

so in terms of "threads" I would prefer to think of flows entering
the tunnel and attempting to multiplex them as best as possible
across the crypto hard/software so that minimal in-hw latencies are experienced
for most packets and that the coupled queue length does not grow out of control,

Adding fq_codel's hashing algo and queuing to ipsec as was done in
commit: 264b87fa617e758966108db48db220571ff3d60e to leverage
the inner hash...

Had some nice results:

before: http://www.taht.net/~d/ipsec_fq_codel/oldqos.png (100ms spikes)
After: http://www.taht.net/~d/ipsec_fq_codel/newqos.png (2ms spikes)

I'd love to see more vpn vendors using the rrul test or something even
nastier to evaluate their results, rather than dragstrip bulk throughput tests,
steering multiple flows over multiple cores.

> We're also doing (fully inline, no CPU involved) MACsec cores, which

> operate at layer 2 and I know it's a concern there for very specific

> use cases (high frequency trading, precision time protocol, ...).

> For "normal" VPN's though, a few uSec more or less should be a non-issue.


Measured buffering is typically 1000 packets in userspace vpns. If you
can put data in, faster than you can get it out, well....

> > Already

> > WireGuard is multi-threaded which isn't super great all the time for

> > latency (improvements are a work in progress). If you're involved with

> > the design of the hardware, perhaps this is something you can help

> > ensure winds up working well? For example, AES-NI is straightforward

> > and good, but Intel can do that because they are the CPU. It sounds

> > like your silicon will be adjacent. How do you envision this working

> > in a low latency environment?

> >

> Depends on how low low-latency is. If you really need minimal latency,

> you need an inline implementation. Which we can also provide, BTW :-)

>

> Regards,

> Pascal van Leeuwen

> Silicon IP Architect, Multi-Protocol Engines @ Verimatrix

> www.insidesecure.com




-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

Jason A. Donenfeld Sept. 27, 2019, 7:20 a.m. UTC | #15

Hey Andy,

Thanks for weighing in.

> inlining.  I'd be surprised for chacha20.  If you really want inlining

> to dictate the overall design, I think you need some real numbers for

> why it's necessary.  There also needs to be a clear story for how

> exactly making everything inline plays with the actual decision of

> which implementation to use.

Take a look at my description for the MIPS case: when on MIPS, the
arch code is *always* used since it's just straight up scalar
assembly. In this case, the chacha20_arch function *never* returns
false [1], which means it's always included [2], so the generic
implementation gets optimized out, saving disk and memory, which I
assume MIPS people care about.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-mips-glue.c?h=jd/wireguard#n13
[2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n118

I'm fine with considering this a form of "premature optimization",
though, and ditching the motivation there.

On Thu, Sep 26, 2019 at 11:37 PM Andy Lutomirski <luto@kernel.org> wrote:
> My suggestion from way back, which is at

> least a good deal of the way toward being doable, is to do static

> calls.  This means that the common code will call out to the arch code

> via a regular CALL instruction and will *not* inline the arch code.

> This means that the arch code could live in its own module, it can be

> selected at boot time, etc.

Alright, let's do static calls, then, to deal with the case of going
from the entry point implementation in lib/zinc (or lib/crypto, if you
want, Ard) to the arch-specific implementation in arch/${ARCH}/crypto.
And then within each arch, we can keep it simple, since everything is
already in the same directory.

Sound good?

Jason

Ard Biesheuvel Oct. 1, 2019, 8:56 a.m. UTC | #16

On Fri, 27 Sep 2019 at 09:21, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>

> Hey Andy,

>

> Thanks for weighing in.

>

> > inlining.  I'd be surprised for chacha20.  If you really want inlining

> > to dictate the overall design, I think you need some real numbers for

> > why it's necessary.  There also needs to be a clear story for how

> > exactly making everything inline plays with the actual decision of

> > which implementation to use.

>

> Take a look at my description for the MIPS case: when on MIPS, the

> arch code is *always* used since it's just straight up scalar

> assembly. In this case, the chacha20_arch function *never* returns

> false [1], which means it's always included [2], so the generic

> implementation gets optimized out, saving disk and memory, which I

> assume MIPS people care about.

>

> [1] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-mips-glue.c?h=jd/wireguard#n13

> [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n118

>

> I'm fine with considering this a form of "premature optimization",

> though, and ditching the motivation there.

>

> On Thu, Sep 26, 2019 at 11:37 PM Andy Lutomirski <luto@kernel.org> wrote:

> > My suggestion from way back, which is at

> > least a good deal of the way toward being doable, is to do static

> > calls.  This means that the common code will call out to the arch code

> > via a regular CALL instruction and will *not* inline the arch code.

> > This means that the arch code could live in its own module, it can be

> > selected at boot time, etc.

>

> Alright, let's do static calls, then, to deal with the case of going

> from the entry point implementation in lib/zinc (or lib/crypto, if you

> want, Ard) to the arch-specific implementation in arch/${ARCH}/crypto.

> And then within each arch, we can keep it simple, since everything is

> already in the same directory.

>

> Sound good?

>


Yup.

I posted something to this effect - I am ironing out some wrinkles
doing randconfig builds (with Arnd's help) but the general picture
shouldn't change.

[RFC,00/18] crypto: wireguard using the existing crypto API

Message

Comments