Message ID | 20190925161255.1871-1-ard.biesheuvel@linaro.org |
---|---|
Headers | show |
Series | crypto: wireguard using the existing crypto API | expand |
Hi Ard, Thanks for taking the initiative on this. When I first discussed with DaveM porting WireGuard to the crypto API and doing Zinc later yesterday, I thought to myself, “I wonder if Ard might want to work on this with me…” and sent you a message on IRC. It didn’t occur to me that you were the one who had pushed this endeavor! I must admit, though, I’m a bit surprised to see how it’s appearing. When I wrote [1], I had really imagined postponing the goals of Zinc entirely, and instead writing a small shim that calls into the existing crypto API machinery. I imagined the series to look like this: 1. Add blake2s generic as a crypto API shash object. 2. Add blake2s x86_64 as a crypto API shash object. 3. Add curve25519 generic as a crypto API dh object. 4. Add curve25519 x86_64 as a crypto API dh object. 5. Add curve25519 arm as a crypto API dh object. 6. The unmodified WireGuard commit. 7. A “cryptoapi.c” file for WireGuard that provides definitions of the “just functions” approach of Zinc, but does so in terms of the crypto API’s infrastructure, with global per-cpu lists and a few locks to handle quick buffer and tfm reuse. I wouldn’t expect (7) to be pretty, for the various reasons that most people dislike the crypto API, but at least it would somewhat “work”, not affect the general integrity of WireGuard, and provide a clear path forward in an evolutionary manner for gradually, piecemeal, swapping out pieces of that for a Zinc-like thing, however that winds up appearing. Instead what we’ve wound up with in this series is a Frankenstein’s monster of Zinc, which appears to have basically the same goal as Zinc, and even much of the same implementation just moved to a different directory, but then skimps on making it actually work well and introduces problems. (I’ll elucidate on some specific issues later in this email so that we can get on the same page with regards to security requirements for WireGuard.) I surmise from this Zinc-but-not series is that what actually is going on here is mostly some kind of power or leadership situation, which is what you’ve described to me also at various other points and in person. I also recognize that I am at least part way to blame for whatever dynamic there has stagnated this process; let me try to rectify that: A principle objection you’ve had is that Zinc moves to its own directory, with its own name, and tries to segment itself off from the rest of the crypto API’s infrastructure. You’ve always felt this should be mixed in with the rest of the crypto API’s infrastructure and directory structures in one way or another. Let’s do both of those things – put this in a directory structure you find appropriate and hook this into the rest of the crypto API’s infrastructure in a way you find appropriate. I might disagree, which is why Zinc does things the way it does, but I’m open to compromise and doing things more your way. Another objection you’ve had is that Zinc replaces many existing implementations with its own. Martin wasn’t happy about that either. So let’s not do that, and we’ll have some wholesale replacement of implementations in future patchsets at future dates discussed and benched and bikeshedded independently from this. Finally, perhaps most importantly, Zinc’s been my design rather than our design. Let’s do this together instead of me git-send-email(1)-ing a v37. If the process of doing that together will be fraught with difficulty, I’m still open to the “7 patch series” with the ugly cryptoapi.c approach, as described at the top. But I think if we start with Zinc and whittle it down in accordance with the above, we’ll get something mutually acceptable, and somewhat similar to this series, with a few important exceptions, which illustrate some of the issues I see in this RFC: Issue 1) No fast implementations for the “it’s just functions” interface. This is a deal breaker. I know you disagree here and perhaps think all dynamic dispatch should be by loadable modules configured with userspace policy and lots of function pointers and dynamically composable DSL strings, as the current crypto API does it. But I think a lot of other people agree with me here (and they’ve chimed in before) that the branch predictor does things better, doesn’t have Spectre issues, and is very simple to read and understand. For reference, here’s what that kind of thing looks like: [2]. In this case, the relevance is that the handshake in WireGuard is extremely performance sensitive, in order to fend off DoS. One of the big design gambits in WireGuard is – can we make it 1-RTT to reduce the complexity of the state machine, but keep the crypto efficient enough that this is still safe to do from a DoS perspective. The protocol succeeds at this goal, but in many ways, just by a hair when at scale, and so I’m really quite loathe to decrease handshake performance. Here’s where that matters specifically: - Curve25519 does indeed always appear to be taking tiny 32 byte stack inputs in WireGuard. However, your statement, “the fact that they operate on small, fixed size buffers means that there is really no point in providing alternative, SIMD based implementations of these, and we can limit ourselves to generic C library version,” is just plain wrong in this case. Curve25519 only ever operates on 32 byte inputs, because these represent curve scalars and points. It’s not like a block cipher where parallelism helps with larger inputs or something. In this case, there are some pretty massive speed improvements between the generic C implementations and the optimized ones. Like huge. On both ARM and on Intel. And Curve25519 is the most expensive operation in WireGuard, and each handshake message invokes a few of them. (Aside - Something to look forward to: I’m in the process of getting a formally verified x86_64 ADX implementation ready for kernel usage, to replace our existing heavily-fuzzed one, which will be cool.) - Blake2s actually does benefit from the optimized code even for relatively short inputs. While you might have been focused on the super-super small inputs in noise.c, there are slightly larger ones in cookie.c, and these are the most sensitive computations to make in terms of DoS resistance; they’re on the “front lines” of the battle, if you will. (Aside - Arguably WireGuard may have benefited from using siphash with 128-bit outputs here, or calculated some security metrics for DoS resistance in the face of forged 64-bit outputs or something, or a different custom MAC, but hindsight is 20/20.) - While 25519 and Blake2s are already in use, the optimized versions of ChaPoly wind up being faster as well, even if it’s just hitting the boring SSE code. - On MIPS, the optimized versions of ChaPoly are a necessity. They’re boring integer/scalar code, but they do things that the compiler simply cannot do on the platform and we benefit immensely from it. Taken together, we simply can’t skimp on the implementations available on the handshake layer, so we’ll need to add some form of implementation selection, whether it’s the method Zinc uses ([2]), or something else we cook up together. Issue 2) Linus’ objection to the async API invasion is more correct than he realizes. I could re-enumerate my objections to the API there, but I think we all get it. It’s horrendous looking. Even the introduction of the ivpad member (what on earth?) in the skb cb made me shutter. But there’s actually another issue at play: wg_noise_handshake_begin_session→derive_keys→symmetric_key_init is all part of the handshake. We cannot afford to allocate a brand new crypto object, parse the DSL string, connect all those function pointers, etc. The allocations involved here aren’t really okay at all in that path. That’s why the cryptoapi.c idea above involves just using a pool of pre-allocated objects if we’re going to be using that API at all. Also keep in mind that WireGuard instances sometimes have hundreds of thousands of peers. I’d recommend leaving this synchronous as it exists now, as Linus suggested, and we can revisit that later down the road. There are a number of improvements to the async API we could make down the line that could make this viable in WireGuard. For example, I could imagine decoupling the creation of the cipher object from its keys and intermediate buffers, so that we could in fact allocate the cipher objects with their DSLs globally in a safe way, while allowing the keys and working buffers to come from elsewhere. This is deep plumbing into the async API, but I think we could get there in time. There’s also a degree of practicality: right now there is zero ChaPoly async acceleration hardware anywhere that would fit into the crypto API. At some point, it might come to exist and have incredible performance, and then we’ll both feel very motivated to make this work for WireGuard. But it might also not come to be (AES seems to have won over most of the industry), in which case, why hassle? Issue 3) WireGuard patch is out of date. This is my fault, because I haven’t posted in a long time. There are some important changes in the main WireGuard repo. I’ll roll another patch soon for this so we have something recent to work off of. Sorry about that. Issue 4) FPU register batching? When I introduced the simd_get/simd_put/simd_relax thing, people seemed to think it was a good idea. My benchmarks of it showed significant throughput improvements. Your patchset doesn’t have anything similar to this. But on the other hand, last I spoke with the x86 FPU guys, I thought they might actually be in the process of making simd_get/put obsolete with some internal plumbing to make restoration lazier. I’ll see tglx later today and will poke him about this, as this might already be a non-issue. So given the above, how would you like to proceed? My personal preference would be to see you start with the Zinc patchset and rename things and change the infrastructure to something that fits your preferences, and we can see what that looks like. Less appealing would be to do several iterations of you reworking Zinc from scratch and going through the exercises all over again, but if you prefer that I guess I could cope. Alternatively, maybe this is a lot to chew on, and we should just throw caution into the wind, implement cryptoapi.c for WireGuard (as described at the top), and add C functions to the crypto API sometime later? This is what I had envisioned in [1]. And for the avoidance of doubt, or in case any of the above message belied something different, I really am happy and relieved to have an opportunity to work on this _with you_, and I am much more open than before to compromise and finding practical solutions to the past political issues. Also, if you’re into chat, we can always spec some of the nitty-gritty aspects out over IRC or even the old-fashioned telephone. Thanks again for pushing this forward. Regards, Jason [1] https://lore.kernel.org/wireguard/CAHmME9pmfZAp5zd9BDLFc2fWUhtzZcjYZc2atTPTyNFFmEdHLg@mail.gmail.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54
> There’s also a degree of practicality: right now there is zero ChaPoly > async acceleration hardware anywhere that would fit into the crypto > API. > Actually, that assumption is factually wrong. I don't know if anything is *publicly* available, but I can assure you the silicon is running in labs already. And something will be publicly available early next year at the latest. Which could nicely coincide with having Wireguard support in the kernel (which I would also like to see happen BTW) ... > At some point, it might come to exist and have incredible > performance, and then we’ll both feel very motivated to make this work > for WireGuard. But it might also not come to be (AES seems to have won > over most of the industry), in which case, why hassle? > Not "at some point". It will. Very soon. Maybe not in consumer or server CPUs, but definitely in the embedded (networking) space. And it *will* be much faster than the embedded CPU next to it, so it will be worth using it for something like bulk packet encryption. Regards, Pascal van Leeuwen Silicon IP Architect, Multi-Protocol Engines @ Verimatrix www.insidesecure.com
On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> wrote: > Actually, that assumption is factually wrong. I don't know if anything > is *publicly* available, but I can assure you the silicon is running in > labs already. Great to hear, and thanks for the information. I'll follow up with some questions on this in another thread.
On Thu, 26 Sep 2019 at 10:59, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > ... > > Instead what we’ve wound up with in this series is a Frankenstein’s > monster of Zinc, which appears to have basically the same goal as > Zinc, and even much of the same implementation just moved to a > different directory, but then skimps on making it actually work well > and introduces problems. (I’ll elucidate on some specific issues later > in this email so that we can get on the same page with regards to > security requirements for WireGuard.) I surmise from this Zinc-but-not > series is that what actually is going on here is mostly some kind of > power or leadership situation, which is what you’ve described to me > also at various other points and in person. I'm not sure what you are alluding to here. I have always been very clear about what I like about Zinc and what I don't like about Zinc. I agree that it makes absolutely no sense for casual, in-kernel crypto to jump through all the hoops that the crypto API requires. But for operating on big chunks of data on the kernel heap, we have an existing API that we should leverage if we can, and fix if we need to so that all its users can benefit. > I also recognize that I am > at least part way to blame for whatever dynamic there has stagnated > this process; let me try to rectify that: > > A principle objection you’ve had is that Zinc moves to its own > directory, with its own name, and tries to segment itself off from the > rest of the crypto API’s infrastructure. You’ve always felt this > should be mixed in with the rest of the crypto API’s infrastructure > and directory structures in one way or another. Let’s do both of those > things – put this in a directory structure you find appropriate and > hook this into the rest of the crypto API’s infrastructure in a way > you find appropriate. I might disagree, which is why Zinc does things > the way it does, but I’m open to compromise and doing things more your > way. > It doesn't have to be your way or my way. The whole point of being part of this community is that we find solutions that work for everyone, through discussion and iterative prototyping. Turning up out of the blue with a 50,000 line patch set and a take-it-or-leave-it attitude goes counter to that, and this is why we have made so little progress over the past year. But I am happy with your willingness to collaborate and find common ground, which was also my motivation for spending a considerable amount of time to prepare this patch set. > Another objection you’ve had is that Zinc replaces many existing > implementations with its own. Martin wasn’t happy about that either. > So let’s not do that, and we’ll have some wholesale replacement of > implementations in future patchsets at future dates discussed and > benched and bikeshedded independently from this. > > Finally, perhaps most importantly, Zinc’s been my design rather than > our design. Let’s do this together instead of me git-send-email(1)-ing > a v37. > > If the process of doing that together will be fraught with difficulty, > I’m still open to the “7 patch series” with the ugly cryptoapi.c > approach, as described at the top. If your aim is to write ugly code and use that as a munition > But I think if we start with Zinc > and whittle it down in accordance with the above, we’ll get something > mutually acceptable, and somewhat similar to this series, with a few > important exceptions, which illustrate some of the issues I see in > this RFC: > > Issue 1) No fast implementations for the “it’s just functions” interface. > > This is a deal breaker. I know you disagree here and perhaps think all > dynamic dispatch should be by loadable modules configured with > userspace policy and lots of function pointers and dynamically > composable DSL strings, as the current crypto API does it. But I think > a lot of other people agree with me here (and they’ve chimed in > before) that the branch predictor does things better, doesn’t have > Spectre issues, and is very simple to read and understand. For > reference, here’s what that kind of thing looks like: [2]. > This is one of the issues in the 'fix it for everyone else as well' category. If we can improve the crypto API to be less susceptible to these issues (e.g., using static calls), everybody benefits. I'd be happy to collaborate on that. > In this case, the relevance is that the handshake in WireGuard is > extremely performance sensitive, in order to fend off DoS. One of the > big design gambits in WireGuard is – can we make it 1-RTT to reduce > the complexity of the state machine, but keep the crypto efficient > enough that this is still safe to do from a DoS perspective. The > protocol succeeds at this goal, but in many ways, just by a hair when > at scale, and so I’m really quite loathe to decrease handshake > performance. ... > Taken together, we simply can’t skimp on the implementations available > on the handshake layer, so we’ll need to add some form of > implementation selection, whether it’s the method Zinc uses ([2]), or > something else we cook up together. > So are you saying that the handshake timing constraints in the WireGuard protocol are so stringent that we can't run it securely on, e.g., an ARM CPU that lacks a NEON unit? Or given that you are not providing accelerated implementations of blake2s or Curve25519 for arm64, we can't run it securely on arm64 at all? Typically, I would prefer to only introduce different versions of the same algorithm if there is a clear performance benefit for an actual use case. Framing this as a security issue rather than a performance issue is slightly disingenuous, since people are less likely to challenge it. But the security of any VPN protocol worth its salt should not hinge on the performance delta between the reference C code and a version that was optimized for a particular CPU. > Issue 2) Linus’ objection to the async API invasion is more correct > than he realizes. > > I could re-enumerate my objections to the API there, but I think we > all get it. It’s horrendous looking. Even the introduction of the > ivpad member (what on earth?) in the skb cb made me shutter. Your implementation of RFC7539 truncates the nonce to 64-bits, while RFC7539 defines a clear purpose for the bits you omit. Since the Zinc library is intended to be standalone (and you are proposing its use in other places, like big_keys.c), you might want to document your justification for doing so in the general case, instead of ridiculing the code I needed to write to work around this limitation. > But > there’s actually another issue at play: > > wg_noise_handshake_begin_session→derive_keys→symmetric_key_init is all > part of the handshake. We cannot afford to allocate a brand new crypto > object, parse the DSL string, connect all those function pointers, > etc. Parsing the string and connecting the function pointers happens only once, and only when the transform needs to be instantiated from its constituent parts. Subsequent invocations will just grab the existing object. > The allocations involved here aren’t really okay at all in that > path. That’s why the cryptoapi.c idea above involves just using a pool > of pre-allocated objects if we’re going to be using that API at all. > Also keep in mind that WireGuard instances sometimes have hundreds of > thousands of peers. > My preference would be to address this by permitting per-request keys in the AEAD layer. That way, we can instantiate the transform only once, and just invoke it with the appropriate key on the hot path (and avoid any per-keypair allocations) > I’d recommend leaving this synchronous as it exists now, as Linus > suggested, and we can revisit that later down the road. There are a > number of improvements to the async API we could make down the line > that could make this viable in WireGuard. For example, I could imagine > decoupling the creation of the cipher object from its keys and > intermediate buffers, so that we could in fact allocate the cipher > objects with their DSLs globally in a safe way, while allowing the > keys and working buffers to come from elsewhere. This is deep plumbing > into the async API, but I think we could get there in time. > My changes actually move all the rfc7539() intermediate buffers to the stack, so the only remaining allocation is the per-keypair one. > There’s also a degree of practicality: right now there is zero ChaPoly > async acceleration hardware anywhere that would fit into the crypto > API. At some point, it might come to exist and have incredible > performance, and then we’ll both feel very motivated to make this work > for WireGuard. But it might also not come to be (AES seems to have won > over most of the industry), in which case, why hassle? > As I already pointed out, we have supported hardware already: CAAM is in mainline, and Inside-Secure patches are on the list. > Issue 3) WireGuard patch is out of date. > > This is my fault, because I haven’t posted in a long time. There are > some important changes in the main WireGuard repo. I’ll roll another > patch soon for this so we have something recent to work off of. Sorry > about that. > This is the reason I included your WG patch verbatim, to make it easier to rebase to newer versions. In fact, I never intended or expected anything but discussion from this submission, let alone anyone actually merging it :-) > Issue 4) FPU register batching? > > When I introduced the simd_get/simd_put/simd_relax thing, people > seemed to think it was a good idea. My benchmarks of it showed > significant throughput improvements. Your patchset doesn’t have > anything similar to this. It uses the existing SIMD batching, and enhances it slightly for the Poly1305/shash case. > But on the other hand, last I spoke with the > x86 FPU guys, I thought they might actually be in the process of > making simd_get/put obsolete with some internal plumbing to make > restoration lazier. I’ll see tglx later today and will poke him about > this, as this might already be a non-issue. > We've already made improvements here for arm64 as well (and ARM already used lazy restore). But I think it still makes sense to amortize these calls over a reasonable chunk of data, i.e., a packet. > > So given the above, how would you like to proceed? My personal > preference would be to see you start with the Zinc patchset and rename > things and change the infrastructure to something that fits your > preferences, and we can see what that looks like. Less appealing would > be to do several iterations of you reworking Zinc from scratch and > going through the exercises all over again, but if you prefer that I > guess I could cope. Alternatively, maybe this is a lot to chew on, and > we should just throw caution into the wind, implement cryptoapi.c for > WireGuard (as described at the top), and add C functions to the crypto > API sometime later? This is what I had envisioned in [1]. > It all depends on whether we are interested in supporting async accelerators or not, and it is clear what my position is on this point. I am not convinced that we need accelerated implementations of blake2s and curve25519, but if we do, I'd like those to be implemented as individual modules under arch/*/crypto, with some moduleloader fu for weak symbols or static calls thrown in if we have to. Exposing them as shashes seems unnecessary to me at this point. My only objection to your simd get/put interface is that it uses a typedef rather than a struct definition (although I also wonder how we can avoid two instances living on the same call stack, unless we forbid functions that take a struct simd* to call functions that don't take one, but these are details we should be able to work out.) What I *don't* want is to merge WireGuard with its own library based crypto now, and extend that later for async accelerators once people realize that we really do need that as well. > And for the avoidance of doubt, or in case any of the above message > belied something different, I really am happy and relieved to have an > opportunity to work on this _with you_, and I am much more open than > before to compromise and finding practical solutions to the past > political issues. Also, if you’re into chat, we can always spec some > of the nitty-gritty aspects out over IRC or even the old-fashioned > telephone. Thanks again for pushing this forward. > My pleasure :-)
> > In this case, the relevance is that the handshake in WireGuard is > > extremely performance sensitive, in order to fend off DoS. One of the > > big design gambits in WireGuard is – can we make it 1-RTT to reduce > > the complexity of the state machine, but keep the crypto efficient > > enough that this is still safe to do from a DoS perspective. The > > protocol succeeds at this goal, but in many ways, just by a hair when > > at scale, and so I’m really quite loathe to decrease handshake > > performance. > ... > > Taken together, we simply can’t skimp on the implementations available > > on the handshake layer, so we’ll need to add some form of > > implementation selection, whether it’s the method Zinc uses ([2]), or > > something else we cook up together. > > > > So are you saying that the handshake timing constraints in the > WireGuard protocol are so stringent that we can't run it securely on, > e.g., an ARM CPU that lacks a NEON unit? Or given that you are not > providing accelerated implementations of blake2s or Curve25519 for > arm64, we can't run it securely on arm64 at all? > > Typically, I would prefer to only introduce different versions of the > same algorithm if there is a clear performance benefit for an actual > use case. > > Framing this as a security issue rather than a performance issue is > slightly disingenuous, since people are less likely to challenge it. > But the security of any VPN protocol worth its salt should not hinge > on the performance delta between the reference C code and a version > that was optimized for a particular CPU. > Fully agree with that last statement. Security of a protocol should *never* depend on the performance of a particular implementation. I may want to run this on a very constrained embedded system that would necessarily be very slow, and I would still want that to be secure. If this is true, it's pretty much a deal-breaker to me ... Which would be a shame, because I really do like some of the other things Wireguard does and just the effort of improving VPN in general. > > Issue 2) Linus’ objection to the async API invasion is more correct > > than he realizes. > > > > I could re-enumerate my objections to the API there, but I think we > > all get it. It’s horrendous looking. Even the introduction of the > > ivpad member (what on earth?) in the skb cb made me shutter. > > Your implementation of RFC7539 truncates the nonce to 64-bits, while > RFC7539 defines a clear purpose for the bits you omit. Since the Zinc > library is intended to be standalone (and you are proposing its use in > other places, like big_keys.c), you might want to document your > justification for doing so in the general case, instead of ridiculing > the code I needed to write to work around this limitation. > From RFC7539: "Some protocols may have unique per-invocation inputs that are not 96 bits in length. For example, IPsec may specify a 64-bit nonce. In such a case, it is up to the protocol document to define how to transform the protocol nonce into a 96-bit nonce, <<for example, by concatenating a constant value.>>" So concatenating zeroes within the protocol is fine (if you can live with the security consequences) but a generic library function should of course take all 96 bits as input(!) Actually, the rfc7539esp variant already takes that part of the nonce from the key, not the IV. This may be more convenient for use with Wireguard as well? Just force the trailing nonce portion of the key to zeroes when calling setkey(). > > My preference would be to address this by permitting per-request keys > in the AEAD layer. That way, we can instantiate the transform only > once, and just invoke it with the appropriate key on the hot path (and > avoid any per-keypair allocations) > This part I do not really understand. Why would you need to allocate a new transform if you change the key? Why can't you just call setkey() on the already allocated transform? > > It all depends on whether we are interested in supporting async > accelerators or not, and it is clear what my position is on this > point. > Maybe not for an initial upstream, but it should be a longer-term goal. > > What I *don't* want is to merge WireGuard with its own library based > crypto now, and extend that later for async accelerators once people > realize that we really do need that as well. > What's wrong with a step-by-step approach though? i.e. merge it with library calls now and then gradually work towards the goal of integrating (a tweaked version of) the Crypto API where that actually makes sense? Rome wasn't built in one day either ... Regards, Pascal van Leeuwen Silicon IP Architect, Multi-Protocol Engines @ Verimatrix www.insidesecure.com
On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> wrote: ... > > > > My preference would be to address this by permitting per-request keys > > in the AEAD layer. That way, we can instantiate the transform only > > once, and just invoke it with the appropriate key on the hot path (and > > avoid any per-keypair allocations) > > > This part I do not really understand. Why would you need to allocate a > new transform if you change the key? Why can't you just call setkey() > on the already allocated transform? > Because the single transform will be shared between all users running on different CPUs etc, and so the key should not be part of the TFM state but of the request state. > > > > It all depends on whether we are interested in supporting async > > accelerators or not, and it is clear what my position is on this > > point. > > > Maybe not for an initial upstream, but it should be a longer-term goal. > > > > > What I *don't* want is to merge WireGuard with its own library based > > crypto now, and extend that later for async accelerators once people > > realize that we really do need that as well. > > > What's wrong with a step-by-step approach though? i.e. merge it with > library calls now and then gradually work towards the goal of integrating > (a tweaked version of) the Crypto API where that actually makes sense? > Rome wasn't built in one day either ... > I should clarify: what I don't want is two frameworks in the kernel for doing async crypto, the existing one plus a new library-based one.
> -----Original Message----- > From: Ard Biesheuvel <ard.biesheuvel@linaro.org> > Sent: Thursday, September 26, 2019 3:16 PM > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> > Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux- > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas > <catalin.marinas@arm.com> > Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API > > On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen > <pvanleeuwen@verimatrix.com> wrote: > ... > > > > > > My preference would be to address this by permitting per-request keys > > > in the AEAD layer. That way, we can instantiate the transform only > > > once, and just invoke it with the appropriate key on the hot path (and > > > avoid any per-keypair allocations) > > > > > This part I do not really understand. Why would you need to allocate a > > new transform if you change the key? Why can't you just call setkey() > > on the already allocated transform? > > > > Because the single transform will be shared between all users running > on different CPUs etc, and so the key should not be part of the TFM > state but of the request state. > So you need a transform per user, such that each user can have his own key. But you shouldn't need to reallocate it when the user changes his key. I also don't see how the "different CPUs" is relevant here? I can share a single key across multiple CPUs here just fine ... Regards, Pascal van Leeuwen Silicon IP Architect, Multi-Protocol Engines @ Verimatrix www.insidesecure.com
On Thu, 26 Sep 2019 at 16:03, Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> wrote: > > > -----Original Message----- > > From: Ard Biesheuvel <ard.biesheuvel@linaro.org> > > Sent: Thursday, September 26, 2019 3:16 PM > > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> > > Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux- > > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; > > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH > > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel > > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann > > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; > > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas > > <catalin.marinas@arm.com> > > Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API > > > > On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen > > <pvanleeuwen@verimatrix.com> wrote: > > ... > > > > > > > > My preference would be to address this by permitting per-request keys > > > > in the AEAD layer. That way, we can instantiate the transform only > > > > once, and just invoke it with the appropriate key on the hot path (and > > > > avoid any per-keypair allocations) > > > > > > > This part I do not really understand. Why would you need to allocate a > > > new transform if you change the key? Why can't you just call setkey() > > > on the already allocated transform? > > > > > > > Because the single transform will be shared between all users running > > on different CPUs etc, and so the key should not be part of the TFM > > state but of the request state. > > > So you need a transform per user, such that each user can have his own > key. But you shouldn't need to reallocate it when the user changes his > key. I also don't see how the "different CPUs" is relevant here? I can > share a single key across multiple CPUs here just fine ... > We need two transforms per connection, one for each direction. That is how I currently implemented it, and it seems to me that, if allocating/freeing those on the same path as where the keypair object itself is allocated is too costly, I wonder why allocating the keypair object itself is fine. But what I am suggesting is to use a single TFM which gets shared by all the connections, where the key for each operation is provided per-request. That TFM cannot have a key set, because each user may use a different key.
> -----Original Message----- > From: linux-crypto-owner@vger.kernel.org <linux-crypto-owner@vger.kernel.org> On Behalf > Of Ard Biesheuvel > Sent: Thursday, September 26, 2019 4:53 PM > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> > Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux- > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas > <catalin.marinas@arm.com> > Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API > > On Thu, 26 Sep 2019 at 16:03, Pascal Van Leeuwen > <pvanleeuwen@verimatrix.com> wrote: > > > > > -----Original Message----- > > > From: Ard Biesheuvel <ard.biesheuvel@linaro.org> > > > Sent: Thursday, September 26, 2019 3:16 PM > > > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> > > > Cc: Jason A. Donenfeld <Jason@zx2c4.com>; Linux Crypto Mailing List <linux- > > > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; > > > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg > KH > > > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel > > > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann > > > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski > <luto@kernel.org>; > > > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas > > > <catalin.marinas@arm.com> > > > Subject: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API > > > > > > On Thu, 26 Sep 2019 at 15:06, Pascal Van Leeuwen > > > <pvanleeuwen@verimatrix.com> wrote: > > > ... > > > > > > > > > > My preference would be to address this by permitting per-request keys > > > > > in the AEAD layer. That way, we can instantiate the transform only > > > > > once, and just invoke it with the appropriate key on the hot path (and > > > > > avoid any per-keypair allocations) > > > > > > > > > This part I do not really understand. Why would you need to allocate a > > > > new transform if you change the key? Why can't you just call setkey() > > > > on the already allocated transform? > > > > > > > > > > Because the single transform will be shared between all users running > > > on different CPUs etc, and so the key should not be part of the TFM > > > state but of the request state. > > > > > So you need a transform per user, such that each user can have his own > > key. But you shouldn't need to reallocate it when the user changes his > > key. I also don't see how the "different CPUs" is relevant here? I can > > share a single key across multiple CPUs here just fine ... > > > > We need two transforms per connection, one for each direction. That is > how I currently implemented it, and it seems to me that, if > allocating/freeing those on the same path as where the keypair object > itself is allocated is too costly, I wonder why allocating the keypair > object itself is fine. > I guess that keypair object is a Wireguard specific thing? In that case it may not make a difference performance wise. I just would not expect a new (pair of) transforms to be allocated just for a rekey, only when a new connection is made. Thinking about this some more: Allocating a transform is about more than just allocating the object, there may be all kinds of side-effects like fallback ciphers being allocated, specific HW initialization being done, etc. I just feel that if you only need to change the key, you should only change the key. As that's what the driver would be optimized for. Regards, Pascal van Leeuwen Silicon IP Architect, Multi-Protocol Engines @ Verimatrix www.insidesecure.com
Hi Ard, On Thu, Sep 26, 2019 at 2:07 PM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > attitude goes counter to that, and this is why we have made so little > progress over the past year. I also just haven't submitted much in the past year, taking a bit of a break to see how things would settle. Seemed like rushing things wasn't prudent, so I slowed down. > But I am happy with your willingness to collaborate and find common > ground, which was also my motivation for spending a considerable > amount of time to prepare this patch set. Super. > > If the process of doing that together will be fraught with difficulty, > > I’m still open to the “7 patch series” with the ugly cryptoapi.c > > approach, as described at the top. > > If your aim is to write ugly code and use that as a munition No, this is not a matter of munition at all. Please take my words seriously; I am entirely genuine here. Three people I greatly respect made a very compelling argument to me, prompting the decision in [1]. The argument was that trying to fix the crypto layer AND trying to get WireGuard merged at the same time was ambitious and crazy. Maybe instead, they argued, I should just use the old crypto API, get WireGuard in, and then begin the Zinc process after. I think procedurally, that's probably good advice, and the people I was talking to seemed to have a really firm grasp on what works and what doesn't in the mainlining process. Now it's possible their judgement is wrong, but I really am open, in earnest, to following it. And the way that would look would be not trying to fix the crypto API now, getting WireGuard in based on whatever we can cobble together based on the current foundations with some intermediate file (cryptoapi.c in my previous email) to prevent it from infecting WireGuard. This isn't "munition"; it's a serious proposal. The funny thing, though, is that all the while I was under the impression somebody had figured out a great way to do this, it turns out you were busy with basically Zinc-but-not. So we're back to square one: you and I both want the crypto API to change, and now we have to figure out a way forward together on how to do this, prompting my last email to you, indicating that I was open to all sorts of compromises. However, I still remain fully open to following the prior suggestion, of not doing that at all right now, and instead basing this on the existing crypto API as-is. [1] https://lore.kernel.org/wireguard/CAHmME9pmfZAp5zd9BDLFc2fWUhtzZcjYZc2atTPTyNFFmEdHLg@mail.gmail.com/ > > reference, here’s what that kind of thing looks like: [2]. > > This is one of the issues in the 'fix it for everyone else as well' > category. If we can improve the crypto API to be less susceptible to > these issues (e.g., using static calls), everybody benefits. I'd be > happy to collaborate on that. Good. I'm happy to learn that you're all for having fast implementations that underlie the simple function calls. > > Taken together, we simply can’t skimp on the implementations available > > on the handshake layer, so we’ll need to add some form of > > implementation selection, whether it’s the method Zinc uses ([2]), or > > something else we cook up together. > > So are you saying that the handshake timing constraints in the > WireGuard protocol are so stringent that we can't run it securely on, > e.g., an ARM CPU that lacks a NEON unit? Or given that you are not > providing accelerated implementations of blake2s or Curve25519 for > arm64, we can't run it securely on arm64 at all? Deployed at scale, the handshake must have a certain performance to not be DoS'd. I've spent a long time benching these and attacking my own code. I won't be comfortable with this going in without the fast implementations for the handshake. And down the line, too, we can look into how to even improve the DoS resistance. I think there's room for improvement, and I hope at some point you'll join us in discussions on WireGuard internals. But the bottom line is that we need those fast primitives. > Typically, I would prefer to only introduce different versions of the > same algorithm if there is a clear performance benefit for an actual > use case. As I was saying, this is indeed the case. > Framing this as a security issue rather than a performance issue is > slightly disingenuous, since people are less likely to challenge it. I'm not being disingenuous. DoS resistance is a real issue with WireGuard. You might argue that FourQ and Siphash would have made better choices, and that's an interesting discussion, but it is what it is. The thing needs fast implementations. And we're going to have to implement that code anyway for other things, so might as well get it working well now. > But the security of any VPN protocol worth its salt You're not required to use WireGuard. > Parsing the string and connecting the function pointers happens only > once, and only when the transform needs to be instantiated from its > constituent parts. Subsequent invocations will just grab the existing > object. That's good to know. It doesn't fully address the issue, though. > My preference would be to address this by permitting per-request keys > in the AEAD layer. That way, we can instantiate the transform only > once, and just invoke it with the appropriate key on the hot path (and > avoid any per-keypair allocations) That'd be a major improvement to the async interface, yes. > > So given the above, how would you like to proceed? My personal > > preference would be to see you start with the Zinc patchset and rename > > things and change the infrastructure to something that fits your > > preferences, and we can see what that looks like. Less appealing would > > be to do several iterations of you reworking Zinc from scratch and > > going through the exercises all over again, but if you prefer that I > > guess I could cope. Alternatively, maybe this is a lot to chew on, and > > we should just throw caution into the wind, implement cryptoapi.c for > > WireGuard (as described at the top), and add C functions to the crypto > > API sometime later? This is what I had envisioned in [1]. > It all depends on whether we are interested in supporting async > accelerators or not, and it is clear what my position is on this > point. For a first version of WireGuard, no, I'm really not interested in that. Adding it in there is more ambitious than it looks to get it right. Async means more buffers, which means the queuing system for WireGuard needs to be changed. There's already ongoing research into this, and I'm happy to consider that research with a light toward maybe having async stuff in the future. But sticking into the code now as-is simply does not work from a buffering/queueing perspective. So again, let's take an iterative approach here: first we do stuff with the simple synchronous API. After the dust has settled, hardware is available for testing, Van Jacobson has been taken off the bookshelf for a fresh reading, and we've all sat down for a few interesting conversations at netdev on queueing and bufferbloat, then let's start working this in. In otherwords, just because technically you can glue those APIs together, sort of, doesn't mean that approach makes sense for the system as a whole. > I am not convinced that we need accelerated implementations of blake2s > and curve25519, but if we do, I'd like those to be implemented as > individual modules under arch/*/crypto, with some moduleloader fu for > weak symbols or static calls thrown in if we have to. Exposing them as > shashes seems unnecessary to me at this point. We need the accelerated implementations. And we'll need it for chapoly too, obviously. So let's work out a good way to hook that all into the Zinc-style interface. [2] does it in a very effective way that's overall quite good for performance and easy to follow. The chacha20-x86_64-glue.c code itself gets called via the static symbol chacha20_arch. This is implemented for each platform with a fall back to one that returns false, so that the generic code is called. The Zinc stuff here is obvious, simple, and I'm pretty sure you know what's up with it. I prefer each of these glue implementations to live in lib/zinc/chacha20/chacha20-${ARCH}-glue.c. You don't like that and want things in arch/${ARCH}/crypto/chacha20-glue.c. Okay, sure, fine, let's do all the naming and organization and political stuff how you like, and I'll leave aside my arguments about why I disagree. Let's take stock of where that leaves us, in terms of files: - lib/crypto/chacha20.c: this has a generic implementation, but at the top of the generic implementation, it has some code like "if (chacha20_arch(..., ..., ...)) return;" - arch/crypto/x86_64/chacha20-glue.c: this has the chacha20_arch() implementation, which branches out to the various SIMD implementations depending on some booleans calculated at module load time. - arch/crypto/arm/chacha20-glue.c: this has the chacha20_arch() implementation, which branches out to the various SIMD implementations depending on some booleans calculated at module load time. - arch/crypto/mips/chacha20-glue.c: this has the chacha20_arch() implementation, which contains an assembly version that's always run unconditionally. Our goals are that chacha20_arch() from each of these arch glues gets included in the lib/crypto/chacha20.c compilation unit. The reason why we want it in its own unit is so that the inliner can get rid of unreached code and more tightly integrate the branches. For the MIPS case, the advantage is clear. Here's how Zinc handles it: [3]. Some simple ifdefs and includes. Easy and straightforward. Some people might object, though, and it sounds like you might. So let's talk about some alternative mechanisms with their pros and cons: - The zinc method: straightforward, but not everybody likes ifdefs. - Stick the filename to be included into a Kconfig variable and let the configuration system do the logic: also straightforward. Not sure it's kosher, but it would work. - Weak symbols: we don't get inlining or the dead code elimination. - Function pointers: ditto, plus spectre. - Other ideas you might have? I'm open to suggestions here. [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54 [3] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n19 > What I *don't* want is to merge WireGuard with its own library based > crypto now, and extend that later for async accelerators once people > realize that we really do need that as well. I wouldn't worry so much about that. Zinc/library-based-crypto is just trying to fulfill the boring synchronous pure-code part of crypto implementations. For the async stuff, we can work together on improving the existing crypto API to be more appealing, in tandem with some interesting research into packet queuing systems. From the other thread, you might have seen that already Toke has cool ideas that I hope we can all sit down and talk about. I'm certainly not interested in "bolting" anything on to Zinc/library-based-crypto, and I'm happy to work to improve the asynchronous API for doing asynchronous crypto. Jason
> > So are you saying that the handshake timing constraints in the > > WireGuard protocol are so stringent that we can't run it securely on, > > e.g., an ARM CPU that lacks a NEON unit? Or given that you are not > > providing accelerated implementations of blake2s or Curve25519 for > > arm64, we can't run it securely on arm64 at all? > > Deployed at scale, the handshake must have a certain performance to > not be DoS'd. I've spent a long time benching these and attacking my > own code. I won't be comfortable with this going in without the fast > implementations for the handshake. As a networking guy, the relation between fast crypto for handshake and DoS is not obvious. Could you explain this a bit? It seems like a lot of people would like an OpenWRT box to be their VPN gateway. And most of them are small ARM or MIPs processors. Are you saying WireGuard will not be usable on such devices? Thanks Andrew
On Thu, Sep 26, 2019 at 1:52 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hi Ard, > > > Our goals are that chacha20_arch() from each of these arch glues gets > included in the lib/crypto/chacha20.c compilation unit. The reason why > we want it in its own unit is so that the inliner can get rid of > unreached code and more tightly integrate the branches. For the MIPS > case, the advantage is clear. IMO this needs numbers. My suggestion from way back, which is at least a good deal of the way toward being doable, is to do static calls. This means that the common code will call out to the arch code via a regular CALL instruction and will *not* inline the arch code. This means that the arch code could live in its own module, it can be selected at boot time, etc. For x86, inlining seems a but nuts to avoid a whole mess of: if (use avx2) do_avx2_thing(); else if (use avx1) do_avx1_thing(); else etc; On x86, direct calls are pretty cheap. Certainly for operations like curve25519, I doubt you will ever see a real-world effect from inlining. I'd be surprised for chacha20. If you really want inlining to dictate the overall design, I think you need some real numbers for why it's necessary. There also needs to be a clear story for how exactly making everything inline plays with the actual decision of which implementation to use. I think it's also worth noting that LTO is coming. --Andy
On Thu, 26 Sep 2019 13:06:51 +0200, Jason A. Donenfeld wrote: > On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen wrote: > > Actually, that assumption is factually wrong. I don't know if anything > > is *publicly* available, but I can assure you the silicon is running in > > labs already. And something will be publicly available early next year > > at the latest. Which could nicely coincide with having Wireguard support > > in the kernel (which I would also like to see happen BTW) ... > > > > Not "at some point". It will. Very soon. Maybe not in consumer or server > > CPUs, but definitely in the embedded (networking) space. > > And it *will* be much faster than the embedded CPU next to it, so it will > > be worth using it for something like bulk packet encryption. > > Super! I was wondering if you could speak a bit more about the > interface. My biggest questions surround latency. Will it be > synchronous or asynchronous? If the latter, why? What will its > latencies be? How deep will its buffers be? The reason I ask is that a > lot of crypto acceleration hardware of the past has been fast and > having very deep buffers, but at great expense of latency. In the > networking context, keeping latency low is pretty important. FWIW are you familiar with existing kTLS, and IPsec offloads in the networking stack? They offload the crypto into the NIC, inline, which helps with the latency, and processing overhead. There are also NIC silicon which can do some ChaCha/Poly, although I'm not familiar enough with WireGuard to know if offload to existing silicon will be possible.
On Thu, Sep 26, 2019 at 6:52 AM Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> wrote: > > > -----Original Message----- > > From: Jason A. Donenfeld <Jason@zx2c4.com> > > Sent: Thursday, September 26, 2019 1:07 PM > > To: Pascal Van Leeuwen <pvanleeuwen@verimatrix.com> > > Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>; Linux Crypto Mailing List <linux- > > crypto@vger.kernel.org>; linux-arm-kernel <linux-arm-kernel@lists.infradead.org>; > > Herbert Xu <herbert@gondor.apana.org.au>; David Miller <davem@davemloft.net>; Greg KH > > <gregkh@linuxfoundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Samuel > > Neves <sneves@dei.uc.pt>; Dan Carpenter <dan.carpenter@oracle.com>; Arnd Bergmann > > <arnd@arndb.de>; Eric Biggers <ebiggers@google.com>; Andy Lutomirski <luto@kernel.org>; > > Will Deacon <will@kernel.org>; Marc Zyngier <maz@kernel.org>; Catalin Marinas > > <catalin.marinas@arm.com>; Willy Tarreau <w@1wt.eu>; Netdev <netdev@vger.kernel.org>; > > Toke Høiland-Jørgensen <toke@toke.dk>; Dave Taht <dave.taht@gmail.com> > > Subject: chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard > > using the existing crypto API] > > > > [CC +willy, toke, dave, netdev] > > > > Hi Pascal > > > > On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen > > <pvanleeuwen@verimatrix.com> wrote: > > > Actually, that assumption is factually wrong. I don't know if anything > > > is *publicly* available, but I can assure you the silicon is running in > > > labs already. And something will be publicly available early next year > > > at the latest. Which could nicely coincide with having Wireguard support > > > in the kernel (which I would also like to see happen BTW) ... > > > > > > Not "at some point". It will. Very soon. Maybe not in consumer or server > > > CPUs, but definitely in the embedded (networking) space. > > > And it *will* be much faster than the embedded CPU next to it, so it will > > > be worth using it for something like bulk packet encryption. > > > > Super! I was wondering if you could speak a bit more about the > > interface. My biggest questions surround latency. Will it be > > synchronous or asynchronous? > > > The hardware being external to the CPU and running in parallel with it, > obviously asynchronous. > > > If the latter, why? > > > Because, as you probably already guessed, the round-trip latency is way > longer than the actual processing time, at least for small packets. > > Partly because the only way to communicate between the CPU and the HW > accelerator (whether that is crypto, a GPU, a NIC, etc.) that doesn't > keep the CPU busy moving data is through memory, with the HW doing DMA. > And, as any programmer should now, round trip times to memory are huge > relative to the processing speed. > > And partly because these accelerators are very similar to CPU's in > terms of architecture, doing pipelined processing and having multiple > of such pipelines in parallel. Except that these pipelines are not > working on low-level instructions but on full packets/blocks. So they > need to have many packets in flight to keep those pipelines fully > occupied. And packets need to move through the various pipeline stages, > so they incur the time needed to process them multiple times. (just > like e.g. a multiply instruction with a throughput of 1 per cycle > actually may need 4 or more cycles to actually provide its result) > > Could you do that from a synchronous interface? In theory, probably, > if you would spawn a new thread for every new packet arriving and > rely on the scheduler to preempt the waiting threads. But you'd need > as many threads as the HW accelerator can have packets in flight, > while an async would need only 2 threads: one to handle the input to > the accelerator and one to handle the output (or at most one thread > per CPU, if you want to divide the workload) > > Such a many-thread approach seems very inefficient to me. > > > What will its latencies be? > > > Depends very much on the specific integration scenario (i.e. bus > speed, bus hierarchy, cache hierarchy, memory speed, etc.) but on > the order of a few thousand CPU clocks is not unheard of. > Which is an eternity for the CPU, but still only a few uSec in > human time. Not a problem unless you're a high-frequency trader and > every ns counts ... > It's not like the CPU would process those packets in zero time. > > > How deep will its buffers be? > > > That of course depends on the specific accelerator implementation, > but possibly dozens of small packets in our case, as you'd need > at least width x depth packets in there to keep the pipes busy. > Just like a modern CPU needs hundreds of instructions in flight > to keep all its resources busy. > > > The reason I ask is that a > > lot of crypto acceleration hardware of the past has been fast and > > having very deep buffers, but at great expense of latency. > > > Define "great expense". Everything is relative. The latency is very > high compared to per-packet processing time but at the same time it's > only on the order of a few uSec. Which may not even be significant on > the total time it takes for the packet to travel from input MAC to > output MAC, considering the CPU will still need to parse and classify > it and do pre- and postprocessing on it. > > > In the networking context, keeping latency low is pretty important. > > > I've been doing this for IPsec for nearly 20 years now and I've never > heard anyone complain about our latency, so it must be OK. Well, it depends on where your bottlenecks are. On low-end hardware you can and do tend to bottleneck on the crypto step, and with uncontrolled, non-fq'd non-aqm'd buffering you get results like this: http://blog.cerowrt.org/post/wireguard/ so in terms of "threads" I would prefer to think of flows entering the tunnel and attempting to multiplex them as best as possible across the crypto hard/software so that minimal in-hw latencies are experienced for most packets and that the coupled queue length does not grow out of control, Adding fq_codel's hashing algo and queuing to ipsec as was done in commit: 264b87fa617e758966108db48db220571ff3d60e to leverage the inner hash... Had some nice results: before: http://www.taht.net/~d/ipsec_fq_codel/oldqos.png (100ms spikes) After: http://www.taht.net/~d/ipsec_fq_codel/newqos.png (2ms spikes) I'd love to see more vpn vendors using the rrul test or something even nastier to evaluate their results, rather than dragstrip bulk throughput tests, steering multiple flows over multiple cores. > We're also doing (fully inline, no CPU involved) MACsec cores, which > operate at layer 2 and I know it's a concern there for very specific > use cases (high frequency trading, precision time protocol, ...). > For "normal" VPN's though, a few uSec more or less should be a non-issue. Measured buffering is typically 1000 packets in userspace vpns. If you can put data in, faster than you can get it out, well.... > > Already > > WireGuard is multi-threaded which isn't super great all the time for > > latency (improvements are a work in progress). If you're involved with > > the design of the hardware, perhaps this is something you can help > > ensure winds up working well? For example, AES-NI is straightforward > > and good, but Intel can do that because they are the CPU. It sounds > > like your silicon will be adjacent. How do you envision this working > > in a low latency environment? > > > Depends on how low low-latency is. If you really need minimal latency, > you need an inline implementation. Which we can also provide, BTW :-) > > Regards, > Pascal van Leeuwen > Silicon IP Architect, Multi-Protocol Engines @ Verimatrix > www.insidesecure.com -- Dave Täht CTO, TekLibre, LLC http://www.teklibre.com Tel: 1-831-205-9740
Hey Andy, Thanks for weighing in. > inlining. I'd be surprised for chacha20. If you really want inlining > to dictate the overall design, I think you need some real numbers for > why it's necessary. There also needs to be a clear story for how > exactly making everything inline plays with the actual decision of > which implementation to use. Take a look at my description for the MIPS case: when on MIPS, the arch code is *always* used since it's just straight up scalar assembly. In this case, the chacha20_arch function *never* returns false [1], which means it's always included [2], so the generic implementation gets optimized out, saving disk and memory, which I assume MIPS people care about. [1] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-mips-glue.c?h=jd/wireguard#n13 [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n118 I'm fine with considering this a form of "premature optimization", though, and ditching the motivation there. On Thu, Sep 26, 2019 at 11:37 PM Andy Lutomirski <luto@kernel.org> wrote: > My suggestion from way back, which is at > least a good deal of the way toward being doable, is to do static > calls. This means that the common code will call out to the arch code > via a regular CALL instruction and will *not* inline the arch code. > This means that the arch code could live in its own module, it can be > selected at boot time, etc. Alright, let's do static calls, then, to deal with the case of going from the entry point implementation in lib/zinc (or lib/crypto, if you want, Ard) to the arch-specific implementation in arch/${ARCH}/crypto. And then within each arch, we can keep it simple, since everything is already in the same directory. Sound good? Jason
On Fri, 27 Sep 2019 at 09:21, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hey Andy, > > Thanks for weighing in. > > > inlining. I'd be surprised for chacha20. If you really want inlining > > to dictate the overall design, I think you need some real numbers for > > why it's necessary. There also needs to be a clear story for how > > exactly making everything inline plays with the actual decision of > > which implementation to use. > > Take a look at my description for the MIPS case: when on MIPS, the > arch code is *always* used since it's just straight up scalar > assembly. In this case, the chacha20_arch function *never* returns > false [1], which means it's always included [2], so the generic > implementation gets optimized out, saving disk and memory, which I > assume MIPS people care about. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-mips-glue.c?h=jd/wireguard#n13 > [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n118 > > I'm fine with considering this a form of "premature optimization", > though, and ditching the motivation there. > > On Thu, Sep 26, 2019 at 11:37 PM Andy Lutomirski <luto@kernel.org> wrote: > > My suggestion from way back, which is at > > least a good deal of the way toward being doable, is to do static > > calls. This means that the common code will call out to the arch code > > via a regular CALL instruction and will *not* inline the arch code. > > This means that the arch code could live in its own module, it can be > > selected at boot time, etc. > > Alright, let's do static calls, then, to deal with the case of going > from the entry point implementation in lib/zinc (or lib/crypto, if you > want, Ard) to the arch-specific implementation in arch/${ARCH}/crypto. > And then within each arch, we can keep it simple, since everything is > already in the same directory. > > Sound good? > Yup. I posted something to this effect - I am ironing out some wrinkles doing randconfig builds (with Arnd's help) but the general picture shouldn't change.