Message ID | 20230118214111.394416-1-dima@arista.com |
---|---|
Headers | show |
Series | net/crypto: Introduce crypto_pool | expand |
Hi Herbert, On 1/19/23 09:51, Herbert Xu wrote: > On Wed, Jan 18, 2023 at 09:41:08PM +0000, Dmitry Safonov wrote: >> Introduce a per-CPU pool of async crypto requests that can be used >> in bh-disabled contexts (designed with net RX/TX softirqs as users in >> mind). Allocation can sleep and is a slow-path. >> Initial implementation has only ahash as a backend and a fix-sized array >> of possible algorithms used in parallel. >> >> Signed-off-by: Dmitry Safonov <dima@arista.com> >> --- >> crypto/Kconfig | 3 + >> crypto/Makefile | 1 + >> crypto/crypto_pool.c | 333 ++++++++++++++++++++++++++++++++++++++++++ >> include/crypto/pool.h | 46 ++++++ >> 4 files changed, 383 insertions(+) >> create mode 100644 crypto/crypto_pool.c >> create mode 100644 include/crypto/pool.h > > I'm still nacking this. > > I'm currently working on per-request keys which should render > this unnecessary. With per-request keys you can simply do an > atomic kmalloc when you compute the hash. Adding per-request keys sounds like a real improvement to me. But that is not the same issue I'm addressing here. I'm maybe bad at describing or maybe I just don't see how per-request keys would help. Let me describe the problem I'm solving again and please feel free to correct inline or suggest alternatives. The initial need for crypto_pool comes from TCP-AO implementation that I'm pusing upstream, see RFC5925 that describes the option and the latest version of patch set is in [1]. In that patch set hashing is used in a similar way to TCP-MD5: crypto_alloc_ahash() is a slow-path in setsockopt() and the use of pre-allocated requests in fast path, TX/RX softirqs. For TCP-AO 2 algorithms are "must have" in any compliant implementation, according to RFC5926: HMAC-SHA-1-96 and AES-128-CMAC-96, other algorithms are optional. But having in mind that sha1, as you know, is not secure to collision attacks, some customers prefer to have/use stronger hashes. In other words, TCP-AO implementation needs 2+ hashing algorithms to be used in a similar manner as TCP-MD5 uses MD5 hashing. And than, I look around and I see that the same pattern (slow allocation of crypto request and usage on a fast-path with bh disabled) is used in other places over kernel: - here I convert to crypto_pool seg6_hmac & tcp-md5 - net/ipv4/ah4.c could benefit from it: currently it allocates crypto_alloc_ahash() per every connection, allocating user-specified hash algorithm with ahash = crypto_alloc_ahash(x->aalg->alg_name, 0, 0), which are not shared between each other and it doesn't provide pre-allocated temporary/scratch buffer to calculate hash, so it uses GFP_ATOMIC in ah_alloc_tmp() - net/ipv6/ah6.c is copy'n'paste of the above - net/ipv4/esp4.c and net/ipv6/esp6.c are more-or-less also copy'n'paste with crypto_alloc_aead() instead of crypto_alloc_ahash() - net/mac80211/ - another example of the same pattern, see even the comment in ieee80211_key_alloc() where the keys are allocated and the usage in net/mac80211/{rx,tx}.c with bh-disabled - net/xfrm/xfrm_ipcomp.c has its own manager for different compression algorithms that are used in quite the same fashion. The significant exception is scratch area: it's IPCOMP_SCRATCH_SIZE=65400. So, if it could be shared with other crypto users that do the same pattern (bh-disabled usage), it would save some memory. And those are just fast-grep examples from net/, looking closer it may be possible to find more potential users. So, in crypto_pool.c it's 333 lines where is a manager that let a user share pre-allocated ahash requests [comp, aead, may be added on top] inside bh-disabled section as well as share a temporary/scratch buffer. It will make it possible to remove some if not all custom managers of the very same code pattern, some of which don't even try to share pre-allocated tfms. That's why I see some value in this crypto-pool thing. If you NACK it, the alternative for TCP-AO patches would be to add just another pool into net/ipv4/tcp.c that either copies TCP-MD5 code or re-uses it. I fail to see how your per-request keys patches would provide an API alternative to this patch set. Still, users will have to manage pre-allocated tfms and buffers. I can actually see how your per-request keys would benefit *from* this patch set: it will be much easier to wire per-req keys up to crypto_pool to avoid per-CPU tfm allocation for algorithms you'll add support for. In that case you won't have to patch crypto-pool users. [1]: https://lore.kernel.org/all/20221027204347.529913-1-dima@arista.com/T/#u Thanks, waiting for your input, Dmitry
On 1/20/23 08:49, Herbert Xu wrote: > On Thu, Jan 19, 2023 at 06:03:40PM +0000, Dmitry Safonov wrote: >> >> - net/ipv4/ah4.c could benefit from it: currently it allocates >> crypto_alloc_ahash() per every connection, allocating user-specified >> hash algorithm with ahash = crypto_alloc_ahash(x->aalg->alg_name, 0, 0), >> which are not shared between each other and it doesn't provide >> pre-allocated temporary/scratch buffer to calculate hash, so it uses >> GFP_ATOMIC in ah_alloc_tmp() >> - net/ipv6/ah6.c is copy'n'paste of the above >> - net/ipv4/esp4.c and net/ipv6/esp6.c are more-or-less also copy'n'paste >> with crypto_alloc_aead() instead of crypto_alloc_ahash() > > No they should definitely not switch over to the pool model. In > fact, these provide the correct model that you should follow. > > The correct model is to allocate the tfm on the control/slow path, > and allocate requests on the fast path (or reuse existing memory, > e.g., from the skb). Ok, I see. Do you think, it's worth having a pool of tfms? If not, I can proceed with TCP-AO patches set and implement pool of ahash tfms that will be used only for TCP-MD5 and TCP-AO, does that sound good to you? I see that ahash tfm allocation doesn't eat a lot of memory, rather little more than 100 bytes, but even so, I don't see why not saving some memory "for free", especially if one can have thousands of keys over different sockets. Where there's not much complexity in sharing tfms & scratch buffers? Thanks, Dmitry