mbox series

[v3,00/17] crypt: x86 - fix RCU stalls

Message ID 20221103042740.6556-1-elliott@hpe.com
Headers show
Series crypt: x86 - fix RCU stalls | expand

Message

Elliott, Robert (Servers) Nov. 3, 2022, 4:27 a.m. UTC
This series fixes the RCU stalls triggered by the x86 crypto
modules discussed in
https://lore.kernel.org/all/MW5PR84MB18426EBBA3303770A8BC0BDFAB759@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM/

Two root causes were:
- too much data processed between kernel_fpu_begin and
  kernel_fpu_end calls (which are heavily used by the x86
  optimized drivers)
- tcrypt not calling cond_resched during speed test loops

These problems have always been lurking, but improving the
loading of the x86/sha512 module led to it happening a lot
during boot when using SHA-512 for module signature checking.

Fixing these problems makes it safer to improve loading
the rest of the x86 modules like the sha512 module.

This series only handles the x86 modules.

Except for the tcrypt change, v3 only tackles the hash functions
as discussed in
https://lore.kernel.org/lkml/MW5PR84MB184284FBED63E2D043C93A6FAB369@MW5PR84MB1842.NAMPRD84.PROD.OUTLOOK.COM/

The limits are implemented as static const unsigned ints at the
module level, which makes them easy to expose as module parameters
for testing like this:
   -static const unsigned int bytes_per_fpu = 655 * 1024;
   +static unsigned int bytes_per_fpu = 655 * 1024;
   +module_param(bytes_per_fpu, uint, 0644);
   +MODULE_PARM_DESC(bytes_per_fpu, "Bytes per FPU context");


Robert Elliott (17):
  crypto: tcrypt - test crc32
  crypto: tcrypt - test nhpoly1305
  crypto: tcrypt - reschedule during cycles speed tests
  crypto: x86/sha - limit FPU preemption
  crypto: x86/crc - limit FPU preemption
  crypto: x86/sm3 - limit FPU preemption
  crypto: x86/ghash - use u8 rather than char
  crypto: x86/ghash - restructure FPU context saving
  crypto: x86/ghash - limit FPU preemption
  crypto: x86/*poly* - limit FPU preemption
  crypto: x86/sha - register all variations
  crypto: x86/sha - minimize time in FPU context
  crypto: x86/sha1, sha256 - load based on CPU features
  crypto: x86/crc - load based on CPU features
  crypto: x86/sm3 - load based on CPU features
  crypto: x86/ghash,polyval - load based on CPU features
  crypto: x86/nhpoly1305, poly1305 - load based on CPU features

 arch/x86/crypto/crc32-pclmul_asm.S         |   6 +-
 arch/x86/crypto/crc32-pclmul_glue.c        |  36 ++-
 arch/x86/crypto/crc32c-intel_glue.c        |  58 +++--
 arch/x86/crypto/crct10dif-pclmul_glue.c    |  54 ++--
 arch/x86/crypto/ghash-clmulni-intel_asm.S  |   4 +-
 arch/x86/crypto/ghash-clmulni-intel_glue.c |  43 ++--
 arch/x86/crypto/nhpoly1305-avx2-glue.c     |  21 +-
 arch/x86/crypto/nhpoly1305-sse2-glue.c     |  21 +-
 arch/x86/crypto/poly1305_glue.c            |  49 +++-
 arch/x86/crypto/polyval-clmulni_glue.c     |  14 +-
 arch/x86/crypto/sha1_ssse3_glue.c          | 276 +++++++++++++--------
 arch/x86/crypto/sha256_ssse3_glue.c        | 268 +++++++++++++-------
 arch/x86/crypto/sha512_ssse3_glue.c        | 191 ++++++++------
 arch/x86/crypto/sm3_avx_glue.c             |  45 +++-
 crypto/tcrypt.c                            |  56 +++--
 15 files changed, 764 insertions(+), 378 deletions(-)

Comments

Jason A. Donenfeld Nov. 16, 2022, 11:13 a.m. UTC | #1
On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote:
> +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */
> +static const unsigned int bytes_per_fpu = 337 * 1024;
> +

Use an enum for constants like this:

    enum { BYTES_PER_FPU = ... };

You can even make it function-local, so it's near the code that uses it,
which will better justify its existence.

Also, where did you get this number? Seems kind of weird.

>  asmlinkage void nh_avx2(const u32 *key, const u8 *message, size_t message_len,
>  			u8 hash[NH_HASH_BYTES]);
>  
> @@ -26,18 +29,20 @@ static void _nh_avx2(const u32 *key, const u8 *message, size_t message_len,
>  static int nhpoly1305_avx2_update(struct shash_desc *desc,
>  				  const u8 *src, unsigned int srclen)
>  {
> +	BUILD_BUG_ON(bytes_per_fpu == 0);

Make the constant function local and remove this check.

> +7
>  	if (srclen < 64 || !crypto_simd_usable())
>  		return crypto_nhpoly1305_update(desc, src, srclen);
>  
> -	do {
> -		unsigned int n = min_t(unsigned int, srclen, SZ_4K);
> +	while (srclen) {

Does this add a needless additional check or does it generate better
code? Would be nice to have some explanation of the rationale.

Same comments as above apply for the rest of this patch ans series.
Jason A. Donenfeld Nov. 16, 2022, 11:26 a.m. UTC | #2
On Tue, Nov 15, 2022 at 10:13:39PM -0600, Robert Elliott wrote:
> For modules that have multiple choices, add read-only module parameters
> reporting which CPU features a module is using.
> 
> The parameters show up as follows for modules that modify the behavior
> of their registered drivers or register additional drivers for
> each choice:
>     /sys/module/aesni_intel/parameters/using_x86_avx:1
>     /sys/module/aesni_intel/parameters/using_x86_avx2:1
>     /sys/module/aria_aesni_avx_x86_64/parameters/using_x86_gfni:0
>     /sys/module/chacha_x86_64/parameters/using_x86_avx2:1
>     /sys/module/chacha_x86_64/parameters/using_x86_avx512:1
>     /sys/module/crc32c_intel/parameters/using_x86_pclmulqdq:1
>     /sys/module/curve25519_x86_64/parameters/using_x86_adx:1
>     /sys/module/libblake2s_x86_64/parameters/using_x86_avx512:1
>     /sys/module/libblake2s_x86_64/parameters/using_x86_ssse3:1
>     /sys/module/poly1305_x86_64/parameters/using_x86_avx:1
>     /sys/module/poly1305_x86_64/parameters/using_x86_avx2:1
>     /sys/module/poly1305_x86_64/parameters/using_x86_avx512:0
>     /sys/module/sha1_ssse3/parameters/using_x86_avx:1
>     /sys/module/sha1_ssse3/parameters/using_x86_avx2:1
>     /sys/module/sha1_ssse3/parameters/using_x86_shani:0
>     /sys/module/sha1_ssse3/parameters/using_x86_ssse3:1
>     /sys/module/sha256_ssse3/parameters/using_x86_avx:1
>     /sys/module/sha256_ssse3/parameters/using_x86_avx2:1
>     /sys/module/sha256_ssse3/parameters/using_x86_shani:0
>     /sys/module/sha256_ssse3/parameters/using_x86_ssse3:1
>     /sys/module/sha512_ssse3/parameters/using_x86_avx:1
>     /sys/module/sha512_ssse3/parameters/using_x86_avx2:1
>     /sys/module/sha512_ssse3/parameters/using_x86_ssse3:1

Isn't chacha missing?

However, what's the point of any of this? Who benefits from this info?
If something seems slow, I'll generally look at perf top, which provides
this same thing.

Also, "using" isn't quite correct. Some AVX2 machines will never use any
ssse3 instructions, despite the code being executable.

> 
> Delete the aesni_intel prints reporting those selections:
>     pr_info("AVX2 version of gcm_enc/dec engaged.\n");

This part I like.

> +module_param_named(using_x86_adx, curve25519_use_bmi2_adx.key.enabled.counter, int, 0444);
> +MODULE_PARM_DESC(using_x86_adx, "Using x86 instruction set extensions: ADX");

And BMI2, not just ADX.
Elliott, Robert (Servers) Nov. 28, 2022, 6:48 p.m. UTC | #3
> On Fri, Nov 25, 2022 at 09:59:17AM +0100, Ard Biesheuvel wrote:
> > Note that it is only used in shash implementations, given that they
> > are the only ones that may receive unbounded inputs.
> 
> Although typical usage probably doesn't stress this, the length of the
> additional associated data presented to aead implementations is
> unconstrained as well. At least in x86, they can end up processing
> multiple megabytes in one chunk like the hash functions (if the
> associated data is a big buffer described by a sg list created
> with sg_init_one()).
> 

Reviewing the two arm64 aead drivers, aes-ce-ccm-glue.c solves that by
including this in the do/while loop in ccm_calculate_auth_mac():
                n = min_t(u32, n, SZ_4K); /* yield NEON at least every 4k */

That was added by 36a916af641 ("crypto: arm64/aes-ccm - yield NEON
when processing auth-only data") in 2021, also relying on
41691c44606b ("crypto: arm64/aes-ccm - reduce NEON begin/end calls
for common case").

ghash-ce-glue.c seems to be missing that in its similar function named
gcm_calculate_auth_mac().
Elliott, Robert (Servers) Dec. 2, 2022, 6:21 a.m. UTC | #4
> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Friday, November 25, 2022 2:41 AM
> To: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Elliott, Robert (Servers) <elliott@hpe.com>; davem@davemloft.net;
> tim.c.chen@linux.intel.com; ap420073@gmail.com; ardb@kernel.org;
> David.Laight@aculab.com; ebiggers@kernel.org; linux-
> crypto@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> 
> On Wed, Nov 16, 2022 at 12:13:51PM +0100, Jason A. Donenfeld wrote:
> > On Tue, Nov 15, 2022 at 10:13:28PM -0600, Robert Elliott wrote:
> > > +/* avoid kernel_fpu_begin/end scheduler/rcu stalls */
> > > +static const unsigned int bytes_per_fpu = 337 * 1024;
> > > +
> >
> > Use an enum for constants like this:
> >
> >     enum { BYTES_PER_FPU = ... };
> >
> > You can even make it function-local, so it's near the code that uses it,
> > which will better justify its existence.
> >
> > Also, where did you get this number? Seems kind of weird.
> 
> These numbers are highly dependent on hardware and I think having
> them hard-coded is wrong.
> 
> Perhaps we should try a different approach.  How about just limiting
> the size to 4K, and then depending on need_resched we break out of
> the loop? Something like:
> 
> 	if (!len)
> 		return 0;
> 
> 	kernel_fpu_begin();
> 	for (;;) {
> 		unsigned int chunk = min(len, 4096);
> 
> 		sha1_base_do_update(desc, data, chunk, sha1_xform);
> 
> 		len -= chunk;
> 		data += chunk;
> 
> 		if (!len)
> 			break;
> 
> 		if (need_resched()) {
> 			kernel_fpu_end();
> 			cond_resched();
> 			kernel_fpu_begin();
> 		}
> 	}
> 	kernel_fpu_end();
> 

I implemented that conditional approach in the sha algorithms. 

The results of a boot (using sha512 for module signatures, with
crypto extra tests enabled, comparing to sha512 with a 20 KiB
fixed limit) are:

  sha1  cond:  14479 calls;   784256 cycles doing begin/end; longest FPU context     35828 cycles
sha256  cond:  26763 calls;  1273570 cycles doing begin/end; longest FPU context    118612 cycles
sha512  cond:  26957 calls;  1680046 cycles doing begin/end; longest FPU context 169140982 cycles
sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context   4049644 cycles

NOTE: I didn't have a patch in place to isolate the counts for each variation
(ssse3 vs. avx vs. avx2) and
- for sha512: sha512 vs. sha384
- for sha256: sha256 vs. sha224
so the numbers include sha256 and sha512 running twice as many tests
as sha1.

This approach looks very good:
- 16% of the number of begin/end calls
- 10% of the CPU cycles spent making the calls
- the FPU context is held for a long time (77 ms) but only while
  it's not needed.

That's much more efficient than releasing it every 30 us just in case.

I'll keep testing this to make sure RCU stalls stay away, and apply
the approach to the other algorithms.

In x86, need_resched() has to deal with a PER_CPU variable, so I'm
not sure it's worth the hassle to figure out how to do that from
assembly code.
Herbert Xu Dec. 2, 2022, 9:25 a.m. UTC | #5
On Fri, Dec 02, 2022 at 06:21:02AM +0000, Elliott, Robert (Servers) wrote:
>
> I'll keep testing this to make sure RCU stalls stay away, and apply
> the approach to the other algorithms.

Thanks for doing all the hard work!

BTW, just a minor nit but you can delete the cond_resched() call
because kernel_fpu_end()/preempt_enable() will do it anyway.

Cheers,
Elliott, Robert (Servers) Dec. 2, 2022, 4:15 p.m. UTC | #6
> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Friday, December 2, 2022 3:25 AM
> To: Elliott, Robert (Servers) <elliott@hpe.com>
> Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> 
> On Fri, Dec 02, 2022 at 06:21:02AM +0000, Elliott, Robert (Servers) wrote:
...
> BTW, just a minor nit but you can delete the cond_resched() call
> because kernel_fpu_end()/preempt_enable() will do it anyway.

That happens under
	CONFIG_PREEMPTION=y
(from include/Linux/preempt.h and arch/x86/include/asm/preempt.h)

Is calling cond_resched() still helpful if that is not the configuration?
Herbert Xu Dec. 6, 2022, 4:27 a.m. UTC | #7
On Fri, Dec 02, 2022 at 04:15:23PM +0000, Elliott, Robert (Servers) wrote:
> 
> 
> > -----Original Message-----
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Sent: Friday, December 2, 2022 3:25 AM
> > To: Elliott, Robert (Servers) <elliott@hpe.com>
> > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> > 
> > On Fri, Dec 02, 2022 at 06:21:02AM +0000, Elliott, Robert (Servers) wrote:
> ...
> > BTW, just a minor nit but you can delete the cond_resched() call
> > because kernel_fpu_end()/preempt_enable() will do it anyway.
> 
> That happens under
> 	CONFIG_PREEMPTION=y
> (from include/Linux/preempt.h and arch/x86/include/asm/preempt.h)
> 
> Is calling cond_resched() still helpful if that is not the configuration?

Perhaps, but then again perhaps if preemption is off, maybe we
shouldn't even bother with the 4K split.  Were the initial
warnings with or without preemption?

Personally I don't really care since I always use preemption.

The PREEMPT Kconfigs do provide a bit of nuance with the split
between PREEMPT_NONE vs. PREEMPT_VOLUNTARY.  But perhaps that is
just overkill for our situation.

I'll leave it to you to decide :)

Thanks,
Peter Lafreniere Dec. 6, 2022, 2:03 p.m. UTC | #8
> > > BTW, just a minor nit but you can delete the cond_resched() call
> > > because kernel_fpu_end()/preempt_enable() will do it anyway.
> > 
> > That happens under
> > CONFIG_PREEMPTION=y
> > (from include/Linux/preempt.h and arch/x86/include/asm/preempt.h)
> > 
> > Is calling cond_resched() still helpful if that is not the configuration?
> 
> 
> Perhaps, but then again perhaps if preemption is off, maybe we
> shouldn't even bother with the 4K split. Were the initial
> warnings with or without preemption?
> 
> Personally I don't really care since I always use preemption.
> 
> The PREEMPT Kconfigs do provide a bit of nuance with the split
> between PREEMPT_NONE vs. PREEMPT_VOLUNTARY. But perhaps that is
> just overkill for our situation.

I was thinking about this a few days ago, and my 2¢ is that it's 
probably best to not preempt the kernel in the middle of a crypto 
operation under PREEMPT_VOLUNTARY. We're already not preempting during 
these operations, and there haven't been complaints of excessive latency 
because of these crypto operations.

If we skip the kernel_fpu_{begin,end} pair when not under 
CONFIG_PREEMPT, we'll save a significant cycle count that is wasted 
currently. See Elliot Robert's numbers on conditional begin/end in sha 
to see the benefits of not saving/restoring unnecessarily: "10% of the 
CPU cycles spent making the [kernel_fpu_{begin,end}] calls".

> I'll leave it to you to decide :)

One extra thought: commit 827ee47: "crypto: x86 - add some helper macros 
for ECB and CBC modes" makes a mention of fpu save/restore being done 
lazily. I don't know the details, so would that change this discussion?

Thanks for listening,

Peter Lafreniere <peter@n8pjl.ca>
David Laight Dec. 6, 2022, 2:44 p.m. UTC | #9
From: Peter Lafreniere
> Sent: 06 December 2022 14:04
>
> > > > BTW, just a minor nit but you can delete the cond_resched() call
> > > > because kernel_fpu_end()/preempt_enable() will do it anyway.
> > >
> > > That happens under
> > > CONFIG_PREEMPTION=y
> > > (from include/Linux/preempt.h and arch/x86/include/asm/preempt.h)
> > >
> > > Is calling cond_resched() still helpful if that is not the configuration?
> >
> >
> > Perhaps, but then again perhaps if preemption is off, maybe we
> > shouldn't even bother with the 4K split. Were the initial
> > warnings with or without preemption?
> >
> > Personally I don't really care since I always use preemption.
> >
> > The PREEMPT Kconfigs do provide a bit of nuance with the split
> > between PREEMPT_NONE vs. PREEMPT_VOLUNTARY. But perhaps that is
> > just overkill for our situation.
> 
> I was thinking about this a few days ago, and my 2¢ is that it's
> probably best to not preempt the kernel in the middle of a crypto
> operation under PREEMPT_VOLUNTARY. We're already not preempting during
> these operations, and there haven't been complaints of excessive latency
> because of these crypto operations.
...

Probably because the people who have been suffering from (and
looking for) latency issues aren't running crypto tests.

I've found some terrible pre-emption latency issues trying
to get RT processes scheduled in a sensible timeframe.
I wouldn't worry about 100us - I'm doing audio processing
every 10ms, but anything much longer causes problems when
trying to use 90+% of the cpu time for lots of audio channels.

I didn't try a CONFIG_RT kernel, the application needs to run
on a standard 'distro' kernel. In any case I suspect all the
extra processes switches (etc) the RT kernel adds will completely
kill performance.

I wonder how much it would cost to measure the time spent with
pre-empt disabled (and not checked) and to trace long intervals.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Peter Lafreniere Dec. 6, 2022, 11:06 p.m. UTC | #10
> > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> > Perhaps we should try a different approach. How about just limiting
> > the size to 4K, and then depending on need_resched we break out of
> > the loop? Something like:
> > 
> > if (!len)
> > return 0;
> > 
> > kernel_fpu_begin();
> > for (;;) {
> > unsigned int chunk = min(len, 4096);
> > 
> > sha1_base_do_update(desc, data, chunk, sha1_xform);
> > 
> > len -= chunk;
> > data += chunk;
> > 
> > if (!len)
> > break;
> > 
> > if (need_resched()) {
> > kernel_fpu_end();
> > cond_resched();
> > kernel_fpu_begin();
> > }
> > }
> > kernel_fpu_end();
> 
> 
> I implemented that conditional approach in the sha algorithms.
> 
> The results of a boot (using sha512 for module signatures, with
> crypto extra tests enabled, comparing to sha512 with a 20 KiB
> fixed limit) are:
> 
> sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU context 35828 cycles
> sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU context 118612 cycles
> sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU context 169140982 cycles
> sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU context 4049644 cycles
> 
> NOTE: I didn't have a patch in place to isolate the counts for each variation
> (ssse3 vs. avx vs. avx2) and
> - for sha512: sha512 vs. sha384
> - for sha256: sha256 vs. sha224
> so the numbers include sha256 and sha512 running twice as many tests
> as sha1.
> 
> This approach looks very good:
> - 16% of the number of begin/end calls
> - 10% of the CPU cycles spent making the calls
> - the FPU context is held for a long time (77 ms) but only while
> it's not needed.
> 
> That's much more efficient than releasing it every 30 us just in case.

How recently did you make this change? I implemented this conditional 
approach for ecb_cbc_helpers.h, but saw no changes at all to performance 
on serpent-avx2 and twofish-avx.

kernel_fpu_{begin,end} (after the first call to begin) don't do anything 
more than enable/disable preemption and make a few writes to the mxcsr. 
It's likely that the above approach has the tiniest bit less overhead, 
and it will preempt on non CONFIG_PREEMPT kernels, but nothing suggests 
a performance uplift.

This brings us back to this question: should crypto routines be 
preempted under PREEMPT_VOLUNTARY or not?

> I'll keep testing this to make sure RCU stalls stay away, and apply
> the approach to the other algorithms.

I missed the earlier discussions. Have you seen issues with RCU 
stalls/latency spikes because of crypto routines? If so, what preemption 
model were you running?
 
> In x86, need_resched() has to deal with a PER_CPU variable, so I'm
> not sure it's worth the hassle to figure out how to do that from
> assembly code.

Leave it in c. It'll be more maintainable that way.

Cheers,
Peter Lafreniere <peter@n8pjl.ca>
Elliott, Robert (Servers) Dec. 10, 2022, 12:34 a.m. UTC | #11
> -----Original Message-----
> From: Peter Lafreniere <peter@n8pjl.ca>
> Sent: Tuesday, December 6, 2022 5:06 PM
> To: Elliott, Robert (Servers) <elliott@hpe.com>
> Subject: RE: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> 
> > > Subject: Re: [PATCH v4 10/24] crypto: x86/poly - limit FPU preemption
> > > Perhaps we should try a different approach. How about just limiting
> > > the size to 4K, and then depending on need_resched we break out of
> > > the loop? Something like:
> > >
> > > if (!len)
> > > return 0;
> > >
> > > kernel_fpu_begin();
> > > for (;;) {
> > > unsigned int chunk = min(len, 4096);
> > >
> > > sha1_base_do_update(desc, data, chunk, sha1_xform);
> > >
> > > len -= chunk;
> > > data += chunk;
> > >
> > > if (!len)
> > > break;
> > >
> > > if (need_resched()) {
> > > kernel_fpu_end();
> > > cond_resched();
> > > kernel_fpu_begin();
> > > }
> > > }
> > > kernel_fpu_end();
> >
> >
> > I implemented that conditional approach in the sha algorithms.
> >
> > The results of a boot (using sha512 for module signatures, with
> > crypto extra tests enabled, comparing to sha512 with a 20 KiB
> > fixed limit) are:
> >
> > sha1 cond: 14479 calls; 784256 cycles doing begin/end; longest FPU
> context 35828 cycles
> > sha256 cond: 26763 calls; 1273570 cycles doing begin/end; longest FPU
> context 118612 cycles
> > sha512 cond: 26957 calls; 1680046 cycles doing begin/end; longest FPU
> context 169140982 cycles
> > sha512 20KiB: 161011 calls; 16232280 cycles doing begin/end; longest FPU
> context 4049644 cycles
> >
> > NOTE: I didn't have a patch in place to isolate the counts for each
> variation
> > (ssse3 vs. avx vs. avx2) and
> > - for sha512: sha512 vs. sha384
> > - for sha256: sha256 vs. sha224
> > so the numbers include sha256 and sha512 running twice as many tests
> > as sha1.
> >
> > This approach looks very good:
> > - 16% of the number of begin/end calls
> > - 10% of the CPU cycles spent making the calls
> > - the FPU context is held for a long time (77 ms) but only while
> > it's not needed.
> >
> > That's much more efficient than releasing it every 30 us just in case.
> 
> How recently did you make this change? I implemented this conditional
> approach for ecb_cbc_helpers.h, but saw no changes at all to performance
> on serpent-avx2 and twofish-avx.

The hash functions are the main problem; the skciphers receive
requests already broken into 4 KiB chunks by the SG list helpers.
 
> kernel_fpu_{begin,end} (after the first call to begin) don't do anything
> more than enable/disable preemption and make a few writes to the mxcsr.
> It's likely that the above approach has the tiniest bit less overhead,
> and it will preempt on non CONFIG_PREEMPT kernels, but nothing suggests
> a performance uplift.
> 
> > I'll keep testing this to make sure RCU stalls stay away, and apply
> > the approach to the other algorithms.
> 
> I missed the earlier discussions. Have you seen issues with RCU
> stalls/latency spikes because of crypto routines? If so, what preemption
> model were you running?

While running Wireshark in Fedora, I noticed the top function consuming
CPU cycles (per "perf top") was sha512_generic.

Although Fedora and RHEL have the x86 optimized driver compiled as a
module, nothing in the distro or application space noticed it was there
and loaded it. The only x86 optimized drivers that do get used are the
ones built-in to the kernel.

After making changes to load the x86 sha512 module, I noticed several
boots over the next few weeks reported RCU stalls, all in the sha512_avx2
function. Because the stack traces take a long time to print to the
serial port, these can trigger soft lockups as well. Fedora and RHEL
default to "Voluntary Kernel Preemption (Desktop)": 
    # CONFIG_PREEMPT_NONE is not set
    CONFIG_PREEMPT_VOLUNTARY=y
    # CONFIG_PREEMPT is not set

The reason was that sha512 and all the other x86 crypto hash functions
process the entire data in one kernel_fpu_begin()/end() block, which
blocks preemption. Each boot checks module signatures for about 4000
files, totaling about 2.4 GB. Breaking the loops into smaller chunks
fixes the problem. However, since functions like crc32c are 20x faster
than sha1, one value like 4 KiB is not ideal.

A few non-hash functions have issues too. Although most skciphers are
broken up into 4 KiB chunks by the sg list walking functions, aegis
packs everything inside one kernel_fpu_begin()/end() block. All the
aead functions handle the main data with sg list walking functions,
but handle all the associated data inside one kernel_fpu_begin()/end()
block.

> > In x86, need_resched() has to deal with a PER_CPU variable, so I'm
> > not sure it's worth the hassle to figure out how to do that from
> > assembly code.
> 
> Leave it in c. It'll be more maintainable that way.

I'm testing a new kernel_fpu_yield() utility function that looks nice:

void __sha1_transform_avx2(struct sha1_state *state, const u8 *data, int blocks)
{
        if (blocks <= 0)
                return;

        kernel_fpu_begin();
        for (;;) {
                const int chunks = min(blocks, 4096 / SHA1_BLOCK_SIZE);

                sha1_transform_avx2(state->state, data, chunks);
                blocks -= chunks;

                if (blocks <= 0)
                        break;

                data += chunks * SHA1_BLOCK_SIZE;
                kernel_fpu_yield();
        }
        kernel_fpu_end();
}

This construction also makes it easy to add debug counters to
observe what is happening.

In a boot with preempt=none and the crypto extra self-tests
enabled, two modules benefitted from that new yield call:
/sys/module/sha256_ssse3/parameters/fpu_rescheds:3
/sys/module/sha512_ssse3/parameters/fpu_rescheds:515

10 passes of 1 MiB buffer tests on all the drivers
shows several others benefitting:
/sys/module/aegis128_aesni/parameters/fpu_rescheds:1
/sys/module/aesni_intel/parameters/fpu_rescheds:0
/sys/module/aria_aesni_avx_x86_64/parameters/fpu_rescheds:45
/sys/module/camellia_aesni_avx2/parameters/fpu_rescheds:0
/sys/module/camellia_aesni_avx_x86_64/parameters/fpu_rescheds:0
/sys/module/camellia_x86_64/parameters/fpu_rescheds:0
/sys/module/cast5_avx_x86_64/parameters/fpu_rescheds:0
/sys/module/cast6_avx_x86_64/parameters/fpu_rescheds:0
/sys/module/chacha_x86_64/parameters/fpu_rescheds:0
/sys/module/crc32c_intel/parameters/fpu_rescheds:1
/sys/module/crc32_pclmul/parameters/fpu_rescheds:1
/sys/module/crct10dif_pclmul/parameters/fpu_rescheds:1
/sys/module/ghash_clmulni_intel/parameters/fpu_rescheds:1
/sys/module/libblake2s_x86_64/parameters/fpu_rescheds:0
/sys/module/nhpoly1305_avx2/parameters/fpu_rescheds:1
/sys/module/nhpoly1305_sse2/parameters/fpu_rescheds:1
/sys/module/poly1305_x86_64/parameters/fpu_rescheds:1
/sys/module/polyval_clmulni/parameters/fpu_rescheds:1
/sys/module/serpent_avx2/parameters/fpu_rescheds:0
/sys/module/serpent_avx_x86_64/parameters/fpu_rescheds:0
/sys/module/serpent_sse2_x86_64/parameters/fpu_rescheds:0
/sys/module/sha1_ssse3/parameters/fpu_rescheds:3
/sys/module/sha256_ssse3/parameters/fpu_rescheds:9
/sys/module/sha512_ssse3/parameters/fpu_rescheds:723
/sys/module/sm3_avx_x86_64/parameters/fpu_rescheds:171
/sys/module/sm4_aesni_avx_x86_64/parameters/fpu_rescheds:0
/sys/module/twofish_avx_x86_64/parameters/fpu_rescheds:0
/sys/module/twofish_x86_64_3way/parameters/fpu_rescheds:0


I'll keep experimenting with all the preempt modes, heavier
workloads, and shorter RCU timeouts to confirm this solution
is robust. It might even be appropriate for the generic
drivers, if they suffer from the problems that sm4 shows here.

> This brings us back to this question: should crypto routines be
> preempted under PREEMPT_VOLUNTARY or not?

I think so. The RCU stall and soft lockup detectors aren't disabled,
so there is still an expectation of sharing the CPUs even in
PREEMPT=none mode.

1 MiB tests under CONFIG_PREEMPT=none triggered soft lockups while
running CBC mode for SM4, Camellia, and Serpent:

[  208.975253] tcrypt: PERL "cfb-sm4-aesni-avx2" => 22499840,
[  218.187217] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [modprobe:3433]
...
[  219.391776] tcrypt: PERL "cbc-sm4-aesni-avx2" => 22528138,

[  244.471181] tcrypt: PERL "ecb-sm4-aesni-avx" => 4469626,
[  246.181064] watchdog: BUG: soft lockup - CPU#1 stuck for 52s! [modprobe:3433]
...
[  250.168239] tcrypt: PERL "cbc-camellia-aesni-avx2" => 12202738,


[  264.047440] tcrypt: PERL "cbc-cast5-avx" => 17744280,
[  273.091258] tcrypt: PERL "cbc-cast6-avx" => 19375400,
[  274.183249] watchdog: BUG: soft lockup - CPU#1 stuck for 78s! [modprobe:3433]
...
[  283.066260] tcrypt: PERL "cbc-serpent-avx2" => 21454930,

SM4 falls back to the generic driver for encryption; it only has
optimized decryption functions. Therefore, it doesn't make any
kernel_fpu_end() calls and thus makes no rescheduling calls.

This shows the CPU cycles for 1 MiB of encrypt and decrypt for
each algorithm (no soft lockups this time). SM4, Serpent, Cast5,
and Cast6 encryption in CBC mode are the slowest by far.

[ 2233.362748] tcrypt: PERL my %speeds_skcipher = (
[ 2234.427387] tcrypt: PERL            "cbc-aes-aesni" =>  2178586,
[ 2234.738823] tcrypt: PERL            "cbc-aes-aesni" =>   538752,
[ 2235.064335] tcrypt: PERL            "ctr-aes-aesni" =>   574026,
[ 2235.389427] tcrypt: PERL            "ctr-aes-aesni" =>   574060,
[ 2236.451594] tcrypt: PERL        "cts-cbc-aes-aesni" =>  2178946,
[ 2236.762174] tcrypt: PERL        "cts-cbc-aes-aesni" =>   540066,
[ 2237.070371] tcrypt: PERL            "ecb-aes-aesni" =>   536970,
[ 2237.379549] tcrypt: PERL            "ecb-aes-aesni" =>   538012,
[ 2237.686137] tcrypt: PERL           "xctr-aes-aesni" =>   534690,
[ 2237.993315] tcrypt: PERL           "xctr-aes-aesni" =>   534632,
[ 2238.304077] tcrypt: PERL            "xts-aes-aesni" =>   542590,
[ 2238.615057] tcrypt: PERL            "xts-aes-aesni" =>   541296,
[ 2240.233298] tcrypt: PERL             "ctr-aria-avx" =>  3393212,
[ 2241.849000] tcrypt: PERL             "ctr-aria-avx" =>  3391982,
[ 2242.081296] tcrypt: PERL           "xchacha12-simd" =>   370794,
[ 2242.316868] tcrypt: PERL           "xchacha12-simd" =>   373788,
[ 2242.626165] tcrypt: PERL           "xchacha20-simd" =>   536310,
[ 2242.936646] tcrypt: PERL           "xchacha20-simd" =>   537094,
[ 2243.250356] tcrypt: PERL            "chacha20-simd" =>   540542,
[ 2243.559396] tcrypt: PERL            "chacha20-simd" =>   536604,
[ 2244.831594] tcrypt: PERL       "ctr-sm4-aesni-avx2" =>  2642674,
[ 2246.106143] tcrypt: PERL       "ctr-sm4-aesni-avx2" =>  2640350,
[ 2256.475661] tcrypt: PERL       "cfb-sm4-aesni-avx2" => 22496346,
[ 2257.732511] tcrypt: PERL       "cfb-sm4-aesni-avx2" =>  2604932,
[ 2268.123821] tcrypt: PERL       "cbc-sm4-aesni-avx2" => 22528268,
[ 2269.378028] tcrypt: PERL       "cbc-sm4-aesni-avx2" =>  2601090,
[ 2271.533556] tcrypt: PERL        "ctr-sm4-aesni-avx" =>  4559648,
[ 2273.688772] tcrypt: PERL        "ctr-sm4-aesni-avx" =>  4561300,
[ 2284.073187] tcrypt: PERL        "cfb-sm4-aesni-avx" => 22499496,
[ 2286.177732] tcrypt: PERL        "cfb-sm4-aesni-avx" =>  4457588,
[ 2296.569751] tcrypt: PERL        "cbc-sm4-aesni-avx" => 22529182,
[ 2298.677312] tcrypt: PERL        "cbc-sm4-aesni-avx" =>  4457226,
[ 2300.789931] tcrypt: PERL        "ecb-sm4-aesni-avx" =>  4464282,
[ 2302.899974] tcrypt: PERL        "ecb-sm4-aesni-avx" =>  4466052,
[ 2308.589365] tcrypt: PERL  "cbc-camellia-aesni-avx2" => 12260426,
[ 2309.737064] tcrypt: PERL  "cbc-camellia-aesni-avx2" =>  2350988,
[ 2315.433319] tcrypt: PERL       "cbc-camellia-aesni" => 12248986,
[ 2317.262589] tcrypt: PERL       "cbc-camellia-aesni" =>  3814202,
[ 2325.460542] tcrypt: PERL            "cbc-cast5-avx" => 17739828,
[ 2327.856127] tcrypt: PERL            "cbc-cast5-avx" =>  5061992,
[ 2336.668992] tcrypt: PERL            "cbc-cast6-avx" => 19066440,
[ 2340.470787] tcrypt: PERL            "cbc-cast6-avx" =>  8147336,
[ 2350.376676] tcrypt: PERL         "cbc-serpent-avx2" => 21466002,
[ 2351.646295] tcrypt: PERL         "cbc-serpent-avx2" =>  2611362,
[ 2361.562736] tcrypt: PERL          "cbc-serpent-avx" => 21471118,
[ 2364.019693] tcrypt: PERL          "cbc-serpent-avx" =>  5201506,
[ 2373.930747] tcrypt: PERL         "cbc-serpent-sse2" => 21465594,
[ 2376.697210] tcrypt: PERL         "cbc-serpent-sse2" =>  5855766,
[ 2380.944596] tcrypt: PERL          "cbc-twofish-avx" =>  9058090,
[ 2383.308215] tcrypt: PERL          "cbc-twofish-avx" =>  4989064,
[ 2384.904158] tcrypt: PERL             "ecb-aria-avx" =>  3299260,
[ 2386.498365] tcrypt: PERL             "ecb-aria-avx" =>  3297534,
[ 2387.625226] tcrypt: PERL  "ecb-camellia-aesni-avx2" =>  2306326,
[ 2388.757749] tcrypt: PERL  "ecb-camellia-aesni-avx2" =>  2312876,
[ 2390.549340] tcrypt: PERL       "ecb-camellia-aesni" =>  3752534,
[ 2392.335240] tcrypt: PERL       "ecb-camellia-aesni" =>  3751896,
[ 2394.724956] tcrypt: PERL            "ecb-cast5-avx" =>  5032914,
[ 2397.116268] tcrypt: PERL            "ecb-cast5-avx" =>  5041908,
[ 2400.935093] tcrypt: PERL            "ecb-cast6-avx" =>  8148418,
[ 2404.754816] tcrypt: PERL            "ecb-cast6-avx" =>  8150448,
[ 2406.025861] tcrypt: PERL         "ecb-serpent-avx2" =>  2613024,
[ 2407.286682] tcrypt: PERL         "ecb-serpent-avx2" =>  2602556,
[ 2409.732474] tcrypt: PERL          "ecb-serpent-avx" =>  5191944,
[ 2412.161829] tcrypt: PERL          "ecb-serpent-avx" =>  5165230,
[ 2414.678835] tcrypt: PERL         "ecb-serpent-sse2" =>  5345630,
[ 2417.217632] tcrypt: PERL         "ecb-serpent-sse2" =>  5331110,
[ 2419.545136] tcrypt: PERL          "ecb-twofish-avx" =>  4917424,
[ 2421.870457] tcrypt: PERL          "ecb-twofish-avx" =>  4915194,
[ 2421.870564] tcrypt: PERL );
Elliott, Robert (Servers) Dec. 16, 2022, 10:12 p.m. UTC | #12
> I'll keep experimenting with all the preempt modes, heavier
> workloads, and shorter RCU timeouts to confirm this solution
> is robust. It might even be appropriate for the generic
> drivers, if they suffer from the problems that sm4 shows here.

I have a set of patches that's looking promising. It's no longer
generating RCU stall warnings or soft lockups with either x86
drivers or generic drivers (sm4 is particularly taxing).

Test case: 
* added 28 clones of the tcrypt module so modprobe can run it
  many times in parallel (1 thread per CPU core)
* added 1 MiB big buffer functional tests (compare to
  generic results)
* added 1 MiB big buffer speed tests
* 3 windows running
  * 28 threads running
    * modprobe with each defined test mode in order 1, 2, 3, etc.
* RCU stall timeouts set to shortest supported values
* run in preempt=none, preempt=voluntary, preempt=full modes

Patches include:
* Ard's kmap_local() patch
* Suppress RCU stall warnings during speed tests. Change the
  rcu_sysrq_start()/end() functions to be general purpose and
  call them from tcrypt test functions that measure time of
  a crypto operation
* add crypto_yield() unilaterally in skcipher_walk_done so
  it is run even if data is aligned
* add crypto_yield() in aead_encrypt/decrypt so they always
  call it like skcipher
* add crypto_yield() at the end each hash update(), digest(),
  and finup() function so they always call it like skcipher 
* add kernel_fpu_yield() calls every 4 KiB inside x86
  kernel_fpu_begin()/end() blocks, so the x86 functions always
  yield to the scheduler even when they're bypassing those
  helper functions (that now call crypto_yield() more
  consistently)

I'll keep trying to break it over the weekend. If it holds
up I'll post the patches next week.