mbox series

[0/6] x86: new optimized CRC functions, with VPCLMULQDQ support

Message ID 20241125041129.192999-1-ebiggers@kernel.org
Headers show
Series x86: new optimized CRC functions, with VPCLMULQDQ support | expand

Message

Eric Biggers Nov. 25, 2024, 4:11 a.m. UTC
This patchset is also available in git via:

    git fetch https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git crc-x86-v1

This patchset applies on top of my other recent CRC patchsets
https://lore.kernel.org/r/20241103223154.136127-1-ebiggers@kernel.org/ and
https://lore.kernel.org/r/20241117002244.105200-1-ebiggers@kernel.org/ .
Consider it a preview for what may be coming next, as my priority is
getting those two other patchsets merged first.

This patchset adds a new assembly macro that expands into the body of a
CRC function for x86 for the specified number of bits, bit order, vector
length, and AVX level.  There's also a new script that generates the
constants needed by this function, given a CRC generator polynomial.

This approach allows easily wiring up an x86-optimized implementation of
any variant of CRC-8, CRC-16, CRC-32, or CRC-64, including full support
for VPCLMULQDQ.  On long messages the resulting functions are up to 4x
faster than the existing PCLMULQDQ optimized functions when they exist,
or up to 29x faster than the existing table-based functions.

This patchset starts by wiring up the new macro for crc32_le,
crc_t10dif, and crc32_be.  Later I'd also like to wire up crc64_be and
crc64_rocksoft, once the design of the library functions for those has
been fixed to be like what I'm doing for crc32* and crc_t10dif.

A similar approach of sharing code between CRC variants, and vector
lengths when applicable, should work for other architectures.  The CRC
constant generation script should be mostly reusable.

Eric Biggers (6):
  x86: move zmm exclusion list into CPU feature flag
  scripts/crc: add gen-crc-consts.py
  x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
  x86/crc32: implement crc32_le using new template
  x86/crc-t10dif: implement crc_t10dif using new template
  x86/crc32: implement crc32_be using new template

 arch/x86/Kconfig                        |   2 +-
 arch/x86/crypto/aesni-intel_glue.c      |  22 +-
 arch/x86/include/asm/cpufeatures.h      |   1 +
 arch/x86/kernel/cpu/intel.c             |  22 +
 arch/x86/lib/Makefile                   |   2 +-
 arch/x86/lib/crc-pclmul-consts.h        | 148 ++++++
 arch/x86/lib/crc-pclmul-template-glue.h |  84 ++++
 arch/x86/lib/crc-pclmul-template.S      | 588 ++++++++++++++++++++++++
 arch/x86/lib/crc-t10dif-glue.c          |  22 +-
 arch/x86/lib/crc16-msb-pclmul.S         |   6 +
 arch/x86/lib/crc32-glue.c               |  38 +-
 arch/x86/lib/crc32-pclmul.S             | 220 +--------
 arch/x86/lib/crct10dif-pcl-asm_64.S     | 332 -------------
 scripts/crc/gen-crc-consts.py           | 207 +++++++++
 14 files changed, 1087 insertions(+), 607 deletions(-)
 create mode 100644 arch/x86/lib/crc-pclmul-consts.h
 create mode 100644 arch/x86/lib/crc-pclmul-template-glue.h
 create mode 100644 arch/x86/lib/crc-pclmul-template.S
 create mode 100644 arch/x86/lib/crc16-msb-pclmul.S
 delete mode 100644 arch/x86/lib/crct10dif-pcl-asm_64.S
 create mode 100755 scripts/crc/gen-crc-consts.py

Comments

Eric Biggers Nov. 25, 2024, 6:08 p.m. UTC | #1
On Mon, Nov 25, 2024 at 09:33:46AM +0100, Ingo Molnar wrote:
> 
> * Eric Biggers <ebiggers@kernel.org> wrote:
> 
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > Lift zmm_exclusion_list in aesni-intel_glue.c into the x86 CPU setup
> > code, and add a new x86 CPU feature flag X86_FEATURE_PREFER_YMM that is
> > set when the CPU is on this list.
> > 
> > This allows other code in arch/x86/, such as the CRC library code, to
> > apply the same exclusion list when deciding whether to execute 256-bit
> > or 512-bit optimized functions.
> > 
> > Note that full AVX512 support including zmm registers is still exposed
> > to userspace and is still supported for in-kernel use.  This flag just
> > indicates whether in-kernel code should prefer to use ymm registers.
> > 
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> > ---
> >  arch/x86/crypto/aesni-intel_glue.c | 22 +---------------------
> >  arch/x86/include/asm/cpufeatures.h |  1 +
> >  arch/x86/kernel/cpu/intel.c        | 22 ++++++++++++++++++++++
> >  3 files changed, 24 insertions(+), 21 deletions(-)
> 
> Acked-by: Ingo Molnar <mingo@kernel.org>
> 
> I suppose you'd like to carry this in the crypto tree?

I am planning to carry CRC-related patches myself
(https://lore.kernel.org/lkml/20241117002244.105200-12-ebiggers@kernel.org/).

> 
> > +/*
> > + * This is a list of Intel CPUs that are known to suffer from downclocking when
> > + * zmm registers (512-bit vectors) are used.  On these CPUs, when the kernel
> > + * executes SIMD-optimized code such as cryptography functions or CRCs, it
> > + * should prefer 256-bit (ymm) code to 512-bit (zmm) code.
> > + */
> 
> One speling nit, could you please do:
> 
>   s/ymm/YMM
>   s/zmm/ZMM
> 
> ... to make it consistent with how the rest of the x86 code is 
> capitalizing the names of FPU vector register classes. Just like
> we are capitalizing CPU and CRC properly ;-)
> 

Will do, thanks.

- Eric
Ard Biesheuvel Nov. 29, 2024, 4:16 p.m. UTC | #2
On Mon, 25 Nov 2024 at 05:12, Eric Biggers <ebiggers@kernel.org> wrote:
>
> This patchset is also available in git via:
>
>     git fetch https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git crc-x86-v1
>
> This patchset applies on top of my other recent CRC patchsets
> https://lore.kernel.org/r/20241103223154.136127-1-ebiggers@kernel.org/ and
> https://lore.kernel.org/r/20241117002244.105200-1-ebiggers@kernel.org/ .
> Consider it a preview for what may be coming next, as my priority is
> getting those two other patchsets merged first.
>
> This patchset adds a new assembly macro that expands into the body of a
> CRC function for x86 for the specified number of bits, bit order, vector
> length, and AVX level.  There's also a new script that generates the
> constants needed by this function, given a CRC generator polynomial.
>
> This approach allows easily wiring up an x86-optimized implementation of
> any variant of CRC-8, CRC-16, CRC-32, or CRC-64, including full support
> for VPCLMULQDQ.  On long messages the resulting functions are up to 4x
> faster than the existing PCLMULQDQ optimized functions when they exist,
> or up to 29x faster than the existing table-based functions.
>
> This patchset starts by wiring up the new macro for crc32_le,
> crc_t10dif, and crc32_be.  Later I'd also like to wire up crc64_be and
> crc64_rocksoft, once the design of the library functions for those has
> been fixed to be like what I'm doing for crc32* and crc_t10dif.
>
> A similar approach of sharing code between CRC variants, and vector
> lengths when applicable, should work for other architectures.  The CRC
> constant generation script should be mostly reusable.
>
> Eric Biggers (6):
>   x86: move zmm exclusion list into CPU feature flag
>   scripts/crc: add gen-crc-consts.py
>   x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
>   x86/crc32: implement crc32_le using new template
>   x86/crc-t10dif: implement crc_t10dif using new template
>   x86/crc32: implement crc32_be using new template
>

Good stuff!

Acked-by: Ard Biesheuvel <ardb@kernel.org>

Would indeed be nice to get CRC-64 implemented this way as well, so we
can use it on both x86 and arm64.
Eric Biggers Nov. 29, 2024, 5:50 p.m. UTC | #3
On Fri, Nov 29, 2024 at 05:16:42PM +0100, Ard Biesheuvel wrote:
> On Mon, 25 Nov 2024 at 05:12, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > This patchset is also available in git via:
> >
> >     git fetch https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git crc-x86-v1
> >
> > This patchset applies on top of my other recent CRC patchsets
> > https://lore.kernel.org/r/20241103223154.136127-1-ebiggers@kernel.org/ and
> > https://lore.kernel.org/r/20241117002244.105200-1-ebiggers@kernel.org/ .
> > Consider it a preview for what may be coming next, as my priority is
> > getting those two other patchsets merged first.
> >
> > This patchset adds a new assembly macro that expands into the body of a
> > CRC function for x86 for the specified number of bits, bit order, vector
> > length, and AVX level.  There's also a new script that generates the
> > constants needed by this function, given a CRC generator polynomial.
> >
> > This approach allows easily wiring up an x86-optimized implementation of
> > any variant of CRC-8, CRC-16, CRC-32, or CRC-64, including full support
> > for VPCLMULQDQ.  On long messages the resulting functions are up to 4x
> > faster than the existing PCLMULQDQ optimized functions when they exist,
> > or up to 29x faster than the existing table-based functions.
> >
> > This patchset starts by wiring up the new macro for crc32_le,
> > crc_t10dif, and crc32_be.  Later I'd also like to wire up crc64_be and
> > crc64_rocksoft, once the design of the library functions for those has
> > been fixed to be like what I'm doing for crc32* and crc_t10dif.
> >
> > A similar approach of sharing code between CRC variants, and vector
> > lengths when applicable, should work for other architectures.  The CRC
> > constant generation script should be mostly reusable.
> >
> > Eric Biggers (6):
> >   x86: move zmm exclusion list into CPU feature flag
> >   scripts/crc: add gen-crc-consts.py
> >   x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
> >   x86/crc32: implement crc32_le using new template
> >   x86/crc-t10dif: implement crc_t10dif using new template
> >   x86/crc32: implement crc32_be using new template
> >
> 
> Good stuff!
> 
> Acked-by: Ard Biesheuvel <ardb@kernel.org>
> 
> Would indeed be nice to get CRC-64 implemented this way as well, so we
> can use it on both x86 and arm64.

Thanks!  The template actually supports CRC-64 already (both LSB and MSB-first
variants) and I've tested it in userspace.  I just haven't wired it up to the
kernel's CRC-64 functions yet.

- Eric