[RFC,V1,0/7] Introduce AVX512 optimized crypto algorithms

Message ID	1608325864-4033-1-git-send-email-megha.dey@intel.com
Headers	show Return-Path: <linux-crypto-owner@kernel.org> IronPort-SDR: iWnqiuM7+otpRRdDXf/MM0g+8OjQfCqK1f0/yitaBI9qTo4+BxF8Eyj2qM+dhk74T6OprYTSRb CznU94ovQHkw== IronPort-SDR: 52s5of98+mfvXunFxukl0uY49q9ySIok/gVoOjYRO5EaPxQZw7RsM7hqsQq5SpQA4hi/Esnew1 c6Y9z9UUpnng== From: Megha Dey <megha.dey@intel.com> To: herbert@gondor.apana.org.au, davem@davemloft.net Cc: linux-crypto@vger.kernel.org, linux-kernel@vger.kernel.org, ravi.v.shankar@intel.com, tim.c.chen@intel.com, andi.kleen@intel.com, dave.hansen@intel.com, megha.dey@intel.com, wajdi.k.feghali@intel.com, greg.b.tucker@intel.com, robert.a.kasten@intel.com, rajendrakumar.chinnaiyan@intel.com, tomasz.kantecki@intel.com, ryan.d.saffores@intel.com, ilya.albrekht@intel.com, kyung.min.park@intel.com, tony.luck@intel.com, ira.weiny@intel.com, x86@kernel.org Subject: [RFC V1 0/7] Introduce AVX512 optimized crypto algorithms Date: Fri, 18 Dec 2020 13:10:57 -0800 Message-Id: <1608325864-4033-1-git-send-email-megha.dey@intel.com> Precedence: bulk
Series	Introduce AVX512 optimized crypto algorithms \| expand [RFC,V1,0/7] Introduce AVX512 optimized crypto algorithms [RFC,V1,2/7] crypto: crct10dif - Accelerated CRC T10 DIF with vectorized instruction [RFC,V1,4/7] crypto: tcrypt - Add speed test for optimized GHASH computations [RFC,V1,5/7] crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization

Megha Dey Dec. 18, 2020, 9:10 p.m. UTC

Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions
(first implemented on Intel's Icelake client and Xeon CPUs).

These algorithms take advantage of the AVX512 registers to keep the CPU
busy and increase memory bandwidth utilization. They provide substantial
(2-10x) improvements over existing crypto algorithms when update data size
is greater than 128 bytes and do not have any significant impact when used
on small amounts of data.

However, these algorithms may also incur a frequency penalty and cause
collateral damage to other workloads running on the same core(co-scheduled
threads). These frequency drops are also known as bin drops where 1 bin
drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin
drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)
are observed on the Icelake server.

The AVX512 optimization are disabled by default to avoid impact on other
workloads. In order to use these optimized algorithms:
1. At compile time:
   a. User must enable CONFIG_CRYPTO_AVX512 option
   b. Toolchain(assembler) must support VAES or VPCLMULQDQ instructions
2. At run time:
   a. User must set module parameter use_avx512 at boot time
   (<module_name>.use_avx512 = 1) or post boot using sysfs
   (echo 1 > /sys/module/<module_name>/parameters/use_avx512).
   (except for aesni ctr and gcm which require boot time initialization
   because of commit 0fbafd06bdde ("crypto: aesni - fix failing setkey for
   rfc4106-gcm-aesni")
   b. Platform must support VPCLMULQDQ or VAES features

N.B. It is unclear whether these coarse grain controls(global module
parameter) would meet all user needs. Perhaps some per-thread control might
be useful? Looking for guidance here.
 
Other implementations of these crypto algorithms are possible, which would
result in lower crypto performance but would not cause collateral damage
from frequency drops (AVX512L vs AVX512VL).

The following crypto algorithms are optimized using AVX512 registers:

1. "by16" implementation of T10 Data Integrity Field CRC16 (CRC T10 DIF)
   The "by16" means the main loop processes 256 bytes (16 * 16 bytes) at
   a time in CRC T10 DIF calculation. This algorithm is optimized using
   the VPCLMULQDQ instruction which is the encoded 512 bit version of
   PCLMULQDQ instruction. On an Icelake desktop, with constant frequency
   set, the "by16" CRC T10 DIF AVX512 optimization shows about 1.5X
   improvement when the bytes per update size is 1KB or above as measured
   by the tcrypt module.

2. GHASH computations with vectorized instruction.
   VPCLMULQDQ instruction is used to accelerate the most time-consuming
   part of GHASH, carry-less multiplication. For best parallelism and
   deeper out of order execution, the main loop of the code works on 16x16
   byte blocks at a time and performs reduction every 48 x 16 byte blocks.
   Optimized GHASH computations show a 4x to 10x speedup when the bytes
   per update is 256B or above.

3. "by16" implementation of the AES CTR mode using VAES instructions
   "by16" means that 16 independent blocks (each 128 bits) can be ciphered
   simultaneously. On an Icelake desktop, with constant frequency set, the
   "by16" AES CTR mode shows about 2X improvement when the bytes per update
   size is 256B or above as measured by the tcrypt module.

4. AES GCM using VPCLMULQDQ instructions
   Using AVX 512 registers, an average increase of 2X is observed when the
   bytes per update size is 256B or above as measured by tcrypt module.

Patch 1 checks for assembler support for VPCLMULQDQ instruction
Patch 2 introduces CRC T10 DIF calculation with VPCLMULQDQ instructions
Patch 3 introduces optimized GHASH computation with VPCLMULQDQ instructions
Patch 4 adds new speed test for optimized GHASH computations
Patch 5 introduces "by 16" version of AES CTR mode using VAES instructions
Patch 6 fixes coding style in existing if else block
Patch 7 introduces the AES GCM mode using VPCLMULQDQ instructions

Complex sign off chain in patches 2 and 3. Original implementation (non
kernel) was done by Intel's IPsec team. Kyung Min Park is the author of
Patch 2 and co-author of patch 3 along with me.

Also, most of this code is related to crypto subsystem. X86 mailing list is
copied here because of Patch 1.

Cc: x86@kernel.org
Reviewed-by: Tony Luck <tony.luck@intel.com>

Kyung Min Park (3):
  crypto: crct10dif - Accelerated CRC T10 DIF with vectorized
    instruction
  crypto: ghash - Optimized GHASH computations
  crypto: tcrypt - Add speed test for optimized GHASH computations

Megha Dey (4):
  x86: Probe assembler capabilities for VAES and VPLCMULQDQ support
  crypto: aesni - AES CTR x86_64 "by16" AVX512 optimization
  crypto: aesni - fix coding style for if/else block
  crypto: aesni - AVX512 version of AESNI-GCM using VPCLMULQDQ

 arch/x86/Kconfig.assembler                   |   10 +
 arch/x86/crypto/Makefile                     |    4 +
 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S  |  856 ++++++++++++
 arch/x86/crypto/aesni-intel_avx512-x86_64.S  | 1788 ++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c           |  122 +-
 arch/x86/crypto/avx512_vaes_common.S         | 1633 +++++++++++++++++++++++
 arch/x86/crypto/crct10dif-avx512-asm_64.S    |  482 +++++++
 arch/x86/crypto/crct10dif-pclmul_glue.c      |   24 +-
 arch/x86/crypto/ghash-clmulni-intel_avx512.S |   68 +
 arch/x86/crypto/ghash-clmulni-intel_glue.c   |   39 +-
 arch/x86/include/asm/disabled-features.h     |   14 +-
 crypto/Kconfig                               |   59 +
 crypto/tcrypt.c                              |    5 +
 13 files changed, 5091 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/crypto/aes_ctrby16_avx512-x86_64.S
 create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S
 create mode 100644 arch/x86/crypto/avx512_vaes_common.S
 create mode 100644 arch/x86/crypto/crct10dif-avx512-asm_64.S
 create mode 100644 arch/x86/crypto/ghash-clmulni-intel_avx512.S

Ard Biesheuvel Dec. 19, 2020, 5:03 p.m. UTC | #1

On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:
>

> From: Kyung Min Park <kyung.min.park@intel.com>

>

> Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.

> The new instruction allows to work on 4 x 16 byte blocks at the time.

> For best parallelism and deeper out of order execution, the main loop of

> the code works on 16 x 16 byte blocks at the time and performs reduction

> every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys

> and the precompute operation has been optimized as well to leverage 512 bit

> registers, parallel carry less multiply and reduction.

>

> VPCLMULQDQ instruction is used to accelerate the most time-consuming

> part of GHASH, carry-less multiplication. VPCLMULQDQ instruction

> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.

>

> The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ

> version with the VPCLMULQDQ version when the following criteria are met:

> At compile time:

> 1. CONFIG_CRYPTO_AVX512 is enabled

> 2. toolchain(assembler) supports VPCLMULQDQ instructions

> At runtime:

> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently

>    only Icelake)

> 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at

>    boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set

>    to 1 after boot.

>    If compiled as loadable module, use_avx512 module parameter must be set:

>    modprobe ghash_clmulni_intel use_avx512=1

>

> With new implementation, tcrypt ghash speed test shows about 4x to 10x

> speedup improvement for GHASH calculation compared to the original

> implementation with PCLMULQDQ when the bytes per update size is 256 Bytes

> or above. Detailed results for a variety of block sizes and update

> sizes are in the table below. The test was performed on Icelake based

> platform with constant frequency set for CPU.

>

> The average performance improvement of the AVX512 version over the current

> implementation is as follows:

> For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).

> For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).

>

> A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction

> and VPCLMULQDQ instruction shows the following results.

>

> ---------------------------------------------------------------------------

> |            |            |         cycles/operation         |            |

> |            |            |       (the lower the better)     |            |

> |    byte    |   bytes    |----------------------------------| percentage |

> |   blocks   | per update |   GHASH test   |   GHASH test    | loss/gain  |

> |            |            | with PCLMULQDQ | with VPCLMULQDQ |            |

> |------------|------------|----------------|-----------------|------------|

> |      16    |     16     |       144      |        233      |   -38.0    |

> |      64    |     16     |       535      |        709      |   -24.5    |

> |      64    |     64     |       210      |        146      |    43.8    |

> |     256    |     16     |      1808      |       1911      |    -5.4    |

> |     256    |     64     |       865      |        581      |    48.9    |

> |     256    |    256     |       682      |        170      |   301.0    |

> |    1024    |     16     |      6746      |       6935      |    -2.7    |

> |    1024    |    256     |      2829      |        714      |   296.0    |

> |    1024    |   1024     |      2543      |        341      |   645.0    |

> |    2048    |     16     |     13219      |      13403      |    -1.3    |

> |    2048    |    256     |      5435      |       1408      |   286.0    |

> |    2048    |   1024     |      5218      |        685      |   661.0    |

> |    2048    |   2048     |      5061      |        565      |   796.0    |

> |    4096    |     16     |     40793      |      27615      |    47.8    |

> |    4096    |    256     |     10662      |       2689      |   297.0    |

> |    4096    |   1024     |     10196      |       1333      |   665.0    |

> |    4096    |   4096     |     10049      |       1011      |   894.0    |

> |    8192    |     16     |     51672      |      54599      |    -5.3    |

> |    8192    |    256     |     21228      |       5284      |   301.0    |

> |    8192    |   1024     |     20306      |       2556      |   694.0    |

> |    8192    |   4096     |     20076      |       2044      |   882.0    |

> |    8192    |   8192     |     20071      |       2017      |   895.0    |

> ---------------------------------------------------------------------------

>

> This work was inspired by the AES GCM mode optimization published

> in Intel Optimized IPSEC Cryptographic library.

> https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm

>

> Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>

> Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>

> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>

> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>

> Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>

> Co-developed-by: Megha Dey <megha.dey@intel.com>

> Signed-off-by: Megha Dey <megha.dey@intel.com>


Hello Megha,

What is the purpose of this separate GHASH module? GHASH is only used
in combination with AES-CTR to produce GCM, and this series already
contains a GCM driver.

Do cores exist that implement PCLMULQDQ but not AES-NI?

If not, I think we should be able to drop this patch (and remove the
existing PCLMULQDQ GHASH driver as well)

Eric Biggers Dec. 21, 2020, 11:20 p.m. UTC | #2

On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:
> Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions

> (first implemented on Intel's Icelake client and Xeon CPUs).

> 

> These algorithms take advantage of the AVX512 registers to keep the CPU

> busy and increase memory bandwidth utilization. They provide substantial

> (2-10x) improvements over existing crypto algorithms when update data size

> is greater than 128 bytes and do not have any significant impact when used

> on small amounts of data.

> 

> However, these algorithms may also incur a frequency penalty and cause

> collateral damage to other workloads running on the same core(co-scheduled

> threads). These frequency drops are also known as bin drops where 1 bin

> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin

> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)

> are observed on the Icelake server.

> 


Do these new algorithms all pass the self-tests, including the fuzz tests that
are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?

- Eric

Megha Dey Dec. 28, 2020, 7:10 p.m. UTC | #3

Hi Eric,

On 12/21/2020 3:20 PM, Eric Biggers wrote:
> On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:

>> Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions

>> (first implemented on Intel's Icelake client and Xeon CPUs).

>>

>> These algorithms take advantage of the AVX512 registers to keep the CPU

>> busy and increase memory bandwidth utilization. They provide substantial

>> (2-10x) improvements over existing crypto algorithms when update data size

>> is greater than 128 bytes and do not have any significant impact when used

>> on small amounts of data.

>>

>> However, these algorithms may also incur a frequency penalty and cause

>> collateral damage to other workloads running on the same core(co-scheduled

>> threads). These frequency drops are also known as bin drops where 1 bin

>> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin

>> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)

>> are observed on the Icelake server.

>>

> Do these new algorithms all pass the self-tests, including the fuzz tests that

> are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?


I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and 
tcrypt, not with
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad).
I see a couple of errors after enabling it and am working on fixing those.

Megha

>

> - Eric

Megha Dey Jan. 16, 2021, 12:14 a.m. UTC | #4

Hi Ard,

On 12/19/2020 9:03 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:

>> From: Kyung Min Park <kyung.min.park@intel.com>

>>

>> Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.

>> The new instruction allows to work on 4 x 16 byte blocks at the time.

>> For best parallelism and deeper out of order execution, the main loop of

>> the code works on 16 x 16 byte blocks at the time and performs reduction

>> every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys

>> and the precompute operation has been optimized as well to leverage 512 bit

>> registers, parallel carry less multiply and reduction.

>>

>> VPCLMULQDQ instruction is used to accelerate the most time-consuming

>> part of GHASH, carry-less multiplication. VPCLMULQDQ instruction

>> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.

>>

>> The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ

>> version with the VPCLMULQDQ version when the following criteria are met:

>> At compile time:

>> 1. CONFIG_CRYPTO_AVX512 is enabled

>> 2. toolchain(assembler) supports VPCLMULQDQ instructions

>> At runtime:

>> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently

>>     only Icelake)

>> 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at

>>     boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set

>>     to 1 after boot.

>>     If compiled as loadable module, use_avx512 module parameter must be set:

>>     modprobe ghash_clmulni_intel use_avx512=1

>>

>> With new implementation, tcrypt ghash speed test shows about 4x to 10x

>> speedup improvement for GHASH calculation compared to the original

>> implementation with PCLMULQDQ when the bytes per update size is 256 Bytes

>> or above. Detailed results for a variety of block sizes and update

>> sizes are in the table below. The test was performed on Icelake based

>> platform with constant frequency set for CPU.

>>

>> The average performance improvement of the AVX512 version over the current

>> implementation is as follows:

>> For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).

>> For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).

>>

>> A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction

>> and VPCLMULQDQ instruction shows the following results.

>>

>> ---------------------------------------------------------------------------

>> |            |            |         cycles/operation         |            |

>> |            |            |       (the lower the better)     |            |

>> |    byte    |   bytes    |----------------------------------| percentage |

>> |   blocks   | per update |   GHASH test   |   GHASH test    | loss/gain  |

>> |            |            | with PCLMULQDQ | with VPCLMULQDQ |            |

>> |------------|------------|----------------|-----------------|------------|

>> |      16    |     16     |       144      |        233      |   -38.0    |

>> |      64    |     16     |       535      |        709      |   -24.5    |

>> |      64    |     64     |       210      |        146      |    43.8    |

>> |     256    |     16     |      1808      |       1911      |    -5.4    |

>> |     256    |     64     |       865      |        581      |    48.9    |

>> |     256    |    256     |       682      |        170      |   301.0    |

>> |    1024    |     16     |      6746      |       6935      |    -2.7    |

>> |    1024    |    256     |      2829      |        714      |   296.0    |

>> |    1024    |   1024     |      2543      |        341      |   645.0    |

>> |    2048    |     16     |     13219      |      13403      |    -1.3    |

>> |    2048    |    256     |      5435      |       1408      |   286.0    |

>> |    2048    |   1024     |      5218      |        685      |   661.0    |

>> |    2048    |   2048     |      5061      |        565      |   796.0    |

>> |    4096    |     16     |     40793      |      27615      |    47.8    |

>> |    4096    |    256     |     10662      |       2689      |   297.0    |

>> |    4096    |   1024     |     10196      |       1333      |   665.0    |

>> |    4096    |   4096     |     10049      |       1011      |   894.0    |

>> |    8192    |     16     |     51672      |      54599      |    -5.3    |

>> |    8192    |    256     |     21228      |       5284      |   301.0    |

>> |    8192    |   1024     |     20306      |       2556      |   694.0    |

>> |    8192    |   4096     |     20076      |       2044      |   882.0    |

>> |    8192    |   8192     |     20071      |       2017      |   895.0    |

>> ---------------------------------------------------------------------------

>>

>> This work was inspired by the AES GCM mode optimization published

>> in Intel Optimized IPSEC Cryptographic library.

>> https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm

>>

>> Co-developed-by: Greg Tucker <greg.b.tucker@intel.com>

>> Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>

>> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>

>> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>

>> Signed-off-by: Kyung Min Park <kyung.min.park@intel.com>

>> Co-developed-by: Megha Dey <megha.dey@intel.com>

>> Signed-off-by: Megha Dey <megha.dey@intel.com>

> Hello Megha,

>

> What is the purpose of this separate GHASH module? GHASH is only used

> in combination with AES-CTR to produce GCM, and this series already

> contains a GCM driver.

>

> Do cores exist that implement PCLMULQDQ but not AES-NI?

>

> If not, I think we should be able to drop this patch (and remove the

> existing PCLMULQDQ GHASH driver as well)


AFAIK, dm-verity (authenticated but not encrypted file system) is one 
use case for authentication only.

Although I am not sure if GHASH is specifically used for this or SHA?

Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

Megha

Dave Hansen Jan. 16, 2021, 12:20 a.m. UTC | #5

On 1/15/21 4:14 PM, Dey, Megha wrote:
> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.


That's true, bit it's also possible that a hypervisor could enumerate
support for PCLMULQDQ and not AES-NI.  In general, we've tried to
implement x86 CPU features independently, even if they never show up in
a real CPU independently.

Eric Biggers Jan. 16, 2021, 1:43 a.m. UTC | #6

On Fri, Jan 15, 2021 at 04:14:40PM -0800, Dey, Megha wrote:
> > Hello Megha,

> > 

> > What is the purpose of this separate GHASH module? GHASH is only used

> > in combination with AES-CTR to produce GCM, and this series already

> > contains a GCM driver.

> > 

> > Do cores exist that implement PCLMULQDQ but not AES-NI?

> > 

> > If not, I think we should be able to drop this patch (and remove the

> > existing PCLMULQDQ GHASH driver as well)

> 

> AFAIK, dm-verity (authenticated but not encrypted file system) is one use

> case for authentication only.

> 

> Although I am not sure if GHASH is specifically used for this or SHA?

> 

> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

> 


dm-verity only uses unkeyed hash algorithms.  So no, it doesn't use GHASH.

- Eric

Eric Biggers Jan. 16, 2021, 2:04 a.m. UTC | #7

On Fri, Jan 15, 2021 at 04:20:44PM -0800, Dave Hansen wrote:
> On 1/15/21 4:14 PM, Dey, Megha wrote:

> > Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

> 

> That's true, bit it's also possible that a hypervisor could enumerate

> support for PCLMULQDQ and not AES-NI.  In general, we've tried to

> implement x86 CPU features independently, even if they never show up in

> a real CPU independently.


We only add optimized implementations of crypto algorithms if they are actually
useful, though.  If they would never be used in practice, that's not useful.

- Eric

Megha Dey Jan. 16, 2021, 5:07 a.m. UTC | #8

On 1/15/2021 5:43 PM, Eric Biggers wrote:
> On Fri, Jan 15, 2021 at 04:14:40PM -0800, Dey, Megha wrote:

>>> Hello Megha,

>>>

>>> What is the purpose of this separate GHASH module? GHASH is only used

>>> in combination with AES-CTR to produce GCM, and this series already

>>> contains a GCM driver.

>>>

>>> Do cores exist that implement PCLMULQDQ but not AES-NI?

>>>

>>> If not, I think we should be able to drop this patch (and remove the

>>> existing PCLMULQDQ GHASH driver as well)

>> AFAIK, dm-verity (authenticated but not encrypted file system) is one use

>> case for authentication only.

>>

>> Although I am not sure if GHASH is specifically used for this or SHA?

>>

>> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

>>

> dm-verity only uses unkeyed hash algorithms.  So no, it doesn't use GHASH.

Hmm, I see. If that is the case, I am not aware of any other use case 
apart from GCM.

I see that the existing GHASH module in the kernel from 2009. I am not 
sure if there was a use case then, which now is no longer valid.

There many be out-of-tree kernel modules which may be using it but again 
its only speculation.

So, in the next version should I remove the existing GHASH module? (And 
of course remove this patch as well?)

-Megha

>

> - Eric

Dave Hansen Jan. 16, 2021, 5:13 a.m. UTC | #9

On 1/15/21 6:04 PM, Eric Biggers wrote:
> On Fri, Jan 15, 2021 at 04:20:44PM -0800, Dave Hansen wrote:

>> On 1/15/21 4:14 PM, Dey, Megha wrote:

>>> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

>> That's true, bit it's also possible that a hypervisor could enumerate

>> support for PCLMULQDQ and not AES-NI.  In general, we've tried to

>> implement x86 CPU features independently, even if they never show up in

>> a real CPU independently.

> We only add optimized implementations of crypto algorithms if they are actually

> useful, though.  If they would never be used in practice, that's not useful.


Yes, totally agree.  If it's not of practical use, it doesn't get merged.

I just wanted to share what we do for other related but independent CPU
features.

Ard Biesheuvel Jan. 16, 2021, 4:48 p.m. UTC | #10

On Sat, 16 Jan 2021 at 06:13, Dave Hansen <dave.hansen@intel.com> wrote:
>

> On 1/15/21 6:04 PM, Eric Biggers wrote:

> > On Fri, Jan 15, 2021 at 04:20:44PM -0800, Dave Hansen wrote:

> >> On 1/15/21 4:14 PM, Dey, Megha wrote:

> >>> Also, I do not know of any cores that implement PCLMULQDQ and not AES-NI.

> >> That's true, bit it's also possible that a hypervisor could enumerate

> >> support for PCLMULQDQ and not AES-NI.  In general, we've tried to

> >> implement x86 CPU features independently, even if they never show up in

> >> a real CPU independently.

> > We only add optimized implementations of crypto algorithms if they are actually

> > useful, though.  If they would never be used in practice, that's not useful.

>

> Yes, totally agree.  If it's not of practical use, it doesn't get merged.

>

> I just wanted to share what we do for other related but independent CPU

> features.


Thanks for the insight.

The issue with the current GHASH driver is that it uses infrastructure
that we may decide to remove (the async cryptd helper [0]). So adding
more dependencies on that without any proven benefit should obviously
be avoided at this time as well.

[0] https://lore.kernel.org/linux-arm-kernel/20201218170106.23280-1-ardb@kernel.org/

Ard Biesheuvel Jan. 16, 2021, 4:52 p.m. UTC | #11

On Mon, 28 Dec 2020 at 20:11, Dey, Megha <megha.dey@intel.com> wrote:
>

> Hi Eric,

>

> On 12/21/2020 3:20 PM, Eric Biggers wrote:

> > On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:

> >> Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions

> >> (first implemented on Intel's Icelake client and Xeon CPUs).

> >>

> >> These algorithms take advantage of the AVX512 registers to keep the CPU

> >> busy and increase memory bandwidth utilization. They provide substantial

> >> (2-10x) improvements over existing crypto algorithms when update data size

> >> is greater than 128 bytes and do not have any significant impact when used

> >> on small amounts of data.

> >>

> >> However, these algorithms may also incur a frequency penalty and cause

> >> collateral damage to other workloads running on the same core(co-scheduled

> >> threads). These frequency drops are also known as bin drops where 1 bin

> >> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin

> >> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)

> >> are observed on the Icelake server.

> >>

> > Do these new algorithms all pass the self-tests, including the fuzz tests that

> > are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?

>

> I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and

> tcrypt, not with

> CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad).

> I see a couple of errors after enabling it and am working on fixing those.

>


Hello Megha,

I think the GHASH changes can be dropped (as discussed in the other
thread), given the lack of a use case. The existing GHASH driver could
also be removed in the future, but I don't think it needs to be part
of this series.

Could you please rebase this onto the latest AES-NI changes that are
in Herbert's tree? (as well as the ones I sent out today) They address
some issues with indirect calls and excessive disabling of preemption,
and your GCM and CTR changes are definitely going to be affected by
this as well.

Ard Biesheuvel Jan. 16, 2021, 4:54 p.m. UTC | #12

On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:
>

> This is a preparatory patch to introduce the optimized crypto algorithms

> using AVX512 instructions which would require VAES and VPLCMULQDQ support.

>

> Check for VAES and VPCLMULQDQ assembler support using AVX512 registers.

>

> Cc: x86@kernel.org

> Signed-off-by: Megha Dey <megha.dey@intel.com>

> ---

>  arch/x86/Kconfig.assembler | 10 ++++++++++

>  1 file changed, 10 insertions(+)

>

> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler

> index 26b8c08..9ea0bc8 100644

> --- a/arch/x86/Kconfig.assembler

> +++ b/arch/x86/Kconfig.assembler

> @@ -1,6 +1,16 @@

>  # SPDX-License-Identifier: GPL-2.0

>  # Copyright (C) 2020 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

>

> +config AS_VAES_AVX512

> +       def_bool $(as-instr,vaesenc %zmm0$(comma)%zmm1$(comma)%zmm1) && 64BIT


Is the '&& 64BIT' necessary here, but not below?

In any case, better to use a separate 'depends on' line, for legibility

> +       help

> +         Supported by binutils >= 2.30 and LLVM integrated assembler

> +

> +config AS_VPCLMULQDQ

> +       def_bool $(as-instr,vpclmulqdq \$0$(comma)%zmm2$(comma)%zmm6$(comma)%zmm4)

> +       help

> +         Supported by binutils >= 2.30 and LLVM integrated assembler

> +

>  config AS_AVX512

>         def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5)

>         help

> --

> 2.7.4

>

Ard Biesheuvel Jan. 16, 2021, 5:16 p.m. UTC | #13

On Fri, 18 Dec 2020 at 22:08, Megha Dey <megha.dey@intel.com> wrote:
>

> Introduce the AVX512 implementation that optimizes the AESNI-GCM encode

> and decode routines using VPCLMULQDQ.

>

> The glue code in AESNI module overrides the existing AVX2 GCM mode

> encryption/decryption routines with the AX512 AES GCM mode ones when the

> following criteria are met:

> At compile time:

> 1. CONFIG_CRYPTO_AVX512 is enabled

> 2. toolchain(assembler) supports VPCLMULQDQ instructions

> At runtime:

> 1. VPCLMULQDQ and AVX512VL features are supported on a platform

>    (currently only Icelake)

> 2. aesni_intel.use_avx512 module parameter is set at boot time. For this

>    algorithm, switching from AVX512 optimized version is not possible

>    once set at boot time because of how the code is structured today.(Can

>    be changed later if required)

>

> The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512,

> aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted

> from the Intel Optimized IPSEC Cryptographic library.

>

> On a Icelake desktop, with turbo disabled and all CPUs running at

> maximum frequency, the AVX512 GCM mode optimization shows better

> performance across data & key sizes as measured by tcrypt.

>

> The average performance improvement of the AVX512 version over the AVX2

> version is as follows:

> For all key sizes(128/192/256 bits),

>         data sizes < 128 bytes/block, negligible improvement (~7.5%)

>         data sizes > 128 bytes/block, there is an average improvement of

>         40% for both encryption and decryption.

>

> A typical run of tcrypt with AES GCM mode encryption/decryption of the

> AVX2 and AVX512 optimization on a Icelake desktop shows the following

> results:

>

>   ----------------------------------------------------------------------

>   |   key  | bytes | cycles/op (lower is better)   | Percentage gain/  |

>   | length |   per |   encryption  |  decryption   |      loss         |

>   | (bits) | block |-------------------------------|-------------------|

>   |        |       | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt |

>   |---------------------------------------------------------------------

>   |  128   | 16    | 689  |  701   | 689  |  707   |  -1.7   |  -2.61  |

>   |  128   | 64    | 731  |  660   | 771  |  649   |   9.7   |  15.82  |

>   |  128   | 256   | 911  |  750   | 900  |  721   |  17.67  |  19.88  |

>   |  128   | 512   | 1181 |  814   | 1161 |  782   |  31.07  |  32.64  |

>   |  128   | 1024  | 1676 |  1052  | 1685 |  1030  |  37.23  |  38.87  |

>   |  128   | 2048  | 2475 |  1447  | 2456 |  1419  |  41.53  |  42.22  |

>   |  128   | 4096  | 3806 |  2154  | 3820 |  2119  |  43.41  |  44.53  |

>   |  128   | 8192  | 9169 |  3806  | 6997 |  3718  |  58.49  |  46.86  |

>   |  192   | 16    | 754  |  683   | 737  |  672   |   9.42  |   8.82  |

>   |  192   | 64    | 735  |  686   | 715  |  640   |   6.66  |  10.49  |

>   |  192   | 256   | 949  |  738   | 2435 |  729   |  22.23  |  70     |

>   |  192   | 512   | 1235 |  854   | 1200 |  833   |  30.85  |  30.58  |

>   |  192   | 1024  | 1777 |  1084  | 1763 |  1051  |  38.99  |  40.39  |

>   |  192   | 2048  | 2574 |  1497  | 2592 |  1459  |  41.84  |  43.71  |

>   |  192   | 4096  | 4086 |  2317  | 4091 |  2244  |  43.29  |  45.14  |

>   |  192   | 8192  | 7481 |  4054  | 7505 |  3953  |  45.81  |  47.32  |

>   |  256   | 16    | 755  |  682   | 720  |  683   |   9.68  |   5.14  |

>   |  256   | 64    | 744  |  677   | 719  |  658   |   9     |   8.48  |

>   |  256   | 256   | 962  |  758   | 948  |  749   |  21.21  |  21     |

>   |  256   | 512   | 1297 |  862   | 1276 |  836   |  33.54  |  34.48  |

>   |  256   | 1024  | 1831 |  1114  | 1819 |  1095  |  39.16  |  39.8   |

>   |  256   | 2048  | 2767 |  1566  | 2715 |  1524  |  43.4   |  43.87  |

>   |  256   | 4096  | 4378 |  2382  | 4368 |  2354  |  45.6   |  46.11  |

>   |  256   | 8192  | 8075 |  4262  | 8080 |  4186  |  47.22  |  48.19  |

>   ----------------------------------------------------------------------

>

> This work was inspired by the AES GCM mode optimization published in

> Intel Optimized IPSEC Cryptographic library.

> https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm

>

> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>

> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>

> Signed-off-by: Megha Dey <megha.dey@intel.com>

> ---

>  arch/x86/crypto/Makefile                    |    1 +

>  arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++++++++++++++++++++++++++

>  arch/x86/crypto/aesni-intel_glue.c          |   62 +-

>  crypto/Kconfig                              |   12 +

>  4 files changed, 1858 insertions(+), 5 deletions(-)

>  create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S

>

...
> diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c

> index 9e56cdf..8fc5bac 100644

> --- a/arch/x86/crypto/aesni-intel_glue.c

> +++ b/arch/x86/crypto/aesni-intel_glue.c

> @@ -55,13 +55,16 @@ MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");

>   * This needs to be 16 byte aligned.

>   */

>  struct aesni_rfc4106_gcm_ctx {

> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;

> +       /* AVX512 optimized algorithms use 48 hash keys to conduct

> +        * multiple PCLMULQDQ operations in parallel

> +        */

> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;

>         struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;

>         u8 nonce[4];

>  };

>

>  struct generic_gcmaes_ctx {

> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;

> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;

>         struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;

>  };

>

> @@ -82,7 +85,7 @@ struct gcm_context_data {

>         u8 current_counter[GCM_BLOCK_LEN];

>         u64 partial_block_len;

>         u64 unused;

> -       u8 hash_keys[GCM_BLOCK_LEN * 16];

> +       u8 hash_keys[48 * 16];

>  };

>


This structure gets allocated on the stack, and gets inflated
significantly by this change, even though the code is not enabled by
default, and not even supported for most users.

Is it really necessary for this to be per-request data? If these are
precomputed powers of H, they can be moved into the TFM context
structure instead, which lives on the heap (and can be shared by all
concurrent users of the TFM)

>  asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,

> @@ -266,6 +269,47 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen2 = {

>         .finalize = &aesni_gcm_finalize_avx_gen2,

>  };

>

> +#ifdef CONFIG_CRYPTO_AES_GCM_AVX512

> +/*

> + * asmlinkage void aesni_gcm_init_avx_512()

> + * gcm_data *my_ctx_data, context data

> + * u8 *hash_subkey,  the Hash sub key input. Data starts on a 16-byte boundary.

> + */

> +asmlinkage void aesni_gcm_init_avx_512(void *my_ctx_data,

> +                                      struct gcm_context_data *gdata,

> +                                      u8 *iv,

> +                                      u8 *hash_subkey,

> +                                      const u8 *aad,

> +                                      unsigned long aad_len);

> +asmlinkage void aesni_gcm_enc_update_avx_512(void *ctx,

> +                                            struct gcm_context_data *gdata,

> +                                            u8 *out,

> +                                            const u8 *in,

> +                                            unsigned long plaintext_len);

> +asmlinkage void aesni_gcm_dec_update_avx_512(void *ctx,

> +                                            struct gcm_context_data *gdata,

> +                                            u8 *out,

> +                                            const u8 *in,

> +                                            unsigned long ciphertext_len);

> +asmlinkage void aesni_gcm_finalize_avx_512(void *ctx,

> +                                          struct gcm_context_data *gdata,

> +                                          u8 *auth_tag,

> +                                          unsigned long auth_tag_len);

> +

> +asmlinkage void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey);

> +

> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {

> +       .init = &aesni_gcm_init_avx_512,

> +       .enc_update = &aesni_gcm_enc_update_avx_512,

> +       .dec_update = &aesni_gcm_dec_update_avx_512,

> +       .finalize = &aesni_gcm_finalize_avx_512,

> +};

> +#else

> +static void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey)

> +{}

> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {};

> +#endif

> +


Please drop the alternative dummy definitions.

>  /*

>   * asmlinkage void aesni_gcm_init_avx_gen4()

>   * gcm_data *my_ctx_data, context data

> @@ -669,7 +713,11 @@ rfc4106_set_hash_subkey(u8 *hash_subkey, const u8 *key, unsigned int key_len)

>         /* We want to cipher all zeros to create the hash sub key. */

>         memset(hash_subkey, 0, RFC4106_HASH_SUBKEY_SIZE);

>

> -       aes_encrypt(&ctx, hash_subkey, hash_subkey);

> +       if (IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) && use_avx512 &&

> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ))

> +               aes_gcm_precomp_avx_512(&ctx, hash_subkey);

> +       else

> +               aes_encrypt(&ctx, hash_subkey, hash_subkey);

>


I suppose this answers my question about the subkeys. Please find a
way to move these out of struct gcm_context_data so they don't need to
be copied to the stack for each request.


>         memzero_explicit(&ctx, sizeof(ctx));

>         return 0;

> @@ -1114,7 +1162,11 @@ static int __init aesni_init(void)

>         if (!x86_match_cpu(aesni_cpu_id))

>                 return -ENODEV;

>  #ifdef CONFIG_X86_64

> -       if (boot_cpu_has(X86_FEATURE_AVX2)) {

> +       if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) &&

> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ)) {

> +               pr_info("AVX512 version of gcm_enc/dec engaged.\n");

> +               aesni_gcm_tfm = &aesni_gcm_tfm_avx_512;


This was changed in the cryptodev tree to use static keys.

> +       } else if (boot_cpu_has(X86_FEATURE_AVX2)) {

>                 pr_info("AVX2 version of gcm_enc/dec engaged.\n");

>                 aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4;

>         } else if (boot_cpu_has(X86_FEATURE_AVX)) {

> diff --git a/crypto/Kconfig b/crypto/Kconfig

> index 3043849..8c8a68d 100644

> --- a/crypto/Kconfig

> +++ b/crypto/Kconfig

> @@ -661,6 +661,18 @@ config CRYPTO_AES_CTR_AVX512

>         depends on CRYPTO_AES_NI_INTEL

>         depends on AS_VAES_AVX512

>

> +# We default CRYPTO_AES_GCM_AVX512 to Y but depend on CRYPTO_AVX512 in

> +# order to have a singular option (CRYPTO_AVX512) select multiple algorithms

> +# when supported. Specifically, if the platform and/or toolset does not

> +# support VPLMULQDQ. Then this algorithm should not be supported as part of

> +# the set that CRYPTO_AVX512 selects.

> +config CRYPTO_AES_GCM_AVX512

> +       bool

> +       default y

> +       depends on CRYPTO_AVX512

> +       depends on CRYPTO_AES_NI_INTEL

> +       depends on AS_VPCLMULQDQ

> +

>  config CRYPTO_CRC32C_SPARC64

>         tristate "CRC32c CRC algorithm (SPARC64)"

>         depends on SPARC64

> --

> 2.7.4

>

Megha Dey Jan. 16, 2021, 6:35 p.m. UTC | #14

Hi Ard,

On 1/16/2021 8:52 AM, Ard Biesheuvel wrote:
> On Mon, 28 Dec 2020 at 20:11, Dey, Megha <megha.dey@intel.com> wrote:

>> Hi Eric,

>>

>> On 12/21/2020 3:20 PM, Eric Biggers wrote:

>>> On Fri, Dec 18, 2020 at 01:10:57PM -0800, Megha Dey wrote:

>>>> Optimize crypto algorithms using VPCLMULQDQ and VAES AVX512 instructions

>>>> (first implemented on Intel's Icelake client and Xeon CPUs).

>>>>

>>>> These algorithms take advantage of the AVX512 registers to keep the CPU

>>>> busy and increase memory bandwidth utilization. They provide substantial

>>>> (2-10x) improvements over existing crypto algorithms when update data size

>>>> is greater than 128 bytes and do not have any significant impact when used

>>>> on small amounts of data.

>>>>

>>>> However, these algorithms may also incur a frequency penalty and cause

>>>> collateral damage to other workloads running on the same core(co-scheduled

>>>> threads). These frequency drops are also known as bin drops where 1 bin

>>>> drop is around 100MHz. With the SpecCPU and ffmpeg benchmark, a 0-1 bin

>>>> drop(0-100MHz) is observed on Icelake desktop and 0-2 bin drops (0-200Mhz)

>>>> are observed on the Icelake server.

>>>>

>>> Do these new algorithms all pass the self-tests, including the fuzz tests that

>>> are enabled when CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?

>> I had tested these algorithms with CRYPTO_MANAGER_DISABLE_TESTS=n and

>> tcrypt, not with

>> CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (I wasn't aware this existed, my bad).

>> I see a couple of errors after enabling it and am working on fixing those.

>>

> Hello Megha,

>

> I think the GHASH changes can be dropped (as discussed in the other

> thread), given the lack of a use case. The existing GHASH driver could

> also be removed in the future, but I don't think it needs to be part

> of this series.

Ok, I will remove the GHASH patch from the next series.
>

> Could you please rebase this onto the latest AES-NI changes that are

> in Herbert's tree? (as well as the ones I sent out today) They address

> some issues with indirect calls and excessive disabling of preemption,

> and your GCM and CTR changes are definitely going to be affected by

> this as well.

Yeah sure, will do, thanks for the headsup!

Megha Dey Jan. 20, 2021, 10:38 p.m. UTC | #15

Hi Ard,

On 1/16/2021 8:54 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@intel.com> wrote:

>> This is a preparatory patch to introduce the optimized crypto algorithms

>> using AVX512 instructions which would require VAES and VPLCMULQDQ support.

>>

>> Check for VAES and VPCLMULQDQ assembler support using AVX512 registers.

>>

>> Cc: x86@kernel.org

>> Signed-off-by: Megha Dey <megha.dey@intel.com>

>> ---

>>   arch/x86/Kconfig.assembler | 10 ++++++++++

>>   1 file changed, 10 insertions(+)

>>

>> diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler

>> index 26b8c08..9ea0bc8 100644

>> --- a/arch/x86/Kconfig.assembler

>> +++ b/arch/x86/Kconfig.assembler

>> @@ -1,6 +1,16 @@

>>   # SPDX-License-Identifier: GPL-2.0

>>   # Copyright (C) 2020 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.

>>

>> +config AS_VAES_AVX512

>> +       def_bool $(as-instr,vaesenc %zmm0$(comma)%zmm1$(comma)%zmm1) && 64BIT

> Is the '&& 64BIT' necessary here, but not below?

>

> In any case, better to use a separate 'depends on' line, for legibility


yeah , I think the '&& 64 BIT' is not required. I will remove it in the 
next version.

-Megha

>

>> +       help

>> +         Supported by binutils >= 2.30 and LLVM integrated assembler

>> +

>> +config AS_VPCLMULQDQ

>> +       def_bool $(as-instr,vpclmulqdq \$0$(comma)%zmm2$(comma)%zmm6$(comma)%zmm4)

>> +       help

>> +         Supported by binutils >= 2.30 and LLVM integrated assembler

>> +

>>   config AS_AVX512

>>          def_bool $(as-instr,vpmovm2b %k1$(comma)%zmm5)

>>          help

>> --

>> 2.7.4

>>

Megha Dey Jan. 20, 2021, 10:48 p.m. UTC | #16

Hi Ard,

On 1/16/2021 9:16 AM, Ard Biesheuvel wrote:
> On Fri, 18 Dec 2020 at 22:08, Megha Dey <megha.dey@intel.com> wrote:

>> Introduce the AVX512 implementation that optimizes the AESNI-GCM encode

>> and decode routines using VPCLMULQDQ.

>>

>> The glue code in AESNI module overrides the existing AVX2 GCM mode

>> encryption/decryption routines with the AX512 AES GCM mode ones when the

>> following criteria are met:

>> At compile time:

>> 1. CONFIG_CRYPTO_AVX512 is enabled

>> 2. toolchain(assembler) supports VPCLMULQDQ instructions

>> At runtime:

>> 1. VPCLMULQDQ and AVX512VL features are supported on a platform

>>     (currently only Icelake)

>> 2. aesni_intel.use_avx512 module parameter is set at boot time. For this

>>     algorithm, switching from AVX512 optimized version is not possible

>>     once set at boot time because of how the code is structured today.(Can

>>     be changed later if required)

>>

>> The functions aesni_gcm_init_avx_512, aesni_gcm_enc_update_avx_512,

>> aesni_gcm_dec_update_avx_512 and aesni_gcm_finalize_avx_512 are adapted

>> from the Intel Optimized IPSEC Cryptographic library.

>>

>> On a Icelake desktop, with turbo disabled and all CPUs running at

>> maximum frequency, the AVX512 GCM mode optimization shows better

>> performance across data & key sizes as measured by tcrypt.

>>

>> The average performance improvement of the AVX512 version over the AVX2

>> version is as follows:

>> For all key sizes(128/192/256 bits),

>>          data sizes < 128 bytes/block, negligible improvement (~7.5%)

>>          data sizes > 128 bytes/block, there is an average improvement of

>>          40% for both encryption and decryption.

>>

>> A typical run of tcrypt with AES GCM mode encryption/decryption of the

>> AVX2 and AVX512 optimization on a Icelake desktop shows the following

>> results:

>>

>>    ----------------------------------------------------------------------

>>    |   key  | bytes | cycles/op (lower is better)   | Percentage gain/  |

>>    | length |   per |   encryption  |  decryption   |      loss         |

>>    | (bits) | block |-------------------------------|-------------------|

>>    |        |       | avx2 | avx512 | avx2 | avx512 | Encrypt | Decrypt |

>>    |---------------------------------------------------------------------

>>    |  128   | 16    | 689  |  701   | 689  |  707   |  -1.7   |  -2.61  |

>>    |  128   | 64    | 731  |  660   | 771  |  649   |   9.7   |  15.82  |

>>    |  128   | 256   | 911  |  750   | 900  |  721   |  17.67  |  19.88  |

>>    |  128   | 512   | 1181 |  814   | 1161 |  782   |  31.07  |  32.64  |

>>    |  128   | 1024  | 1676 |  1052  | 1685 |  1030  |  37.23  |  38.87  |

>>    |  128   | 2048  | 2475 |  1447  | 2456 |  1419  |  41.53  |  42.22  |

>>    |  128   | 4096  | 3806 |  2154  | 3820 |  2119  |  43.41  |  44.53  |

>>    |  128   | 8192  | 9169 |  3806  | 6997 |  3718  |  58.49  |  46.86  |

>>    |  192   | 16    | 754  |  683   | 737  |  672   |   9.42  |   8.82  |

>>    |  192   | 64    | 735  |  686   | 715  |  640   |   6.66  |  10.49  |

>>    |  192   | 256   | 949  |  738   | 2435 |  729   |  22.23  |  70     |

>>    |  192   | 512   | 1235 |  854   | 1200 |  833   |  30.85  |  30.58  |

>>    |  192   | 1024  | 1777 |  1084  | 1763 |  1051  |  38.99  |  40.39  |

>>    |  192   | 2048  | 2574 |  1497  | 2592 |  1459  |  41.84  |  43.71  |

>>    |  192   | 4096  | 4086 |  2317  | 4091 |  2244  |  43.29  |  45.14  |

>>    |  192   | 8192  | 7481 |  4054  | 7505 |  3953  |  45.81  |  47.32  |

>>    |  256   | 16    | 755  |  682   | 720  |  683   |   9.68  |   5.14  |

>>    |  256   | 64    | 744  |  677   | 719  |  658   |   9     |   8.48  |

>>    |  256   | 256   | 962  |  758   | 948  |  749   |  21.21  |  21     |

>>    |  256   | 512   | 1297 |  862   | 1276 |  836   |  33.54  |  34.48  |

>>    |  256   | 1024  | 1831 |  1114  | 1819 |  1095  |  39.16  |  39.8   |

>>    |  256   | 2048  | 2767 |  1566  | 2715 |  1524  |  43.4   |  43.87  |

>>    |  256   | 4096  | 4378 |  2382  | 4368 |  2354  |  45.6   |  46.11  |

>>    |  256   | 8192  | 8075 |  4262  | 8080 |  4186  |  47.22  |  48.19  |

>>    ----------------------------------------------------------------------

>>

>> This work was inspired by the AES GCM mode optimization published in

>> Intel Optimized IPSEC Cryptographic library.

>> https://github.com/intel/intel-ipsec-mb/blob/master/lib/avx512/gcm_vaes_avx512.asm

>>

>> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@intel.com>

>> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@intel.com>

>> Signed-off-by: Megha Dey <megha.dey@intel.com>

>> ---

>>   arch/x86/crypto/Makefile                    |    1 +

>>   arch/x86/crypto/aesni-intel_avx512-x86_64.S | 1788 +++++++++++++++++++++++++++

>>   arch/x86/crypto/aesni-intel_glue.c          |   62 +-

>>   crypto/Kconfig                              |   12 +

>>   4 files changed, 1858 insertions(+), 5 deletions(-)

>>   create mode 100644 arch/x86/crypto/aesni-intel_avx512-x86_64.S

>>

> ...

>> diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c

>> index 9e56cdf..8fc5bac 100644

>> --- a/arch/x86/crypto/aesni-intel_glue.c

>> +++ b/arch/x86/crypto/aesni-intel_glue.c

>> @@ -55,13 +55,16 @@ MODULE_PARM_DESC(use_avx512, "Use AVX512 optimized algorithm, if available");

>>    * This needs to be 16 byte aligned.

>>    */

>>   struct aesni_rfc4106_gcm_ctx {

>> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;

>> +       /* AVX512 optimized algorithms use 48 hash keys to conduct

>> +        * multiple PCLMULQDQ operations in parallel

>> +        */

>> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;

>>          struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;

>>          u8 nonce[4];

>>   };

>>

>>   struct generic_gcmaes_ctx {

>> -       u8 hash_subkey[16] AESNI_ALIGN_ATTR;

>> +       u8 hash_subkey[16 * 48] AESNI_ALIGN_ATTR;

>>          struct crypto_aes_ctx aes_key_expanded AESNI_ALIGN_ATTR;

>>   };

>>

>> @@ -82,7 +85,7 @@ struct gcm_context_data {

>>          u8 current_counter[GCM_BLOCK_LEN];

>>          u64 partial_block_len;

>>          u64 unused;

>> -       u8 hash_keys[GCM_BLOCK_LEN * 16];

>> +       u8 hash_keys[48 * 16];

>>   };

>>

> This structure gets allocated on the stack, and gets inflated

> significantly by this change, even though the code is not enabled by

> default, and not even supported for most users.

Hmm yeah, this makes sense. I will look into it.
>

> Is it really necessary for this to be per-request data? If these are

> precomputed powers of H, they can be moved into the TFM context

> structure instead, which lives on the heap (and can be shared by all

> concurrent users of the TFM)

Yeah, this is a per request data.
>

>>   asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,

>> @@ -266,6 +269,47 @@ static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_gen2 = {

>>          .finalize = &aesni_gcm_finalize_avx_gen2,

>>   };

>>

>> +#ifdef CONFIG_CRYPTO_AES_GCM_AVX512

>> +/*

>> + * asmlinkage void aesni_gcm_init_avx_512()

>> + * gcm_data *my_ctx_data, context data

>> + * u8 *hash_subkey,  the Hash sub key input. Data starts on a 16-byte boundary.

>> + */

>> +asmlinkage void aesni_gcm_init_avx_512(void *my_ctx_data,

>> +                                      struct gcm_context_data *gdata,

>> +                                      u8 *iv,

>> +                                      u8 *hash_subkey,

>> +                                      const u8 *aad,

>> +                                      unsigned long aad_len);

>> +asmlinkage void aesni_gcm_enc_update_avx_512(void *ctx,

>> +                                            struct gcm_context_data *gdata,

>> +                                            u8 *out,

>> +                                            const u8 *in,

>> +                                            unsigned long plaintext_len);

>> +asmlinkage void aesni_gcm_dec_update_avx_512(void *ctx,

>> +                                            struct gcm_context_data *gdata,

>> +                                            u8 *out,

>> +                                            const u8 *in,

>> +                                            unsigned long ciphertext_len);

>> +asmlinkage void aesni_gcm_finalize_avx_512(void *ctx,

>> +                                          struct gcm_context_data *gdata,

>> +                                          u8 *auth_tag,

>> +                                          unsigned long auth_tag_len);

>> +

>> +asmlinkage void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey);

>> +

>> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {

>> +       .init = &aesni_gcm_init_avx_512,

>> +       .enc_update = &aesni_gcm_enc_update_avx_512,

>> +       .dec_update = &aesni_gcm_dec_update_avx_512,

>> +       .finalize = &aesni_gcm_finalize_avx_512,

>> +};

>> +#else

>> +static void aes_gcm_precomp_avx_512(struct crypto_aes_ctx *ctx, u8 *hash_subkey)

>> +{}

>> +static const struct aesni_gcm_tfm_s aesni_gcm_tfm_avx_512 = {};

>> +#endif

>> +

> Please drop the alternative dummy definitions.

ok
>

>>   /*

>>    * asmlinkage void aesni_gcm_init_avx_gen4()

>>    * gcm_data *my_ctx_data, context data

>> @@ -669,7 +713,11 @@ rfc4106_set_hash_subkey(u8 *hash_subkey, const u8 *key, unsigned int key_len)

>>          /* We want to cipher all zeros to create the hash sub key. */

>>          memset(hash_subkey, 0, RFC4106_HASH_SUBKEY_SIZE);

>>

>> -       aes_encrypt(&ctx, hash_subkey, hash_subkey);

>> +       if (IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) && use_avx512 &&

>> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ))

>> +               aes_gcm_precomp_avx_512(&ctx, hash_subkey);

>> +       else

>> +               aes_encrypt(&ctx, hash_subkey, hash_subkey);

>>

> I suppose this answers my question about the subkeys. Please find a

> way to move these out of struct gcm_context_data so they don't need to

> be copied to the stack for each request.

Hmm yeah. I will move this allocation to the heap instead.
>

>

>>          memzero_explicit(&ctx, sizeof(ctx));

>>          return 0;

>> @@ -1114,7 +1162,11 @@ static int __init aesni_init(void)

>>          if (!x86_match_cpu(aesni_cpu_id))

>>                  return -ENODEV;

>>   #ifdef CONFIG_X86_64

>> -       if (boot_cpu_has(X86_FEATURE_AVX2)) {

>> +       if (use_avx512 && IS_ENABLED(CONFIG_CRYPTO_AES_GCM_AVX512) &&

>> +           cpu_feature_enabled(X86_FEATURE_VPCLMULQDQ)) {

>> +               pr_info("AVX512 version of gcm_enc/dec engaged.\n");

>> +               aesni_gcm_tfm = &aesni_gcm_tfm_avx_512;

> This was changed in the cryptodev tree to use static keys.


yep, will make the necessary changes.


-Megha

>

>> +       } else if (boot_cpu_has(X86_FEATURE_AVX2)) {

>>                  pr_info("AVX2 version of gcm_enc/dec engaged.\n");

>>                  aesni_gcm_tfm = &aesni_gcm_tfm_avx_gen4;

>>          } else if (boot_cpu_has(X86_FEATURE_AVX)) {

>> diff --git a/crypto/Kconfig b/crypto/Kconfig

>> index 3043849..8c8a68d 100644

>> --- a/crypto/Kconfig

>> +++ b/crypto/Kconfig

>> @@ -661,6 +661,18 @@ config CRYPTO_AES_CTR_AVX512

>>          depends on CRYPTO_AES_NI_INTEL

>>          depends on AS_VAES_AVX512

>>

>> +# We default CRYPTO_AES_GCM_AVX512 to Y but depend on CRYPTO_AVX512 in

>> +# order to have a singular option (CRYPTO_AVX512) select multiple algorithms

>> +# when supported. Specifically, if the platform and/or toolset does not

>> +# support VPLMULQDQ. Then this algorithm should not be supported as part of

>> +# the set that CRYPTO_AVX512 selects.

>> +config CRYPTO_AES_GCM_AVX512

>> +       bool

>> +       default y

>> +       depends on CRYPTO_AVX512

>> +       depends on CRYPTO_AES_NI_INTEL

>> +       depends on AS_VPCLMULQDQ

>> +

>>   config CRYPTO_CRC32C_SPARC64

>>          tristate "CRC32c CRC algorithm (SPARC64)"

>>          depends on SPARC64

>> --

>> 2.7.4

>>

[RFC,V1,0/7] Introduce AVX512 optimized crypto algorithms

Message

Comments