From patchwork Fri Mar 29 08:03:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 784585 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77F1745016; Fri, 29 Mar 2024 08:06:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699560; cv=none; b=ikkIlxm8IA5DC76VwwWodXST10KRwOx3k3OjXYqWRO+gcBqOxbuziC/wCWE6jMsfYUzi2PxQDuv+b9zHIvabQeK5p4rvUZ1Efoz5BnNduEfUHTk6h8VAws94FxSCtvy2KOO2D09rKFEV8hkDqevw2FYxugvqoj8tUq049Z3b33o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699560; c=relaxed/simple; bh=XX+VRU77adYKl58Xf5SH3Psy1Aivalng4w2owP5PZaM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UauHUXgV5ed66seL5BShnnjnYLRPjUqOp+onFJU/xoTvw6aF0Fp0CU6hT7xzok70Pcva8MUEDemRsS90+yp0W5vjf2Ml6DSynaHfFm0Ik/RQvarWYLmmxcKSvd/dYZ6Uy9TzVy7S8UWtnICoMqfDLC28+jKliYoj6bbG9Gp+8G4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=pGWTgk/H; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="pGWTgk/H" Received: by smtp.kernel.org (Postfix) with ESMTPSA id CB36DC433C7; Fri, 29 Mar 2024 08:05:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1711699560; bh=XX+VRU77adYKl58Xf5SH3Psy1Aivalng4w2owP5PZaM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=pGWTgk/HOt2BEKHYJrtDMwiPg7uV/Rq5N/hvnbapV1svXV7MqGHAG/zdfJxfwPfLx cpJB1vtpZFiBaKhsCUQ0+ySb4idOjLg//OMzEoTWdqIg/ZZLa2c3rLpFx8LcQ4k14o SwY2hZnZ/VGoT037asg0u5THFYPZOya/PVG84itX25eq4i8UVTBoCoUVk+keN6vfWd 8fyMbYjH2fRkyl5SPI/FWhA7ozrQXvzeGsBzTDhAw4E/+mPG+nFcOpKXGWox3+uwyK 1tvNOgxWaggTq4kGclnN0GlXOO9Q9ZKIPtg4PDeCTrToWVwK4jG3Wf6MVwSYXEiZ6H 45OLpA1qvm9pA== From: Eric Biggers To: linux-crypto@vger.kernel.org, x86@kernel.org Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , Andy Lutomirski , "Chang S . Bae" , Ingo Molnar Subject: [PATCH v2 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Date: Fri, 29 Mar 2024 01:03:49 -0700 Message-ID: <20240329080355.2871-2-ebiggers@kernel.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240329080355.2871-1-ebiggers@kernel.org> References: <20240329080355.2871-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers Add config symbols AS_VAES and AS_VPCLMULQDQ that expose whether the assembler supports the vector AES and carryless multiplication cryptographic extensions. Reviewed-by: Ingo Molnar Signed-off-by: Eric Biggers --- arch/x86/Kconfig.assembler | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler index 8ad41da301e5..59aedf32c4ea 100644 --- a/arch/x86/Kconfig.assembler +++ b/arch/x86/Kconfig.assembler @@ -23,9 +23,19 @@ config AS_TPAUSE config AS_GFNI def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2) help Supported by binutils >= 2.30 and LLVM integrated assembler +config AS_VAES + def_bool $(as-instr,vaesenc %ymm0$(comma)%ymm1$(comma)%ymm2) + help + Supported by binutils >= 2.30 and LLVM integrated assembler + +config AS_VPCLMULQDQ + def_bool $(as-instr,vpclmulqdq \$0x10$(comma)%ymm0$(comma)%ymm1$(comma)%ymm2) + help + Supported by binutils >= 2.30 and LLVM integrated assembler + config AS_WRUSS def_bool $(as-instr,wrussq %rax$(comma)(%rbx)) help Supported by binutils >= 2.31 and LLVM integrated assembler From patchwork Fri Mar 29 08:03:50 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 784170 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E81D148787; Fri, 29 Mar 2024 08:06:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699561; cv=none; b=V9Lgb8CrDCaKAhXT8kegaHb0xzQcmZNYraXIFw1MJx9Ws7Uat3pGoOnPPkcxnmC/VvaEue6t73/isZCuNhGwD+pMtTRAZnNx3RcBy1JZhD0EPyKAJc5aNMYHJeAIUFfeA67aDCd+0fhJbmqskqR3ORsF8T4bHAvEfGu7JXRmYWk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699561; c=relaxed/simple; bh=pWhmPihwILTv/IMwtfnHZMx2QGXDzXc1wxNtI5HBEb0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Z9bcWoIyXIXo1DQYz+HWtMEs45RJOBYR0NcY3tKTmAPCcLfz1J7aod2IIeb29YfTe8YCBgOOJvMzASdiLGtmUSQVXzcv8ym2ngKCkRsAjnlwWLqwHTgtBtTqfcb0whQGOk6cu5B1QOd+W+sAs2jgbpSdBbN6B3UYPNKSWVe4aw8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=cV3jBj7T; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="cV3jBj7T" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 372EEC433B2; Fri, 29 Mar 2024 08:06:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1711699560; bh=pWhmPihwILTv/IMwtfnHZMx2QGXDzXc1wxNtI5HBEb0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=cV3jBj7T4QdisTABQsUAGh0X+ZmzAg3hHe63DU7cqcUfKWehg+O7l8CqzNyIBpm9W ytQoilhFT3USXP8wYBQao/XKocdjUVBoB97WfecUihkQnldNWxTrKb6JHYDSjbSerX /uqS6gsxerontWIU7648GC7CeB0eJPP+a6kiqeXMWa4LQ5PRcNesN2CqW4WF/FNAxv 6Qihnd2JCgtarveJfC8duyOqttfZiEUwaLZGCXVT5ruofV+fx8bkrcWDZcXd7QDFK7 gsMO3i2ay+MXRsh93GvdTobtWKgLufsAfQtoyxhKJbh5opre//IiSPDuN4h2lcj7Yy 41YgBm+TaQIjA== From: Eric Biggers To: linux-crypto@vger.kernel.org, x86@kernel.org Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , Andy Lutomirski , "Chang S . Bae" Subject: [PATCH v2 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Date: Fri, 29 Mar 2024 01:03:50 -0700 Message-ID: <20240329080355.2871-3-ebiggers@kernel.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240329080355.2871-1-ebiggers@kernel.org> References: <20240329080355.2871-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers Add an assembly file aes-xts-avx-x86_64.S which contains a macro that expands into AES-XTS implementations for x86_64 CPUs that support at least AES-NI and AVX, optionally also taking advantage of VAES, VPCLMULQDQ, and AVX512 or AVX10. This patch doesn't expand the macro at all. Later patches will do so, adding each implementation individually so that the motivation and use case for each individual implementation can be fully presented. The file also provides a function aes_xts_encrypt_iv() which handles the encryption of the IV (tweak), using AES-NI and AVX. Signed-off-by: Eric Biggers --- arch/x86/crypto/Makefile | 3 +- arch/x86/crypto/aes-xts-avx-x86_64.S | 800 +++++++++++++++++++++++++++ 2 files changed, 802 insertions(+), 1 deletion(-) create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index 9aa46093c91b..9c5ce5613738 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -46,11 +46,12 @@ obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o chacha-x86_64-y := chacha-avx2-x86_64.o chacha-ssse3-x86_64.o chacha_glue.o chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o -aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o +aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o \ + aes_ctrby8_avx-x86_64.o aes-xts-avx-x86_64.o obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o sha1-ssse3-y := sha1_avx2_x86_64_asm.o sha1_ssse3_asm.o sha1_ssse3_glue.o sha1-ssse3-$(CONFIG_AS_SHA1_NI) += sha1_ni_asm.o diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S new file mode 100644 index 000000000000..a5e2783c46ec --- /dev/null +++ b/arch/x86/crypto/aes-xts-avx-x86_64.S @@ -0,0 +1,800 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * AES-XTS for modern x86_64 CPUs + * + * Copyright 2024 Google LLC + * + * Author: Eric Biggers + */ + +/* + * This file implements AES-XTS for modern x86_64 CPUs. To handle the + * complexities of coding for x86 SIMD, e.g. where every vector length needs + * different code, it uses a macro to generate several implementations that + * share similar source code but are targeted at different CPUs, listed below: + * + * AES-NI + AVX + * - 128-bit vectors (1 AES block per vector) + * - VEX-coded instructions + * - xmm0-xmm15 + * - This is for older CPUs that lack VAES but do have AVX. + * + * VAES + VPCLMULQDQ + AVX2 + * - 256-bit vectors (2 AES blocks per vector) + * - VEX-coded instructions + * - ymm0-ymm15 + * - This is for CPUs that have VAES but lack AVX512 or AVX10, + * e.g. Intel's Alder Lake and AMD's Zen 3. + * + * VAES + VPCLMULQDQ + AVX10/256 + BMI2 + * - 256-bit vectors (2 AES blocks per vector) + * - EVEX-coded instructions + * - ymm0-ymm31 + * - This is for CPUs that have AVX512 but where using zmm registers causes + * downclocking, and for CPUs that have AVX10/256 but not AVX10/512. + * - By "AVX10/256" we really mean (AVX512BW + AVX512VL) || AVX10/256. + * To avoid confusion with 512-bit, we just write AVX10/256. + * + * VAES + VPCLMULQDQ + AVX10/512 + BMI2 + * - Same as the previous one, but upgrades to 512-bit vectors + * (4 AES blocks per vector) in zmm0-zmm31. + * - This is for CPUs that have good AVX512 or AVX10/512 support. + * + * This file doesn't have an implementation for AES-NI alone (without AVX), as + * the lack of VEX would make all the assembly code different. + * + * When we use VAES, we also use VPCLMULQDQ to parallelize the computation of + * the XTS tweaks. This avoids a bottleneck. Currently there don't seem to be + * any CPUs that support VAES but not VPCLMULQDQ. If that changes, we might + * need to start also providing an implementation using VAES alone. + * + * The AES-XTS implementations in this file support everything required by the + * crypto API, including support for arbitrary input lengths and multi-part + * processing. However, they are most heavily optimized for the common case of + * power-of-2 length inputs that are processed in a single part (disk sectors). + */ + +#include +#include + +.section .rodata +.p2align 4 +.Lgf_poly: + // The low 64 bits of this value represent the polynomial x^7 + x^2 + x + // + 1. It is the value that must be XOR'd into the low 64 bits of the + // tweak each time a 1 is carried out of the high 64 bits. + // + // The high 64 bits of this value is just the internal carry bit that + // exists when there's a carry out of the low 64 bits of the tweak. + .quad 0x87, 1 + + // This table contains constants for vpshufb and vpblendvb, used to + // handle variable byte shifts and blending during ciphertext stealing + // on CPUs that don't support AVX10-style masking. +.Lcts_permute_table: + .byte 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 + .byte 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 + .byte 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07 + .byte 0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f + .byte 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 + .byte 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80 +.text + +// Function parameters +.set KEY, %rdi // Initially points to crypto_aes_ctx, then is + // advanced to point directly to the round keys +.set SRC, %rsi // Pointer to next source data +.set DST, %rdx // Pointer to next destination data +.set LEN, %rcx // Remaining length in bytes +.set TWEAK, %r8 // Pointer to next tweak + +// %r9d holds the AES key length in bytes. +.set KEYLEN, %r9d + +// %rax and %r10-r11 are available as temporaries. + +.macro _define_Vi i +.if VL == 16 + .set V\i, %xmm\i +.elseif VL == 32 + .set V\i, %ymm\i +.elseif VL == 64 + .set V\i, %zmm\i +.else + .error "Unsupported Vector Length (VL)" +.endif +.endm + +.macro _define_aliases + // Define register aliases V0-V15, or V0-V31 if all 32 SIMD registers + // are available, that map to the xmm, ymm, or zmm registers according + // to the selected Vector Length (VL). + _define_Vi 0 + _define_Vi 1 + _define_Vi 2 + _define_Vi 3 + _define_Vi 4 + _define_Vi 5 + _define_Vi 6 + _define_Vi 7 + _define_Vi 8 + _define_Vi 9 + _define_Vi 10 + _define_Vi 11 + _define_Vi 12 + _define_Vi 13 + _define_Vi 14 + _define_Vi 15 +.if USE_AVX10 + _define_Vi 16 + _define_Vi 17 + _define_Vi 18 + _define_Vi 19 + _define_Vi 20 + _define_Vi 21 + _define_Vi 22 + _define_Vi 23 + _define_Vi 24 + _define_Vi 25 + _define_Vi 26 + _define_Vi 27 + _define_Vi 28 + _define_Vi 29 + _define_Vi 30 + _define_Vi 31 +.endif + + // V0-V3 hold the data blocks during the main loop, or temporary values + // otherwise. V4-V5 hold temporary values. + + // V6-V9 hold XTS tweaks. Each 128-bit lane holds one tweak. + .set TWEAK0_XMM, %xmm6 + .set TWEAK0, V6 + .set TWEAK1_XMM, %xmm7 + .set TWEAK1, V7 + .set TWEAK2, V8 + .set TWEAK3, V9 + + // V10-V13 are used for computing the next values of TWEAK[0-3]. + .set NEXT_TWEAK0, V10 + .set NEXT_TWEAK1, V11 + .set NEXT_TWEAK2, V12 + .set NEXT_TWEAK3, V13 + + // V14 holds the constant from .Lgf_poly, copied to all 128-bit lanes. + .set GF_POLY_XMM, %xmm14 + .set GF_POLY, V14 + + // V15 holds the first AES round key, copied to all 128-bit lanes. + .set KEY0_XMM, %xmm15 + .set KEY0, V15 + + // If 32 SIMD registers are available, then V16-V29 hold the remaining + // AES round keys, copied to all 128-bit lanes. +.if USE_AVX10 + .set KEY1_XMM, %xmm16 + .set KEY1, V16 + .set KEY2_XMM, %xmm17 + .set KEY2, V17 + .set KEY3_XMM, %xmm18 + .set KEY3, V18 + .set KEY4_XMM, %xmm19 + .set KEY4, V19 + .set KEY5_XMM, %xmm20 + .set KEY5, V20 + .set KEY6_XMM, %xmm21 + .set KEY6, V21 + .set KEY7_XMM, %xmm22 + .set KEY7, V22 + .set KEY8_XMM, %xmm23 + .set KEY8, V23 + .set KEY9_XMM, %xmm24 + .set KEY9, V24 + .set KEY10_XMM, %xmm25 + .set KEY10, V25 + .set KEY11_XMM, %xmm26 + .set KEY11, V26 + .set KEY12_XMM, %xmm27 + .set KEY12, V27 + .set KEY13_XMM, %xmm28 + .set KEY13, V28 + .set KEY14_XMM, %xmm29 + .set KEY14, V29 +.endif + // V30-V31 are currently unused. +.endm + +// Move a vector between memory and a register. +.macro _vmovdqu src, dst +.if VL < 64 + vmovdqu \src, \dst +.else + vmovdqu8 \src, \dst +.endif +.endm + +// Broadcast a 128-bit value into a vector. +.macro _vbroadcast128 src, dst +.if VL == 16 && !USE_AVX10 + vmovdqu \src, \dst +.elseif VL == 32 && !USE_AVX10 + vbroadcasti128 \src, \dst +.else + vbroadcasti32x4 \src, \dst +.endif +.endm + +// XOR two vectors together. +.macro _vpxor src1, src2, dst +.if USE_AVX10 + vpxord \src1, \src2, \dst +.else + vpxor \src1, \src2, \dst +.endif +.endm + +// XOR three vectors together. +.macro _xor3 src1, src2, src3_and_dst +.if USE_AVX10 + // vpternlogd with immediate 0x96 is a three-argument XOR. + vpternlogd $0x96, \src1, \src2, \src3_and_dst +.else + vpxor \src1, \src3_and_dst, \src3_and_dst + vpxor \src2, \src3_and_dst, \src3_and_dst +.endif +.endm + +// Given a 128-bit XTS tweak in the xmm register \src, compute the next tweak +// (by multiplying by the polynomial 'x') and write it to \dst. +.macro _next_tweak src, tmp, dst + vpshufd $0x13, \src, \tmp + vpaddq \src, \src, \dst + vpsrad $31, \tmp, \tmp + vpand GF_POLY_XMM, \tmp, \tmp + vpxor \tmp, \dst, \dst +.endm + +// Given the XTS tweak(s) in the vector \src, compute the next vector of +// tweak(s) (by multiplying by the polynomial 'x^(VL/16)') and write it to \dst. +// +// If VL > 16, then there are multiple tweaks, and we use vpclmulqdq to compute +// all tweaks in the vector in parallel. If VL=16, we just do the regular +// computation without vpclmulqdq, as it's the faster method for a single tweak. +.macro _next_tweakvec src, tmp1, tmp2, dst +.if VL == 16 + _next_tweak \src, \tmp1, \dst +.else + vpsrlq $64 - VL/16, \src, \tmp1 + vpclmulqdq $0x01, GF_POLY, \tmp1, \tmp2 + vpslldq $8, \tmp1, \tmp1 + vpsllq $VL/16, \src, \dst + _xor3 \tmp1, \tmp2, \dst +.endif +.endm + +// Given the first XTS tweak at (TWEAK), compute the first set of tweaks and +// store them in the vector registers TWEAK0-TWEAK3. Clobbers V0-V5. +.macro _compute_first_set_of_tweaks + vmovdqu (TWEAK), TWEAK0_XMM + _vbroadcast128 .Lgf_poly(%rip), GF_POLY +.if VL == 16 + // With VL=16, multiplying by x serially is fastest. + _next_tweak TWEAK0, %xmm0, TWEAK1 + _next_tweak TWEAK1, %xmm0, TWEAK2 + _next_tweak TWEAK2, %xmm0, TWEAK3 +.else +.if VL == 32 + // Compute the second block of TWEAK0. + _next_tweak TWEAK0_XMM, %xmm0, %xmm1 + vinserti128 $1, %xmm1, TWEAK0, TWEAK0 +.elseif VL == 64 + // Compute the remaining blocks of TWEAK0. + _next_tweak TWEAK0_XMM, %xmm0, %xmm1 + _next_tweak %xmm1, %xmm0, %xmm2 + _next_tweak %xmm2, %xmm0, %xmm3 + vinserti32x4 $1, %xmm1, TWEAK0, TWEAK0 + vinserti32x4 $2, %xmm2, TWEAK0, TWEAK0 + vinserti32x4 $3, %xmm3, TWEAK0, TWEAK0 +.endif + // Compute TWEAK[1-3] from TWEAK0. + vpsrlq $64 - 1*VL/16, TWEAK0, V0 + vpsrlq $64 - 2*VL/16, TWEAK0, V2 + vpsrlq $64 - 3*VL/16, TWEAK0, V4 + vpclmulqdq $0x01, GF_POLY, V0, V1 + vpclmulqdq $0x01, GF_POLY, V2, V3 + vpclmulqdq $0x01, GF_POLY, V4, V5 + vpslldq $8, V0, V0 + vpslldq $8, V2, V2 + vpslldq $8, V4, V4 + vpsllq $1*VL/16, TWEAK0, TWEAK1 + vpsllq $2*VL/16, TWEAK0, TWEAK2 + vpsllq $3*VL/16, TWEAK0, TWEAK3 +.if USE_AVX10 + vpternlogd $0x96, V0, V1, TWEAK1 + vpternlogd $0x96, V2, V3, TWEAK2 + vpternlogd $0x96, V4, V5, TWEAK3 +.else + vpxor V0, TWEAK1, TWEAK1 + vpxor V2, TWEAK2, TWEAK2 + vpxor V4, TWEAK3, TWEAK3 + vpxor V1, TWEAK1, TWEAK1 + vpxor V3, TWEAK2, TWEAK2 + vpxor V5, TWEAK3, TWEAK3 +.endif +.endif +.endm + +// Do one step in computing the next set of tweaks using the method of just +// multiplying by x repeatedly (the same method _next_tweak uses). +.macro _tweak_step_mulx i +.if \i == 0 + .set PREV_TWEAK, TWEAK3 + .set NEXT_TWEAK, NEXT_TWEAK0 +.elseif \i == 5 + .set PREV_TWEAK, NEXT_TWEAK0 + .set NEXT_TWEAK, NEXT_TWEAK1 +.elseif \i == 10 + .set PREV_TWEAK, NEXT_TWEAK1 + .set NEXT_TWEAK, NEXT_TWEAK2 +.elseif \i == 15 + .set PREV_TWEAK, NEXT_TWEAK2 + .set NEXT_TWEAK, NEXT_TWEAK3 +.endif +.if \i < 20 && \i % 5 == 0 + vpshufd $0x13, PREV_TWEAK, V5 +.elseif \i < 20 && \i % 5 == 1 + vpaddq PREV_TWEAK, PREV_TWEAK, NEXT_TWEAK +.elseif \i < 20 && \i % 5 == 2 + vpsrad $31, V5, V5 +.elseif \i < 20 && \i % 5 == 3 + vpand GF_POLY, V5, V5 +.elseif \i < 20 && \i % 5 == 4 + vpxor V5, NEXT_TWEAK, NEXT_TWEAK +.elseif \i == 1000 + vmovdqa NEXT_TWEAK0, TWEAK0 + vmovdqa NEXT_TWEAK1, TWEAK1 + vmovdqa NEXT_TWEAK2, TWEAK2 + vmovdqa NEXT_TWEAK3, TWEAK3 +.endif +.endm + +// Do one step in computing the next set of tweaks using the VPCLMULQDQ method +// (the same method _next_tweakvec uses for VL > 16). This means multiplying +// each tweak by x^(4*VL/16) independently. Since 4*VL/16 is a multiple of 8 +// when VL > 16 (which it is here), the needed shift amounts are byte-aligned, +// which allows the use of vpsrldq and vpslldq to do 128-bit wide shifts. +.macro _tweak_step_pclmul i +.if \i == 2 + vpsrldq $(128 - 4*VL/16) / 8, TWEAK0, NEXT_TWEAK0 +.elseif \i == 4 + vpsrldq $(128 - 4*VL/16) / 8, TWEAK1, NEXT_TWEAK1 +.elseif \i == 6 + vpsrldq $(128 - 4*VL/16) / 8, TWEAK2, NEXT_TWEAK2 +.elseif \i == 8 + vpsrldq $(128 - 4*VL/16) / 8, TWEAK3, NEXT_TWEAK3 +.elseif \i == 10 + vpclmulqdq $0x00, GF_POLY, NEXT_TWEAK0, NEXT_TWEAK0 +.elseif \i == 12 + vpclmulqdq $0x00, GF_POLY, NEXT_TWEAK1, NEXT_TWEAK1 +.elseif \i == 14 + vpclmulqdq $0x00, GF_POLY, NEXT_TWEAK2, NEXT_TWEAK2 +.elseif \i == 16 + vpclmulqdq $0x00, GF_POLY, NEXT_TWEAK3, NEXT_TWEAK3 +.elseif \i == 1000 + vpslldq $(4*VL/16) / 8, TWEAK0, TWEAK0 + vpslldq $(4*VL/16) / 8, TWEAK1, TWEAK1 + vpslldq $(4*VL/16) / 8, TWEAK2, TWEAK2 + vpslldq $(4*VL/16) / 8, TWEAK3, TWEAK3 + _vpxor NEXT_TWEAK0, TWEAK0, TWEAK0 + _vpxor NEXT_TWEAK1, TWEAK1, TWEAK1 + _vpxor NEXT_TWEAK2, TWEAK2, TWEAK2 + _vpxor NEXT_TWEAK3, TWEAK3, TWEAK3 +.endif +.endm + +// _tweak_step does one step of the computation of the next set of tweaks from +// TWEAK[0-3]. To complete all steps, this must be invoked with \i values 0 +// through at least 19, then 1000 which signals the last step. +// +// This is used to interleave the computation of the next set of tweaks with the +// AES en/decryptions, which increases performance in some cases. +.macro _tweak_step i +.if VL == 16 + _tweak_step_mulx \i +.else + _tweak_step_pclmul \i +.endif +.endm + +// Load the round keys: just the first one if !USE_AVX10, otherwise all of them. +.macro _load_round_keys + _vbroadcast128 0*16(KEY), KEY0 +.if USE_AVX10 + _vbroadcast128 1*16(KEY), KEY1 + _vbroadcast128 2*16(KEY), KEY2 + _vbroadcast128 3*16(KEY), KEY3 + _vbroadcast128 4*16(KEY), KEY4 + _vbroadcast128 5*16(KEY), KEY5 + _vbroadcast128 6*16(KEY), KEY6 + _vbroadcast128 7*16(KEY), KEY7 + _vbroadcast128 8*16(KEY), KEY8 + _vbroadcast128 9*16(KEY), KEY9 + _vbroadcast128 10*16(KEY), KEY10 + // Note: if it's AES-128 or AES-192, the last several round keys won't + // be used. We do the loads anyway to save a conditional jump. + _vbroadcast128 11*16(KEY), KEY11 + _vbroadcast128 12*16(KEY), KEY12 + _vbroadcast128 13*16(KEY), KEY13 + _vbroadcast128 14*16(KEY), KEY14 +.endif +.endm + +// Do a single round of AES encryption (if \enc==1) or decryption (if \enc==0) +// on the block(s) in \data using the round key(s) in \key. The register length +// determines the number of AES blocks en/decrypted. +.macro _vaes enc, last, key, data +.if \enc +.if \last + vaesenclast \key, \data, \data +.else + vaesenc \key, \data, \data +.endif +.else +.if \last + vaesdeclast \key, \data, \data +.else + vaesdec \key, \data, \data +.endif +.endif +.endm + +// Do a single round of AES en/decryption on the block(s) in \data, using the +// same key for all block(s). The round key is loaded from the appropriate +// register or memory location for round \i. May clobber V4. +.macro _vaes_1x enc, last, i, xmm_suffix, data +.if USE_AVX10 + _vaes \enc, \last, KEY\i\xmm_suffix, \data +.else +.ifnb \xmm_suffix + _vaes \enc, \last, \i*16(KEY), \data +.else + _vbroadcast128 \i*16(KEY), V4 + _vaes \enc, \last, V4, \data +.endif +.endif +.endm + +// Do a single round of AES en/decryption on the blocks in registers V0-V3, +// using the same key for all blocks. The round key is loaded from the +// appropriate register or memory location for round \i. In addition, does step +// \i of the computation of the next set of tweaks. May clobber V4. +.macro _vaes_4x enc, last, i +.if USE_AVX10 + _tweak_step (2*(\i-1)) + _vaes \enc, \last, KEY\i, V0 + _vaes \enc, \last, KEY\i, V1 + _tweak_step (2*(\i-1) + 1) + _vaes \enc, \last, KEY\i, V2 + _vaes \enc, \last, KEY\i, V3 +.else + _vbroadcast128 \i*16(KEY), V4 + _tweak_step (2*(\i-1)) + _vaes \enc, \last, V4, V0 + _vaes \enc, \last, V4, V1 + _tweak_step (2*(\i-1) + 1) + _vaes \enc, \last, V4, V2 + _vaes \enc, \last, V4, V3 +.endif +.endm + +// Do tweaked AES en/decryption (i.e., XOR with \tweak, then AES en/decrypt, +// then XOR with \tweak again) of the block(s) in \data. To process a single +// block, use xmm registers and set \xmm_suffix=_XMM. To process a vector of +// length VL, use V* registers and leave \xmm_suffix empty. May clobber V4. +.macro _aes_crypt enc, xmm_suffix, tweak, data + _xor3 KEY0\xmm_suffix, \tweak, \data + _vaes_1x \enc, 0, 1, \xmm_suffix, \data + _vaes_1x \enc, 0, 2, \xmm_suffix, \data + _vaes_1x \enc, 0, 3, \xmm_suffix, \data + _vaes_1x \enc, 0, 4, \xmm_suffix, \data + _vaes_1x \enc, 0, 5, \xmm_suffix, \data + _vaes_1x \enc, 0, 6, \xmm_suffix, \data + _vaes_1x \enc, 0, 7, \xmm_suffix, \data + _vaes_1x \enc, 0, 8, \xmm_suffix, \data + _vaes_1x \enc, 0, 9, \xmm_suffix, \data + cmp $24, KEYLEN + jle .Laes_128_or_192\@ + _vaes_1x \enc, 0, 10, \xmm_suffix, \data + _vaes_1x \enc, 0, 11, \xmm_suffix, \data + _vaes_1x \enc, 0, 12, \xmm_suffix, \data + _vaes_1x \enc, 0, 13, \xmm_suffix, \data + _vaes_1x \enc, 1, 14, \xmm_suffix, \data + jmp .Laes_done\@ +.Laes_128_or_192\@: + je .Laes_192\@ + _vaes_1x \enc, 1, 10, \xmm_suffix, \data + jmp .Laes_done\@ +.Laes_192\@: + _vaes_1x \enc, 0, 10, \xmm_suffix, \data + _vaes_1x \enc, 0, 11, \xmm_suffix, \data + _vaes_1x \enc, 1, 12, \xmm_suffix, \data +.Laes_done\@: + _vpxor \tweak, \data, \data +.endm + +.macro _aes_xts_crypt enc + _define_aliases + + // Load the AES key length: 16 (AES-128), 24 (AES-192), or 32 (AES-256). + movl 480(KEY), KEYLEN + + // If decrypting, advance KEY to the decryption round keys. +.if !\enc + add $240, KEY +.endif + + // Check whether the data length is a multiple of the AES block length. + test $15, LEN + jnz .Lneed_cts\@ +.Lxts_init\@: + + // Cache as many round keys as possible. + _load_round_keys + + // Compute the first set of tweaks TWEAK[0-3]. + _compute_first_set_of_tweaks + + sub $4*VL, LEN + jl .Lhandle_remainder\@ + +.Lmain_loop\@: + // This is the main loop, en/decrypting 4*VL bytes per iteration. + + // XOR each source block with its tweak and the first round key. +.if USE_AVX10 + vmovdqu8 0*VL(SRC), V0 + vmovdqu8 1*VL(SRC), V1 + vmovdqu8 2*VL(SRC), V2 + vmovdqu8 3*VL(SRC), V3 + vpternlogd $0x96, TWEAK0, KEY0, V0 + vpternlogd $0x96, TWEAK1, KEY0, V1 + vpternlogd $0x96, TWEAK2, KEY0, V2 + vpternlogd $0x96, TWEAK3, KEY0, V3 +.else + vpxor 0*VL(SRC), KEY0, V0 + vpxor 1*VL(SRC), KEY0, V1 + vpxor 2*VL(SRC), KEY0, V2 + vpxor 3*VL(SRC), KEY0, V3 + vpxor TWEAK0, V0, V0 + vpxor TWEAK1, V1, V1 + vpxor TWEAK2, V2, V2 + vpxor TWEAK3, V3, V3 +.endif + // Do all the AES rounds on the data blocks, interleaved with + // the computation of the next set of tweaks. + _vaes_4x \enc, 0, 1 + _vaes_4x \enc, 0, 2 + _vaes_4x \enc, 0, 3 + _vaes_4x \enc, 0, 4 + _vaes_4x \enc, 0, 5 + _vaes_4x \enc, 0, 6 + _vaes_4x \enc, 0, 7 + _vaes_4x \enc, 0, 8 + _vaes_4x \enc, 0, 9 + // Try to optimize for AES-256 by keeping the code for AES-128 and + // AES-192 out-of-line. + cmp $24, KEYLEN + jle .Lencrypt_4x_aes_128_or_192\@ + _vaes_4x \enc, 0, 10 + _vaes_4x \enc, 0, 11 + _vaes_4x \enc, 0, 12 + _vaes_4x \enc, 0, 13 + _vaes_4x \enc, 1, 14 +.Lencrypt_4x_done\@: + + // XOR in the tweaks again. + _vpxor TWEAK0, V0, V0 + _vpxor TWEAK1, V1, V1 + _vpxor TWEAK2, V2, V2 + _vpxor TWEAK3, V3, V3 + + // Store the destination blocks. + _vmovdqu V0, 0*VL(DST) + _vmovdqu V1, 1*VL(DST) + _vmovdqu V2, 2*VL(DST) + _vmovdqu V3, 3*VL(DST) + + // Finish computing the next set of tweaks. + _tweak_step 1000 + + add $4*VL, SRC + add $4*VL, DST + sub $4*VL, LEN + jge .Lmain_loop\@ + + // Check for the uncommon case where the data length isn't a multiple of + // 4*VL. Handle it out-of-line in order to optimize for the common + // case. In the common case, just fall through to the ret. + test $4*VL-1, LEN + jnz .Lhandle_remainder\@ +.Ldone\@: + // Store the next tweak back to *TWEAK to support continuation calls. + vmovdqu TWEAK0_XMM, (TWEAK) +.if VL > 16 + vzeroupper +.endif + RET + +.Lhandle_remainder\@: + add $4*VL, LEN // Undo the extra sub from earlier. + + // En/decrypt any remaining full blocks, one vector at a time. +.if VL > 16 + sub $VL, LEN + jl .Lvec_at_a_time_done\@ +.Lvec_at_a_time\@: + _vmovdqu (SRC), V0 + _aes_crypt \enc, , TWEAK0, V0 + _vmovdqu V0, (DST) + _next_tweakvec TWEAK0, V0, V1, TWEAK0 + add $VL, SRC + add $VL, DST + sub $VL, LEN + jge .Lvec_at_a_time\@ +.Lvec_at_a_time_done\@: + add $VL-16, LEN // Undo the extra sub from earlier. +.else + sub $16, LEN +.endif + + // En/decrypt any remaining full blocks, one at a time. + jl .Lblock_at_a_time_done\@ +.Lblock_at_a_time\@: + vmovdqu (SRC), %xmm0 + _aes_crypt \enc, _XMM, TWEAK0_XMM, %xmm0 + vmovdqu %xmm0, (DST) + _next_tweak TWEAK0_XMM, %xmm0, TWEAK0_XMM + add $16, SRC + add $16, DST + sub $16, LEN + jge .Lblock_at_a_time\@ +.Lblock_at_a_time_done\@: + add $16, LEN // Undo the extra sub from earlier. + +.Lfull_blocks_done\@: + // Now 0 <= LEN <= 15. If LEN is nonzero, do ciphertext stealing to + // process the last 16 + LEN bytes. If LEN is zero, we're done. + test LEN, LEN + jnz .Lcts\@ + jmp .Ldone\@ + + // Out-of-line handling of AES-128 and AES-192 +.Lencrypt_4x_aes_128_or_192\@: + jz .Lencrypt_4x_aes_192\@ + _vaes_4x \enc, 1, 10 + jmp .Lencrypt_4x_done\@ +.Lencrypt_4x_aes_192\@: + _vaes_4x \enc, 0, 10 + _vaes_4x \enc, 0, 11 + _vaes_4x \enc, 1, 12 + jmp .Lencrypt_4x_done\@ + +.Lneed_cts\@: + // The data length isn't a multiple of the AES block length, so + // ciphertext stealing (CTS) will be needed. Subtract one block from + // LEN so that the main loop doesn't process the last full block. The + // CTS step will process it specially along with the partial block. + sub $16, LEN + jmp .Lxts_init\@ + +.Lcts\@: + // Do ciphertext stealing (CTS) to en/decrypt the last full block and + // the partial block. CTS needs two tweaks. TWEAK0_XMM contains the + // next tweak; compute the one after that. Decryption uses these two + // tweaks in reverse order, so also define aliases to handle that. + _next_tweak TWEAK0_XMM, %xmm0, TWEAK1_XMM +.if \enc + .set CTS_TWEAK0, TWEAK0_XMM + .set CTS_TWEAK1, TWEAK1_XMM +.else + .set CTS_TWEAK0, TWEAK1_XMM + .set CTS_TWEAK1, TWEAK0_XMM +.endif + + // En/decrypt the last full block. + vmovdqu (SRC), %xmm0 + _aes_crypt \enc, _XMM, CTS_TWEAK0, %xmm0 + +.if USE_AVX10 + // Create a mask that has the first LEN bits set. + mov $-1, %rax + bzhi LEN, %rax, %rax + kmovq %rax, %k1 + + // Swap the first LEN bytes of the above result with the partial block. + // Note that to support in-place en/decryption, the load from the src + // partial block must happen before the store to the dst partial block. + vmovdqa %xmm0, %xmm1 + vmovdqu8 16(SRC), %xmm0{%k1} + vmovdqu8 %xmm1, 16(DST){%k1} +.else + lea .Lcts_permute_table(%rip), %rax + + // Load the src partial block, left-aligned. Note that to support + // in-place en/decryption, this must happen before the store to the dst + // partial block. + vmovdqu (SRC, LEN, 1), %xmm1 + + // Shift the first LEN bytes of the en/decryption of the last full block + // to the end of a register, then store it to DST+LEN. This stores the + // dst partial block. It also writes to the second part of the dst last + // full block, but that part is overwritten later. + vpshufb (%rax, LEN, 1), %xmm0, %xmm2 + vmovdqu %xmm2, (DST, LEN, 1) + + // Make xmm3 contain [16-LEN,16-LEN+1,...,14,15,0x80,0x80,...]. + sub LEN, %rax + vmovdqu 32(%rax), %xmm3 + + // Shift the src partial block to the beginning of its register. + vpshufb %xmm3, %xmm1, %xmm1 + + // Do a blend to generate the src partial block followed by the second + // part of the en/decryption of the last full block. + vpblendvb %xmm3, %xmm0, %xmm1, %xmm0 +.endif + // En/decrypt again and store the last full block. + _aes_crypt \enc, _XMM, CTS_TWEAK1, %xmm0 + vmovdqu %xmm0, (DST) + jmp .Ldone\@ +.endm + +// void aes_xts_encrypt_iv(const struct crypto_aes_ctx *tweak_key, +// u8 iv[AES_BLOCK_SIZE]); +SYM_FUNC_START(aes_xts_encrypt_iv) + vmovdqu (%rsi), %xmm0 + vpxor 0*16(%rdi), %xmm0, %xmm0 + vaesenc 1*16(%rdi), %xmm0, %xmm0 + vaesenc 2*16(%rdi), %xmm0, %xmm0 + vaesenc 3*16(%rdi), %xmm0, %xmm0 + vaesenc 4*16(%rdi), %xmm0, %xmm0 + vaesenc 5*16(%rdi), %xmm0, %xmm0 + vaesenc 6*16(%rdi), %xmm0, %xmm0 + vaesenc 7*16(%rdi), %xmm0, %xmm0 + vaesenc 8*16(%rdi), %xmm0, %xmm0 + vaesenc 9*16(%rdi), %xmm0, %xmm0 + cmpl $24, 480(%rdi) + jle .Lencrypt_iv_aes_128_or_192 + vaesenc 10*16(%rdi), %xmm0, %xmm0 + vaesenc 11*16(%rdi), %xmm0, %xmm0 + vaesenc 12*16(%rdi), %xmm0, %xmm0 + vaesenc 13*16(%rdi), %xmm0, %xmm0 + vaesenclast 14*16(%rdi), %xmm0, %xmm0 +.Lencrypt_iv_done: + vmovdqu %xmm0, (%rsi) + RET + + // Out-of-line handling of AES-128 and AES-192 +.Lencrypt_iv_aes_128_or_192: + jz .Lencrypt_iv_aes_192 + vaesenclast 10*16(%rdi), %xmm0, %xmm0 + jmp .Lencrypt_iv_done +.Lencrypt_iv_aes_192: + vaesenc 10*16(%rdi), %xmm0, %xmm0 + vaesenc 11*16(%rdi), %xmm0, %xmm0 + vaesenclast 12*16(%rdi), %xmm0, %xmm0 + jmp .Lencrypt_iv_done +SYM_FUNC_END(aes_xts_encrypt_iv) + +// Below are the actual AES-XTS encryption and decryption functions, +// instantiated from the above macro. They all have the following prototype: +// +// void (*xts_asm_func)(const struct crypto_aes_ctx *key, +// const u8 *src, u8 *dst, size_t len, +// u8 tweak[AES_BLOCK_SIZE]); +// +// |key| is the data key. |tweak| contains the next tweak; the encryption of +// the original IV with the tweak key was already done. This function supports +// incremental computation, but |len| must always be >= 16 (AES_BLOCK_SIZE), and +// |len| must be a multiple of 16 except on the last call. If |len| is a +// multiple of 16, then this function updates |tweak| to contain the next tweak. From patchwork Fri Mar 29 08:03:51 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 784584 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2C9B54AEC0; Fri, 29 Mar 2024 08:06:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699561; cv=none; b=cKs+WmVKM9l5Y/61f7Jn+z70ZVjysun5JLdTgWdVHIPIz5nRcKc+ChpF/B7oNDQqHZlMPuOiMp+qshXC7kQUqhtE7OgWWePmh3RwzoBfe31d9LzJJUv6FYufBDhMgtnWP20bIU1SiVy2KEwxiyenciMrMYyg6y+5EpWWAz9zOU8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699561; c=relaxed/simple; bh=oxsPdK3jkgwrvnm54g4hxrgazAKHmAjHGyqExsd3Gik=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BlwFE8bGbOGm0bzrSnbdmDucf9uvQh9LkRk67v15rxn11hSKvzErWTbAwFelwgcPvXuLjG77/zOwe5jN17HCzBo9WvjjGfQZaFvOtnoSrMUMTIw3K64Z8pRu4bf3Udf0kWgMLYFKUfPE7bwvfpaUljs9M0SfiAFNEcT/kZWrfGA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=lreBZWb8; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="lreBZWb8" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 90FFBC433A6; Fri, 29 Mar 2024 08:06:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1711699560; bh=oxsPdK3jkgwrvnm54g4hxrgazAKHmAjHGyqExsd3Gik=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=lreBZWb8qbqkn1y3BoZ21x2K/Ibl7G0mPN1Fg9+fV54tnoVxdUoEtSKBLSsDaZPdc +8+vbaxxSXGy+shf5Tt6OEN9QI5HxpA/dNPaMwMVcmxkTBLk75tj4iuUVC+UXh3kal Q0reos5NibAxarbuLV5OYxFVCDULL9U9Enit7Fjsh9yE9nsiRG3WXwcC31ONJUqQ9r Hv4CU/0PzcvDXvWg/joMKGBZw7SRCZdhYZobaVrK+dfHfeCouAPAiSL2I62xiUipGS SmkrFnZ0V65BZJvjS3SyqPFVFUcJl1DPn4I3cRQoQoTF49IcdH9VaDCO78OW8ZL7VZ HurklXM60qsrg== From: Eric Biggers To: linux-crypto@vger.kernel.org, x86@kernel.org Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , Andy Lutomirski , "Chang S . Bae" Subject: [PATCH v2 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Date: Fri, 29 Mar 2024 01:03:51 -0700 Message-ID: <20240329080355.2871-4-ebiggers@kernel.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240329080355.2871-1-ebiggers@kernel.org> References: <20240329080355.2871-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers Add an AES-XTS implementation "xts-aes-aesni-avx" for x86_64 CPUs that have the AES-NI and AVX extensions but not VAES. It's similar to the existing xts-aes-aesni in that uses xmm registers to operate on one AES block at a time. It differs from xts-aes-aesni in the following ways: - It uses the VEX-coded (non-destructive) instructions from AVX. This improves performance slightly. - It incorporates some additional optimizations such as interleaving the tweak computation with AES en/decryption, handling single-page messages more efficiently, and caching the first round key. - It supports only 64-bit (x86_64). - It's generated by an assembly macro that will also be used to generate VAES-based implementations. The performance improvement over xts-aes-aesni varies from small to large, depending on the CPU and other factors such as the size of the messages en/decrypted. For example, the following increases in AES-256-XTS decryption throughput are seen on the following CPUs: | 4096-byte messages | 512-byte messages | ----------------------+--------------------+-------------------+ Intel Skylake | 6% | 31% | Intel Cascade Lake | 4% | 26% | AMD Zen 1 | 61% | 73% | AMD Zen 2 | 36% | 59% | (The above CPUs don't support VAES, so they can't use VAES instead.) While this isn't as large an improvement as what VAES provides, this still seems worthwhile. This implementation is fairly easy to provide based on the assembly macro that's needed for VAES anyway, and it will be the best implementation on a large number of CPUs (very roughly, the CPUs launched by Intel and AMD from 2011 to 2018). This makes the existing xts-aes-aesni *mostly* obsolete. For now, leave it in place to support 32-bit kernels and also CPUs like Intel Westmere that support AES-NI but not AVX. (We could potentially remove it anyway and just rely on the indirect acceleration via ecb-aes-aesni in those cases, but that change will need to be considered separately.) Signed-off-by: Eric Biggers --- arch/x86/crypto/aes-xts-avx-x86_64.S | 9 ++ arch/x86/crypto/aesni-intel_glue.c | 202 ++++++++++++++++++++++++++- 2 files changed, 209 insertions(+), 2 deletions(-) diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S index a5e2783c46ec..32e26f562cf0 100644 --- a/arch/x86/crypto/aes-xts-avx-x86_64.S +++ b/arch/x86/crypto/aes-xts-avx-x86_64.S @@ -796,5 +796,14 @@ SYM_FUNC_END(aes_xts_encrypt_iv) // |key| is the data key. |tweak| contains the next tweak; the encryption of // the original IV with the tweak key was already done. This function supports // incremental computation, but |len| must always be >= 16 (AES_BLOCK_SIZE), and // |len| must be a multiple of 16 except on the last call. If |len| is a // multiple of 16, then this function updates |tweak| to contain the next tweak. + +.set VL, 16 +.set USE_AVX10, 0 +SYM_TYPED_FUNC_START(aes_xts_encrypt_aesni_avx) + _aes_xts_crypt 1 +SYM_FUNC_END(aes_xts_encrypt_aesni_avx) +SYM_TYPED_FUNC_START(aes_xts_decrypt_aesni_avx) + _aes_xts_crypt 0 +SYM_FUNC_END(aes_xts_decrypt_aesni_avx) diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index b1d90c25975a..10e283721a85 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -1135,11 +1135,200 @@ static struct skcipher_alg aesni_xctr = { .encrypt = xctr_crypt, .decrypt = xctr_crypt, }; static struct simd_skcipher_alg *aesni_simd_xctr; -#endif /* CONFIG_X86_64 */ + +asmlinkage void aes_xts_encrypt_iv(const struct crypto_aes_ctx *tweak_key, + u8 iv[AES_BLOCK_SIZE]); + +typedef void (*xts_asm_func)(const struct crypto_aes_ctx *key, + const u8 *src, u8 *dst, size_t len, + u8 tweak[AES_BLOCK_SIZE]); + +/* This handles cases where the source and/or destination span pages. */ +static noinline int +xts_crypt_slowpath(struct skcipher_request *req, xts_asm_func asm_func) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + const struct aesni_xts_ctx *ctx = aes_xts_ctx(tfm); + int tail = req->cryptlen % AES_BLOCK_SIZE; + struct scatterlist sg_src[2], sg_dst[2]; + struct skcipher_request subreq; + struct skcipher_walk walk; + struct scatterlist *src, *dst; + int err; + + /* + * If the message length isn't divisible by the AES block size, then + * separate off the last full block and the partial block. This ensures + * that they are processed in the same call to the assembly function, + * which is required for ciphertext stealing. + */ + if (tail) { + skcipher_request_set_tfm(&subreq, tfm); + skcipher_request_set_callback(&subreq, + skcipher_request_flags(req), + NULL, NULL); + skcipher_request_set_crypt(&subreq, req->src, req->dst, + req->cryptlen - tail - AES_BLOCK_SIZE, + req->iv); + req = &subreq; + } + + err = skcipher_walk_virt(&walk, req, false); + + while (walk.nbytes) { + unsigned int nbytes = walk.nbytes; + + if (nbytes < walk.total) + nbytes = round_down(nbytes, AES_BLOCK_SIZE); + + kernel_fpu_begin(); + (*asm_func)(&ctx->crypt_ctx, walk.src.virt.addr, + walk.dst.virt.addr, nbytes, req->iv); + kernel_fpu_end(); + err = skcipher_walk_done(&walk, walk.nbytes - nbytes); + } + + if (err || !tail) + return err; + + /* Do ciphertext stealing with the last full block and partial block. */ + + dst = src = scatterwalk_ffwd(sg_src, req->src, req->cryptlen); + if (req->dst != req->src) + dst = scatterwalk_ffwd(sg_dst, req->dst, req->cryptlen); + + skcipher_request_set_crypt(req, src, dst, AES_BLOCK_SIZE + tail, + req->iv); + + err = skcipher_walk_virt(&walk, req, false); + if (err) + return err; + + kernel_fpu_begin(); + (*asm_func)(&ctx->crypt_ctx, walk.src.virt.addr, walk.dst.virt.addr, + walk.nbytes, req->iv); + kernel_fpu_end(); + + return skcipher_walk_done(&walk, 0); +} + +/* __always_inline to avoid indirect call in fastpath */ +static __always_inline int +xts_crypt2(struct skcipher_request *req, xts_asm_func asm_func) +{ + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + const struct aesni_xts_ctx *ctx = aes_xts_ctx(tfm); + const unsigned int cryptlen = req->cryptlen; + struct scatterlist *src = req->src; + struct scatterlist *dst = req->dst; + + if (unlikely(cryptlen < AES_BLOCK_SIZE)) + return -EINVAL; + + kernel_fpu_begin(); + aes_xts_encrypt_iv(&ctx->tweak_ctx, req->iv); + + /* + * In practice, virtually all XTS plaintexts and ciphertexts are either + * 512 or 4096 bytes, aligned such that they don't span page boundaries. + * To optimize the performance of these cases, and also any other case + * where no page boundary is spanned, the below fast-path handles + * single-page sources and destinations as efficiently as possible. + */ + if (likely(src->length >= cryptlen && dst->length >= cryptlen && + src->offset + cryptlen <= PAGE_SIZE && + dst->offset + cryptlen <= PAGE_SIZE)) { + struct page *src_page = sg_page(src); + struct page *dst_page = sg_page(dst); + void *src_virt = kmap_local_page(src_page) + src->offset; + void *dst_virt = kmap_local_page(dst_page) + dst->offset; + + (*asm_func)(&ctx->crypt_ctx, src_virt, dst_virt, cryptlen, + req->iv); + kunmap_local(dst_virt); + kunmap_local(src_virt); + kernel_fpu_end(); + return 0; + } + kernel_fpu_end(); + return xts_crypt_slowpath(req, asm_func); +} + +#define DEFINE_XTS_ALG(suffix, driver_name, priority) \ + \ +asmlinkage void aes_xts_encrypt_##suffix(const struct crypto_aes_ctx *key, \ + const u8 *src, u8 *dst, size_t len, \ + u8 tweak[AES_BLOCK_SIZE]); \ +asmlinkage void aes_xts_decrypt_##suffix(const struct crypto_aes_ctx *key, \ + const u8 *src, u8 *dst, size_t len, \ + u8 tweak[AES_BLOCK_SIZE]); \ + \ +static int xts_encrypt_##suffix(struct skcipher_request *req) \ +{ \ + return xts_crypt2(req, aes_xts_encrypt_##suffix); \ +} \ + \ +static int xts_decrypt_##suffix(struct skcipher_request *req) \ +{ \ + return xts_crypt2(req, aes_xts_decrypt_##suffix); \ +} \ + \ +static struct skcipher_alg aes_xts_alg_##suffix = { \ + .base = { \ + .cra_name = "__xts(aes)", \ + .cra_driver_name = "__" driver_name, \ + .cra_priority = priority, \ + .cra_flags = CRYPTO_ALG_INTERNAL, \ + .cra_blocksize = AES_BLOCK_SIZE, \ + .cra_ctxsize = XTS_AES_CTX_SIZE, \ + .cra_module = THIS_MODULE, \ + }, \ + .min_keysize = 2 * AES_MIN_KEY_SIZE, \ + .max_keysize = 2 * AES_MAX_KEY_SIZE, \ + .ivsize = AES_BLOCK_SIZE, \ + .walksize = 2 * AES_BLOCK_SIZE, \ + .setkey = xts_aesni_setkey, \ + .encrypt = xts_encrypt_##suffix, \ + .decrypt = xts_decrypt_##suffix, \ +}; \ + \ +static struct simd_skcipher_alg *aes_xts_simdalg_##suffix + +DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500); + +static int __init register_xts_algs(void) +{ + int err; + + if (!boot_cpu_has(X86_FEATURE_AVX)) + return 0; + err = simd_register_skciphers_compat(&aes_xts_alg_aesni_avx, 1, + &aes_xts_simdalg_aesni_avx); + if (err) + return err; + return 0; +} + +static void unregister_xts_algs(void) +{ + if (aes_xts_simdalg_aesni_avx) + simd_unregister_skciphers(&aes_xts_alg_aesni_avx, 1, + &aes_xts_simdalg_aesni_avx); +} +#else /* CONFIG_X86_64 */ +static int __init register_xts_algs(void) +{ + return 0; +} + +static void unregister_xts_algs(void) +{ +} +#endif /* !CONFIG_X86_64 */ #ifdef CONFIG_X86_64 static int generic_gcmaes_set_key(struct crypto_aead *aead, const u8 *key, unsigned int key_len) { @@ -1274,17 +1463,25 @@ static int __init aesni_init(void) &aesni_simd_xctr); if (err) goto unregister_aeads; #endif /* CONFIG_X86_64 */ + err = register_xts_algs(); + if (err) + goto unregister_xts; + return 0; +unregister_xts: + unregister_xts_algs(); #ifdef CONFIG_X86_64 + if (aesni_simd_xctr) + simd_unregister_skciphers(&aesni_xctr, 1, &aesni_simd_xctr); unregister_aeads: +#endif /* CONFIG_X86_64 */ simd_unregister_aeads(aesni_aeads, ARRAY_SIZE(aesni_aeads), aesni_simd_aeads); -#endif /* CONFIG_X86_64 */ unregister_skciphers: simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers), aesni_simd_skciphers); unregister_cipher: @@ -1301,10 +1498,11 @@ static void __exit aesni_exit(void) crypto_unregister_alg(&aesni_cipher_alg); #ifdef CONFIG_X86_64 if (boot_cpu_has(X86_FEATURE_AVX)) simd_unregister_skciphers(&aesni_xctr, 1, &aesni_simd_xctr); #endif /* CONFIG_X86_64 */ + unregister_xts_algs(); } late_initcall(aesni_init); module_exit(aesni_exit); From patchwork Fri Mar 29 08:03:52 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 784169 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3ECAD4AEE7; Fri, 29 Mar 2024 08:06:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699561; cv=none; b=gEDrmvYZUSZCodr6fHF6svTJ9rQwAaEXI8tbj8HYpVhJshEkkY2yQToWRLwNoQR3ldnzB/QnQCO40LlMuwd7kvwteaphPPAYpxuFGJZv3TDTG9OjNv8Bh/3armFdpAY4hP2nTZdm+uHdj1jXjWT3klRmfgkcNgEofAiyWbEdt3g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699561; c=relaxed/simple; bh=hTPh4ID7+uT08Pmb9pplREzEU5MITK5DEdwRz0lQl4k=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KcQTcEXf+XN+tXAjkw/i6cfXFpQK9CBjdMSS7z346bGKJnEmKZJQVjNSHhGl9HuS+3HjjlBGNRJTxZgZl/9zuCm1U8nba8G1RQJGNCUVXQVO6HPALTX7SS8oH3wBmGneWDw0M55X/yotce487s3/ZbMY+jWoJTfVUib5rBg2YwM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=KFMPdobM; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="KFMPdobM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E814BC4166B; Fri, 29 Mar 2024 08:06:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1711699561; bh=hTPh4ID7+uT08Pmb9pplREzEU5MITK5DEdwRz0lQl4k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=KFMPdobMJkjyiBcIo8vJP6EswCgS5VZi3eC6JbUfvkR1DrIlSFkxEBl2Ve2zzsnjg fw0o2kJaFbgM4N3rpNsJmcUcx2TczISBqw8qI3viBOIeSptzOPIj23roRgKTHf1mR2 KqbMwKpsRKmxt0dK9+DieMm0kSL3HJqj2Q6PvnQquZocFPPaSZsNtgpBuc7N6vEC0V U9oZuZTrTnqJLlOLqwYSPUqqlG/+pZBdh7zqJ5GzWRrfIfU7cOL9VeTc5pBIsb+XA/ L6s8p4L0EEFniUJuGmO4jEJuRTge/B2YZtdz22OHcohcfnKzZeur0Kx0GMNhKMpF2Q 5jhlRiAijcZ0g== From: Eric Biggers To: linux-crypto@vger.kernel.org, x86@kernel.org Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , Andy Lutomirski , "Chang S . Bae" Subject: [PATCH v2 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Date: Fri, 29 Mar 2024 01:03:52 -0700 Message-ID: <20240329080355.2871-5-ebiggers@kernel.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240329080355.2871-1-ebiggers@kernel.org> References: <20240329080355.2871-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers Add an AES-XTS implementation "xts-aes-vaes-avx2" for x86_64 CPUs with the VAES, VPCLMULQDQ, and AVX2 extensions, but not AVX512 or AVX10. This implementation uses ymm registers to operate on two AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. This is the optimal implementation on AMD Zen 3. It should also be the optimal implementation on Intel Alder Lake, which similarly supports VAES but not AVX512. Comparing to xts-aes-aesni-avx on Zen 3, xts-aes-vaes-avx2 provides 70% higher AES-256-XTS decryption throughput with 4096-byte messages, or 23% higher with 512-byte messages. A large improvement is also seen with CPUs that do support AVX512 (e.g., 98% higher AES-256-XTS decryption throughput on Ice Lake with 4096-byte messages), though the following patches add AVX512 optimized implementations to get a bit more performance on those CPUs. Signed-off-by: Eric Biggers --- arch/x86/crypto/aes-xts-avx-x86_64.S | 11 +++++++++++ arch/x86/crypto/aesni-intel_glue.c | 20 ++++++++++++++++++++ 2 files changed, 31 insertions(+) diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S index 32e26f562cf0..43706213dfca 100644 --- a/arch/x86/crypto/aes-xts-avx-x86_64.S +++ b/arch/x86/crypto/aes-xts-avx-x86_64.S @@ -805,5 +805,16 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_aesni_avx) _aes_xts_crypt 1 SYM_FUNC_END(aes_xts_encrypt_aesni_avx) SYM_TYPED_FUNC_START(aes_xts_decrypt_aesni_avx) _aes_xts_crypt 0 SYM_FUNC_END(aes_xts_decrypt_aesni_avx) + +#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ) +.set VL, 32 +.set USE_AVX10, 0 +SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx2) + _aes_xts_crypt 1 +SYM_FUNC_END(aes_xts_encrypt_vaes_avx2) +SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx2) + _aes_xts_crypt 0 +SYM_FUNC_END(aes_xts_decrypt_vaes_avx2) +#endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */ diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 10e283721a85..4cc15c7207f3 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -1295,10 +1295,13 @@ static struct skcipher_alg aes_xts_alg_##suffix = { \ }; \ \ static struct simd_skcipher_alg *aes_xts_simdalg_##suffix DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500); +#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ) +DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600); +#endif static int __init register_xts_algs(void) { int err; @@ -1306,18 +1309,35 @@ static int __init register_xts_algs(void) return 0; err = simd_register_skciphers_compat(&aes_xts_alg_aesni_avx, 1, &aes_xts_simdalg_aesni_avx); if (err) return err; +#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ) + if (!boot_cpu_has(X86_FEATURE_AVX2) || + !boot_cpu_has(X86_FEATURE_VAES) || + !boot_cpu_has(X86_FEATURE_VPCLMULQDQ) || + !boot_cpu_has(X86_FEATURE_PCLMULQDQ) || + !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL)) + return 0; + err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx2, 1, + &aes_xts_simdalg_vaes_avx2); + if (err) + return err; +#endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */ return 0; } static void unregister_xts_algs(void) { if (aes_xts_simdalg_aesni_avx) simd_unregister_skciphers(&aes_xts_alg_aesni_avx, 1, &aes_xts_simdalg_aesni_avx); +#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ) + if (aes_xts_simdalg_vaes_avx2) + simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1, + &aes_xts_simdalg_vaes_avx2); +#endif } #else /* CONFIG_X86_64 */ static int __init register_xts_algs(void) { return 0; From patchwork Fri Mar 29 08:03:53 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 784583 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D0F8B4F606; Fri, 29 Mar 2024 08:06:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699561; cv=none; b=L1P6tAn+IlEFJjB1kFzS8+I2eju00SssDkAVvHQ5mxWCHaTlqW1tLnpWHeqgwcQRT8k/PmirspSLlXkiXlgOKdJ/4m1cYFBHJdrq89HtB5hF/e4neFecQYewQWMvFJkML62tmlcpBDbngtrXp+l8f5JOykCe6mcySNcsCZzsf0k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699561; c=relaxed/simple; bh=cwKbsRa5+4j6tLRcb2nECqjE6MrPiyU1VuI5Llnpq9k=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=SkFKP+uNE2WMAYH1MuMYymT2/9gnI6ZJ8cUwopPUZYkLngtoxyFUaphIRDBIIqU4vJ7aeUh3QvEssqmXKl/gcGgGhSapdCOUXatJSaCA218klo+vF3Cn/NKQNl0RMN0j2WCE7W27jZI2xasq+WUYKM03ZSSrHeveHR04/t+8Hes= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=KLGpHiJw; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="KLGpHiJw" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4927BC43330; Fri, 29 Mar 2024 08:06:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1711699561; bh=cwKbsRa5+4j6tLRcb2nECqjE6MrPiyU1VuI5Llnpq9k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=KLGpHiJwGP7HnGCEjv+ea2lYxIZ3b4yd9A+sA7N9evLlU0HhCYnT8eXjL1ReL7QgO uCTigWDeR1Lbm7SrU2AT5J6aPybD553fJnYlWO0rjZREklW9yH/NjweP+jbWcB6Swk 4u4Z8Dk3cg9Ctga9KktcJMLLf53SmGNDAJ/bFDLLHN+jGTB35f96yWpR5h0Jfx7PXM tYFIDXDszGcEZDoXB8xWiTw+TAPql5qVNsmGOmgPP4ABZTVuJ2LjpCRUtgksrwrrZM 85gwKD6B9Z19kVH/XSdpBP7h10hXuIY9TnfTQkdlqQ4fdmfG2A9itPlPyDbXvy62ag jwCUZSzSleqPw== From: Eric Biggers To: linux-crypto@vger.kernel.org, x86@kernel.org Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , Andy Lutomirski , "Chang S . Bae" Subject: [PATCH v2 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Date: Fri, 29 Mar 2024 01:03:53 -0700 Message-ID: <20240329080355.2871-6-ebiggers@kernel.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240329080355.2871-1-ebiggers@kernel.org> References: <20240329080355.2871-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers Add an AES-XTS implementation "xts-aes-vaes-avx10_256" for x86_64 CPUs with the VAES, VPCLMULQDQ, and either AVX10/256 or AVX512BW + AVX512VL extensions. This implementation avoids using zmm registers, instead using ymm registers to operate on two AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. This is the optimal implementation on CPUs that support VAES and AVX512 but where the zmm registers should not be used due to downclocking effects, for example Intel's Ice Lake. It should also be the optimal implementation on future CPUs that support AVX10/256 but not AVX10/512. The performance is slightly better than that of xts-aes-vaes-avx2, which uses the same 256-bit vector length, due to factors such as being able to use ymm16-ymm31 to cache the AES round keys, and being able to use the vpternlogd instruction to do XORs more efficiently. For example, on Ice Lake, the throughput of decrypting 4096-byte messages with AES-256-XTS is 6.6% higher with xts-aes-vaes-avx10_256 than with xts-aes-vaes-avx2. While this is a small improvement, it is straightforward to provide this implementation (xts-aes-vaes-avx10_256) as long as we are providing xts-aes-vaes-avx2 and xts-aes-vaes-avx10_512 anyway, due to the way the _aes_xts_crypt macro is structured. Signed-off-by: Eric Biggers --- arch/x86/crypto/aes-xts-avx-x86_64.S | 9 +++++++++ arch/x86/crypto/aesni-intel_glue.c | 16 ++++++++++++++++ 2 files changed, 25 insertions(+) diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S index 43706213dfca..71be474b22da 100644 --- a/arch/x86/crypto/aes-xts-avx-x86_64.S +++ b/arch/x86/crypto/aes-xts-avx-x86_64.S @@ -815,6 +815,15 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx2) _aes_xts_crypt 1 SYM_FUNC_END(aes_xts_encrypt_vaes_avx2) SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx2) _aes_xts_crypt 0 SYM_FUNC_END(aes_xts_decrypt_vaes_avx2) + +.set VL, 32 +.set USE_AVX10, 1 +SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_256) + _aes_xts_crypt 1 +SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_256) +SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_256) + _aes_xts_crypt 0 +SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_256) #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */ diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 4cc15c7207f3..914cbf5d1f5c 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -1297,10 +1297,11 @@ static struct skcipher_alg aes_xts_alg_##suffix = { \ static struct simd_skcipher_alg *aes_xts_simdalg_##suffix DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500); #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ) DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600); +DEFINE_XTS_ALG(vaes_avx10_256, "xts-aes-vaes-avx10_256", 700); #endif static int __init register_xts_algs(void) { int err; @@ -1320,10 +1321,22 @@ static int __init register_xts_algs(void) return 0; err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx2, 1, &aes_xts_simdalg_vaes_avx2); if (err) return err; + + if (!boot_cpu_has(X86_FEATURE_AVX512BW) || + !boot_cpu_has(X86_FEATURE_AVX512VL) || + !boot_cpu_has(X86_FEATURE_BMI2) || + !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM | + XFEATURE_MASK_AVX512, NULL)) + return 0; + + err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_256, 1, + &aes_xts_simdalg_vaes_avx10_256); + if (err) + return err; #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */ return 0; } static void unregister_xts_algs(void) @@ -1333,10 +1346,13 @@ static void unregister_xts_algs(void) &aes_xts_simdalg_aesni_avx); #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ) if (aes_xts_simdalg_vaes_avx2) simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1, &aes_xts_simdalg_vaes_avx2); + if (aes_xts_simdalg_vaes_avx10_256) + simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_256, 1, + &aes_xts_simdalg_vaes_avx10_256); #endif } #else /* CONFIG_X86_64 */ static int __init register_xts_algs(void) { From patchwork Fri Mar 29 08:03:54 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Biggers X-Patchwork-Id: 784168 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F40F50A97; Fri, 29 Mar 2024 08:06:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699562; cv=none; b=GLvee0OC4/vaz9Q5gHg12MFuAiN4bYN1RjAVIMItjQHIby+yZvDGfgy8gwH9RIfNp6yDLmJlYEOGfwAuhXbWfnTI20Xso8agqBRpaglyj2exhPKXW9Ibpzv+xCH0FoiyQc2sInkyac5K/4AIjKN9KXOIPL3OZ+2MHqdMLyj5LeI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711699562; c=relaxed/simple; bh=RGIruG4V66PZQcVzovXySY0XyiLGcDc08NnJCgDt1EM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LvrzPtKLQoDF/7lRgpDjKcSf+uBoOyupKQUyq3EexHzfyAiim6YrWsKGJqjb73yODxmbn8iGStQiSx9ZXRRtuL3WCBUh+LQMfZyAm4SYE3iDTZzE9dPi/HAuh0B+bNMvPXvyr68DdTzY3m1k1Up1ef787c+1CCFziTWbHu64+Mw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=BVth4ZHg; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="BVth4ZHg" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9FCDFC43143; Fri, 29 Mar 2024 08:06:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1711699561; bh=RGIruG4V66PZQcVzovXySY0XyiLGcDc08NnJCgDt1EM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=BVth4ZHgqbzBaZonyjaLYOQ+Oet0j5PB2Dcj6q2n2fz54LfXwCYqYreuQPLXFfhUq 7SKJGDVflqvBqbdVwTRCBPgPqkMCjdrYDa2gIGHlLdJ6WHtYmWGaV5i5OUDlDL9wox TDb0Y77nadEKS3jAZ3j1bSIk+zMdL9clRbtBagrCNaviILmy5XZUT1ktTmxB+QYCd8 eoHUUpbX/NSvlfS6ftYs9SnsjYRvWs1ZpiWygxvFE9jQuAypohCfdwnwgy0Q6aZQgd qk+xENZ5rBgGragXv0LX80yVKaXAbwHWTOSOxSR844CSCO3pdpQlLekfSfZNm4IHnH VmqOTd/DcEkmQ== From: Eric Biggers To: linux-crypto@vger.kernel.org, x86@kernel.org Cc: linux-kernel@vger.kernel.org, Ard Biesheuvel , Andy Lutomirski , "Chang S . Bae" Subject: [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Date: Fri, 29 Mar 2024 01:03:54 -0700 Message-ID: <20240329080355.2871-7-ebiggers@kernel.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240329080355.2871-1-ebiggers@kernel.org> References: <20240329080355.2871-1-ebiggers@kernel.org> Precedence: bulk X-Mailing-List: linux-crypto@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Eric Biggers Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL extensions. This implementation uses zmm registers to operate on four AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. To avoid downclocking on older Intel CPU models, an exclusion list is used to prevent this 512-bit implementation from being used by default on some CPU models. They will use xts-aes-vaes-avx10_256 instead. For now, this exclusion list is simply coded into aesni-intel_glue.c. It may make sense to eventually move it into a more central location. xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on some current CPUs. E.g., on AMD Zen 4, AES-256-XTS decryption throughput increases by 13% with 4096-byte inputs, or 14% with 512-byte inputs. On Intel Sapphire Rapids, AES-256-XTS decryption throughput increases by 2% with 4096-byte inputs, or 3% with 512-byte inputs. Future CPUs may provide stronger 512-bit support, in which case a larger benefit should be seen. Signed-off-by: Eric Biggers --- arch/x86/crypto/aes-xts-avx-x86_64.S | 9 ++++++++ arch/x86/crypto/aesni-intel_glue.c | 32 ++++++++++++++++++++++++++++ 2 files changed, 41 insertions(+) diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S index 71be474b22da..b8005d0205f8 100644 --- a/arch/x86/crypto/aes-xts-avx-x86_64.S +++ b/arch/x86/crypto/aes-xts-avx-x86_64.S @@ -824,6 +824,15 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_256) _aes_xts_crypt 1 SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_256) SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_256) _aes_xts_crypt 0 SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_256) + +.set VL, 64 +.set USE_AVX10, 1 +SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_512) + _aes_xts_crypt 1 +SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_512) +SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_512) + _aes_xts_crypt 0 +SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_512) #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */ diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c index 914cbf5d1f5c..0855ace8659c 100644 --- a/arch/x86/crypto/aesni-intel_glue.c +++ b/arch/x86/crypto/aesni-intel_glue.c @@ -1298,12 +1298,33 @@ static struct simd_skcipher_alg *aes_xts_simdalg_##suffix DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500); #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ) DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600); DEFINE_XTS_ALG(vaes_avx10_256, "xts-aes-vaes-avx10_256", 700); +DEFINE_XTS_ALG(vaes_avx10_512, "xts-aes-vaes-avx10_512", 800); #endif +/* + * This is a list of CPU models that are known to suffer from downclocking when + * zmm registers (512-bit vectors) are used. On these CPUs, the AES-XTS + * implementation with zmm registers won't be used by default. An + * implementation with ymm registers (256-bit vectors) will be used instead. + */ +static const struct x86_cpu_id zmm_exclusion_list[] = { + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L }, + { .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE }, + /* Allow Rocket Lake and later, and Sapphire Rapids and later. */ + /* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */ + {}, +}; + static int __init register_xts_algs(void) { int err; if (!boot_cpu_has(X86_FEATURE_AVX)) @@ -1333,10 +1354,18 @@ static int __init register_xts_algs(void) err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_256, 1, &aes_xts_simdalg_vaes_avx10_256); if (err) return err; + + if (x86_match_cpu(zmm_exclusion_list)) + aes_xts_alg_vaes_avx10_512.base.cra_priority = 1; + + err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_512, 1, + &aes_xts_simdalg_vaes_avx10_512); + if (err) + return err; #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */ return 0; } static void unregister_xts_algs(void) @@ -1349,10 +1378,13 @@ static void unregister_xts_algs(void) simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1, &aes_xts_simdalg_vaes_avx2); if (aes_xts_simdalg_vaes_avx10_256) simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_256, 1, &aes_xts_simdalg_vaes_avx10_256); + if (aes_xts_simdalg_vaes_avx10_512) + simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_512, 1, + &aes_xts_simdalg_vaes_avx10_512); #endif } #else /* CONFIG_X86_64 */ static int __init register_xts_algs(void) {