From patchwork Sat Aug  4 18:46:24 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ard Biesheuvel <ard.biesheuvel@linaro.org>
X-Patchwork-Id: 143449
Delivered-To: patch@linaro.org
Received: by 2002:a2e:9754:0:0:0:0:0 with SMTP id f20-v6csp1543512ljj;
 Sat, 4 Aug 2018 11:46:36 -0700 (PDT)
X-Google-Smtp-Source: AAOMgpdwpxFZwbQomsz9i+eVVHHf6/7IkF2kQNHyAe+yKJcJCDxRlKyo6tNEo5GFOKHfL02qOC5+
X-Received: by 2002:a63:5542:: with SMTP id
 f2-v6mr8633022pgm.37.1533408396070; 
 Sat, 04 Aug 2018 11:46:36 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1533408396; cv=none;
 d=google.com; s=arc-20160816;
 b=znwBe17AMyITbp7HYfEXpgZfs5nsvKRV3ylqh8GmmpI0CXyUMLyl3rDhMsTMbxrux8
 BLaakFZLc0SrLijpzsCKjNuqEpvyT4FAGF3ye73dVoRUqHUHfmGPQRttRpr9jvEUUSIa
 LYrxBL6Rk/q4CZ6kIICPL+bji5co7eGbcIZXH1ZGRRDgl68QlPlxngxtIZy3lEnE6jFz
 dv0l5xEaYBQB84RlL2G0CWO0f2NwsY77pDVdecAinItqwNMOptzv9eQgo4a+M01NZSsN
 lsGinjAruO9Elnj9MqKGOyQ4V5YwG8Xj5ADJL5WXtw9c8iHviZtGl8XmvwXy3NdtFIUc
 gLqA==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature:arc-authentication-results;
 bh=GoEAMPv7ae12W1B/AP8WTKMiEpYBcJEkfUG44j/a860=;
 b=efuHWWyxnPFHE4aICRZX5tbvqSpal178W0Pe9CJQqRtgfAL/coEXQFOdbWSuX/AlgA
 86CpaJwYsKjbWE+LsgBVd5O85r2x1C6A1t5xFN9qebyi92AF5UTPFhfJ16S86BCPND+Q
 zkZt39wFPbuxUm+3XME/MKBQEWYQzToAs6/YtfPbuaesbFIBqvo9aB/dm9NQ2ufspKMk
 1UKuUv6ff83WBjiwtQSKZAsh60y/JQB1v8v9IIN5rWdz891/k4dO9yii/3P4AMDz3Uty
 4zg0hy1XFNZYmYWCVYeDi6X+mp6pnU6hOKocA9JBHC+ADi6p41VdoBcMzSUJOX2ImkUa
 i1ww==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=Zer9dOQ4;
 spf=pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-crypto-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-crypto-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 p65-v6si8667183pga.401.2018.08.04.11.46.35; 
 Sat, 04 Aug 2018 11:46:36 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=Zer9dOQ4;
 spf=pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-crypto-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1728054AbeHDUsK (ORCPT <rfc822;victor.chong@linaro.org>
 + 1 other); Sat, 4 Aug 2018 16:48:10 -0400
Received: from mail-ed1-f45.google.com ([209.85.208.45]:46523 "EHLO
 mail-ed1-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1728067AbeHDUsI (ORCPT
 <rfc822;linux-crypto@vger.kernel.org>);
 Sat, 4 Aug 2018 16:48:08 -0400
Received: by mail-ed1-f45.google.com with SMTP id o8-v6so3286479edt.13
 for <linux-crypto@vger.kernel.org>;
 Sat, 04 Aug 2018 11:46:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=GoEAMPv7ae12W1B/AP8WTKMiEpYBcJEkfUG44j/a860=;
 b=Zer9dOQ4R/DkOk+xID0AtZJI+AdzIM4vjUOSVg6cFds/I/B+NSnbf2Z96uPxuJIqbf
 A7yyxePXeMcVLiL3ANRrNuq307/K+wlj8+2HUXTwuKUdZhj+6U4fdnXnO9KQMj16IPU0
 g1xuyGRNu1JDLBucDzyjEqmnIQy0r6W8J4PvA=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=GoEAMPv7ae12W1B/AP8WTKMiEpYBcJEkfUG44j/a860=;
 b=q3QyOqB8mtjOLsgMen2idRrBghcd6GkzWX9EqMZVbZRGpcSbCRTtMMo7/es0ylv3p2
 Qsrb30jc9ooId6bLAiNvonDubLxi1Km+ZXAyjnqGkMKUQyimK2CkAbftvYi0IyD5pUnx
 OBYZ+UIZkAqJIoTWQS6qICSzQOJCYdfbTXBqaUOWmw8mGxgZlg4DFZq5xOPWxPPg1vFc
 cDYpPWHueH1MHw4jBz93st+TzyUvClZ95glzGQ2sVN0BwI5seHS86z98tqz4rK2+8FmV
 wd46HwW7i2XW1VDIVTJSELsgH/5ZYENa/ipRKGtqqheBNy9Nh5CtRt/S22yccq+4PSoG
 0nFA==
X-Gm-Message-State: AOUpUlGzlQQb2s+Zdb6etoFRzaqsd1TSp3zNAanI2YNSZMjT3sUfFLon
 UTQhvRAgL2kPEGivjjCDhuD+7ZAnHe4=
X-Received: by 2002:a50:d307:: with SMTP id
 g7-v6mr12098903edh.221.1533408391768; 
 Sat, 04 Aug 2018 11:46:31 -0700 (PDT)
Received: from rev02.home (b80182.upc-b.chello.nl. [212.83.80.182])
 by smtp.gmail.com with ESMTPSA id
 l30-v6sm4340504edc.70.2018.08.04.11.46.30
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 04 Aug 2018 11:46:30 -0700 (PDT)
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: linux-crypto@vger.kernel.org
Cc: herbert@gondor.apana.org.au, linux-arm-kernel@lists.infradead.org,
 jerome.forissier@linaro.org, jens.wiklander@linaro.org,
 Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: [PATCH 1/2] crypto: arm64/ghash-ce - replace NEON yield check with
 block limit
Date: Sat,  4 Aug 2018 20:46:24 +0200
Message-Id: <20180804184625.28523-2-ard.biesheuvel@linaro.org>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180804184625.28523-1-ard.biesheuvel@linaro.org>
References: <20180804184625.28523-1-ard.biesheuvel@linaro.org>
Sender: linux-crypto-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-crypto.vger.kernel.org>
X-Mailing-List: linux-crypto@vger.kernel.org

Checking the TIF_NEED_RESCHED flag is disproportionately costly on cores
with fast crypto instructions and comparatively slow memory accesses.

On algorithms such as GHASH, which executes at ~1 cycle per byte on
cores that implement support for 64 bit polynomial multiplication,
there is really no need to check the TIF_NEED_RESCHED particularly
often, and so we can remove the NEON yield check from the assembler
routines.

However, unlike the AEAD or skcipher APIs, the shash/ahash APIs take
arbitrary input lengths, and so there needs to be some sanity check
to ensure that we don't hog the CPU for excessive amounts of time.

So let's simply cap the maximum input size that is processed in one go
to 64 KB.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/ghash-ce-core.S | 39 ++++++--------------
 arch/arm64/crypto/ghash-ce-glue.c | 16 ++++++--
 2 files changed, 23 insertions(+), 32 deletions(-)

-- 
2.18.0

diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 913e49932ae6..344811c6a0ca 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -213,31 +213,23 @@
 	.endm
 
 	.macro		__pmull_ghash, pn
-	frame_push	5
-
-	mov		x19, x0
-	mov		x20, x1
-	mov		x21, x2
-	mov		x22, x3
-	mov		x23, x4
-
-0:	ld1		{SHASH.2d}, [x22]
-	ld1		{XL.2d}, [x20]
+	ld1		{SHASH.2d}, [x3]
+	ld1		{XL.2d}, [x1]
 	ext		SHASH2.16b, SHASH.16b, SHASH.16b, #8
 	eor		SHASH2.16b, SHASH2.16b, SHASH.16b
 
 	__pmull_pre_\pn
 
 	/* do the head block first, if supplied */
-	cbz		x23, 1f
-	ld1		{T1.2d}, [x23]
-	mov		x23, xzr
-	b		2f
+	cbz		x4, 0f
+	ld1		{T1.2d}, [x4]
+	mov		x4, xzr
+	b		1f
 
-1:	ld1		{T1.2d}, [x21], #16
-	sub		w19, w19, #1
+0:	ld1		{T1.2d}, [x2], #16
+	sub		w0, w0, #1
 
-2:	/* multiply XL by SHASH in GF(2^128) */
+1:	/* multiply XL by SHASH in GF(2^128) */
 CPU_LE(	rev64		T1.16b, T1.16b	)
 
 	ext		T2.16b, XL.16b, XL.16b, #8
@@ -259,18 +251,9 @@ CPU_LE(	rev64		T1.16b, T1.16b	)
 	eor		T2.16b, T2.16b, XH.16b
 	eor		XL.16b, XL.16b, T2.16b
 
-	cbz		w19, 3f
-
-	if_will_cond_yield_neon
-	st1		{XL.2d}, [x20]
-	do_cond_yield_neon
-	b		0b
-	endif_yield_neon
-
-	b		1b
+	cbnz		w0, 0b
 
-3:	st1		{XL.2d}, [x20]
-	frame_pop
+	st1		{XL.2d}, [x1]
 	ret
 	.endm
 
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 88e3d93fa7c7..03ce71ea81a2 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -113,6 +113,9 @@ static void ghash_do_update(int blocks, u64 dg[], const char *src,
 	}
 }
 
+/* avoid hogging the CPU for too long */
+#define MAX_BLOCKS	(SZ_64K / GHASH_BLOCK_SIZE)
+
 static int ghash_update(struct shash_desc *desc, const u8 *src,
 			unsigned int len)
 {
@@ -136,11 +139,16 @@ static int ghash_update(struct shash_desc *desc, const u8 *src,
 		blocks = len / GHASH_BLOCK_SIZE;
 		len %= GHASH_BLOCK_SIZE;
 
-		ghash_do_update(blocks, ctx->digest, src, key,
-				partial ? ctx->buf : NULL);
+		do {
+			int chunk = min(blocks, MAX_BLOCKS);
+
+			ghash_do_update(chunk, ctx->digest, src, key,
+					partial ? ctx->buf : NULL);
 
-		src += blocks * GHASH_BLOCK_SIZE;
-		partial = 0;
+			blocks -= chunk;
+			src += chunk * GHASH_BLOCK_SIZE;
+			partial = 0;
+		} while (unlikely(blocks > 0));
 	}
 	if (len)
 		memcpy(ctx->buf + partial, src, len);

From patchwork Sat Aug  4 18:46:25 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ard Biesheuvel <ard.biesheuvel@linaro.org>
X-Patchwork-Id: 143450
Delivered-To: patch@linaro.org
Received: by 2002:a2e:9754:0:0:0:0:0 with SMTP id f20-v6csp1543521ljj;
 Sat, 4 Aug 2018 11:46:36 -0700 (PDT)
X-Google-Smtp-Source: AAOMgpfcD9cU8gXdVEdhyf6Zlq5cMouCYcbNkiKvzhdqTufKOuAQKr2l7Yu5frq8A+23gffrLvCo
X-Received: by 2002:a63:4506:: with SMTP id
 s6-v6mr8787409pga.422.1533408396365; 
 Sat, 04 Aug 2018 11:46:36 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1533408396; cv=none;
 d=google.com; s=arc-20160816;
 b=ygDgMwqUfQcpKbEK5piJu2JCdSN0/iyT8+x8tbA0qk79Y7wRU1jtpi7j4Q2vZZuB/C
 /+HgoEWB/+K15kn64sHqZLLzW/0DVVNsyLVoWzW9PXC3TGsOHXcGFD+ES0GfB6h1pauD
 VJJSm7/DJn9VC8i6T0sJfpFh2ZbvEXcz0+4uRob/X7qY3eQYXnR65PMHmKd/v1LzQ9Ta
 zGQN8X9Cql5voinOAWVypIhND8Fao6pCqUowiiMmf4T/TOzLAWqT3G3VtzJdFM9A5XPB
 RCs3iIwR2Q8/OtxUcQjdD+Mf1HmHosPDFiRuysM009sNxMxYp8v8oYFxbLHPVk347okM
 XKUQ==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature:arc-authentication-results;
 bh=9w4N68DENte3mzEE1jIcjhaQkI+u2rctrlHPu5Xa3SI=;
 b=KhsE6opZVoQnsiMs8y9KVR19o5nhj79IUyk70D911J5lITsSVgoGNSKvOSURIrIHEz
 Ckujgd3iaOIs2ETqc/4V3O0RIqf1jOSjcn5mrD2s0faerEtSYRAMVXjoE0ETyWouNyuv
 5v0PjU5RkDylSXauDvobxrk/JDE8z0EkCtV01duy4+UfpAc5EZTgn1veSVVw6Na5T9NQ
 yF04O13OYhkpDoP94qhWVmD46LNfCwHyyRyPSh4YFvvuvGLog2aZstz9oDF1DeeEFsWo
 qtIOS2CeuhGCwL6q+vOfC277yEtnQmcIh73E2/kEXJiR/oaw5+riR9Asef/RZ73VX9q0
 fDBA==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b="k3A/Rgh/";
 spf=pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-crypto-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-crypto-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 p65-v6si8667183pga.401.2018.08.04.11.46.36; 
 Sat, 04 Aug 2018 11:46:36 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b="k3A/Rgh/";
 spf=pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-crypto-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1728067AbeHDUsK (ORCPT <rfc822;victor.chong@linaro.org>
 + 1 other); Sat, 4 Aug 2018 16:48:10 -0400
Received: from mail-ed1-f51.google.com ([209.85.208.51]:38033 "EHLO
 mail-ed1-f51.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1728205AbeHDUsK (ORCPT
 <rfc822;linux-crypto@vger.kernel.org>);
 Sat, 4 Aug 2018 16:48:10 -0400
Received: by mail-ed1-f51.google.com with SMTP id t2-v6so3298951edr.5
 for <linux-crypto@vger.kernel.org>;
 Sat, 04 Aug 2018 11:46:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=9w4N68DENte3mzEE1jIcjhaQkI+u2rctrlHPu5Xa3SI=;
 b=k3A/Rgh/5sIWqfTZaA/fw4/sl02vgSh89vqU4yKTo/ZKFbR7eS7OJXbZMHACQ/QchY
 ZuqQn33t2pvleQ99KiRNr1EIR6ZRi0ihrfFQ+vkpXUw+2GNpUQbFG+aW0dQB+et+U9NS
 OJCQj1hX5BltKDO/wceoKLq1QlqBDeNLhiLlI=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=9w4N68DENte3mzEE1jIcjhaQkI+u2rctrlHPu5Xa3SI=;
 b=t+Ky/Lr31rd9A5m/5tgzUXuQ/bsby43rg2sJpxdhIlqiq0qf1vKs7j/X9Q/kobYqcG
 mvj/Bp6nanx+WTunTtaUFEthpsdP0eOR+Nt+xUqzn2WJB8SdkfchjliZY6bRu0YGhdYO
 9z/x48EL+hXbpEnsrUVqV51eNEovqsSvdWb7+lvevB+OJ/aJHzACf/OWb9x0S6wuGwmv
 PcuHU2R5XTAYaOq+Y3lDZclMxPEh+/XM/u3LNSSZ9GYIAkWX6KbOBvWJKUIdp6ImbJV3
 uHdB/cZU08ZwBbaW4NSOrgOUKcrJtBgyDwnDRCb8dZeFQJRJ+dH8tCfnyllacMjoQF+Y
 kv/g==
X-Gm-Message-State: AOUpUlH9AUzTWxLPFNXOVOs3juuuEI6f05FKPdqzQA5WS+i3S3suYttD
 UIct/nJOn/6AV5B4AxcCYnaNUViZMDU=
X-Received: by 2002:aa7:d0d8:: with SMTP id
 u24-v6mr11964997edo.144.1533408393060; 
 Sat, 04 Aug 2018 11:46:33 -0700 (PDT)
Received: from rev02.home (b80182.upc-b.chello.nl. [212.83.80.182])
 by smtp.gmail.com with ESMTPSA id
 l30-v6sm4340504edc.70.2018.08.04.11.46.31
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 04 Aug 2018 11:46:32 -0700 (PDT)
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: linux-crypto@vger.kernel.org
Cc: herbert@gondor.apana.org.au, linux-arm-kernel@lists.infradead.org,
 jerome.forissier@linaro.org, jens.wiklander@linaro.org,
 Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: [PATCH 2/2] crypto: arm64/ghash-ce - implement 4-way aggregation
Date: Sat,  4 Aug 2018 20:46:25 +0200
Message-Id: <20180804184625.28523-3-ard.biesheuvel@linaro.org>
X-Mailer: git-send-email 2.18.0
In-Reply-To: <20180804184625.28523-1-ard.biesheuvel@linaro.org>
References: <20180804184625.28523-1-ard.biesheuvel@linaro.org>
Sender: linux-crypto-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-crypto.vger.kernel.org>
X-Mailing-List: linux-crypto@vger.kernel.org

Enhance the GHASH implementation that uses 64-bit polynomial
multiplication by adding support for 4-way aggregation. This
more than doubles the performance, from 2.4 cycles per byte
to 1.1 cpb on Cortex-A53.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/ghash-ce-core.S | 122 +++++++++++++++++---
 arch/arm64/crypto/ghash-ce-glue.c |  71 ++++++------
 2 files changed, 142 insertions(+), 51 deletions(-)

-- 
2.18.0

diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 344811c6a0ca..1b319b716d5e 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -46,6 +46,19 @@
 	ss3		.req	v26
 	ss4		.req	v27
 
+	XL2		.req	v8
+	XM2		.req	v9
+	XH2		.req	v10
+	XL3		.req	v11
+	XM3		.req	v12
+	XH3		.req	v13
+	TT3		.req	v14
+	TT4		.req	v15
+	HH		.req	v16
+	HH3		.req	v17
+	HH4		.req	v18
+	HH34		.req	v19
+
 	.text
 	.arch		armv8-a+crypto
 
@@ -134,11 +147,25 @@
 	.endm
 
 	.macro		__pmull_pre_p64
+	add		x8, x3, #16
+	ld1		{HH.2d-HH4.2d}, [x8]
+
+	trn1		SHASH2.2d, SHASH.2d, HH.2d
+	trn2		T1.2d, SHASH.2d, HH.2d
+	eor		SHASH2.16b, SHASH2.16b, T1.16b
+
+	trn1		HH34.2d, HH3.2d, HH4.2d
+	trn2		T1.2d, HH3.2d, HH4.2d
+	eor		HH34.16b, HH34.16b, T1.16b
+
 	movi		MASK.16b, #0xe1
 	shl		MASK.2d, MASK.2d, #57
 	.endm
 
 	.macro		__pmull_pre_p8
+	ext		SHASH2.16b, SHASH.16b, SHASH.16b, #8
+	eor		SHASH2.16b, SHASH2.16b, SHASH.16b
+
 	// k00_16 := 0x0000000000000000_000000000000ffff
 	// k32_48 := 0x00000000ffffffff_0000ffffffffffff
 	movi		k32_48.2d, #0xffffffff
@@ -215,8 +242,6 @@
 	.macro		__pmull_ghash, pn
 	ld1		{SHASH.2d}, [x3]
 	ld1		{XL.2d}, [x1]
-	ext		SHASH2.16b, SHASH.16b, SHASH.16b, #8
-	eor		SHASH2.16b, SHASH2.16b, SHASH.16b
 
 	__pmull_pre_\pn
 
@@ -224,12 +249,79 @@
 	cbz		x4, 0f
 	ld1		{T1.2d}, [x4]
 	mov		x4, xzr
-	b		1f
+	b		3f
+
+0:	.ifc		\pn, p64
+	tbnz		w0, #0, 2f		// skip until #blocks is a
+	tbnz		w0, #1, 2f		// round multiple of 4
+
+1:	ld1		{XM3.16b-TT4.16b}, [x2], #64
+
+	sub		w0, w0, #4
+
+	rev64		T1.16b, XM3.16b
+	rev64		T2.16b, XH3.16b
+	rev64		TT4.16b, TT4.16b
+	rev64		TT3.16b, TT3.16b
+
+	ext		IN1.16b, TT4.16b, TT4.16b, #8
+	ext		XL3.16b, TT3.16b, TT3.16b, #8
+
+	eor		TT4.16b, TT4.16b, IN1.16b
+	pmull2		XH2.1q, SHASH.2d, IN1.2d	// a1 * b1
+	pmull		XL2.1q, SHASH.1d, IN1.1d	// a0 * b0
+	pmull		XM2.1q, SHASH2.1d, TT4.1d	// (a1 + a0)(b1 + b0)
+
+	eor		TT3.16b, TT3.16b, XL3.16b
+	pmull2		XH3.1q, HH.2d, XL3.2d		// a1 * b1
+	pmull		XL3.1q, HH.1d, XL3.1d		// a0 * b0
+	pmull2		XM3.1q, SHASH2.2d, TT3.2d	// (a1 + a0)(b1 + b0)
+
+	ext		IN1.16b, T2.16b, T2.16b, #8
+	eor		XL2.16b, XL2.16b, XL3.16b
+	eor		XH2.16b, XH2.16b, XH3.16b
+	eor		XM2.16b, XM2.16b, XM3.16b
+
+	eor		T2.16b, T2.16b, IN1.16b
+	pmull2		XH3.1q, HH3.2d, IN1.2d		// a1 * b1
+	pmull		XL3.1q, HH3.1d, IN1.1d		// a0 * b0
+	pmull		XM3.1q, HH34.1d, T2.1d		// (a1 + a0)(b1 + b0)
 
-0:	ld1		{T1.2d}, [x2], #16
+	eor		XL2.16b, XL2.16b, XL3.16b
+	eor		XH2.16b, XH2.16b, XH3.16b
+	eor		XM2.16b, XM2.16b, XM3.16b
+
+	ext		IN1.16b, T1.16b, T1.16b, #8
+	ext		TT3.16b, XL.16b, XL.16b, #8
+	eor		XL.16b, XL.16b, IN1.16b
+	eor		T1.16b, T1.16b, TT3.16b
+
+	pmull2		XH.1q, HH4.2d, XL.2d		// a1 * b1
+	eor		T1.16b, T1.16b, XL.16b
+	pmull		XL.1q, HH4.1d, XL.1d		// a0 * b0
+	pmull2		XM.1q, HH34.2d, T1.2d		// (a1 + a0)(b1 + b0)
+
+	eor		XL.16b, XL.16b, XL2.16b
+	eor		XH.16b, XH.16b, XH2.16b
+	eor		XM.16b, XM.16b, XM2.16b
+
+	eor		T2.16b, XL.16b, XH.16b
+	ext		T1.16b, XL.16b, XH.16b, #8
+	eor		XM.16b, XM.16b, T2.16b
+
+	__pmull_reduce_p64
+
+	eor		T2.16b, T2.16b, XH.16b
+	eor		XL.16b, XL.16b, T2.16b
+
+	cbz		w0, 5f
+	b		1b
+	.endif
+
+2:	ld1		{T1.2d}, [x2], #16
 	sub		w0, w0, #1
 
-1:	/* multiply XL by SHASH in GF(2^128) */
+3:	/* multiply XL by SHASH in GF(2^128) */
 CPU_LE(	rev64		T1.16b, T1.16b	)
 
 	ext		T2.16b, XL.16b, XL.16b, #8
@@ -242,7 +334,7 @@ CPU_LE(	rev64		T1.16b, T1.16b	)
 	__pmull_\pn 	XL, XL, SHASH			// a0 * b0
 	__pmull_\pn	XM, T1, SHASH2			// (a1 + a0)(b1 + b0)
 
-	eor		T2.16b, XL.16b, XH.16b
+4:	eor		T2.16b, XL.16b, XH.16b
 	ext		T1.16b, XL.16b, XH.16b, #8
 	eor		XM.16b, XM.16b, T2.16b
 
@@ -253,7 +345,7 @@ CPU_LE(	rev64		T1.16b, T1.16b	)
 
 	cbnz		w0, 0b
 
-	st1		{XL.2d}, [x1]
+5:	st1		{XL.2d}, [x1]
 	ret
 	.endm
 
@@ -269,14 +361,10 @@ ENTRY(pmull_ghash_update_p8)
 	__pmull_ghash	p8
 ENDPROC(pmull_ghash_update_p8)
 
-	KS0		.req	v8
-	KS1		.req	v9
-	INP0		.req	v10
-	INP1		.req	v11
-	HH		.req	v12
-	XL2		.req	v13
-	XM2		.req	v14
-	XH2		.req	v15
+	KS0		.req	v12
+	KS1		.req	v13
+	INP0		.req	v14
+	INP1		.req	v15
 
 	.macro		load_round_keys, rounds, rk
 	cmp		\rounds, #12
@@ -310,8 +398,8 @@ ENDPROC(pmull_ghash_update_p8)
 	.endm
 
 	.macro		pmull_gcm_do_crypt, enc
-	ld1		{HH.2d}, [x4], #16
-	ld1		{SHASH.2d}, [x4]
+	ld1		{SHASH.2d}, [x4], #16
+	ld1		{HH.2d}, [x4]
 	ld1		{XL.2d}, [x1]
 	ldr		x8, [x5, #8]			// load lower counter
 
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 03ce71ea81a2..08b49fd621cb 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -33,9 +33,12 @@ MODULE_ALIAS_CRYPTO("ghash");
 #define GCM_IV_SIZE		12
 
 struct ghash_key {
-	u64 a;
-	u64 b;
-	be128 k;
+	u64			h[2];
+	u64			h2[2];
+	u64			h3[2];
+	u64			h4[2];
+
+	be128			k;
 };
 
 struct ghash_desc_ctx {
@@ -46,7 +49,6 @@ struct ghash_desc_ctx {
 
 struct gcm_aes_ctx {
 	struct crypto_aes_ctx	aes_key;
-	u64			h2[2];
 	struct ghash_key	ghash_key;
 };
 
@@ -63,11 +65,12 @@ static void (*pmull_ghash_update)(int blocks, u64 dg[], const char *src,
 				  const char *head);
 
 asmlinkage void pmull_gcm_encrypt(int blocks, u64 dg[], u8 dst[],
-				  const u8 src[], u64 const *k, u8 ctr[],
-				  u32 const rk[], int rounds, u8 ks[]);
+				  const u8 src[], struct ghash_key const *k,
+				  u8 ctr[], u32 const rk[], int rounds,
+				  u8 ks[]);
 
 asmlinkage void pmull_gcm_decrypt(int blocks, u64 dg[], u8 dst[],
-				  const u8 src[], u64 const *k,
+				  const u8 src[], struct ghash_key const *k,
 				  u8 ctr[], u32 const rk[], int rounds);
 
 asmlinkage void pmull_gcm_encrypt_block(u8 dst[], u8 const src[],
@@ -174,23 +177,36 @@ static int ghash_final(struct shash_desc *desc, u8 *dst)
 	return 0;
 }
 
+static void ghash_reflect(u64 h[], const be128 *k)
+{
+	u64 carry = be64_to_cpu(k->a) & BIT(63) ? 1 : 0;
+
+	h[0] = (be64_to_cpu(k->b) << 1) | carry;
+	h[1] = (be64_to_cpu(k->a) << 1) | (be64_to_cpu(k->b) >> 63);
+
+	if (carry)
+		h[1] ^= 0xc200000000000000UL;
+}
+
 static int __ghash_setkey(struct ghash_key *key,
 			  const u8 *inkey, unsigned int keylen)
 {
-	u64 a, b;
+	be128 h;
 
 	/* needed for the fallback */
 	memcpy(&key->k, inkey, GHASH_BLOCK_SIZE);
 
-	/* perform multiplication by 'x' in GF(2^128) */
-	b = get_unaligned_be64(inkey);
-	a = get_unaligned_be64(inkey + 8);
+	ghash_reflect(key->h, &key->k);
+
+	h = key->k;
+	gf128mul_lle(&h, &key->k);
+	ghash_reflect(key->h2, &h);
 
-	key->a = (a << 1) | (b >> 63);
-	key->b = (b << 1) | (a >> 63);
+	gf128mul_lle(&h, &key->k);
+	ghash_reflect(key->h3, &h);
 
-	if (b >> 63)
-		key->b ^= 0xc200000000000000UL;
+	gf128mul_lle(&h, &key->k);
+	ghash_reflect(key->h4, &h);
 
 	return 0;
 }
@@ -241,8 +257,7 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *inkey,
 		      unsigned int keylen)
 {
 	struct gcm_aes_ctx *ctx = crypto_aead_ctx(tfm);
-	be128 h1, h2;
-	u8 *key = (u8 *)&h1;
+	u8 key[GHASH_BLOCK_SIZE];
 	int ret;
 
 	ret = crypto_aes_expand_key(&ctx->aes_key, inkey, keylen);
@@ -254,19 +269,7 @@ static int gcm_setkey(struct crypto_aead *tfm, const u8 *inkey,
 	__aes_arm64_encrypt(ctx->aes_key.key_enc, key, (u8[AES_BLOCK_SIZE]){},
 			    num_rounds(&ctx->aes_key));
 
-	__ghash_setkey(&ctx->ghash_key, key, sizeof(be128));
-
-	/* calculate H^2 (used for 2-way aggregation) */
-	h2 = h1;
-	gf128mul_lle(&h2, &h1);
-
-	ctx->h2[0] = (be64_to_cpu(h2.b) << 1) | (be64_to_cpu(h2.a) >> 63);
-	ctx->h2[1] = (be64_to_cpu(h2.a) << 1) | (be64_to_cpu(h2.b) >> 63);
-
-	if (be64_to_cpu(h2.a) >> 63)
-		ctx->h2[1] ^= 0xc200000000000000UL;
-
-	return 0;
+	return __ghash_setkey(&ctx->ghash_key, key, sizeof(be128));
 }
 
 static int gcm_setauthsize(struct crypto_aead *tfm, unsigned int authsize)
@@ -402,8 +405,8 @@ static int gcm_encrypt(struct aead_request *req)
 				kernel_neon_begin();
 
 			pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
-					  walk.src.virt.addr, ctx->h2, iv,
-					  rk, nrounds, ks);
+					  walk.src.virt.addr, &ctx->ghash_key,
+					  iv, rk, nrounds, ks);
 			kernel_neon_end();
 
 			err = skcipher_walk_done(&walk,
@@ -513,8 +516,8 @@ static int gcm_decrypt(struct aead_request *req)
 				kernel_neon_begin();
 
 			pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
-					  walk.src.virt.addr, ctx->h2, iv,
-					  rk, nrounds);
+					  walk.src.virt.addr, &ctx->ghash_key,
+					  iv, rk, nrounds);
 
 			/* check if this is the final iteration of the loop */
 			if (rem < (2 * AES_BLOCK_SIZE)) {