From patchwork Tue Sep 10 23:18:59 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ard Biesheuvel <ard.biesheuvel@linaro.org>
X-Patchwork-Id: 173556
Delivered-To: patch@linaro.org
Received: by 2002:a05:6e02:ce:0:0:0:0 with SMTP id r14csp3087ilq;
 Tue, 10 Sep 2019 16:19:36 -0700 (PDT)
X-Google-Smtp-Source: APXvYqyubkB8O4flyKz/nsDyyRGRGQH1GZtivQmSXKVE/lKpnNxmzlllgeJe80OwO4JKlfzOZVG4
X-Received: by 2002:a17:906:fd0:: with SMTP id
 c16mr26826733ejk.213.1568157576294; 
 Tue, 10 Sep 2019 16:19:36 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1568157576; cv=none;
 d=google.com; s=arc-20160816;
 b=wdjy6Dy9oULd2bC8up8ppPAeoHNMiNRE9YH3p4u0QdRPL50+cTSBj/ak2hzlLxGYtH
 vaJrbod5ilIBCu/UexL5Q3e4FsTDBPAAwdyF5MVytjtzzutxM8ZXR31fjFl7kGKu7vhE
 MuZDv89jR4p7JJ6zl/67jifepCf8bhlVPV8YMmVCoTFt23PilgrRuqUNv3Z/GT/xl98C
 TqPPoyQaPKOjb8DPz8ZcAt4wRgezqc0iXMlLd8YL5lgfXd7+Nj8Ax7sgCttmwrxjY1GA
 7qhty0VpoSzmUziOFa7o/LxhVOGfoxsMqSGYSzWn9Lp6/2ycmafx/nfKHBN8piVeF9kC
 ar8w==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=VGVIdAO9OEwtBrGMbmlkUR8fmURSnEmtHuSHajss7bg=;
 b=hnIPegKTtADwQdc3dn+x9boRsqhs919QjasPU6/OIzN6J8CUAgC4nzuVNZ4kD7w1I5
 mhlKUfuO9VBMKCfUeinJ/iNzpsxirJzqlvN4NlxgObfUJY/GhwV2O7G0fG7v1Tuctwbx
 UBYpW9j3J/UpsEZqnUaZl4taJN/wf1B/YbKEveG86Hz95/Neufcscnyau/1x1IkGCOls
 ttz/T2mt/cFXXSXFKXBubq3XwUyicYciVFW5wGMQZX6ZFS8fbwY5b6QZgg9iSsYg2WHo
 Kt0+CjLpNVQL+m7BHQ/hyy1jkc4tUfsjz6uFicC1XtdE/BJ+EhLHMCB8LKWp/9HCVP7H
 zPqQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=NlfcSpZr;
 spf=pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-crypto-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-crypto-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 w2si12341749edc.386.2019.09.10.16.19.36; 
 Tue, 10 Sep 2019 16:19:36 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=NlfcSpZr;
 spf=pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-crypto-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725965AbfIJXTe (ORCPT <rfc822;sumit.garg@linaro.org>
 + 3 others); Tue, 10 Sep 2019 19:19:34 -0400
Received: from mail-wm1-f65.google.com ([209.85.128.65]:35665 "EHLO
 mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1725876AbfIJXTe (ORCPT
 <rfc822;linux-crypto@vger.kernel.org>);
 Tue, 10 Sep 2019 19:19:34 -0400
Received: by mail-wm1-f65.google.com with SMTP id n10so1322555wmj.0
 for <linux-crypto@vger.kernel.org>;
 Tue, 10 Sep 2019 16:19:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=VGVIdAO9OEwtBrGMbmlkUR8fmURSnEmtHuSHajss7bg=;
 b=NlfcSpZrFWftuixDbmewS1eJhRHfReZ0Po4F+qUjHPU5odMqE0D2dE6j4XdcX7ftYK
 ykPZ8/Fe1ReQLb4tCNA5FTgWFRj6HNjb23z3DU5qjiI+WK+L9a7qIXCTJ0KO3YgNrF79
 O5XjK+yJoY1hgct2UeobC92E7qtqaM0UHYgnRfUfMnl+rYGj2dD0ql8WaybRUUetkfL4
 e8CYqDRigCh85crpWEjBqvcdoDNeAmuO2U4yD9EC/DVY/2QEHnvEQav/ucm2L7m5thNQ
 qSEKuo9/IvDPrC2ctn6xlvfcgmGCw3c8khVRYwiChPwY//pbhJI1e1XJwsUQsUWWvOw0
 0JSA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=VGVIdAO9OEwtBrGMbmlkUR8fmURSnEmtHuSHajss7bg=;
 b=OyreM+ZT1a66var348vVLAL7tUpVCvagO42g3h6rMt4h4EWU3l/FGhlZOkdimmPDi3
 ebmzRj2NNN8WLamYoZRdW8DL5vxUdvNAI96J/wurQ4eGOvU2I3Qfh1dJwa4qogmEd1nR
 ly8+xh2uCe7HOggJrrirdvlJDqOKK8Wa3OH7ivRjLq2zimj8TZynUTbsTibXkfGP+hOh
 KrVuVjbw9/61JvU65MRl3EWpexxkhRuD2+RsoSUP2aujbg0kx38gLNT1YFggTMIf0qfK
 cHs3kepaA7vmwYym/X+N/cfk6yj1W8sRp3f0T/q9fu7UwbPcTzvt5TKFBgDxrWzdocy+
 qZ/g==
X-Gm-Message-State: APjAAAVQmZoe3MUoBzmzBDdP/v+VTJNgI6dYV8wOmoZb/LDmqR8+S0Ac
 2atynYusVtXZVXpZ3SY062G1W4A6prDl5kP/
X-Received: by 2002:a05:600c:2181:: with SMTP id
 e1mr1403939wme.117.1568157569046; 
 Tue, 10 Sep 2019 16:19:29 -0700 (PDT)
Received: from e111045-lin.nice.arm.com ([148.69.85.38])
 by smtp.gmail.com with ESMTPSA id
 y186sm2137846wmb.41.2019.09.10.16.19.27
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 10 Sep 2019 16:19:28 -0700 (PDT)
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: linux-crypto@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org, herbert@gondor.apana.org.au,
 ebiggers@google.com, marc.zyngier@arm.com,
 Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: [PATCH 1/2] crypto: testmgr - add another gcm(aes) testcase
Date: Wed, 11 Sep 2019 00:18:59 +0100
Message-Id: <20190910231900.25445-2-ard.biesheuvel@linaro.org>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190910231900.25445-1-ard.biesheuvel@linaro.org>
References: <20190910231900.25445-1-ard.biesheuvel@linaro.org>
Sender: linux-crypto-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-crypto.vger.kernel.org>
X-Mailing-List: linux-crypto@vger.kernel.org

Add an additional gcm(aes) test case that triggers the code path in
the new arm64 driver that deals with tail blocks whose size is not
a multiple of the block size, and where the size of the preceding
input is a multiple of 64 bytes.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 crypto/testmgr.h | 192 ++++++++++++++++++++
 1 file changed, 192 insertions(+)

-- 
2.17.1

diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index b3296441f9c7..690e68cb3e51 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -17411,6 +17411,198 @@ static const struct aead_testvec aes_gcm_tv_template[] = {
 			  "\x25\x19\x49\x8e\x80\xf1\x47\x8f"
 			  "\x37\xba\x55\xbd\x6d\x27\x61\x8c",
 		.clen	= 76,
+	}, {
+		.key	= "\x62\x35\xf8\x95\xfc\xa5\xeb\xf6"
+			  "\x0e\x92\x12\x04\xd3\xa1\x3f\x2e"
+			  "\x8b\x32\xcf\xe7\x44\xed\x13\x59"
+			  "\x04\x38\x77\xb0\xb9\xad\xb4\x38",
+		.klen	= 32,
+		.iv	= "\x00\xff\xff\xff\xff\x00\x00\xff"
+			  "\xff\xff\x00\xff",
+		.ptext	= "\x42\xc1\xcc\x08\x48\x6f\x41\x3f"
+			  "\x2f\x11\x66\x8b\x2a\x16\xf0\xe0"
+			  "\x58\x83\xf0\xc3\x70\x14\xc0\x5b"
+			  "\x3f\xec\x1d\x25\x3c\x51\xd2\x03"
+			  "\xcf\x59\x74\x1f\xb2\x85\xb4\x07"
+			  "\xc6\x6a\x63\x39\x8a\x5b\xde\xcb"
+			  "\xaf\x08\x44\xbd\x6f\x91\x15\xe1"
+			  "\xf5\x7a\x6e\x18\xbd\xdd\x61\x50"
+			  "\x59\xa9\x97\xab\xbb\x0e\x74\x5c"
+			  "\x00\xa4\x43\x54\x04\x54\x9b\x3b"
+			  "\x77\xec\xfd\x5c\xa6\xe8\x7b\x08"
+			  "\xae\xe6\x10\x3f\x32\x65\xd1\xfc"
+			  "\xa4\x1d\x2c\x31\xfb\x33\x7a\xb3"
+			  "\x35\x23\xf4\x20\x41\xd4\xad\x82"
+			  "\x8b\xa4\xad\x96\x1c\x20\x53\xbe"
+			  "\x0e\xa6\xf4\xdc\x78\x49\x3e\x72"
+			  "\xb1\xa9\xb5\x83\xcb\x08\x54\xb7"
+			  "\xad\x49\x3a\xae\x98\xce\xa6\x66"
+			  "\x10\x30\x90\x8c\x55\x83\xd7\x7c"
+			  "\x8b\xe6\x53\xde\xd2\x6e\x18\x21"
+			  "\x01\x52\xd1\x9f\x9d\xbb\x9c\x73"
+			  "\x57\xcc\x89\x09\x75\x9b\x78\x70"
+			  "\xed\x26\x97\x4d\xb4\xe4\x0c\xa5"
+			  "\xfa\x70\x04\x70\xc6\x96\x1c\x7d"
+			  "\x54\x41\x77\xa8\xe3\xb0\x7e\x96"
+			  "\x82\xd9\xec\xa2\x87\x68\x55\xf9"
+			  "\x8f\x9e\x73\x43\x47\x6a\x08\x36"
+			  "\x93\x67\xa8\x2d\xde\xac\x41\xa9"
+			  "\x5c\x4d\x73\x97\x0f\x70\x68\xfa"
+			  "\x56\x4d\x00\xc2\x3b\x1f\xc8\xb9"
+			  "\x78\x1f\x51\x07\xe3\x9a\x13\x4e"
+			  "\xed\x2b\x2e\xa3\xf7\x44\xb2\xe7"
+			  "\xab\x19\x37\xd9\xba\x76\x5e\xd2"
+			  "\xf2\x53\x15\x17\x4c\x6b\x16\x9f"
+			  "\x02\x66\x49\xca\x7c\x91\x05\xf2"
+			  "\x45\x36\x1e\xf5\x77\xad\x1f\x46"
+			  "\xa8\x13\xfb\x63\xb6\x08\x99\x63"
+			  "\x82\xa2\xed\xb3\xac\xdf\x43\x19"
+			  "\x45\xea\x78\x73\xd9\xb7\x39\x11"
+			  "\xa3\x13\x7c\xf8\x3f\xf7\xad\x81"
+			  "\x48\x2f\xa9\x5c\x5f\xa0\xf0\x79"
+			  "\xa4\x47\x7d\x80\x20\x26\xfd\x63"
+			  "\x0a\xc7\x7e\x6d\x75\x47\xff\x76"
+			  "\x66\x2e\x8a\x6c\x81\x35\xaf\x0b"
+			  "\x2e\x6a\x49\x60\xc1\x10\xe1\xe1"
+			  "\x54\x03\xa4\x09\x0c\x37\x7a\x15"
+			  "\x23\x27\x5b\x8b\x4b\xa5\x64\x97"
+			  "\xae\x4a\x50\x73\x1f\x66\x1c\x5c"
+			  "\x03\x25\x3c\x8d\x48\x58\x71\x34"
+			  "\x0e\xec\x4e\x55\x1a\x03\x6a\xe5"
+			  "\xb6\x19\x2b\x84\x2a\x20\xd1\xea"
+			  "\x80\x6f\x96\x0e\x05\x62\xc7\x78"
+			  "\x87\x79\x60\x38\x46\xb4\x25\x57"
+			  "\x6e\x16\x63\xf8\xad\x6e\xd7\x42"
+			  "\x69\xe1\x88\xef\x6e\xd5\xb4\x9a"
+			  "\x3c\x78\x6c\x3b\xe5\xa0\x1d\x22"
+			  "\x86\x5c\x74\x3a\xeb\x24\x26\xc7"
+			  "\x09\xfc\x91\x96\x47\x87\x4f\x1a"
+			  "\xd6\x6b\x2c\x18\x47\xc0\xb8\x24"
+			  "\xa8\x5a\x4a\x9e\xcb\x03\xe7\x2a"
+			  "\x09\xe6\x4d\x9c\x6d\x86\x60\xf5"
+			  "\x2f\x48\x69\x37\x9f\xf2\xd2\xcb"
+			  "\x0e\x5a\xdd\x6e\x8a\xfb\x6a\xfe"
+			  "\x0b\x63\xde\x87\x42\x79\x8a\x68"
+			  "\x51\x28\x9b\x7a\xeb\xaf\xb8\x2f"
+			  "\x9d\xd1\xc7\x45\x90\x08\xc9\x83"
+			  "\xe9\x83\x84\xcb\x28\x69\x09\x69"
+			  "\xce\x99\x46\x00\x54\xcb\xd8\x38"
+			  "\xf9\x53\x4a\xbf\x31\xce\x57\x15"
+			  "\x33\xfa\x96\x04\x33\x42\xe3\xc0"
+			  "\xb7\x54\x4a\x65\x7a\x7c\x02\xe6"
+			  "\x19\x95\xd0\x0e\x82\x07\x63\xf9"
+			  "\xe1\x2b\x2a\xfc\x55\x92\x52\xc9"
+			  "\xb5\x9f\x23\x28\x60\xe7\x20\x51"
+			  "\x10\xd3\xed\x6d\x9b\xab\xb8\xe2"
+			  "\x5d\x9a\x34\xb3\xbe\x9c\x64\xcb"
+			  "\x78\xc6\x91\x22\x40\x91\x80\xbe"
+			  "\xd7\x78\x5c\x0e\x0a\xdc\x08\xe9"
+			  "\x67\x10\xa4\x83\x98\x79\x23\xe7"
+			  "\x92\xda\xa9\x22\x16\xb1\xe7\x78"
+			  "\xa3\x1c\x6c\x8f\x35\x7c\x4d\x37"
+			  "\x2f\x6e\x0b\x50\x5c\x34\xb9\xf9"
+			  "\xe6\x3d\x91\x0d\x32\x95\xaa\x3d"
+			  "\x48\x11\x06\xbb\x2d\xf2\x63\x88"
+			  "\x3f\x73\x09\xe2\x45\x56\x31\x51"
+			  "\xfa\x5e\x4e\x62\xf7\x90\xf9\xa9"
+			  "\x7d\x7b\x1b\xb1\xc8\x26\x6e\x66"
+			  "\xf6\x90\x9a\x7f\xf2\x57\xcc\x23"
+			  "\x59\xfa\xfa\xaa\x44\x04\x01\xa7"
+			  "\xa4\x78\xdb\x74\x3d\x8b\xb5",
+		.plen	= 719,
+		.ctext	= "\x84\x0b\xdb\xd5\xb7\xa8\xfe\x20"
+			  "\xbb\xb1\x12\x7f\x41\xea\xb3\xc0"
+			  "\xa2\xb4\x37\x19\x11\x58\xb6\x0b"
+			  "\x4c\x1d\x38\x05\x54\xd1\x16\x73"
+			  "\x8e\x1c\x20\x90\xa2\x9a\xb7\x74"
+			  "\x47\xe6\xd8\xfc\x18\x3a\xb4\xea"
+			  "\xd5\x16\x5a\x2c\x53\x01\x46\xb3"
+			  "\x18\x33\x74\x6c\x50\xf2\xe8\xc0"
+			  "\x73\xda\x60\x22\xeb\xe3\xe5\x9b"
+			  "\x20\x93\x6c\x4b\x37\x99\xb8\x23"
+			  "\x3b\x4e\xac\xe8\x5b\xe8\x0f\xb7"
+			  "\xc3\x8f\xfb\x4a\x37\xd9\x39\x95"
+			  "\x34\xf1\xdb\x8f\x71\xd9\xc7\x0b"
+			  "\x02\xf1\x63\xfc\x9b\xfc\xc5\xab"
+			  "\xb9\x14\x13\x21\xdf\xce\xaa\x88"
+			  "\x44\x30\x1e\xce\x26\x01\x92\xf8"
+			  "\x9f\x00\x4b\x0c\x4b\xf7\x5f\xe0"
+			  "\x89\xca\x94\x66\x11\x21\x97\xca"
+			  "\x3e\x83\x74\x2d\xdb\x4d\x11\xeb"
+			  "\x97\xc2\x14\xff\x9e\x1e\xa0\x6b"
+			  "\x08\xb4\x31\x2b\x85\xc6\x85\x6c"
+			  "\x90\xec\x39\xc0\xec\xb3\xb5\x4e"
+			  "\xf3\x9c\xe7\x83\x3a\x77\x0a\xf4"
+			  "\x56\xfe\xce\x18\x33\x6d\x0b\x2d"
+			  "\x33\xda\xc8\x05\x5c\xb4\x09\x2a"
+			  "\xde\x6b\x52\x98\x01\xef\x36\x3d"
+			  "\xbd\xf9\x8f\xa8\x3e\xaa\xcd\xd1"
+			  "\x01\x2d\x42\x49\xc3\xb6\x84\xbb"
+			  "\x48\x96\xe0\x90\x93\x6c\x48\x64"
+			  "\xd4\xfa\x7f\x93\x2c\xa6\x21\xc8"
+			  "\x7a\x23\x7b\xaa\x20\x56\x12\xae"
+			  "\x16\x9d\x94\x0f\x54\xa1\xec\xca"
+			  "\x51\x4e\xf2\x39\xf4\xf8\x5f\x04"
+			  "\x5a\x0d\xbf\xf5\x83\xa1\x15\xe1"
+			  "\xf5\x3c\xd8\x62\xa3\xed\x47\x89"
+			  "\x85\x4c\xe5\xdb\xac\x9e\x17\x1d"
+			  "\x0c\x09\xe3\x3e\x39\x5b\x4d\x74"
+			  "\x0e\xf5\x34\xee\x70\x11\x4c\xfd"
+			  "\xdb\x34\xb1\xb5\x10\x3f\x73\xb7"
+			  "\xf5\xfa\xed\xb0\x1f\xa5\xcd\x3c"
+			  "\x8d\x35\x83\xd4\x11\x44\x6e\x6c"
+			  "\x5b\xe0\x0e\x69\xa5\x39\xe5\xbb"
+			  "\xa9\x57\x24\x37\xe6\x1f\xdd\xcf"
+			  "\x16\x2a\x13\xf9\x6a\x2d\x90\xa0"
+			  "\x03\x60\x7a\xed\x69\xd5\x00\x8b"
+			  "\x7e\x4f\xcb\xb9\xfa\x91\xb9\x37"
+			  "\xc1\x26\xce\x90\x97\x22\x64\x64"
+			  "\xc1\x72\x43\x1b\xf6\xac\xc1\x54"
+			  "\x8a\x10\x9c\xdd\x8d\xd5\x8e\xb2"
+			  "\xe4\x85\xda\xe0\x20\x5f\xf4\xb4"
+			  "\x15\xb5\xa0\x8d\x12\x74\x49\x23"
+			  "\x3a\xdf\x4a\xd3\xf0\x3b\x89\xeb"
+			  "\xf8\xcc\x62\x7b\xfb\x93\x07\x41"
+			  "\x61\x26\x94\x58\x70\xa6\x3c\xe4"
+			  "\xff\x58\xc4\x13\x3d\xcb\x36\x6b"
+			  "\x32\xe5\xb2\x6d\x03\x74\x6f\x76"
+			  "\x93\x77\xde\x48\xc4\xfa\x30\x4a"
+			  "\xda\x49\x80\x77\x0f\x1c\xbe\x11"
+			  "\xc8\x48\xb1\xe5\xbb\xf2\x8a\xe1"
+			  "\x96\x2f\x9f\xd1\x8e\x8a\x5c\xe2"
+			  "\xf7\xd7\xd8\x54\xf3\x3f\xc4\x91"
+			  "\xb8\xfb\x86\xdc\x46\x24\x91\x60"
+			  "\x6c\x2f\xc9\x41\x37\x51\x49\x54"
+			  "\x09\x81\x21\xf3\x03\x9f\x2b\xe3"
+			  "\x1f\x39\x63\xaf\xf4\xd7\x53\x60"
+			  "\xa7\xc7\x54\xf9\xee\xb1\xb1\x7d"
+			  "\x75\x54\x65\x93\xfe\xb1\x68\x6b"
+			  "\x57\x02\xf9\xbb\x0e\xf9\xf8\xbf"
+			  "\x01\x12\x27\xb4\xfe\xe4\x79\x7a"
+			  "\x40\x5b\x51\x4b\xdf\x38\xec\xb1"
+			  "\x6a\x56\xff\x35\x4d\x42\x33\xaa"
+			  "\x6f\x1b\xe4\xdc\xe0\xdb\x85\x35"
+			  "\x62\x10\xd4\xec\xeb\xc5\x7e\x45"
+			  "\x1c\x6f\x17\xca\x3b\x8e\x2d\x66"
+			  "\x4f\x4b\x36\x56\xcd\x1b\x59\xaa"
+			  "\xd2\x9b\x17\xb9\x58\xdf\x7b\x64"
+			  "\x8a\xff\x3b\x9c\xa6\xb5\x48\x9e"
+			  "\xaa\xe2\x5d\x09\x71\x32\x5f\xb6"
+			  "\x29\xbe\xe7\xc7\x52\x7e\x91\x82"
+			  "\x6b\x6d\x33\xe1\x34\x06\x36\x21"
+			  "\x5e\xbe\x1e\x2f\x3e\xc1\xfb\xea"
+			  "\x49\x2c\xb5\xca\xf7\xb0\x37\xea"
+			  "\x1f\xed\x10\x04\xd9\x48\x0d\x1a"
+			  "\x1c\xfb\xe7\x84\x0e\x83\x53\x74"
+			  "\xc7\x65\xe2\x5c\xe5\xba\x73\x4c"
+			  "\x0e\xe1\xb5\x11\x45\x61\x43\x46"
+			  "\xaa\x25\x8f\xbd\x85\x08\xfa\x4c"
+			  "\x15\xc1\xc0\xd8\xf5\xdc\x16\xbb"
+			  "\x7b\x1d\xe3\x87\x57\xa7\x2a\x1d"
+			  "\x38\x58\x9e\x8a\x43\xdc\x57"
+			  "\xd1\x81\x7d\x2b\xe9\xff\x99\x3a"
+			  "\x4b\x24\x52\x58\x55\xe1\x49\x14",
+		.clen	= 735,
 	}
 };
 

From patchwork Tue Sep 10 23:19:00 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ard Biesheuvel <ard.biesheuvel@linaro.org>
X-Patchwork-Id: 173557
Delivered-To: patch@linaro.org
Received: by 2002:a05:6e02:ce:0:0:0:0 with SMTP id r14csp3093ilq;
 Tue, 10 Sep 2019 16:19:36 -0700 (PDT)
X-Google-Smtp-Source: APXvYqzu6GBRIGoyEWFhafjNTLNhr7rO4KCg68Tcsi2lK1ob2oxoQjbiJrowrxGrVbrwrlXF2kgg
X-Received: by 2002:a17:906:308d:: with SMTP id
 13mr8512158ejv.207.1568157576778; 
 Tue, 10 Sep 2019 16:19:36 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1568157576; cv=none;
 d=google.com; s=arc-20160816;
 b=QMia0z6W0zYnby6PSRRszpxCxp/6raH5EoDlB71lRqQSCluYt5X8SEpru6iybco+LY
 OWZW6yu3lBvr8EneWVkr7eg0gMsd8PtK9/9IOIK7blOl8SZ/qBozt0cUbtfWquryv6ck
 9RV5s0AWzPXWg6eNsN5h1pxOzPhEWC2zHXm+NT1ROGGgKDkvtcNJOvz1jezYy1LzInuR
 b2+m22oWGk/fM7iLT+BnTvItjZqQiHaKKQZ7IeAPWOUVFwVODNOQiwXwtL/jmii/pr1Z
 SxpSXCLrs7PxC/2XjA45rWj8AoMGhg4EevqjuHDOJUVdft8QCZ7ajwV3zwGVjhIAkvU6
 kTFw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816; 
 h=list-id:precedence:sender:references:in-reply-to:message-id:date
 :subject:cc:to:from:dkim-signature;
 bh=NtofCXrViCdXCL+kihHyw58CmFK5QeADjCfLREXSGyU=;
 b=YQC6kxgXI8lkyi6aiqypzAgHp3NRJlsLHDqKrvFAV+rEfYK5fFAdJB/4reRCGfLJcj
 BIWmunKyiI61og9Pj0PKFzrQiG7ABlMx5wXFeQ0xbD9kdd7rKM0aFug2IYVKKzEmDWng
 Ly/couXM++I+7EXPcFoFlE5ospB7A4ymmcdikcy6w0GBFOknv2h4UlvDv9aKIvqFUzcC
 ktzybALX5cLuXIOGDUPIvmKfp1RFCYIVPRRibyrQwKDlks6hiKTJZUXDtbjTheA46ff7
 tCI1xVDip3pyBrQ2R2iGjngf8qkI7qv5MDuFSSKoQjnqDv08xfClaqsCIk0u+DH8kaK9
 cfPQ==
ARC-Authentication-Results: i=1; mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=dLSdvJfX;
 spf=pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-crypto-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Return-Path: <linux-crypto-owner@vger.kernel.org>
Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67])
 by mx.google.com with ESMTP id
 w2si12341749edc.386.2019.09.10.16.19.36; 
 Tue, 10 Sep 2019 16:19:36 -0700 (PDT)
Received-SPF: pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender) client-ip=209.132.180.67; 
Authentication-Results: mx.google.com;
 dkim=pass header.i=@linaro.org header.s=google header.b=dLSdvJfX;
 spf=pass (google.com: best guess record for domain of
 linux-crypto-owner@vger.kernel.org designates 209.132.180.67
 as permitted sender)
 smtp.mailfrom=linux-crypto-owner@vger.kernel.org; 
 dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
 id S1725876AbfIJXTf (ORCPT <rfc822;sumit.garg@linaro.org>
 + 3 others); Tue, 10 Sep 2019 19:19:35 -0400
Received: from mail-wr1-f65.google.com ([209.85.221.65]:41576 "EHLO
 mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
 with ESMTP id S1725957AbfIJXTf (ORCPT
 <rfc822;linux-crypto@vger.kernel.org>);
 Tue, 10 Sep 2019 19:19:35 -0400
Received: by mail-wr1-f65.google.com with SMTP id h7so21363895wrw.8
 for <linux-crypto@vger.kernel.org>;
 Tue, 10 Sep 2019 16:19:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; 
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=NtofCXrViCdXCL+kihHyw58CmFK5QeADjCfLREXSGyU=;
 b=dLSdvJfX7/voUCauPFCXZ6T8FfQLvYFWW9rxmrQFoso6jKaW1FomwJn/ut6Eu3gRZQ
 mbKgseZ/yxxESVtOq4PDoPSjd1TdJFBmbY1tScwXglW9x52IXnPH+8LD989wCjlT6xno
 EOcxtDDRz2omJFHuGjEosoCa959yfVa1PC3um1iZuFMNjqI0kHdGgBewTN1dI7GuNBy1
 f/8Fho7gRP+jsK8y7LIeorJfZtWaW77HlTlemZKlP4RgCf11V1HlN4OqUXFxTEdEprsj
 NvPGf4bN6ra2HRgI0i4gNNfqR/x7jO+H4ut6f0RWxKpYmxZFrkkreDbBvORE4w0/NHes
 47nw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
 :references;
 bh=NtofCXrViCdXCL+kihHyw58CmFK5QeADjCfLREXSGyU=;
 b=UE/ibR2TMJWOjjqA3LyJGO3T4AYCi8yawO4DkpTld6iPT0tCVj3fNejwEbjkG2obSp
 cPLn0SQ5X6/99AAm/7NfNhNSHfbDzJlR5YmIyXsf5y98vfe7M+3vOw08B8B/Z+nHSeEi
 Dq4sB+lGPX6NxGsiSGOtxIHPWDqNXl30xm4OzOL+J+uM+UcD1CiAtuJ/oi0Md/hOeX69
 sjw1rBYulmFY2ElHhYA+30cSiWcievrxNB0LIXGZIhnUrASnisi8hlXphLcDLXp+YEVa
 laubZAKFxsAJFxZxVd+ISDVBYX354iVeKp844UjwGn++7l5qZG45bA+3/FUPACr2RaJ/
 cIQw==
X-Gm-Message-State: APjAAAWNl7UGCPEbytI1JZCs9z9G1HNqhqy14p8JGJ58i80DTzXWNSof
 iAA5gJnoUU0pLBaPqQl8V45RXLccH9AMmsEh
X-Received: by 2002:adf:a382:: with SMTP id l2mr27223297wrb.194.1568157570369; 
 Tue, 10 Sep 2019 16:19:30 -0700 (PDT)
Received: from e111045-lin.nice.arm.com ([148.69.85.38])
 by smtp.gmail.com with ESMTPSA id
 y186sm2137846wmb.41.2019.09.10.16.19.29
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 10 Sep 2019 16:19:29 -0700 (PDT)
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: linux-crypto@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org, herbert@gondor.apana.org.au,
 ebiggers@google.com, marc.zyngier@arm.com,
 Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: [PATCH 2/2] crypto: arm64/gcm-ce - implement 4 way interleave
Date: Wed, 11 Sep 2019 00:19:00 +0100
Message-Id: <20190910231900.25445-3-ard.biesheuvel@linaro.org>
X-Mailer: git-send-email 2.17.1
In-Reply-To: <20190910231900.25445-1-ard.biesheuvel@linaro.org>
References: <20190910231900.25445-1-ard.biesheuvel@linaro.org>
Sender: linux-crypto-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-crypto.vger.kernel.org>
X-Mailing-List: linux-crypto@vger.kernel.org

To improve performance on cores with deep pipelines such as ThunderX2,
reimplement gcm(aes) using a 4-way interleave rather than the 2-way
interleave we use currently.

This comes down to a complete rewrite of the GCM part of the combined
GCM/GHASH driver, and instead of interleaving two invocations of AES
with the GHASH handling at the instruction level, the new version
uses a more coarse grained approach where each chunk of 64 bytes is
encrypted first and then ghashed (or ghashed and then decrypted in
the converse case).

The core NEON routine is now able to consume inputs of any size,
and tail blocks of less than 64 bytes are handled using overlapping
loads and stores, and processed by the same 4-way encryption and
hashing routines. This gets rid of most of the branches, and avoids
having to return to the C code to handle the tail block using a
stack buffer.

The table below compares the performance of the old driver and the new
one on various micro-architectures and running in various modes.

        |     AES-128      |     AES-192      |     AES-256      |
 #bytes | 512 | 1500 |  4k | 512 | 1500 |  4k | 512 | 1500 |  4k |
 -------+-----+------+-----+-----+------+-----+-----+------+-----+
    TX2 | 35% |  23% | 11% | 34% |  20% |  9% | 38% |  25% | 16% |
   EMAG | 11% |   6% |  3% | 12% |   4% |  2% | 11% |   4% |  2% |
    A72 |  8% |   5% | -4% |  9% |   4% | -5% |  7% |   4% | -5% |
    A53 | 11% |   6% | -1% | 10% |   8% | -1% | 10% |   8% | -2% |

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 arch/arm64/crypto/ghash-ce-core.S | 501 ++++++++++++++------
 arch/arm64/crypto/ghash-ce-glue.c | 293 +++++-------
 2 files changed, 467 insertions(+), 327 deletions(-)

-- 
2.17.1

diff --git a/arch/arm64/crypto/ghash-ce-core.S b/arch/arm64/crypto/ghash-ce-core.S
index 410e8afcf5a7..a791c4adf8e6 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -13,8 +13,8 @@
 	T1		.req	v2
 	T2		.req	v3
 	MASK		.req	v4
-	XL		.req	v5
-	XM		.req	v6
+	XM		.req	v5
+	XL		.req	v6
 	XH		.req	v7
 	IN1		.req	v7
 
@@ -358,20 +358,37 @@ ENTRY(pmull_ghash_update_p8)
 	__pmull_ghash	p8
 ENDPROC(pmull_ghash_update_p8)
 
-	KS0		.req	v12
-	KS1		.req	v13
-	INP0		.req	v14
-	INP1		.req	v15
-
-	.macro		load_round_keys, rounds, rk
-	cmp		\rounds, #12
-	blo		2222f		/* 128 bits */
-	beq		1111f		/* 192 bits */
-	ld1		{v17.4s-v18.4s}, [\rk], #32
-1111:	ld1		{v19.4s-v20.4s}, [\rk], #32
-2222:	ld1		{v21.4s-v24.4s}, [\rk], #64
-	ld1		{v25.4s-v28.4s}, [\rk], #64
-	ld1		{v29.4s-v31.4s}, [\rk]
+	KS0		.req	v8
+	KS1		.req	v9
+	KS2		.req	v10
+	KS3		.req	v11
+
+	INP0		.req	v21
+	INP1		.req	v22
+	INP2		.req	v23
+	INP3		.req	v24
+
+	K0		.req	v25
+	K1		.req	v26
+	K2		.req	v27
+	K3		.req	v28
+	K4		.req	v12
+	K5		.req	v13
+	K6		.req	v4
+	K7		.req	v5
+	K8		.req	v14
+	K9		.req	v15
+	KK		.req	v29
+	KL		.req	v30
+	KM		.req	v31
+
+	.macro		load_round_keys, rounds, rk, tmp
+	add		\tmp, \rk, #64
+	ld1		{K0.4s-K3.4s}, [\rk]
+	ld1		{K4.4s-K5.4s}, [\tmp]
+	add		\tmp, \rk, \rounds, lsl #4
+	sub		\tmp, \tmp, #32
+	ld1		{KK.4s-KM.4s}, [\tmp]
 	.endm
 
 	.macro		enc_round, state, key
@@ -379,197 +396,367 @@ ENDPROC(pmull_ghash_update_p8)
 	aesmc		\state\().16b, \state\().16b
 	.endm
 
-	.macro		enc_block, state, rounds
-	cmp		\rounds, #12
-	b.lo		2222f		/* 128 bits */
-	b.eq		1111f		/* 192 bits */
-	enc_round	\state, v17
-	enc_round	\state, v18
-1111:	enc_round	\state, v19
-	enc_round	\state, v20
-2222:	.irp		key, v21, v22, v23, v24, v25, v26, v27, v28, v29
+	.macro		enc_qround, s0, s1, s2, s3, key
+	enc_round	\s0, \key
+	enc_round	\s1, \key
+	enc_round	\s2, \key
+	enc_round	\s3, \key
+	.endm
+
+	.macro		enc_block, state, rounds, rk, tmp
+	add		\tmp, \rk, #96
+	ld1		{K6.4s-K7.4s}, [\tmp], #32
+	.irp		key, K0, K1, K2, K3, K4 K5
 	enc_round	\state, \key
 	.endr
-	aese		\state\().16b, v30.16b
-	eor		\state\().16b, \state\().16b, v31.16b
+
+	tbnz		\rounds, #2, .Lnot128_\@
+.Lout256_\@:
+	enc_round	\state, K6
+	enc_round	\state, K7
+
+.Lout192_\@:
+	enc_round	\state, KK
+	aese		\state\().16b, KL.16b
+	eor		\state\().16b, \state\().16b, KM.16b
+
+	.subsection	1
+.Lnot128_\@:
+	ld1		{K8.4s-K9.4s}, [\tmp], #32
+	enc_round	\state, K6
+	enc_round	\state, K7
+	ld1		{K6.4s-K7.4s}, [\tmp]
+	enc_round	\state, K8
+	enc_round	\state, K9
+	tbz		\rounds, #1, .Lout192_\@
+	b		.Lout256_\@
+	.previous
 	.endm
 
+	.align		6
 	.macro		pmull_gcm_do_crypt, enc
-	ld1		{SHASH.2d}, [x4], #16
-	ld1		{HH.2d}, [x4]
-	ld1		{XL.2d}, [x1]
-	ldr		x8, [x5, #8]			// load lower counter
+	stp		x29, x30, [sp, #-32]!
+	mov		x29, sp
+	str		x19, [sp, #24]
+
+	load_round_keys	x7, x6, x8
+
+	ld1		{SHASH.2d}, [x3], #16
+	ld1		{HH.2d-HH4.2d}, [x3]
 
-	movi		MASK.16b, #0xe1
 	trn1		SHASH2.2d, SHASH.2d, HH.2d
 	trn2		T1.2d, SHASH.2d, HH.2d
-CPU_LE(	rev		x8, x8		)
-	shl		MASK.2d, MASK.2d, #57
 	eor		SHASH2.16b, SHASH2.16b, T1.16b
 
-	.if		\enc == 1
-	ldr		x10, [sp]
-	ld1		{KS0.16b-KS1.16b}, [x10]
-	.endif
+	trn1		HH34.2d, HH3.2d, HH4.2d
+	trn2		T1.2d, HH3.2d, HH4.2d
+	eor		HH34.16b, HH34.16b, T1.16b
 
-	cbnz		x6, 4f
+	ld1		{XL.2d}, [x4]
 
-0:	ld1		{INP0.16b-INP1.16b}, [x3], #32
+	cbz		x0, 3f				// tag only?
 
-	rev		x9, x8
-	add		x11, x8, #1
-	add		x8, x8, #2
+	ldr		w8, [x5, #12]			// load lower counter
+CPU_LE(	rev		w8, w8		)
 
-	.if		\enc == 1
-	eor		INP0.16b, INP0.16b, KS0.16b	// encrypt input
-	eor		INP1.16b, INP1.16b, KS1.16b
+0:	mov		w9, #4				// max blocks per round
+	add		x10, x0, #0xf
+	lsr		x10, x10, #4			// remaining blocks
+
+	subs		x0, x0, #64
+	csel		w9, w10, w9, mi
+	add		w8, w8, w9
+
+	bmi		1f
+	ld1		{INP0.16b-INP3.16b}, [x2], #64
+	.subsection	1
+	/*
+	 * Populate the four input registers right to left with up to 63 bytes
+	 * of data, using overlapping loads to avoid branches.
+	 *
+	 *                INP0     INP1     INP2     INP3
+	 *  1 byte     |        |        |        |x       |
+	 * 16 bytes    |        |        |        |xxxxxxxx|
+	 * 17 bytes    |        |        |xxxxxxxx|x       |
+	 * 47 bytes    |        |xxxxxxxx|xxxxxxxx|xxxxxxx |
+	 * etc etc
+	 *
+	 * Note that this code may read up to 15 bytes before the start of
+	 * the input. It is up to the calling code to ensure this is safe if
+	 * this happens in the first iteration of the loop (i.e., when the
+	 * input size is < 16 bytes)
+	 */
+1:	mov		x15, #16
+	ands		x19, x0, #0xf
+	csel		x19, x19, x15, ne
+	adr_l		x17, .Lpermute_table + 16
+
+	sub		x11, x15, x19
+	add		x12, x17, x11
+	sub		x17, x17, x11
+	ld1		{T1.16b}, [x12]
+	sub		x10, x1, x11
+	sub		x11, x2, x11
+
+	cmp		x0, #-16
+	csel		x14, x15, xzr, gt
+	cmp		x0, #-32
+	csel		x15, x15, xzr, gt
+	cmp		x0, #-48
+	csel		x16, x19, xzr, gt
+	csel		x1, x1, x10, gt
+	csel		x2, x2, x11, gt
+
+	ld1		{INP0.16b}, [x2], x14
+	ld1		{INP1.16b}, [x2], x15
+	ld1		{INP2.16b}, [x2], x16
+	ld1		{INP3.16b}, [x2]
+	tbl		INP3.16b, {INP3.16b}, T1.16b
+	b		2f
+	.previous
+
+2:	.if		\enc == 0
+	bl		pmull_gcm_ghash_4x
 	.endif
 
-	ld1		{KS0.8b}, [x5]			// load upper counter
-	rev		x11, x11
-	sub		w0, w0, #2
-	mov		KS1.8b, KS0.8b
-	ins		KS0.d[1], x9			// set lower counter
-	ins		KS1.d[1], x11
+	bl		pmull_gcm_enc_4x
 
-	rev64		T1.16b, INP1.16b
+	tbnz		x0, #63, 6f
+	st1		{INP0.16b-INP3.16b}, [x1], #64
+	.if		\enc == 1
+	bl		pmull_gcm_ghash_4x
+	.endif
+	bne		0b
 
-	cmp		w7, #12
-	b.ge		2f				// AES-192/256?
+3:	ldp		x19, x10, [sp, #24]
+	cbz		x10, 5f				// output tag?
 
-1:	enc_round	KS0, v21
-	ext		IN1.16b, T1.16b, T1.16b, #8
+	ld1		{INP3.16b}, [x10]		// load lengths[]
+	mov		w9, #1
+	bl		pmull_gcm_ghash_4x
 
-	enc_round	KS1, v21
-	pmull2		XH2.1q, SHASH.2d, IN1.2d	// a1 * b1
+	mov		w11, #(0x1 << 24)		// BE '1U'
+	ld1		{KS0.16b}, [x5]
+	mov		KS0.s[3], w11
 
-	enc_round	KS0, v22
-	eor		T1.16b, T1.16b, IN1.16b
+	enc_block	KS0, x7, x6, x12
 
-	enc_round	KS1, v22
-	pmull		XL2.1q, SHASH.1d, IN1.1d	// a0 * b0
+	ext		XL.16b, XL.16b, XL.16b, #8
+	rev64		XL.16b, XL.16b
+	eor		XL.16b, XL.16b, KS0.16b
+	st1		{XL.16b}, [x10]			// store tag
 
-	enc_round	KS0, v23
-	pmull		XM2.1q, SHASH2.1d, T1.1d	// (a1 + a0)(b1 + b0)
+4:	ldp		x29, x30, [sp], #32
+	ret
 
-	enc_round	KS1, v23
-	rev64		T1.16b, INP0.16b
-	ext		T2.16b, XL.16b, XL.16b, #8
+5:
+CPU_LE(	rev		w8, w8		)
+	str		w8, [x5, #12]			// store lower counter
+	st1		{XL.2d}, [x4]
+	b		4b
+
+6:	ld1		{T1.16b-T2.16b}, [x17], #32	// permute vectors
+	sub		x17, x17, x19, lsl #1
+
+	cmp		w9, #1
+	beq		7f
+	.subsection	1
+7:	ld1		{INP2.16b}, [x1]
+	tbx		INP2.16b, {INP3.16b}, T1.16b
+	mov		INP3.16b, INP2.16b
+	b		8f
+	.previous
+
+	st1		{INP0.16b}, [x1], x14
+	st1		{INP1.16b}, [x1], x15
+	st1		{INP2.16b}, [x1], x16
+	tbl		INP3.16b, {INP3.16b}, T1.16b
+	tbx		INP3.16b, {INP2.16b}, T2.16b
+8:	st1		{INP3.16b}, [x1]
 
-	enc_round	KS0, v24
-	ext		IN1.16b, T1.16b, T1.16b, #8
-	eor		T1.16b, T1.16b, T2.16b
+	.if		\enc == 1
+	ld1		{T1.16b}, [x17]
+	tbl		INP3.16b, {INP3.16b}, T1.16b	// clear non-data bits
+	bl		pmull_gcm_ghash_4x
+	.endif
+	b		3b
+	.endm
 
-	enc_round	KS1, v24
-	eor		XL.16b, XL.16b, IN1.16b
+	/*
+	 * void pmull_gcm_encrypt(int blocks, u8 dst[], const u8 src[],
+	 *			  struct ghash_key const *k, u64 dg[], u8 ctr[],
+	 *			  int rounds, u8 tag)
+	 */
+ENTRY(pmull_gcm_encrypt)
+	pmull_gcm_do_crypt	1
+ENDPROC(pmull_gcm_encrypt)
 
-	enc_round	KS0, v25
-	eor		T1.16b, T1.16b, XL.16b
+	/*
+	 * void pmull_gcm_decrypt(int blocks, u8 dst[], const u8 src[],
+	 *			  struct ghash_key const *k, u64 dg[], u8 ctr[],
+	 *			  int rounds, u8 tag)
+	 */
+ENTRY(pmull_gcm_decrypt)
+	pmull_gcm_do_crypt	0
+ENDPROC(pmull_gcm_decrypt)
 
-	enc_round	KS1, v25
-	pmull2		XH.1q, HH.2d, XL.2d		// a1 * b1
+pmull_gcm_ghash_4x:
+	movi		MASK.16b, #0xe1
+	shl		MASK.2d, MASK.2d, #57
 
-	enc_round	KS0, v26
-	pmull		XL.1q, HH.1d, XL.1d		// a0 * b0
+	rev64		T1.16b, INP0.16b
+	rev64		T2.16b, INP1.16b
+	rev64		TT3.16b, INP2.16b
+	rev64		TT4.16b, INP3.16b
 
-	enc_round	KS1, v26
-	pmull2		XM.1q, SHASH2.2d, T1.2d		// (a1 + a0)(b1 + b0)
+	ext		XL.16b, XL.16b, XL.16b, #8
 
-	enc_round	KS0, v27
-	eor		XL.16b, XL.16b, XL2.16b
-	eor		XH.16b, XH.16b, XH2.16b
+	tbz		w9, #2, 0f			// <4 blocks?
+	.subsection	1
+0:	movi		XH2.16b, #0
+	movi		XM2.16b, #0
+	movi		XL2.16b, #0
 
-	enc_round	KS1, v27
-	eor		XM.16b, XM.16b, XM2.16b
-	ext		T1.16b, XL.16b, XH.16b, #8
+	tbz		w9, #0, 1f			// 2 blocks?
+	tbz		w9, #1, 2f			// 1 block?
 
-	enc_round	KS0, v28
-	eor		T2.16b, XL.16b, XH.16b
-	eor		XM.16b, XM.16b, T1.16b
+	eor		T2.16b, T2.16b, XL.16b
+	ext		T1.16b, T2.16b, T2.16b, #8
+	b		.Lgh3
 
-	enc_round	KS1, v28
-	eor		XM.16b, XM.16b, T2.16b
+1:	eor		TT3.16b, TT3.16b, XL.16b
+	ext		T2.16b, TT3.16b, TT3.16b, #8
+	b		.Lgh2
 
-	enc_round	KS0, v29
-	pmull		T2.1q, XL.1d, MASK.1d
+2:	eor		TT4.16b, TT4.16b, XL.16b
+	ext		IN1.16b, TT4.16b, TT4.16b, #8
+	b		.Lgh1
+	.previous
 
-	enc_round	KS1, v29
-	mov		XH.d[0], XM.d[1]
-	mov		XM.d[1], XL.d[0]
+	eor		T1.16b, T1.16b, XL.16b
+	ext		IN1.16b, T1.16b, T1.16b, #8
 
-	aese		KS0.16b, v30.16b
-	eor		XL.16b, XM.16b, T2.16b
+	pmull2		XH2.1q, HH4.2d, IN1.2d		// a1 * b1
+	eor		T1.16b, T1.16b, IN1.16b
+	pmull		XL2.1q, HH4.1d, IN1.1d		// a0 * b0
+	pmull2		XM2.1q, HH34.2d, T1.2d		// (a1 + a0)(b1 + b0)
 
-	aese		KS1.16b, v30.16b
-	ext		T2.16b, XL.16b, XL.16b, #8
+	ext		T1.16b, T2.16b, T2.16b, #8
+.Lgh3:	eor		T2.16b, T2.16b, T1.16b
+	pmull2		XH.1q, HH3.2d, T1.2d		// a1 * b1
+	pmull		XL.1q, HH3.1d, T1.1d		// a0 * b0
+	pmull		XM.1q, HH34.1d, T2.1d		// (a1 + a0)(b1 + b0)
 
-	eor		KS0.16b, KS0.16b, v31.16b
-	pmull		XL.1q, XL.1d, MASK.1d
-	eor		T2.16b, T2.16b, XH.16b
+	eor		XH2.16b, XH2.16b, XH.16b
+	eor		XL2.16b, XL2.16b, XL.16b
+	eor		XM2.16b, XM2.16b, XM.16b
 
-	eor		KS1.16b, KS1.16b, v31.16b
-	eor		XL.16b, XL.16b, T2.16b
+	ext		T2.16b, TT3.16b, TT3.16b, #8
+.Lgh2:	eor		TT3.16b, TT3.16b, T2.16b
+	pmull2		XH.1q, HH.2d, T2.2d		// a1 * b1
+	pmull		XL.1q, HH.1d, T2.1d		// a0 * b0
+	pmull2		XM.1q, SHASH2.2d, TT3.2d	// (a1 + a0)(b1 + b0)
 
-	.if		\enc == 0
-	eor		INP0.16b, INP0.16b, KS0.16b
-	eor		INP1.16b, INP1.16b, KS1.16b
-	.endif
+	eor		XH2.16b, XH2.16b, XH.16b
+	eor		XL2.16b, XL2.16b, XL.16b
+	eor		XM2.16b, XM2.16b, XM.16b
 
-	st1		{INP0.16b-INP1.16b}, [x2], #32
+	ext		IN1.16b, TT4.16b, TT4.16b, #8
+.Lgh1:	eor		TT4.16b, TT4.16b, IN1.16b
+	pmull		XL.1q, SHASH.1d, IN1.1d		// a0 * b0
+	pmull2		XH.1q, SHASH.2d, IN1.2d		// a1 * b1
+	pmull		XM.1q, SHASH2.1d, TT4.1d	// (a1 + a0)(b1 + b0)
 
-	cbnz		w0, 0b
+	eor		XH.16b, XH.16b, XH2.16b
+	eor		XL.16b, XL.16b, XL2.16b
+	eor		XM.16b, XM.16b, XM2.16b
 
-CPU_LE(	rev		x8, x8		)
-	st1		{XL.2d}, [x1]
-	str		x8, [x5, #8]			// store lower counter
+	eor		T2.16b, XL.16b, XH.16b
+	ext		T1.16b, XL.16b, XH.16b, #8
+	eor		XM.16b, XM.16b, T2.16b
 
-	.if		\enc == 1
-	st1		{KS0.16b-KS1.16b}, [x10]
-	.endif
+	__pmull_reduce_p64
+
+	eor		T2.16b, T2.16b, XH.16b
+	eor		XL.16b, XL.16b, T2.16b
 
 	ret
+ENDPROC(pmull_gcm_ghash_4x)
+
+pmull_gcm_enc_4x:
+	ld1		{KS0.16b}, [x5]			// load upper counter
+	sub		w10, w8, #4
+	sub		w11, w8, #3
+	sub		w12, w8, #2
+	sub		w13, w8, #1
+	rev		w10, w10
+	rev		w11, w11
+	rev		w12, w12
+	rev		w13, w13
+	mov		KS1.16b, KS0.16b
+	mov		KS2.16b, KS0.16b
+	mov		KS3.16b, KS0.16b
+	ins		KS0.s[3], w10			// set lower counter
+	ins		KS1.s[3], w11
+	ins		KS2.s[3], w12
+	ins		KS3.s[3], w13
+
+	add		x10, x6, #96			// round key pointer
+	ld1		{K6.4s-K7.4s}, [x10], #32
+	.irp		key, K0, K1, K2, K3, K4, K5
+	enc_qround	KS0, KS1, KS2, KS3, \key
+	.endr
 
-2:	b.eq		3f				// AES-192?
-	enc_round	KS0, v17
-	enc_round	KS1, v17
-	enc_round	KS0, v18
-	enc_round	KS1, v18
-3:	enc_round	KS0, v19
-	enc_round	KS1, v19
-	enc_round	KS0, v20
-	enc_round	KS1, v20
-	b		1b
+	tbnz		x7, #2, .Lnot128
+	.subsection	1
+.Lnot128:
+	ld1		{K8.4s-K9.4s}, [x10], #32
+	.irp		key, K6, K7
+	enc_qround	KS0, KS1, KS2, KS3, \key
+	.endr
+	ld1		{K6.4s-K7.4s}, [x10]
+	.irp		key, K8, K9
+	enc_qround	KS0, KS1, KS2, KS3, \key
+	.endr
+	tbz		x7, #1, .Lout192
+	b		.Lout256
+	.previous
 
-4:	load_round_keys	w7, x6
-	b		0b
-	.endm
+.Lout256:
+	.irp		key, K6, K7
+	enc_qround	KS0, KS1, KS2, KS3, \key
+	.endr
 
-	/*
-	 * void pmull_gcm_encrypt(int blocks, u64 dg[], u8 dst[], const u8 src[],
-	 *			  struct ghash_key const *k, u8 ctr[],
-	 *			  int rounds, u8 ks[])
-	 */
-ENTRY(pmull_gcm_encrypt)
-	pmull_gcm_do_crypt	1
-ENDPROC(pmull_gcm_encrypt)
+.Lout192:
+	enc_qround	KS0, KS1, KS2, KS3, KK
 
-	/*
-	 * void pmull_gcm_decrypt(int blocks, u64 dg[], u8 dst[], const u8 src[],
-	 *			  struct ghash_key const *k, u8 ctr[],
-	 *			  int rounds)
-	 */
-ENTRY(pmull_gcm_decrypt)
-	pmull_gcm_do_crypt	0
-ENDPROC(pmull_gcm_decrypt)
+	aese		KS0.16b, KL.16b
+	aese		KS1.16b, KL.16b
+	aese		KS2.16b, KL.16b
+	aese		KS3.16b, KL.16b
+
+	eor		KS0.16b, KS0.16b, KM.16b
+	eor		KS1.16b, KS1.16b, KM.16b
+	eor		KS2.16b, KS2.16b, KM.16b
+	eor		KS3.16b, KS3.16b, KM.16b
+
+	eor		INP0.16b, INP0.16b, KS0.16b
+	eor		INP1.16b, INP1.16b, KS1.16b
+	eor		INP2.16b, INP2.16b, KS2.16b
+	eor		INP3.16b, INP3.16b, KS3.16b
 
-	/*
-	 * void pmull_gcm_encrypt_block(u8 dst[], u8 src[], u8 rk[], int rounds)
-	 */
-ENTRY(pmull_gcm_encrypt_block)
-	cbz		x2, 0f
-	load_round_keys	w3, x2
-0:	ld1		{v0.16b}, [x1]
-	enc_block	v0, w3
-	st1		{v0.16b}, [x0]
 	ret
-ENDPROC(pmull_gcm_encrypt_block)
+ENDPROC(pmull_gcm_enc_4x)
+
+	.section	".rodata", "a"
+	.align		6
+.Lpermute_table:
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe,  0xf
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff
+	.byte		 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7
+	.byte		 0x8,  0x9,  0xa,  0xb,  0xc,  0xd,  0xe,  0xf
+	.previous
diff --git a/arch/arm64/crypto/ghash-ce-glue.c b/arch/arm64/crypto/ghash-ce-glue.c
index 70b1469783f9..522cf004ce65 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -58,17 +58,15 @@ asmlinkage void pmull_ghash_update_p8(int blocks, u64 dg[], const char *src,
 				      struct ghash_key const *k,
 				      const char *head);
 
-asmlinkage void pmull_gcm_encrypt(int blocks, u64 dg[], u8 dst[],
-				  const u8 src[], struct ghash_key const *k,
+asmlinkage void pmull_gcm_encrypt(int bytes, u8 dst[], const u8 src[],
+				  struct ghash_key const *k, u64 dg[],
 				  u8 ctr[], u32 const rk[], int rounds,
-				  u8 ks[]);
+				  u8 tag[]);
 
-asmlinkage void pmull_gcm_decrypt(int blocks, u64 dg[], u8 dst[],
-				  const u8 src[], struct ghash_key const *k,
-				  u8 ctr[], u32 const rk[], int rounds);
-
-asmlinkage void pmull_gcm_encrypt_block(u8 dst[], u8 const src[],
-					u32 const rk[], int rounds);
+asmlinkage void pmull_gcm_decrypt(int bytes, u8 dst[], const u8 src[],
+				  struct ghash_key const *k, u64 dg[],
+				  u8 ctr[], u32 const rk[], int rounds,
+				  u8 tag[]);
 
 static int ghash_init(struct shash_desc *desc)
 {
@@ -85,7 +83,7 @@ static void ghash_do_update(int blocks, u64 dg[], const char *src,
 						struct ghash_key const *k,
 						const char *head))
 {
-	if (likely(crypto_simd_usable())) {
+	if (likely(crypto_simd_usable() && simd_update)) {
 		kernel_neon_begin();
 		simd_update(blocks, dg, src, key, head);
 		kernel_neon_end();
@@ -398,136 +396,112 @@ static void gcm_calculate_auth_mac(struct aead_request *req, u64 dg[])
 	}
 }
 
-static void gcm_final(struct aead_request *req, struct gcm_aes_ctx *ctx,
-		      u64 dg[], u8 tag[], int cryptlen)
-{
-	u8 mac[AES_BLOCK_SIZE];
-	u128 lengths;
-
-	lengths.a = cpu_to_be64(req->assoclen * 8);
-	lengths.b = cpu_to_be64(cryptlen * 8);
-
-	ghash_do_update(1, dg, (void *)&lengths, &ctx->ghash_key, NULL,
-			pmull_ghash_update_p64);
-
-	put_unaligned_be64(dg[1], mac);
-	put_unaligned_be64(dg[0], mac + 8);
-
-	crypto_xor(tag, mac, AES_BLOCK_SIZE);
-}
-
 static int gcm_encrypt(struct aead_request *req)
 {
 	struct crypto_aead *aead = crypto_aead_reqtfm(req);
 	struct gcm_aes_ctx *ctx = crypto_aead_ctx(aead);
+	int nrounds = num_rounds(&ctx->aes_key);
 	struct skcipher_walk walk;
+	u8 buf[AES_BLOCK_SIZE];
 	u8 iv[AES_BLOCK_SIZE];
-	u8 ks[2 * AES_BLOCK_SIZE];
-	u8 tag[AES_BLOCK_SIZE];
 	u64 dg[2] = {};
-	int nrounds = num_rounds(&ctx->aes_key);
+	u128 lengths;
+	u8 *tag;
 	int err;
 
+	lengths.a = cpu_to_be64(req->assoclen * 8);
+	lengths.b = cpu_to_be64(req->cryptlen * 8);
+
 	if (req->assoclen)
 		gcm_calculate_auth_mac(req, dg);
 
 	memcpy(iv, req->iv, GCM_IV_SIZE);
-	put_unaligned_be32(1, iv + GCM_IV_SIZE);
+	put_unaligned_be32(2, iv + GCM_IV_SIZE);
 
 	err = skcipher_walk_aead_encrypt(&walk, req, false);
 
-	if (likely(crypto_simd_usable() && walk.total >= 2 * AES_BLOCK_SIZE)) {
-		u32 const *rk = NULL;
-
-		kernel_neon_begin();
-		pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc, nrounds);
-		put_unaligned_be32(2, iv + GCM_IV_SIZE);
-		pmull_gcm_encrypt_block(ks, iv, NULL, nrounds);
-		put_unaligned_be32(3, iv + GCM_IV_SIZE);
-		pmull_gcm_encrypt_block(ks + AES_BLOCK_SIZE, iv, NULL, nrounds);
-		put_unaligned_be32(4, iv + GCM_IV_SIZE);
-
+	if (likely(crypto_simd_usable())) {
 		do {
-			int blocks = walk.nbytes / (2 * AES_BLOCK_SIZE) * 2;
+			const u8 *src = walk.src.virt.addr;
+			u8 *dst = walk.dst.virt.addr;
+			int nbytes = walk.nbytes;
+
+			tag = (u8 *)&lengths;
 
-			if (rk)
-				kernel_neon_begin();
+			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+				src = dst = memcpy(buf + sizeof(buf) - nbytes,
+						   src, nbytes);
+			} else if (nbytes < walk.total) {
+				nbytes &= ~(AES_BLOCK_SIZE - 1);
+				tag = NULL;
+			}
 
-			pmull_gcm_encrypt(blocks, dg, walk.dst.virt.addr,
-					  walk.src.virt.addr, &ctx->ghash_key,
-					  iv, rk, nrounds, ks);
+			kernel_neon_begin();
+			pmull_gcm_encrypt(nbytes, dst, src, &ctx->ghash_key, dg,
+					  iv, ctx->aes_key.key_enc, nrounds,
+					  tag);
 			kernel_neon_end();
 
-			err = skcipher_walk_done(&walk,
-					walk.nbytes % (2 * AES_BLOCK_SIZE));
+			if (unlikely(!nbytes))
+				break;
 
-			rk = ctx->aes_key.key_enc;
-		} while (walk.nbytes >= 2 * AES_BLOCK_SIZE);
-	} else {
-		aes_encrypt(&ctx->aes_key, tag, iv);
-		put_unaligned_be32(2, iv + GCM_IV_SIZE);
+			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+				memcpy(walk.dst.virt.addr,
+				       buf + sizeof(buf) - nbytes, nbytes);
 
-		while (walk.nbytes >= (2 * AES_BLOCK_SIZE)) {
-			const int blocks =
-				walk.nbytes / (2 * AES_BLOCK_SIZE) * 2;
+			err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+		} while (walk.nbytes);
+	} else {
+		while (walk.nbytes >= AES_BLOCK_SIZE) {
+			int blocks = walk.nbytes / AES_BLOCK_SIZE;
+			const u8 *src = walk.src.virt.addr;
 			u8 *dst = walk.dst.virt.addr;
-			u8 *src = walk.src.virt.addr;
 			int remaining = blocks;
 
 			do {
-				aes_encrypt(&ctx->aes_key, ks, iv);
-				crypto_xor_cpy(dst, src, ks, AES_BLOCK_SIZE);
+				aes_encrypt(&ctx->aes_key, buf, iv);
+				crypto_xor_cpy(dst, src, buf, AES_BLOCK_SIZE);
 				crypto_inc(iv, AES_BLOCK_SIZE);
 
 				dst += AES_BLOCK_SIZE;
 				src += AES_BLOCK_SIZE;
 			} while (--remaining > 0);
 
-			ghash_do_update(blocks, dg,
-					walk.dst.virt.addr, &ctx->ghash_key,
-					NULL, pmull_ghash_update_p64);
+			ghash_do_update(blocks, dg, walk.dst.virt.addr,
+					&ctx->ghash_key, NULL, NULL);
 
 			err = skcipher_walk_done(&walk,
-						 walk.nbytes % (2 * AES_BLOCK_SIZE));
-		}
-		if (walk.nbytes) {
-			aes_encrypt(&ctx->aes_key, ks, iv);
-			if (walk.nbytes > AES_BLOCK_SIZE) {
-				crypto_inc(iv, AES_BLOCK_SIZE);
-				aes_encrypt(&ctx->aes_key, ks + AES_BLOCK_SIZE, iv);
-			}
+						 walk.nbytes % AES_BLOCK_SIZE);
 		}
-	}
 
-	/* handle the tail */
-	if (walk.nbytes) {
-		u8 buf[GHASH_BLOCK_SIZE];
-		unsigned int nbytes = walk.nbytes;
-		u8 *dst = walk.dst.virt.addr;
-		u8 *head = NULL;
+		/* handle the tail */
+		if (walk.nbytes) {
+			aes_encrypt(&ctx->aes_key, buf, iv);
 
-		crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr, ks,
-			       walk.nbytes);
+			crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
+				       buf, walk.nbytes);
 
-		if (walk.nbytes > GHASH_BLOCK_SIZE) {
-			head = dst;
-			dst += GHASH_BLOCK_SIZE;
-			nbytes %= GHASH_BLOCK_SIZE;
+			memcpy(buf, walk.dst.virt.addr, walk.nbytes);
+			memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
 		}
 
-		memcpy(buf, dst, nbytes);
-		memset(buf + nbytes, 0, GHASH_BLOCK_SIZE - nbytes);
-		ghash_do_update(!!nbytes, dg, buf, &ctx->ghash_key, head,
-				pmull_ghash_update_p64);
+		tag = (u8 *)&lengths;
+		ghash_do_update(1, dg, tag, &ctx->ghash_key,
+				walk.nbytes ? buf : NULL, NULL);
 
-		err = skcipher_walk_done(&walk, 0);
+		if (walk.nbytes)
+			err = skcipher_walk_done(&walk, 0);
+
+		put_unaligned_be64(dg[1], tag);
+		put_unaligned_be64(dg[0], tag + 8);
+		put_unaligned_be32(1, iv + GCM_IV_SIZE);
+		aes_encrypt(&ctx->aes_key, iv, iv);
+		crypto_xor(tag, iv, AES_BLOCK_SIZE);
 	}
 
 	if (err)
 		return err;
 
-	gcm_final(req, ctx, dg, tag, req->cryptlen);
-
 	/* copy authtag to end of dst */
 	scatterwalk_map_and_copy(tag, req->dst, req->assoclen + req->cryptlen,
 				 crypto_aead_authsize(aead), 1);
@@ -540,75 +514,65 @@ static int gcm_decrypt(struct aead_request *req)
 	struct crypto_aead *aead = crypto_aead_reqtfm(req);
 	struct gcm_aes_ctx *ctx = crypto_aead_ctx(aead);
 	unsigned int authsize = crypto_aead_authsize(aead);
+	int nrounds = num_rounds(&ctx->aes_key);
 	struct skcipher_walk walk;
-	u8 iv[2 * AES_BLOCK_SIZE];
-	u8 tag[AES_BLOCK_SIZE];
-	u8 buf[2 * GHASH_BLOCK_SIZE];
+	u8 buf[AES_BLOCK_SIZE];
+	u8 iv[AES_BLOCK_SIZE];
 	u64 dg[2] = {};
-	int nrounds = num_rounds(&ctx->aes_key);
+	u128 lengths;
+	u8 *tag;
 	int err;
 
+	lengths.a = cpu_to_be64(req->assoclen * 8);
+	lengths.b = cpu_to_be64((req->cryptlen - authsize) * 8);
+
 	if (req->assoclen)
 		gcm_calculate_auth_mac(req, dg);
 
 	memcpy(iv, req->iv, GCM_IV_SIZE);
-	put_unaligned_be32(1, iv + GCM_IV_SIZE);
+	put_unaligned_be32(2, iv + GCM_IV_SIZE);
 
 	err = skcipher_walk_aead_decrypt(&walk, req, false);
 
-	if (likely(crypto_simd_usable() && walk.total >= 2 * AES_BLOCK_SIZE)) {
-		u32 const *rk = NULL;
-
-		kernel_neon_begin();
-		pmull_gcm_encrypt_block(tag, iv, ctx->aes_key.key_enc, nrounds);
-		put_unaligned_be32(2, iv + GCM_IV_SIZE);
-
+	if (likely(crypto_simd_usable())) {
 		do {
-			int blocks = walk.nbytes / (2 * AES_BLOCK_SIZE) * 2;
-			int rem = walk.total - blocks * AES_BLOCK_SIZE;
-
-			if (rk)
-				kernel_neon_begin();
-
-			pmull_gcm_decrypt(blocks, dg, walk.dst.virt.addr,
-					  walk.src.virt.addr, &ctx->ghash_key,
-					  iv, rk, nrounds);
-
-			/* check if this is the final iteration of the loop */
-			if (rem < (2 * AES_BLOCK_SIZE)) {
-				u8 *iv2 = iv + AES_BLOCK_SIZE;
-
-				if (rem > AES_BLOCK_SIZE) {
-					memcpy(iv2, iv, AES_BLOCK_SIZE);
-					crypto_inc(iv2, AES_BLOCK_SIZE);
-				}
+			const u8 *src = walk.src.virt.addr;
+			u8 *dst = walk.dst.virt.addr;
+			int nbytes = walk.nbytes;
 
-				pmull_gcm_encrypt_block(iv, iv, NULL, nrounds);
+			tag = (u8 *)&lengths;
 
-				if (rem > AES_BLOCK_SIZE)
-					pmull_gcm_encrypt_block(iv2, iv2, NULL,
-								nrounds);
+			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE)) {
+				src = dst = memcpy(buf + sizeof(buf) - nbytes,
+						   src, nbytes);
+			} else if (nbytes < walk.total) {
+				nbytes &= ~(AES_BLOCK_SIZE - 1);
+				tag = NULL;
 			}
 
+			kernel_neon_begin();
+			pmull_gcm_decrypt(nbytes, dst, src, &ctx->ghash_key, dg,
+					  iv, ctx->aes_key.key_enc, nrounds,
+					  tag);
 			kernel_neon_end();
 
-			err = skcipher_walk_done(&walk,
-					walk.nbytes % (2 * AES_BLOCK_SIZE));
+			if (unlikely(!nbytes))
+				break;
 
-			rk = ctx->aes_key.key_enc;
-		} while (walk.nbytes >= 2 * AES_BLOCK_SIZE);
-	} else {
-		aes_encrypt(&ctx->aes_key, tag, iv);
-		put_unaligned_be32(2, iv + GCM_IV_SIZE);
+			if (unlikely(nbytes > 0 && nbytes < AES_BLOCK_SIZE))
+				memcpy(walk.dst.virt.addr,
+				       buf + sizeof(buf) - nbytes, nbytes);
 
-		while (walk.nbytes >= (2 * AES_BLOCK_SIZE)) {
-			int blocks = walk.nbytes / (2 * AES_BLOCK_SIZE) * 2;
+			err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+		} while (walk.nbytes);
+	} else {
+		while (walk.nbytes >= AES_BLOCK_SIZE) {
+			int blocks = walk.nbytes / AES_BLOCK_SIZE;
+			const u8 *src = walk.src.virt.addr;
 			u8 *dst = walk.dst.virt.addr;
-			u8 *src = walk.src.virt.addr;
 
 			ghash_do_update(blocks, dg, walk.src.virt.addr,
-					&ctx->ghash_key, NULL,
-					pmull_ghash_update_p64);
+					&ctx->ghash_key, NULL, NULL);
 
 			do {
 				aes_encrypt(&ctx->aes_key, buf, iv);
@@ -620,49 +584,38 @@ static int gcm_decrypt(struct aead_request *req)
 			} while (--blocks > 0);
 
 			err = skcipher_walk_done(&walk,
-						 walk.nbytes % (2 * AES_BLOCK_SIZE));
+						 walk.nbytes % AES_BLOCK_SIZE);
 		}
-		if (walk.nbytes) {
-			if (walk.nbytes > AES_BLOCK_SIZE) {
-				u8 *iv2 = iv + AES_BLOCK_SIZE;
-
-				memcpy(iv2, iv, AES_BLOCK_SIZE);
-				crypto_inc(iv2, AES_BLOCK_SIZE);
 
-				aes_encrypt(&ctx->aes_key, iv2, iv2);
-			}
-			aes_encrypt(&ctx->aes_key, iv, iv);
+		/* handle the tail */
+		if (walk.nbytes) {
+			memcpy(buf, walk.src.virt.addr, walk.nbytes);
+			memset(buf + walk.nbytes, 0, sizeof(buf) - walk.nbytes);
 		}
-	}
 
-	/* handle the tail */
-	if (walk.nbytes) {
-		const u8 *src = walk.src.virt.addr;
-		const u8 *head = NULL;
-		unsigned int nbytes = walk.nbytes;
+		tag = (u8 *)&lengths;
+		ghash_do_update(1, dg, tag, &ctx->ghash_key,
+				walk.nbytes ? buf : NULL, NULL);
 
-		if (walk.nbytes > GHASH_BLOCK_SIZE) {
-			head = src;
-			src += GHASH_BLOCK_SIZE;
-			nbytes %= GHASH_BLOCK_SIZE;
-		}
+		if (walk.nbytes) {
+			aes_encrypt(&ctx->aes_key, buf, iv);
 
-		memcpy(buf, src, nbytes);
-		memset(buf + nbytes, 0, GHASH_BLOCK_SIZE - nbytes);
-		ghash_do_update(!!nbytes, dg, buf, &ctx->ghash_key, head,
-				pmull_ghash_update_p64);
+			crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr,
+				       buf, walk.nbytes);
 
-		crypto_xor_cpy(walk.dst.virt.addr, walk.src.virt.addr, iv,
-			       walk.nbytes);
+			err = skcipher_walk_done(&walk, 0);
+		}
 
-		err = skcipher_walk_done(&walk, 0);
+		put_unaligned_be64(dg[1], tag);
+		put_unaligned_be64(dg[0], tag + 8);
+		put_unaligned_be32(1, iv + GCM_IV_SIZE);
+		aes_encrypt(&ctx->aes_key, iv, iv);
+		crypto_xor(tag, iv, AES_BLOCK_SIZE);
 	}
 
 	if (err)
 		return err;
 
-	gcm_final(req, ctx, dg, tag, req->cryptlen - authsize);
-
 	/* compare calculated auth tag with the stored one */
 	scatterwalk_map_and_copy(buf, req->src,
 				 req->assoclen + req->cryptlen - authsize,
@@ -675,7 +628,7 @@ static int gcm_decrypt(struct aead_request *req)
 
 static struct aead_alg gcm_aes_alg = {
 	.ivsize			= GCM_IV_SIZE,
-	.chunksize		= 2 * AES_BLOCK_SIZE,
+	.chunksize		= AES_BLOCK_SIZE,
 	.maxauthsize		= AES_BLOCK_SIZE,
 	.setkey			= gcm_setkey,
 	.setauthsize		= gcm_setauthsize,