diff mbox series

[v2,2/2] ARM: permit non-nested kernel mode NEON in softirq context

Message ID 20221207103936.2198407-3-ardb@kernel.org
State Accepted
Commit c79f81631142ee2dc4c743732427f23d18cd2dec
Headers show
Series ARM: allow kernel mode NEON in softirq context | expand

Commit Message

Ard Biesheuvel Dec. 7, 2022, 10:39 a.m. UTC
We currently only permit kernel mode NEON in process context, to avoid
the need to preserve/restore the NEON register file when taking an
exception while running in the kernel.

Like we did on arm64, we can relax this restriction substantially, by
permitting kernel mode NEON from softirq context, while ensuring that
softirq processing is disabled when the NEON is being used in task
context. This guarantees that only NEON context belonging to user space
needs to be preserved and restored, which is already taken care of.

This is especially relevant for network encryption, where incoming
frames are typically handled in softirq context, and deferring software
decryption to a kernel thread or falling back to C code are both
undesirable from a performance PoV.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 arch/arm/include/asm/simd.h |  8 ++++++++
 arch/arm/vfp/vfpmodule.c    | 13 ++++++-------
 2 files changed, 14 insertions(+), 7 deletions(-)

Comments

Linus Walleij Dec. 15, 2022, 10:26 a.m. UTC | #1
On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:

> We currently only permit kernel mode NEON in process context, to avoid
> the need to preserve/restore the NEON register file when taking an
> exception while running in the kernel.
>
> Like we did on arm64, we can relax this restriction substantially, by
> permitting kernel mode NEON from softirq context, while ensuring that
> softirq processing is disabled when the NEON is being used in task
> context. This guarantees that only NEON context belonging to user space
> needs to be preserved and restored, which is already taken care of.
>
> This is especially relevant for network encryption, where incoming
> frames are typically handled in softirq context, and deferring software
> decryption to a kernel thread or falling back to C code are both
> undesirable from a performance PoV.
>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>

So boosting WireGuard as primary SW network encryption user?
This is really neat, BTW:
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>

Yours,
Linus Walleij
Ard Biesheuvel Dec. 15, 2022, 10:43 a.m. UTC | #2
On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote:
>
> On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > We currently only permit kernel mode NEON in process context, to avoid
> > the need to preserve/restore the NEON register file when taking an
> > exception while running in the kernel.
> >
> > Like we did on arm64, we can relax this restriction substantially, by
> > permitting kernel mode NEON from softirq context, while ensuring that
> > softirq processing is disabled when the NEON is being used in task
> > context. This guarantees that only NEON context belonging to user space
> > needs to be preserved and restored, which is already taken care of.
> >
> > This is especially relevant for network encryption, where incoming
> > frames are typically handled in softirq context, and deferring software
> > decryption to a kernel thread or falling back to C code are both
> > undesirable from a performance PoV.
> >
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
>
> So boosting WireGuard as primary SW network encryption user?

Essentially, although the use case that inspired this work is related
to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is
~3x faster than WG's chacha20poly1305, which makes the performance
overhead of asynchronous completion even more significant. (Note that
GCM needs the AES and PMULL instructions which are usually only
available when running the 32-bit kernel on a 64-bit core, whereas
chacha20poly1305 uses ordinary NEON instructions.)

But Martin responded with a Tested-by regarding chacha20poly1305 on
IPsec (not WG) where there is also a noticeable speedup, so WG on
ARM32 should definitely benefit from this as well.

> This is really neat, BTW:
> Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
>

Thanks!
Russell King (Oracle) Dec. 15, 2022, 10:51 a.m. UTC | #3
On Thu, Dec 15, 2022 at 11:43:22AM +0100, Ard Biesheuvel wrote:
> On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote:
> >
> > On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > > We currently only permit kernel mode NEON in process context, to avoid
> > > the need to preserve/restore the NEON register file when taking an
> > > exception while running in the kernel.
> > >
> > > Like we did on arm64, we can relax this restriction substantially, by
> > > permitting kernel mode NEON from softirq context, while ensuring that
> > > softirq processing is disabled when the NEON is being used in task
> > > context. This guarantees that only NEON context belonging to user space
> > > needs to be preserved and restored, which is already taken care of.
> > >
> > > This is especially relevant for network encryption, where incoming
> > > frames are typically handled in softirq context, and deferring software
> > > decryption to a kernel thread or falling back to C code are both
> > > undesirable from a performance PoV.
> > >
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> >
> > So boosting WireGuard as primary SW network encryption user?
> 
> Essentially, although the use case that inspired this work is related
> to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is
> ~3x faster than WG's chacha20poly1305, which makes the performance
> overhead of asynchronous completion even more significant. (Note that
> GCM needs the AES and PMULL instructions which are usually only
> available when running the 32-bit kernel on a 64-bit core, whereas
> chacha20poly1305 uses ordinary NEON instructions.)
> 
> But Martin responded with a Tested-by regarding chacha20poly1305 on
> IPsec (not WG) where there is also a noticeable speedup, so WG on
> ARM32 should definitely benefit from this as well.

It'll be interesting to see whether there is any noticable difference
with my WG VPN.
Ard Biesheuvel Dec. 15, 2022, 11:48 a.m. UTC | #4
On Thu, 15 Dec 2022 at 11:51, Russell King (Oracle)
<linux@armlinux.org.uk> wrote:
>
> On Thu, Dec 15, 2022 at 11:43:22AM +0100, Ard Biesheuvel wrote:
> > On Thu, 15 Dec 2022 at 11:27, Linus Walleij <linus.walleij@linaro.org> wrote:
> > >
> > > On Wed, Dec 7, 2022 at 11:39 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > > We currently only permit kernel mode NEON in process context, to avoid
> > > > the need to preserve/restore the NEON register file when taking an
> > > > exception while running in the kernel.
> > > >
> > > > Like we did on arm64, we can relax this restriction substantially, by
> > > > permitting kernel mode NEON from softirq context, while ensuring that
> > > > softirq processing is disabled when the NEON is being used in task
> > > > context. This guarantees that only NEON context belonging to user space
> > > > needs to be preserved and restored, which is already taken care of.
> > > >
> > > > This is especially relevant for network encryption, where incoming
> > > > frames are typically handled in softirq context, and deferring software
> > > > decryption to a kernel thread or falling back to C code are both
> > > > undesirable from a performance PoV.
> > > >
> > > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > >
> > > So boosting WireGuard as primary SW network encryption user?
> >
> > Essentially, although the use case that inspired this work is related
> > to IPsec not WireGuard, and the crypto algorithm in that case (GCM) is
> > ~3x faster than WG's chacha20poly1305, which makes the performance
> > overhead of asynchronous completion even more significant. (Note that
> > GCM needs the AES and PMULL instructions which are usually only
> > available when running the 32-bit kernel on a 64-bit core, whereas
> > chacha20poly1305 uses ordinary NEON instructions.)
> >
> > But Martin responded with a Tested-by regarding chacha20poly1305 on
> > IPsec (not WG) where there is also a noticeable speedup, so WG on
> > ARM32 should definitely benefit from this as well.
>
> It'll be interesting to see whether there is any noticable difference
> with my WG VPN.
>

Using WireGuard with the same 32-bit KVM guest communicating with its
64-bit host using virtio-net, I get a 44% speedup in the host->guest
direction. The other direction performs exactly the same, which is
unsurprising as it doesn't involve NEON crypto in softirq context at
all.

BEFORE
======

ardb@vm32:~$ iperf3 -c 192.168.11.2
Connecting to host 192.168.11.2, port 5201
[  5] local 192.168.11.1 port 40144 connected to 192.168.11.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  25.8 MBytes   216 Mbits/sec    0    397 KBytes
[  5]   1.00-2.00   sec  25.9 MBytes   217 Mbits/sec    0    397 KBytes
[  5]   2.00-3.00   sec  27.0 MBytes   226 Mbits/sec    0    397 KBytes
[  5]   3.00-4.00   sec  26.5 MBytes   222 Mbits/sec    0    397 KBytes
[  5]   4.00-5.00   sec  26.2 MBytes   220 Mbits/sec    0    397 KBytes
[  5]   5.00-6.00   sec  26.1 MBytes   219 Mbits/sec    0    436 KBytes
[  5]   6.00-7.00   sec  26.2 MBytes   220 Mbits/sec    0    458 KBytes
[  5]   7.00-8.00   sec  26.2 MBytes   220 Mbits/sec    0    458 KBytes
[  5]   8.00-9.00   sec  26.5 MBytes   222 Mbits/sec    0    480 KBytes
[  5]   9.00-10.00  sec  26.9 MBytes   225 Mbits/sec    0    480 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   263 MBytes   221 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   262 MBytes   220 Mbits/sec                  receiver


ardb@sudo:~$ iperf3 -c 192.168.11.1
Connecting to host 192.168.11.1, port 5201
[  5] local 192.168.11.2 port 46340 connected to 192.168.11.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  47.5 MBytes   398 Mbits/sec    0   1.75 MBytes
[  5]   1.00-2.00   sec  45.0 MBytes   377 Mbits/sec   18   1.35 MBytes
[  5]   2.00-3.00   sec  43.8 MBytes   367 Mbits/sec    0   1.47 MBytes
[  5]   3.00-4.00   sec  45.0 MBytes   377 Mbits/sec    0   1.56 MBytes
[  5]   4.00-5.00   sec  45.0 MBytes   377 Mbits/sec    0   1.63 MBytes
[  5]   5.00-6.00   sec  42.5 MBytes   357 Mbits/sec    0   1.68 MBytes
[  5]   6.00-7.00   sec  43.8 MBytes   367 Mbits/sec    0   1.71 MBytes
[  5]   7.00-8.00   sec  43.8 MBytes   367 Mbits/sec    0   1.73 MBytes
[  5]   8.00-9.00   sec  45.0 MBytes   377 Mbits/sec    0   1.74 MBytes
[  5]   9.00-10.00  sec  43.8 MBytes   367 Mbits/sec    0   1.75 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   445 MBytes   373 Mbits/sec   18             sender
[  5]   0.00-10.04  sec   444 MBytes   371 Mbits/sec                  receiver

iperf Done.


AFTER
=====

ardb@vm32:~$ iperf3 -c 192.168.11.2
Connecting to host 192.168.11.2, port 5201
[  5] local 192.168.11.1 port 44004 connected to 192.168.11.2 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  26.2 MBytes   220 Mbits/sec    0    399 KBytes
[  5]   1.00-2.00   sec  25.9 MBytes   217 Mbits/sec    0    399 KBytes
[  5]   2.00-3.00   sec  26.0 MBytes   218 Mbits/sec    0    444 KBytes
[  5]   3.00-4.00   sec  26.8 MBytes   225 Mbits/sec    0    485 KBytes
[  5]   4.00-5.00   sec  26.4 MBytes   222 Mbits/sec    0    542 KBytes
[  5]   5.00-6.00   sec  26.6 MBytes   223 Mbits/sec    0    568 KBytes
[  5]   6.00-7.00   sec  25.4 MBytes   213 Mbits/sec    0    568 KBytes
[  5]   7.00-8.00   sec  25.9 MBytes   217 Mbits/sec    0    568 KBytes
[  5]   8.00-9.00   sec  26.7 MBytes   224 Mbits/sec    0    568 KBytes
[  5]   9.00-10.00  sec  25.9 MBytes   217 Mbits/sec    0    568 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   262 MBytes   220 Mbits/sec    0             sender
[  5]   0.00-9.99   sec   261 MBytes   219 Mbits/sec                  receiver

iperf Done.

ardb@sudo:~$ iperf3 -c 192.168.11.1
Connecting to host 192.168.11.1, port 5201
[  5] local 192.168.11.2 port 49838 connected to 192.168.11.1 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  61.2 MBytes   514 Mbits/sec    0   1.59 MBytes
[  5]   1.00-2.00   sec  66.2 MBytes   555 Mbits/sec    0   1.67 MBytes
[  5]   2.00-3.00   sec  65.0 MBytes   545 Mbits/sec   79   1.24 MBytes
[  5]   3.00-4.00   sec  63.8 MBytes   535 Mbits/sec    0   1.36 MBytes
[  5]   4.00-5.00   sec  63.8 MBytes   535 Mbits/sec    0   1.46 MBytes
[  5]   5.00-6.00   sec  63.8 MBytes   535 Mbits/sec    0   1.53 MBytes
[  5]   6.00-7.00   sec  62.5 MBytes   524 Mbits/sec    0   1.59 MBytes
[  5]   7.00-8.00   sec  65.0 MBytes   545 Mbits/sec   99   1.18 MBytes
[  5]   8.00-9.00   sec  65.0 MBytes   545 Mbits/sec    0   1.25 MBytes
[  5]   9.00-10.00  sec  65.0 MBytes   545 Mbits/sec    0   1.30 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   641 MBytes   538 Mbits/sec  178             sender
[  5]   0.00-10.02  sec   638 MBytes   535 Mbits/sec                  receiver

iperf Done.
diff mbox series

Patch

diff --git a/arch/arm/include/asm/simd.h b/arch/arm/include/asm/simd.h
new file mode 100644
index 0000000000000000..82191dbd7e78a036
--- /dev/null
+++ b/arch/arm/include/asm/simd.h
@@ -0,0 +1,8 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#include <linux/hardirq.h>
+
+static __must_check inline bool may_use_simd(void)
+{
+	return IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && !in_hardirq();
+}
diff --git a/arch/arm/vfp/vfpmodule.c b/arch/arm/vfp/vfpmodule.c
index 8f5bc672b4aac04a..4e1a786df76df157 100644
--- a/arch/arm/vfp/vfpmodule.c
+++ b/arch/arm/vfp/vfpmodule.c
@@ -723,12 +723,12 @@  void kernel_neon_begin(void)
 	local_bh_disable();
 
 	/*
-	 * Kernel mode NEON is only allowed outside of interrupt context
-	 * with preemption disabled. This will make sure that the kernel
-	 * mode NEON register contents never need to be preserved.
+	 * Kernel mode NEON is only allowed outside of hardirq context with
+	 * preemption and softirq processing disabled. This will make sure that
+	 * the kernel mode NEON register contents never need to be preserved.
 	 */
-	BUG_ON(in_interrupt());
-	cpu = get_cpu();
+	BUG_ON(in_hardirq());
+	cpu = __smp_processor_id();
 
 	fpexc = fmrx(FPEXC) | FPEXC_EN;
 	fmxr(FPEXC, fpexc);
@@ -744,7 +744,6 @@  void kernel_neon_begin(void)
 		vfp_save_state(vfp_current_hw_state[cpu], fpexc);
 #endif
 	vfp_current_hw_state[cpu] = NULL;
-	local_bh_enable();
 }
 EXPORT_SYMBOL(kernel_neon_begin);
 
@@ -752,7 +751,7 @@  void kernel_neon_end(void)
 {
 	/* Disable the NEON/VFP unit. */
 	fmxr(FPEXC, fmrx(FPEXC) & ~FPEXC_EN);
-	put_cpu();
+	local_bh_enable();
 }
 EXPORT_SYMBOL(kernel_neon_end);