Message ID | 20180724171224.17363-1-ard.biesheuvel@linaro.org |
---|---|
Headers | show |
Series | crypto/arm64: reduce impact of NEON yield checks | expand |
On 2018-07-24 19:12:20 [+0200], Ard Biesheuvel wrote: > Vakul reports a considerable performance hit when running the accelerated > arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have > been updated to take the TIF_NEED_RESCHED flag into account. just in time. I will try to come up with some numbers on RT with the original patch and with that one. I have it almost working… Sebastian
On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: > Vakul reports a considerable performance hit when running the accelerated > arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have > been updated to take the TIF_NEED_RESCHED flag into account. > > The issue appears to be caused by the fact that Cortex-A53, the core in > question, has a high end implementation of the Crypto Extensions, and > has a shallow pipeline, which means even sequential algorithms that may be > held back by pipeline stalls on high end out of order cores run at maximum > speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a > speed in the order of 2 to 4 cycles per byte, and are currently implemented > to check the TIF_NEED_RESCHED after each iteration, which may process as > little as 16 bytes (for GHASH). > > Obviously, every cycle of overhead hurts in this context, and given that > the A53's load/store unit is not quite high end, any delays caused by > memory accesses that occur in the inner loop of the algorithms are going > to be quite significant, hence the performance regression. > > So reduce the frequency at which the NEON yield checks are performed, so > that they occur roughly once every 1000 cycles, which is hopefully a > reasonable tradeoff between throughput and worst case scheduling latency. Is there any way to tune this automatically, or it that likely to be more trouble than it's worth? Also, how did you come up with 1000 cycles? At what point does preemption latency become more/less important than throughput? Maybe someone already invented a similar framework somewhere else in the kernel. I seem to remember some automatic selection of memcpy implementation based on a boot-time benchmark, but I may have misremembered. Cheers ---Dave
On 25 July 2018 at 11:09, Dave Martin <Dave.Martin@arm.com> wrote: > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: >> Vakul reports a considerable performance hit when running the accelerated >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have >> been updated to take the TIF_NEED_RESCHED flag into account. >> >> The issue appears to be caused by the fact that Cortex-A53, the core in >> question, has a high end implementation of the Crypto Extensions, and >> has a shallow pipeline, which means even sequential algorithms that may be >> held back by pipeline stalls on high end out of order cores run at maximum >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a >> speed in the order of 2 to 4 cycles per byte, and are currently implemented >> to check the TIF_NEED_RESCHED after each iteration, which may process as >> little as 16 bytes (for GHASH). >> >> Obviously, every cycle of overhead hurts in this context, and given that >> the A53's load/store unit is not quite high end, any delays caused by >> memory accesses that occur in the inner loop of the algorithms are going >> to be quite significant, hence the performance regression. >> >> So reduce the frequency at which the NEON yield checks are performed, so >> that they occur roughly once every 1000 cycles, which is hopefully a >> reasonable tradeoff between throughput and worst case scheduling latency. > > Is there any way to tune this automatically, or it that likely to be more > trouble than it's worth? > Good question. I think A53 is a reasonable worst case, and these changes reduce the impact to the same ballpark as the impact of enabling CONFIG_PREEMPT in the first place. > Also, how did you come up with 1000 cycles? At what point does > preemption latency become more/less important than throughput? > Another good question. I was hoping Sebastian or the other -rt folks would be triggered by this. Given the above, I ended up with a ~1000 cycles quantum, and hopefully this is considered to be small enough. > Maybe someone already invented a similar framework somewhere else in the > kernel. I seem to remember some automatic selection of memcpy > implementation based on a boot-time benchmark, but I may have > misremembered. > We have crypto benchmarking code in the kernel, and at some point, I even did some work on selecting the best algo based on performance. But to be honest, I think this is a bit overkill. If you need those final 5% of throughput at any cost, you're better off running with CONFIG_PREEMPT=n anyway.
On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote: > On 25 July 2018 at 11:09, Dave Martin <Dave.Martin@arm.com> wrote: > > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: > >> Vakul reports a considerable performance hit when running the accelerated > >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have > >> been updated to take the TIF_NEED_RESCHED flag into account. > >> > >> The issue appears to be caused by the fact that Cortex-A53, the core in > >> question, has a high end implementation of the Crypto Extensions, and > >> has a shallow pipeline, which means even sequential algorithms that may be > >> held back by pipeline stalls on high end out of order cores run at maximum > >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a > >> speed in the order of 2 to 4 cycles per byte, and are currently implemented > >> to check the TIF_NEED_RESCHED after each iteration, which may process as > >> little as 16 bytes (for GHASH). > >> > >> Obviously, every cycle of overhead hurts in this context, and given that > >> the A53's load/store unit is not quite high end, any delays caused by > >> memory accesses that occur in the inner loop of the algorithms are going > >> to be quite significant, hence the performance regression. > >> > >> So reduce the frequency at which the NEON yield checks are performed, so > >> that they occur roughly once every 1000 cycles, which is hopefully a > >> reasonable tradeoff between throughput and worst case scheduling latency. > > > > Is there any way to tune this automatically, or it that likely to be more > > trouble than it's worth? > > > > Good question. I think A53 is a reasonable worst case, and these > changes reduce the impact to the same ballpark as the impact of > enabling CONFIG_PREEMPT in the first place. > > > Also, how did you come up with 1000 cycles? At what point does > > preemption latency become more/less important than throughput? > > > > Another good question. I was hoping Sebastian or the other -rt folks > would be triggered by this. Given the above, I ended up with a ~1000 > cycles quantum, and hopefully this is considered to be small enough. > > > Maybe someone already invented a similar framework somewhere else in the > > kernel. I seem to remember some automatic selection of memcpy > > implementation based on a boot-time benchmark, but I may have > > misremembered. > > > > We have crypto benchmarking code in the kernel, and at some point, I > even did some work on selecting the best algo based on performance. > > But to be honest, I think this is a bit overkill. If you need those > final 5% of throughput at any cost, you're better off running with > CONFIG_PREEMPT=n anyway. Can't really argue with any of that -- I was just wondering whether there was precedent. Hopefully the ~1000 cycles ballpark will satisfy most people. For the rest, it's too bad: if somebody is relying on the last 1-2% of performance, they probably have a broken use case. Cheers ---Dave
On 25 July 2018 at 11:45, Dave Martin <Dave.Martin@arm.com> wrote: > On Wed, Jul 25, 2018 at 10:23:00AM +0100, Ard Biesheuvel wrote: >> On 25 July 2018 at 11:09, Dave Martin <Dave.Martin@arm.com> wrote: >> > On Tue, Jul 24, 2018 at 06:12:20PM +0100, Ard Biesheuvel wrote: >> >> Vakul reports a considerable performance hit when running the accelerated >> >> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have >> >> been updated to take the TIF_NEED_RESCHED flag into account. >> >> >> >> The issue appears to be caused by the fact that Cortex-A53, the core in >> >> question, has a high end implementation of the Crypto Extensions, and >> >> has a shallow pipeline, which means even sequential algorithms that may be >> >> held back by pipeline stalls on high end out of order cores run at maximum >> >> speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a >> >> speed in the order of 2 to 4 cycles per byte, and are currently implemented >> >> to check the TIF_NEED_RESCHED after each iteration, which may process as >> >> little as 16 bytes (for GHASH). >> >> >> >> Obviously, every cycle of overhead hurts in this context, and given that >> >> the A53's load/store unit is not quite high end, any delays caused by >> >> memory accesses that occur in the inner loop of the algorithms are going >> >> to be quite significant, hence the performance regression. >> >> >> >> So reduce the frequency at which the NEON yield checks are performed, so >> >> that they occur roughly once every 1000 cycles, which is hopefully a >> >> reasonable tradeoff between throughput and worst case scheduling latency. >> > >> > Is there any way to tune this automatically, or it that likely to be more >> > trouble than it's worth? >> > >> >> Good question. I think A53 is a reasonable worst case, and these >> changes reduce the impact to the same ballpark as the impact of >> enabling CONFIG_PREEMPT in the first place. >> >> > Also, how did you come up with 1000 cycles? At what point does >> > preemption latency become more/less important than throughput? >> > >> >> Another good question. I was hoping Sebastian or the other -rt folks >> would be triggered by this. Given the above, I ended up with a ~1000 >> cycles quantum, and hopefully this is considered to be small enough. >> >> > Maybe someone already invented a similar framework somewhere else in the >> > kernel. I seem to remember some automatic selection of memcpy >> > implementation based on a boot-time benchmark, but I may have >> > misremembered. >> > >> >> We have crypto benchmarking code in the kernel, and at some point, I >> even did some work on selecting the best algo based on performance. >> >> But to be honest, I think this is a bit overkill. If you need those >> final 5% of throughput at any cost, you're better off running with >> CONFIG_PREEMPT=n anyway. > > Can't really argue with any of that -- I was just wondering whether > there was precedent. > > Hopefully the ~1000 cycles ballpark will satisfy most people. For > the rest, it's too bad: if somebody is relying on the last 1-2% of > performance, they probably have a broken use case. > Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a 1000 cycle limit to the quantum of work performed with preemption disabled is unreasonably low, we can increase the yield block counts and approach the optimal numbers a bit closer. But with diminishing returns. Also, if the cost of enabling CONFIG_PREEMPT by itself is significantly reduced, e.g., by moving the per-CPU offset into a GPR, we can always revisit this of course.
On 2018-07-25 11:54:53 [+0200], Ard Biesheuvel wrote: > Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a > 1000 cycle limit to the quantum of work performed with preemption > disabled is unreasonably low, we can increase the yield block counts > and approach the optimal numbers a bit closer. But with diminishing > returns. So I tested on SoftIron Overdrive 1000 which has A57 cores. I added this series and didn't notice any spikes. This means cyclictest reported a max value of like ~20us (which means the crypto code was not noticeable). I played a little with it and tcrypt tests for aes/sha1 and also no huge spikes. So at this point this looks fantastic. I also setup cryptsetup / dm-crypt with the usual xts(aes) mode and saw no spikes. At this point, on this hardware if you want to raise the block count, I wouldn't mind. I remember on x86 the SIMD accelerated ciphers led to ~1ms+ spikes once dm-crypt started its jobs. Sebastian
On 25 July 2018 at 18:50, bigeasy@linutronix.de <bigeasy@linutronix.de> wrote: > On 2018-07-25 11:54:53 [+0200], Ard Biesheuvel wrote: >> Indeed. OTOH, if the -rt people (Sebastian?) turn up and say that a >> 1000 cycle limit to the quantum of work performed with preemption >> disabled is unreasonably low, we can increase the yield block counts >> and approach the optimal numbers a bit closer. But with diminishing >> returns. > > So I tested on SoftIron Overdrive 1000 which has A57 cores. I added this > series and didn't notice any spikes. This means cyclictest reported a > max value of like ~20us (which means the crypto code was not > noticeable). > I played a little with it and tcrypt tests for aes/sha1 and also no huge > spikes. So at this point this looks fantastic. I also setup cryptsetup / > dm-crypt with the usual xts(aes) mode and saw no spikes. > At this point, on this hardware if you want to raise the block count, I > wouldn't mind. > > I remember on x86 the SIMD accelerated ciphers led to ~1ms+ spikes once > dm-crypt started its jobs. > Thanks a lot. So 20 us ~= 20,000 cycles on my 1 GHz Cortex-A53, and if I am understanding you correctly, you wouldn't mind the quantum of work to be in the order 16,000 cycles or even substantially more? That is good news, but it is also rather interesting, given that these algorithms run at ~4 cycles per byte, meaning that you'd manage an entire 4 KB page without ever yielding. (GCM is used on network packets, XTS on disk sectors which are all smaller than that) Do you remember how you found out NEON use is a problem for -rt on arm64 in the first place? Which algorithm did you test at the time to arrive at this conclusion? Note that AES-GCM using ordinary SIMD instructions runs at 29 cpb, and plain AES at ~20 (on A53), so perhaps it would make sense to distinguish between algos using crypto instructions and ones using plain SIMD.
On 2018-07-26 09:25:40 [+0200], Ard Biesheuvel wrote: > Thanks a lot. > > So 20 us ~= 20,000 cycles on my 1 GHz Cortex-A53, and if I am > understanding you correctly, you wouldn't mind the quantum of work to > be in the order 16,000 cycles or even substantially more? I have currently that one box and it does not seem to be a problem. So it reports now on idle around 20us max. So if add "only" 20us to NEON / your preempt-disable section then we may end up at 20+20 = 40us. At this point I am not sure how "bad" it is. It works, it does not seem that much and you can disable it if you don't want the extra 20us here. > That is good news, but it is also rather interesting, given that these > algorithms run at ~4 cycles per byte, meaning that you'd manage an > entire 4 KB page without ever yielding. (GCM is used on network > packets, XTS on disk sectors which are all smaller than that) > > Do you remember how you found out NEON use is a problem for -rt on > arm64 in the first place? Which algorithm did you test at the time to > arrive at this conclusion? I *think* that yield got in there by chance. The main problem was back at the time that within the neon begin/end section there was the scatter list walk. That walk may invoke kmap() / kmalloc() / kfree() and is not allowed on RT within a preempt-disable section. This was my main concern. > Note that AES-GCM using ordinary SIMD instructions runs at 29 cpb, and > plain AES at ~20 (on A53), so perhaps it would make sense to > distinguish between algos using crypto instructions and ones using > plain SIMD. I was looking at AES-CE and AES-NEON (aes-neon-blk / aes_ce_blk) with modprobe tcrypt mode=200 sec=1 and mode=403 +404 for the sha1/256 test. Sebastian