diff mbox series

[v6,02/10] util/bufferiszero: Remove AVX512 variant

Message ID 20240424225705.929812-3-richard.henderson@linaro.org
State Superseded
Headers show
Series Optimize buffer_is_zero | expand

Commit Message

Richard Henderson April 24, 2024, 10:56 p.m. UTC
From: Alexander Monakov <amonakov@ispras.ru>

Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
routines are invoked much more rarely in normal use when most buffers
are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
frequency and voltage transition periods during which the CPU operates
at reduced performance, as described in
https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

Signed-off-by: Mikhail Romanov <mmromanov@ispras.ru>
Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20240206204809.9859-4-amonakov@ispras.ru>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 util/bufferiszero.c | 38 +++-----------------------------------
 1 file changed, 3 insertions(+), 35 deletions(-)

Comments

Daniel P. Berrangé April 29, 2024, 11:16 a.m. UTC | #1
On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> From: Alexander Monakov <amonakov@ispras.ru>
> 
> Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> routines are invoked much more rarely in normal use when most buffers
> are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> frequency and voltage transition periods during which the CPU operates
> at reduced performance, as described in
> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

This is describing limitations of Intel's AVX512 implementation.

AMD's AVX512 implementation is said to not have the kind of
power / frequency limitations that Intel's does:

  https://www.mersenneforum.org/showthread.php?p=614191

  "Overall, AMD's AVX512 implementation beat my expectations.
   I was expecting something similar to Zen1's "double-pumping"
   of AVX with half the register file and cross-lane instructions
   being super slow. But this is not the case on Zen4. The lack
   of power or thermal issues combined with stellar shuffle support
   makes it completely worthwhile to use from a developer standpoint.
   If your code can vectorize without excessive wasted computation,
   then go all the way to 512-bit. AMD not only made this worthwhile,
   but *incentivizes* it with the power savings. And if in the future
   AMD decides to widen things up, you may get a 2x speedup for free."

IOW, it sounds like we could be sacrificing performance on modern
AMD Genoa generation CPUs by removing the AVX512 impl

With regards,
Daniel
Alexander Monakov April 29, 2024, 11:29 a.m. UTC | #2
On Mon, 29 Apr 2024, Daniel P. Berrangé wrote:

> On Wed, Apr 24, 2024 at 03:56:57PM -0700, Richard Henderson wrote:
> > From: Alexander Monakov <amonakov@ispras.ru>
> > 
> > Thanks to early checks in the inline buffer_is_zero wrapper, the SIMD
> > routines are invoked much more rarely in normal use when most buffers
> > are non-zero. This makes use of AVX512 unprofitable, as it incurs extra
> > frequency and voltage transition periods during which the CPU operates
> > at reduced performance, as described in
> > https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
> 
> This is describing limitations of Intel's AVX512 implementation.
> 
> AMD's AVX512 implementation is said to not have the kind of
> power / frequency limitations that Intel's does:
> 
>   https://www.mersenneforum.org/showthread.php?p=614191
> 
>   "Overall, AMD's AVX512 implementation beat my expectations.
>    I was expecting something similar to Zen1's "double-pumping"
>    of AVX with half the register file and cross-lane instructions
>    being super slow. But this is not the case on Zen4. The lack
>    of power or thermal issues combined with stellar shuffle support
>    makes it completely worthwhile to use from a developer standpoint.
>    If your code can vectorize without excessive wasted computation,
>    then go all the way to 512-bit. AMD not only made this worthwhile,
>    but *incentivizes* it with the power savings. And if in the future
>    AMD decides to widen things up, you may get a 2x speedup for free."
> 
> IOW, it sounds like we could be sacrificing performance on modern
> AMD Genoa generation CPUs by removing the AVX512 impl

No, the new implementation saturates load ports, and Genoa runs 512-bit
AVX instructions at half throughput compared to their 256-bit counterparts
(so one 512-bit load or two 256-bit loads per cycle), so there's no
obvious reason why this patch would sacrifice performance there.

Maybe it could, indirectly, by lowering the turbo clock limit due to
higher front-end activity, but I don't have access to a Zen 4 machine
to check, and even so it would be a few percent, not 2x.

Alexander
diff mbox series

Patch

diff --git a/util/bufferiszero.c b/util/bufferiszero.c
index f5a3634f9a..641d5f9b9e 100644
--- a/util/bufferiszero.c
+++ b/util/bufferiszero.c
@@ -64,7 +64,7 @@  buffer_zero_int(const void *buf, size_t len)
     }
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
+#if defined(CONFIG_AVX2_OPT) || defined(__SSE2__)
 #include <immintrin.h>
 
 /* Note that each of these vectorized functions require len >= 64.  */
@@ -128,41 +128,12 @@  buffer_zero_avx2(const void *buf, size_t len)
 }
 #endif /* CONFIG_AVX2_OPT */
 
-#ifdef CONFIG_AVX512F_OPT
-static bool __attribute__((target("avx512f")))
-buffer_zero_avx512(const void *buf, size_t len)
-{
-    /* Begin with an unaligned head of 64 bytes.  */
-    __m512i t = _mm512_loadu_si512(buf);
-    __m512i *p = (__m512i *)(((uintptr_t)buf + 5 * 64) & -64);
-    __m512i *e = (__m512i *)(((uintptr_t)buf + len) & -64);
-
-    /* Loop over 64-byte aligned blocks of 256.  */
-    while (p <= e) {
-        __builtin_prefetch(p);
-        if (unlikely(_mm512_test_epi64_mask(t, t))) {
-            return false;
-        }
-        t = p[-4] | p[-3] | p[-2] | p[-1];
-        p += 4;
-    }
-
-    t |= _mm512_loadu_si512(buf + len - 4 * 64);
-    t |= _mm512_loadu_si512(buf + len - 3 * 64);
-    t |= _mm512_loadu_si512(buf + len - 2 * 64);
-    t |= _mm512_loadu_si512(buf + len - 1 * 64);
-
-    return !_mm512_test_epi64_mask(t, t);
-
-}
-#endif /* CONFIG_AVX512F_OPT */
-
 /*
  * Make sure that these variables are appropriately initialized when
  * SSE2 is enabled on the compiler command-line, but the compiler is
  * too old to support CONFIG_AVX2_OPT.
  */
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 # define INIT_USED     0
 # define INIT_LENGTH   0
 # define INIT_ACCEL    buffer_zero_int
@@ -188,9 +159,6 @@  select_accel_cpuinfo(unsigned info)
         unsigned len;
         bool (*fn)(const void *, size_t);
     } all[] = {
-#ifdef CONFIG_AVX512F_OPT
-        { CPUINFO_AVX512F, 256, buffer_zero_avx512 },
-#endif
 #ifdef CONFIG_AVX2_OPT
         { CPUINFO_AVX2,    128, buffer_zero_avx2 },
 #endif
@@ -208,7 +176,7 @@  select_accel_cpuinfo(unsigned info)
     return 0;
 }
 
-#if defined(CONFIG_AVX512F_OPT) || defined(CONFIG_AVX2_OPT)
+#if defined(CONFIG_AVX2_OPT)
 static void __attribute__((constructor)) init_accel(void)
 {
     used_accel = select_accel_cpuinfo(cpuinfo_init());