[v4,00/10] Optimize buffer_is_zero

Message ID	20240215081449.848220-1-richard.henderson@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Cc: amonakov@ispras.ru, mmromanov@ispras.ru Subject: [PATCH v4 00/10] Optimize buffer_is_zero Date: Wed, 14 Feb 2024 22:14:39 -1000 Message-Id: <20240215081449.848220-1-richard.henderson@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::52d; envelope-from=richard.henderson@linaro.org; helo=mail-pg1-x52d.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org
Series	Optimize buffer_is_zero \| expand [v4,00/10] Optimize buffer_is_zero [v4,01/10] util/bufferiszero: Remove SSE4.1 variant [v4,02/10] util/bufferiszero: Remove AVX512 variant [v4,03/10] util/bufferiszero: Reorganize for early test for acceleration [v4,04/10] util/bufferiszero: Remove useless prefetches [v4,05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants [v4,06/10] util/bufferiszero: Improve scalar variant [v4,07/10] util/bufferiszero: Introduce biz_accel_fn typedef [v4,08/10] util/bufferiszero: Simplify test_buffer_is_zero_next_accel [v4,09/10] util/bufferiszero: Add simd acceleration for aarch64 [RFC,v4,10/10] util/bufferiszero: Add sve acceleration for aarch64

Richard Henderson Feb. 15, 2024, 8:14 a.m. UTC

v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/

Changes for v4:
  - Keep separate >= 256 entry point, but only keep constant length
    check inline.  This allows the indirect function call to be hidden
    and optimized away when the pointer is constant.
  - Split out a >= 256 integer routine.
  - Simplify acceleration selection for testing.
  - Add function pointer typedef.
  - Implement new aarch64 accelerations.


r~


Alexander Monakov (5):
  util/bufferiszero: Remove SSE4.1 variant
  util/bufferiszero: Remove AVX512 variant
  util/bufferiszero: Reorganize for early test for acceleration
  util/bufferiszero: Remove useless prefetches
  util/bufferiszero: Optimize SSE2 and AVX2 variants

Richard Henderson (5):
  util/bufferiszero: Improve scalar variant
  util/bufferiszero: Introduce biz_accel_fn typedef
  util/bufferiszero: Simplify test_buffer_is_zero_next_accel
  util/bufferiszero: Add simd acceleration for aarch64
  util/bufferiszero: Add sve acceleration for aarch64

 host/include/aarch64/host/cpuinfo.h |   1 +
 include/qemu/cutils.h               |  15 +-
 util/bufferiszero.c                 | 500 ++++++++++++++++------------
 util/cpuinfo-aarch64.c              |   1 +
 meson.build                         |  13 +
 5 files changed, 323 insertions(+), 207 deletions(-)

Alexander Monakov Feb. 15, 2024, 8:57 a.m. UTC | #1

On Wed, 14 Feb 2024, Richard Henderson wrote:

> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
> 
> Changes for v4:
>   - Keep separate >= 256 entry point, but only keep constant length
>     check inline.  This allows the indirect function call to be hidden
>     and optimized away when the pointer is constant.

Sorry, I don't understand this. Most of the improvement (at least in our
testing) comes from inlining the byte checks, which often fail and eliminate
call overhead entirely. Moving them out-of-line seems to lose most of the
speedup the patchset was bringing, doesn't it? Is there some concern I am
not seeing?

>   - Split out a >= 256 integer routine.
>   - Simplify acceleration selection for testing.
>   - Add function pointer typedef.
>   - Implement new aarch64 accelerations.
> 
> 
> r~
> 
> 
> Alexander Monakov (5):
>   util/bufferiszero: Remove SSE4.1 variant
>   util/bufferiszero: Remove AVX512 variant
>   util/bufferiszero: Reorganize for early test for acceleration
>   util/bufferiszero: Remove useless prefetches
>   util/bufferiszero: Optimize SSE2 and AVX2 variants
> 
> Richard Henderson (5):
>   util/bufferiszero: Improve scalar variant
>   util/bufferiszero: Introduce biz_accel_fn typedef
>   util/bufferiszero: Simplify test_buffer_is_zero_next_accel
>   util/bufferiszero: Add simd acceleration for aarch64
>   util/bufferiszero: Add sve acceleration for aarch64
> 
>  host/include/aarch64/host/cpuinfo.h |   1 +
>  include/qemu/cutils.h               |  15 +-
>  util/bufferiszero.c                 | 500 ++++++++++++++++------------
>  util/cpuinfo-aarch64.c              |   1 +
>  meson.build                         |  13 +
>  5 files changed, 323 insertions(+), 207 deletions(-)
> 
>

Richard Henderson Feb. 15, 2024, 9:16 p.m. UTC | #2

On 2/14/24 22:57, Alexander Monakov wrote:
> 
> On Wed, 14 Feb 2024, Richard Henderson wrote:
> 
>> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
>>
>> Changes for v4:
>>    - Keep separate >= 256 entry point, but only keep constant length
>>      check inline.  This allows the indirect function call to be hidden
>>      and optimized away when the pointer is constant.
> 
> Sorry, I don't understand this. Most of the improvement (at least in our
> testing) comes from inlining the byte checks, which often fail and eliminate
> call overhead entirely. Moving them out-of-line seems to lose most of the
> speedup the patchset was bringing, doesn't it? Is there some concern I am
> not seeing?

What is your benchmarking method?

It was my guess that most of the improvement came from performing those early byte checks 
*at all*, and that the overhead of a function call to a small out of line wrapper would be 
negligible.

By not exposing the function pointer outside the bufferiszero translation unit, the 
compiler can see when the pointer is never modified for a given host, and then transform 
the indirect branch to a direct branch.

r~

Alexander Monakov Feb. 15, 2024, 9:36 p.m. UTC | #3

On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/14/24 22:57, Alexander Monakov wrote:
> > 
> > On Wed, 14 Feb 2024, Richard Henderson wrote:
> > 
> >> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
> >>
> >> Changes for v4:
> >>    - Keep separate >= 256 entry point, but only keep constant length
> >>      check inline.  This allows the indirect function call to be hidden
> >>      and optimized away when the pointer is constant.
> > 
> > Sorry, I don't understand this. Most of the improvement (at least in our
> > testing) comes from inlining the byte checks, which often fail and eliminate
> > call overhead entirely. Moving them out-of-line seems to lose most of the
> > speedup the patchset was bringing, doesn't it? Is there some concern I am
> > not seeing?
> 
> What is your benchmarking method?

Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2,
are you saying they did not reach your inbox?
https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/

> It was my guess that most of the improvement came from performing those early
> byte checks *at all*, and that the overhead of a function call to a small out
> of line wrapper would be negligible.

qemu-img invokes buffer_is_zero in a fairly tight loop. Let us know if you
need numbers how much the out-of-line version loses.

> By not exposing the function pointer outside the bufferiszero translation
> unit, the compiler can see when the pointer is never modified for a given
> host, and then transform the indirect branch to a direct branch.

Okay, but that does not make it necessary to move byte checks out of line.
I was preparing a rebase that does not expose the function pointer to the
inline wrapper. I was completely unaware that you're taking over the patchset.

Alexander

Richard Henderson Feb. 15, 2024, 10:27 p.m. UTC | #4

On 2/15/24 11:36, Alexander Monakov wrote:
> 
> On Thu, 15 Feb 2024, Richard Henderson wrote:
> 
>> On 2/14/24 22:57, Alexander Monakov wrote:
>>>
>>> On Wed, 14 Feb 2024, Richard Henderson wrote:
>>>
>>>> v3: https://patchew.org/QEMU/20240206204809.9859-1-amonakov@ispras.ru/
>>>>
>>>> Changes for v4:
>>>>     - Keep separate >= 256 entry point, but only keep constant length
>>>>       check inline.  This allows the indirect function call to be hidden
>>>>       and optimized away when the pointer is constant.
>>>
>>> Sorry, I don't understand this. Most of the improvement (at least in our
>>> testing) comes from inlining the byte checks, which often fail and eliminate
>>> call overhead entirely. Moving them out-of-line seems to lose most of the
>>> speedup the patchset was bringing, doesn't it? Is there some concern I am
>>> not seeing?
>>
>> What is your benchmarking method?
> 
> Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and v2,
> are you saying they did not reach your inbox?
> https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
> https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/

I'm saying that this is not a reproducible description of methodology.

With master, so with neither of our changes:

I tried converting an 80G win7 image that I happened to have lying about, I see 
buffer_zero_avx2 with only 3.03% perf overhead.  Then I tried truncating the image to 16G 
to see if having the entire image in ram would help -- not yet, still only 3.4% perf 
overhead.  Finally, I truncated the image to 4G and saw 2.9% overhead.

So... help be out here.  I would like to be able to see results that are at least vaguely 
similar.

r~

Alexander Monakov Feb. 15, 2024, 11:37 p.m. UTC | #5

On Thu, 15 Feb 2024, Richard Henderson wrote:

> > Converting a 4.4 GiB Windows 10 image to qcow2. It was mentioned in v1 and
> > v2,
> > are you saying they did not reach your inbox?
> > https://lore.kernel.org/qemu-devel/20231013155856.21475-1-mmromanov@ispras.ru/
> > https://lore.kernel.org/qemu-devel/20231027143704.7060-1-mmromanov@ispras.ru/
> 
> I'm saying that this is not a reproducible description of methodology.
> 
> With master, so with neither of our changes:
> 
> I tried converting an 80G win7 image that I happened to have lying about, I
> see buffer_zero_avx2 with only 3.03% perf overhead.  Then I tried truncating
> the image to 16G to see if having the entire image in ram would help -- not
> yet, still only 3.4% perf overhead.  Finally, I truncated the image to 4G and
> saw 2.9% overhead.
> 
> So... help be out here.  I would like to be able to see results that are at
> least vaguely similar.

Ah, I guess you might be running at low perf_event_paranoid setting that
allows unprivileged sampling of kernel events? In our submissions the
percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
excluding kernel time under syscalls.

Retrieve IE11.Win7.VirtualBox.zip from
https://archive.org/details/ie11.win7.virtualbox
and use

  unzip -p IE11.Win7.VirtualBox.zip | tar xv

to extract 'IE11 - Win7-disk001.vmdk'.

(Mikhail used a different image when preparing the patch)

On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running

  qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2

user:kernel time is about 0.15:2.3, so 70% relative to user time does
roughly correspond to single-digits percentage relative to (user+kernel) time.

(which does tell us that qemu-img is doing I/O inefficiently, it shouldn't
need two seconds to read a fully cached 5 Gigabyte file)

Alexander

Richard Henderson Feb. 16, 2024, 8:11 a.m. UTC | #6

On 2/15/24 13:37, Alexander Monakov wrote:
> Ah, I guess you might be running at low perf_event_paranoid setting that
> allows unprivileged sampling of kernel events? In our submissions the
> percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
> excluding kernel time under syscalls.

Ok.  Eliminating kernel samples makes things easier to see.
But I still do not see a 40% reduction in runtime.

Just so we're on the same page:

> Retrieve IE11.Win7.VirtualBox.zip from
> https://archive.org/details/ie11.win7.virtualbox
> and use
> 
>   unzip -p IE11.Win7.VirtualBox.zip | tar xv
> 
> to extract 'IE11 - Win7-disk001.vmdk'.
> 
> (Mikhail used a different image when preparing the patch)
> 
> On this image, I get 70% in buffer_zero_sse2 on a Sandy Bridge running
> 
>   qemu-img convert 'IE11 - Win7-disk001.vmdk' -O qcow2 /tmp/t.qcow2

With this, I see virtually all of the runtime in libz.so.
Therefore I converted this to raw first, to focus on the issue.

For avoidance of doubt:

$ ls -lsh test.raw && sha256sum test.raw
  12G -rw-r--r--  1 rth  rth   40G Feb 15 21:14 test.raw
3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80  test.raw

Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)

master:
   57.48%  qemu-img-m  [.] buffer_zero_avx2
    3.60%  qemu-img-m  [.] is_allocated_sectors.part.0
    2.61%  qemu-img-m  [.] buffer_is_zero
   63.69%  -- total

v3:
   48.86%  qemu-img-v3  [.] is_allocated_sectors.part.0
    3.79%  qemu-img-v3  [.] buffer_zero_avx2
   52.65%  -- total
     -17%  -- reduction from master

v4:
   54.60%  qemu-img-v4  [.] buffer_is_zero_ge256
    3.30%  qemu-img-v4  [.] buffer_zero_avx2
    3.17%  qemu-img-v4  [.] is_allocated_sectors.part.0
   61.07%  -- total
      -4%  -- reduction from master

v4+:
   46.65%  qemu-img  [.] is_allocated_sectors.part.0
    3.49%  qemu-img  [.] buffer_zero_avx2
    0.05%  qemu-img  [.] buffer_is_zero_ge256
   50.19%  -- total
     -21%  -- reduction from master

The v4+ puts the 3 byte test back inline, like in your v3.

Importantly, it must be as 3 short-circuting tests, where my v4 "simplified" this to (s | 
m | e) != 0, on the assumption that the reduced number of branches would help.

Diving into perf, it becomes clear why:

  57.36 │       cmpb   $0x0,(%rbx)
   4.02 │     ↓ jne    89
  21.84 │       cmpb   $0x0,0x1ff(%rbx)
   0.64 │     ↓ jne    89
   8.45 │       cmpb   $0x0,0x100(%rbx)
   0.26 │     ↓ jne    89
   0.06 │       mov    $0x200,%esi
        │       mov    %rbx,%rdi
   0.07 │     → call   buffer_is_zero_ge256

The three bytes are on 3 different cachelines.  Judging by the relative percentages, it 
would seem that the first byte alone eliminates slightly more than half of all blocks; the 
last byte eliminates more than half again; the middle byte eliminates a fair fraction of 
the rest.  With the short-circuit, the extra cachelines are not touched.

This is so important that it should be spelled out in a comment.

With that settled, I guess we need to talk about how much the out-of-line implementation 
matters at all.  I'm thinking about writing a test/bench/bufferiszero, with all-zero 
buffers of various sizes and alignments.  With that it would be easier to talk about 
whether any given implementation is is an improvement for that final 4% not eliminated by 
the three bytes.

> (which does tell us that qemu-img is doing I/O inefficiently, it shouldn't
> need two seconds to read a fully cached 5 Gigabyte file)

Indeed!

r~

Alexander Monakov Feb. 16, 2024, 8:20 p.m. UTC | #7

On Thu, 15 Feb 2024, Richard Henderson wrote:

> On 2/15/24 13:37, Alexander Monakov wrote:
> > Ah, I guess you might be running at low perf_event_paranoid setting that
> > allows unprivileged sampling of kernel events? In our submissions the
> > percentage was for perf_event_paranoid=2, i.e. relative to Qemu only,
> > excluding kernel time under syscalls.
> 
> Ok.  Eliminating kernel samples makes things easier to see.
> But I still do not see a 40% reduction in runtime.

I suspect Mikhail's image was less sparse, so the impact from inlining
was greater.

> With this, I see virtually all of the runtime in libz.so.
> Therefore I converted this to raw first, to focus on the issue.

Ah, apologies for that. I built with --disable-default-features and
did not notice my qemu-img lacked support for vmdk and treated it
as a raw image instead. I was assuming it was similar to what Mikhail
used, but obviously it's not due to the compression.

> For avoidance of doubt:
> 
> $ ls -lsh test.raw && sha256sum test.raw
>  12G -rw-r--r--  1 rth  rth   40G Feb 15 21:14 test.raw
> 3b056d839952538fed42fa898c6063646f4fda1bf7ea0180fbb5f29d21fe8e80  test.raw
> 
> Host: 11th Gen Intel(R) Core(TM) i7-1195G7 @ 2.90GHz
> Compiler: gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
> 
> master:
>   57.48%  qemu-img-m  [.] buffer_zero_avx2
>    3.60%  qemu-img-m  [.] is_allocated_sectors.part.0
>    2.61%  qemu-img-m  [.] buffer_is_zero
>   63.69%  -- total
> 
> v3:
>   48.86%  qemu-img-v3  [.] is_allocated_sectors.part.0
>   3.79%  qemu-img-v3  [.] buffer_zero_avx2
>   52.65%  -- total
>     -17%  -- reduction from master
> 
> v4:
>   54.60%  qemu-img-v4  [.] buffer_is_zero_ge256
>    3.30%  qemu-img-v4  [.] buffer_zero_avx2
>    3.17%  qemu-img-v4  [.] is_allocated_sectors.part.0
>   61.07%  -- total
>      -4%  -- reduction from master
> 
> v4+:
>   46.65%  qemu-img  [.] is_allocated_sectors.part.0
>    3.49%  qemu-img  [.] buffer_zero_avx2
>    0.05%  qemu-img  [.] buffer_is_zero_ge256
>   50.19%  -- total
>     -21%  -- reduction from master

Any ideas where the -21% vs v3's -17% difference comes from?

FWIW, in situations like these I always recommend to run perf with fixed
sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e
cycles/period=100000/P' to make sample counts between runs of different
duration directly comparable (displayed with 'perf report -n').

> The v4+ puts the 3 byte test back inline, like in your v3.
> 
> Importantly, it must be as 3 short-circuting tests, where my v4 "simplified"
> this to (s | m | e) != 0, on the assumption that the reduced number of
> branches would help.

Yes, we also noticed that when preparing our patch. We also tried mixed
variants like (s | e) != 0 || m != 0, but they did not turn out faster.

> With that settled, I guess we need to talk about how much the out-of-line
> implementation matters at all.  I'm thinking about writing a
> test/bench/bufferiszero, with all-zero buffers of various sizes and
> alignments.  With that it would be easier to talk about whether any given
> implementation is is an improvement for that final 4% not eliminated by the
> three bytes.

Yeah, initially I suggested this task to Mikhail as a practice exercise
outside of Qemu, and we had a benchmark that measures buffer_is_zero via
perf_event_open. This allows to see exactly how close the implementation
runs to the performance ceiling given by max L1 fetch rate (two loads
per cycle on x86).

Alexander

Richard Henderson Feb. 16, 2024, 10:28 p.m. UTC | #8

On 2/16/24 10:20, Alexander Monakov wrote:
> FWIW, in situations like these I always recommend to run perf with fixed
> sampling rate, i.e. 'perf record -e cycles:P -c 100000' or 'perf record -e
> cycles/period=100000/P' to make sample counts between runs of different
> duration directly comparable (displayed with 'perf report -n').

I've re-done the numbers with fixed period, as suggested, and the difference between v3 
and v4+ is in the sampling noise, differing about 0.3%.


r~

[v4,00/10] Optimize buffer_is_zero

Message

Comments