[0/4] target/arm: Reduce overhead of cpu_get_tb_cpu_state

Message ID	20190214040652.4811-1-richard.henderson@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Date: Wed, 13 Feb 2019 20:06:48 -0800 Message-Id: <20190214040652.4811-1-richard.henderson@linaro.org> Subject: [Qemu-devel] [PATCH 0/4] target/arm: Reduce overhead of cpu_get_tb_cpu_state Precedence: list Cc: peter.maydell@linaro.org, cota@braap.org, alex.bennee@linaro.org Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>
Series	target/arm: Reduce overhead of cpu_get_tb_cpu_state \| expand [0/4] target/arm: Reduce overhead of cpu_get_tb_cpu_state [1/4] target/arm: Split out recompute_hflags et al [2/4] target/arm: Rebuild hflags at el changes and MSR writes [3/4] target/arm: Assert hflags is correct in cpu_get_tb_cpu_state [4/4] target/arm: Rely on hflags correct in cpu_get_tb_cpu_state

Message ID

20190214040652.4811-1-richard.henderson@linaro.org

Headers

Received-SPF: pass (google.com: domain of
	qemu-devel-bounces+patch=linaro.org@nongnu.org designates
	209.51.188.17 as permitted sender) client-ip=209.51.188.17; 
From: Richard Henderson <richard.henderson@linaro.org>
To: qemu-devel@nongnu.org
Date: Wed, 13 Feb 2019 20:06:48 -0800
Message-Id: <20190214040652.4811-1-richard.henderson@linaro.org>
Subject: [Qemu-devel] [PATCH 0/4] target/arm: Reduce overhead of
	cpu_get_tb_cpu_state
Precedence: list
Cc: peter.maydell@linaro.org, cota@braap.org, alex.bennee@linaro.org
Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>

Series

target/arm: Reduce overhead of cpu_get_tb_cpu_state | expand

Message

Richard Henderson Feb. 14, 2019, 4:06 a.m. UTC

We've talked about this before, caching state to reduce the
amount of computation that happens looking up each TB.

I know that Peter has been concerned that we would not be able to 
reliably maintain all of the places that need to be updates to
keep this up-to-date.

Well, modulo dirty tricks within linux-user, it appears as if
exception delivery and return, plus after every TB-ending write
to a system register is sufficient.

There seems to be a noticable improvement, although wall-time
is harder to come by -- all of my system-level measurements
include user input, and my user-level measurements seem to be
too small to matter.


r~


Richard Henderson (4):
  target/arm: Split out recompute_hflags et al
  target/arm: Rebuild hflags at el changes and MSR writes
  target/arm: Assert hflags is correct in cpu_get_tb_cpu_state
  target/arm: Rely on hflags correct in cpu_get_tb_cpu_state

 target/arm/cpu.h           |  22 ++-
 target/arm/helper.h        |   3 +
 target/arm/internals.h     |   4 +
 linux-user/syscall.c       |   1 +
 target/arm/cpu.c           |   1 +
 target/arm/helper-a64.c    |   3 +
 target/arm/helper.c        | 267 ++++++++++++++++++++++---------------
 target/arm/machine.c       |   1 +
 target/arm/op_helper.c     |   1 +
 target/arm/translate-a64.c |   6 +-
 target/arm/translate.c     |  14 +-
 11 files changed, 204 insertions(+), 119 deletions(-)

-- 
2.17.1

Comments

Laurent Desnogues Feb. 14, 2019, 10:28 a.m. UTC | #1

Hi Richard,

On Thu, Feb 14, 2019 at 5:07 AM Richard Henderson
<richard.henderson@linaro.org> wrote:
>

> We've talked about this before, caching state to reduce the

> amount of computation that happens looking up each TB.

>

> I know that Peter has been concerned that we would not be able to

> reliably maintain all of the places that need to be updates to

> keep this up-to-date.

>

> Well, modulo dirty tricks within linux-user, it appears as if

> exception delivery and return, plus after every TB-ending write

> to a system register is sufficient.

>

> There seems to be a noticable improvement, although wall-time

> is harder to come by -- all of my system-level measurements

> include user input, and my user-level measurements seem to be

> too small to matter.


FWIW this patch series made a run of linux-user AArch64 176.gcc 166.i
go from 29.5s down to 24.5s (on an E5-2650 v2).  Though that'd need
more benchmarks, that looks quite good to me.

Thanks,

Laurent

>

> r~

>

>

> Richard Henderson (4):

>   target/arm: Split out recompute_hflags et al

>   target/arm: Rebuild hflags at el changes and MSR writes

>   target/arm: Assert hflags is correct in cpu_get_tb_cpu_state

>   target/arm: Rely on hflags correct in cpu_get_tb_cpu_state

>

>  target/arm/cpu.h           |  22 ++-

>  target/arm/helper.h        |   3 +

>  target/arm/internals.h     |   4 +

>  linux-user/syscall.c       |   1 +

>  target/arm/cpu.c           |   1 +

>  target/arm/helper-a64.c    |   3 +

>  target/arm/helper.c        | 267 ++++++++++++++++++++++---------------

>  target/arm/machine.c       |   1 +

>  target/arm/op_helper.c     |   1 +

>  target/arm/translate-a64.c |   6 +-

>  target/arm/translate.c     |  14 +-

>  11 files changed, 204 insertions(+), 119 deletions(-)

>

> --

> 2.17.1

>

>

Alex Bennée Feb. 14, 2019, 11:05 a.m. UTC | #2

Richard Henderson <richard.henderson@linaro.org> writes:

> We've talked about this before, caching state to reduce the

> amount of computation that happens looking up each TB.

>

> I know that Peter has been concerned that we would not be able to

> reliably maintain all of the places that need to be updates to

> keep this up-to-date.

>

> Well, modulo dirty tricks within linux-user, it appears as if

> exception delivery and return, plus after every TB-ending write

> to a system register is sufficient.

>

> There seems to be a noticable improvement, although wall-time

> is harder to come by -- all of my system-level measurements

> include user input, and my user-level measurements seem to be

> too small to matter.


I'll run some of my benchmarks on it but I'm almost certain it will help
as it is one of the highest hits on "perf top" when running a loaded
guest.

>

>

> r~

>

>

> Richard Henderson (4):

>   target/arm: Split out recompute_hflags et al

>   target/arm: Rebuild hflags at el changes and MSR writes

>   target/arm: Assert hflags is correct in cpu_get_tb_cpu_state

>   target/arm: Rely on hflags correct in cpu_get_tb_cpu_state

>

>  target/arm/cpu.h           |  22 ++-

>  target/arm/helper.h        |   3 +

>  target/arm/internals.h     |   4 +

>  linux-user/syscall.c       |   1 +

>  target/arm/cpu.c           |   1 +

>  target/arm/helper-a64.c    |   3 +

>  target/arm/helper.c        | 267 ++++++++++++++++++++++---------------

>  target/arm/machine.c       |   1 +

>  target/arm/op_helper.c     |   1 +

>  target/arm/translate-a64.c |   6 +-

>  target/arm/translate.c     |  14 +-

>  11 files changed, 204 insertions(+), 119 deletions(-)



--
Alex Bennée

Emilio Cota Feb. 14, 2019, 5:05 p.m. UTC | #3

On Wed, Feb 13, 2019 at 20:06:48 -0800, Richard Henderson wrote:
> We've talked about this before, caching state to reduce the

> amount of computation that happens looking up each TB.

> 

> I know that Peter has been concerned that we would not be able to 

> reliably maintain all of the places that need to be updates to

> keep this up-to-date.

> 

> Well, modulo dirty tricks within linux-user, it appears as if

> exception delivery and return, plus after every TB-ending write

> to a system register is sufficient.

> 

> There seems to be a noticable improvement, although wall-time

> is harder to come by -- all of my system-level measurements

> include user input, and my user-level measurements seem to be

> too small to matter.


Thanks for this!

Some SPEC06int user-mode numbers (before vs. after)

                   aarch64-linux-user speedup for SPEC06int (test set)
                      Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz

                      2 +-----------------------------------------+
                        |                                         |
                    1.9 |-+.........................a+-+r.......+-|
                        |                            +-+          |
                        |                            * *          |
                    1.8 |-+..........................*.*........+-|
                        |       +-+                  * *          |
                    1.7 |-+.....+-+...............+-+*.*...+-+..+-|
                        |       * *         +-+   * ** *   +-+    |
                    1.6 |-+.....*.*..........|....*.**.*+-+*.*..+-|
                        |       * *         *|*   * ** *+-+* *    |
                    1.5 |-+.....*.*.........*|*...*.**.**.**.*..+-|
                        |       * *         +-+   * ** ** ** *    |
                        |       * *         * *   * ** ** ** *    |
                    1.4 |-+.....*.*.........*.*...*.**.**.**.*+-+-|
                        |       * *   +-+   * *   * ** ** ** ** * |
                    1.3 |-+.....*.*...+-+...*.*...*.**.**.**.**.*-|
                        | +-+   * *   * *   * *   * ** ** ** ** * |
                    1.2 |-+-+...*.*...*.*...*.*...*.**.**.**.**.*-|
                        | * *   * *   * *   * *   * ** ** ** ** * |
                        | * *   * *   * *+-+* *   * ** ** ** ** * |
                    1.1 |-*.*...*.*...*.**.**.*...*.**.**.**.**.*-|
                        | * *+-+* *+-+* ** ** *+-+* ** ** ** ** * |
                      1 +-----------------------------------------+
              400.per401.b40344454462.li464471.483.xalangeomean
 png: https://imgur.com/RjkYYJ5

That is, a 1.4x average speedup.

		Emilio