Message ID | 20190214040652.4811-1-richard.henderson@linaro.org |
---|---|
Headers | show |
Series | target/arm: Reduce overhead of cpu_get_tb_cpu_state | expand |
Hi Richard, On Thu, Feb 14, 2019 at 5:07 AM Richard Henderson <richard.henderson@linaro.org> wrote: > > We've talked about this before, caching state to reduce the > amount of computation that happens looking up each TB. > > I know that Peter has been concerned that we would not be able to > reliably maintain all of the places that need to be updates to > keep this up-to-date. > > Well, modulo dirty tricks within linux-user, it appears as if > exception delivery and return, plus after every TB-ending write > to a system register is sufficient. > > There seems to be a noticable improvement, although wall-time > is harder to come by -- all of my system-level measurements > include user input, and my user-level measurements seem to be > too small to matter. FWIW this patch series made a run of linux-user AArch64 176.gcc 166.i go from 29.5s down to 24.5s (on an E5-2650 v2). Though that'd need more benchmarks, that looks quite good to me. Thanks, Laurent > > r~ > > > Richard Henderson (4): > target/arm: Split out recompute_hflags et al > target/arm: Rebuild hflags at el changes and MSR writes > target/arm: Assert hflags is correct in cpu_get_tb_cpu_state > target/arm: Rely on hflags correct in cpu_get_tb_cpu_state > > target/arm/cpu.h | 22 ++- > target/arm/helper.h | 3 + > target/arm/internals.h | 4 + > linux-user/syscall.c | 1 + > target/arm/cpu.c | 1 + > target/arm/helper-a64.c | 3 + > target/arm/helper.c | 267 ++++++++++++++++++++++--------------- > target/arm/machine.c | 1 + > target/arm/op_helper.c | 1 + > target/arm/translate-a64.c | 6 +- > target/arm/translate.c | 14 +- > 11 files changed, 204 insertions(+), 119 deletions(-) > > -- > 2.17.1 > >
Richard Henderson <richard.henderson@linaro.org> writes: > We've talked about this before, caching state to reduce the > amount of computation that happens looking up each TB. > > I know that Peter has been concerned that we would not be able to > reliably maintain all of the places that need to be updates to > keep this up-to-date. > > Well, modulo dirty tricks within linux-user, it appears as if > exception delivery and return, plus after every TB-ending write > to a system register is sufficient. > > There seems to be a noticable improvement, although wall-time > is harder to come by -- all of my system-level measurements > include user input, and my user-level measurements seem to be > too small to matter. I'll run some of my benchmarks on it but I'm almost certain it will help as it is one of the highest hits on "perf top" when running a loaded guest. > > > r~ > > > Richard Henderson (4): > target/arm: Split out recompute_hflags et al > target/arm: Rebuild hflags at el changes and MSR writes > target/arm: Assert hflags is correct in cpu_get_tb_cpu_state > target/arm: Rely on hflags correct in cpu_get_tb_cpu_state > > target/arm/cpu.h | 22 ++- > target/arm/helper.h | 3 + > target/arm/internals.h | 4 + > linux-user/syscall.c | 1 + > target/arm/cpu.c | 1 + > target/arm/helper-a64.c | 3 + > target/arm/helper.c | 267 ++++++++++++++++++++++--------------- > target/arm/machine.c | 1 + > target/arm/op_helper.c | 1 + > target/arm/translate-a64.c | 6 +- > target/arm/translate.c | 14 +- > 11 files changed, 204 insertions(+), 119 deletions(-) -- Alex Bennée
On Wed, Feb 13, 2019 at 20:06:48 -0800, Richard Henderson wrote: > We've talked about this before, caching state to reduce the > amount of computation that happens looking up each TB. > > I know that Peter has been concerned that we would not be able to > reliably maintain all of the places that need to be updates to > keep this up-to-date. > > Well, modulo dirty tricks within linux-user, it appears as if > exception delivery and return, plus after every TB-ending write > to a system register is sufficient. > > There seems to be a noticable improvement, although wall-time > is harder to come by -- all of my system-level measurements > include user input, and my user-level measurements seem to be > too small to matter. Thanks for this! Some SPEC06int user-mode numbers (before vs. after) aarch64-linux-user speedup for SPEC06int (test set) Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz 2 +-----------------------------------------+ | | 1.9 |-+.........................a+-+r.......+-| | +-+ | | * * | 1.8 |-+..........................*.*........+-| | +-+ * * | 1.7 |-+.....+-+...............+-+*.*...+-+..+-| | * * +-+ * ** * +-+ | 1.6 |-+.....*.*..........|....*.**.*+-+*.*..+-| | * * *|* * ** *+-+* * | 1.5 |-+.....*.*.........*|*...*.**.**.**.*..+-| | * * +-+ * ** ** ** * | | * * * * * ** ** ** * | 1.4 |-+.....*.*.........*.*...*.**.**.**.*+-+-| | * * +-+ * * * ** ** ** ** * | 1.3 |-+.....*.*...+-+...*.*...*.**.**.**.**.*-| | +-+ * * * * * * * ** ** ** ** * | 1.2 |-+-+...*.*...*.*...*.*...*.**.**.**.**.*-| | * * * * * * * * * ** ** ** ** * | | * * * * * *+-+* * * ** ** ** ** * | 1.1 |-*.*...*.*...*.**.**.*...*.**.**.**.**.*-| | * *+-+* *+-+* ** ** *+-+* ** ** ** ** * | 1 +-----------------------------------------+ 400.per401.b40344454462.li464471.483.xalangeomean png: https://imgur.com/RjkYYJ5 That is, a 1.4x average speedup. Emilio