mbox series

[v2,00/33] accel/tcg + target/arm: pc-relative translation

Message ID 20220816203400.161187-1-richard.henderson@linaro.org
Headers show
Series accel/tcg + target/arm: pc-relative translation | expand

Message

Richard Henderson Aug. 16, 2022, 8:33 p.m. UTC
Supercedes: 20220812180806.2128593-1-richard.henderson@linaro.org
("accel/tcg: minimize tlb lookups during translate + user-only PROT_EXEC fixes")

A few changes to the PROT_EXEC work that I posted last week, and
then continuing to the main event.

My initial goal was to reduce the overhead of TB flushing, which
Alex Bennee identified as a significant issue with respect to
booting AArch64 kernels under avocado.  Our initial guess was that
we need a more efficient data structure for walking TBs associated
with a physical page.

While I was looking at some of those numbers, I noted that we were
seeing up to 16000 TBs attached to a single page, which is well more
than I expected to see, and means that a new data structure isn't
going to help as much as simply reducing the number of translations.

It turns out the retranslation is due to the guest kernel's userland
address space randomization.  Each process gets e.g. libc mapped to
a different virtual address, which caused a new translation.

This, then, introduces some infrastructure for writing "pc-relative"
translation blocks, in which the guest pc is treated as a variable
just like any other guest cpu register.  The hashing for these TBs
are adjusted to compare the physical address.  The target/arm backend
is adjusted to use the new feature.

This does result in a significant reduction in translation.  From the
BootLinuxAarch64.test_virt_tcg_gicv2 test, at the login prompt:

    Before:

    gen code size       160684739/1073736704
    TB count            289808
    TB flush count      1
    TB invalidate count 235143

    After:

    gen code size       277992547/1073736704
    TB count            503882
    TB flush count      0
    TB invalidate count 69282

Before TARGET_TB_PCREL, we generate approximately 1.1GB of TBs
(overflow 1GB, flush, and fill 153MB again).  Afterward, we only
generate 265MB of TBs.

Surprisingly, this does not affect wall-clock times nearly as
much as I would have expected:

                                       before   after   change
 BootLinuxAarch64.test_virt_tcg_gicv2:  97.35    85.11   -12%
 BootLinuxAarch64.test_virt_tcg_gicv3: 102.75    96.87    -5%

Change in profile, top 10 entries before, matched up with after:

  before                                                           after
   9.01%  qemu-system-aar  [.] helper_lookup_tb_ptr                10.67%
   4.92%  qemu-system-aar  [.] qht_lookup_custom                    5.06%
   4.79%  qemu-system-aar  [.] get_phys_addr_lpae                   5.24%
   2.57%  qemu-system-aar  [.] address_space_ldq_le                 2.77%
   2.33%  qemu-system-aar  [.] liveness_pass_1                      0.60%
   2.24%  qemu-system-aar  [.] cpu_get_tb_cpu_state                 2.58%
   1.76%  qemu-system-aar  [.] address_space_translate_internal     1.75%
   1.71%  qemu-system-aar  [.] tb_lookup_cmp                        1.92%
   1.65%  qemu-system-aar  [.] tcg_gen_code                         0.44%
   1.64%  qemu-system-aar  [.] do_tb_phys_invalidate                0.09%


r~


Ilya Leoshkevich (1):
  accel/tcg: Introduce is_same_page()

Richard Henderson (32):
  linux-user/arm: Mark the commpage executable
  linux-user/hppa: Allocate page zero as a commpage
  linux-user/x86_64: Allocate vsyscall page as a commpage
  linux-user: Honor PT_GNU_STACK
  tests/tcg/i386: Move smc_code2 to an executable section
  accel/tcg: Remove PageDesc code_bitmap
  accel/tcg: Use bool for page_find_alloc
  accel/tcg: Make tb_htable_lookup static
  accel/tcg: Move qemu_ram_addr_from_host_nofail to physmem.c
  accel/tcg: Properly implement get_page_addr_code for user-only
  accel/tcg: Use probe_access_internal for softmmu
    get_page_addr_code_hostp
  accel/tcg: Add nofault parameter to get_page_addr_code_hostp
  accel/tcg: Unlock mmap_lock after longjmp
  accel/tcg: Raise PROT_EXEC exception early
  accel/tcg: Remove translator_ldsw
  accel/tcg: Add pc and host_pc params to gen_intermediate_code
  accel/tcg: Add fast path for translator_ld*
  accel/tcg: Use DisasContextBase in plugin_gen_tb_start
  accel/tcg: Do not align tb->page_addr[0]
  include/hw/core: Create struct CPUJumpCache
  accel/tcg: Introduce tb_pc and tb_pc_log
  accel/tcg: Introduce TARGET_TB_PCREL
  accel/tcg: Split log_cpu_exec into inline and slow path
  target/arm: Introduce curr_insn_len
  target/arm: Change gen_goto_tb to work on displacements
  target/arm: Change gen_*set_pc_im to gen_*update_pc
  target/arm: Change gen_exception_insn* to work on displacements
  target/arm: Change gen_exception_internal to work on displacements
  target/arm: Change gen_jmp* to work on displacements
  target/arm: Introduce gen_pc_plus_diff for aarch64
  target/arm: Introduce gen_pc_plus_diff for aarch32
  target/arm: Enable TARGET_TB_PCREL

 include/elf.h                           |   1 +
 include/exec/cpu-common.h               |   1 +
 include/exec/cpu-defs.h                 |   3 +
 include/exec/exec-all.h                 | 138 +++++++-------
 include/exec/plugin-gen.h               |   7 +-
 include/exec/translator.h               |  85 +++++++--
 include/hw/core/cpu.h                   |   9 +-
 linux-user/arm/target_cpu.h             |   4 +-
 linux-user/qemu.h                       |   1 +
 target/arm/cpu-param.h                  |   2 +
 target/arm/translate-a32.h              |   2 +-
 target/arm/translate.h                  |  21 ++-
 accel/tcg/cpu-exec.c                    | 222 +++++++++++++---------
 accel/tcg/cputlb.c                      |  98 +++-------
 accel/tcg/plugin-gen.c                  |  23 +--
 accel/tcg/translate-all.c               | 197 +++++++-------------
 accel/tcg/translator.c                  | 122 +++++++++---
 accel/tcg/user-exec.c                   |  15 ++
 linux-user/elfload.c                    |  81 +++++++-
 softmmu/physmem.c                       |  12 ++
 target/alpha/translate.c                |   5 +-
 target/arm/cpu.c                        |  23 +--
 target/arm/translate-a64.c              | 174 ++++++++++-------
 target/arm/translate-m-nocp.c           |   6 +-
 target/arm/translate-mve.c              |   2 +-
 target/arm/translate-vfp.c              |  10 +-
 target/arm/translate.c                  | 237 +++++++++++++++---------
 target/avr/cpu.c                        |   2 +-
 target/avr/translate.c                  |   5 +-
 target/cris/translate.c                 |   5 +-
 target/hexagon/cpu.c                    |   2 +-
 target/hexagon/translate.c              |   6 +-
 target/hppa/cpu.c                       |   4 +-
 target/hppa/translate.c                 |   5 +-
 target/i386/tcg/tcg-cpu.c               |   2 +-
 target/i386/tcg/translate.c             |   7 +-
 target/loongarch/cpu.c                  |   2 +-
 target/loongarch/translate.c            |   6 +-
 target/m68k/translate.c                 |   5 +-
 target/microblaze/cpu.c                 |   2 +-
 target/microblaze/translate.c           |   5 +-
 target/mips/tcg/exception.c             |   2 +-
 target/mips/tcg/sysemu/special_helper.c |   2 +-
 target/mips/tcg/translate.c             |   5 +-
 target/nios2/translate.c                |   5 +-
 target/openrisc/cpu.c                   |   2 +-
 target/openrisc/translate.c             |   6 +-
 target/ppc/translate.c                  |   5 +-
 target/riscv/cpu.c                      |   4 +-
 target/riscv/translate.c                |   5 +-
 target/rx/cpu.c                         |   2 +-
 target/rx/translate.c                   |   5 +-
 target/s390x/tcg/translate.c            |   5 +-
 target/sh4/cpu.c                        |   4 +-
 target/sh4/translate.c                  |   5 +-
 target/sparc/cpu.c                      |   2 +-
 target/sparc/translate.c                |   5 +-
 target/tricore/cpu.c                    |   2 +-
 target/tricore/translate.c              |   6 +-
 target/xtensa/translate.c               |   6 +-
 tcg/tcg.c                               |   6 +-
 tests/tcg/i386/test-i386.c              |   2 +-
 62 files changed, 979 insertions(+), 666 deletions(-)