mbox series

[v5,00/17] target/arm: Reduce overhead of cpu_get_tb_cpu_state

Message ID 20190820210720.18976-1-richard.henderson@linaro.org
Headers show
Series target/arm: Reduce overhead of cpu_get_tb_cpu_state | expand

Message

Richard Henderson Aug. 20, 2019, 9:07 p.m. UTC
Changes since v4:
  * Split patch 1 into 15 smaller patches.
  * Cache the new DEBUG_TARGET_EL field.
  * Split out m-profile hflags separately from a-profile 32-bit.
  * Move around non-cached tb flags as well, avoiding repetitive
    checks for m-profile or other mutually exclusive conditions.

  I haven't officially re-run the performance test quoted in the
  last patch, but I have eyeballed "perf top", and have dug into
  the compiled code a bit, which resulted in a few of the new
  cleanup patches (e.g. cs_base, arm_mmu_idx_el, and
  arm_cpu_data_is_big_endian).

Changes since v3:
  * Rebase.
  * Do not cache XSCALE_CPAR now that it overlaps VECSTRIDE.
  * Leave the new v7m bits as uncached.  I haven't figured
    out all of the ways fpccr is modified.

Changes since v2:
  * Do not cache VECLEN, VECSTRIDE, VFPEN.
    These variables come from VFP_FPSCR and VFP_FPEXC, not from
    system control registers.
  * Move HANDLER and STACKCHECK to rebuild_hflags_a32,
    instead of building them in rebuild_hflags_common.

Changes since v1:
  * Apparently I had started a last-minute API change, and failed to
    covert all of the users, and also failed to re-test afterward.
  * Retain assertions for --enable-debug-tcg.

Richard Henderson (17):
  target/arm: Split out rebuild_hflags_common
  target/arm: Split out rebuild_hflags_a64
  target/arm: Split out rebuild_hflags_common_32
  target/arm: Split arm_cpu_data_is_big_endian
  target/arm: Split out rebuild_hflags_m32
  target/arm: Reduce tests vs M-profile in cpu_get_tb_cpu_state
  target/arm: Split out rebuild_hflags_a32
  target/arm: Split out rebuild_hflags_aprofile
  target/arm: Hoist XSCALE_CPAR, VECLEN, VECSTRIDE in
    cpu_get_tb_cpu_state
  target/arm: Simplify set of PSTATE_SS in cpu_get_tb_cpu_state
  target/arm: Hoist computation of TBFLAG_A32.VFPEN
  target/arm: Add arm_rebuild_hflags
  target/arm: Split out arm_mmu_idx_el
  target/arm: Hoist store to cs_base in cpu_get_tb_cpu_state
  target/arm: Add HELPER(rebuild_hflags_{a32,a64,m32})
  target/arm: Rebuild hflags at EL changes and MSR writes
  target/arm: Rely on hflags correct in cpu_get_tb_cpu_state

 target/arm/cpu.h           |  84 +++++---
 target/arm/helper.h        |   4 +
 target/arm/internals.h     |   9 +
 linux-user/syscall.c       |   1 +
 target/arm/cpu.c           |   1 +
 target/arm/helper-a64.c    |   3 +
 target/arm/helper.c        | 383 ++++++++++++++++++++++++-------------
 target/arm/machine.c       |   1 +
 target/arm/op_helper.c     |   1 +
 target/arm/translate-a64.c |   6 +-
 target/arm/translate.c     |  18 +-
 11 files changed, 341 insertions(+), 170 deletions(-)

-- 
2.17.1

Comments

Richard Henderson Aug. 20, 2019, 11:54 p.m. UTC | #1
On 8/20/19 2:07 PM, Richard Henderson wrote:
> Changes since v4:

>   * Split patch 1 into 15 smaller patches.

>   * Cache the new DEBUG_TARGET_EL field.

>   * Split out m-profile hflags separately from a-profile 32-bit.

>   * Move around non-cached tb flags as well, avoiding repetitive

>     checks for m-profile or other mutually exclusive conditions.


Just after I posted this, I started rebasing my VHE patch set on top, and I
found that the new DEBUG_TARGET_EL field has used The Last Bit, so that I could
not add the one bit that I need for VHE.

However, while working on this patch set, I noticed that we have a lot of
unnecessary overlap between A- and M- profile in the TBFLAGs.  Thus point 4
above and the completely separate rebuild_hflags_m32().

If we rearrange things like the appended, then we recover 4 bits.

Thoughts?


r~
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 91a54662c3..0c2803baa1 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -3183,38 +3183,50 @@ FIELD(TBFLAG_ANY, BE_DATA, 23, 1)
  */
 FIELD(TBFLAG_ANY, DEBUG_TARGET_EL, 21, 2)
 
-/* Bit usage when in AArch32 state: */
-FIELD(TBFLAG_A32, THUMB, 0, 1)          /* Not cached. */
-FIELD(TBFLAG_A32, VECLEN, 1, 3)         /* Not cached. */
-FIELD(TBFLAG_A32, VECSTRIDE, 4, 2)      /* Not cached. */
+/*
+ * Bit usage when in AArch32 state, both A- and M-profile.
+ */
+FIELD(TBFLAG_AM32, CONDEXEC, 0, 8)      /* Not cached. */
+
+/*
+ * Bit usage when in AArch32 state, for A-profile only.
+ */
+FIELD(TBFLAG_A32, THUMB, 8, 1)          /* Not cached. */
+FIELD(TBFLAG_A32, VECLEN, 9, 3)         /* Not cached. */
+FIELD(TBFLAG_A32, VECSTRIDE, 12, 2)     /* Not cached. */
 /*
  * We store the bottom two bits of the CPAR as TB flags and handle
  * checks on the other bits at runtime. This shares the same bits as
  * VECSTRIDE, which is OK as no XScale CPU has VFP.
  * Not cached, because VECLEN+VECSTRIDE are not cached.
  */
-FIELD(TBFLAG_A32, XSCALE_CPAR, 4, 2)
+FIELD(TBFLAG_A32, XSCALE_CPAR, 12, 2)
+FIELD(TBFLAG_A32, VFPEN, 14, 1)         /* Partially cached, minus FPEXC. */
+FIELD(TBFLAG_A32, SCTLR_B, 15, 1)
 /*
  * Indicates whether cp register reads and writes by guest code should access
  * the secure or nonsecure bank of banked registers; note that this is not
  * the same thing as the current security state of the processor!
  */
-FIELD(TBFLAG_A32, NS, 6, 1)
-FIELD(TBFLAG_A32, VFPEN, 7, 1)          /* Partially cached, minus FPEXC. */
-FIELD(TBFLAG_A32, CONDEXEC, 8, 8)       /* Not cached. */
-FIELD(TBFLAG_A32, SCTLR_B, 16, 1)
-/* For M profile only, set if FPCCR.LSPACT is set */
-FIELD(TBFLAG_A32, LSPACT, 18, 1)        /* Not cached. */
-/* For M profile only, set if we must create a new FP context */
-FIELD(TBFLAG_A32, NEW_FP_CTXT_NEEDED, 19, 1) /* Not cached. */
-/* For M profile only, set if FPCCR.S does not match current security state */
-FIELD(TBFLAG_A32, FPCCR_S_WRONG, 20, 1) /* Not cached. */
-/* For M profile only, Handler (ie not Thread) mode */
-FIELD(TBFLAG_A32, HANDLER, 21, 1)
-/* For M profile only, whether we should generate stack-limit checks */
-FIELD(TBFLAG_A32, STACKCHECK, 22, 1)
+FIELD(TBFLAG_A32, NS, 16, 1)
 
-/* Bit usage when in AArch64 state */
+/*
+ * Bit usage when in AArch32 state, for M-profile only.
+ */
+/* Set if FPCCR.LSPACT is set */
+FIELD(TBFLAG_M32, LSPACT, 8, 1)                  /* Not cached. */
+/* Set if we must create a new FP context */
+FIELD(TBFLAG_M32, NEW_FP_CTXT_NEEDED, 9, 1)      /* Not cached. */
+/* Set if FPCCR.S does not match current security state */
+FIELD(TBFLAG_M32, FPCCR_S_WRONG, 10, 1)          /* Not cached. */
+/* Handler (ie not Thread) mode */
+FIELD(TBFLAG_A32, HANDLER, 11, 1)
+/* Whether we should generate stack-limit checks */
+FIELD(TBFLAG_A32, STACKCHECK, 12, 1)
+
+/*
+ * Bit usage when in AArch64 state
+ */
 FIELD(TBFLAG_A64, TBII, 0, 2)
 FIELD(TBFLAG_A64, SVEEXC_EL, 2, 2)
 FIELD(TBFLAG_A64, ZCR_LEN, 4, 4)
Peter Maydell Sept. 4, 2019, 10:48 a.m. UTC | #2
On Wed, 21 Aug 2019 at 00:54, Richard Henderson
<richard.henderson@linaro.org> wrote:
> However, while working on this patch set, I noticed that we have a lot of

> unnecessary overlap between A- and M- profile in the TBFLAGs.  Thus point 4

> above and the completely separate rebuild_hflags_m32().

>

> If we rearrange things like the appended, then we recover 4 bits.


You can't make the THUMB bit A-profile only: we need it in
M-profile too, so we can correctly generate code that takes
the InvalidState exception for attempts to execute with the
Thumb bit not set.

If you want to make VFPEN be A-profile only you need to
do something so we don't look at it for M-profile: currently
we set it always-1 for M-profile so we don't trip the code
that causes us to take an exception if it's 0.

Otherwise seems reasonable. My overall question is: how bad
is it if we just start using bits in the cs_base word?
If we try to get too tricky with using the same bits for
different purposes it opens the door for accidentally writing
code where we use a bit that isn't actually set correctly
for all the situations where we're using it.

thanks
-- PMM
Richard Henderson Sept. 4, 2019, 5:26 p.m. UTC | #3
On 9/4/19 3:48 AM, Peter Maydell wrote:
> On Wed, 21 Aug 2019 at 00:54, Richard Henderson

> <richard.henderson@linaro.org> wrote:

>> However, while working on this patch set, I noticed that we have a lot of

>> unnecessary overlap between A- and M- profile in the TBFLAGs.  Thus point 4

>> above and the completely separate rebuild_hflags_m32().

>>

>> If we rearrange things like the appended, then we recover 4 bits.

> 

> You can't make the THUMB bit A-profile only: we need it in

> M-profile too, so we can correctly generate code that takes

> the InvalidState exception for attempts to execute with the

> Thumb bit not set.


Yes, IIRC I found that when I went to make this work, rather than merely a
sketch through the header file.

> If you want to make VFPEN be A-profile only you need to

> do something so we don't look at it for M-profile: currently

> we set it always-1 for M-profile so we don't trip the code

> that causes us to take an exception if it's 0.

> 

> Otherwise seems reasonable. My overall question is: how bad

> is it if we just start using bits in the cs_base word?


*shrug* Even then we don't want to waste bits; there are only 32 of them
remaining, and after that there's no room at all for expansion.

> If we try to get too tricky with using the same bits for

> different purposes it opens the door for accidentally writing

> code where we use a bit that isn't actually set correctly

> for all the situations where we're using it.


Thankfully, we don't touch these bits in many places.  We set them in the
generator function, and we read them at the beginning of the translator.


r~