[v4,52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode

Message ID	20230503070656.1746170-53-richard.henderson@linaro.org
State	Superseded
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Cc: git@xen0n.name, gaosong@loongson.cn, philmd@linaro.org, qemu-arm@nongnu.org, qemu-riscv@nongnu.org, qemu-s390x@nongnu.org Subject: [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode Date: Wed, 3 May 2023 08:06:51 +0100 Message-Id: <20230503070656.1746170-53-richard.henderson@linaro.org> In-Reply-To: <20230503070656.1746170-1-richard.henderson@linaro.org> References: <20230503070656.1746170-1-richard.henderson@linaro.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2a00:1450:4864:20::329; envelope-from=richard.henderson@linaro.org; helo=mail-wm1-x329.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=unavailable autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org
Series	tcg: Improve atomicity support \| expand [v4,00/57] tcg: Improve atomicity support [v4,01/57] include/exec/memop: Add bits describing atomicity [v4,02/57] accel/tcg: Add cpu_in_serial_context [v4,03/57] accel/tcg: Introduce tlb_read_idx [v4,04/57] accel/tcg: Reorg system mode load helpers [v4,05/57] accel/tcg: Reorg system mode store helpers [v4,06/57] accel/tcg: Honor atomicity of loads [v4,07/57] accel/tcg: Honor atomicity of stores [v4,08/57] target/loongarch: Do not include tcg-ldst.h [v4,09/57] tcg: Unify helper_{be,le}_{ld,st}* [v4,10/57] accel/tcg: Implement helper_{ld, st}_mmu for user-only [v4,11/57] tcg/tci: Use helper_{ld,st}_mmu for user-only [v4,12/57] tcg: Add 128-bit guest memory primitives [v4,13/57] meson: Detect atomic128 support with optimization [v4,14/57] tcg/i386: Add have_atomic16 [v4,15/57] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc [v4,16/57] accel/tcg: Add aarch64 specific support in ldst_atomicity [v4,17/57] tcg/aarch64: Detect have_lse, have_lse2 for linux [v4,18/57] tcg/aarch64: Detect have_lse, have_lse2 for darwin [v4,19/57] accel/tcg: Add have_lse2 support in ldst_atomicity [v4,20/57] tcg: Introduce TCG_OPF_TYPE_MASK [v4,21/57] tcg/i386: Use full load/store helpers in user-only mode [v4,22/57] tcg/aarch64: Use full load/store helpers in user-only mode [v4,23/57] tcg/ppc: Use full load/store helpers in user-only mode [v4,24/57] tcg/loongarch64: Use full load/store helpers in user-only mode [v4,25/57] tcg/riscv: Use full load/store helpers in user-only mode [v4,26/57] tcg/arm: Adjust constraints on qemu_ld/st [v4,27/57] tcg/arm: Use full load/store helpers in user-only mode [v4,28/57] tcg/mips: Use full load/store helpers in user-only mode [v4,29/57] tcg/s390x: Use full load/store helpers in user-only mode [v4,30/57] tcg/sparc64: Allocate %g2 as a third temporary [v4,31/57] tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13 [v4,32/57] tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32 [v4,33/57] tcg/sparc64: Split out tcg_out_movi_s32 [v4,34/57] tcg/sparc64: Use standard slow path for softmmu [v4,35/57] accel/tcg: Remove helper_unaligned_{ld,st} [v4,36/57] tcg/loongarch64: Assert the host supports unaligned accesses [v4,37/57] tcg/loongarch64: Support softmmu unaligned accesses [v4,38/57] tcg/riscv: Support softmmu unaligned accesses [v4,39/57] tcg: Introduce tcg_target_has_memory_bswap [v4,40/57] tcg: Add INDEX_op_qemu_{ld,st}_i128 [v4,41/57] tcg: Support TCG_TYPE_I128 in tcg_out_{ld, st}_helper_{args, ret} [v4,42/57] tcg: Introduce atom_and_align_for_opc [v4,43/57] tcg/i386: Use atom_and_align_for_opc [v4,44/57] tcg/aarch64: Use atom_and_align_for_opc [v4,45/57] tcg/arm: Use atom_and_align_for_opc [v4,46/57] tcg/loongarch64: Use atom_and_align_for_opc [v4,47/57] tcg/mips: Use atom_and_align_for_opc [v4,48/57] tcg/ppc: Use atom_and_align_for_opc [v4,49/57] tcg/riscv: Use atom_and_align_for_opc [v4,50/57] tcg/s390x: Use atom_and_align_for_opc [v4,51/57] tcg/sparc64: Use atom_and_align_for_opc [v4,52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode [v4,53/57] tcg/i386: Support 128-bit load/store with have_atomic16 [v4,54/57] tcg/aarch64: Rename temporaries [v4,55/57] tcg/aarch64: Support 128-bit load/store [v4,56/57] tcg/ppc: Support 128-bit load/store [v4,57/57] tcg/s390x: Support 128-bit load/store

Message ID

20230503070656.1746170-53-richard.henderson@linaro.org

State

Superseded

Headers

Received-SPF: pass (google.com: domain of
 qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as
 permitted sender) client-ip=209.51.188.17;
From: Richard Henderson <richard.henderson@linaro.org>
To: qemu-devel@nongnu.org
Cc: git@xen0n.name, gaosong@loongson.cn, philmd@linaro.org,
 qemu-arm@nongnu.org, qemu-riscv@nongnu.org, qemu-s390x@nongnu.org
Subject: [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode
Date: Wed,  3 May 2023 08:06:51 +0100
Message-Id: <20230503070656.1746170-53-richard.henderson@linaro.org>
In-Reply-To: <20230503070656.1746170-1-richard.henderson@linaro.org>
References: <20230503070656.1746170-1-richard.henderson@linaro.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2a00:1450:4864:20::329;
 envelope-from=richard.henderson@linaro.org; helo=mail-wm1-x329.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
 T_SCC_BODY_TEXT_LINE=-0.01 autolearn=unavailable autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org
Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org

Series

tcg: Improve atomicity support | expand

Use the fpu to perform 64-bit loads and stores. Signed-off-by: Richard Henderson <richard.henderson@linaro.org> --- tcg/i386/tcg-target.c.inc | 44 +++++++++++++++++++++++++++++++++------ 1 file changed, 38 insertions(+), 6 deletions(-)

Comments

Peter Maydell May 5, 2023, 1:27 p.m. UTC | #1

On Wed, 3 May 2023 at 08:18, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Use the fpu to perform 64-bit loads and stores.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>


> @@ -2091,7 +2095,20 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
>              datalo = datahi;
>              datahi = t;
>          }
> -        if (h.base == datalo || h.index == datalo) {
> +        if (h.atom == MO_64) {
> +            /*
> +             * Atomicity requires that we use use a single 8-byte load.
> +             * For simplicity and code size, always use the FPU for this.
> +             * Similar insns using SSE/AVX are merely larger.

I'm surprised there's no performance penalty for throwing old-school
FPU insns into what is presumably otherwise code that's only
using modern SSE.

> +             * Load from memory in one go, then store back to the stack,
> +             * from whence we can load into the correct integer regs.
> +             */
> +            tcg_out_modrm_sib_offset(s, OPC_ESCDF + h.seg, ESCDF_FILD_m64,
> +                                     h.base, h.index, 0, h.ofs);
> +            tcg_out_modrm_offset(s, OPC_ESCDF, ESCDF_FISTP_m64, TCG_REG_ESP, 0);
> +            tcg_out_modrm_offset(s, movop, datalo, TCG_REG_ESP, 0);
> +            tcg_out_modrm_offset(s, movop, datahi, TCG_REG_ESP, 4);
> +        } else if (h.base == datalo || h.index == datalo) {
>              tcg_out_modrm_sib_offset(s, OPC_LEA, datahi,
>                                       h.base, h.index, 0, h.ofs);
>              tcg_out_modrm_offset(s, movop + h.seg, datalo, datahi, 0);

I assume the caller has arranged that the top of the stack
is trashable at this point?

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

-- PMM

Richard Henderson May 8, 2023, 4:15 p.m. UTC | #2

On 5/5/23 14:27, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:18, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Use the fpu to perform 64-bit loads and stores.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> 
> 
>> @@ -2091,7 +2095,20 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
>>               datalo = datahi;
>>               datahi = t;
>>           }
>> -        if (h.base == datalo || h.index == datalo) {
>> +        if (h.atom == MO_64) {
>> +            /*
>> +             * Atomicity requires that we use use a single 8-byte load.
>> +             * For simplicity and code size, always use the FPU for this.
>> +             * Similar insns using SSE/AVX are merely larger.
> 
> I'm surprised there's no performance penalty for throwing old-school
> FPU insns into what is presumably otherwise code that's only
> using modern SSE.

I have no idea about performance.  We don't require SSE for TCG at the moment.

> I assume the caller has arranged that the top of the stack
> is trashable at this point?

The entire fpu stack is call-clobbered.


r~

diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index 3e21f067d6..5c6c64c48a 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -468,6 +468,10 @@  static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 #define OPC_GRP5        (0xff)
 #define OPC_GRP14       (0x73 | P_EXT | P_DATA16)
 
+#define OPC_ESCDF       (0xdf)
+#define ESCDF_FILD_m64  5
+#define ESCDF_FISTP_m64 7
+
 /* Group 1 opcode extensions for 0x80-0x83.
    These are also used as modifiers for OPC_ARITH.  */
 #define ARITH_ADD 0
@@ -2091,7 +2095,20 @@  static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
             datalo = datahi;
             datahi = t;
         }
-        if (h.base == datalo || h.index == datalo) {
+        if (h.atom == MO_64) {
+            /*
+             * Atomicity requires that we use use a single 8-byte load.
+             * For simplicity and code size, always use the FPU for this.
+             * Similar insns using SSE/AVX are merely larger.
+             * Load from memory in one go, then store back to the stack,
+             * from whence we can load into the correct integer regs.
+             */
+            tcg_out_modrm_sib_offset(s, OPC_ESCDF + h.seg, ESCDF_FILD_m64,
+                                     h.base, h.index, 0, h.ofs);
+            tcg_out_modrm_offset(s, OPC_ESCDF, ESCDF_FISTP_m64, TCG_REG_ESP, 0);
+            tcg_out_modrm_offset(s, movop, datalo, TCG_REG_ESP, 0);
+            tcg_out_modrm_offset(s, movop, datahi, TCG_REG_ESP, 4);
+        } else if (h.base == datalo || h.index == datalo) {
             tcg_out_modrm_sib_offset(s, OPC_LEA, datahi,
                                      h.base, h.index, 0, h.ofs);
             tcg_out_modrm_offset(s, movop + h.seg, datalo, datahi, 0);
@@ -2161,12 +2178,27 @@  static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
         if (TCG_TARGET_REG_BITS == 64) {
             tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
                                      h.base, h.index, 0, h.ofs);
+            break;
+        }
+        if (use_movbe) {
+            TCGReg t = datalo;
+            datalo = datahi;
+            datahi = t;
+        }
+        if (h.atom == MO_64) {
+            /*
+             * Atomicity requires that we use use one 8-byte store.
+             * For simplicity, and code size, always use the FPU for this.
+             * Similar insns using SSE/AVX are merely larger.
+             * Assemble the 8-byte quantity in required endianness
+             * on the stack, load to coproc unit, and store.
+             */
+            tcg_out_modrm_offset(s, movop, datalo, TCG_REG_ESP, 0);
+            tcg_out_modrm_offset(s, movop, datahi, TCG_REG_ESP, 4);
+            tcg_out_modrm_offset(s, OPC_ESCDF, ESCDF_FILD_m64, TCG_REG_ESP, 0);
+            tcg_out_modrm_sib_offset(s, OPC_ESCDF + h.seg, ESCDF_FISTP_m64,
+                                     h.base, h.index, 0, h.ofs);
         } else {
-            if (use_movbe) {
-                TCGReg t = datalo;
-                datalo = datahi;
-                datahi = t;
-            }
             tcg_out_modrm_sib_offset(s, movop + h.seg, datalo,
                                      h.base, h.index, 0, h.ofs);
             tcg_out_modrm_sib_offset(s, movop + h.seg, datahi,

[v4,52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode

Commit Message

Comments

Patch