Message ID | 20220911230418.340941-28-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
Series | [01/37] target/i386: Define XMMReg and access macros, align ZMM registers | expand |
On 9/12/22 00:04, Paolo Bonzini wrote: > + while (vec_len > 8) { > + vec_len -= 8; > + tcg_gen_shli_tl(s->T0, s->T0, 8); > + tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State, xmm_t0.ZMM_B(vec_len - 1))); > + tcg_gen_or_tl(s->T0, s->T0, t); > } The shl + or is deposit, for those hosts that have it, and will be re-expanded to shl + or for those that don't: tcg_gen_ld8u_tl(t, ...); tcg_gen_deposit_tl(s->T0, t, s->T0, 8, TARGET_LONG_BITS - 8); r~
On Tue, Sep 13, 2022 at 10:17 AM Richard Henderson <richard.henderson@linaro.org> wrote: > > On 9/12/22 00:04, Paolo Bonzini wrote: > > + while (vec_len > 8) { > > + vec_len -= 8; > > + tcg_gen_shli_tl(s->T0, s->T0, 8); > > + tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State, xmm_t0.ZMM_B(vec_len - 1))); > > + tcg_gen_or_tl(s->T0, s->T0, t); > > } > > The shl + or is deposit, for those hosts that have it, > and will be re-expanded to shl + or for those that don't: > > tcg_gen_ld8u_tl(t, ...); > tcg_gen_deposit_tl(s->T0, t, s->T0, 8, TARGET_LONG_BITS - 8); What you get from that is an shl(t, 56) followed by extract2 (i.e. SHRD). Yeah there are targets with a native deposit (x86 itself could add PDEP/PEXT support I guess) but I find it hard to believe that it outperforms a simple shl + or. If we want to get clever, I should instead load ZMM_B(vec_len - 1) directly into the *high* byte of t, using ZMM_L or ZMM_Q, and then issue the extract2 myself. Paolo
On 9/14/22 23:59, Paolo Bonzini wrote: > On Tue, Sep 13, 2022 at 10:17 AM Richard Henderson > <richard.henderson@linaro.org> wrote: >> >> On 9/12/22 00:04, Paolo Bonzini wrote: >>> + while (vec_len > 8) { >>> + vec_len -= 8; >>> + tcg_gen_shli_tl(s->T0, s->T0, 8); >>> + tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State, xmm_t0.ZMM_B(vec_len - 1))); >>> + tcg_gen_or_tl(s->T0, s->T0, t); >>> } >> >> The shl + or is deposit, for those hosts that have it, >> and will be re-expanded to shl + or for those that don't: >> >> tcg_gen_ld8u_tl(t, ...); >> tcg_gen_deposit_tl(s->T0, t, s->T0, 8, TARGET_LONG_BITS - 8); > > What you get from that is an shl(t, 56) followed by extract2 (i.e. > SHRD). Yeah there are targets with a native deposit (x86 itself could > add PDEP/PEXT support I guess) but I find it hard to believe that it > outperforms a simple shl + or. Perhaps the shl+shrd (or shrd+rol if the deposit is slightly different) is over-cleverness on my part in the expansion, and pdep requires a constant mask. But for other hosts, deposit is the same cost as shift. r~
diff --git a/target/i386/tcg/emit.c.inc b/target/i386/tcg/emit.c.inc index dbf2c05e16..52c0a7fbe0 100644 --- a/target/i386/tcg/emit.c.inc +++ b/target/i386/tcg/emit.c.inc @@ -1179,14 +1179,69 @@ static void gen_PINSR(DisasContext *s, CPUX86State *env, X86DecodedInsn *decode) gen_pinsr(s, env, decode, decode->op[2].ot); } +static void gen_pmovmskb_i64(TCGv_i64 d, TCGv_i64 s) +{ + TCGv_i64 t = tcg_temp_new_i64(); + + tcg_gen_andi_i64(d, s, 0x8080808080808080ull); + + /* + * After each shift+or pair: + * 0: a.......b.......c.......d.......e.......f.......g.......h....... + * 7: ab......bc......cd......de......ef......fg......gh......h....... + * 14: abcd....bcde....cdef....defg....efgh....fgh.....gh......h....... + * 28: abcdefghbcdefgh.cdefgh..defgh...efgh....fgh.....gh......h....... + * The result is left in the high bits of the word. + */ + tcg_gen_shli_i64(t, d, 7); + tcg_gen_or_i64(d, d, t); + tcg_gen_shli_i64(t, d, 14); + tcg_gen_or_i64(d, d, t); + tcg_gen_shli_i64(t, d, 28); + tcg_gen_or_i64(d, d, t); +} + +static void gen_pmovmskb_vec(unsigned vece, TCGv_vec d, TCGv_vec s) +{ + TCGv_vec t = tcg_temp_new_vec_matching(d); + TCGv_vec m = tcg_constant_vec_matching(d, MO_8, 0x80); + + /* See above */ + tcg_gen_and_vec(vece, d, s, m); + tcg_gen_shli_vec(vece, t, d, 7); + tcg_gen_or_vec(vece, d, d, t); + tcg_gen_shli_vec(vece, t, d, 14); + tcg_gen_or_vec(vece, d, d, t); + if (vece == MO_64) { + tcg_gen_shli_vec(vece, t, d, 28); + tcg_gen_or_vec(vece, d, d, t); + } +} + static void gen_PMOVMSKB(DisasContext *s, CPUX86State *env, X86DecodedInsn *decode) { - if (s->prefix & PREFIX_DATA) { - gen_helper_pmovmskb_xmm(s->tmp2_i32, cpu_env, s->ptr2); - } else { - gen_helper_pmovmskb_mmx(s->tmp2_i32, cpu_env, s->ptr2); + static const TCGOpcode vecop_list[] = { INDEX_op_shli_vec, 0 }; + static const GVecGen2 g = { + .fni8 = gen_pmovmskb_i64, + .fniv = gen_pmovmskb_vec, + .opt_opc = vecop_list, + .vece = MO_64, + .prefer_i64 = TCG_TARGET_REG_BITS == 64 + }; + MemOp ot = decode->op[0].ot; + int vec_len = sse_vec_len(s, decode); + TCGv t = tcg_temp_new(); + + tcg_gen_gvec_2(offsetof(CPUX86State, xmm_t0) + xmm_offset(ot), decode->op[2].offset, + vec_len, vec_len, &g); + tcg_gen_ld8u_tl(s->T0, cpu_env, offsetof(CPUX86State, xmm_t0.ZMM_B(vec_len - 1))); + while (vec_len > 8) { + vec_len -= 8; + tcg_gen_shli_tl(s->T0, s->T0, 8); + tcg_gen_ld8u_tl(t, cpu_env, offsetof(CPUX86State, xmm_t0.ZMM_B(vec_len - 1))); + tcg_gen_or_tl(s->T0, s->T0, t); } - tcg_gen_extu_i32_tl(s->T0, s->tmp2_i32); + tcg_temp_free(t); } static void gen_POR(DisasContext *s, CPUX86State *env, X86DecodedInsn *decode)