Message ID | 1513007074-18802-1-git-send-email-james.greenhalgh@arm.com |
---|---|
State | New |
Headers | show |
Series | [AArch64] Do not perform a vector splat for vector initialisation if it is not useful | expand |
On 12/11/2017 08:44 AM, James Greenhalgh wrote: > Hi, > > In the testcase in this patch we create an SLP vector with only two > elements. Our current vector initialisation code will first duplicate > the first element to both lanes, then overwrite the top lane with a new > value. > > This duplication can be clunky and wasteful. > > Better would be to simply use the fact that we will always be overwriting > the remaining bits, and simply move the first element to the corrcet place > (implicitly zeroing all other bits). > > This reduces the code generation for this case, and can allow more > efficient addressing modes, and other second order benefits for AArch64 > code which has been vectorized to V2DI mode. > > Note that the change is generic enough to catch the case for any vector > mode, but is expected to be most useful for 2x64-bit vectorization. > > Unfortunately, on its own, this would cause failures in > gcc.target/aarch64/load_v2vec_lanes_1.c and > gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more > vec_merge and vec_duplicate for their simplifications to apply. To fix this, > add a special case to the AArch64 code if we are loading from two memory > addresses, and use the load_pair_lanes patterns directly. > > We also need a new pattern in simplify-rtx.c:simplify_ternary_operation , to > catch: > > (vec_merge:OUTER > (vec_duplicate:OUTER x:INNER) > (subreg:OUTER y:INNER 0) > (const_int N)) > > And simplify it to: > > (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x) > > This is similar to the existing patterns which are tested in this function, > without requiring the second operand to also be a vec_duplicate. > > Bootstrapped and tested on aarch64-none-linux-gnu and tested on > aarch64-none-elf. > > Note that this requires https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html > if we don't want to ICE creating broken vector zero extends. > > Are the non-AArch64 parts OK? > > Thanks, > James > > --- > 2017-12-11 James Greenhalgh <james.greenhalgh@arm.com> > > * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify code > generation for cases where splatting a value is not useful. > * simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge > across a vec_duplicate and a paradoxical subreg forming a vector > mode to a vec_concat. > > 2017-12-11 James Greenhalgh <james.greenhalgh@arm.com> > > * gcc.target/aarch64/vect-slp-dup.c: New. > > > 0001-patch-AArch64-Do-not-perform-a-vector-splat-for-vect.patch > > > diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c > index 806c309..ed16f70 100644 > --- a/gcc/simplify-rtx.c > +++ b/gcc/simplify-rtx.c > @@ -5785,6 +5785,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode, > return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); > } > > + /* Replace: > + > + (vec_merge:outer (vec_duplicate:outer x:inner) > + (subreg:outer y:inner 0) > + (const_int N)) > + > + with (vec_concat:outer x:inner y:inner) if N == 1, > + or (vec_concat:outer y:inner x:inner) if N == 2. > + > + Implicitly, this means we have a paradoxical subreg, but such > + a check is cheap, so make it anyway. I'm going to assume that N == 1 and N == 3 are handled elsewhere and do not show up here in practice. So is it advisable to handle the case where the VEC_DUPLICATE and SUBREG show up in the opposite order? Or is there some canonicalization that prevents that? simplify-rtx bits are OK as-is if we're certain we're not going to get the alternate ordering of the VEC_MERGE operands. ALso OK if you either generalize this chunk of code or duplicate & twiddle it to handle the alternate order. I didn't look at the aarch64 specific bits. jeff
On 19 December 2017 at 00:36, Jeff Law <law@redhat.com> wrote: > On 12/11/2017 08:44 AM, James Greenhalgh wrote: >> Hi, >> >> In the testcase in this patch we create an SLP vector with only two >> elements. Our current vector initialisation code will first duplicate >> the first element to both lanes, then overwrite the top lane with a new >> value. >> >> This duplication can be clunky and wasteful. >> >> Better would be to simply use the fact that we will always be overwriting >> the remaining bits, and simply move the first element to the corrcet place >> (implicitly zeroing all other bits). >> >> This reduces the code generation for this case, and can allow more >> efficient addressing modes, and other second order benefits for AArch64 >> code which has been vectorized to V2DI mode. >> >> Note that the change is generic enough to catch the case for any vector >> mode, but is expected to be most useful for 2x64-bit vectorization. >> >> Unfortunately, on its own, this would cause failures in >> gcc.target/aarch64/load_v2vec_lanes_1.c and >> gcc.target/aarch64/store_v2vec_lanes.c , which expect to see many more >> vec_merge and vec_duplicate for their simplifications to apply. To fix this, >> add a special case to the AArch64 code if we are loading from two memory >> addresses, and use the load_pair_lanes patterns directly. >> >> We also need a new pattern in simplify-rtx.c:simplify_ternary_operation , to >> catch: >> >> (vec_merge:OUTER >> (vec_duplicate:OUTER x:INNER) >> (subreg:OUTER y:INNER 0) >> (const_int N)) >> >> And simplify it to: >> >> (vec_concat:OUTER x:INNER y:INNER) or (vec_concat y x) >> >> This is similar to the existing patterns which are tested in this function, >> without requiring the second operand to also be a vec_duplicate. >> >> Bootstrapped and tested on aarch64-none-linux-gnu and tested on >> aarch64-none-elf. >> >> Note that this requires https://gcc.gnu.org/ml/gcc-patches/2017-12/msg00614.html >> if we don't want to ICE creating broken vector zero extends. >> >> Are the non-AArch64 parts OK? >> >> Thanks, >> James >> >> --- >> 2017-12-11 James Greenhalgh <james.greenhalgh@arm.com> >> >> * config/aarch64/aarch64.c (aarch64_expand_vector_init): Modify code >> generation for cases where splatting a value is not useful. >> * simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge >> across a vec_duplicate and a paradoxical subreg forming a vector >> mode to a vec_concat. >> >> 2017-12-11 James Greenhalgh <james.greenhalgh@arm.com> >> >> * gcc.target/aarch64/vect-slp-dup.c: New. >> >> >> 0001-patch-AArch64-Do-not-perform-a-vector-splat-for-vect.patch >> >> > >> diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c >> index 806c309..ed16f70 100644 >> --- a/gcc/simplify-rtx.c >> +++ b/gcc/simplify-rtx.c >> @@ -5785,6 +5785,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode, >> return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); >> } >> >> + /* Replace: >> + >> + (vec_merge:outer (vec_duplicate:outer x:inner) >> + (subreg:outer y:inner 0) >> + (const_int N)) >> + >> + with (vec_concat:outer x:inner y:inner) if N == 1, >> + or (vec_concat:outer y:inner x:inner) if N == 2. >> + >> + Implicitly, this means we have a paradoxical subreg, but such >> + a check is cheap, so make it anyway. > I'm going to assume that N == 1 and N == 3 are handled elsewhere and do > not show up here in practice. > > > So is it advisable to handle the case where the VEC_DUPLICATE and SUBREG > show up in the opposite order? Or is there some canonicalization that > prevents that? > > simplify-rtx bits are OK as-is if we're certain we're not going to get > the alternate ordering of the VEC_MERGE operands. ALso OK if you either > generalize this chunk of code or duplicate & twiddle it to handle the > alternate order. > > I didn't look at the aarch64 specific bits. > Hi James, Your patch (r255946) introduced regressions on aarch64_be: http://people.linaro.org/~christophe.lyon/cross-validation/gcc/trunk/255946/report-build-info.html I filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83663 to track this. Thanks, Christophe > jeff >
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 83d8607..8abb8e4 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -12105,9 +12105,51 @@ aarch64_expand_vector_init (rtx target, rtx vals) maxv = matches[i][1]; } - /* Create a duplicate of the most common element. */ - rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); - aarch64_emit_move (target, gen_vec_duplicate (mode, x)); + /* Create a duplicate of the most common element, unless all elements + are equally useless to us, in which case just immediately set the + vector register using the first element. */ + + if (maxv == 1) + { + /* For vectors of two 64-bit elements, we can do even better. */ + if (n_elts == 2 + && (inner_mode == E_DImode + || inner_mode == E_DFmode)) + + { + rtx x0 = XVECEXP (vals, 0, 0); + rtx x1 = XVECEXP (vals, 0, 1); + /* Combine can pick up this case, but handling it directly + here leaves clearer RTL. + + This is load_pair_lanes<mode>, and also gives us a clean-up + for store_pair_lanes<mode>. */ + if (memory_operand (x0, inner_mode) + && memory_operand (x1, inner_mode) + && !STRICT_ALIGNMENT + && rtx_equal_p (XEXP (x1, 0), + plus_constant (Pmode, + XEXP (x0, 0), + GET_MODE_SIZE (inner_mode)))) + { + rtx t; + if (inner_mode == DFmode) + t = gen_load_pair_lanesdf (target, x0, x1); + else + t = gen_load_pair_lanesdi (target, x0, x1); + emit_insn (t); + return; + } + } + rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, 0)); + aarch64_emit_move (target, lowpart_subreg (mode, x, inner_mode)); + maxelement = 0; + } + else + { + rtx x = copy_to_mode_reg (inner_mode, XVECEXP (vals, 0, maxelement)); + aarch64_emit_move (target, gen_vec_duplicate (mode, x)); + } /* Insert the rest. */ for (int i = 0; i < n_elts; i++) diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c index 806c309..ed16f70 100644 --- a/gcc/simplify-rtx.c +++ b/gcc/simplify-rtx.c @@ -5785,6 +5785,36 @@ simplify_ternary_operation (enum rtx_code code, machine_mode mode, return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); } + /* Replace: + + (vec_merge:outer (vec_duplicate:outer x:inner) + (subreg:outer y:inner 0) + (const_int N)) + + with (vec_concat:outer x:inner y:inner) if N == 1, + or (vec_concat:outer y:inner x:inner) if N == 2. + + Implicitly, this means we have a paradoxical subreg, but such + a check is cheap, so make it anyway. + + Only applies for vectors of two elements. */ + if (GET_CODE (op0) == VEC_DUPLICATE + && GET_CODE (op1) == SUBREG + && GET_MODE (op1) == GET_MODE (op0) + && GET_MODE (SUBREG_REG (op1)) == GET_MODE (XEXP (op0, 0)) + && paradoxical_subreg_p (op1) + && SUBREG_BYTE (op1) == 0 + && GET_MODE_NUNITS (GET_MODE (op0)) == 2 + && GET_MODE_NUNITS (GET_MODE (op1)) == 2 + && IN_RANGE (sel, 1, 2)) + { + rtx newop0 = XEXP (op0, 0); + rtx newop1 = SUBREG_REG (op1); + if (sel == 2) + std::swap (newop0, newop1); + return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1); + } + /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y) (const_int n)) with (vec_concat x y) or (vec_concat y x) depending on value diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c new file mode 100644 index 0000000..0541e48 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-dup.c @@ -0,0 +1,20 @@ +/* { dg-do compile } */ + +/* { dg-options "-O3 -ftree-vectorize -fno-vect-cost-model" } */ + +void bar (double); + +void +foo (double *restrict in, double *restrict in2, + double *restrict out1, double *restrict out2) +{ + for (int i = 0; i < 1024; i++) + { + out1[i] = in[i] + 2.0 * in[i+128]; + out1[i+1] = in[i+1] + 2.0 * in2[i]; + bar (in[i]); + } +} + +/* { dg-final { scan-assembler-not "dup\tv\[0-9\]+.2d, v\[0-9\]+" } } */ +