Message ID | 87y3n5z15k.fsf@linaro.org |
---|---|
State | New |
Headers | show |
Series | SLP reductions with variable-length vectors | expand |
Richard Sandiford <richard.sandiford@linaro.org> writes: > Two things stopped us using SLP reductions with variable-length vectors: > > (1) We didn't have a way of constructing the initial vector. > This patch does it by creating a vector full of the neutral > identity value and then using a shift-and-insert function > to insert any non-identity inputs into the low-numbered elements. > (The non-identity values are needed for double reductions.) > Alternatively, for unchained MIN/MAX reductions that have no neutral > value, we instead use the same duplicate-and-interleave approach as > for SLP constant and external definitions (added by a previous > patch). > > (2) The epilogue for constant-length vectors would extract the vector > elements associated with each SLP statement and do scalar arithmetic > on these individual elements. For variable-length vectors, the patch > instead creates a reduction vector for each SLP statement, replacing > the elements for other SLP statements with the identity value. > It then uses a hardware reduction instruction on each vector. > > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu > and powerpc64le-linux-gnu. Here's an updated version that applies on top of the recent removal of REDUC_*_EXPR. Tested as before. Thanks, Richard 2017-11-22 Richard Sandiford <richard.sandiford@linaro.org> Alan Hayward <alan.hayward@arm.com> David Sherwood <david.sherwood@arm.com> gcc/ * doc/md.texi (vec_shl_insert_@var{m}): New optab. * internal-fn.def (VEC_SHL_INSERT): New internal function. * optabs.def (vec_shl_insert_optab): New optab. * tree-vectorizer.h (can_duplicate_and_interleave_p): Declare. (duplicate_and_interleave): Likewise. * tree-vect-loop.c: Include internal-fn.h. (neutral_op_for_slp_reduction): New function, split out from get_initial_defs_for_reduction. (get_initial_def_for_reduction): Handle option 2 for variable-length vectors by loading the neutral value into a vector and then shifting the initial value into element 0. (get_initial_defs_for_reduction): Replace the code argument with the neutral value calculated by neutral_op_for_slp_reduction. Use gimple_build_vector for constant-length vectors. Use IFN_VEC_SHL_INSERT for variable-length vectors if all but the first group_size elements have a neutral value. Use duplicate_and_interleave otherwise. (vect_create_epilog_for_reduction): Take a neutral_op parameter. Update call to get_initial_defs_for_reduction. Handle SLP reductions for variable-length vectors by creating one vector result for each scalar result, with the elements associated with other scalar results stubbed out with the neutral value. (vectorizable_reduction): Call neutral_op_for_slp_reduction. Require IFN_VEC_SHL_INSERT for double reductions on variable-length vectors, or SLP reductions that have a neutral value. Require can_duplicate_and_interleave_p support for variable-length unchained SLP reductions if there is no neutral value, such as for MIN/MAX reductions. Also require the number of vector elements to be a multiple of the number of SLP statements when doing variable-length unchained SLP reductions. Update call to vect_create_epilog_for_reduction. * tree-vect-slp.c (can_duplicate_and_interleave_p): Make public and remove initial values. (duplicate_and_interleave): Use IFN_VEC_SHL_INSERT for variable-length vectors if all but the first group_size elements have a neutral value. * config/aarch64/aarch64.md (UNSPEC_INSR): New unspec. * config/aarch64/aarch64-sve.md (vec_shl_insert_<mode>): New insn. gcc/testsuite/ * gcc.dg/vect/pr37027.c: Remove XFAIL for variable-length vectors. * gcc.dg/vect/pr67790.c: Likewise. * gcc.dg/vect/slp-reduc-1.c: Likewise. * gcc.dg/vect/slp-reduc-2.c: Likewise. * gcc.dg/vect/slp-reduc-3.c: Likewise. * gcc.dg/vect/slp-reduc-5.c: Likewise. * gcc.target/aarch64/sve_slp_5.c: New test. * gcc.target/aarch64/sve_slp_5_run.c: Likewise. * gcc.target/aarch64/sve_slp_6.c: Likewise. * gcc.target/aarch64/sve_slp_6_run.c: Likewise. * gcc.target/aarch64/sve_slp_7.c: Likewise. * gcc.target/aarch64/sve_slp_7_run.c: Likewise. Index: gcc/doc/md.texi =================================================================== --- gcc/doc/md.texi 2017-11-22 18:05:03.927025542 +0000 +++ gcc/doc/md.texi 2017-11-22 18:05:51.545487816 +0000 @@ -5273,6 +5273,14 @@ operand 1. Add operand 1 to operand 2 an operand 0. (This is used express accumulation of elements into an accumulator of a wider mode.) +@cindex @code{vec_shl_insert_@var{m}} instruction pattern +@item @samp{vec_shl_insert_@var{m}} +Shift the elements in vector input operand 1 left one element (i.e. +away from element 0) and fill the vacated element 0 with the scalar +in operand 2. Store the result in vector output operand 0. Operands +0 and 1 have mode @var{m} and operand 2 has the mode appropriate for +one element of @var{m}. + @cindex @code{vec_shr_@var{m}} instruction pattern @item @samp{vec_shr_@var{m}} Whole vector right shift in bits, i.e. towards element 0. Index: gcc/internal-fn.def =================================================================== --- gcc/internal-fn.def 2017-11-22 18:05:03.927025542 +0000 +++ gcc/internal-fn.def 2017-11-22 18:05:51.545487816 +0000 @@ -126,6 +126,8 @@ DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT_ODD, vec_extract_odd, binary) DEF_INTERNAL_OPTAB_FN (VEC_REVERSE, ECF_CONST | ECF_NOTHROW, vec_reverse, unary) +DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT, ECF_CONST | ECF_NOTHROW, + vec_shl_insert, binary) DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary) Index: gcc/optabs.def =================================================================== --- gcc/optabs.def 2017-11-22 18:05:03.927977844 +0000 +++ gcc/optabs.def 2017-11-22 18:05:51.545487816 +0000 @@ -374,3 +374,4 @@ OPTAB_D (set_thread_pointer_optab, "set_ OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE) OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES) +OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a") Index: gcc/tree-vectorizer.h =================================================================== --- gcc/tree-vectorizer.h 2017-11-22 18:05:03.934643959 +0000 +++ gcc/tree-vectorizer.h 2017-11-22 18:05:51.548344107 +0000 @@ -1348,6 +1348,11 @@ extern void vect_get_slp_defs (vec<tree> extern bool vect_slp_bb (basic_block); extern gimple *vect_find_last_scalar_stmt_in_slp (slp_tree); extern bool is_simple_and_all_uses_invariant (gimple *, loop_vec_info); +extern bool can_duplicate_and_interleave_p (unsigned int, machine_mode, + unsigned int * = NULL, + tree * = NULL); +extern void duplicate_and_interleave (gimple_seq *, tree, vec<tree>, + unsigned int, vec<tree> &); /* In tree-vect-patterns.c. */ /* Pattern recognition functions. Index: gcc/tree-vect-loop.c =================================================================== --- gcc/tree-vect-loop.c 2017-11-22 18:04:55.887690745 +0000 +++ gcc/tree-vect-loop.c 2017-11-22 18:05:51.547392010 +0000 @@ -2449,6 +2449,54 @@ reduction_fn_for_scalar_code (enum tree_ } } +/* If there is a neutral value X such that SLP reduction NODE would not + be affected by the introduction of additional X elements, return that X, + otherwise return null. CODE is the code of the reduction. REDUC_CHAIN + is true if the SLP statements perform a single reduction, false if each + statement performs an independent reduction. */ + +static tree +neutral_op_for_slp_reduction (slp_tree slp_node, tree_code code, + bool reduc_chain) +{ + vec<gimple *> stmts = SLP_TREE_SCALAR_STMTS (slp_node); + gimple *stmt = stmts[0]; + stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt); + tree vector_type = STMT_VINFO_VECTYPE (stmt_vinfo); + tree scalar_type = TREE_TYPE (vector_type); + struct loop *loop = gimple_bb (stmt)->loop_father; + gcc_assert (loop); + + switch (code) + { + case WIDEN_SUM_EXPR: + case DOT_PROD_EXPR: + case SAD_EXPR: + case PLUS_EXPR: + case MINUS_EXPR: + case BIT_IOR_EXPR: + case BIT_XOR_EXPR: + return build_zero_cst (scalar_type); + + case MULT_EXPR: + return build_one_cst (scalar_type); + + case BIT_AND_EXPR: + return build_all_ones_cst (scalar_type); + + case MAX_EXPR: + case MIN_EXPR: + /* For MIN/MAX the initial values are neutral. A reduction chain + has only a single initial value, so that value is neutral for + all statements. */ + if (reduc_chain) + return PHI_ARG_DEF_FROM_EDGE (stmt, loop_preheader_edge (loop)); + return NULL_TREE; + + default: + return NULL_TREE; + } +} /* Error reporting helper for vect_is_simple_reduction below. GIMPLE statement STMT is printed with a message MSG. */ @@ -4014,6 +4062,7 @@ get_initial_def_for_reduction (gimple *s int int_init_val = 0; gimple *def_stmt = NULL; gimple_seq stmts = NULL; + unsigned HOST_WIDE_INT count; gcc_assert (vectype); nunits = TYPE_VECTOR_SUBPARTS (vectype); @@ -4086,13 +4135,19 @@ get_initial_def_for_reduction (gimple *s /* Option1: the first element is '0' or '1' as well. */ init_def = gimple_build_vector_from_val (&stmts, vectype, def_for_init); + else if (!nunits.is_constant (&count)) + { + /* Option2 (variable length): the first element is INIT_VAL. */ + init_def = build_vector_from_val (vectype, def_for_init); + gcall *call = gimple_build_call_internal (IFN_VEC_SHL_INSERT, + 2, init_def, init_val); + init_def = make_ssa_name (vectype); + gimple_call_set_lhs (call, init_def); + gimple_seq_add_stmt (&stmts, call); + } else { - /* Option2: the first element is INIT_VAL. */ - - /* Enforced by vectorizable_reduction (which disallows double - reductions with variable-length vectors). */ - unsigned int count = nunits.to_constant (); + /* Option2 (constant length): the first element is INIT_VAL. */ auto_vec<tree, 32> elts (count); elts.quick_push (init_val); for (unsigned int i = 1; i < count; ++i) @@ -4130,34 +4185,32 @@ get_initial_def_for_reduction (gimple *s } /* Get at the initial defs for the reduction PHIs in SLP_NODE. - NUMBER_OF_VECTORS is the number of vector defs to create. */ + NUMBER_OF_VECTORS is the number of vector defs to create. + If NEUTRAL_OP is nonnull, introducing extra elements of that + value will not change the result. */ static void get_initial_defs_for_reduction (slp_tree slp_node, vec<tree> *vec_oprnds, unsigned int number_of_vectors, - enum tree_code code, bool reduc_chain) + bool reduc_chain, tree neutral_op) { vec<gimple *> stmts = SLP_TREE_SCALAR_STMTS (slp_node); gimple *stmt = stmts[0]; stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt); - unsigned nunits; + unsigned HOST_WIDE_INT nunits; unsigned j, number_of_places_left_in_vector; - tree vector_type, scalar_type; + tree vector_type; tree vop; int group_size = stmts.length (); unsigned int vec_num, i; unsigned number_of_copies = 1; vec<tree> voprnds; voprnds.create (number_of_vectors); - tree neutral_op = NULL; struct loop *loop; + auto_vec<tree, 16> permute_results; vector_type = STMT_VINFO_VECTYPE (stmt_vinfo); - scalar_type = TREE_TYPE (vector_type); - /* vectorizable_reduction has already rejected SLP reductions on - variable-length vectors. */ - nunits = TYPE_VECTOR_SUBPARTS (vector_type).to_constant (); gcc_assert (STMT_VINFO_DEF_TYPE (stmt_vinfo) == vect_reduction_def); @@ -4165,45 +4218,7 @@ get_initial_defs_for_reduction (slp_tree gcc_assert (loop); edge pe = loop_preheader_edge (loop); - /* op is the reduction operand of the first stmt already. */ - /* For additional copies (see the explanation of NUMBER_OF_COPIES below) - we need either neutral operands or the original operands. See - get_initial_def_for_reduction() for details. */ - switch (code) - { - case WIDEN_SUM_EXPR: - case DOT_PROD_EXPR: - case SAD_EXPR: - case PLUS_EXPR: - case MINUS_EXPR: - case BIT_IOR_EXPR: - case BIT_XOR_EXPR: - neutral_op = build_zero_cst (scalar_type); - break; - - case MULT_EXPR: - neutral_op = build_one_cst (scalar_type); - break; - - case BIT_AND_EXPR: - neutral_op = build_all_ones_cst (scalar_type); - break; - - /* For MIN/MAX we don't have an easy neutral operand but - the initial values can be used fine here. Only for - a reduction chain we have to force a neutral element. */ - case MAX_EXPR: - case MIN_EXPR: - if (! reduc_chain) - neutral_op = NULL; - else - neutral_op = PHI_ARG_DEF_FROM_EDGE (stmt, pe); - break; - - default: - gcc_assert (! reduc_chain); - neutral_op = NULL; - } + gcc_assert (!reduc_chain || neutral_op); /* NUMBER_OF_COPIES is the number of times we need to use the same values in created vectors. It is greater than 1 if unrolling is performed. @@ -4221,6 +4236,9 @@ get_initial_defs_for_reduction (slp_tree (s1, s2, ..., s8). We will create two vectors {s1, s2, s3, s4} and {s5, s6, s7, s8}. */ + if (!TYPE_VECTOR_SUBPARTS (vector_type).is_constant (&nunits)) + nunits = group_size; + number_of_copies = nunits * number_of_vectors / group_size; number_of_places_left_in_vector = nunits; @@ -4247,7 +4265,40 @@ get_initial_defs_for_reduction (slp_tree if (number_of_places_left_in_vector == 0) { gimple_seq ctor_seq = NULL; - tree init = gimple_build_vector (&ctor_seq, vector_type, elts); + tree init; + if (must_eq (TYPE_VECTOR_SUBPARTS (vector_type), nunits)) + /* Build the vector directly from ELTS. */ + init = gimple_build_vector (&ctor_seq, vector_type, elts); + else if (neutral_op) + { + /* Build a vector of the neutral value and shift the + other elements into place. */ + init = gimple_build_vector_from_val (&ctor_seq, vector_type, + neutral_op); + int k = nunits; + while (k > 0 && elts[k - 1] == neutral_op) + k -= 1; + while (k > 0) + { + k -= 1; + gcall *call = gimple_build_call_internal + (IFN_VEC_SHL_INSERT, 2, init, elts[k]); + init = make_ssa_name (vector_type); + gimple_call_set_lhs (call, init); + gimple_seq_add_stmt (&ctor_seq, call); + } + } + else + { + /* First time round, duplicate ELTS to fill the + required number of vectors, then cherry pick the + appropriate result for each iteration. */ + if (vec_oprnds->is_empty ()) + duplicate_and_interleave (&ctor_seq, vector_type, elts, + number_of_vectors, + permute_results); + init = permute_results[number_of_vectors - j - 1]; + } if (ctor_seq != NULL) gsi_insert_seq_on_edge_immediate (pe, ctor_seq); voprnds.quick_push (init); @@ -4317,6 +4368,8 @@ get_initial_defs_for_reduction (slp_tree DOUBLE_REDUC is TRUE if double reduction phi nodes should be handled. SLP_NODE is an SLP node containing a group of reduction statements. The first one in this group is STMT. + NEUTRAL_OP is the value given by neutral_op_for_slp_reduction; it is + null if this is not an SLP reduction This function: 1. Creates the reduction def-use cycles: sets the arguments for @@ -4364,7 +4417,8 @@ vect_create_epilog_for_reduction (vec<tr vec<gimple *> reduction_phis, bool double_reduc, slp_tree slp_node, - slp_instance slp_node_instance) + slp_instance slp_node_instance, + tree neutral_op) { stmt_vec_info stmt_info = vinfo_for_stmt (stmt); stmt_vec_info prev_phi_info; @@ -4400,6 +4454,7 @@ vect_create_epilog_for_reduction (vec<tr auto_vec<tree> vec_initial_defs; auto_vec<gimple *> phis; bool slp_reduc = false; + bool direct_slp_reduc; tree new_phi_result; gimple *inner_phi = NULL; tree induction_index = NULL_TREE; @@ -4443,8 +4498,9 @@ vect_create_epilog_for_reduction (vec<tr unsigned vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); vec_initial_defs.reserve (vec_num); get_initial_defs_for_reduction (slp_node_instance->reduc_phis, - &vec_initial_defs, vec_num, code, - GROUP_FIRST_ELEMENT (stmt_info)); + &vec_initial_defs, vec_num, + GROUP_FIRST_ELEMENT (stmt_info), + neutral_op); } else { @@ -4738,6 +4794,12 @@ vect_create_epilog_for_reduction (vec<tr b2 = operation (b1) */ slp_reduc = (slp_node && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))); + /* True if we should implement SLP_REDUC using native reduction operations + instead of scalar operations. */ + direct_slp_reduc = (reduc_fn != IFN_LAST + && slp_reduc + && !TYPE_VECTOR_SUBPARTS (vectype).is_constant ()); + /* In case of reduction chain, e.g., # a1 = phi <a3, a0> a2 = operation (a1) @@ -4745,7 +4807,7 @@ vect_create_epilog_for_reduction (vec<tr we may end up with more than one vector result. Here we reduce them to one vector. */ - if (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) + if (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) || direct_slp_reduc) { tree first_vect = PHI_RESULT (new_phis[0]); gassign *new_vec_stmt = NULL; @@ -5034,6 +5096,83 @@ vect_create_epilog_for_reduction (vec<tr scalar_results.safe_push (new_temp); } + else if (direct_slp_reduc) + { + /* Here we create one vector for each of the GROUP_SIZE results, + with the elements for other SLP statements replaced with the + neutral value. We can then do a normal reduction on each vector. */ + + /* Enforced by vectorizable_reduction. */ + gcc_assert (new_phis.length () == 1); + gcc_assert (pow2p_hwi (group_size)); + + slp_tree orig_phis_slp_node = slp_node_instance->reduc_phis; + vec<gimple *> orig_phis = SLP_TREE_SCALAR_STMTS (orig_phis_slp_node); + gimple_seq seq = NULL; + + /* Build a vector {0, 1, 2, ...}, with the same number of elements + and the same element size as VECTYPE. */ + tree index = build_index_vector (vectype, 0, 1); + tree index_type = TREE_TYPE (index); + tree index_elt_type = TREE_TYPE (index_type); + tree mask_type = build_same_sized_truth_vector_type (index_type); + + /* Create a vector that, for each element, identifies which of + the GROUP_SIZE results should use it. */ + tree index_mask = build_int_cst (index_elt_type, group_size - 1); + index = gimple_build (&seq, BIT_AND_EXPR, index_type, index, + build_vector_from_val (index_type, index_mask)); + + /* Get a neutral vector value. This is simply a splat of the neutral + scalar value if we have one, otherwise the initial scalar value + is itself a neutral value. */ + tree vector_identity = NULL_TREE; + if (neutral_op) + vector_identity = gimple_build_vector_from_val (&seq, vectype, + neutral_op); + for (unsigned int i = 0; i < group_size; ++i) + { + /* If there's no univeral neutral value, we can use the + initial scalar value from the original PHI. This is used + for MIN and MAX reduction, for example. */ + if (!neutral_op) + { + tree scalar_value + = PHI_ARG_DEF_FROM_EDGE (orig_phis[i], + loop_preheader_edge (loop)); + vector_identity = gimple_build_vector_from_val (&seq, vectype, + scalar_value); + } + + /* Calculate the equivalent of: + + sel[j] = (index[j] == i); + + which selects the elements of NEW_PHI_RESULT that should + be included in the result. */ + tree compare_val = build_int_cst (index_elt_type, i); + compare_val = build_vector_from_val (index_type, compare_val); + tree sel = gimple_build (&seq, EQ_EXPR, mask_type, + index, compare_val); + + /* Calculate the equivalent of: + + vec = seq ? new_phi_result : vector_identity; + + VEC is now suitable for a full vector reduction. */ + tree vec = gimple_build (&seq, VEC_COND_EXPR, vectype, + sel, new_phi_result, vector_identity); + + /* Do the reduction and convert it to the appropriate type. */ + gcall *call = gimple_build_call_internal (reduc_fn, 1, vec); + tree scalar = make_ssa_name (TREE_TYPE (vectype)); + gimple_call_set_lhs (call, scalar); + gimple_seq_add_stmt (&seq, call); + scalar = gimple_convert (&seq, scalar_type, scalar); + scalar_results.safe_push (scalar); + } + gsi_insert_seq_before (&exit_gsi, seq, GSI_SAME_STMT); + } else { bool reduce_with_shift = have_whole_vector_shift (mode); @@ -6248,25 +6387,64 @@ vectorizable_reduction (gimple *stmt, gi return false; } - if (double_reduc && !nunits_out.is_constant ()) + /* For SLP reductions, see if there is a neutral value we can use. */ + tree neutral_op = NULL_TREE; + if (slp_node) + neutral_op + = neutral_op_for_slp_reduction (slp_node_instance->reduc_phis, code, + GROUP_FIRST_ELEMENT (stmt_info) != NULL); + + /* For double reductions, and for SLP reductions with a neutral value, + we construct a variable-length initial vector by loading a vector + full of the neutral value and then shift-and-inserting the start + values into the low-numbered elements. */ + if ((double_reduc || neutral_op) + && !nunits_out.is_constant () + && !direct_internal_fn_supported_p (IFN_VEC_SHL_INSERT, + vectype_out, OPTIMIZE_FOR_SPEED)) { - /* The current double-reduction code creates the initial value - element-by-element. */ if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "double reduction not supported for variable-length" - " vectors.\n"); + "reduction on variable-length vectors requires" + " target support for a vector-shift-and-insert" + " operation.\n"); return false; } - if (slp_node && !nunits_out.is_constant ()) - { - /* The current SLP code creates the initial value element-by-element. */ - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "SLP reduction not supported for variable-length" - " vectors.\n"); - return false; + /* Check extra constraints for variable-length unchained SLP reductions. */ + if (STMT_SLP_TYPE (stmt_info) + && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) + && !nunits_out.is_constant ()) + { + /* We checked above that we could build the initial vector when + there's a neutral element value. Check here for the case in + which each SLP statement has its own initial value and in which + that value needs to be repeated for every instance of the + statement within the initial vector. */ + unsigned int group_size = SLP_TREE_SCALAR_STMTS (slp_node).length (); + scalar_mode elt_mode = SCALAR_TYPE_MODE (TREE_TYPE (vectype_out)); + if (!neutral_op + && !can_duplicate_and_interleave_p (group_size, elt_mode)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "unsupported form of SLP reduction for" + " variable-length vectors: cannot build" + " initial vector.\n"); + return false; + } + /* The epilogue code relies on the number of elements being a multiple + of the group size. The duplicate-and-interleave approach to setting + up the the initial vector does too. */ + if (!multiple_p (nunits_out, group_size)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "unsupported form of SLP reduction for" + " variable-length vectors: the vector size" + " is not a multiple of the number of results.\n"); + return false; + } } /* In case of widenning multiplication by a constant, we update the type @@ -6533,7 +6711,8 @@ vectorizable_reduction (gimple *stmt, gi vect_create_epilog_for_reduction (vect_defs, stmt, reduc_def_stmt, epilog_copies, reduc_fn, phis, - double_reduc, slp_node, slp_node_instance); + double_reduc, slp_node, slp_node_instance, + neutral_op); return true; } Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c 2017-11-22 18:05:03.933691657 +0000 +++ gcc/tree-vect-slp.c 2017-11-22 18:05:51.548344107 +0000 @@ -214,10 +214,10 @@ vect_get_place_in_interleaving_chain (gi (if nonnull) and the type of each intermediate vector in *VECTOR_TYPE_OUT (if nonnull). */ -static bool +bool can_duplicate_and_interleave_p (unsigned int count, machine_mode elt_mode, - unsigned int *nvectors_out = NULL, - tree *vector_type_out = NULL) + unsigned int *nvectors_out, + tree *vector_type_out) { poly_int64 elt_bytes = count * GET_MODE_SIZE (elt_mode); poly_int64 nelts; @@ -3277,7 +3277,7 @@ vect_mask_constant_operand_p (gimple *st We try to find the largest IM for which this sequence works, in order to cut down on the number of interleaves. */ -static void +void duplicate_and_interleave (gimple_seq *seq, tree vector_type, vec<tree> elts, unsigned int nresults, vec<tree> &results) { @@ -3559,6 +3559,26 @@ vect_get_constant_vectors (tree op, slp_ if (must_eq (TYPE_VECTOR_SUBPARTS (vector_type), nunits)) /* Build the vector directly from ELTS. */ vec_cst = gimple_build_vector (&ctor_seq, vector_type, elts); + else if (neutral_op) + { + /* Build a vector of the neutral value and shift the + other elements into place. */ + vec_cst = gimple_build_vector_from_val (&ctor_seq, + vector_type, + neutral_op); + int k = nunits; + while (k > 0 && elts[k - 1] == neutral_op) + k -= 1; + while (k > 0) + { + k -= 1; + gcall *call = gimple_build_call_internal + (IFN_VEC_SHL_INSERT, 2, vec_cst, elts[k]); + vec_cst = make_ssa_name (vector_type); + gimple_call_set_lhs (call, vec_cst); + gimple_seq_add_stmt (&ctor_seq, call); + } + } else { if (vec_oprnds->is_empty ()) Index: gcc/config/aarch64/aarch64.md =================================================================== --- gcc/config/aarch64/aarch64.md 2017-11-22 18:05:03.925120938 +0000 +++ gcc/config/aarch64/aarch64.md 2017-11-22 18:05:51.544535719 +0000 @@ -162,6 +162,7 @@ (define_c_enum "unspec" [ UNSPEC_WHILE_LO UNSPEC_LDN UNSPEC_STN + UNSPEC_INSR ]) (define_c_enum "unspecv" [ Index: gcc/config/aarch64/aarch64-sve.md =================================================================== --- gcc/config/aarch64/aarch64-sve.md 2017-11-22 18:05:03.924168636 +0000 +++ gcc/config/aarch64/aarch64-sve.md 2017-11-22 18:05:51.543583622 +0000 @@ -2081,3 +2081,16 @@ (define_expand "vec_pack_<su>fix_trunc_v operands[5] = gen_reg_rtx (VNx4SImode); } ) + +;; Shift an SVE vector left and insert a scalar into element 0. +(define_insn "vec_shl_insert_<mode>" + [(set (match_operand:SVE_ALL 0 "register_operand" "=w, w") + (unspec:SVE_ALL + [(match_operand:SVE_ALL 1 "register_operand" "0, 0") + (match_operand:<VEL> 2 "register_operand" "rZ, w")] + UNSPEC_INSR))] + "TARGET_SVE" + "@ + insr\t%0.<Vetype>, %<vwcore>2 + insr\t%0.<Vetype>, %<Vetype>2" +) Index: gcc/testsuite/gcc.dg/vect/pr37027.c =================================================================== --- gcc/testsuite/gcc.dg/vect/pr37027.c 2017-11-22 18:05:03.927977844 +0000 +++ gcc/testsuite/gcc.dg/vect/pr37027.c 2017-11-22 18:05:51.545487816 +0000 @@ -32,5 +32,5 @@ foo (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || { vect_variable_length && vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */ Index: gcc/testsuite/gcc.dg/vect/pr67790.c =================================================================== --- gcc/testsuite/gcc.dg/vect/pr67790.c 2017-11-22 18:05:03.927977844 +0000 +++ gcc/testsuite/gcc.dg/vect/pr67790.c 2017-11-22 18:05:51.545487816 +0000 @@ -37,4 +37,4 @@ int main() return 0; } -/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { xfail { vect_variable_length && vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-1.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-1.c 2017-11-22 18:05:03.927977844 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-1.c 2017-11-22 18:05:51.546439913 +0000 @@ -43,5 +43,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || { vect_variable_length && vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-2.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-2.c 2017-11-22 18:05:03.927977844 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-2.c 2017-11-22 18:05:51.546439913 +0000 @@ -38,5 +38,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || { vect_variable_length && vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-3.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-3.c 2017-11-22 18:05:03.927977844 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-3.c 2017-11-22 18:05:51.546439913 +0000 @@ -58,7 +58,4 @@ int main (void) /* The initialization loop in main also gets vectorized. */ /* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: detected" 1 "vect" { xfail *-*-* } } } */ /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { target { vect_short_mult && { vect_widen_sum_hi_to_si && vect_unpack } } } } } */ -/* We can't yet create the necessary SLP constant vector for variable-length - SVE and so fall back to Advanced SIMD. This means that we repeat each - analysis note. */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_widen_sum_hi_to_si_pattern || { { ! vect_unpack } || { aarch64_sve && vect_variable_length } } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_widen_sum_hi_to_si_pattern || { ! vect_unpack } } } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-5.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-5.c 2017-11-22 18:05:03.927977844 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-5.c 2017-11-22 18:05:51.546439913 +0000 @@ -43,5 +43,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { xfail vect_no_int_min_max } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_min_max || { vect_variable_length && vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_min_max } } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_5.c =================================================================== --- /dev/null 2017-11-22 12:44:16.206041922 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_5.c 2017-11-22 18:05:51.546439913 +0000 @@ -0,0 +1,58 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -ffast-math" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +void __attribute__ ((noinline, noclone)) \ +vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \ +{ \ + TYPE x0 = b[0]; \ + TYPE x1 = b[1]; \ + for (int i = 0; i < n; ++i) \ + { \ + x0 += a[i * 2]; \ + x1 += a[i * 2 + 1]; \ + } \ + b[0] = x0; \ + b[1] = x1; \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (_Float16) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* ??? We don't think it's worth using SLP for the 64-bit loops and fall + back to the less efficient non-SLP implementation instead. */ +/* ??? At present we don't treat the int8_t and int16_t loops as + reductions. */ +/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 2 } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */ +/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tld2b\t} } } */ +/* { dg-final { scan-assembler-not {\tld2h\t} } } */ +/* { dg-final { scan-assembler-not {\tld2w\t} } } */ +/* { dg-final { scan-assembler-not {\tld2d\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 4 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 4 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 2 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 2 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 2 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 2 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 2 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_5_run.c =================================================================== --- /dev/null 2017-11-22 12:44:16.206041922 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_5_run.c 2017-11-22 18:05:51.546439913 +0000 @@ -0,0 +1,35 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -ffast-math" } */ + +#include "sve_slp_5.c" + +#define N (141 * 2) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[2] = { 40, 22 }; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + a[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_slp_##TYPE (a, b, N / 2); \ + TYPE x0 = 40; \ + TYPE x1 = 22; \ + for (unsigned int i = 0; i < N; i += 2) \ + { \ + x0 += a[i]; \ + x1 += a[i + 1]; \ + asm volatile ("" ::: "memory"); \ + } \ + /* _Float16 isn't precise enough for this. */ \ + if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \ + && (x0 != b[0] || x1 != b[1])) \ + __builtin_abort (); \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +} Index: gcc/testsuite/gcc.target/aarch64/sve_slp_6.c =================================================================== --- /dev/null 2017-11-22 12:44:16.206041922 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_6.c 2017-11-22 18:05:51.546439913 +0000 @@ -0,0 +1,47 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -ffast-math" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +void __attribute__ ((noinline, noclone)) \ +vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \ +{ \ + TYPE x0 = b[0]; \ + TYPE x1 = b[1]; \ + TYPE x2 = b[2]; \ + for (int i = 0; i < n; ++i) \ + { \ + x0 += a[i * 3]; \ + x1 += a[i * 3 + 1]; \ + x2 += a[i * 3 + 2]; \ + } \ + b[0] = x0; \ + b[1] = x1; \ + b[2] = x2; \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (_Float16) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* These loops can't use SLP. */ +/* { dg-final { scan-assembler-not {\tld1b\t} } } */ +/* { dg-final { scan-assembler-not {\tld1h\t} } } */ +/* { dg-final { scan-assembler-not {\tld1w\t} } } */ +/* { dg-final { scan-assembler-not {\tld1d\t} } } */ +/* { dg-final { scan-assembler {\tld3b\t} } } */ +/* { dg-final { scan-assembler {\tld3h\t} } } */ +/* { dg-final { scan-assembler {\tld3w\t} } } */ +/* { dg-final { scan-assembler {\tld3d\t} } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_6_run.c =================================================================== --- /dev/null 2017-11-22 12:44:16.206041922 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_6_run.c 2017-11-22 18:05:51.546439913 +0000 @@ -0,0 +1,37 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -ffast-math" } */ + +#include "sve_slp_6.c" + +#define N (77 * 3) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[3] = { 40, 22, 75 }; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + a[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_slp_##TYPE (a, b, N / 3); \ + TYPE x0 = 40; \ + TYPE x1 = 22; \ + TYPE x2 = 75; \ + for (unsigned int i = 0; i < N; i += 3) \ + { \ + x0 += a[i]; \ + x1 += a[i + 1]; \ + x2 += a[i + 2]; \ + asm volatile ("" ::: "memory"); \ + } \ + /* _Float16 isn't precise enough for this. */ \ + if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \ + && (x0 != b[0] || x1 != b[1] || x2 != b[2])) \ + __builtin_abort (); \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +} Index: gcc/testsuite/gcc.target/aarch64/sve_slp_7.c =================================================================== --- /dev/null 2017-11-22 12:44:16.206041922 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_7.c 2017-11-22 18:05:51.546439913 +0000 @@ -0,0 +1,66 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -ffast-math" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +void __attribute__ ((noinline, noclone)) \ +vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \ +{ \ + TYPE x0 = b[0]; \ + TYPE x1 = b[1]; \ + TYPE x2 = b[2]; \ + TYPE x3 = b[3]; \ + for (int i = 0; i < n; ++i) \ + { \ + x0 += a[i * 4]; \ + x1 += a[i * 4 + 1]; \ + x2 += a[i * 4 + 2]; \ + x3 += a[i * 4 + 3]; \ + } \ + b[0] = x0; \ + b[1] = x1; \ + b[2] = x2; \ + b[3] = x3; \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (_Float16) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* We can't use SLP for the 64-bit loops, since the number of reduction + results might be greater than the number of elements in the vector. + Otherwise we have two loads per loop, one for the initial vector + and one for the loop body. */ +/* ??? At present we don't treat the int8_t and int16_t loops as + reductions. */ +/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 2 } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */ +/* { dg-final { scan-assembler-times {\tld4d\t} 3 } } */ +/* { dg-final { scan-assembler-not {\tld4b\t} } } */ +/* { dg-final { scan-assembler-not {\tld4h\t} } } */ +/* { dg-final { scan-assembler-not {\tld4w\t} } } */ +/* { dg-final { scan-assembler-not {\tld1d\t} } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 8 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 8 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 4 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 8 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 8 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_7_run.c =================================================================== --- /dev/null 2017-11-22 12:44:16.206041922 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_7_run.c 2017-11-22 18:05:51.546439913 +0000 @@ -0,0 +1,39 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -ffast-math" } */ + +#include "sve_slp_7.c" + +#define N (54 * 4) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[4] = { 40, 22, 75, 19 }; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + a[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_slp_##TYPE (a, b, N / 4); \ + TYPE x0 = 40; \ + TYPE x1 = 22; \ + TYPE x2 = 75; \ + TYPE x3 = 19; \ + for (unsigned int i = 0; i < N; i += 4) \ + { \ + x0 += a[i]; \ + x1 += a[i + 1]; \ + x2 += a[i + 2]; \ + x3 += a[i + 3]; \ + asm volatile ("" ::: "memory"); \ + } \ + /* _Float16 isn't precise enough for this. */ \ + if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \ + && (x0 != b[0] || x1 != b[1] || x2 != b[2] || x3 != b[3])) \ + __builtin_abort (); \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +}
On 11/22/2017 11:10 AM, Richard Sandiford wrote: > Richard Sandiford <richard.sandiford@linaro.org> writes: >> Two things stopped us using SLP reductions with variable-length vectors: >> >> (1) We didn't have a way of constructing the initial vector. >> This patch does it by creating a vector full of the neutral >> identity value and then using a shift-and-insert function >> to insert any non-identity inputs into the low-numbered elements. >> (The non-identity values are needed for double reductions.) >> Alternatively, for unchained MIN/MAX reductions that have no neutral >> value, we instead use the same duplicate-and-interleave approach as >> for SLP constant and external definitions (added by a previous >> patch). >> >> (2) The epilogue for constant-length vectors would extract the vector >> elements associated with each SLP statement and do scalar arithmetic >> on these individual elements. For variable-length vectors, the patch >> instead creates a reduction vector for each SLP statement, replacing >> the elements for other SLP statements with the identity value. >> It then uses a hardware reduction instruction on each vector. >> >> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu >> and powerpc64le-linux-gnu. > > Here's an updated version that applies on top of the recent > removal of REDUC_*_EXPR. Tested as before. > > Thanks, > Richard > > > 2017-11-22 Richard Sandiford <richard.sandiford@linaro.org> > Alan Hayward <alan.hayward@arm.com> > David Sherwood <david.sherwood@arm.com> > > gcc/ > * doc/md.texi (vec_shl_insert_@var{m}): New optab. > * internal-fn.def (VEC_SHL_INSERT): New internal function. > * optabs.def (vec_shl_insert_optab): New optab. > * tree-vectorizer.h (can_duplicate_and_interleave_p): Declare. > (duplicate_and_interleave): Likewise. > * tree-vect-loop.c: Include internal-fn.h. > (neutral_op_for_slp_reduction): New function, split out from > get_initial_defs_for_reduction. > (get_initial_def_for_reduction): Handle option 2 for variable-length > vectors by loading the neutral value into a vector and then shifting > the initial value into element 0. > (get_initial_defs_for_reduction): Replace the code argument with > the neutral value calculated by neutral_op_for_slp_reduction. > Use gimple_build_vector for constant-length vectors. > Use IFN_VEC_SHL_INSERT for variable-length vectors if all > but the first group_size elements have a neutral value. > Use duplicate_and_interleave otherwise. > (vect_create_epilog_for_reduction): Take a neutral_op parameter. > Update call to get_initial_defs_for_reduction. Handle SLP > reductions for variable-length vectors by creating one vector > result for each scalar result, with the elements associated > with other scalar results stubbed out with the neutral value. > (vectorizable_reduction): Call neutral_op_for_slp_reduction. > Require IFN_VEC_SHL_INSERT for double reductions on > variable-length vectors, or SLP reductions that have > a neutral value. Require can_duplicate_and_interleave_p > support for variable-length unchained SLP reductions if there > is no neutral value, such as for MIN/MAX reductions. Also require > the number of vector elements to be a multiple of the number of > SLP statements when doing variable-length unchained SLP reductions. > Update call to vect_create_epilog_for_reduction. > * tree-vect-slp.c (can_duplicate_and_interleave_p): Make public > and remove initial values. > (duplicate_and_interleave): Use IFN_VEC_SHL_INSERT for > variable-length vectors if all but the first group_size elements > have a neutral value. > * config/aarch64/aarch64.md (UNSPEC_INSR): New unspec. > * config/aarch64/aarch64-sve.md (vec_shl_insert_<mode>): New insn. > > gcc/testsuite/ > * gcc.dg/vect/pr37027.c: Remove XFAIL for variable-length vectors. > * gcc.dg/vect/pr67790.c: Likewise. > * gcc.dg/vect/slp-reduc-1.c: Likewise. > * gcc.dg/vect/slp-reduc-2.c: Likewise. > * gcc.dg/vect/slp-reduc-3.c: Likewise. > * gcc.dg/vect/slp-reduc-5.c: Likewise. > * gcc.target/aarch64/sve_slp_5.c: New test. > * gcc.target/aarch64/sve_slp_5_run.c: Likewise. > * gcc.target/aarch64/sve_slp_6.c: Likewise. > * gcc.target/aarch64/sve_slp_6_run.c: Likewise. > * gcc.target/aarch64/sve_slp_7.c: Likewise. > * gcc.target/aarch64/sve_slp_7_run.c: Likewise. OK jeff
On Thu, Dec 14, 2017 at 12:43:11AM +0000, Jeff Law wrote: > On 11/22/2017 11:10 AM, Richard Sandiford wrote: > > Richard Sandiford <richard.sandiford@linaro.org> writes: > >> Two things stopped us using SLP reductions with variable-length vectors: > >> > >> (1) We didn't have a way of constructing the initial vector. > >> This patch does it by creating a vector full of the neutral > >> identity value and then using a shift-and-insert function > >> to insert any non-identity inputs into the low-numbered elements. > >> (The non-identity values are needed for double reductions.) > >> Alternatively, for unchained MIN/MAX reductions that have no neutral > >> value, we instead use the same duplicate-and-interleave approach as > >> for SLP constant and external definitions (added by a previous > >> patch). > >> > >> (2) The epilogue for constant-length vectors would extract the vector > >> elements associated with each SLP statement and do scalar arithmetic > >> on these individual elements. For variable-length vectors, the patch > >> instead creates a reduction vector for each SLP statement, replacing > >> the elements for other SLP statements with the identity value. > >> It then uses a hardware reduction instruction on each vector. > >> > >> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu > >> and powerpc64le-linux-gnu. > > > > Here's an updated version that applies on top of the recent > > removal of REDUC_*_EXPR. Tested as before. > > > > Thanks, > > Richard > > > > > > 2017-11-22 Richard Sandiford <richard.sandiford@linaro.org> > > Alan Hayward <alan.hayward@arm.com> > > David Sherwood <david.sherwood@arm.com> > > > > gcc/ > > * doc/md.texi (vec_shl_insert_@var{m}): New optab. > > * internal-fn.def (VEC_SHL_INSERT): New internal function. > > * optabs.def (vec_shl_insert_optab): New optab. > > * tree-vectorizer.h (can_duplicate_and_interleave_p): Declare. > > (duplicate_and_interleave): Likewise. > > * tree-vect-loop.c: Include internal-fn.h. > > (neutral_op_for_slp_reduction): New function, split out from > > get_initial_defs_for_reduction. > > (get_initial_def_for_reduction): Handle option 2 for variable-length > > vectors by loading the neutral value into a vector and then shifting > > the initial value into element 0. > > (get_initial_defs_for_reduction): Replace the code argument with > > the neutral value calculated by neutral_op_for_slp_reduction. > > Use gimple_build_vector for constant-length vectors. > > Use IFN_VEC_SHL_INSERT for variable-length vectors if all > > but the first group_size elements have a neutral value. > > Use duplicate_and_interleave otherwise. > > (vect_create_epilog_for_reduction): Take a neutral_op parameter. > > Update call to get_initial_defs_for_reduction. Handle SLP > > reductions for variable-length vectors by creating one vector > > result for each scalar result, with the elements associated > > with other scalar results stubbed out with the neutral value. > > (vectorizable_reduction): Call neutral_op_for_slp_reduction. > > Require IFN_VEC_SHL_INSERT for double reductions on > > variable-length vectors, or SLP reductions that have > > a neutral value. Require can_duplicate_and_interleave_p > > support for variable-length unchained SLP reductions if there > > is no neutral value, such as for MIN/MAX reductions. Also require > > the number of vector elements to be a multiple of the number of > > SLP statements when doing variable-length unchained SLP reductions. > > Update call to vect_create_epilog_for_reduction. > > * tree-vect-slp.c (can_duplicate_and_interleave_p): Make public > > and remove initial values. > > (duplicate_and_interleave): Use IFN_VEC_SHL_INSERT for > > variable-length vectors if all but the first group_size elements > > have a neutral value. > > * config/aarch64/aarch64.md (UNSPEC_INSR): New unspec. > > * config/aarch64/aarch64-sve.md (vec_shl_insert_<mode>): New insn. > > > > gcc/testsuite/ > > * gcc.dg/vect/pr37027.c: Remove XFAIL for variable-length vectors. > > * gcc.dg/vect/pr67790.c: Likewise. > > * gcc.dg/vect/slp-reduc-1.c: Likewise. > > * gcc.dg/vect/slp-reduc-2.c: Likewise. > > * gcc.dg/vect/slp-reduc-3.c: Likewise. > > * gcc.dg/vect/slp-reduc-5.c: Likewise. > > * gcc.target/aarch64/sve_slp_5.c: New test. > > * gcc.target/aarch64/sve_slp_5_run.c: Likewise. > > * gcc.target/aarch64/sve_slp_6.c: Likewise. > > * gcc.target/aarch64/sve_slp_6_run.c: Likewise. > > * gcc.target/aarch64/sve_slp_7.c: Likewise. > > * gcc.target/aarch64/sve_slp_7_run.c: Likewise. > OK > jeff As you explicitly asked on another thread, this is OK from an AArch64 maintainer too. James
Index: gcc/doc/md.texi =================================================================== --- gcc/doc/md.texi 2017-11-17 09:44:46.094275532 +0000 +++ gcc/doc/md.texi 2017-11-17 09:44:46.386606597 +0000 @@ -5273,6 +5273,14 @@ operand 1. Add operand 1 to operand 2 an operand 0. (This is used express accumulation of elements into an accumulator of a wider mode.) +@cindex @code{vec_shl_insert_@var{m}} instruction pattern +@item @samp{vec_shl_insert_@var{m}} +Shift the elements in vector input operand 1 left one element (i.e. +away from element 0) and fill the vacated element 0 with the scalar +in operand 2. Store the result in vector output operand 0. Operands +0 and 1 have mode @var{m} and operand 2 has the mode appropriate for +one element of @var{m}. + @cindex @code{vec_shr_@var{m}} instruction pattern @item @samp{vec_shr_@var{m}} Whole vector right shift in bits, i.e. towards element 0. Index: gcc/internal-fn.def =================================================================== --- gcc/internal-fn.def 2017-11-17 09:44:46.094275532 +0000 +++ gcc/internal-fn.def 2017-11-17 09:44:46.386606597 +0000 @@ -112,6 +112,8 @@ DEF_INTERNAL_OPTAB_FN (VEC_EXTRACT_ODD, vec_extract_odd, binary) DEF_INTERNAL_OPTAB_FN (VEC_REVERSE, ECF_CONST | ECF_NOTHROW, vec_reverse, unary) +DEF_INTERNAL_OPTAB_FN (VEC_SHL_INSERT, ECF_CONST | ECF_NOTHROW, + vec_shl_insert, binary) DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary) Index: gcc/optabs.def =================================================================== --- gcc/optabs.def 2017-11-17 09:44:46.094275532 +0000 +++ gcc/optabs.def 2017-11-17 09:44:46.386606597 +0000 @@ -374,3 +374,4 @@ OPTAB_D (set_thread_pointer_optab, "set_ OPTAB_DC (vec_duplicate_optab, "vec_duplicate$a", VEC_DUPLICATE) OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES) +OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a") Index: gcc/tree-vectorizer.h =================================================================== --- gcc/tree-vectorizer.h 2017-11-17 09:44:46.094275532 +0000 +++ gcc/tree-vectorizer.h 2017-11-17 09:44:46.389306597 +0000 @@ -1348,6 +1348,11 @@ extern void vect_get_slp_defs (vec<tree> extern bool vect_slp_bb (basic_block); extern gimple *vect_find_last_scalar_stmt_in_slp (slp_tree); extern bool is_simple_and_all_uses_invariant (gimple *, loop_vec_info); +extern bool can_duplicate_and_interleave_p (unsigned int, machine_mode, + unsigned int * = NULL, + tree * = NULL); +extern void duplicate_and_interleave (gimple_seq *, tree, vec<tree>, + unsigned int, vec<tree> &); /* In tree-vect-patterns.c. */ /* Pattern recognition functions. Index: gcc/tree-vect-loop.c =================================================================== --- gcc/tree-vect-loop.c 2017-11-17 09:44:46.094275532 +0000 +++ gcc/tree-vect-loop.c 2017-11-17 09:44:46.389306597 +0000 @@ -50,6 +50,7 @@ Software Foundation; either version 3, o #include "cgraph.h" #include "tree-cfg.h" #include "tree-if-conv.h" +#include "internal-fn.h" /* Loop Vectorization Pass. @@ -2449,6 +2450,54 @@ reduction_code_for_scalar_code (enum tre } } +/* If there is a neutral value X such that SLP reduction NODE would not + be affected by the introduction of additional X elements, return that X, + otherwise return null. CODE is the code of the reduction. REDUC_CHAIN + is true if the SLP statements perform a single reduction, false if each + statement performs an independent reduction. */ + +static tree +neutral_op_for_slp_reduction (slp_tree slp_node, tree_code code, + bool reduc_chain) +{ + vec<gimple *> stmts = SLP_TREE_SCALAR_STMTS (slp_node); + gimple *stmt = stmts[0]; + stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt); + tree vector_type = STMT_VINFO_VECTYPE (stmt_vinfo); + tree scalar_type = TREE_TYPE (vector_type); + struct loop *loop = gimple_bb (stmt)->loop_father; + gcc_assert (loop); + + switch (code) + { + case WIDEN_SUM_EXPR: + case DOT_PROD_EXPR: + case SAD_EXPR: + case PLUS_EXPR: + case MINUS_EXPR: + case BIT_IOR_EXPR: + case BIT_XOR_EXPR: + return build_zero_cst (scalar_type); + + case MULT_EXPR: + return build_one_cst (scalar_type); + + case BIT_AND_EXPR: + return build_all_ones_cst (scalar_type); + + case MAX_EXPR: + case MIN_EXPR: + /* For MIN/MAX the initial values are neutral. A reduction chain + has only a single initial value, so that value is neutral for + all statements. */ + if (reduc_chain) + return PHI_ARG_DEF_FROM_EDGE (stmt, loop_preheader_edge (loop)); + return NULL_TREE; + + default: + return NULL_TREE; + } +} /* Error reporting helper for vect_is_simple_reduction below. GIMPLE statement STMT is printed with a message MSG. */ @@ -4014,6 +4063,7 @@ get_initial_def_for_reduction (gimple *s int int_init_val = 0; gimple *def_stmt = NULL; gimple_seq stmts = NULL; + unsigned HOST_WIDE_INT count; gcc_assert (vectype); nunits = TYPE_VECTOR_SUBPARTS (vectype); @@ -4086,13 +4136,19 @@ get_initial_def_for_reduction (gimple *s /* Option1: the first element is '0' or '1' as well. */ init_def = gimple_build_vector_from_val (&stmts, vectype, def_for_init); + else if (!nunits.is_constant (&count)) + { + /* Option2 (variable length): the first element is INIT_VAL. */ + init_def = build_vector_from_val (vectype, def_for_init); + gcall *call = gimple_build_call_internal (IFN_VEC_SHL_INSERT, + 2, init_def, init_val); + init_def = make_ssa_name (vectype); + gimple_call_set_lhs (call, init_def); + gimple_seq_add_stmt (&stmts, call); + } else { - /* Option2: the first element is INIT_VAL. */ - - /* Enforced by vectorizable_reduction (which disallows double - reductions with variable-length vectors). */ - unsigned int count = nunits.to_constant (); + /* Option2 (constant length): the first element is INIT_VAL. */ auto_vec<tree, 32> elts (count); elts.quick_push (init_val); for (unsigned int i = 1; i < count; ++i) @@ -4130,34 +4186,32 @@ get_initial_def_for_reduction (gimple *s } /* Get at the initial defs for the reduction PHIs in SLP_NODE. - NUMBER_OF_VECTORS is the number of vector defs to create. */ + NUMBER_OF_VECTORS is the number of vector defs to create. + If NEUTRAL_OP is nonnull, introducing extra elements of that + value will not change the result. */ static void get_initial_defs_for_reduction (slp_tree slp_node, vec<tree> *vec_oprnds, unsigned int number_of_vectors, - enum tree_code code, bool reduc_chain) + bool reduc_chain, tree neutral_op) { vec<gimple *> stmts = SLP_TREE_SCALAR_STMTS (slp_node); gimple *stmt = stmts[0]; stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt); - unsigned nunits; + unsigned HOST_WIDE_INT nunits; unsigned j, number_of_places_left_in_vector; - tree vector_type, scalar_type; + tree vector_type; tree vop; int group_size = stmts.length (); unsigned int vec_num, i; unsigned number_of_copies = 1; vec<tree> voprnds; voprnds.create (number_of_vectors); - tree neutral_op = NULL; struct loop *loop; + auto_vec<tree, 16> permute_results; vector_type = STMT_VINFO_VECTYPE (stmt_vinfo); - scalar_type = TREE_TYPE (vector_type); - /* vectorizable_reduction has already rejected SLP reductions on - variable-length vectors. */ - nunits = TYPE_VECTOR_SUBPARTS (vector_type).to_constant (); gcc_assert (STMT_VINFO_DEF_TYPE (stmt_vinfo) == vect_reduction_def); @@ -4165,45 +4219,7 @@ get_initial_defs_for_reduction (slp_tree gcc_assert (loop); edge pe = loop_preheader_edge (loop); - /* op is the reduction operand of the first stmt already. */ - /* For additional copies (see the explanation of NUMBER_OF_COPIES below) - we need either neutral operands or the original operands. See - get_initial_def_for_reduction() for details. */ - switch (code) - { - case WIDEN_SUM_EXPR: - case DOT_PROD_EXPR: - case SAD_EXPR: - case PLUS_EXPR: - case MINUS_EXPR: - case BIT_IOR_EXPR: - case BIT_XOR_EXPR: - neutral_op = build_zero_cst (scalar_type); - break; - - case MULT_EXPR: - neutral_op = build_one_cst (scalar_type); - break; - - case BIT_AND_EXPR: - neutral_op = build_all_ones_cst (scalar_type); - break; - - /* For MIN/MAX we don't have an easy neutral operand but - the initial values can be used fine here. Only for - a reduction chain we have to force a neutral element. */ - case MAX_EXPR: - case MIN_EXPR: - if (! reduc_chain) - neutral_op = NULL; - else - neutral_op = PHI_ARG_DEF_FROM_EDGE (stmt, pe); - break; - - default: - gcc_assert (! reduc_chain); - neutral_op = NULL; - } + gcc_assert (!reduc_chain || neutral_op); /* NUMBER_OF_COPIES is the number of times we need to use the same values in created vectors. It is greater than 1 if unrolling is performed. @@ -4221,6 +4237,9 @@ get_initial_defs_for_reduction (slp_tree (s1, s2, ..., s8). We will create two vectors {s1, s2, s3, s4} and {s5, s6, s7, s8}. */ + if (!TYPE_VECTOR_SUBPARTS (vector_type).is_constant (&nunits)) + nunits = group_size; + number_of_copies = nunits * number_of_vectors / group_size; number_of_places_left_in_vector = nunits; @@ -4247,7 +4266,40 @@ get_initial_defs_for_reduction (slp_tree if (number_of_places_left_in_vector == 0) { gimple_seq ctor_seq = NULL; - tree init = gimple_build_vector (&ctor_seq, vector_type, elts); + tree init; + if (must_eq (TYPE_VECTOR_SUBPARTS (vector_type), nunits)) + /* Build the vector directly from ELTS. */ + init = gimple_build_vector (&ctor_seq, vector_type, elts); + else if (neutral_op) + { + /* Build a vector of the neutral value and shift the + other elements into place. */ + init = gimple_build_vector_from_val (&ctor_seq, vector_type, + neutral_op); + int k = nunits; + while (k > 0 && elts[k - 1] == neutral_op) + k -= 1; + while (k > 0) + { + k -= 1; + gcall *call = gimple_build_call_internal + (IFN_VEC_SHL_INSERT, 2, init, elts[k]); + init = make_ssa_name (vector_type); + gimple_call_set_lhs (call, init); + gimple_seq_add_stmt (&ctor_seq, call); + } + } + else + { + /* First time round, duplicate ELTS to fill the + required number of vectors, then cherry pick the + appropriate result for each iteration. */ + if (vec_oprnds->is_empty ()) + duplicate_and_interleave (&ctor_seq, vector_type, elts, + number_of_vectors, + permute_results); + init = permute_results[number_of_vectors - j - 1]; + } if (ctor_seq != NULL) gsi_insert_seq_on_edge_immediate (pe, ctor_seq); voprnds.quick_push (init); @@ -4317,6 +4369,8 @@ get_initial_defs_for_reduction (slp_tree DOUBLE_REDUC is TRUE if double reduction phi nodes should be handled. SLP_NODE is an SLP node containing a group of reduction statements. The first one in this group is STMT. + NEUTRAL_OP is the value given by neutral_op_for_slp_reduction; it is + null if this is not an SLP reduction This function: 1. Creates the reduction def-use cycles: sets the arguments for @@ -4364,7 +4418,8 @@ vect_create_epilog_for_reduction (vec<tr vec<gimple *> reduction_phis, bool double_reduc, slp_tree slp_node, - slp_instance slp_node_instance) + slp_instance slp_node_instance, + tree neutral_op) { stmt_vec_info stmt_info = vinfo_for_stmt (stmt); stmt_vec_info prev_phi_info; @@ -4400,6 +4455,7 @@ vect_create_epilog_for_reduction (vec<tr auto_vec<tree> vec_initial_defs; auto_vec<gimple *> phis; bool slp_reduc = false; + bool direct_slp_reduc; tree new_phi_result; gimple *inner_phi = NULL; tree induction_index = NULL_TREE; @@ -4443,8 +4499,9 @@ vect_create_epilog_for_reduction (vec<tr unsigned vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); vec_initial_defs.reserve (vec_num); get_initial_defs_for_reduction (slp_node_instance->reduc_phis, - &vec_initial_defs, vec_num, code, - GROUP_FIRST_ELEMENT (stmt_info)); + &vec_initial_defs, vec_num, + GROUP_FIRST_ELEMENT (stmt_info), + neutral_op); } else { @@ -4738,6 +4795,12 @@ vect_create_epilog_for_reduction (vec<tr b2 = operation (b1) */ slp_reduc = (slp_node && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))); + /* True if we should implement SLP_REDUC using native reduction operations + instead of scalar operations. */ + direct_slp_reduc = (reduc_code != ERROR_MARK + && slp_reduc + && !TYPE_VECTOR_SUBPARTS (vectype).is_constant ()); + /* In case of reduction chain, e.g., # a1 = phi <a3, a0> a2 = operation (a1) @@ -4745,7 +4808,7 @@ vect_create_epilog_for_reduction (vec<tr we may end up with more than one vector result. Here we reduce them to one vector. */ - if (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) + if (GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) || direct_slp_reduc) { tree first_vect = PHI_RESULT (new_phis[0]); gassign *new_vec_stmt = NULL; @@ -5032,6 +5095,81 @@ vect_create_epilog_for_reduction (vec<tr scalar_results.safe_push (new_temp); } + else if (direct_slp_reduc) + { + /* Here we create one vector for each of the GROUP_SIZE results, + with the elements for other SLP statements replaced with the + neutral value. We can then do a normal reduction on each vector. */ + + /* Enforced by vectorizable_reduction. */ + gcc_assert (new_phis.length () == 1); + gcc_assert (pow2p_hwi (group_size)); + + slp_tree orig_phis_slp_node = slp_node_instance->reduc_phis; + vec<gimple *> orig_phis = SLP_TREE_SCALAR_STMTS (orig_phis_slp_node); + gimple_seq seq = NULL; + + /* Build a vector {0, 1, 2, ...}, with the same number of elements + and the same element size as VECTYPE. */ + tree index = build_index_vector (vectype, 0, 1); + tree index_type = TREE_TYPE (index); + tree index_elt_type = TREE_TYPE (index_type); + tree mask_type = build_same_sized_truth_vector_type (index_type); + + /* Create a vector that, for each element, identifies which of + the GROUP_SIZE results should use it. */ + tree index_mask = build_int_cst (index_elt_type, group_size - 1); + index = gimple_build (&seq, BIT_AND_EXPR, index_type, index, + build_vector_from_val (index_type, index_mask)); + + /* Get a neutral vector value. This is simply a splat of the neutral + scalar value if we have one, otherwise the initial scalar value + is itself a neutral value. */ + tree vector_identity = NULL_TREE; + if (neutral_op) + vector_identity = gimple_build_vector_from_val (&seq, vectype, + neutral_op); + for (unsigned int i = 0; i < group_size; ++i) + { + /* If there's no univeral neutral value, we can use the + initial scalar value from the original PHI. This is used + for MIN and MAX reduction, for example. */ + if (!neutral_op) + { + tree scalar_value + = PHI_ARG_DEF_FROM_EDGE (orig_phis[i], + loop_preheader_edge (loop)); + vector_identity = gimple_build_vector_from_val (&seq, vectype, + scalar_value); + } + + /* Calculate the equivalent of: + + sel[j] = (index[j] == i); + + which selects the elements of NEW_PHI_RESULT that should + be included in the result. */ + tree compare_val = build_int_cst (index_elt_type, i); + compare_val = build_vector_from_val (index_type, compare_val); + tree sel = gimple_build (&seq, EQ_EXPR, mask_type, + index, compare_val); + + /* Calculate the equivalent of: + + vec = seq ? new_phi_result : vector_identity; + + VEC is now suitable for a full vector reduction. */ + tree vec = gimple_build (&seq, VEC_COND_EXPR, vectype, + sel, new_phi_result, vector_identity); + + /* Do the reduction and convert it to the appropriate type. */ + tree scalar = gimple_build (&seq, reduc_code, + TREE_TYPE (vectype), vec); + scalar = gimple_convert (&seq, scalar_type, scalar); + scalar_results.safe_push (scalar); + } + gsi_insert_seq_before (&exit_gsi, seq, GSI_SAME_STMT); + } else { bool reduce_with_shift = have_whole_vector_shift (mode); @@ -6254,25 +6392,64 @@ vectorizable_reduction (gimple *stmt, gi return false; } - if (double_reduc && !nunits_out.is_constant ()) + /* For SLP reductions, see if there is a neutral value we can use. */ + tree neutral_op = NULL_TREE; + if (slp_node) + neutral_op + = neutral_op_for_slp_reduction (slp_node_instance->reduc_phis, code, + GROUP_FIRST_ELEMENT (stmt_info) != NULL); + + /* For double reductions, and for SLP reductions with a neutral value, + we construct a variable-length initial vector by loading a vector + full of the neutral value and then shift-and-inserting the start + values into the low-numbered elements. */ + if ((double_reduc || neutral_op) + && !nunits_out.is_constant () + && !direct_internal_fn_supported_p (IFN_VEC_SHL_INSERT, + vectype_out, OPTIMIZE_FOR_SPEED)) { - /* The current double-reduction code creates the initial value - element-by-element. */ if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "double reduction not supported for variable-length" - " vectors.\n"); + "reduction on variable-length vectors requires" + " target support for a vector-shift-and-insert" + " operation.\n"); return false; } - if (slp_node && !nunits_out.is_constant ()) - { - /* The current SLP code creates the initial value element-by-element. */ - if (dump_enabled_p ()) - dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "SLP reduction not supported for variable-length" - " vectors.\n"); - return false; + /* Check extra constraints for variable-length unchained SLP reductions. */ + if (STMT_SLP_TYPE (stmt_info) + && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) + && !nunits_out.is_constant ()) + { + /* We checked above that we could build the initial vector when + there's a neutral element value. Check here for the case in + which each SLP statement has its own initial value and in which + that value needs to be repeated for every instance of the + statement within the initial vector. */ + unsigned int group_size = SLP_TREE_SCALAR_STMTS (slp_node).length (); + scalar_mode elt_mode = SCALAR_TYPE_MODE (TREE_TYPE (vectype_out)); + if (!neutral_op + && !can_duplicate_and_interleave_p (group_size, elt_mode)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "unsupported form of SLP reduction for" + " variable-length vectors: cannot build" + " initial vector.\n"); + return false; + } + /* The epilogue code relies on the number of elements being a multiple + of the group size. The duplicate-and-interleave approach to setting + up the the initial vector does too. */ + if (!multiple_p (nunits_out, group_size)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "unsupported form of SLP reduction for" + " variable-length vectors: the vector size" + " is not a multiple of the number of results.\n"); + return false; + } } /* In case of widenning multiplication by a constant, we update the type @@ -6540,7 +6717,8 @@ vectorizable_reduction (gimple *stmt, gi vect_create_epilog_for_reduction (vect_defs, stmt, reduc_def_stmt, epilog_copies, epilog_reduc_code, phis, - double_reduc, slp_node, slp_node_instance); + double_reduc, slp_node, slp_node_instance, + neutral_op); return true; } Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c 2017-11-17 09:44:46.094275532 +0000 +++ gcc/tree-vect-slp.c 2017-11-17 09:44:46.389306597 +0000 @@ -214,10 +214,10 @@ vect_get_place_in_interleaving_chain (gi (if nonnull) and the type of each intermediate vector in *VECTOR_TYPE_OUT (if nonnull). */ -static bool +bool can_duplicate_and_interleave_p (unsigned int count, machine_mode elt_mode, - unsigned int *nvectors_out = NULL, - tree *vector_type_out = NULL) + unsigned int *nvectors_out, + tree *vector_type_out) { poly_int64 elt_bytes = count * GET_MODE_SIZE (elt_mode); poly_int64 nelts; @@ -3277,7 +3277,7 @@ vect_mask_constant_operand_p (gimple *st We try to find the largest IM for which this sequence works, in order to cut down on the number of interleaves. */ -static void +void duplicate_and_interleave (gimple_seq *seq, tree vector_type, vec<tree> elts, unsigned int nresults, vec<tree> &results) { @@ -3559,6 +3559,26 @@ vect_get_constant_vectors (tree op, slp_ if (must_eq (TYPE_VECTOR_SUBPARTS (vector_type), nunits)) /* Build the vector directly from ELTS. */ vec_cst = gimple_build_vector (&ctor_seq, vector_type, elts); + else if (neutral_op) + { + /* Build a vector of the neutral value and shift the + other elements into place. */ + vec_cst = gimple_build_vector_from_val (&ctor_seq, + vector_type, + neutral_op); + int k = nunits; + while (k > 0 && elts[k - 1] == neutral_op) + k -= 1; + while (k > 0) + { + k -= 1; + gcall *call = gimple_build_call_internal + (IFN_VEC_SHL_INSERT, 2, vec_cst, elts[k]); + vec_cst = make_ssa_name (vector_type); + gimple_call_set_lhs (call, vec_cst); + gimple_seq_add_stmt (&ctor_seq, call); + } + } else { if (vec_oprnds->is_empty ()) Index: gcc/config/aarch64/aarch64.md =================================================================== --- gcc/config/aarch64/aarch64.md 2017-11-17 09:44:46.094275532 +0000 +++ gcc/config/aarch64/aarch64.md 2017-11-17 09:44:46.385706597 +0000 @@ -162,6 +162,7 @@ (define_c_enum "unspec" [ UNSPEC_WHILE_LO UNSPEC_LDN UNSPEC_STN + UNSPEC_INSR ]) (define_c_enum "unspecv" [ Index: gcc/config/aarch64/aarch64-sve.md =================================================================== --- gcc/config/aarch64/aarch64-sve.md 2017-11-17 09:44:46.094275532 +0000 +++ gcc/config/aarch64/aarch64-sve.md 2017-11-17 09:44:46.385706597 +0000 @@ -2080,3 +2080,16 @@ (define_expand "vec_pack_<su>fix_trunc_v operands[5] = gen_reg_rtx (V8SImode); } ) + +;; Shift an SVE vector left and insert a scalar into element 0. +(define_insn "vec_shl_insert_<mode>" + [(set (match_operand:SVE_ALL 0 "register_operand" "=w, w") + (unspec:SVE_ALL + [(match_operand:SVE_ALL 1 "register_operand" "0, 0") + (match_operand:<VEL> 2 "register_operand" "rZ, w")] + UNSPEC_INSR))] + "TARGET_SVE" + "@ + insr\t%0.<Vetype>, %<vwcore>2 + insr\t%0.<Vetype>, %<Vetype>2" +) Index: gcc/testsuite/gcc.dg/vect/pr37027.c =================================================================== --- gcc/testsuite/gcc.dg/vect/pr37027.c 2017-11-17 09:44:46.094275532 +0000 +++ gcc/testsuite/gcc.dg/vect/pr37027.c 2017-11-17 09:44:46.386606597 +0000 @@ -32,5 +32,5 @@ foo (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || { vect_variable_length && vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */ Index: gcc/testsuite/gcc.dg/vect/pr67790.c =================================================================== --- gcc/testsuite/gcc.dg/vect/pr67790.c 2017-11-17 09:44:46.094275532 +0000 +++ gcc/testsuite/gcc.dg/vect/pr67790.c 2017-11-17 09:44:46.387506597 +0000 @@ -37,4 +37,4 @@ int main() return 0; } -/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { xfail { vect_variable_length && vect_load_lanes } } } } */ +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-1.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-1.c 2017-11-17 09:44:46.094275532 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-1.c 2017-11-17 09:44:46.387506597 +0000 @@ -43,5 +43,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || { vect_variable_length && vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-2.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-2.c 2017-11-17 09:44:46.094275532 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-2.c 2017-11-17 09:44:46.387506597 +0000 @@ -38,5 +38,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail vect_no_int_add } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_add || { vect_variable_length && vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_add } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-3.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-3.c 2017-11-17 09:44:46.094275532 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-3.c 2017-11-17 09:44:46.387506597 +0000 @@ -58,7 +58,4 @@ int main (void) /* The initialization loop in main also gets vectorized. */ /* { dg-final { scan-tree-dump-times "vect_recog_dot_prod_pattern: detected" 1 "vect" { xfail *-*-* } } } */ /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { target { vect_short_mult && { vect_widen_sum_hi_to_si && vect_unpack } } } } } */ -/* We can't yet create the necessary SLP constant vector for variable-length - SVE and so fall back to Advanced SIMD. This means that we repeat each - analysis note. */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_widen_sum_hi_to_si_pattern || { { ! vect_unpack } || { aarch64_sve && vect_variable_length } } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_widen_sum_hi_to_si_pattern || { ! vect_unpack } } } } } */ Index: gcc/testsuite/gcc.dg/vect/slp-reduc-5.c =================================================================== --- gcc/testsuite/gcc.dg/vect/slp-reduc-5.c 2017-11-17 09:44:46.094275532 +0000 +++ gcc/testsuite/gcc.dg/vect/slp-reduc-5.c 2017-11-17 09:44:46.387506597 +0000 @@ -43,5 +43,5 @@ int main (void) } /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { xfail vect_no_int_min_max } } } */ -/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail { vect_no_int_min_max || { vect_variable_length && vect_load_lanes } } } } } */ +/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { xfail vect_no_int_min_max } } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_5.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_5.c 2017-11-17 09:44:46.387506597 +0000 @@ -0,0 +1,58 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -ffast-math" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +void __attribute__ ((noinline, noclone)) \ +vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \ +{ \ + TYPE x0 = b[0]; \ + TYPE x1 = b[1]; \ + for (int i = 0; i < n; ++i) \ + { \ + x0 += a[i * 2]; \ + x1 += a[i * 2 + 1]; \ + } \ + b[0] = x0; \ + b[1] = x1; \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (_Float16) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* ??? We don't think it's worth using SLP for the 64-bit loops and fall + back to the less efficient non-SLP implementation instead. */ +/* ??? At present we don't treat the int8_t and int16_t loops as + reductions. */ +/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 2 } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */ +/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tld2b\t} } } */ +/* { dg-final { scan-assembler-not {\tld2h\t} } } */ +/* { dg-final { scan-assembler-not {\tld2w\t} } } */ +/* { dg-final { scan-assembler-not {\tld2d\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 4 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 4 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 2 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 2 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 2 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 2 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 2 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_5_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_5_run.c 2017-11-17 09:44:46.387506597 +0000 @@ -0,0 +1,35 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -ffast-math" } */ + +#include "sve_slp_5.c" + +#define N (141 * 2) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[2] = { 40, 22 }; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + a[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_slp_##TYPE (a, b, N / 2); \ + TYPE x0 = 40; \ + TYPE x1 = 22; \ + for (unsigned int i = 0; i < N; i += 2) \ + { \ + x0 += a[i]; \ + x1 += a[i + 1]; \ + asm volatile ("" ::: "memory"); \ + } \ + /* _Float16 isn't precise enough for this. */ \ + if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \ + && (x0 != b[0] || x1 != b[1])) \ + __builtin_abort (); \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +} Index: gcc/testsuite/gcc.target/aarch64/sve_slp_6.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_6.c 2017-11-17 09:44:46.387506597 +0000 @@ -0,0 +1,47 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -ffast-math" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +void __attribute__ ((noinline, noclone)) \ +vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \ +{ \ + TYPE x0 = b[0]; \ + TYPE x1 = b[1]; \ + TYPE x2 = b[2]; \ + for (int i = 0; i < n; ++i) \ + { \ + x0 += a[i * 3]; \ + x1 += a[i * 3 + 1]; \ + x2 += a[i * 3 + 2]; \ + } \ + b[0] = x0; \ + b[1] = x1; \ + b[2] = x2; \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (_Float16) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* These loops can't use SLP. */ +/* { dg-final { scan-assembler-not {\tld1b\t} } } */ +/* { dg-final { scan-assembler-not {\tld1h\t} } } */ +/* { dg-final { scan-assembler-not {\tld1w\t} } } */ +/* { dg-final { scan-assembler-not {\tld1d\t} } } */ +/* { dg-final { scan-assembler {\tld3b\t} } } */ +/* { dg-final { scan-assembler {\tld3h\t} } } */ +/* { dg-final { scan-assembler {\tld3w\t} } } */ +/* { dg-final { scan-assembler {\tld3d\t} } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_6_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_6_run.c 2017-11-17 09:44:46.387506597 +0000 @@ -0,0 +1,37 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -ffast-math" } */ + +#include "sve_slp_6.c" + +#define N (77 * 3) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[3] = { 40, 22, 75 }; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + a[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_slp_##TYPE (a, b, N / 3); \ + TYPE x0 = 40; \ + TYPE x1 = 22; \ + TYPE x2 = 75; \ + for (unsigned int i = 0; i < N; i += 3) \ + { \ + x0 += a[i]; \ + x1 += a[i + 1]; \ + x2 += a[i + 2]; \ + asm volatile ("" ::: "memory"); \ + } \ + /* _Float16 isn't precise enough for this. */ \ + if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \ + && (x0 != b[0] || x1 != b[1] || x2 != b[2])) \ + __builtin_abort (); \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +} Index: gcc/testsuite/gcc.target/aarch64/sve_slp_7.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_7.c 2017-11-17 09:44:46.388406597 +0000 @@ -0,0 +1,66 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -msve-vector-bits=scalable -ffast-math" } */ + +#include <stdint.h> + +#define VEC_PERM(TYPE) \ +void __attribute__ ((noinline, noclone)) \ +vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n) \ +{ \ + TYPE x0 = b[0]; \ + TYPE x1 = b[1]; \ + TYPE x2 = b[2]; \ + TYPE x3 = b[3]; \ + for (int i = 0; i < n; ++i) \ + { \ + x0 += a[i * 4]; \ + x1 += a[i * 4 + 1]; \ + x2 += a[i * 4 + 2]; \ + x3 += a[i * 4 + 3]; \ + } \ + b[0] = x0; \ + b[1] = x1; \ + b[2] = x2; \ + b[3] = x3; \ +} + +#define TEST_ALL(T) \ + T (int8_t) \ + T (uint8_t) \ + T (int16_t) \ + T (uint16_t) \ + T (int32_t) \ + T (uint32_t) \ + T (int64_t) \ + T (uint64_t) \ + T (_Float16) \ + T (float) \ + T (double) + +TEST_ALL (VEC_PERM) + +/* We can't use SLP for the 64-bit loops, since the number of reduction + results might be greater than the number of elements in the vector. + Otherwise we have two loads per loop, one for the initial vector + and one for the loop body. */ +/* ??? At present we don't treat the int8_t and int16_t loops as + reductions. */ +/* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tld1b\t} 1 } } */ +/* { dg-final { scan-assembler-times {\tld1h\t} 2 } } */ +/* { dg-final { scan-assembler-times {\tld1w\t} 3 } } */ +/* { dg-final { scan-assembler-times {\tld4d\t} 3 } } */ +/* { dg-final { scan-assembler-not {\tld4b\t} } } */ +/* { dg-final { scan-assembler-not {\tld4h\t} } } */ +/* { dg-final { scan-assembler-not {\tld4w\t} } } */ +/* { dg-final { scan-assembler-not {\tld1d\t} } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 8 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 8 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.b} 4 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.s} 8 } } */ +/* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 8 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\th[0-9]+, p[0-7], z[0-9]+\.h} 4 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\ts[0-9]+, p[0-7], z[0-9]+\.s} 4 } } */ +/* { dg-final { scan-assembler-times {\tfaddv\td[0-9]+, p[0-7], z[0-9]+\.d} 4 } } */ Index: gcc/testsuite/gcc.target/aarch64/sve_slp_7_run.c =================================================================== --- /dev/null 2017-11-14 14:28:07.424493901 +0000 +++ gcc/testsuite/gcc.target/aarch64/sve_slp_7_run.c 2017-11-17 09:44:46.388406597 +0000 @@ -0,0 +1,39 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve -ffast-math" } */ + +#include "sve_slp_7.c" + +#define N (54 * 4) + +#define HARNESS(TYPE) \ + { \ + TYPE a[N], b[4] = { 40, 22, 75, 19 }; \ + for (unsigned int i = 0; i < N; ++i) \ + { \ + a[i] = i * 2 + i % 5; \ + asm volatile ("" ::: "memory"); \ + } \ + vec_slp_##TYPE (a, b, N / 4); \ + TYPE x0 = 40; \ + TYPE x1 = 22; \ + TYPE x2 = 75; \ + TYPE x3 = 19; \ + for (unsigned int i = 0; i < N; i += 4) \ + { \ + x0 += a[i]; \ + x1 += a[i + 1]; \ + x2 += a[i + 2]; \ + x3 += a[i + 3]; \ + asm volatile ("" ::: "memory"); \ + } \ + /* _Float16 isn't precise enough for this. */ \ + if ((TYPE) 0x1000 + 1 != (TYPE) 0x1000 \ + && (x0 != b[0] || x1 != b[1] || x2 != b[2] || x3 != b[3])) \ + __builtin_abort (); \ + } + +int __attribute__ ((optimize (1))) +main (void) +{ + TEST_ALL (HARNESS) +}