diff mbox series

Add support for masked load/store_lanes

Message ID 87efp8wwu2.fsf@linaro.org
State New
Headers show
Series Add support for masked load/store_lanes | expand

Commit Message

Richard Sandiford Nov. 8, 2017, 4:37 p.m. UTC
This patch adds support for vectorising groups of IFN_MASK_LOADs
and IFN_MASK_STOREs using conditional load/store-lanes instructions.
This requires new internal functions to represent the result
(IFN_MASK_{LOAD,STORE}_LANES), as well as associated optabs.

The normal IFN_{LOAD,STORE}_LANES functions are const operations
that logically just perform the permute: the load or store is
encoded as a MEM operand to the call statement.  In contrast,
the IFN_MASK_{LOAD,STORE}_LANES functions use the same kind of
interface as IFN_MASK_{LOAD,STORE}, since the memory is only
conditionally accessed.

The AArch64 patterns were added as part of the main LD[234]/ST[234] patch.

Tested on aarch64-linux-gnu (both with and without SVE), x86_64-linux-gnu
and powerpc64le-linux-gnu.  OK to install?

Thanks,
Richard


2017-11-08  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* optabs.def (vec_mask_load_lanes_optab): New optab.
	(vec_mask_store_lanes_optab): Likewise.
	* internal-fn.def (MASK_LOAD_LANES): New internal function.
	(MASK_STORE_LANES): Likewise.
	* internal-fn.c (mask_load_lanes_direct): New macro.
	(mask_store_lanes_direct): Likewise.
	(expand_mask_load_optab_fn): Handle masked operations.
	(expand_mask_load_lanes_optab_fn): New macro.
	(expand_mask_store_optab_fn): Handle masked operations.
	(expand_mask_store_lanes_optab_fn): New macro.
	(direct_mask_load_lanes_optab_supported_p): Likewise.
	(direct_mask_store_lanes_optab_supported_p): Likewise.
	* tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p
	parameter.
	(vect_load_lanes_supported): Likewise.
	* tree-vect-data-refs.c (strip_conversion): New function.
	(can_group_stmts_p): Likewise.
	(vect_analyze_data_ref_accesses): Use it instead of checking
	for a pair of assignments.
	(vect_store_lanes_supported): Take a masked_p parameter.
	(vect_load_lanes_supported): Likewise.
	* tree-vect-loop.c (vect_analyze_loop_2): Update calls to
	vect_store_lanes_supported and vect_load_lanes_supported.
	* tree-vect-slp.c (vect_analyze_slp_instance): Likewise.
	* tree-vect-stmts.c (replace_mask_load): New function, split
	out from vectorizable_mask_load_store.  Keep the group information
	up-to-date.
	(get_store_op): New function.
	(get_group_load_store_type): Take a masked_p parameter.  Don't
	allow gaps for masked accesses.  Use get_store_op.  Update calls
	to vect_store_lanes_supported and vect_load_lanes_supported.
	(get_load_store_type): Take a masked_p parameter and update
	call to get_group_load_store_type.
	(init_stored_values, advance_stored_values): New functions,
	split out from vectorizable_store.
	(do_load_lanes, do_store_lanes): New functions.
	(get_masked_group_alias_ptr_type): New function.
	(vectorizable_mask_load_store): Update call to get_load_store_type.
	Handle masked VMAT_LOAD_STORE_LANES.  Update GROUP_STORE_COUNT
	when vectorizing a group of stores and only vectorize when we
	reach the last statement in the group.  Vectorize the first
	statement in a group of loads.  Use an array aggregate type
	rather than a vector type for load/store_lanes.  Use
	init_stored_values, advance_stored_values, do_load_lanes,
	do_store_lanes, get_masked_group_alias_ptr_type and replace_mask_load.
	(vectorizable_store): Update call to get_load_store_type.
	Use init_stored_values, advance_stored_values and do_store_lanes.
	(vectorizable_load): Update call to get_load_store_type.
	Use do_load_lanes.
	(vect_transform_stmt): Set grouped_store for grouped IFN_MASK_STOREs.
	Only set is_store for the last element in the group.

gcc/testsuite/
	* gcc.dg/vect/vect-ooo-group-1.c: New test.
	* gcc.target/aarch64/sve_mask_struct_load_1.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_1_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_2.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_2_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_3.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_3_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_4.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_5.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_6.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_7.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_8.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_1.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_1_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_2.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_2_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_3.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_3_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_4.c: Likewise.

Comments

Richard Sandiford Nov. 17, 2017, 9:36 a.m. UTC | #1
Richard Sandiford <richard.sandiford@linaro.org> writes:
> This patch adds support for vectorising groups of IFN_MASK_LOADs

> and IFN_MASK_STOREs using conditional load/store-lanes instructions.

> This requires new internal functions to represent the result

> (IFN_MASK_{LOAD,STORE}_LANES), as well as associated optabs.

>

> The normal IFN_{LOAD,STORE}_LANES functions are const operations

> that logically just perform the permute: the load or store is

> encoded as a MEM operand to the call statement.  In contrast,

> the IFN_MASK_{LOAD,STORE}_LANES functions use the same kind of

> interface as IFN_MASK_{LOAD,STORE}, since the memory is only

> conditionally accessed.

>

> The AArch64 patterns were added as part of the main LD[234]/ST[234] patch.

>

> Tested on aarch64-linux-gnu (both with and without SVE), x86_64-linux-gnu

> and powerpc64le-linux-gnu.  OK to install?


Here's an updated (and much simpler) version that applies on top of the
series I just posted to remove vectorizable_mask_load_store.  Tested as
before.

Thanks,
Richard


2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* doc/md.texi (vec_mask_load_lanes@var{m}@var{n}): Document.
	(vec_mask_store_lanes@var{m}@var{n}): Likewise.
	* optabs.def (vec_mask_load_lanes_optab): New optab.
	(vec_mask_store_lanes_optab): Likewise.
	* internal-fn.def (MASK_LOAD_LANES): New internal function.
	(MASK_STORE_LANES): Likewise.
	* internal-fn.c (mask_load_lanes_direct): New macro.
	(mask_store_lanes_direct): Likewise.
	(expand_mask_load_optab_fn): Handle masked operations.
	(expand_mask_load_lanes_optab_fn): New macro.
	(expand_mask_store_optab_fn): Handle masked operations.
	(expand_mask_store_lanes_optab_fn): New macro.
	(direct_mask_load_lanes_optab_supported_p): Likewise.
	(direct_mask_store_lanes_optab_supported_p): Likewise.
	* tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p
	parameter.
	(vect_load_lanes_supported): Likewise.
	* tree-vect-data-refs.c (strip_conversion): New function.
	(can_group_stmts_p): Likewise.
	(vect_analyze_data_ref_accesses): Use it instead of checking
	for a pair of assignments.
	(vect_store_lanes_supported): Take a masked_p parameter.
	(vect_load_lanes_supported): Likewise.
	* tree-vect-loop.c (vect_analyze_loop_2): Update calls to
	vect_store_lanes_supported and vect_load_lanes_supported.
	* tree-vect-slp.c (vect_analyze_slp_instance): Likewise.
	* tree-vect-stmts.c (get_group_load_store_type): Take a masked_p
	parameter.  Don't allow gaps for masked accesses.
	Use vect_get_store_rhs.  Update calls to vect_store_lanes_supported
	and vect_load_lanes_supported.
	(get_load_store_type): Take a masked_p parameter and update
	call to get_group_load_store_type.
	(vectorizable_store): Update call to get_load_store_type.
	Handle IFN_MASK_STORE_LANES.
	(vectorizable_load): Update call to get_load_store_type.
	Handle IFN_MASK_LOAD_LANES.

gcc/testsuite/
	* gcc.dg/vect/vect-ooo-group-1.c: New test.
	* gcc.target/aarch64/sve_mask_struct_load_1.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_1_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_2.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_2_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_3.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_3_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_4.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_5.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_6.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_7.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_load_8.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_1.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_1_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_2.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_2_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_3.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_3_run.c: Likewise.
	* gcc.target/aarch64/sve_mask_struct_store_4.c: Likewise.

Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	2017-11-17 09:06:19.783260344 +0000
+++ gcc/doc/md.texi	2017-11-17 09:35:23.400133274 +0000
@@ -4855,6 +4855,23 @@ loads for vectors of mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{vec_mask_load_lanes@var{m}@var{n}} instruction pattern
+@item @samp{vec_mask_load_lanes@var{m}@var{n}}
+Like @samp{vec_load_lanes@var{m}@var{n}}, but takes an additional
+mask operand (operand 2) that specifies which elements of the destination
+vectors should be loaded.  Other elements of the destination
+vectors are set to zero.  The operation is equivalent to:
+
+@smallexample
+int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
+for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
+  if (operand2[j])
+    for (i = 0; i < c; i++)
+      operand0[i][j] = operand1[j * c + i];
+@end smallexample
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_store_lanes@var{m}@var{n}} instruction pattern
 @item @samp{vec_store_lanes@var{m}@var{n}}
 Equivalent to @samp{vec_load_lanes@var{m}@var{n}}, with the memory
@@ -4872,6 +4889,22 @@ for a memory operand 0 and register oper
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{vec_mask_store_lanes@var{m}@var{n}} instruction pattern
+@item @samp{vec_mask_store_lanes@var{m}@var{n}}
+Like @samp{vec_store_lanes@var{m}@var{n}}, but takes an additional
+mask operand (operand 2) that specifies which elements of the source
+vectors should be stored.  The operation is equivalent to:
+
+@smallexample
+int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
+for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
+  if (operand2[j])
+    for (i = 0; i < c; i++)
+      operand0[j * c + i] = operand1[i][j];
+@end smallexample
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,
Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def	2017-11-17 09:35:23.086033247 +0000
+++ gcc/optabs.def	2017-11-17 09:35:23.401033274 +0000
@@ -80,6 +80,8 @@ OPTAB_CD(ssmsub_widen_optab, "ssmsub$b$a
 OPTAB_CD(usmsub_widen_optab, "usmsub$a$b4")
 OPTAB_CD(vec_load_lanes_optab, "vec_load_lanes$a$b")
 OPTAB_CD(vec_store_lanes_optab, "vec_store_lanes$a$b")
+OPTAB_CD(vec_mask_load_lanes_optab, "vec_mask_load_lanes$a$b")
+OPTAB_CD(vec_mask_store_lanes_optab, "vec_mask_store_lanes$a$b")
 OPTAB_CD(vcond_optab, "vcond$a$b")
 OPTAB_CD(vcondu_optab, "vcondu$a$b")
 OPTAB_CD(vcondeq_optab, "vcondeq$a$b")
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2017-11-17 09:35:23.086033247 +0000
+++ gcc/internal-fn.def	2017-11-17 09:35:23.401033274 +0000
@@ -45,9 +45,11 @@ along with GCC; see the file COPYING3.
 
    - mask_load: currently just maskload
    - load_lanes: currently just vec_load_lanes
+   - mask_load_lanes: currently just vec_mask_load_lanes
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
+   - mask_store_lanes: currently just vec_mask_store_lanes
 
    DEF_INTERNAL_FLT_FN is like DEF_INTERNAL_OPTAB_FN, but in addition,
    the function implements the computational part of a built-in math
@@ -92,9 +94,13 @@ along with GCC; see the file COPYING3.
 
 DEF_INTERNAL_OPTAB_FN (MASK_LOAD, ECF_PURE, maskload, mask_load)
 DEF_INTERNAL_OPTAB_FN (LOAD_LANES, ECF_CONST, vec_load_lanes, load_lanes)
+DEF_INTERNAL_OPTAB_FN (MASK_LOAD_LANES, ECF_PURE,
+		       vec_mask_load_lanes, mask_load_lanes)
 
 DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
 DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
+DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
+		       vec_mask_store_lanes, mask_store_lanes)
 
 DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2017-11-17 09:35:23.086033247 +0000
+++ gcc/internal-fn.c	2017-11-17 09:35:23.401033274 +0000
@@ -82,8 +82,10 @@ #define DEF_INTERNAL_FN(CODE, FLAGS, FNS
 #define not_direct { -2, -2, false }
 #define mask_load_direct { -1, 2, false }
 #define load_lanes_direct { -1, -1, false }
+#define mask_load_lanes_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
+#define mask_store_lanes_direct { 0, 0, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 
@@ -2363,7 +2365,7 @@ expand_LOOP_DIST_ALIAS (internal_fn, gca
   gcc_unreachable ();
 }
 
-/* Expand MASK_LOAD call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
 
 static void
 expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2372,6 +2374,7 @@ expand_mask_load_optab_fn (internal_fn,
   tree type, lhs, rhs, maskt, ptr;
   rtx mem, target, mask;
   unsigned align;
+  insn_code icode;
 
   maskt = gimple_call_arg (stmt, 2);
   lhs = gimple_call_lhs (stmt);
@@ -2384,6 +2387,12 @@ expand_mask_load_optab_fn (internal_fn,
     type = build_aligned_type (type, align);
   rhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), ptr);
 
+  if (optab == vec_mask_load_lanes_optab)
+    icode = get_multi_vector_move (type, optab);
+  else
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+				   TYPE_MODE (TREE_TYPE (maskt)));
+
   mem = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   gcc_assert (MEM_P (mem));
   mask = expand_normal (maskt);
@@ -2391,12 +2400,12 @@ expand_mask_load_optab_fn (internal_fn,
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
   create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
-  expand_insn (convert_optab_handler (optab, TYPE_MODE (type),
-				      TYPE_MODE (TREE_TYPE (maskt))),
-	       3, ops);
+  expand_insn (icode, 3, ops);
 }
 
-/* Expand MASK_STORE call STMT using optab OPTAB.  */
+#define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+
+/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
 
 static void
 expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2405,6 +2414,7 @@ expand_mask_store_optab_fn (internal_fn,
   tree type, lhs, rhs, maskt, ptr;
   rtx mem, reg, mask;
   unsigned align;
+  insn_code icode;
 
   maskt = gimple_call_arg (stmt, 2);
   rhs = gimple_call_arg (stmt, 3);
@@ -2415,6 +2425,12 @@ expand_mask_store_optab_fn (internal_fn,
     type = build_aligned_type (type, align);
   lhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), ptr);
 
+  if (optab == vec_mask_store_lanes_optab)
+    icode = get_multi_vector_move (type, optab);
+  else
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+				   TYPE_MODE (TREE_TYPE (maskt)));
+
   mem = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   gcc_assert (MEM_P (mem));
   mask = expand_normal (maskt);
@@ -2422,11 +2438,11 @@ expand_mask_store_optab_fn (internal_fn,
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
   create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
-  expand_insn (convert_optab_handler (optab, TYPE_MODE (type),
-				      TYPE_MODE (TREE_TYPE (maskt))),
-	       3, ops);
+  expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
 {
@@ -2818,8 +2834,10 @@ #define direct_unary_optab_supported_p d
 #define direct_binary_optab_supported_p direct_optab_supported_p
 #define direct_mask_load_optab_supported_p direct_optab_supported_p
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 
 /* Return true if FN is supported for the types in TYPES when the
    optimization type is OPT_TYPE.  The types are those associated with
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2017-11-17 09:35:23.086033247 +0000
+++ gcc/tree-vectorizer.h	2017-11-17 09:35:23.406433274 +0000
@@ -1292,9 +1292,9 @@ extern tree bump_vector_ptr (tree, gimpl
 			     tree);
 extern tree vect_create_destination_var (tree, tree);
 extern bool vect_grouped_store_supported (tree, unsigned HOST_WIDE_INT);
-extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT, bool);
 extern bool vect_grouped_load_supported (tree, bool, unsigned HOST_WIDE_INT);
-extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT, bool);
 extern void vect_permute_store_chain (vec<tree> ,unsigned int, gimple *,
                                     gimple_stmt_iterator *, vec<tree> *);
 extern tree vect_setup_realignment (gimple *, gimple_stmt_iterator *, tree *,
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c	2017-11-17 09:35:23.085133247 +0000
+++ gcc/tree-vect-data-refs.c	2017-11-17 09:35:23.404633274 +0000
@@ -2791,6 +2791,62 @@ dr_group_sort_cmp (const void *dra_, con
   return cmp;
 }
 
+/* If OP is the result of a conversion, return the unconverted value,
+   otherwise return null.  */
+
+static tree
+strip_conversion (tree op)
+{
+  if (TREE_CODE (op) != SSA_NAME)
+    return NULL_TREE;
+  gimple *stmt = SSA_NAME_DEF_STMT (op);
+  if (!is_gimple_assign (stmt)
+      || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))
+    return NULL_TREE;
+  return gimple_assign_rhs1 (stmt);
+}
+
+/* Return true if vectorizable_* routines can handle statements STMT1
+   and STMT2 being in a single group.  */
+
+static bool
+can_group_stmts_p (gimple *stmt1, gimple *stmt2)
+{
+  if (gimple_assign_single_p (stmt1))
+    return gimple_assign_single_p (stmt2);
+
+  if (is_gimple_call (stmt1) && gimple_call_internal_p (stmt1))
+    {
+      /* Check for two masked loads or two masked stores.  */
+      if (!is_gimple_call (stmt2) || !gimple_call_internal_p (stmt2))
+	return false;
+      internal_fn ifn = gimple_call_internal_fn (stmt1);
+      if (ifn != IFN_MASK_LOAD && ifn != IFN_MASK_STORE)
+	return false;
+      if (ifn != gimple_call_internal_fn (stmt2))
+	return false;
+
+      /* Check that the masks are the same.  Cope with casts of masks,
+	 like those created by build_mask_conversion.  */
+      tree mask1 = gimple_call_arg (stmt1, 2);
+      tree mask2 = gimple_call_arg (stmt2, 2);
+      if (!operand_equal_p (mask1, mask2, 0))
+	{
+	  mask1 = strip_conversion (mask1);
+	  if (!mask1)
+	    return false;
+	  mask2 = strip_conversion (mask2);
+	  if (!mask2)
+	    return false;
+	  if (!operand_equal_p (mask1, mask2, 0))
+	    return false;
+	}
+      return true;
+    }
+
+  return false;
+}
+
 /* Function vect_analyze_data_ref_accesses.
 
    Analyze the access pattern of all the data references in the loop.
@@ -2857,8 +2913,7 @@ vect_analyze_data_ref_accesses (vec_info
 	      || data_ref_compare_tree (DR_BASE_ADDRESS (dra),
 					DR_BASE_ADDRESS (drb)) != 0
 	      || data_ref_compare_tree (DR_OFFSET (dra), DR_OFFSET (drb)) != 0
-	      || !gimple_assign_single_p (DR_STMT (dra))
-	      || !gimple_assign_single_p (DR_STMT (drb)))
+	      || !can_group_stmts_p (DR_STMT (dra), DR_STMT (drb)))
 	    break;
 
 	  /* Check that the data-refs have the same constant size.  */
@@ -4662,15 +4717,21 @@ vect_grouped_store_supported (tree vecty
 }
 
 
-/* Return TRUE if vec_store_lanes is available for COUNT vectors of
-   type VECTYPE.  */
+/* Return TRUE if vec_{mask_}store_lanes is available for COUNT vectors of
+   type VECTYPE.  MASKED_P says whether the masked form is needed.  */
 
 bool
-vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count,
+			    bool masked_p)
 {
-  return vect_lanes_optab_supported_p ("vec_store_lanes",
-				       vec_store_lanes_optab,
-				       vectype, count);
+  if (masked_p)
+    return vect_lanes_optab_supported_p ("vec_mask_store_lanes",
+					 vec_mask_store_lanes_optab,
+					 vectype, count);
+  else
+    return vect_lanes_optab_supported_p ("vec_store_lanes",
+					 vec_store_lanes_optab,
+					 vectype, count);
 }
 
 
@@ -5238,15 +5299,21 @@ vect_grouped_load_supported (tree vectyp
   return false;
 }
 
-/* Return TRUE if vec_load_lanes is available for COUNT vectors of
-   type VECTYPE.  */
+/* Return TRUE if vec_{masked_}load_lanes is available for COUNT vectors of
+   type VECTYPE.  MASKED_P says whether the masked form is needed.  */
 
 bool
-vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count,
+			   bool masked_p)
 {
-  return vect_lanes_optab_supported_p ("vec_load_lanes",
-				       vec_load_lanes_optab,
-				       vectype, count);
+  if (masked_p)
+    return vect_lanes_optab_supported_p ("vec_mask_load_lanes",
+					 vec_mask_load_lanes_optab,
+					 vectype, count);
+  else
+    return vect_lanes_optab_supported_p ("vec_load_lanes",
+					 vec_load_lanes_optab,
+					 vectype, count);
 }
 
 /* Function vect_permute_load_chain.
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-17 09:35:23.086033247 +0000
+++ gcc/tree-vect-loop.c	2017-11-17 09:35:23.404633274 +0000
@@ -2247,7 +2247,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
       vinfo = vinfo_for_stmt (STMT_VINFO_GROUP_FIRST_ELEMENT (vinfo));
       unsigned int size = STMT_VINFO_GROUP_SIZE (vinfo);
       tree vectype = STMT_VINFO_VECTYPE (vinfo);
-      if (! vect_store_lanes_supported (vectype, size)
+      if (! vect_store_lanes_supported (vectype, size, false)
 	  && ! vect_grouped_store_supported (vectype, size))
 	return false;
       FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), j, node)
@@ -2257,7 +2257,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
 	  bool single_element_p = !STMT_VINFO_GROUP_NEXT_ELEMENT (vinfo);
 	  size = STMT_VINFO_GROUP_SIZE (vinfo);
 	  vectype = STMT_VINFO_VECTYPE (vinfo);
-	  if (! vect_load_lanes_supported (vectype, size)
+	  if (! vect_load_lanes_supported (vectype, size, false)
 	      && ! vect_grouped_load_supported (vectype, single_element_p,
 						size))
 	    return false;
Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	2017-11-17 09:35:23.086033247 +0000
+++ gcc/tree-vect-slp.c	2017-11-17 09:35:23.405533274 +0000
@@ -2175,7 +2175,7 @@ vect_analyze_slp_instance (vec_info *vin
 	 instructions do not generate this SLP instance.  */
       if (is_a <loop_vec_info> (vinfo)
 	  && loads_permuted
-	  && dr && vect_store_lanes_supported (vectype, group_size))
+	  && dr && vect_store_lanes_supported (vectype, group_size, false))
 	{
 	  slp_tree load_node;
 	  FOR_EACH_VEC_ELT (loads, i, load_node)
@@ -2188,7 +2188,7 @@ vect_analyze_slp_instance (vec_info *vin
 	      if (STMT_VINFO_STRIDED_P (stmt_vinfo)
 		  || ! vect_load_lanes_supported
 			(STMT_VINFO_VECTYPE (stmt_vinfo),
-			 GROUP_SIZE (stmt_vinfo)))
+			 GROUP_SIZE (stmt_vinfo), false))
 		break;
 	    }
 	  if (i == loads.length ())
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2017-11-17 09:35:23.086033247 +0000
+++ gcc/tree-vect-stmts.c	2017-11-17 09:35:23.405533274 +0000
@@ -1756,7 +1756,7 @@ vect_get_store_rhs (gimple *stmt)
 
 static bool
 get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
-			   vec_load_store_type vls_type,
+			   bool masked_p, vec_load_store_type vls_type,
 			   vect_memory_access_type *memory_access_type)
 {
   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
@@ -1777,7 +1777,10 @@ get_group_load_store_type (gimple *stmt,
 
   /* True if we can cope with such overrun by peeling for gaps, so that
      there is at least one final scalar iteration after the vector loop.  */
-  bool can_overrun_p = (vls_type == VLS_LOAD && loop_vinfo && !loop->inner);
+  bool can_overrun_p = (!masked_p
+			&& vls_type == VLS_LOAD
+			&& loop_vinfo
+			&& !loop->inner);
 
   /* There can only be a gap at the end of the group if the stride is
      known at compile time.  */
@@ -1840,6 +1843,7 @@ get_group_load_store_type (gimple *stmt,
 	 and so we are guaranteed to access a non-gap element in the
 	 same B-sized block.  */
       if (would_overrun_p
+	  && !masked_p
 	  && gap < (vect_known_alignment_in_bytes (first_dr)
 		    / vect_get_scalar_dr_size (first_dr)))
 	would_overrun_p = false;
@@ -1850,8 +1854,8 @@ get_group_load_store_type (gimple *stmt,
 	{
 	  /* First try using LOAD/STORE_LANES.  */
 	  if (vls_type == VLS_LOAD
-	      ? vect_load_lanes_supported (vectype, group_size)
-	      : vect_store_lanes_supported (vectype, group_size))
+	      ? vect_load_lanes_supported (vectype, group_size, masked_p)
+	      : vect_store_lanes_supported (vectype, group_size, masked_p))
 	    {
 	      *memory_access_type = VMAT_LOAD_STORE_LANES;
 	      overrun_p = would_overrun_p;
@@ -1877,8 +1881,7 @@ get_group_load_store_type (gimple *stmt,
       gimple *next_stmt = GROUP_NEXT_ELEMENT (stmt_info);
       while (next_stmt)
 	{
-	  gcc_assert (gimple_assign_single_p (next_stmt));
-	  tree op = gimple_assign_rhs1 (next_stmt);
+	  tree op = vect_get_store_rhs (next_stmt);
 	  gimple *def_stmt;
 	  enum vect_def_type dt;
 	  if (!vect_is_simple_use (op, vinfo, &def_stmt, &dt))
@@ -1962,11 +1965,12 @@ get_negative_load_store_type (gimple *st
    or scatters, fill in GS_INFO accordingly.
 
    SLP says whether we're performing SLP rather than loop vectorization.
+   MASKED_P is true if the statement is conditional on a vectorized mask.
    VECTYPE is the vector type that the vectorized statements will use.
    NCOPIES is the number of vector statements that will be needed.  */
 
 static bool
-get_load_store_type (gimple *stmt, tree vectype, bool slp,
+get_load_store_type (gimple *stmt, tree vectype, bool slp, bool masked_p,
 		     vec_load_store_type vls_type, unsigned int ncopies,
 		     vect_memory_access_type *memory_access_type,
 		     gather_scatter_info *gs_info)
@@ -1994,7 +1998,7 @@ get_load_store_type (gimple *stmt, tree
     }
   else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
-      if (!get_group_load_store_type (stmt, vectype, slp, vls_type,
+      if (!get_group_load_store_type (stmt, vectype, slp, masked_p, vls_type,
 				      memory_access_type))
 	return false;
     }
@@ -5733,23 +5737,26 @@ vectorizable_store (gimple *stmt, gimple
     return false;
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, slp, vls_type, ncopies,
+  if (!get_load_store_type (stmt, vectype, slp, mask, vls_type, ncopies,
 			    &memory_access_type, &gs_info))
     return false;
 
   if (mask)
     {
-      if (memory_access_type != VMAT_CONTIGUOUS)
+      if (memory_access_type == VMAT_CONTIGUOUS)
+	{
+	  if (!VECTOR_MODE_P (vec_mode)
+	      || !can_vec_mask_load_store_p (vec_mode,
+					     TYPE_MODE (mask_vectype), false))
+	    return false;
+	}
+      else if (memory_access_type != VMAT_LOAD_STORE_LANES)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 			     "unsupported access type for masked store.\n");
 	  return false;
 	}
-      if (!VECTOR_MODE_P (vec_mode)
-	  || !can_vec_mask_load_store_p (vec_mode, TYPE_MODE (mask_vectype),
-					 false))
-	return false;
     }
   else
     {
@@ -6389,12 +6396,27 @@ vectorizable_store (gimple *stmt, gimple
 	      write_vector_array (stmt, gsi, vec_oprnd, vec_array, i);
 	    }
 
-	  /* Emit:
-	       MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
-	  data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
-	  gcall *call = gimple_build_call_internal (IFN_STORE_LANES, 1,
-						    vec_array);
-	  gimple_call_set_lhs (call, data_ref);
+	  gcall *call;
+	  if (mask)
+	    {
+	      /* Emit:
+		   MASK_STORE_LANES (DATAREF_PTR, ALIAS_PTR, VEC_MASK,
+				     VEC_ARRAY).  */
+	      unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (vectype));
+	      tree alias_ptr = build_int_cst (ref_type, align);
+	      call = gimple_build_call_internal (IFN_MASK_STORE_LANES, 4,
+						 dataref_ptr, alias_ptr,
+						 vec_mask, vec_array);
+	    }
+	  else
+	    {
+	      /* Emit:
+		   MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
+	      data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
+	      call = gimple_build_call_internal (IFN_STORE_LANES, 1,
+						 vec_array);
+	      gimple_call_set_lhs (call, data_ref);
+	    }
 	  gimple_call_set_nothrow (call, true);
 	  new_stmt = call;
 	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
@@ -6842,7 +6864,7 @@ vectorizable_load (gimple *stmt, gimple_
     }
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, slp, VLS_LOAD, ncopies,
+  if (!get_load_store_type (stmt, vectype, slp, mask, VLS_LOAD, ncopies,
 			    &memory_access_type, &gs_info))
     return false;
 
@@ -6850,8 +6872,9 @@ vectorizable_load (gimple *stmt, gimple_
     {
       if (memory_access_type == VMAT_CONTIGUOUS)
 	{
-	  if (!VECTOR_MODE_P (TYPE_MODE (vectype))
-	      || !can_vec_mask_load_store_p (TYPE_MODE (vectype),
+	  machine_mode vec_mode = TYPE_MODE (vectype);
+	  if (!VECTOR_MODE_P (vec_mode)
+	      || !can_vec_mask_load_store_p (vec_mode,
 					     TYPE_MODE (mask_vectype), true))
 	    return false;
 	}
@@ -6869,7 +6892,7 @@ vectorizable_load (gimple *stmt, gimple_
 	      return false;
 	    }
 	}
-      else
+      else if (memory_access_type != VMAT_LOAD_STORE_LANES)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -7419,11 +7442,25 @@ vectorizable_load (gimple *stmt, gimple_
 
 	  vec_array = create_vector_array (vectype, vec_num);
 
-	  /* Emit:
-	       VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
-	  data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
-	  gcall *call = gimple_build_call_internal (IFN_LOAD_LANES, 1,
-						    data_ref);
+	  gcall *call;
+	  if (mask)
+	    {
+	      /* Emit:
+		   VEC_ARRAY = MASK_LOAD_LANES (DATAREF_PTR, ALIAS_PTR,
+		                                VEC_MASK).  */
+	      unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (vectype));
+	      tree alias_ptr = build_int_cst (ref_type, align);
+	      call = gimple_build_call_internal (IFN_MASK_LOAD_LANES, 3,
+						 dataref_ptr, alias_ptr,
+						 vec_mask);
+	    }
+	  else
+	    {
+	      /* Emit:
+		   VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
+	      data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
+	      call = gimple_build_call_internal (IFN_LOAD_LANES, 1, data_ref);
+	    }
 	  gimple_call_set_lhs (call, vec_array);
 	  gimple_call_set_nothrow (call, true);
 	  new_stmt = call;
Index: gcc/testsuite/gcc.dg/vect/vect-ooo-group-1.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-ooo-group-1.c	2017-11-17 09:35:23.401033274 +0000
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+
+void
+f (int *restrict a, int *restrict b, int *restrict c)
+{
+  for (int i = 0; i < 100; ++i)
+    if (c[i])
+      {
+	a[i * 2] = b[i * 5 + 2];
+	a[i * 2 + 1] = b[i * 5];
+      }
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1.c	2017-11-17 09:35:23.401033274 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 2] + src[i * 2 + 1];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld2d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1_run.c	2017-11-17 09:35:23.401033274 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_1.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 2];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 2; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_2 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = in[i * 2] + in[i * 2 + 1];	\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,69 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = (src[i * 3]					\
+		   + src[i * 3 + 1]				\
+		   + src[i * 3 + 2]);				\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2_run.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,40 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_2.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 3];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 3; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_3 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = (in[i * 3]			\
+			   + in[i * 3 + 1]		\
+			   + in[i * 3 + 2]);		\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,70 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = (src[i * 4]					\
+		   + src[i * 4 + 1]				\
+		   + src[i * 4 + 2]				\
+		   + src[i * 4 + 3]);				\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3_run.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,41 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_3.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 4];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 4; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_4 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = (in[i * 4]			\
+			   + in[i * 4 + 1]		\
+			   + in[i * 4 + 2]		\
+			   + in[i * 4 + 3]);		\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_4.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_4.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 3] + src[i * 3 + 2];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_5.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_5.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 4] + src[i * 4 + 3];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_6.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_6.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 2];					\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld2b\t} } } */
+/* { dg-final { scan-assembler-not {\tld2h\t} } } */
+/* { dg-final { scan-assembler-not {\tld2w\t} } } */
+/* { dg-final { scan-assembler-not {\tld2d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_7.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_7.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 3] + src[i * 3 + 1];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld3b\t} } } */
+/* { dg-final { scan-assembler-not {\tld3h\t} } } */
+/* { dg-final { scan-assembler-not {\tld3w\t} } } */
+/* { dg-final { scan-assembler-not {\tld3d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_8.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_8.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 4] + src[i * 4 + 2];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld4b\t} } } */
+/* { dg-final { scan-assembler-not {\tld4h\t} } } */
+/* { dg-final { scan-assembler-not {\tld4w\t} } } */
+/* { dg-final { scan-assembler-not {\tld4d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1.c	2017-11-17 09:35:23.402833274 +0000
@@ -0,0 +1,73 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, INTYPE bias, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	INTYPE value = src[i] + bias;				\
+	if (cond[i])						\
+	  {							\
+	    dest[i * 2] = value;				\
+	    dest[i * 2 + 1] = value;				\
+	  }							\
+      }								\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst2d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1_run.c	2017-11-17 09:35:23.403733274 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_1.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 2];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 2; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_2 (out, in, mask, 17, N);				\
+    for (int i = 0; i < N * 2; ++i)				\
+      {								\
+	OUTTYPE if_true = (INTYPE) (in[i / 2] + 17);		\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 2] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2.c	2017-11-17 09:35:23.403733274 +0000
@@ -0,0 +1,74 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, INTYPE bias, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	INTYPE value = src[i] + bias;				\
+	if (cond[i])						\
+	  {							\
+	    dest[i * 3] = value;				\
+	    dest[i * 3 + 1] = value;				\
+	    dest[i * 3 + 2] = value;				\
+	  }							\
+      }								\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2_run.c	2017-11-17 09:35:23.403733274 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_2.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 3];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 3; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_3 (out, in, mask, 11, N);				\
+    for (int i = 0; i < N * 3; ++i)				\
+      {								\
+	OUTTYPE if_true = (INTYPE) (in[i / 3] + 11);		\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 3] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3.c	2017-11-17 09:35:23.403733274 +0000
@@ -0,0 +1,75 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, INTYPE bias, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	INTYPE value = src[i] + bias;				\
+	if (cond[i])						\
+	  {							\
+	    dest[i * 4] = value;				\
+	    dest[i * 4 + 1] = value;				\
+	    dest[i * 4 + 2] = value;				\
+	    dest[i * 4 + 3] = value;				\
+	  }							\
+      }								\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3_run.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3_run.c	2017-11-17 09:35:23.403733274 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_3.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 4];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 4; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_4 (out, in, mask, 42, N);				\
+    for (int i = 0; i < N * 4; ++i)				\
+      {								\
+	OUTTYPE if_true = (INTYPE) (in[i / 4] + 42);		\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 4] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_4.c
===================================================================
--- /dev/null	2017-11-14 14:28:07.424493901 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_4.c	2017-11-17 09:35:23.403733274 +0000
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	if (cond[i] < 8)					\
+	  dest[i * 2] = src[i];					\
+	if (cond[i] > 2)					\
+	  dest[i * 2 + 1] = src[i];				\
+	}							\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tst2b\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2h\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2w\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2d\t.z[0-9]} } } */
Jeff Law Dec. 12, 2017, 12:59 a.m. UTC | #2
On 11/17/2017 02:36 AM, Richard Sandiford wrote:
> Richard Sandiford <richard.sandiford@linaro.org> writes:

>> This patch adds support for vectorising groups of IFN_MASK_LOADs

>> and IFN_MASK_STOREs using conditional load/store-lanes instructions.

>> This requires new internal functions to represent the result

>> (IFN_MASK_{LOAD,STORE}_LANES), as well as associated optabs.

>>

>> The normal IFN_{LOAD,STORE}_LANES functions are const operations

>> that logically just perform the permute: the load or store is

>> encoded as a MEM operand to the call statement.  In contrast,

>> the IFN_MASK_{LOAD,STORE}_LANES functions use the same kind of

>> interface as IFN_MASK_{LOAD,STORE}, since the memory is only

>> conditionally accessed.

>>

>> The AArch64 patterns were added as part of the main LD[234]/ST[234] patch.

>>

>> Tested on aarch64-linux-gnu (both with and without SVE), x86_64-linux-gnu

>> and powerpc64le-linux-gnu.  OK to install?

> 

> Here's an updated (and much simpler) version that applies on top of the

> series I just posted to remove vectorizable_mask_load_store.  Tested as

> before.

> 

> Thanks,

> Richard

> 

> 

> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

> 	    Alan Hayward  <alan.hayward@arm.com>

> 	    David Sherwood  <david.sherwood@arm.com>

> 

> gcc/

> 	* doc/md.texi (vec_mask_load_lanes@var{m}@var{n}): Document.

> 	(vec_mask_store_lanes@var{m}@var{n}): Likewise.

> 	* optabs.def (vec_mask_load_lanes_optab): New optab.

> 	(vec_mask_store_lanes_optab): Likewise.

> 	* internal-fn.def (MASK_LOAD_LANES): New internal function.

> 	(MASK_STORE_LANES): Likewise.

> 	* internal-fn.c (mask_load_lanes_direct): New macro.

> 	(mask_store_lanes_direct): Likewise.

> 	(expand_mask_load_optab_fn): Handle masked operations.

> 	(expand_mask_load_lanes_optab_fn): New macro.

> 	(expand_mask_store_optab_fn): Handle masked operations.

> 	(expand_mask_store_lanes_optab_fn): New macro.

> 	(direct_mask_load_lanes_optab_supported_p): Likewise.

> 	(direct_mask_store_lanes_optab_supported_p): Likewise.

> 	* tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p

> 	parameter.

> 	(vect_load_lanes_supported): Likewise.

> 	* tree-vect-data-refs.c (strip_conversion): New function.

> 	(can_group_stmts_p): Likewise.

> 	(vect_analyze_data_ref_accesses): Use it instead of checking

> 	for a pair of assignments.

> 	(vect_store_lanes_supported): Take a masked_p parameter.

> 	(vect_load_lanes_supported): Likewise.

> 	* tree-vect-loop.c (vect_analyze_loop_2): Update calls to

> 	vect_store_lanes_supported and vect_load_lanes_supported.

> 	* tree-vect-slp.c (vect_analyze_slp_instance): Likewise.

> 	* tree-vect-stmts.c (get_group_load_store_type): Take a masked_p

> 	parameter.  Don't allow gaps for masked accesses.

> 	Use vect_get_store_rhs.  Update calls to vect_store_lanes_supported

> 	and vect_load_lanes_supported.

> 	(get_load_store_type): Take a masked_p parameter and update

> 	call to get_group_load_store_type.

> 	(vectorizable_store): Update call to get_load_store_type.

> 	Handle IFN_MASK_STORE_LANES.

> 	(vectorizable_load): Update call to get_load_store_type.

> 	Handle IFN_MASK_LOAD_LANES.

> 

> gcc/testsuite/

> 	* gcc.dg/vect/vect-ooo-group-1.c: New test.

> 	* gcc.target/aarch64/sve_mask_struct_load_1.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_1_run.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_2.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_2_run.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_3.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_3_run.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_4.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_5.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_6.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_7.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_load_8.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_store_1.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_store_1_run.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_store_2.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_store_2_run.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_store_3.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_store_3_run.c: Likewise.

> 	* gcc.target/aarch64/sve_mask_struct_store_4.c: Likewise.

> 

> Index: gcc/doc/md.texi

> ===================================================================

> --- gcc/doc/md.texi	2017-11-17 09:06:19.783260344 +0000

> +++ gcc/doc/md.texi	2017-11-17 09:35:23.400133274 +0000

> @@ -4855,6 +4855,23 @@ loads for vectors of mode @var{n}.

>  

>  This pattern is not allowed to @code{FAIL}.

>  

> +@cindex @code{vec_mask_load_lanes@var{m}@var{n}} instruction pattern

> +@item @samp{vec_mask_load_lanes@var{m}@var{n}}

> +Like @samp{vec_load_lanes@var{m}@var{n}}, but takes an additional

> +mask operand (operand 2) that specifies which elements of the destination

> +vectors should be loaded.  Other elements of the destination

> +vectors are set to zero.  The operation is equivalent to:

> +

> +@smallexample

> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});

> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)

> +  if (operand2[j])

> +    for (i = 0; i < c; i++)

> +      operand0[i][j] = operand1[j * c + i];

> +@end smallexample

Don't you need to set operand0[i][j] to zero if operand2[j] is zero for
this to be correct?  And if that's the case, don't you need to expose
the set to zero as a side effect?



> +@cindex @code{vec_mask_store_lanes@var{m}@var{n}} instruction pattern

> +@item @samp{vec_mask_store_lanes@var{m}@var{n}}

> +Like @samp{vec_store_lanes@var{m}@var{n}}, but takes an additional

> +mask operand (operand 2) that specifies which elements of the source

> +vectors should be stored.  The operation is equivalent to:

> +

> +@smallexample

> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});

> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)

> +  if (operand2[j])

> +    for (i = 0; i < c; i++)

> +      operand0[j * c + i] = operand1[i][j];

> +@end smallexample

> +

> +This pattern is not allowed to @code{FAIL}.

Is the asymmetry between loads and stores intentional?  In particular
for loads "Other elements of the destination vectors are set to zero"



> Index: gcc/tree-vect-data-refs.c

> ===================================================================

> --- gcc/tree-vect-data-refs.c	2017-11-17 09:35:23.085133247 +0000

> +++ gcc/tree-vect-data-refs.c	2017-11-17 09:35:23.404633274 +0000

> @@ -2791,6 +2791,62 @@ dr_group_sort_cmp (const void *dra_, con

>    return cmp;

>  }

>  

> +/* If OP is the result of a conversion, return the unconverted value,

> +   otherwise return null.  */

> +

> +static tree

> +strip_conversion (tree op)

> +{

> +  if (TREE_CODE (op) != SSA_NAME)

> +    return NULL_TREE;

> +  gimple *stmt = SSA_NAME_DEF_STMT (op);

> +  if (!is_gimple_assign (stmt)

> +      || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))

> +    return NULL_TREE;

> +  return gimple_assign_rhs1 (stmt);

> +}

DO you have any desire to walk back through multiple conversions?
They're only used for masks when comparing if masks are the same, so it
probably doesn't matter in practice if we handle multiple conversions I
guess.

Somehow I know we've got to have an equivalent of this routine lying
around somewhere :-)  Though I don't think it's worth the time to find.

Not an ACK or NAK.  I'm a bit hung up on the doc issue and how it
potentially impacts how we support this capability.

jeff
James Greenhalgh Jan. 7, 2018, 8:51 p.m. UTC | #3
On Tue, Dec 12, 2017 at 12:59:33AM +0000, Jeff Law wrote:
> On 11/17/2017 02:36 AM, Richard Sandiford wrote:

> > Richard Sandiford <richard.sandiford@linaro.org> writes:

> >> This patch adds support for vectorising groups of IFN_MASK_LOADs

> >> and IFN_MASK_STOREs using conditional load/store-lanes instructions.

> >> This requires new internal functions to represent the result

> >> (IFN_MASK_{LOAD,STORE}_LANES), as well as associated optabs.

> >>

> >> The normal IFN_{LOAD,STORE}_LANES functions are const operations

> >> that logically just perform the permute: the load or store is

> >> encoded as a MEM operand to the call statement.  In contrast,

> >> the IFN_MASK_{LOAD,STORE}_LANES functions use the same kind of

> >> interface as IFN_MASK_{LOAD,STORE}, since the memory is only

> >> conditionally accessed.

> >>

> >> The AArch64 patterns were added as part of the main LD[234]/ST[234] patch.

> >>

> >> Tested on aarch64-linux-gnu (both with and without SVE), x86_64-linux-gnu

> >> and powerpc64le-linux-gnu.  OK to install?

> > 

> > Here's an updated (and much simpler) version that applies on top of the

> > series I just posted to remove vectorizable_mask_load_store.  Tested as

> > before.

> > 

> > Thanks,

> > Richard

> > 

> > 

> > 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

> > 	    Alan Hayward  <alan.hayward@arm.com>

> > 	    David Sherwood  <david.sherwood@arm.com>

> > 

> > gcc/

> > 	* doc/md.texi (vec_mask_load_lanes@var{m}@var{n}): Document.

> > 	(vec_mask_store_lanes@var{m}@var{n}): Likewise.

> > 	* optabs.def (vec_mask_load_lanes_optab): New optab.

> > 	(vec_mask_store_lanes_optab): Likewise.

> > 	* internal-fn.def (MASK_LOAD_LANES): New internal function.

> > 	(MASK_STORE_LANES): Likewise.

> > 	* internal-fn.c (mask_load_lanes_direct): New macro.

> > 	(mask_store_lanes_direct): Likewise.

> > 	(expand_mask_load_optab_fn): Handle masked operations.

> > 	(expand_mask_load_lanes_optab_fn): New macro.

> > 	(expand_mask_store_optab_fn): Handle masked operations.

> > 	(expand_mask_store_lanes_optab_fn): New macro.

> > 	(direct_mask_load_lanes_optab_supported_p): Likewise.

> > 	(direct_mask_store_lanes_optab_supported_p): Likewise.

> > 	* tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p

> > 	parameter.

> > 	(vect_load_lanes_supported): Likewise.

> > 	* tree-vect-data-refs.c (strip_conversion): New function.

> > 	(can_group_stmts_p): Likewise.

> > 	(vect_analyze_data_ref_accesses): Use it instead of checking

> > 	for a pair of assignments.

> > 	(vect_store_lanes_supported): Take a masked_p parameter.

> > 	(vect_load_lanes_supported): Likewise.

> > 	* tree-vect-loop.c (vect_analyze_loop_2): Update calls to

> > 	vect_store_lanes_supported and vect_load_lanes_supported.

> > 	* tree-vect-slp.c (vect_analyze_slp_instance): Likewise.

> > 	* tree-vect-stmts.c (get_group_load_store_type): Take a masked_p

> > 	parameter.  Don't allow gaps for masked accesses.

> > 	Use vect_get_store_rhs.  Update calls to vect_store_lanes_supported

> > 	and vect_load_lanes_supported.

> > 	(get_load_store_type): Take a masked_p parameter and update

> > 	call to get_group_load_store_type.

> > 	(vectorizable_store): Update call to get_load_store_type.

> > 	Handle IFN_MASK_STORE_LANES.

> > 	(vectorizable_load): Update call to get_load_store_type.

> > 	Handle IFN_MASK_LOAD_LANES.

> > 

> > gcc/testsuite/

> > 	* gcc.dg/vect/vect-ooo-group-1.c: New test.

> > 	* gcc.target/aarch64/sve_mask_struct_load_1.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_1_run.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_2.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_2_run.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_3.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_3_run.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_4.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_5.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_6.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_7.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_load_8.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_store_1.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_store_1_run.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_store_2.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_store_2_run.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_store_3.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_store_3_run.c: Likewise.

> > 	* gcc.target/aarch64/sve_mask_struct_store_4.c: Likewise.

> > 

> > Index: gcc/doc/md.texi

> > ===================================================================

> > --- gcc/doc/md.texi	2017-11-17 09:06:19.783260344 +0000

> > +++ gcc/doc/md.texi	2017-11-17 09:35:23.400133274 +0000

> > @@ -4855,6 +4855,23 @@ loads for vectors of mode @var{n}.

> >  

> >  This pattern is not allowed to @code{FAIL}.

> >  

> > +@cindex @code{vec_mask_load_lanes@var{m}@var{n}} instruction pattern

> > +@item @samp{vec_mask_load_lanes@var{m}@var{n}}

> > +Like @samp{vec_load_lanes@var{m}@var{n}}, but takes an additional

> > +mask operand (operand 2) that specifies which elements of the destination

> > +vectors should be loaded.  Other elements of the destination

> > +vectors are set to zero.  The operation is equivalent to:

> > +

> > +@smallexample

> > +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});

> > +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)

> > +  if (operand2[j])

> > +    for (i = 0; i < c; i++)

> > +      operand0[i][j] = operand1[j * c + i];

> > +@end smallexample

> Don't you need to set operand0[i][j] to zero if operand2[j] is zero for

> this to be correct?  And if that's the case, don't you need to expose

> the set to zero as a side effect?

> 

> 

> 

> > +@cindex @code{vec_mask_store_lanes@var{m}@var{n}} instruction pattern

> > +@item @samp{vec_mask_store_lanes@var{m}@var{n}}

> > +Like @samp{vec_store_lanes@var{m}@var{n}}, but takes an additional

> > +mask operand (operand 2) that specifies which elements of the source

> > +vectors should be stored.  The operation is equivalent to:

> > +

> > +@smallexample

> > +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});

> > +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)

> > +  if (operand2[j])

> > +    for (i = 0; i < c; i++)

> > +      operand0[j * c + i] = operand1[i][j];

> > +@end smallexample

> > +

> > +This pattern is not allowed to @code{FAIL}.

> Is the asymmetry between loads and stores intentional?  In particular

> for loads "Other elements of the destination vectors are set to zero"

> 

> 

> 

> > Index: gcc/tree-vect-data-refs.c

> > ===================================================================

> > --- gcc/tree-vect-data-refs.c	2017-11-17 09:35:23.085133247 +0000

> > +++ gcc/tree-vect-data-refs.c	2017-11-17 09:35:23.404633274 +0000

> > @@ -2791,6 +2791,62 @@ dr_group_sort_cmp (const void *dra_, con

> >    return cmp;

> >  }

> >  

> > +/* If OP is the result of a conversion, return the unconverted value,

> > +   otherwise return null.  */

> > +

> > +static tree

> > +strip_conversion (tree op)

> > +{

> > +  if (TREE_CODE (op) != SSA_NAME)

> > +    return NULL_TREE;

> > +  gimple *stmt = SSA_NAME_DEF_STMT (op);

> > +  if (!is_gimple_assign (stmt)

> > +      || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))

> > +    return NULL_TREE;

> > +  return gimple_assign_rhs1 (stmt);

> > +}

> DO you have any desire to walk back through multiple conversions?

> They're only used for masks when comparing if masks are the same, so it

> probably doesn't matter in practice if we handle multiple conversions I

> guess.

> 

> Somehow I know we've got to have an equivalent of this routine lying

> around somewhere :-)  Though I don't think it's worth the time to find.

> 

> Not an ACK or NAK.  I'm a bit hung up on the doc issue and how it

> potentially impacts how we support this capability.


I have no comment on the midend parts, but when you get the OK - the AArch64
tests in this patch are OK.

Thanks,
James
Richard Sandiford Jan. 12, 2018, 4:28 p.m. UTC | #4
Sorry, just realised this wasn't ACKed

Jeff Law <law@redhat.com> writes:
> On 11/17/2017 02:36 AM, Richard Sandiford wrote:

>> Richard Sandiford <richard.sandiford@linaro.org> writes:

>>> This patch adds support for vectorising groups of IFN_MASK_LOADs

>>> and IFN_MASK_STOREs using conditional load/store-lanes instructions.

>>> This requires new internal functions to represent the result

>>> (IFN_MASK_{LOAD,STORE}_LANES), as well as associated optabs.

>>>

>>> The normal IFN_{LOAD,STORE}_LANES functions are const operations

>>> that logically just perform the permute: the load or store is

>>> encoded as a MEM operand to the call statement.  In contrast,

>>> the IFN_MASK_{LOAD,STORE}_LANES functions use the same kind of

>>> interface as IFN_MASK_{LOAD,STORE}, since the memory is only

>>> conditionally accessed.

>>>

>>> The AArch64 patterns were added as part of the main LD[234]/ST[234] patch.

>>>

>>> Tested on aarch64-linux-gnu (both with and without SVE), x86_64-linux-gnu

>>> and powerpc64le-linux-gnu.  OK to install?

>> 

>> Here's an updated (and much simpler) version that applies on top of the

>> series I just posted to remove vectorizable_mask_load_store.  Tested as

>> before.

>> 

>> Thanks,

>> Richard

>> 

>> 

>> 2017-11-17  Richard Sandiford  <richard.sandiford@linaro.org>

>> 	    Alan Hayward  <alan.hayward@arm.com>

>> 	    David Sherwood  <david.sherwood@arm.com>

>> 

>> gcc/

>> 	* doc/md.texi (vec_mask_load_lanes@var{m}@var{n}): Document.

>> 	(vec_mask_store_lanes@var{m}@var{n}): Likewise.

>> 	* optabs.def (vec_mask_load_lanes_optab): New optab.

>> 	(vec_mask_store_lanes_optab): Likewise.

>> 	* internal-fn.def (MASK_LOAD_LANES): New internal function.

>> 	(MASK_STORE_LANES): Likewise.

>> 	* internal-fn.c (mask_load_lanes_direct): New macro.

>> 	(mask_store_lanes_direct): Likewise.

>> 	(expand_mask_load_optab_fn): Handle masked operations.

>> 	(expand_mask_load_lanes_optab_fn): New macro.

>> 	(expand_mask_store_optab_fn): Handle masked operations.

>> 	(expand_mask_store_lanes_optab_fn): New macro.

>> 	(direct_mask_load_lanes_optab_supported_p): Likewise.

>> 	(direct_mask_store_lanes_optab_supported_p): Likewise.

>> 	* tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p

>> 	parameter.

>> 	(vect_load_lanes_supported): Likewise.

>> 	* tree-vect-data-refs.c (strip_conversion): New function.

>> 	(can_group_stmts_p): Likewise.

>> 	(vect_analyze_data_ref_accesses): Use it instead of checking

>> 	for a pair of assignments.

>> 	(vect_store_lanes_supported): Take a masked_p parameter.

>> 	(vect_load_lanes_supported): Likewise.

>> 	* tree-vect-loop.c (vect_analyze_loop_2): Update calls to

>> 	vect_store_lanes_supported and vect_load_lanes_supported.

>> 	* tree-vect-slp.c (vect_analyze_slp_instance): Likewise.

>> 	* tree-vect-stmts.c (get_group_load_store_type): Take a masked_p

>> 	parameter.  Don't allow gaps for masked accesses.

>> 	Use vect_get_store_rhs.  Update calls to vect_store_lanes_supported

>> 	and vect_load_lanes_supported.

>> 	(get_load_store_type): Take a masked_p parameter and update

>> 	call to get_group_load_store_type.

>> 	(vectorizable_store): Update call to get_load_store_type.

>> 	Handle IFN_MASK_STORE_LANES.

>> 	(vectorizable_load): Update call to get_load_store_type.

>> 	Handle IFN_MASK_LOAD_LANES.

>> 

>> gcc/testsuite/

>> 	* gcc.dg/vect/vect-ooo-group-1.c: New test.

>> 	* gcc.target/aarch64/sve_mask_struct_load_1.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_1_run.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_2.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_2_run.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_3.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_3_run.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_4.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_5.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_6.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_7.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_load_8.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_store_1.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_store_1_run.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_store_2.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_store_2_run.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_store_3.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_store_3_run.c: Likewise.

>> 	* gcc.target/aarch64/sve_mask_struct_store_4.c: Likewise.

>> 

>> Index: gcc/doc/md.texi

>> ===================================================================

>> --- gcc/doc/md.texi	2017-11-17 09:06:19.783260344 +0000

>> +++ gcc/doc/md.texi	2017-11-17 09:35:23.400133274 +0000

>> @@ -4855,6 +4855,23 @@ loads for vectors of mode @var{n}.

>>  

>>  This pattern is not allowed to @code{FAIL}.

>>  

>> +@cindex @code{vec_mask_load_lanes@var{m}@var{n}} instruction pattern

>> +@item @samp{vec_mask_load_lanes@var{m}@var{n}}

>> +Like @samp{vec_load_lanes@var{m}@var{n}}, but takes an additional

>> +mask operand (operand 2) that specifies which elements of the destination

>> +vectors should be loaded.  Other elements of the destination

>> +vectors are set to zero.  The operation is equivalent to:

>> +

>> +@smallexample

>> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});

>> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)

>> +  if (operand2[j])

>> +    for (i = 0; i < c; i++)

>> +      operand0[i][j] = operand1[j * c + i];

>> +@end smallexample

> Don't you need to set operand0[i][j] to zero if operand2[j] is zero for

> this to be correct?  And if that's the case, don't you need to expose

> the set to zero as a side effect?


Oops, good catch.

>> +@cindex @code{vec_mask_store_lanes@var{m}@var{n}} instruction pattern

>> +@item @samp{vec_mask_store_lanes@var{m}@var{n}}

>> +Like @samp{vec_store_lanes@var{m}@var{n}}, but takes an additional

>> +mask operand (operand 2) that specifies which elements of the source

>> +vectors should be stored.  The operation is equivalent to:

>> +

>> +@smallexample

>> +int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});

>> +for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)

>> +  if (operand2[j])

>> +    for (i = 0; i < c; i++)

>> +      operand0[j * c + i] = operand1[i][j];

>> +@end smallexample

>> +

>> +This pattern is not allowed to @code{FAIL}.

> Is the asymmetry between loads and stores intentional?  In particular

> for loads "Other elements of the destination vectors are set to zero"


Yeah, the stores don't modify memory locations for which operand2[j]
is false.  They help to vectorise stores that are conditionally executed,
rather than unconditional stores that have a conditional rhs.

>> Index: gcc/tree-vect-data-refs.c

>> ===================================================================

>> --- gcc/tree-vect-data-refs.c	2017-11-17 09:35:23.085133247 +0000

>> +++ gcc/tree-vect-data-refs.c	2017-11-17 09:35:23.404633274 +0000

>> @@ -2791,6 +2791,62 @@ dr_group_sort_cmp (const void *dra_, con

>>    return cmp;

>>  }

>>  

>> +/* If OP is the result of a conversion, return the unconverted value,

>> +   otherwise return null.  */

>> +

>> +static tree

>> +strip_conversion (tree op)

>> +{

>> +  if (TREE_CODE (op) != SSA_NAME)

>> +    return NULL_TREE;

>> +  gimple *stmt = SSA_NAME_DEF_STMT (op);

>> +  if (!is_gimple_assign (stmt)

>> +      || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))

>> +    return NULL_TREE;

>> +  return gimple_assign_rhs1 (stmt);

>> +}

> DO you have any desire to walk back through multiple conversions?

> They're only used for masks when comparing if masks are the same, so it

> probably doesn't matter in practice if we handle multiple conversions I

> guess.


I think one level is enough, since this is really just there to
cope with conversions introduced by build_mask_conversion in
tree-vect-patterns.c, which always builds a single conversion.

> Somehow I know we've got to have an equivalent of this routine lying

> around somewhere :-)  Though I don't think it's worth the time to find.


Yeah, quite possibly :-)

> Not an ACK or NAK.  I'm a bit hung up on the doc issue and how it

> potentially impacts how we support this capability.


Here's the patch with the updated docs.  Does this version look OK?

Thanks,
Richard


2018-01-12  Richard Sandiford  <richard.sandiford@linaro.org>
	    Alan Hayward  <alan.hayward@arm.com>
	    David Sherwood  <david.sherwood@arm.com>

gcc/
	* doc/md.texi (vec_mask_load_lanes@var{m}@var{n}): Document.
	(vec_mask_store_lanes@var{m}@var{n}): Likewise.
	* optabs.def (vec_mask_load_lanes_optab): New optab.
	(vec_mask_store_lanes_optab): Likewise.
	* internal-fn.def (MASK_LOAD_LANES): New internal function.
	(MASK_STORE_LANES): Likewise.
	* internal-fn.c (mask_load_lanes_direct): New macro.
	(mask_store_lanes_direct): Likewise.
	(expand_mask_load_optab_fn): Handle masked operations.
	(expand_mask_load_lanes_optab_fn): New macro.
	(expand_mask_store_optab_fn): Handle masked operations.
	(expand_mask_store_lanes_optab_fn): New macro.
	(direct_mask_load_lanes_optab_supported_p): Likewise.
	(direct_mask_store_lanes_optab_supported_p): Likewise.
	* tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p
	parameter.
	(vect_load_lanes_supported): Likewise.
	* tree-vect-data-refs.c (strip_conversion): New function.
	(can_group_stmts_p): Likewise.
	(vect_analyze_data_ref_accesses): Use it instead of checking
	for a pair of assignments.
	(vect_store_lanes_supported): Take a masked_p parameter.
	(vect_load_lanes_supported): Likewise.
	* tree-vect-loop.c (vect_analyze_loop_2): Update calls to
	vect_store_lanes_supported and vect_load_lanes_supported.
	* tree-vect-slp.c (vect_analyze_slp_instance): Likewise.
	* tree-vect-stmts.c (get_group_load_store_type): Take a masked_p
	parameter.  Don't allow gaps for masked accesses.
	Use vect_get_store_rhs.  Update calls to vect_store_lanes_supported
	and vect_load_lanes_supported.
	(get_load_store_type): Take a masked_p parameter and update
	call to get_group_load_store_type.
	(vectorizable_store): Update call to get_load_store_type.
	Handle IFN_MASK_STORE_LANES.
	(vectorizable_load): Update call to get_load_store_type.
	Handle IFN_MASK_LOAD_LANES.

gcc/testsuite/
	* gcc.dg/vect/vect-ooo-group-1.c: New test.
	* gcc.target/aarch64/sve/mask_struct_load_1.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_1_run.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_2.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_2_run.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_3_run.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_4.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_5.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_6.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_7.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_load_8.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_store_1.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_store_1_run.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_store_2.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_store_2_run.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_store_3.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_store_3_run.c: Likewise.
	* gcc.target/aarch64/sve/mask_struct_store_4.c: Likewise.

Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	2018-01-12 16:05:09.907387289 +0000
+++ gcc/doc/md.texi	2018-01-12 16:19:47.086865940 +0000
@@ -4855,6 +4855,26 @@ loads for vectors of mode @var{n}.
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{vec_mask_load_lanes@var{m}@var{n}} instruction pattern
+@item @samp{vec_mask_load_lanes@var{m}@var{n}}
+Like @samp{vec_load_lanes@var{m}@var{n}}, but takes an additional
+mask operand (operand 2) that specifies which elements of the destination
+vectors should be loaded.  Other elements of the destination
+vectors are set to zero.  The operation is equivalent to:
+
+@smallexample
+int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
+for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
+  if (operand2[j])
+    for (i = 0; i < c; i++)
+      operand0[i][j] = operand1[j * c + i];
+  else
+    for (i = 0; i < c; i++)
+      operand0[i][j] = 0;
+@end smallexample
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_store_lanes@var{m}@var{n}} instruction pattern
 @item @samp{vec_store_lanes@var{m}@var{n}}
 Equivalent to @samp{vec_load_lanes@var{m}@var{n}}, with the memory
@@ -4872,6 +4892,22 @@ for a memory operand 0 and register oper
 
 This pattern is not allowed to @code{FAIL}.
 
+@cindex @code{vec_mask_store_lanes@var{m}@var{n}} instruction pattern
+@item @samp{vec_mask_store_lanes@var{m}@var{n}}
+Like @samp{vec_store_lanes@var{m}@var{n}}, but takes an additional
+mask operand (operand 2) that specifies which elements of the source
+vectors should be stored.  The operation is equivalent to:
+
+@smallexample
+int c = GET_MODE_SIZE (@var{m}) / GET_MODE_SIZE (@var{n});
+for (j = 0; j < GET_MODE_NUNITS (@var{n}); j++)
+  if (operand2[j])
+    for (i = 0; i < c; i++)
+      operand0[j * c + i] = operand1[i][j];
+@end smallexample
+
+This pattern is not allowed to @code{FAIL}.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,
Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def	2018-01-09 15:46:34.439449019 +0000
+++ gcc/optabs.def	2018-01-12 16:19:47.086865940 +0000
@@ -80,6 +80,8 @@ OPTAB_CD(ssmsub_widen_optab, "ssmsub$b$a
 OPTAB_CD(usmsub_widen_optab, "usmsub$a$b4")
 OPTAB_CD(vec_load_lanes_optab, "vec_load_lanes$a$b")
 OPTAB_CD(vec_store_lanes_optab, "vec_store_lanes$a$b")
+OPTAB_CD(vec_mask_load_lanes_optab, "vec_mask_load_lanes$a$b")
+OPTAB_CD(vec_mask_store_lanes_optab, "vec_mask_store_lanes$a$b")
 OPTAB_CD(vcond_optab, "vcond$a$b")
 OPTAB_CD(vcondu_optab, "vcondu$a$b")
 OPTAB_CD(vcondeq_optab, "vcondeq$a$b")
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2018-01-09 15:46:34.439449019 +0000
+++ gcc/internal-fn.def	2018-01-12 16:19:47.086865940 +0000
@@ -47,9 +47,11 @@ along with GCC; see the file COPYING3.
 
    - mask_load: currently just maskload
    - load_lanes: currently just vec_load_lanes
+   - mask_load_lanes: currently just vec_mask_load_lanes
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
+   - mask_store_lanes: currently just vec_mask_store_lanes
 
    DEF_INTERNAL_SIGNED_OPTAB_FN defines an internal function that
    maps to one of two optabs, depending on the signedness of an input.
@@ -106,9 +108,13 @@ along with GCC; see the file COPYING3.
 
 DEF_INTERNAL_OPTAB_FN (MASK_LOAD, ECF_PURE, maskload, mask_load)
 DEF_INTERNAL_OPTAB_FN (LOAD_LANES, ECF_CONST, vec_load_lanes, load_lanes)
+DEF_INTERNAL_OPTAB_FN (MASK_LOAD_LANES, ECF_PURE,
+		       vec_mask_load_lanes, mask_load_lanes)
 
 DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
 DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
+DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
+		       vec_mask_store_lanes, mask_store_lanes)
 
 DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2018-01-09 15:46:34.439449019 +0000
+++ gcc/internal-fn.c	2018-01-12 16:19:47.086865940 +0000
@@ -82,8 +82,10 @@ #define DEF_INTERNAL_FN(CODE, FLAGS, FNS
 #define not_direct { -2, -2, false }
 #define mask_load_direct { -1, 2, false }
 #define load_lanes_direct { -1, -1, false }
+#define mask_load_lanes_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
+#define mask_store_lanes_direct { 0, 0, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 
@@ -2408,7 +2410,7 @@ expand_LOOP_DIST_ALIAS (internal_fn, gca
   gcc_unreachable ();
 }
 
-/* Expand MASK_LOAD call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
 
 static void
 expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2417,6 +2419,7 @@ expand_mask_load_optab_fn (internal_fn,
   tree type, lhs, rhs, maskt, ptr;
   rtx mem, target, mask;
   unsigned align;
+  insn_code icode;
 
   maskt = gimple_call_arg (stmt, 2);
   lhs = gimple_call_lhs (stmt);
@@ -2429,6 +2432,12 @@ expand_mask_load_optab_fn (internal_fn,
     type = build_aligned_type (type, align);
   rhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), ptr);
 
+  if (optab == vec_mask_load_lanes_optab)
+    icode = get_multi_vector_move (type, optab);
+  else
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+				   TYPE_MODE (TREE_TYPE (maskt)));
+
   mem = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   gcc_assert (MEM_P (mem));
   mask = expand_normal (maskt);
@@ -2436,12 +2445,12 @@ expand_mask_load_optab_fn (internal_fn,
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
   create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
-  expand_insn (convert_optab_handler (optab, TYPE_MODE (type),
-				      TYPE_MODE (TREE_TYPE (maskt))),
-	       3, ops);
+  expand_insn (icode, 3, ops);
 }
 
-/* Expand MASK_STORE call STMT using optab OPTAB.  */
+#define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+
+/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
 
 static void
 expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2450,6 +2459,7 @@ expand_mask_store_optab_fn (internal_fn,
   tree type, lhs, rhs, maskt, ptr;
   rtx mem, reg, mask;
   unsigned align;
+  insn_code icode;
 
   maskt = gimple_call_arg (stmt, 2);
   rhs = gimple_call_arg (stmt, 3);
@@ -2460,6 +2470,12 @@ expand_mask_store_optab_fn (internal_fn,
     type = build_aligned_type (type, align);
   lhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), ptr);
 
+  if (optab == vec_mask_store_lanes_optab)
+    icode = get_multi_vector_move (type, optab);
+  else
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+				   TYPE_MODE (TREE_TYPE (maskt)));
+
   mem = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   gcc_assert (MEM_P (mem));
   mask = expand_normal (maskt);
@@ -2467,11 +2483,11 @@ expand_mask_store_optab_fn (internal_fn,
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
   create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
-  expand_insn (convert_optab_handler (optab, TYPE_MODE (type),
-				      TYPE_MODE (TREE_TYPE (maskt))),
-	       3, ops);
+  expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
 {
@@ -2871,8 +2887,10 @@ #define direct_unary_optab_supported_p d
 #define direct_binary_optab_supported_p direct_optab_supported_p
 #define direct_mask_load_optab_supported_p direct_optab_supported_p
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 
 /* Return the optab used by internal function FN.  */
 
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2018-01-12 14:45:51.039434496 +0000
+++ gcc/tree-vectorizer.h	2018-01-12 16:19:47.091865737 +0000
@@ -1293,9 +1293,9 @@ extern tree bump_vector_ptr (tree, gimpl
 			     tree);
 extern tree vect_create_destination_var (tree, tree);
 extern bool vect_grouped_store_supported (tree, unsigned HOST_WIDE_INT);
-extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT, bool);
 extern bool vect_grouped_load_supported (tree, bool, unsigned HOST_WIDE_INT);
-extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT, bool);
 extern void vect_permute_store_chain (vec<tree> ,unsigned int, gimple *,
                                     gimple_stmt_iterator *, vec<tree> *);
 extern tree vect_setup_realignment (gimple *, gimple_stmt_iterator *, tree *,
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c	2018-01-12 16:05:09.983384123 +0000
+++ gcc/tree-vect-data-refs.c	2018-01-12 16:19:47.089865818 +0000
@@ -2780,6 +2780,62 @@ dr_group_sort_cmp (const void *dra_, con
   return cmp;
 }
 
+/* If OP is the result of a conversion, return the unconverted value,
+   otherwise return null.  */
+
+static tree
+strip_conversion (tree op)
+{
+  if (TREE_CODE (op) != SSA_NAME)
+    return NULL_TREE;
+  gimple *stmt = SSA_NAME_DEF_STMT (op);
+  if (!is_gimple_assign (stmt)
+      || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))
+    return NULL_TREE;
+  return gimple_assign_rhs1 (stmt);
+}
+
+/* Return true if vectorizable_* routines can handle statements STMT1
+   and STMT2 being in a single group.  */
+
+static bool
+can_group_stmts_p (gimple *stmt1, gimple *stmt2)
+{
+  if (gimple_assign_single_p (stmt1))
+    return gimple_assign_single_p (stmt2);
+
+  if (is_gimple_call (stmt1) && gimple_call_internal_p (stmt1))
+    {
+      /* Check for two masked loads or two masked stores.  */
+      if (!is_gimple_call (stmt2) || !gimple_call_internal_p (stmt2))
+	return false;
+      internal_fn ifn = gimple_call_internal_fn (stmt1);
+      if (ifn != IFN_MASK_LOAD && ifn != IFN_MASK_STORE)
+	return false;
+      if (ifn != gimple_call_internal_fn (stmt2))
+	return false;
+
+      /* Check that the masks are the same.  Cope with casts of masks,
+	 like those created by build_mask_conversion.  */
+      tree mask1 = gimple_call_arg (stmt1, 2);
+      tree mask2 = gimple_call_arg (stmt2, 2);
+      if (!operand_equal_p (mask1, mask2, 0))
+	{
+	  mask1 = strip_conversion (mask1);
+	  if (!mask1)
+	    return false;
+	  mask2 = strip_conversion (mask2);
+	  if (!mask2)
+	    return false;
+	  if (!operand_equal_p (mask1, mask2, 0))
+	    return false;
+	}
+      return true;
+    }
+
+  return false;
+}
+
 /* Function vect_analyze_data_ref_accesses.
 
    Analyze the access pattern of all the data references in the loop.
@@ -2846,8 +2902,7 @@ vect_analyze_data_ref_accesses (vec_info
 	      || data_ref_compare_tree (DR_BASE_ADDRESS (dra),
 					DR_BASE_ADDRESS (drb)) != 0
 	      || data_ref_compare_tree (DR_OFFSET (dra), DR_OFFSET (drb)) != 0
-	      || !gimple_assign_single_p (DR_STMT (dra))
-	      || !gimple_assign_single_p (DR_STMT (drb)))
+	      || !can_group_stmts_p (DR_STMT (dra), DR_STMT (drb)))
 	    break;
 
 	  /* Check that the data-refs have the same constant size.  */
@@ -4684,15 +4739,21 @@ vect_grouped_store_supported (tree vecty
 }
 
 
-/* Return TRUE if vec_store_lanes is available for COUNT vectors of
-   type VECTYPE.  */
+/* Return TRUE if vec_{mask_}store_lanes is available for COUNT vectors of
+   type VECTYPE.  MASKED_P says whether the masked form is needed.  */
 
 bool
-vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count,
+			    bool masked_p)
 {
-  return vect_lanes_optab_supported_p ("vec_store_lanes",
-				       vec_store_lanes_optab,
-				       vectype, count);
+  if (masked_p)
+    return vect_lanes_optab_supported_p ("vec_mask_store_lanes",
+					 vec_mask_store_lanes_optab,
+					 vectype, count);
+  else
+    return vect_lanes_optab_supported_p ("vec_store_lanes",
+					 vec_store_lanes_optab,
+					 vectype, count);
 }
 
 
@@ -5283,15 +5344,21 @@ vect_grouped_load_supported (tree vectyp
   return false;
 }
 
-/* Return TRUE if vec_load_lanes is available for COUNT vectors of
-   type VECTYPE.  */
+/* Return TRUE if vec_{masked_}load_lanes is available for COUNT vectors of
+   type VECTYPE.  MASKED_P says whether the masked form is needed.  */
 
 bool
-vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count,
+			   bool masked_p)
 {
-  return vect_lanes_optab_supported_p ("vec_load_lanes",
-				       vec_load_lanes_optab,
-				       vectype, count);
+  if (masked_p)
+    return vect_lanes_optab_supported_p ("vec_mask_load_lanes",
+					 vec_mask_load_lanes_optab,
+					 vectype, count);
+  else
+    return vect_lanes_optab_supported_p ("vec_load_lanes",
+					 vec_load_lanes_optab,
+					 vectype, count);
 }
 
 /* Function vect_permute_load_chain.
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2018-01-12 14:45:51.039434496 +0000
+++ gcc/tree-vect-loop.c	2018-01-12 16:19:47.089865818 +0000
@@ -2250,7 +2250,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
       vinfo = vinfo_for_stmt (STMT_VINFO_GROUP_FIRST_ELEMENT (vinfo));
       unsigned int size = STMT_VINFO_GROUP_SIZE (vinfo);
       tree vectype = STMT_VINFO_VECTYPE (vinfo);
-      if (! vect_store_lanes_supported (vectype, size)
+      if (! vect_store_lanes_supported (vectype, size, false)
 	  && ! vect_grouped_store_supported (vectype, size))
 	return false;
       FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), j, node)
@@ -2260,7 +2260,7 @@ vect_analyze_loop_2 (loop_vec_info loop_
 	  bool single_element_p = !STMT_VINFO_GROUP_NEXT_ELEMENT (vinfo);
 	  size = STMT_VINFO_GROUP_SIZE (vinfo);
 	  vectype = STMT_VINFO_VECTYPE (vinfo);
-	  if (! vect_load_lanes_supported (vectype, size)
+	  if (! vect_load_lanes_supported (vectype, size, false)
 	      && ! vect_grouped_load_supported (vectype, single_element_p,
 						size))
 	    return false;
Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	2018-01-09 15:46:34.439449019 +0000
+++ gcc/tree-vect-slp.c	2018-01-12 16:19:47.090865778 +0000
@@ -2189,7 +2189,7 @@ vect_analyze_slp_instance (vec_info *vin
 	 instructions do not generate this SLP instance.  */
       if (is_a <loop_vec_info> (vinfo)
 	  && loads_permuted
-	  && dr && vect_store_lanes_supported (vectype, group_size))
+	  && dr && vect_store_lanes_supported (vectype, group_size, false))
 	{
 	  slp_tree load_node;
 	  FOR_EACH_VEC_ELT (loads, i, load_node)
@@ -2202,7 +2202,7 @@ vect_analyze_slp_instance (vec_info *vin
 	      if (STMT_VINFO_STRIDED_P (stmt_vinfo)
 		  || ! vect_load_lanes_supported
 			(STMT_VINFO_VECTYPE (stmt_vinfo),
-			 GROUP_SIZE (stmt_vinfo)))
+			 GROUP_SIZE (stmt_vinfo), false))
 		break;
 	    }
 	  if (i == loads.length ())
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2018-01-12 14:45:51.040434457 +0000
+++ gcc/tree-vect-stmts.c	2018-01-12 16:19:47.090865778 +0000
@@ -1757,7 +1757,7 @@ vect_get_store_rhs (gimple *stmt)
 
 static bool
 get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
-			   vec_load_store_type vls_type,
+			   bool masked_p, vec_load_store_type vls_type,
 			   vect_memory_access_type *memory_access_type)
 {
   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
@@ -1778,7 +1778,10 @@ get_group_load_store_type (gimple *stmt,
 
   /* True if we can cope with such overrun by peeling for gaps, so that
      there is at least one final scalar iteration after the vector loop.  */
-  bool can_overrun_p = (vls_type == VLS_LOAD && loop_vinfo && !loop->inner);
+  bool can_overrun_p = (!masked_p
+			&& vls_type == VLS_LOAD
+			&& loop_vinfo
+			&& !loop->inner);
 
   /* There can only be a gap at the end of the group if the stride is
      known at compile time.  */
@@ -1841,6 +1844,7 @@ get_group_load_store_type (gimple *stmt,
 	 and so we are guaranteed to access a non-gap element in the
 	 same B-sized block.  */
       if (would_overrun_p
+	  && !masked_p
 	  && gap < (vect_known_alignment_in_bytes (first_dr)
 		    / vect_get_scalar_dr_size (first_dr)))
 	would_overrun_p = false;
@@ -1857,8 +1861,9 @@ get_group_load_store_type (gimple *stmt,
 	  /* Otherwise try using LOAD/STORE_LANES.  */
 	  if (*memory_access_type == VMAT_ELEMENTWISE
 	      && (vls_type == VLS_LOAD
-		  ? vect_load_lanes_supported (vectype, group_size)
-		  : vect_store_lanes_supported (vectype, group_size)))
+		  ? vect_load_lanes_supported (vectype, group_size, masked_p)
+		  : vect_store_lanes_supported (vectype, group_size,
+						masked_p)))
 	    {
 	      *memory_access_type = VMAT_LOAD_STORE_LANES;
 	      overrun_p = would_overrun_p;
@@ -1884,8 +1889,7 @@ get_group_load_store_type (gimple *stmt,
       gimple *next_stmt = GROUP_NEXT_ELEMENT (stmt_info);
       while (next_stmt)
 	{
-	  gcc_assert (gimple_assign_single_p (next_stmt));
-	  tree op = gimple_assign_rhs1 (next_stmt);
+	  tree op = vect_get_store_rhs (next_stmt);
 	  gimple *def_stmt;
 	  enum vect_def_type dt;
 	  if (!vect_is_simple_use (op, vinfo, &def_stmt, &dt))
@@ -1969,11 +1973,12 @@ get_negative_load_store_type (gimple *st
    or scatters, fill in GS_INFO accordingly.
 
    SLP says whether we're performing SLP rather than loop vectorization.
+   MASKED_P is true if the statement is conditional on a vectorized mask.
    VECTYPE is the vector type that the vectorized statements will use.
    NCOPIES is the number of vector statements that will be needed.  */
 
 static bool
-get_load_store_type (gimple *stmt, tree vectype, bool slp,
+get_load_store_type (gimple *stmt, tree vectype, bool slp, bool masked_p,
 		     vec_load_store_type vls_type, unsigned int ncopies,
 		     vect_memory_access_type *memory_access_type,
 		     gather_scatter_info *gs_info)
@@ -2001,7 +2006,7 @@ get_load_store_type (gimple *stmt, tree
     }
   else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
-      if (!get_group_load_store_type (stmt, vectype, slp, vls_type,
+      if (!get_group_load_store_type (stmt, vectype, slp, masked_p, vls_type,
 				      memory_access_type))
 	return false;
     }
@@ -5762,23 +5767,26 @@ vectorizable_store (gimple *stmt, gimple
     return false;
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, slp, vls_type, ncopies,
+  if (!get_load_store_type (stmt, vectype, slp, mask, vls_type, ncopies,
 			    &memory_access_type, &gs_info))
     return false;
 
   if (mask)
     {
-      if (memory_access_type != VMAT_CONTIGUOUS)
+      if (memory_access_type == VMAT_CONTIGUOUS)
+	{
+	  if (!VECTOR_MODE_P (vec_mode)
+	      || !can_vec_mask_load_store_p (vec_mode,
+					     TYPE_MODE (mask_vectype), false))
+	    return false;
+	}
+      else if (memory_access_type != VMAT_LOAD_STORE_LANES)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
 			     "unsupported access type for masked store.\n");
 	  return false;
 	}
-      if (!VECTOR_MODE_P (vec_mode)
-	  || !can_vec_mask_load_store_p (vec_mode, TYPE_MODE (mask_vectype),
-					 false))
-	return false;
     }
   else
     {
@@ -6421,12 +6429,27 @@ vectorizable_store (gimple *stmt, gimple
 	      write_vector_array (stmt, gsi, vec_oprnd, vec_array, i);
 	    }
 
-	  /* Emit:
-	       MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
-	  data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
-	  gcall *call = gimple_build_call_internal (IFN_STORE_LANES, 1,
-						    vec_array);
-	  gimple_call_set_lhs (call, data_ref);
+	  gcall *call;
+	  if (mask)
+	    {
+	      /* Emit:
+		   MASK_STORE_LANES (DATAREF_PTR, ALIAS_PTR, VEC_MASK,
+				     VEC_ARRAY).  */
+	      unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (vectype));
+	      tree alias_ptr = build_int_cst (ref_type, align);
+	      call = gimple_build_call_internal (IFN_MASK_STORE_LANES, 4,
+						 dataref_ptr, alias_ptr,
+						 vec_mask, vec_array);
+	    }
+	  else
+	    {
+	      /* Emit:
+		   MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
+	      data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
+	      call = gimple_build_call_internal (IFN_STORE_LANES, 1,
+						 vec_array);
+	      gimple_call_set_lhs (call, data_ref);
+	    }
 	  gimple_call_set_nothrow (call, true);
 	  new_stmt = call;
 	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
@@ -6870,7 +6893,7 @@ vectorizable_load (gimple *stmt, gimple_
     }
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, slp, VLS_LOAD, ncopies,
+  if (!get_load_store_type (stmt, vectype, slp, mask, VLS_LOAD, ncopies,
 			    &memory_access_type, &gs_info))
     return false;
 
@@ -6878,8 +6901,9 @@ vectorizable_load (gimple *stmt, gimple_
     {
       if (memory_access_type == VMAT_CONTIGUOUS)
 	{
-	  if (!VECTOR_MODE_P (TYPE_MODE (vectype))
-	      || !can_vec_mask_load_store_p (TYPE_MODE (vectype),
+	  machine_mode vec_mode = TYPE_MODE (vectype);
+	  if (!VECTOR_MODE_P (vec_mode)
+	      || !can_vec_mask_load_store_p (vec_mode,
 					     TYPE_MODE (mask_vectype), true))
 	    return false;
 	}
@@ -6897,7 +6921,7 @@ vectorizable_load (gimple *stmt, gimple_
 	      return false;
 	    }
 	}
-      else
+      else if (memory_access_type != VMAT_LOAD_STORE_LANES)
 	{
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -7447,11 +7471,25 @@ vectorizable_load (gimple *stmt, gimple_
 
 	  vec_array = create_vector_array (vectype, vec_num);
 
-	  /* Emit:
-	       VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
-	  data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
-	  gcall *call = gimple_build_call_internal (IFN_LOAD_LANES, 1,
-						    data_ref);
+	  gcall *call;
+	  if (mask)
+	    {
+	      /* Emit:
+		   VEC_ARRAY = MASK_LOAD_LANES (DATAREF_PTR, ALIAS_PTR,
+		                                VEC_MASK).  */
+	      unsigned int align = TYPE_ALIGN_UNIT (TREE_TYPE (vectype));
+	      tree alias_ptr = build_int_cst (ref_type, align);
+	      call = gimple_build_call_internal (IFN_MASK_LOAD_LANES, 3,
+						 dataref_ptr, alias_ptr,
+						 vec_mask);
+	    }
+	  else
+	    {
+	      /* Emit:
+		   VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
+	      data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
+	      call = gimple_build_call_internal (IFN_LOAD_LANES, 1, data_ref);
+	    }
 	  gimple_call_set_lhs (call, vec_array);
 	  gimple_call_set_nothrow (call, true);
 	  new_stmt = call;
Index: gcc/testsuite/gcc.dg/vect/vect-ooo-group-1.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-ooo-group-1.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,12 @@
+/* { dg-do compile } */
+
+void
+f (int *restrict a, int *restrict b, int *restrict c)
+{
+  for (int i = 0; i < 100; ++i)
+    if (c[i])
+      {
+	a[i * 2] = b[i * 5 + 2];
+	a[i * 2 + 1] = b[i * 5];
+      }
+}
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_1.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_1.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 2] + src[i * 2 + 1];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld2d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_1_run.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_1_run.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#include "mask_struct_load_1.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 2];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 2; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_2 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = in[i * 2] + in[i * 2 + 1];	\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_2.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_2.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,69 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = (src[i * 3]					\
+		   + src[i * 3 + 1]				\
+		   + src[i * 3 + 2]);				\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_2_run.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_2_run.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,40 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#include "mask_struct_load_2.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 3];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 3; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_3 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = (in[i * 3]			\
+			   + in[i * 3 + 1]		\
+			   + in[i * 3 + 2]);		\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_3.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_3.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,70 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = (src[i * 4]					\
+		   + src[i * 4 + 1]				\
+		   + src[i * 4 + 2]				\
+		   + src[i * 4 + 3]);				\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_3_run.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_3_run.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,41 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#include "mask_struct_load_3.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 4];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 4; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_4 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = (in[i * 4]			\
+			   + in[i * 4 + 1]		\
+			   + in[i * 4 + 2]		\
+			   + in[i * 4 + 3]);		\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_4.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_4.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 3] + src[i * 3 + 2];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_5.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_5.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,67 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 4] + src[i * 4 + 3];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_6.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_6.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 2];					\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld2b\t} } } */
+/* { dg-final { scan-assembler-not {\tld2h\t} } } */
+/* { dg-final { scan-assembler-not {\tld2w\t} } } */
+/* { dg-final { scan-assembler-not {\tld2d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_7.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_7.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 3] + src[i * 3 + 1];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld3b\t} } } */
+/* { dg-final { scan-assembler-not {\tld3h\t} } } */
+/* { dg-final { scan-assembler-not {\tld3w\t} } } */
+/* { dg-final { scan-assembler-not {\tld3d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_8.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_load_8.c	2018-01-12 16:19:47.087865899 +0000
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 4] + src[i * 4 + 2];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld4b\t} } } */
+/* { dg-final { scan-assembler-not {\tld4h\t} } } */
+/* { dg-final { scan-assembler-not {\tld4w\t} } } */
+/* { dg-final { scan-assembler-not {\tld4d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_1.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_1.c	2018-01-12 16:19:47.088865859 +0000
@@ -0,0 +1,73 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, INTYPE bias, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	INTYPE value = src[i] + bias;				\
+	if (cond[i])						\
+	  {							\
+	    dest[i * 2] = value;				\
+	    dest[i * 2 + 1] = value;				\
+	  }							\
+      }								\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst2d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_1_run.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_1_run.c	2018-01-12 16:19:47.088865859 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#include "mask_struct_store_1.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 2];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 2; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_2 (out, in, mask, 17, N);				\
+    for (int i = 0; i < N * 2; ++i)				\
+      {								\
+	OUTTYPE if_true = (INTYPE) (in[i / 2] + 17);		\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 2] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_2.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_2.c	2018-01-12 16:19:47.088865859 +0000
@@ -0,0 +1,74 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, INTYPE bias, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	INTYPE value = src[i] + bias;				\
+	if (cond[i])						\
+	  {							\
+	    dest[i * 3] = value;				\
+	    dest[i * 3 + 1] = value;				\
+	    dest[i * 3 + 2] = value;				\
+	  }							\
+      }								\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_2_run.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_2_run.c	2018-01-12 16:19:47.088865859 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#include "mask_struct_store_2.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 3];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 3; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_3 (out, in, mask, 11, N);				\
+    for (int i = 0; i < N * 3; ++i)				\
+      {								\
+	OUTTYPE if_true = (INTYPE) (in[i / 3] + 11);		\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 3] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3.c	2018-01-12 16:19:47.088865859 +0000
@@ -0,0 +1,75 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, INTYPE bias, int n)	\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	INTYPE value = src[i] + bias;				\
+	if (cond[i])						\
+	  {							\
+	    dest[i * 4] = value;				\
+	    dest[i * 4 + 1] = value;				\
+	    dest[i * 4 + 2] = value;				\
+	    dest[i * 4 + 3] = value;				\
+	  }							\
+      }								\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3_run.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_3_run.c	2018-01-12 16:19:47.088865859 +0000
@@ -0,0 +1,38 @@
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#include "mask_struct_store_3.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 4];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 4; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_4 (out, in, mask, 42, N);				\
+    for (int i = 0; i < N * 4; ++i)				\
+      {								\
+	OUTTYPE if_true = (INTYPE) (in[i / 4] + 42);		\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 4] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_4.c
===================================================================
--- /dev/null	2018-01-12 06:40:27.684409621 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve/mask_struct_store_4.c	2018-01-12 16:19:47.088865859 +0000
@@ -0,0 +1,44 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	if (cond[i] < 8)					\
+	  dest[i * 2] = src[i];					\
+	if (cond[i] > 2)					\
+	  dest[i * 2 + 1] = src[i];				\
+	}							\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tst2b\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2h\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2w\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2d\t.z[0-9]} } } */
Jeff Law Jan. 13, 2018, 3:50 p.m. UTC | #5
On 01/12/2018 09:28 AM, Richard Sandiford wrote:
> 

> Here's the patch with the updated docs.  Does this version look OK?

> 

> Thanks,

> Richard

> 

> 

> 2018-01-12  Richard Sandiford  <richard.sandiford@linaro.org>

> 	    Alan Hayward  <alan.hayward@arm.com>

> 	    David Sherwood  <david.sherwood@arm.com>

> 

> gcc/

> 	* doc/md.texi (vec_mask_load_lanes@var{m}@var{n}): Document.

> 	(vec_mask_store_lanes@var{m}@var{n}): Likewise.

> 	* optabs.def (vec_mask_load_lanes_optab): New optab.

> 	(vec_mask_store_lanes_optab): Likewise.

> 	* internal-fn.def (MASK_LOAD_LANES): New internal function.

> 	(MASK_STORE_LANES): Likewise.

> 	* internal-fn.c (mask_load_lanes_direct): New macro.

> 	(mask_store_lanes_direct): Likewise.

> 	(expand_mask_load_optab_fn): Handle masked operations.

> 	(expand_mask_load_lanes_optab_fn): New macro.

> 	(expand_mask_store_optab_fn): Handle masked operations.

> 	(expand_mask_store_lanes_optab_fn): New macro.

> 	(direct_mask_load_lanes_optab_supported_p): Likewise.

> 	(direct_mask_store_lanes_optab_supported_p): Likewise.

> 	* tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p

> 	parameter.

> 	(vect_load_lanes_supported): Likewise.

> 	* tree-vect-data-refs.c (strip_conversion): New function.

> 	(can_group_stmts_p): Likewise.

> 	(vect_analyze_data_ref_accesses): Use it instead of checking

> 	for a pair of assignments.

> 	(vect_store_lanes_supported): Take a masked_p parameter.

> 	(vect_load_lanes_supported): Likewise.

> 	* tree-vect-loop.c (vect_analyze_loop_2): Update calls to

> 	vect_store_lanes_supported and vect_load_lanes_supported.

> 	* tree-vect-slp.c (vect_analyze_slp_instance): Likewise.

> 	* tree-vect-stmts.c (get_group_load_store_type): Take a masked_p

> 	parameter.  Don't allow gaps for masked accesses.

> 	Use vect_get_store_rhs.  Update calls to vect_store_lanes_supported

> 	and vect_load_lanes_supported.

> 	(get_load_store_type): Take a masked_p parameter and update

> 	call to get_group_load_store_type.

> 	(vectorizable_store): Update call to get_load_store_type.

> 	Handle IFN_MASK_STORE_LANES.

> 	(vectorizable_load): Update call to get_load_store_type.

> 	Handle IFN_MASK_LOAD_LANES.

> 

> gcc/testsuite/

> 	* gcc.dg/vect/vect-ooo-group-1.c: New test.

> 	* gcc.target/aarch64/sve/mask_struct_load_1.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_1_run.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_2.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_2_run.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_3_run.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_4.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_5.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_6.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_7.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_load_8.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_store_1.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_store_1_run.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_store_2.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_store_2_run.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_store_3.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_store_3_run.c: Likewise.

> 	* gcc.target/aarch64/sve/mask_struct_store_4.c: Likewise.

OK.  I guess in retrospect I should have made the assumption that the
docs were slightly off and reviewed the rest in that light.

Sorry for making this wait.


Jeff
Christophe Lyon Jan. 15, 2018, 9:40 a.m. UTC | #6
On 13 January 2018 at 16:50, Jeff Law <law@redhat.com> wrote:
> On 01/12/2018 09:28 AM, Richard Sandiford wrote:

>>

>> Here's the patch with the updated docs.  Does this version look OK?

>>

>> Thanks,

>> Richard

>>

>>

>> 2018-01-12  Richard Sandiford  <richard.sandiford@linaro.org>

>>           Alan Hayward  <alan.hayward@arm.com>

>>           David Sherwood  <david.sherwood@arm.com>

>>

>> gcc/

>>       * doc/md.texi (vec_mask_load_lanes@var{m}@var{n}): Document.

>>       (vec_mask_store_lanes@var{m}@var{n}): Likewise.

>>       * optabs.def (vec_mask_load_lanes_optab): New optab.

>>       (vec_mask_store_lanes_optab): Likewise.

>>       * internal-fn.def (MASK_LOAD_LANES): New internal function.

>>       (MASK_STORE_LANES): Likewise.

>>       * internal-fn.c (mask_load_lanes_direct): New macro.

>>       (mask_store_lanes_direct): Likewise.

>>       (expand_mask_load_optab_fn): Handle masked operations.

>>       (expand_mask_load_lanes_optab_fn): New macro.

>>       (expand_mask_store_optab_fn): Handle masked operations.

>>       (expand_mask_store_lanes_optab_fn): New macro.

>>       (direct_mask_load_lanes_optab_supported_p): Likewise.

>>       (direct_mask_store_lanes_optab_supported_p): Likewise.

>>       * tree-vectorizer.h (vect_store_lanes_supported): Take a masked_p

>>       parameter.

>>       (vect_load_lanes_supported): Likewise.

>>       * tree-vect-data-refs.c (strip_conversion): New function.

>>       (can_group_stmts_p): Likewise.

>>       (vect_analyze_data_ref_accesses): Use it instead of checking

>>       for a pair of assignments.

>>       (vect_store_lanes_supported): Take a masked_p parameter.

>>       (vect_load_lanes_supported): Likewise.

>>       * tree-vect-loop.c (vect_analyze_loop_2): Update calls to

>>       vect_store_lanes_supported and vect_load_lanes_supported.

>>       * tree-vect-slp.c (vect_analyze_slp_instance): Likewise.

>>       * tree-vect-stmts.c (get_group_load_store_type): Take a masked_p

>>       parameter.  Don't allow gaps for masked accesses.

>>       Use vect_get_store_rhs.  Update calls to vect_store_lanes_supported

>>       and vect_load_lanes_supported.

>>       (get_load_store_type): Take a masked_p parameter and update

>>       call to get_group_load_store_type.

>>       (vectorizable_store): Update call to get_load_store_type.

>>       Handle IFN_MASK_STORE_LANES.

>>       (vectorizable_load): Update call to get_load_store_type.

>>       Handle IFN_MASK_LOAD_LANES.

>>

>> gcc/testsuite/

>>       * gcc.dg/vect/vect-ooo-group-1.c: New test.

>>       * gcc.target/aarch64/sve/mask_struct_load_1.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_1_run.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_2.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_2_run.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_3.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_3_run.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_4.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_5.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_6.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_7.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_load_8.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_store_1.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_store_1_run.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_store_2.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_store_2_run.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_store_3.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_store_3_run.c: Likewise.

>>       * gcc.target/aarch64/sve/mask_struct_store_4.c: Likewise.

> OK.  I guess in retrospect I should have made the assumption that the

> docs were slightly off and reviewed the rest in that light.

>

> Sorry for making this wait.

>

>

Hi Richard,

I've noticed that this commit (r256620) causes new failures, see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83845


> Jeff

>

>
diff mbox series

Patch

Index: gcc/optabs.def
===================================================================
--- gcc/optabs.def	2017-11-08 15:05:55.697852337 +0000
+++ gcc/optabs.def	2017-11-08 16:35:04.763816035 +0000
@@ -80,6 +80,8 @@  OPTAB_CD(ssmsub_widen_optab, "ssmsub$b$a
 OPTAB_CD(usmsub_widen_optab, "usmsub$a$b4")
 OPTAB_CD(vec_load_lanes_optab, "vec_load_lanes$a$b")
 OPTAB_CD(vec_store_lanes_optab, "vec_store_lanes$a$b")
+OPTAB_CD(vec_mask_load_lanes_optab, "vec_mask_load_lanes$a$b")
+OPTAB_CD(vec_mask_store_lanes_optab, "vec_mask_store_lanes$a$b")
 OPTAB_CD(vcond_optab, "vcond$a$b")
 OPTAB_CD(vcondu_optab, "vcondu$a$b")
 OPTAB_CD(vcondeq_optab, "vcondeq$a$b")
Index: gcc/internal-fn.def
===================================================================
--- gcc/internal-fn.def	2017-11-01 08:07:13.340797708 +0000
+++ gcc/internal-fn.def	2017-11-08 16:35:04.763816035 +0000
@@ -45,9 +45,11 @@  along with GCC; see the file COPYING3.
 
    - mask_load: currently just maskload
    - load_lanes: currently just vec_load_lanes
+   - mask_load_lanes: currently just vec_mask_load_lanes
 
    - mask_store: currently just maskstore
    - store_lanes: currently just vec_store_lanes
+   - mask_store_lanes: currently just vec_mask_store_lanes
 
    DEF_INTERNAL_FLT_FN is like DEF_INTERNAL_OPTAB_FN, but in addition,
    the function implements the computational part of a built-in math
@@ -92,9 +94,13 @@  along with GCC; see the file COPYING3.
 
 DEF_INTERNAL_OPTAB_FN (MASK_LOAD, ECF_PURE, maskload, mask_load)
 DEF_INTERNAL_OPTAB_FN (LOAD_LANES, ECF_CONST, vec_load_lanes, load_lanes)
+DEF_INTERNAL_OPTAB_FN (MASK_LOAD_LANES, ECF_PURE,
+		       vec_mask_load_lanes, mask_load_lanes)
 
 DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
 DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
+DEF_INTERNAL_OPTAB_FN (MASK_STORE_LANES, 0,
+		       vec_mask_store_lanes, mask_store_lanes)
 
 DEF_INTERNAL_OPTAB_FN (RSQRT, ECF_CONST, rsqrt, unary)
 
Index: gcc/internal-fn.c
===================================================================
--- gcc/internal-fn.c	2017-11-08 15:05:55.618852345 +0000
+++ gcc/internal-fn.c	2017-11-08 16:35:04.763816035 +0000
@@ -79,8 +79,10 @@  #define DEF_INTERNAL_FN(CODE, FLAGS, FNS
 #define not_direct { -2, -2, false }
 #define mask_load_direct { -1, 2, false }
 #define load_lanes_direct { -1, -1, false }
+#define mask_load_lanes_direct { -1, -1, false }
 #define mask_store_direct { 3, 2, false }
 #define store_lanes_direct { 0, 0, false }
+#define mask_store_lanes_direct { 0, 0, false }
 #define unary_direct { 0, 0, true }
 #define binary_direct { 0, 0, true }
 
@@ -2277,7 +2279,7 @@  expand_LOOP_DIST_ALIAS (internal_fn, gca
   gcc_unreachable ();
 }
 
-/* Expand MASK_LOAD call STMT using optab OPTAB.  */
+/* Expand MASK_LOAD{,_LANES} call STMT using optab OPTAB.  */
 
 static void
 expand_mask_load_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2286,6 +2288,7 @@  expand_mask_load_optab_fn (internal_fn,
   tree type, lhs, rhs, maskt, ptr;
   rtx mem, target, mask;
   unsigned align;
+  insn_code icode;
 
   maskt = gimple_call_arg (stmt, 2);
   lhs = gimple_call_lhs (stmt);
@@ -2298,6 +2301,12 @@  expand_mask_load_optab_fn (internal_fn,
     type = build_aligned_type (type, align);
   rhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), ptr);
 
+  if (optab == vec_mask_load_lanes_optab)
+    icode = get_multi_vector_move (type, optab);
+  else
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+				   TYPE_MODE (TREE_TYPE (maskt)));
+
   mem = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   gcc_assert (MEM_P (mem));
   mask = expand_normal (maskt);
@@ -2305,12 +2314,12 @@  expand_mask_load_optab_fn (internal_fn,
   create_output_operand (&ops[0], target, TYPE_MODE (type));
   create_fixed_operand (&ops[1], mem);
   create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
-  expand_insn (convert_optab_handler (optab, TYPE_MODE (type),
-				      TYPE_MODE (TREE_TYPE (maskt))),
-	       3, ops);
+  expand_insn (icode, 3, ops);
 }
 
-/* Expand MASK_STORE call STMT using optab OPTAB.  */
+#define expand_mask_load_lanes_optab_fn expand_mask_load_optab_fn
+
+/* Expand MASK_STORE{,_LANES} call STMT using optab OPTAB.  */
 
 static void
 expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab)
@@ -2319,6 +2328,7 @@  expand_mask_store_optab_fn (internal_fn,
   tree type, lhs, rhs, maskt, ptr;
   rtx mem, reg, mask;
   unsigned align;
+  insn_code icode;
 
   maskt = gimple_call_arg (stmt, 2);
   rhs = gimple_call_arg (stmt, 3);
@@ -2329,6 +2339,12 @@  expand_mask_store_optab_fn (internal_fn,
     type = build_aligned_type (type, align);
   lhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), ptr);
 
+  if (optab == vec_mask_store_lanes_optab)
+    icode = get_multi_vector_move (type, optab);
+  else
+    icode = convert_optab_handler (optab, TYPE_MODE (type),
+				   TYPE_MODE (TREE_TYPE (maskt)));
+
   mem = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
   gcc_assert (MEM_P (mem));
   mask = expand_normal (maskt);
@@ -2336,11 +2352,11 @@  expand_mask_store_optab_fn (internal_fn,
   create_fixed_operand (&ops[0], mem);
   create_input_operand (&ops[1], reg, TYPE_MODE (type));
   create_input_operand (&ops[2], mask, TYPE_MODE (TREE_TYPE (maskt)));
-  expand_insn (convert_optab_handler (optab, TYPE_MODE (type),
-				      TYPE_MODE (TREE_TYPE (maskt))),
-	       3, ops);
+  expand_insn (icode, 3, ops);
 }
 
+#define expand_mask_store_lanes_optab_fn expand_mask_store_optab_fn
+
 static void
 expand_ABNORMAL_DISPATCHER (internal_fn, gcall *)
 {
@@ -2732,8 +2748,10 @@  #define direct_unary_optab_supported_p d
 #define direct_binary_optab_supported_p direct_optab_supported_p
 #define direct_mask_load_optab_supported_p direct_optab_supported_p
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_store_optab_supported_p direct_optab_supported_p
 #define direct_store_lanes_optab_supported_p multi_vector_optab_supported_p
+#define direct_mask_store_lanes_optab_supported_p multi_vector_optab_supported_p
 
 /* Return true if FN is supported for the types in TYPES when the
    optimization type is OPT_TYPE.  The types are those associated with
Index: gcc/tree-vectorizer.h
===================================================================
--- gcc/tree-vectorizer.h	2017-11-08 15:05:33.791822333 +0000
+++ gcc/tree-vectorizer.h	2017-11-08 16:35:04.771159765 +0000
@@ -1284,9 +1284,9 @@  extern tree bump_vector_ptr (tree, gimpl
 			     tree);
 extern tree vect_create_destination_var (tree, tree);
 extern bool vect_grouped_store_supported (tree, unsigned HOST_WIDE_INT);
-extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_store_lanes_supported (tree, unsigned HOST_WIDE_INT, bool);
 extern bool vect_grouped_load_supported (tree, bool, unsigned HOST_WIDE_INT);
-extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT);
+extern bool vect_load_lanes_supported (tree, unsigned HOST_WIDE_INT, bool);
 extern void vect_permute_store_chain (vec<tree> ,unsigned int, gimple *,
                                     gimple_stmt_iterator *, vec<tree> *);
 extern tree vect_setup_realignment (gimple *, gimple_stmt_iterator *, tree *,
Index: gcc/tree-vect-data-refs.c
===================================================================
--- gcc/tree-vect-data-refs.c	2017-11-08 15:06:16.087850270 +0000
+++ gcc/tree-vect-data-refs.c	2017-11-08 16:35:04.768405866 +0000
@@ -2791,6 +2791,62 @@  dr_group_sort_cmp (const void *dra_, con
   return cmp;
 }
 
+/* If OP is the result of a conversion, return the unconverted value,
+   otherwise return null.  */
+
+static tree
+strip_conversion (tree op)
+{
+  if (TREE_CODE (op) != SSA_NAME)
+    return NULL_TREE;
+  gimple *stmt = SSA_NAME_DEF_STMT (op);
+  if (!is_gimple_assign (stmt)
+      || !CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (stmt)))
+    return NULL_TREE;
+  return gimple_assign_rhs1 (stmt);
+}
+
+/* Return true if vectorizable_* routines can handle statements STMT1
+   and STMT2 being in a single group.  */
+
+static bool
+can_group_stmts_p (gimple *stmt1, gimple *stmt2)
+{
+  if (gimple_assign_single_p (stmt1))
+    return gimple_assign_single_p (stmt2);
+
+  if (is_gimple_call (stmt1) && gimple_call_internal_p (stmt1))
+    {
+      /* Check for two masked loads or two masked stores.  */
+      if (!is_gimple_call (stmt2) || !gimple_call_internal_p (stmt2))
+	return false;
+      internal_fn ifn = gimple_call_internal_fn (stmt1);
+      if (ifn != IFN_MASK_LOAD && ifn != IFN_MASK_STORE)
+	return false;
+      if (ifn != gimple_call_internal_fn (stmt2))
+	return false;
+
+      /* Check that the masks are the same.  Cope with casts of masks,
+	 like those created by build_mask_conversion.  */
+      tree mask1 = gimple_call_arg (stmt1, 2);
+      tree mask2 = gimple_call_arg (stmt2, 2);
+      if (!operand_equal_p (mask1, mask2, 0))
+	{
+	  mask1 = strip_conversion (mask1);
+	  if (!mask1)
+	    return false;
+	  mask2 = strip_conversion (mask2);
+	  if (!mask2)
+	    return false;
+	  if (!operand_equal_p (mask1, mask2, 0))
+	    return false;
+	}
+      return true;
+    }
+
+  return false;
+}
+
 /* Function vect_analyze_data_ref_accesses.
 
    Analyze the access pattern of all the data references in the loop.
@@ -2857,8 +2913,7 @@  vect_analyze_data_ref_accesses (vec_info
 	      || data_ref_compare_tree (DR_BASE_ADDRESS (dra),
 					DR_BASE_ADDRESS (drb)) != 0
 	      || data_ref_compare_tree (DR_OFFSET (dra), DR_OFFSET (drb)) != 0
-	      || !gimple_assign_single_p (DR_STMT (dra))
-	      || !gimple_assign_single_p (DR_STMT (drb)))
+	      || !can_group_stmts_p (DR_STMT (dra), DR_STMT (drb)))
 	    break;
 
 	  /* Check that the data-refs have the same constant size.  */
@@ -4662,15 +4717,21 @@  vect_grouped_store_supported (tree vecty
 }
 
 
-/* Return TRUE if vec_store_lanes is available for COUNT vectors of
-   type VECTYPE.  */
+/* Return TRUE if vec_{mask_}store_lanes is available for COUNT vectors of
+   type VECTYPE.  MASKED_P says whether the masked form is needed.  */
 
 bool
-vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+vect_store_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count,
+			    bool masked_p)
 {
-  return vect_lanes_optab_supported_p ("vec_store_lanes",
-				       vec_store_lanes_optab,
-				       vectype, count);
+  if (masked_p)
+    return vect_lanes_optab_supported_p ("vec_mask_store_lanes",
+					 vec_mask_store_lanes_optab,
+					 vectype, count);
+  else
+    return vect_lanes_optab_supported_p ("vec_store_lanes",
+					 vec_store_lanes_optab,
+					 vectype, count);
 }
 
 
@@ -5238,15 +5299,21 @@  vect_grouped_load_supported (tree vectyp
   return false;
 }
 
-/* Return TRUE if vec_load_lanes is available for COUNT vectors of
-   type VECTYPE.  */
+/* Return TRUE if vec_{masked_}load_lanes is available for COUNT vectors of
+   type VECTYPE.  MASKED_P says whether the masked form is needed.  */
 
 bool
-vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count)
+vect_load_lanes_supported (tree vectype, unsigned HOST_WIDE_INT count,
+			   bool masked_p)
 {
-  return vect_lanes_optab_supported_p ("vec_load_lanes",
-				       vec_load_lanes_optab,
-				       vectype, count);
+  if (masked_p)
+    return vect_lanes_optab_supported_p ("vec_mask_load_lanes",
+					 vec_mask_load_lanes_optab,
+					 vectype, count);
+  else
+    return vect_lanes_optab_supported_p ("vec_load_lanes",
+					 vec_load_lanes_optab,
+					 vectype, count);
 }
 
 /* Function vect_permute_load_chain.
Index: gcc/tree-vect-loop.c
===================================================================
--- gcc/tree-vect-loop.c	2017-11-08 15:05:36.349044117 +0000
+++ gcc/tree-vect-loop.c	2017-11-08 16:35:04.770241799 +0000
@@ -2247,7 +2247,7 @@  vect_analyze_loop_2 (loop_vec_info loop_
       vinfo = vinfo_for_stmt (STMT_VINFO_GROUP_FIRST_ELEMENT (vinfo));
       unsigned int size = STMT_VINFO_GROUP_SIZE (vinfo);
       tree vectype = STMT_VINFO_VECTYPE (vinfo);
-      if (! vect_store_lanes_supported (vectype, size)
+      if (! vect_store_lanes_supported (vectype, size, false)
 	  && ! vect_grouped_store_supported (vectype, size))
 	return false;
       FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), j, node)
@@ -2257,7 +2257,7 @@  vect_analyze_loop_2 (loop_vec_info loop_
 	  bool single_element_p = !STMT_VINFO_GROUP_NEXT_ELEMENT (vinfo);
 	  size = STMT_VINFO_GROUP_SIZE (vinfo);
 	  vectype = STMT_VINFO_VECTYPE (vinfo);
-	  if (! vect_load_lanes_supported (vectype, size)
+	  if (! vect_load_lanes_supported (vectype, size, false)
 	      && ! vect_grouped_load_supported (vectype, single_element_p,
 						size))
 	    return false;
Index: gcc/tree-vect-slp.c
===================================================================
--- gcc/tree-vect-slp.c	2017-11-08 15:05:34.296308263 +0000
+++ gcc/tree-vect-slp.c	2017-11-08 16:35:04.770241799 +0000
@@ -2175,7 +2175,7 @@  vect_analyze_slp_instance (vec_info *vin
 	 instructions do not generate this SLP instance.  */
       if (is_a <loop_vec_info> (vinfo)
 	  && loads_permuted
-	  && dr && vect_store_lanes_supported (vectype, group_size))
+	  && dr && vect_store_lanes_supported (vectype, group_size, false))
 	{
 	  slp_tree load_node;
 	  FOR_EACH_VEC_ELT (loads, i, load_node)
@@ -2188,7 +2188,7 @@  vect_analyze_slp_instance (vec_info *vin
 	      if (STMT_VINFO_STRIDED_P (stmt_vinfo)
 		  || ! vect_load_lanes_supported
 			(STMT_VINFO_VECTYPE (stmt_vinfo),
-			 GROUP_SIZE (stmt_vinfo)))
+			 GROUP_SIZE (stmt_vinfo), false))
 		break;
 	    }
 	  if (i == loads.length ())
Index: gcc/tree-vect-stmts.c
===================================================================
--- gcc/tree-vect-stmts.c	2017-11-08 15:05:36.350875282 +0000
+++ gcc/tree-vect-stmts.c	2017-11-08 16:35:04.771159765 +0000
@@ -1700,6 +1700,69 @@  vectorizable_internal_function (combined
 static tree permute_vec_elements (tree, tree, tree, gimple *,
 				  gimple_stmt_iterator *);
 
+/* Replace IFN_MASK_LOAD statement STMT with a dummy assignment, to ensure
+   that it won't be expanded even when there's no following DCE pass.  */
+
+static void
+replace_mask_load (gimple *stmt, gimple_stmt_iterator *gsi)
+{
+  /* If this statement is part of a pattern created by the vectorizer,
+     get the original statement.  */
+  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
+  if (STMT_VINFO_RELATED_STMT (stmt_info))
+    {
+      stmt = STMT_VINFO_RELATED_STMT (stmt_info);
+      stmt_info = vinfo_for_stmt (stmt);
+    }
+
+  gcc_assert (gsi_stmt (*gsi) == stmt);
+  tree lhs = gimple_call_lhs (stmt);
+  tree zero = build_zero_cst (TREE_TYPE (lhs));
+  gimple *new_stmt = gimple_build_assign (lhs, zero);
+  set_vinfo_for_stmt (new_stmt, stmt_info);
+  set_vinfo_for_stmt (stmt, NULL);
+  STMT_VINFO_STMT (stmt_info) = new_stmt;
+
+  /* If STMT was the first statement in a group, redirect all
+     GROUP_FIRST_ELEMENT pointers to the new statement (which has the
+     same stmt_info as the old statement).  */
+  if (GROUP_FIRST_ELEMENT (stmt_info) == stmt)
+    {
+      gimple *group_stmt = new_stmt;
+      do
+	{
+	  GROUP_FIRST_ELEMENT (vinfo_for_stmt (group_stmt)) = new_stmt;
+	  group_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (group_stmt));
+	}
+      while (group_stmt);
+    }
+  else if (GROUP_FIRST_ELEMENT (stmt_info))
+    {
+      /* Otherwise redirect the GROUP_NEXT_ELEMENT.  It would be more
+	 efficient if these pointers were to the stmt_vec_info rather
+	 than the gimple statements themselves, but this is by no means
+	 the only quadractic loop for groups.  */
+      gimple *group_stmt = GROUP_FIRST_ELEMENT (stmt_info);
+      while (GROUP_NEXT_ELEMENT (vinfo_for_stmt (group_stmt)) != stmt)
+	group_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (group_stmt));
+      GROUP_NEXT_ELEMENT (vinfo_for_stmt (group_stmt)) = new_stmt;
+    }
+  gsi_replace (gsi, new_stmt, true);
+}
+
+/* STMT is either a masked or unconditional store.  Return the value
+   being stored.  */
+
+static tree
+get_store_op (gimple *stmt)
+{
+  if (gimple_assign_single_p (stmt))
+    return gimple_assign_rhs1 (stmt);
+  if (gimple_call_internal_p (stmt, IFN_MASK_STORE))
+    return gimple_call_arg (stmt, 3);
+  gcc_unreachable ();
+}
+
 /* STMT is a non-strided load or store, meaning that it accesses
    elements with a known constant step.  Return -1 if that step
    is negative, 0 if it is zero, and 1 if it is greater than zero.  */
@@ -1744,7 +1807,7 @@  perm_mask_for_reverse (tree vectype)
 
 static bool
 get_group_load_store_type (gimple *stmt, tree vectype, bool slp,
-			   vec_load_store_type vls_type,
+			   bool masked_p, vec_load_store_type vls_type,
 			   vect_memory_access_type *memory_access_type)
 {
   stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
@@ -1765,7 +1828,10 @@  get_group_load_store_type (gimple *stmt,
 
   /* True if we can cope with such overrun by peeling for gaps, so that
      there is at least one final scalar iteration after the vector loop.  */
-  bool can_overrun_p = (vls_type == VLS_LOAD && loop_vinfo && !loop->inner);
+  bool can_overrun_p = (!masked_p
+			&& vls_type == VLS_LOAD
+			&& loop_vinfo
+			&& !loop->inner);
 
   /* There can only be a gap at the end of the group if the stride is
      known at compile time.  */
@@ -1828,6 +1894,7 @@  get_group_load_store_type (gimple *stmt,
 	 and so we are guaranteed to access a non-gap element in the
 	 same B-sized block.  */
       if (would_overrun_p
+	  && !masked_p
 	  && gap < (vect_known_alignment_in_bytes (first_dr)
 		    / vect_get_scalar_dr_size (first_dr)))
 	would_overrun_p = false;
@@ -1838,8 +1905,8 @@  get_group_load_store_type (gimple *stmt,
 	{
 	  /* First try using LOAD/STORE_LANES.  */
 	  if (vls_type == VLS_LOAD
-	      ? vect_load_lanes_supported (vectype, group_size)
-	      : vect_store_lanes_supported (vectype, group_size))
+	      ? vect_load_lanes_supported (vectype, group_size, masked_p)
+	      : vect_store_lanes_supported (vectype, group_size, masked_p))
 	    {
 	      *memory_access_type = VMAT_LOAD_STORE_LANES;
 	      overrun_p = would_overrun_p;
@@ -1865,8 +1932,7 @@  get_group_load_store_type (gimple *stmt,
       gimple *next_stmt = GROUP_NEXT_ELEMENT (stmt_info);
       while (next_stmt)
 	{
-	  gcc_assert (gimple_assign_single_p (next_stmt));
-	  tree op = gimple_assign_rhs1 (next_stmt);
+	  tree op = get_store_op (next_stmt);
 	  gimple *def_stmt;
 	  enum vect_def_type dt;
 	  if (!vect_is_simple_use (op, vinfo, &def_stmt, &dt))
@@ -1950,11 +2016,12 @@  get_negative_load_store_type (gimple *st
    or scatters, fill in GS_INFO accordingly.
 
    SLP says whether we're performing SLP rather than loop vectorization.
+   MASKED_P is true if the statement is conditional on a vectorized mask.
    VECTYPE is the vector type that the vectorized statements will use.
    NCOPIES is the number of vector statements that will be needed.  */
 
 static bool
-get_load_store_type (gimple *stmt, tree vectype, bool slp,
+get_load_store_type (gimple *stmt, tree vectype, bool slp, bool masked_p,
 		     vec_load_store_type vls_type, unsigned int ncopies,
 		     vect_memory_access_type *memory_access_type,
 		     gather_scatter_info *gs_info)
@@ -1982,7 +2049,7 @@  get_load_store_type (gimple *stmt, tree
     }
   else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
     {
-      if (!get_group_load_store_type (stmt, vectype, slp, vls_type,
+      if (!get_group_load_store_type (stmt, vectype, slp, masked_p, vls_type,
 				      memory_access_type))
 	return false;
     }
@@ -2031,6 +2098,174 @@  get_load_store_type (gimple *stmt, tree
   return true;
 }
 
+/* Set up the stored values for the first copy of a vectorized store.
+   GROUP_SIZE is the number of stores in the group (which is 1 for
+   ungrouped stores).  FIRST_STMT is the first statement in the group.
+
+   On return, initialize OPERANDS to a new vector in which element I
+   is the value that the first copy of group member I should store.
+   The caller should free OPERANDS after use.  */
+
+static void
+init_stored_values (unsigned int group_size, gimple *first_stmt,
+		    vec<tree> *operands)
+{
+  operands->create (group_size);
+  gimple *next_stmt = first_stmt;
+  for (unsigned int i = 0; i < group_size; i++)
+    {
+      /* Since gaps are not supported for interleaved stores,
+	 GROUP_SIZE is the exact number of stmts in the chain.
+	 Therefore, NEXT_STMT can't be NULL_TREE.  In case that
+	 there is no interleaving, GROUP_SIZE is 1, and only one
+	 iteration of the loop will be executed.  */
+      gcc_assert (next_stmt);
+      tree op = get_store_op (next_stmt);
+      tree vec_op = vect_get_vec_def_for_operand (op, next_stmt);
+      operands->quick_push (vec_op);
+      next_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (next_stmt));
+    }
+}
+
+/* OPERANDS is a vector set up by init_stored_values.  Update each element
+   for the next copy of each statement.  GROUP_SIZE and FIRST_STMT are
+   as for init_stored_values.  */
+
+static void
+advance_stored_values (unsigned int group_size, gimple *first_stmt,
+		       vec<tree> operands)
+{
+  vec_info *vinfo = vinfo_for_stmt (first_stmt)->vinfo;
+  for (unsigned int i = 0; i < group_size; i++)
+    {
+      tree op = operands[i];
+      enum vect_def_type dt;
+      gimple *def_stmt;
+      vect_is_simple_use (op, vinfo, &def_stmt, &dt);
+      operands[i] = vect_get_vec_def_for_stmt_copy (dt, op);
+    }
+}
+
+/* Emit one copy of a vectorized LOAD_LANES for STMT.  GROUP_SIZE is
+   the number of vectors being loaded and VECTYPE is the type of each
+   vector.  AGGR_TYPE is the type that should be used to refer to the
+   memory source (which contains the same number of elements as
+   GROUP_SIZE copies of VECTYPE, but in a different order).
+   DATAREF_PTR points to the first element that should be loaded.
+   ALIAS_PTR_TYPE is the type of the accessed elements for aliasing
+   purposes.  MASK, if nonnull, is a mask in which element I is true
+   if element I of each destination vector should be loaded.  */
+
+static void
+do_load_lanes (gimple *stmt, gimple_stmt_iterator *gsi,
+	       unsigned int group_size, tree vectype, tree aggr_type,
+	       tree dataref_ptr, tree alias_ptr_type, tree mask)
+{
+  tree scalar_dest = gimple_get_lhs (stmt);
+  tree vec_array = create_vector_array (vectype, group_size);
+
+  gcall *new_stmt;
+  if (mask)
+    {
+      /* Emit: VEC_ARRAY = MASK_LOAD_LANES (DATAREF_PTR, ALIAS_PTR, MASK).  */
+      tree alias_ptr = build_int_cst (alias_ptr_type,
+				      TYPE_ALIGN_UNIT (TREE_TYPE (vectype)));
+      new_stmt = gimple_build_call_internal (IFN_MASK_LOAD_LANES, 3,
+					     dataref_ptr, alias_ptr, mask);
+    }
+  else
+    {
+      /* Emit: VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
+      tree data_ref = create_array_ref (aggr_type, dataref_ptr,
+					alias_ptr_type);
+      new_stmt = gimple_build_call_internal (IFN_LOAD_LANES, 1, data_ref);
+    }
+  gimple_call_set_lhs (new_stmt, vec_array);
+  gimple_call_set_nothrow (new_stmt, true);
+  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+
+  /* Extract each vector into an SSA_NAME.  */
+  auto_vec<tree, 16> dr_chain;
+  dr_chain.reserve (group_size);
+  for (unsigned int i = 0; i < group_size; i++)
+    {
+      tree new_temp = read_vector_array (stmt, gsi, scalar_dest, vec_array, i);
+      dr_chain.quick_push (new_temp);
+    }
+
+  /* Record the mapping between SSA_NAMEs and statements.  */
+  vect_record_grouped_load_vectors (stmt, dr_chain);
+}
+
+/* Emit one copy of a vectorized STORE_LANES for STMT.  GROUP_SIZE is
+   the number of vectors being stored and OPERANDS[I] is the value
+   that group member I should store.  AGGR_TYPE is the type that should
+   be used to refer to the memory destination (which contains the same
+   number of elements as the source vectors, but in a different order).
+   DATAREF_PTR points to the first store location.  ALIAS_PTR_TYPE is
+   the type of the accessed elements for aliasing purposes.  MASK,
+   if nonnull, is a mask in which element I is true if element I of
+   each source vector should be stored.  */
+
+static gimple *
+do_store_lanes (gimple *stmt, gimple_stmt_iterator *gsi,
+		unsigned int group_size, tree aggr_type, tree dataref_ptr,
+		tree alias_ptr_type, vec<tree> operands, tree mask)
+{
+  /* Combine all the vectors into an array.  */
+  tree vectype = TREE_TYPE (operands[0]);
+  tree vec_array = create_vector_array (vectype, group_size);
+  for (unsigned int i = 0; i < group_size; i++)
+    write_vector_array (stmt, gsi, operands[i], vec_array, i);
+
+  gcall *new_stmt;
+  if (mask)
+    {
+      /* Emit: MASK_STORE_LANES (DATAREF_PTR, ALIAS_PTR, MASK, VEC_ARRAY).  */
+      tree alias_ptr = build_int_cst (alias_ptr_type,
+				      TYPE_ALIGN_UNIT (TREE_TYPE (vectype)));
+      new_stmt = gimple_build_call_internal (IFN_MASK_STORE_LANES, 4,
+					     dataref_ptr, alias_ptr,
+					     mask, vec_array);
+    }
+  else
+    {
+      /* Emit: MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
+      tree data_ref = create_array_ref (aggr_type, dataref_ptr, alias_ptr_type);
+      new_stmt = gimple_build_call_internal (IFN_STORE_LANES, 1, vec_array);
+      gimple_call_set_lhs (new_stmt, data_ref);
+    }
+  gimple_call_set_nothrow (new_stmt, true);
+  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+  return new_stmt;
+}
+
+/* Return the alias pointer type for the group of masked loads or
+   stores starting at FIRST_STMT.  */
+
+static tree
+get_masked_group_alias_ptr_type (gimple *first_stmt)
+{
+  tree type, next_type;
+  gimple *next_stmt;
+
+  type = TREE_TYPE (gimple_call_arg (first_stmt, 1));
+  next_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (first_stmt));
+  while (next_stmt)
+    {
+      next_type = TREE_TYPE (gimple_call_arg (next_stmt, 1));
+      if (get_alias_set (type) != get_alias_set (next_type))
+	{
+	  if (dump_enabled_p ())
+	    dump_printf_loc (MSG_NOTE, vect_location,
+			     "conflicting alias set types.\n");
+	  return ptr_type_node;
+	}
+      next_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (next_stmt));
+    }
+  return type;
+}
+
 /* Function vectorizable_mask_load_store.
 
    Check if STMT performs a conditional load or store that can be vectorized.
@@ -2053,6 +2288,7 @@  vectorizable_mask_load_store (gimple *st
   tree rhs_vectype = NULL_TREE;
   tree mask_vectype;
   tree elem_type;
+  tree aggr_type;
   gimple *new_stmt;
   tree dummy;
   tree dataref_ptr = NULL_TREE;
@@ -2066,6 +2302,8 @@  vectorizable_mask_load_store (gimple *st
   tree mask;
   gimple *def_stmt;
   enum vect_def_type dt;
+  gimple *first_stmt = stmt;
+  unsigned int group_size = 1;
 
   if (slp_node != NULL)
     return false;
@@ -2127,7 +2365,7 @@  vectorizable_mask_load_store (gimple *st
     vls_type = VLS_LOAD;
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, false, vls_type, ncopies,
+  if (!get_load_store_type (stmt, vectype, false, true, vls_type, ncopies,
 			    &memory_access_type, &gs_info))
     return false;
 
@@ -2144,7 +2382,18 @@  vectorizable_mask_load_store (gimple *st
 	  return false;
 	}
     }
-  else if (memory_access_type != VMAT_CONTIGUOUS)
+  else if (rhs_vectype
+	   && !useless_type_conversion_p (vectype, rhs_vectype))
+    return false;
+  else if (memory_access_type == VMAT_CONTIGUOUS)
+    {
+      if (!VECTOR_MODE_P (TYPE_MODE (vectype))
+	  || !can_vec_mask_load_store_p (TYPE_MODE (vectype),
+					 TYPE_MODE (mask_vectype),
+					 vls_type == VLS_LOAD))
+	return false;
+    }
+  else if (memory_access_type != VMAT_LOAD_STORE_LANES)
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -2152,13 +2401,6 @@  vectorizable_mask_load_store (gimple *st
 			 vls_type == VLS_LOAD ? "load" : "store");
       return false;
     }
-  else if (!VECTOR_MODE_P (TYPE_MODE (vectype))
-	   || !can_vec_mask_load_store_p (TYPE_MODE (vectype),
-					  TYPE_MODE (mask_vectype),
-					  vls_type == VLS_LOAD)
-	   || (rhs_vectype
-	       && !useless_type_conversion_p (vectype, rhs_vectype)))
-    return false;
 
   if (!vec_stmt) /* transformation not required.  */
     {
@@ -2176,6 +2418,14 @@  vectorizable_mask_load_store (gimple *st
 
   /* Transform.  */
 
+  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+    {
+      first_stmt = GROUP_FIRST_ELEMENT (stmt_info);
+      group_size = GROUP_SIZE (vinfo_for_stmt (first_stmt));
+      if (vls_type != VLS_LOAD)
+	GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt))++;
+    }
+
   if (memory_access_type == VMAT_GATHER_SCATTER)
     {
       tree vec_oprnd0 = NULL_TREE, op;
@@ -2343,23 +2593,28 @@  vectorizable_mask_load_store (gimple *st
 	  prev_stmt_info = vinfo_for_stmt (new_stmt);
 	}
 
-      /* Ensure that even with -fno-tree-dce the scalar MASK_LOAD is removed
-	 from the IL.  */
-      if (STMT_VINFO_RELATED_STMT (stmt_info))
-	{
-	  stmt = STMT_VINFO_RELATED_STMT (stmt_info);
-	  stmt_info = vinfo_for_stmt (stmt);
-	}
-      tree lhs = gimple_call_lhs (stmt);
-      new_stmt = gimple_build_assign (lhs, build_zero_cst (TREE_TYPE (lhs)));
-      set_vinfo_for_stmt (new_stmt, stmt_info);
-      set_vinfo_for_stmt (stmt, NULL);
-      STMT_VINFO_STMT (stmt_info) = new_stmt;
-      gsi_replace (gsi, new_stmt, true);
+      replace_mask_load (stmt, gsi);
       return true;
     }
-  else if (vls_type != VLS_LOAD)
+
+  if (memory_access_type == VMAT_LOAD_STORE_LANES)
+    aggr_type = build_array_type_nelts (elem_type, group_size * nunits);
+  else
+    aggr_type = vectype;
+
+  if (vls_type != VLS_LOAD)
     {
+      /* Vectorize the whole group when we reach the final statement.
+	 Replace all other statements with an empty sequence.  */
+      if (STMT_VINFO_GROUPED_ACCESS (stmt_info)
+	  && (GROUP_STORE_COUNT (vinfo_for_stmt (first_stmt))
+	      < GROUP_SIZE (vinfo_for_stmt (first_stmt))))
+	{
+	  *vec_stmt = NULL;
+	  return true;
+	}
+
+      auto_vec<tree, 16> operands;
       tree vec_rhs = NULL_TREE, vec_mask = NULL_TREE;
       prev_stmt_info = NULL;
       LOOP_VINFO_HAS_MASK_STORE (loop_vinfo) = true;
@@ -2369,48 +2624,62 @@  vectorizable_mask_load_store (gimple *st
 
 	  if (i == 0)
 	    {
-	      tree rhs = gimple_call_arg (stmt, 3);
-	      vec_rhs = vect_get_vec_def_for_operand (rhs, stmt);
+	      init_stored_values (group_size, first_stmt, &operands);
+	      vec_rhs = operands[0];
 	      vec_mask = vect_get_vec_def_for_operand (mask, stmt,
 						       mask_vectype);
-	      /* We should have catched mismatched types earlier.  */
+	      /* We should have caught mismatched types earlier.  */
 	      gcc_assert (useless_type_conversion_p (vectype,
 						     TREE_TYPE (vec_rhs)));
-	      dataref_ptr = vect_create_data_ref_ptr (stmt, vectype, NULL,
-						      NULL_TREE, &dummy, gsi,
-						      &ptr_incr, false, &inv_p);
+	      dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type,
+						      NULL, NULL_TREE, &dummy,
+						      gsi, &ptr_incr, false,
+						      &inv_p);
 	      gcc_assert (!inv_p);
 	    }
 	  else
 	    {
-	      vect_is_simple_use (vec_rhs, loop_vinfo, &def_stmt, &dt);
-	      vec_rhs = vect_get_vec_def_for_stmt_copy (dt, vec_rhs);
+	      advance_stored_values (group_size, first_stmt, operands);
+	      vec_rhs = operands[0];
 	      vect_is_simple_use (vec_mask, loop_vinfo, &def_stmt, &dt);
 	      vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
-	      dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
-					     TYPE_SIZE_UNIT (vectype));
+	      dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr,
+					     gsi, first_stmt,
+					     TYPE_SIZE_UNIT (aggr_type));
 	    }
 
-	  align = DR_TARGET_ALIGNMENT (dr);
-	  if (aligned_access_p (dr))
-	    misalign = 0;
-	  else if (DR_MISALIGNMENT (dr) == -1)
+	  if (memory_access_type == VMAT_LOAD_STORE_LANES)
 	    {
-	      align = TYPE_ALIGN_UNIT (elem_type);
-	      misalign = 0;
+	      tree ref_type = get_masked_group_alias_ptr_type (first_stmt);
+	      new_stmt = do_store_lanes (stmt, gsi, group_size, aggr_type,
+					 dataref_ptr, ref_type, operands,
+					 vec_mask);
 	    }
 	  else
-	    misalign = DR_MISALIGNMENT (dr);
-	  set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
-				  misalign);
-	  tree ptr = build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)),
-				    misalign ? least_bit_hwi (misalign) : align);
-	  gcall *call
-	    = gimple_build_call_internal (IFN_MASK_STORE, 4, dataref_ptr,
-					  ptr, vec_mask, vec_rhs);
-	  gimple_call_set_nothrow (call, true);
-	  new_stmt = call;
-	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	    {
+	      align = DR_TARGET_ALIGNMENT (dr);
+	      if (aligned_access_p (dr))
+		misalign = 0;
+	      else if (DR_MISALIGNMENT (dr) == -1)
+		{
+		  align = TYPE_ALIGN_UNIT (elem_type);
+		  misalign = 0;
+		}
+	      else
+		misalign = DR_MISALIGNMENT (dr);
+	      set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
+				      misalign);
+	      tree ptr = build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)),
+					misalign
+					? least_bit_hwi (misalign)
+					: align);
+	      gcall *call
+		= gimple_build_call_internal (IFN_MASK_STORE, 4, dataref_ptr,
+					      ptr, vec_mask, vec_rhs);
+	      gimple_call_set_nothrow (call, true);
+	      new_stmt = call;
+	      vect_finish_stmt_generation (stmt, new_stmt, gsi);
+	    }
 	  if (i == 0)
 	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
 	  else
@@ -2420,73 +2689,88 @@  vectorizable_mask_load_store (gimple *st
     }
   else
     {
+      /* Vectorize the whole group when we reach the first statement.
+	 For later statements we just need to return the cached
+	 replacement.  */
+      if (group_size > 1
+	  && STMT_VINFO_VEC_STMT (vinfo_for_stmt (first_stmt)))
+	{
+	  *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
+	  replace_mask_load (stmt, gsi);
+	  return true;
+	}
+
       tree vec_mask = NULL_TREE;
       prev_stmt_info = NULL;
-      vec_dest = vect_create_destination_var (gimple_call_lhs (stmt), vectype);
+      if (memory_access_type == VMAT_LOAD_STORE_LANES)
+	vec_dest = NULL_TREE;
+      else
+	vec_dest = vect_create_destination_var (gimple_call_lhs (stmt),
+						vectype);
       for (i = 0; i < ncopies; i++)
 	{
 	  unsigned align, misalign;
 
 	  if (i == 0)
 	    {
+	      gcc_assert (mask == gimple_call_arg (first_stmt, 2));
 	      vec_mask = vect_get_vec_def_for_operand (mask, stmt,
 						       mask_vectype);
-	      dataref_ptr = vect_create_data_ref_ptr (stmt, vectype, NULL,
-						      NULL_TREE, &dummy, gsi,
-						      &ptr_incr, false, &inv_p);
+	      dataref_ptr = vect_create_data_ref_ptr (first_stmt, aggr_type,
+						      NULL, NULL_TREE, &dummy,
+						      gsi, &ptr_incr, false,
+						      &inv_p);
 	      gcc_assert (!inv_p);
 	    }
 	  else
 	    {
 	      vect_is_simple_use (vec_mask, loop_vinfo, &def_stmt, &dt);
 	      vec_mask = vect_get_vec_def_for_stmt_copy (dt, vec_mask);
-	      dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr, gsi, stmt,
-					     TYPE_SIZE_UNIT (vectype));
+	      dataref_ptr = bump_vector_ptr (dataref_ptr, ptr_incr,
+					     gsi, first_stmt,
+					     TYPE_SIZE_UNIT (aggr_type));
 	    }
 
-	  align = DR_TARGET_ALIGNMENT (dr);
-	  if (aligned_access_p (dr))
-	    misalign = 0;
-	  else if (DR_MISALIGNMENT (dr) == -1)
+	  if (memory_access_type == VMAT_LOAD_STORE_LANES)
 	    {
-	      align = TYPE_ALIGN_UNIT (elem_type);
-	      misalign = 0;
+	      tree ref_type = get_masked_group_alias_ptr_type (first_stmt);
+	      do_load_lanes (stmt, gsi, group_size, vectype,
+			     aggr_type, dataref_ptr, ref_type, vec_mask);
+	      *vec_stmt = STMT_VINFO_VEC_STMT (stmt_info);
 	    }
 	  else
-	    misalign = DR_MISALIGNMENT (dr);
-	  set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
-				  misalign);
-	  tree ptr = build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)),
-				    misalign ? least_bit_hwi (misalign) : align);
-	  gcall *call
-	    = gimple_build_call_internal (IFN_MASK_LOAD, 3, dataref_ptr,
-					  ptr, vec_mask);
-	  gimple_call_set_lhs (call, make_ssa_name (vec_dest));
-	  gimple_call_set_nothrow (call, true);
-	  vect_finish_stmt_generation (stmt, call, gsi);
-	  if (i == 0)
-	    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = call;
-	  else
-	    STMT_VINFO_RELATED_STMT (prev_stmt_info) = call;
-	  prev_stmt_info = vinfo_for_stmt (call);
+	    {
+	      align = DR_TARGET_ALIGNMENT (dr);
+	      if (aligned_access_p (dr))
+		misalign = 0;
+	      else if (DR_MISALIGNMENT (dr) == -1)
+		{
+		  align = TYPE_ALIGN_UNIT (elem_type);
+		  misalign = 0;
+		}
+	      else
+		misalign = DR_MISALIGNMENT (dr);
+	      set_ptr_info_alignment (get_ptr_info (dataref_ptr), align,
+				      misalign);
+	      tree ptr = build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)),
+					misalign
+					? least_bit_hwi (misalign)
+					: align);
+	      gcall *call
+		= gimple_build_call_internal (IFN_MASK_LOAD, 3, dataref_ptr,
+					      ptr, vec_mask);
+	      gimple_call_set_lhs (call, make_ssa_name (vec_dest));
+	      gimple_call_set_nothrow (call, true);
+	      vect_finish_stmt_generation (stmt, call, gsi);
+	      if (i == 0)
+		STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = call;
+	      else
+		STMT_VINFO_RELATED_STMT (prev_stmt_info) = call;
+	      prev_stmt_info = vinfo_for_stmt (call);
+	    }
 	}
-    }
 
-  if (vls_type == VLS_LOAD)
-    {
-      /* Ensure that even with -fno-tree-dce the scalar MASK_LOAD is removed
-	 from the IL.  */
-      if (STMT_VINFO_RELATED_STMT (stmt_info))
-	{
-	  stmt = STMT_VINFO_RELATED_STMT (stmt_info);
-	  stmt_info = vinfo_for_stmt (stmt);
-	}
-      tree lhs = gimple_call_lhs (stmt);
-      new_stmt = gimple_build_assign (lhs, build_zero_cst (TREE_TYPE (lhs)));
-      set_vinfo_for_stmt (new_stmt, stmt_info);
-      set_vinfo_for_stmt (stmt, NULL);
-      STMT_VINFO_STMT (stmt_info) = new_stmt;
-      gsi_replace (gsi, new_stmt, true);
+      replace_mask_load (stmt, gsi);
     }
 
   return true;
@@ -5818,7 +6102,7 @@  vectorizable_store (gimple *stmt, gimple
     return false;
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, slp, vls_type, ncopies,
+  if (!get_load_store_type (stmt, vectype, slp, false, vls_type, ncopies,
 			    &memory_access_type, &gs_info))
     return false;
 
@@ -6353,34 +6637,21 @@  vectorizable_store (gimple *stmt, gimple
               vec_oprnd = vec_oprnds[0];
             }
           else
-            {
-	      /* For interleaved stores we collect vectorized defs for all the
-		 stores in the group in DR_CHAIN and OPRNDS. DR_CHAIN is then
-		 used as an input to vect_permute_store_chain(), and OPRNDS as
-		 an input to vect_get_vec_def_for_stmt_copy() for the next copy.
-
-		 If the store is not grouped, GROUP_SIZE is 1, and DR_CHAIN and
-		 OPRNDS are of size 1.  */
-	      next_stmt = first_stmt;
-	      for (i = 0; i < group_size; i++)
-		{
-		  /* Since gaps are not supported for interleaved stores,
-		     GROUP_SIZE is the exact number of stmts in the chain.
-		     Therefore, NEXT_STMT can't be NULL_TREE.  In case that
-		     there is no interleaving, GROUP_SIZE is 1, and only one
-		     iteration of the loop will be executed.  */
-		  gcc_assert (next_stmt
-			      && gimple_assign_single_p (next_stmt));
-		  op = gimple_assign_rhs1 (next_stmt);
-
-		  vec_oprnd = vect_get_vec_def_for_operand (op, next_stmt);
-		  dr_chain.quick_push (vec_oprnd);
-		  oprnds.quick_push (vec_oprnd);
-		  next_stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (next_stmt));
-		}
+	    {
+	      /* For interleaved stores we collect vectorized defs
+		 for all the stores in the group in DR_CHAIN and OPRNDS.
+		 DR_CHAIN is then used as an input to
+		 vect_permute_store_chain(), and OPRNDS as an input to
+		 vect_get_vec_def_for_stmt_copy() for the next copy.
+
+		 If the store is not grouped, GROUP_SIZE is 1, and DR_CHAIN
+		 and OPRNDS are of size 1.  */
+	      init_stored_values (group_size, first_stmt, &oprnds);
+	      dr_chain.safe_splice (oprnds);
+	      vec_oprnd = oprnds[0];
 	    }
 
-	  /* We should have catched mismatched types earlier.  */
+	  /* We should have caught mismatched types earlier.  */
 	  gcc_assert (useless_type_conversion_p (vectype,
 						 TREE_TYPE (vec_oprnd)));
 	  bool simd_lane_access_p
@@ -6414,14 +6685,10 @@  vectorizable_store (gimple *stmt, gimple
 	     next copy.
 	     If the store is not grouped, GROUP_SIZE is 1, and DR_CHAIN and
 	     OPRNDS are of size 1.  */
-	  for (i = 0; i < group_size; i++)
-	    {
-	      op = oprnds[i];
-	      vect_is_simple_use (op, vinfo, &def_stmt, &dt);
-	      vec_oprnd = vect_get_vec_def_for_stmt_copy (dt, op);
-	      dr_chain[i] = vec_oprnd;
-	      oprnds[i] = vec_oprnd;
-	    }
+	  advance_stored_values (group_size, first_stmt, oprnds);
+	  dr_chain.truncate (0);
+	  dr_chain.splice (oprnds);
+	  vec_oprnd = oprnds[0];
 	  if (dataref_offset)
 	    dataref_offset
 	      = int_const_binop (PLUS_EXPR, dataref_offset,
@@ -6432,27 +6699,8 @@  vectorizable_store (gimple *stmt, gimple
 	}
 
       if (memory_access_type == VMAT_LOAD_STORE_LANES)
-	{
-	  tree vec_array;
-
-	  /* Combine all the vectors into an array.  */
-	  vec_array = create_vector_array (vectype, vec_num);
-	  for (i = 0; i < vec_num; i++)
-	    {
-	      vec_oprnd = dr_chain[i];
-	      write_vector_array (stmt, gsi, vec_oprnd, vec_array, i);
-	    }
-
-	  /* Emit:
-	       MEM_REF[...all elements...] = STORE_LANES (VEC_ARRAY).  */
-	  data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
-	  gcall *call = gimple_build_call_internal (IFN_STORE_LANES, 1,
-						    vec_array);
-	  gimple_call_set_lhs (call, data_ref);
-	  gimple_call_set_nothrow (call, true);
-	  new_stmt = call;
-	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
-	}
+	new_stmt = do_store_lanes (stmt, gsi, vec_num, aggr_type,
+				   dataref_ptr, ref_type, dr_chain, NULL_TREE);
       else
 	{
 	  new_stmt = NULL;
@@ -6859,7 +7107,7 @@  vectorizable_load (gimple *stmt, gimple_
     }
 
   vect_memory_access_type memory_access_type;
-  if (!get_load_store_type (stmt, vectype, slp, VLS_LOAD, ncopies,
+  if (!get_load_store_type (stmt, vectype, slp, false, VLS_LOAD, ncopies,
 			    &memory_access_type, &gs_info))
     return false;
 
@@ -7553,32 +7801,8 @@  vectorizable_load (gimple *stmt, gimple_
 	dr_chain.create (vec_num);
 
       if (memory_access_type == VMAT_LOAD_STORE_LANES)
-	{
-	  tree vec_array;
-
-	  vec_array = create_vector_array (vectype, vec_num);
-
-	  /* Emit:
-	       VEC_ARRAY = LOAD_LANES (MEM_REF[...all elements...]).  */
-	  data_ref = create_array_ref (aggr_type, dataref_ptr, ref_type);
-	  gcall *call = gimple_build_call_internal (IFN_LOAD_LANES, 1,
-						    data_ref);
-	  gimple_call_set_lhs (call, vec_array);
-	  gimple_call_set_nothrow (call, true);
-	  new_stmt = call;
-	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
-
-	  /* Extract each vector into an SSA_NAME.  */
-	  for (i = 0; i < vec_num; i++)
-	    {
-	      new_temp = read_vector_array (stmt, gsi, scalar_dest,
-					    vec_array, i);
-	      dr_chain.quick_push (new_temp);
-	    }
-
-	  /* Record the mapping between SSA_NAMEs and statements.  */
-	  vect_record_grouped_load_vectors (stmt, dr_chain);
-	}
+	do_load_lanes (stmt, gsi, group_size, vectype, aggr_type,
+		       dataref_ptr, ref_type, NULL_TREE);
       else
 	{
 	  for (i = 0; i < vec_num; i++)
@@ -8907,7 +9131,16 @@  vect_transform_stmt (gimple *stmt, gimpl
       done = vectorizable_call (stmt, gsi, &vec_stmt, slp_node);
       stmt = gsi_stmt (*gsi);
       if (gimple_call_internal_p (stmt, IFN_MASK_STORE))
-	is_store = true;
+	{
+	  gcc_assert (!slp_node);
+	  /* As with normal stores, we vectorize the whole group when
+	     we reach the last call in the group.  The other calls are
+	     in the group are left with a null VEC_STMT.  */
+	  if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+	    *grouped_store = true;
+	  if (STMT_VINFO_VEC_STMT (stmt_info))
+	    is_store = true;
+	}
       break;
 
     case call_simd_clone_vec_info_type:
Index: gcc/testsuite/gcc.dg/vect/vect-ooo-group-1.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.dg/vect/vect-ooo-group-1.c	2017-11-08 16:35:04.763816035 +0000
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+
+void
+f (int *restrict a, int *restrict b, int *restrict c)
+{
+  for (int i = 0; i < 100; ++i)
+    if (c[i])
+      {
+	a[i * 2] = b[i * 5 + 2];
+	a[i * 2 + 1] = b[i * 5];
+      }
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1.c	2017-11-08 16:35:04.763816035 +0000
@@ -0,0 +1,67 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 2] + src[i * 2 + 1];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld2w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld2d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1_run.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_1_run.c	2017-11-08 16:35:04.763816035 +0000
@@ -0,0 +1,38 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_1.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 2];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 2; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_2 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = in[i * 2] + in[i * 2 + 1];	\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2.c	2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,69 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = (src[i * 3]					\
+		   + src[i * 3 + 1]				\
+		   + src[i * 3 + 2]);				\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2_run.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_2_run.c	2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,40 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_2.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 3];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 3; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_3 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = (in[i * 3]			\
+			   + in[i * 3 + 1]		\
+			   + in[i * 3 + 2]);		\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3.c	2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,70 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = (src[i * 4]					\
+		   + src[i * 4 + 1]				\
+		   + src[i * 4 + 2]				\
+		   + src[i * 4 + 3]);				\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3_run.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_3_run.c	2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,41 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_load_3.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)	\
+  {							\
+    OUTTYPE out[N];					\
+    INTYPE in[N * 4];					\
+    MASKTYPE mask[N];					\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	out[i] = i * 7 / 2;				\
+	mask[i] = i % 5 <= i % 3;			\
+	asm volatile ("" ::: "memory");			\
+      }							\
+    for (int i = 0; i < N * 4; ++i)			\
+      in[i] = i * 9 / 2;				\
+    NAME##_4 (out, in, mask, N);			\
+    for (int i = 0; i < N; ++i)				\
+      {							\
+	OUTTYPE if_true = (in[i * 4]			\
+			   + in[i * 4 + 1]		\
+			   + in[i * 4 + 2]		\
+			   + in[i * 4 + 3]);		\
+	OUTTYPE if_false = i * 7 / 2;			\
+	if (out[i] != (mask[i] ? if_true : if_false))	\
+	  __builtin_abort ();				\
+	asm volatile ("" ::: "memory");			\
+      }							\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_4.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_4.c	2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,67 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 3] + src[i * 3 + 2];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_5.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_5.c	2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,67 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 4] + src[i * 4 + 3];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tld4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    Out  8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tld4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_6.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_6.c	2017-11-08 16:35:04.766569934 +0000
@@ -0,0 +1,40 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 2];					\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld2b\t} } } */
+/* { dg-final { scan-assembler-not {\tld2h\t} } } */
+/* { dg-final { scan-assembler-not {\tld2w\t} } } */
+/* { dg-final { scan-assembler-not {\tld2d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_7.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_7.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,40 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 3] + src[i * 3 + 1];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld3b\t} } } */
+/* { dg-final { scan-assembler-not {\tld3h\t} } } */
+/* { dg-final { scan-assembler-not {\tld3w\t} } } */
+/* { dg-final { scan-assembler-not {\tld3d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_8.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_load_8.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,40 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	dest[i] = src[i * 4] + src[i * 4 + 2];			\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tld4b\t} } } */
+/* { dg-final { scan-assembler-not {\tld4h\t} } } */
+/* { dg-final { scan-assembler-not {\tld4w\t} } } */
+/* { dg-final { scan-assembler-not {\tld4d\t} } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,70 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	{							\
+	  dest[i * 2] = src[i];					\
+	  dest[i * 2 + 1] = src[i];				\
+	}							\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst2w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst2d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1_run.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_1_run.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,38 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_1.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 2];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 2; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_2 (out, in, mask, N);				\
+    for (int i = 0; i < N * 2; ++i)				\
+      {								\
+	OUTTYPE if_true = in[i / 2];				\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 2] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,71 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_3 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	{							\
+	  dest[i * 3] = src[i];					\
+	  dest[i * 3 + 1] = src[i];				\
+	  dest[i * 3 + 2] = src[i];				\
+	}							\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for _Float16)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst3w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst3d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2_run.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_2_run.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,38 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_2.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 3];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 3; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_3 (out, in, mask, N);				\
+    for (int i = 0; i < N * 3; ++i)				\
+      {								\
+	OUTTYPE if_true = in[i / 3];				\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 3] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,72 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -fno-vect-cost-model -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_4 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      if (cond[i])						\
+	{							\
+	  dest[i * 4] = src[i];					\
+	  dest[i * 4 + 1] = src[i];				\
+	  dest[i * 4 + 2] = src[i];				\
+	  dest[i * 4 + 3] = src[i];				\
+	}							\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  1  1  1  1
+        16 |  1  1  1  1
+        32 |  1  1  1  1
+        64 |  1  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4b\t.z[0-9]} 16 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  2  2  2  2
+        16 |  2  1  1  1 x2 (for half float)
+        32 |  2  1  1  1
+        64 |  2  1  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4h\t.z[0-9]} 28 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  4  4  4  4
+        16 |  4  2  2  2
+        32 |  4  2  1  1 x2 (for float)
+        64 |  4  2  1  1.  */
+/* { dg-final { scan-assembler-times {\tst4w\t.z[0-9]} 50 } } */
+
+/*    Mask |  8 16 32 64
+    -------+------------
+    In   8 |  8  8  8  8
+        16 |  8  4  4  4
+        32 |  8  4  2  2
+        64 |  8  4  2  1 x2 (for double).  */
+/* { dg-final { scan-assembler-times {\tst4d\t.z[0-9]} 98 } } */
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3_run.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_3_run.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,38 @@ 
+/* { dg-do run { target aarch64_sve_hw } } */
+/* { dg-options "-O2 -ftree-vectorize -fno-tree-dce -ffast-math -march=armv8-a+sve" } */
+
+#include "sve_mask_struct_store_3.c"
+
+#define N 100
+
+#undef TEST_LOOP
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  {								\
+    OUTTYPE out[N * 4];						\
+    INTYPE in[N];						\
+    MASKTYPE mask[N];						\
+    for (int i = 0; i < N; ++i)					\
+      {								\
+	in[i] = i * 7 / 2;					\
+	mask[i] = i % 5 <= i % 3;				\
+	asm volatile ("" ::: "memory");				\
+      }								\
+    for (int i = 0; i < N * 4; ++i)				\
+      out[i] = i * 9 / 2;					\
+    NAME##_4 (out, in, mask, N);				\
+    for (int i = 0; i < N * 4; ++i)				\
+      {								\
+	OUTTYPE if_true = in[i / 4];				\
+	OUTTYPE if_false = i * 9 / 2;				\
+	if (out[i] != (mask[i / 4] ? if_true : if_false))	\
+	  __builtin_abort ();					\
+	asm volatile ("" ::: "memory");				\
+      }								\
+  }
+
+int __attribute__ ((optimize (1)))
+main (void)
+{
+  TEST (test);
+  return 0;
+}
Index: gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_4.c
===================================================================
--- /dev/null	2017-11-08 11:04:45.353113300 +0000
+++ gcc/testsuite/gcc.target/aarch64/sve_mask_struct_store_4.c	2017-11-08 16:35:04.767487900 +0000
@@ -0,0 +1,44 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -ftree-vectorize -ffast-math -march=armv8-a+sve" } */
+
+#define TEST_LOOP(NAME, OUTTYPE, INTYPE, MASKTYPE)		\
+  void __attribute__ ((noinline, noclone))			\
+  NAME##_2 (OUTTYPE *__restrict dest, INTYPE *__restrict src,	\
+	    MASKTYPE *__restrict cond, int n)			\
+  {								\
+    for (int i = 0; i < n; ++i)					\
+      {								\
+	if (cond[i] < 8)					\
+	  dest[i * 2] = src[i];					\
+	if (cond[i] > 2)					\
+	  dest[i * 2 + 1] = src[i];				\
+	}							\
+  }
+
+#define TEST2(NAME, OUTTYPE, INTYPE) \
+  TEST_LOOP (NAME##_i8, OUTTYPE, INTYPE, signed char) \
+  TEST_LOOP (NAME##_i16, OUTTYPE, INTYPE, unsigned short) \
+  TEST_LOOP (NAME##_f32, OUTTYPE, INTYPE, float) \
+  TEST_LOOP (NAME##_f64, OUTTYPE, INTYPE, double)
+
+#define TEST1(NAME, OUTTYPE) \
+  TEST2 (NAME##_i8, OUTTYPE, signed char) \
+  TEST2 (NAME##_i16, OUTTYPE, unsigned short) \
+  TEST2 (NAME##_i32, OUTTYPE, int) \
+  TEST2 (NAME##_i64, OUTTYPE, unsigned long)
+
+#define TEST(NAME) \
+  TEST1 (NAME##_i8, signed char) \
+  TEST1 (NAME##_i16, unsigned short) \
+  TEST1 (NAME##_i32, int) \
+  TEST1 (NAME##_i64, unsigned long) \
+  TEST2 (NAME##_f16_f16, _Float16, _Float16) \
+  TEST2 (NAME##_f32_f32, float, float) \
+  TEST2 (NAME##_f64_f64, double, double)
+
+TEST (test)
+
+/* { dg-final { scan-assembler-not {\tst2b\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2h\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2w\t.z[0-9]} } } */
+/* { dg-final { scan-assembler-not {\tst2d\t.z[0-9]} } } */