PR111754

Message ID	CAAgBjMkP2ZTUq9_YN+4_kCzfPBDroFE-YUVSS4h9=NFWxhetwA@mail.gmail.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of gcc-patches-bounces+patch=linaro.org@gcc.gnu.org designates 2620:52:3:1:0:246e:9693:128c as permitted sender) client-ip=2620:52:3:1:0:246e:9693:128c; DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org ECE5A3858D39 MIME-Version: 1.0 From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> Date: Fri, 20 Oct 2023 22:55:44 +0530 Message-ID: <CAAgBjMkP2ZTUq9_YN+4_kCzfPBDroFE-YUVSS4h9=NFWxhetwA@mail.gmail.com> Subject: PR111754 To: gcc Patches <gcc-patches@gcc.gnu.org>, Richard Sandiford <richard.sandiford@arm.com>, Richard Biener <rguenther@suse.de> Content-Type: multipart/mixed; boundary="000000000000be551a0608292be9" Precedence: list Errors-To: gcc-patches-bounces+patch=linaro.org@gcc.gnu.org
Series	PR111754 \| expand PR111754

Prathamesh Kulkarni Oct. 20, 2023, 5:25 p.m. UTC

Hi,
For the following test-case:

typedef float __attribute__((__vector_size__ (16))) F;
F foo (F a, F b)
{
  F v = (F) { 9 };
  return __builtin_shufflevector (v, v, 1, 0, 1, 2);
}

Compiling with -O2 results in following ICE:
foo.c: In function ‘foo’:
foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
    6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
const&)
        ../../gcc/gcc/rtl.h:2314
0x7f3185 wide_int_ref_storage<false,
false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
>(std::pair<rtx_def*, machine_mode> const&)
        ../../gcc/gcc/wide-int.h:1089
0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
>::generic_wide_int<std::pair<rtx_def*, machine_mode>
>(std::pair<rtx_def*, machine_mode> const&)
        ../../gcc/gcc/wide-int.h:847
0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
false> > >::poly_int<std::pair<rtx_def*, machine_mode>
>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
        ../../gcc/gcc/poly-int.h:467
0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
false> > >::poly_int<std::pair<rtx_def*, machine_mode>
>(std::pair<rtx_def*, machine_mode> const&)
        ../../gcc/gcc/poly-int.h:453
0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
        ../../gcc/gcc/rtl.h:2383
0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
        ../../gcc/gcc/rtx-vector-builder.h:122
0xfd4e1b vector_builder<rtx_def*, machine_mode,
rtx_vector_builder>::elt(unsigned int) const
        ../../gcc/gcc/vector-builder.h:253
0xfd4d11 rtx_vector_builder::build()
        ../../gcc/gcc/rtx-vector-builder.cc:73
0xc21d9c const_vector_from_tree
        ../../gcc/gcc/expr.cc:13487
0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
expand_modifier, rtx_def**, bool)
        ../../gcc/gcc/expr.cc:11059
0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
        ../../gcc/gcc/expr.h:310
0xaee682 expand_return
        ../../gcc/gcc/cfgexpand.cc:3809
0xaee682 expand_gimple_stmt_1
        ../../gcc/gcc/cfgexpand.cc:3918
0xaee682 expand_gimple_stmt
        ../../gcc/gcc/cfgexpand.cc:4044
0xaf28f0 expand_gimple_basic_block
        ../../gcc/gcc/cfgexpand.cc:6100
0xaf4996 execute
        ../../gcc/gcc/cfgexpand.cc:6835

IIUC, the issue is that fold_vec_perm returns a vector having float element
type with res_nelts_per_pattern == 3, and later ICE's when it tries
to derive element v[3], not present in the encoding, while trying to
build rtx vector
in rtx_vector_builder::build():
 for (unsigned int i = 0; i < nelts; ++i)
    RTVEC_ELT (v, i) = elt (i);

The attached patch tries to fix this by returning false from
valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
input vector has non-integral element type, so for VLA vectors, it
will only build result with dup sequence (nelts_per_pattern < 3) for
non-integral element type.

For VLS vectors, this will still work for stepped sequence since it
will then use the "VLS exception" in fold_vec_perm_cst, and set:
res_npattern = res_nelts and
res_nelts_per_pattern = 1

and fold the above case to:
F foo (F a, F b)
{
  <bb 2> [local count: 1073741824]:
  return { 0.0, 9.0e+0, 0.0, 0.0 };
}

But I am not sure if this is entirely correct, since:
tree res = out_elts.build ();
will canonicalize the encoding and may result in a stepped sequence
(vector_builder::finalize() may reduce npatterns at the cost of increasing
nelts_per_pattern)  ?

PS: This issue is now latent after PR111648 fix, since
valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
false because the corresponding pattern in arg0 is not a natural
stepped sequence, and folds correctly using VLS exception. However, I
guess the underlying issue of dealing with non-integral element types
in fold_vec_perm_cst still remains ?

The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
and on x86_64-linux-gnu.

Thanks,
Prathamesh

Richard Sandiford Oct. 24, 2023, 9:28 p.m. UTC | #1

Hi,

Sorry the slow review.  I clearly didn't think this through properly
when doing the review of the original patch, so I wanted to spend
some time working on the code to get a better understanding of
the problem.

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> Hi,
> For the following test-case:
>
> typedef float __attribute__((__vector_size__ (16))) F;
> F foo (F a, F b)
> {
>   F v = (F) { 9 };
>   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> }
>
> Compiling with -O2 results in following ICE:
> foo.c: In function ‘foo’:
> foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
>     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
>       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
>>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
> const&)
>         ../../gcc/gcc/rtl.h:2314
> 0x7f3185 wide_int_ref_storage<false,
> false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
>>(std::pair<rtx_def*, machine_mode> const&)
>         ../../gcc/gcc/wide-int.h:1089
> 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
>>::generic_wide_int<std::pair<rtx_def*, machine_mode>
>>(std::pair<rtx_def*, machine_mode> const&)
>         ../../gcc/gcc/wide-int.h:847
> 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> false> > >::poly_int<std::pair<rtx_def*, machine_mode>
>>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
>         ../../gcc/gcc/poly-int.h:467
> 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> false> > >::poly_int<std::pair<rtx_def*, machine_mode>
>>(std::pair<rtx_def*, machine_mode> const&)
>         ../../gcc/gcc/poly-int.h:453
> 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
>         ../../gcc/gcc/rtl.h:2383
> 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
>         ../../gcc/gcc/rtx-vector-builder.h:122
> 0xfd4e1b vector_builder<rtx_def*, machine_mode,
> rtx_vector_builder>::elt(unsigned int) const
>         ../../gcc/gcc/vector-builder.h:253
> 0xfd4d11 rtx_vector_builder::build()
>         ../../gcc/gcc/rtx-vector-builder.cc:73
> 0xc21d9c const_vector_from_tree
>         ../../gcc/gcc/expr.cc:13487
> 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> expand_modifier, rtx_def**, bool)
>         ../../gcc/gcc/expr.cc:11059
> 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
>         ../../gcc/gcc/expr.h:310
> 0xaee682 expand_return
>         ../../gcc/gcc/cfgexpand.cc:3809
> 0xaee682 expand_gimple_stmt_1
>         ../../gcc/gcc/cfgexpand.cc:3918
> 0xaee682 expand_gimple_stmt
>         ../../gcc/gcc/cfgexpand.cc:4044
> 0xaf28f0 expand_gimple_basic_block
>         ../../gcc/gcc/cfgexpand.cc:6100
> 0xaf4996 execute
>         ../../gcc/gcc/cfgexpand.cc:6835
>
> IIUC, the issue is that fold_vec_perm returns a vector having float element
> type with res_nelts_per_pattern == 3, and later ICE's when it tries
> to derive element v[3], not present in the encoding, while trying to
> build rtx vector
> in rtx_vector_builder::build():
>  for (unsigned int i = 0; i < nelts; ++i)
>     RTVEC_ELT (v, i) = elt (i);
>
> The attached patch tries to fix this by returning false from
> valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
> input vector has non-integral element type, so for VLA vectors, it
> will only build result with dup sequence (nelts_per_pattern < 3) for
> non-integral element type.
>
> For VLS vectors, this will still work for stepped sequence since it
> will then use the "VLS exception" in fold_vec_perm_cst, and set:
> res_npattern = res_nelts and
> res_nelts_per_pattern = 1
>
> and fold the above case to:
> F foo (F a, F b)
> {
>   <bb 2> [local count: 1073741824]:
>   return { 0.0, 9.0e+0, 0.0, 0.0 };
> }
>
> But I am not sure if this is entirely correct, since:
> tree res = out_elts.build ();
> will canonicalize the encoding and may result in a stepped sequence
> (vector_builder::finalize() may reduce npatterns at the cost of increasing
> nelts_per_pattern)  ?
>
> PS: This issue is now latent after PR111648 fix, since
> valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
> false because the corresponding pattern in arg0 is not a natural
> stepped sequence, and folds correctly using VLS exception. However, I
> guess the underlying issue of dealing with non-integral element types
> in fold_vec_perm_cst still remains ?
>
> The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
> and on x86_64-linux-gnu.

I think the problem is instead in the way that we're calculating
res_npatterns and res_nelts_per_pattern.

If the selector is a duplication of { a1, ..., an }, then the
result will be a duplication of n elements, regardless of the shape
of the other arguments.

Similarly, if the selector is { a1, ...., an } followed by a
duplication of { b1, ..., bn }, the result be n elements followed
by a duplication of n elements, regardless of the shape of the other
arguments.

So for these two cases, res_npatterns and res_nelts_per_pattern
can come directly from the selector's encoding.

If:

(1) the selector is an n-pattern stepped sequence
(2) the stepped part of each pattern selects from the same input pattern
(3) the stepped part of each pattern does not select the first element
    of the input pattern, or the full input pattern is stepped
    (your previous patch)

then the result is stepped only if one of the inputs is stepped.
This is because, if an input pattern has 1 or 2 elements, (3) means
that each element of the stepped sequence will select the same value,
as if the selector step had been 0.

So I think the PR could be solved by something like the attached.
Do you agree?  If so, could you base the patch on this instead?

Only tested against the self-tests.

Thanks,
Richard

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 40767736389..00fce4945a7 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
   unsigned res_npatterns, res_nelts_per_pattern;
   unsigned HOST_WIDE_INT res_nelts;
 
-  /* (1) If SEL is a suitable mask as determined by
-     valid_mask_for_fold_vec_perm_cst_p, then:
-     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
-     res_nelts_per_pattern = max of nelts_per_pattern between
-			     ARG0, ARG1 and SEL.
-     (2) If SEL is not a suitable mask, and TYPE is VLS then:
-     res_npatterns = nelts in result vector.
-     res_nelts_per_pattern = 1.
-     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
-  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
-    {
-      res_npatterns
-	= std::max (VECTOR_CST_NPATTERNS (arg0),
-		    std::max (VECTOR_CST_NPATTERNS (arg1),
-			      sel.encoding ().npatterns ()));
+  /* First try to implement the fold in a VLA-friendly way.
+
+     (1) If the selector is simply a duplication of N elements, the
+	 result is likewise a duplication of N elements.
+
+     (2) If the selector is N elements followed by a duplication
+	 of N elements, the result is too.
 
-      res_nelts_per_pattern
-	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
-		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
-			      sel.encoding ().nelts_per_pattern ()));
+     (3) If the selector is N elements followed by an interleaving
+	 of N linear series, the situation is more complex.
 
+	 valid_mask_for_fold_vec_perm_cst_p detects whether we
+	 can handle this case.  If we can, then each of the N linear
+	 series either (a) selects the same element each time or
+	 (b) selects a linear series from one of the input patterns.
+
+	 If (b) holds for one of the linear series, the result
+	 will contain a linear series, and so the result will have
+	 the same shape as the selector.  If (a) holds for all of
+	 the lienar series, the result will be the same as (2) above.
+
+	 (b) can only hold if one of the inputs pattern has a
+	 stepped encoding.  */
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns = sel.encoding ().npatterns ();
+      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+      if (res_nelts_per_pattern == 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
+	res_nelts_per_pattern = 2;
       res_nelts = res_npatterns * res_nelts_per_pattern;
     }
   else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))

Richard Sandiford Oct. 25, 2023, 8:42 a.m. UTC | #2

Sigh, I knew I should have waited until the morning to proof-read
and send this.

Richard Sandiford <richard.sandiford@arm.com> writes:
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 40767736389..00fce4945a7 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
>    unsigned res_npatterns, res_nelts_per_pattern;
>    unsigned HOST_WIDE_INT res_nelts;
>  
> -  /* (1) If SEL is a suitable mask as determined by
> -     valid_mask_for_fold_vec_perm_cst_p, then:
> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> -     res_nelts_per_pattern = max of nelts_per_pattern between
> -			     ARG0, ARG1 and SEL.
> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> -     res_npatterns = nelts in result vector.
> -     res_nelts_per_pattern = 1.
> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> -    {
> -      res_npatterns
> -	= std::max (VECTOR_CST_NPATTERNS (arg0),
> -		    std::max (VECTOR_CST_NPATTERNS (arg1),
> -			      sel.encoding ().npatterns ()));
> +  /* First try to implement the fold in a VLA-friendly way.
> +
> +     (1) If the selector is simply a duplication of N elements, the
> +	 result is likewise a duplication of N elements.
> +
> +     (2) If the selector is N elements followed by a duplication
> +	 of N elements, the result is too.
>  
> -      res_nelts_per_pattern
> -	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> -		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> -			      sel.encoding ().nelts_per_pattern ()));
> +     (3) If the selector is N elements followed by an interleaving
> +	 of N linear series, the situation is more complex.
>  
> +	 valid_mask_for_fold_vec_perm_cst_p detects whether we
> +	 can handle this case.  If we can, then each of the N linear
> +	 series either (a) selects the same element each time or
> +	 (b) selects a linear series from one of the input patterns.
> +
> +	 If (b) holds for one of the linear series, the result
> +	 will contain a linear series, and so the result will have
> +	 the same shape as the selector.  If (a) holds for all of
> +	 the lienar series, the result will be the same as (2) above.

linear

> +
> +	 (b) can only hold if one of the inputs pattern has a

input patterns

Sorry for the typos.

Richard

> +	 stepped encoding.  */
> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> +    {
> +      res_npatterns = sel.encoding ().npatterns ();
> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> +      if (res_nelts_per_pattern == 3
> +	  && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> +	  && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> +	res_nelts_per_pattern = 2;
>        res_nelts = res_npatterns * res_nelts_per_pattern;
>      }
>    else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))

Prathamesh Kulkarni Oct. 25, 2023, 10:59 a.m. UTC | #3

On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Hi,
>
> Sorry the slow review.  I clearly didn't think this through properly
> when doing the review of the original patch, so I wanted to spend
> some time working on the code to get a better understanding of
> the problem.
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > Hi,
> > For the following test-case:
> >
> > typedef float __attribute__((__vector_size__ (16))) F;
> > F foo (F a, F b)
> > {
> >   F v = (F) { 9 };
> >   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > }
> >
> > Compiling with -O2 results in following ICE:
> > foo.c: In function ‘foo’:
> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
> >     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> >       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
> >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
> > const&)
> >         ../../gcc/gcc/rtl.h:2314
> > 0x7f3185 wide_int_ref_storage<false,
> > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
> >>(std::pair<rtx_def*, machine_mode> const&)
> >         ../../gcc/gcc/wide-int.h:1089
> > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
> >>::generic_wide_int<std::pair<rtx_def*, machine_mode>
> >>(std::pair<rtx_def*, machine_mode> const&)
> >         ../../gcc/gcc/wide-int.h:847
> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
> >         ../../gcc/gcc/poly-int.h:467
> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> >>(std::pair<rtx_def*, machine_mode> const&)
> >         ../../gcc/gcc/poly-int.h:453
> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
> >         ../../gcc/gcc/rtl.h:2383
> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
> >         ../../gcc/gcc/rtx-vector-builder.h:122
> > 0xfd4e1b vector_builder<rtx_def*, machine_mode,
> > rtx_vector_builder>::elt(unsigned int) const
> >         ../../gcc/gcc/vector-builder.h:253
> > 0xfd4d11 rtx_vector_builder::build()
> >         ../../gcc/gcc/rtx-vector-builder.cc:73
> > 0xc21d9c const_vector_from_tree
> >         ../../gcc/gcc/expr.cc:13487
> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > expand_modifier, rtx_def**, bool)
> >         ../../gcc/gcc/expr.cc:11059
> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
> >         ../../gcc/gcc/expr.h:310
> > 0xaee682 expand_return
> >         ../../gcc/gcc/cfgexpand.cc:3809
> > 0xaee682 expand_gimple_stmt_1
> >         ../../gcc/gcc/cfgexpand.cc:3918
> > 0xaee682 expand_gimple_stmt
> >         ../../gcc/gcc/cfgexpand.cc:4044
> > 0xaf28f0 expand_gimple_basic_block
> >         ../../gcc/gcc/cfgexpand.cc:6100
> > 0xaf4996 execute
> >         ../../gcc/gcc/cfgexpand.cc:6835
> >
> > IIUC, the issue is that fold_vec_perm returns a vector having float element
> > type with res_nelts_per_pattern == 3, and later ICE's when it tries
> > to derive element v[3], not present in the encoding, while trying to
> > build rtx vector
> > in rtx_vector_builder::build():
> >  for (unsigned int i = 0; i < nelts; ++i)
> >     RTVEC_ELT (v, i) = elt (i);
> >
> > The attached patch tries to fix this by returning false from
> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
> > input vector has non-integral element type, so for VLA vectors, it
> > will only build result with dup sequence (nelts_per_pattern < 3) for
> > non-integral element type.
> >
> > For VLS vectors, this will still work for stepped sequence since it
> > will then use the "VLS exception" in fold_vec_perm_cst, and set:
> > res_npattern = res_nelts and
> > res_nelts_per_pattern = 1
> >
> > and fold the above case to:
> > F foo (F a, F b)
> > {
> >   <bb 2> [local count: 1073741824]:
> >   return { 0.0, 9.0e+0, 0.0, 0.0 };
> > }
> >
> > But I am not sure if this is entirely correct, since:
> > tree res = out_elts.build ();
> > will canonicalize the encoding and may result in a stepped sequence
> > (vector_builder::finalize() may reduce npatterns at the cost of increasing
> > nelts_per_pattern)  ?
> >
> > PS: This issue is now latent after PR111648 fix, since
> > valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
> > false because the corresponding pattern in arg0 is not a natural
> > stepped sequence, and folds correctly using VLS exception. However, I
> > guess the underlying issue of dealing with non-integral element types
> > in fold_vec_perm_cst still remains ?
> >
> > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
> > and on x86_64-linux-gnu.
>
> I think the problem is instead in the way that we're calculating
> res_npatterns and res_nelts_per_pattern.
>
> If the selector is a duplication of { a1, ..., an }, then the
> result will be a duplication of n elements, regardless of the shape
> of the other arguments.
>
> Similarly, if the selector is { a1, ...., an } followed by a
> duplication of { b1, ..., bn }, the result be n elements followed
> by a duplication of n elements, regardless of the shape of the other
> arguments.
>
> So for these two cases, res_npatterns and res_nelts_per_pattern
> can come directly from the selector's encoding.
>
> If:
>
> (1) the selector is an n-pattern stepped sequence
> (2) the stepped part of each pattern selects from the same input pattern
> (3) the stepped part of each pattern does not select the first element
>     of the input pattern, or the full input pattern is stepped
>     (your previous patch)
>
> then the result is stepped only if one of the inputs is stepped.
> This is because, if an input pattern has 1 or 2 elements, (3) means
> that each element of the stepped sequence will select the same value,
> as if the selector step had been 0.
Hi Richard,
Thanks for the suggestions! I agree when selector is dup of {a1, ... an, ...} or
base elements followed up dup {a1, .. an, b1, ... bn, ...} in that
case we can set
res_nelts_per_pattern from selector's encoding. However even if we provide
more nelts_per_pattern that necessary, I guess vector_builder::finalize() will
canonicalize it to the correct encoding for result ?
>
> So I think the PR could be solved by something like the attached.
> Do you agree?  If so, could you base the patch on this instead?
>
> Only tested against the self-tests.
>
> Thanks,
> Richard
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 40767736389..00fce4945a7 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
>    unsigned res_npatterns, res_nelts_per_pattern;
>    unsigned HOST_WIDE_INT res_nelts;
>
> -  /* (1) If SEL is a suitable mask as determined by
> -     valid_mask_for_fold_vec_perm_cst_p, then:
> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> -     res_nelts_per_pattern = max of nelts_per_pattern between
> -                            ARG0, ARG1 and SEL.
> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> -     res_npatterns = nelts in result vector.
> -     res_nelts_per_pattern = 1.
> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> -    {
> -      res_npatterns
> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
> -                             sel.encoding ().npatterns ()));
> +  /* First try to implement the fold in a VLA-friendly way.
> +
> +     (1) If the selector is simply a duplication of N elements, the
> +        result is likewise a duplication of N elements.
> +
> +     (2) If the selector is N elements followed by a duplication
> +        of N elements, the result is too.
>
> -      res_nelts_per_pattern
> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> -                             sel.encoding ().nelts_per_pattern ()));
> +     (3) If the selector is N elements followed by an interleaving
> +        of N linear series, the situation is more complex.
>
> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
> +        can handle this case.  If we can, then each of the N linear
> +        series either (a) selects the same element each time or
> +        (b) selects a linear series from one of the input patterns.
> +
> +        If (b) holds for one of the linear series, the result
> +        will contain a linear series, and so the result will have
> +        the same shape as the selector.  If (a) holds for all of
> +        the lienar series, the result will be the same as (2) above.
> +
> +        (b) can only hold if one of the inputs pattern has a
> +        stepped encoding.  */
> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> +    {
> +      res_npatterns = sel.encoding ().npatterns ();
> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> +      if (res_nelts_per_pattern == 3
> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> +       res_nelts_per_pattern = 2;
Um, in this case, should we set:
res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
if both have nelts_per_pattern == 1 ?

Also I suppose this matters only for non-integral element type, since
for integral element type,
vector_cst_elt will return the correct value even if the element is
not explicitly encoded and input vector is dup ?

Thanks,
Prathamesh
>        res_nelts = res_npatterns * res_nelts_per_pattern;
>      }
>    else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
> --
> 2.25.1
>

Richard Sandiford Oct. 25, 2023, 10:38 p.m. UTC | #4

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Hi,
>>
>> Sorry the slow review.  I clearly didn't think this through properly
>> when doing the review of the original patch, so I wanted to spend
>> some time working on the code to get a better understanding of
>> the problem.
>>
>> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > Hi,
>> > For the following test-case:
>> >
>> > typedef float __attribute__((__vector_size__ (16))) F;
>> > F foo (F a, F b)
>> > {
>> >   F v = (F) { 9 };
>> >   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
>> > }
>> >
>> > Compiling with -O2 results in following ICE:
>> > foo.c: In function ‘foo’:
>> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
>> >     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
>> >       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
>> >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
>> > const&)
>> >         ../../gcc/gcc/rtl.h:2314
>> > 0x7f3185 wide_int_ref_storage<false,
>> > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
>> >>(std::pair<rtx_def*, machine_mode> const&)
>> >         ../../gcc/gcc/wide-int.h:1089
>> > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
>> >>::generic_wide_int<std::pair<rtx_def*, machine_mode>
>> >>(std::pair<rtx_def*, machine_mode> const&)
>> >         ../../gcc/gcc/wide-int.h:847
>> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
>> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
>> >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
>> >         ../../gcc/gcc/poly-int.h:467
>> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
>> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
>> >>(std::pair<rtx_def*, machine_mode> const&)
>> >         ../../gcc/gcc/poly-int.h:453
>> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
>> >         ../../gcc/gcc/rtl.h:2383
>> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
>> >         ../../gcc/gcc/rtx-vector-builder.h:122
>> > 0xfd4e1b vector_builder<rtx_def*, machine_mode,
>> > rtx_vector_builder>::elt(unsigned int) const
>> >         ../../gcc/gcc/vector-builder.h:253
>> > 0xfd4d11 rtx_vector_builder::build()
>> >         ../../gcc/gcc/rtx-vector-builder.cc:73
>> > 0xc21d9c const_vector_from_tree
>> >         ../../gcc/gcc/expr.cc:13487
>> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
>> > expand_modifier, rtx_def**, bool)
>> >         ../../gcc/gcc/expr.cc:11059
>> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
>> >         ../../gcc/gcc/expr.h:310
>> > 0xaee682 expand_return
>> >         ../../gcc/gcc/cfgexpand.cc:3809
>> > 0xaee682 expand_gimple_stmt_1
>> >         ../../gcc/gcc/cfgexpand.cc:3918
>> > 0xaee682 expand_gimple_stmt
>> >         ../../gcc/gcc/cfgexpand.cc:4044
>> > 0xaf28f0 expand_gimple_basic_block
>> >         ../../gcc/gcc/cfgexpand.cc:6100
>> > 0xaf4996 execute
>> >         ../../gcc/gcc/cfgexpand.cc:6835
>> >
>> > IIUC, the issue is that fold_vec_perm returns a vector having float element
>> > type with res_nelts_per_pattern == 3, and later ICE's when it tries
>> > to derive element v[3], not present in the encoding, while trying to
>> > build rtx vector
>> > in rtx_vector_builder::build():
>> >  for (unsigned int i = 0; i < nelts; ++i)
>> >     RTVEC_ELT (v, i) = elt (i);
>> >
>> > The attached patch tries to fix this by returning false from
>> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
>> > input vector has non-integral element type, so for VLA vectors, it
>> > will only build result with dup sequence (nelts_per_pattern < 3) for
>> > non-integral element type.
>> >
>> > For VLS vectors, this will still work for stepped sequence since it
>> > will then use the "VLS exception" in fold_vec_perm_cst, and set:
>> > res_npattern = res_nelts and
>> > res_nelts_per_pattern = 1
>> >
>> > and fold the above case to:
>> > F foo (F a, F b)
>> > {
>> >   <bb 2> [local count: 1073741824]:
>> >   return { 0.0, 9.0e+0, 0.0, 0.0 };
>> > }
>> >
>> > But I am not sure if this is entirely correct, since:
>> > tree res = out_elts.build ();
>> > will canonicalize the encoding and may result in a stepped sequence
>> > (vector_builder::finalize() may reduce npatterns at the cost of increasing
>> > nelts_per_pattern)  ?
>> >
>> > PS: This issue is now latent after PR111648 fix, since
>> > valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
>> > false because the corresponding pattern in arg0 is not a natural
>> > stepped sequence, and folds correctly using VLS exception. However, I
>> > guess the underlying issue of dealing with non-integral element types
>> > in fold_vec_perm_cst still remains ?
>> >
>> > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
>> > and on x86_64-linux-gnu.
>>
>> I think the problem is instead in the way that we're calculating
>> res_npatterns and res_nelts_per_pattern.
>>
>> If the selector is a duplication of { a1, ..., an }, then the
>> result will be a duplication of n elements, regardless of the shape
>> of the other arguments.
>>
>> Similarly, if the selector is { a1, ...., an } followed by a
>> duplication of { b1, ..., bn }, the result be n elements followed
>> by a duplication of n elements, regardless of the shape of the other
>> arguments.
>>
>> So for these two cases, res_npatterns and res_nelts_per_pattern
>> can come directly from the selector's encoding.
>>
>> If:
>>
>> (1) the selector is an n-pattern stepped sequence
>> (2) the stepped part of each pattern selects from the same input pattern
>> (3) the stepped part of each pattern does not select the first element
>>     of the input pattern, or the full input pattern is stepped
>>     (your previous patch)
>>
>> then the result is stepped only if one of the inputs is stepped.
>> This is because, if an input pattern has 1 or 2 elements, (3) means
>> that each element of the stepped sequence will select the same value,
>> as if the selector step had been 0.
> Hi Richard,
> Thanks for the suggestions! I agree when selector is dup of {a1, ... an, ...} or
> base elements followed up dup {a1, .. an, b1, ... bn, ...} in that
> case we can set
> res_nelts_per_pattern from selector's encoding. However even if we provide
> more nelts_per_pattern that necessary, I guess vector_builder::finalize() will
> canonicalize it to the correct encoding for result ?

Right.  The elements before finalize is called have to be correct,
but they don't need to be canonical or minimal.

After the change, we'll build no more elements than we did before.

>> So I think the PR could be solved by something like the attached.
>> Do you agree?  If so, could you base the patch on this instead?
>>
>> Only tested against the self-tests.
>>
>> Thanks,
>> Richard
>>
>> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> index 40767736389..00fce4945a7 100644
>> --- a/gcc/fold-const.cc
>> +++ b/gcc/fold-const.cc
>> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
>>    unsigned res_npatterns, res_nelts_per_pattern;
>>    unsigned HOST_WIDE_INT res_nelts;
>>
>> -  /* (1) If SEL is a suitable mask as determined by
>> -     valid_mask_for_fold_vec_perm_cst_p, then:
>> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
>> -     res_nelts_per_pattern = max of nelts_per_pattern between
>> -                            ARG0, ARG1 and SEL.
>> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
>> -     res_npatterns = nelts in result vector.
>> -     res_nelts_per_pattern = 1.
>> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
>> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
>> -    {
>> -      res_npatterns
>> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
>> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
>> -                             sel.encoding ().npatterns ()));
>> +  /* First try to implement the fold in a VLA-friendly way.
>> +
>> +     (1) If the selector is simply a duplication of N elements, the
>> +        result is likewise a duplication of N elements.
>> +
>> +     (2) If the selector is N elements followed by a duplication
>> +        of N elements, the result is too.
>>
>> -      res_nelts_per_pattern
>> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
>> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
>> -                             sel.encoding ().nelts_per_pattern ()));
>> +     (3) If the selector is N elements followed by an interleaving
>> +        of N linear series, the situation is more complex.
>>
>> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
>> +        can handle this case.  If we can, then each of the N linear
>> +        series either (a) selects the same element each time or
>> +        (b) selects a linear series from one of the input patterns.
>> +
>> +        If (b) holds for one of the linear series, the result
>> +        will contain a linear series, and so the result will have
>> +        the same shape as the selector.  If (a) holds for all of
>> +        the lienar series, the result will be the same as (2) above.
>> +
>> +        (b) can only hold if one of the inputs pattern has a
>> +        stepped encoding.  */
>> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
>> +    {
>> +      res_npatterns = sel.encoding ().npatterns ();
>> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>> +      if (res_nelts_per_pattern == 3
>> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
>> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
>> +       res_nelts_per_pattern = 2;
> Um, in this case, should we set:
> res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
> if both have nelts_per_pattern == 1 ?

No, it still needs to be 2 even if arg0 and arg1 are duplicates.
E.g. consider a selector that picks the first element of arg0
followed by a duplicate of the first element of arg1.

> Also I suppose this matters only for non-integral element type, since
> for integral element type,
> vector_cst_elt will return the correct value even if the element is
> not explicitly encoded and input vector is dup ?

Yeah, but it might help even for integers.  If we build fewer
elements explicitly, and so read fewer implicitly-encoded inputs,
there's less risk of running into:

      if (!can_div_trunc_p (sel[i], len, &q, &r))
	{
	  if (reason)
	    *reason = "cannot divide selector element by arg len";
	  return NULL_TREE;
	}

Thanks,
Richard

Prathamesh Kulkarni Oct. 26, 2023, 4:13 a.m. UTC | #5

On Thu, 26 Oct 2023 at 04:09, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> Sorry the slow review.  I clearly didn't think this through properly
> >> when doing the review of the original patch, so I wanted to spend
> >> some time working on the code to get a better understanding of
> >> the problem.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > Hi,
> >> > For the following test-case:
> >> >
> >> > typedef float __attribute__((__vector_size__ (16))) F;
> >> > F foo (F a, F b)
> >> > {
> >> >   F v = (F) { 9 };
> >> >   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> >> > }
> >> >
> >> > Compiling with -O2 results in following ICE:
> >> > foo.c: In function ‘foo’:
> >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
> >> >     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> >> >       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
> >> >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
> >> > const&)
> >> >         ../../gcc/gcc/rtl.h:2314
> >> > 0x7f3185 wide_int_ref_storage<false,
> >> > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
> >> >>(std::pair<rtx_def*, machine_mode> const&)
> >> >         ../../gcc/gcc/wide-int.h:1089
> >> > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
> >> >>::generic_wide_int<std::pair<rtx_def*, machine_mode>
> >> >>(std::pair<rtx_def*, machine_mode> const&)
> >> >         ../../gcc/gcc/wide-int.h:847
> >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> >> >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
> >> >         ../../gcc/gcc/poly-int.h:467
> >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> >> >>(std::pair<rtx_def*, machine_mode> const&)
> >> >         ../../gcc/gcc/poly-int.h:453
> >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
> >> >         ../../gcc/gcc/rtl.h:2383
> >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
> >> >         ../../gcc/gcc/rtx-vector-builder.h:122
> >> > 0xfd4e1b vector_builder<rtx_def*, machine_mode,
> >> > rtx_vector_builder>::elt(unsigned int) const
> >> >         ../../gcc/gcc/vector-builder.h:253
> >> > 0xfd4d11 rtx_vector_builder::build()
> >> >         ../../gcc/gcc/rtx-vector-builder.cc:73
> >> > 0xc21d9c const_vector_from_tree
> >> >         ../../gcc/gcc/expr.cc:13487
> >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> >> > expand_modifier, rtx_def**, bool)
> >> >         ../../gcc/gcc/expr.cc:11059
> >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
> >> >         ../../gcc/gcc/expr.h:310
> >> > 0xaee682 expand_return
> >> >         ../../gcc/gcc/cfgexpand.cc:3809
> >> > 0xaee682 expand_gimple_stmt_1
> >> >         ../../gcc/gcc/cfgexpand.cc:3918
> >> > 0xaee682 expand_gimple_stmt
> >> >         ../../gcc/gcc/cfgexpand.cc:4044
> >> > 0xaf28f0 expand_gimple_basic_block
> >> >         ../../gcc/gcc/cfgexpand.cc:6100
> >> > 0xaf4996 execute
> >> >         ../../gcc/gcc/cfgexpand.cc:6835
> >> >
> >> > IIUC, the issue is that fold_vec_perm returns a vector having float element
> >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries
> >> > to derive element v[3], not present in the encoding, while trying to
> >> > build rtx vector
> >> > in rtx_vector_builder::build():
> >> >  for (unsigned int i = 0; i < nelts; ++i)
> >> >     RTVEC_ELT (v, i) = elt (i);
> >> >
> >> > The attached patch tries to fix this by returning false from
> >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
> >> > input vector has non-integral element type, so for VLA vectors, it
> >> > will only build result with dup sequence (nelts_per_pattern < 3) for
> >> > non-integral element type.
> >> >
> >> > For VLS vectors, this will still work for stepped sequence since it
> >> > will then use the "VLS exception" in fold_vec_perm_cst, and set:
> >> > res_npattern = res_nelts and
> >> > res_nelts_per_pattern = 1
> >> >
> >> > and fold the above case to:
> >> > F foo (F a, F b)
> >> > {
> >> >   <bb 2> [local count: 1073741824]:
> >> >   return { 0.0, 9.0e+0, 0.0, 0.0 };
> >> > }
> >> >
> >> > But I am not sure if this is entirely correct, since:
> >> > tree res = out_elts.build ();
> >> > will canonicalize the encoding and may result in a stepped sequence
> >> > (vector_builder::finalize() may reduce npatterns at the cost of increasing
> >> > nelts_per_pattern)  ?
> >> >
> >> > PS: This issue is now latent after PR111648 fix, since
> >> > valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
> >> > false because the corresponding pattern in arg0 is not a natural
> >> > stepped sequence, and folds correctly using VLS exception. However, I
> >> > guess the underlying issue of dealing with non-integral element types
> >> > in fold_vec_perm_cst still remains ?
> >> >
> >> > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
> >> > and on x86_64-linux-gnu.
> >>
> >> I think the problem is instead in the way that we're calculating
> >> res_npatterns and res_nelts_per_pattern.
> >>
> >> If the selector is a duplication of { a1, ..., an }, then the
> >> result will be a duplication of n elements, regardless of the shape
> >> of the other arguments.
> >>
> >> Similarly, if the selector is { a1, ...., an } followed by a
> >> duplication of { b1, ..., bn }, the result be n elements followed
> >> by a duplication of n elements, regardless of the shape of the other
> >> arguments.
> >>
> >> So for these two cases, res_npatterns and res_nelts_per_pattern
> >> can come directly from the selector's encoding.
> >>
> >> If:
> >>
> >> (1) the selector is an n-pattern stepped sequence
> >> (2) the stepped part of each pattern selects from the same input pattern
> >> (3) the stepped part of each pattern does not select the first element
> >>     of the input pattern, or the full input pattern is stepped
> >>     (your previous patch)
> >>
> >> then the result is stepped only if one of the inputs is stepped.
> >> This is because, if an input pattern has 1 or 2 elements, (3) means
> >> that each element of the stepped sequence will select the same value,
> >> as if the selector step had been 0.
> > Hi Richard,
> > Thanks for the suggestions! I agree when selector is dup of {a1, ... an, ...} or
> > base elements followed up dup {a1, .. an, b1, ... bn, ...} in that
> > case we can set
> > res_nelts_per_pattern from selector's encoding. However even if we provide
> > more nelts_per_pattern that necessary, I guess vector_builder::finalize() will
> > canonicalize it to the correct encoding for result ?
>
> Right.  The elements before finalize is called have to be correct,
> but they don't need to be canonical or minimal.
>
> After the change, we'll build no more elements than we did before.
>
> >> So I think the PR could be solved by something like the attached.
> >> Do you agree?  If so, could you base the patch on this instead?
> >>
> >> Only tested against the self-tests.
> >>
> >> Thanks,
> >> Richard
> >>
> >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> index 40767736389..00fce4945a7 100644
> >> --- a/gcc/fold-const.cc
> >> +++ b/gcc/fold-const.cc
> >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> >>    unsigned res_npatterns, res_nelts_per_pattern;
> >>    unsigned HOST_WIDE_INT res_nelts;
> >>
> >> -  /* (1) If SEL is a suitable mask as determined by
> >> -     valid_mask_for_fold_vec_perm_cst_p, then:
> >> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> >> -     res_nelts_per_pattern = max of nelts_per_pattern between
> >> -                            ARG0, ARG1 and SEL.
> >> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> >> -     res_npatterns = nelts in result vector.
> >> -     res_nelts_per_pattern = 1.
> >> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> >> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> >> -    {
> >> -      res_npatterns
> >> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
> >> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
> >> -                             sel.encoding ().npatterns ()));
> >> +  /* First try to implement the fold in a VLA-friendly way.
> >> +
> >> +     (1) If the selector is simply a duplication of N elements, the
> >> +        result is likewise a duplication of N elements.
> >> +
> >> +     (2) If the selector is N elements followed by a duplication
> >> +        of N elements, the result is too.
> >>
> >> -      res_nelts_per_pattern
> >> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> >> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> >> -                             sel.encoding ().nelts_per_pattern ()));
> >> +     (3) If the selector is N elements followed by an interleaving
> >> +        of N linear series, the situation is more complex.
> >>
> >> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
> >> +        can handle this case.  If we can, then each of the N linear
> >> +        series either (a) selects the same element each time or
> >> +        (b) selects a linear series from one of the input patterns.
> >> +
> >> +        If (b) holds for one of the linear series, the result
> >> +        will contain a linear series, and so the result will have
> >> +        the same shape as the selector.  If (a) holds for all of
> >> +        the lienar series, the result will be the same as (2) above.
> >> +
> >> +        (b) can only hold if one of the inputs pattern has a
> >> +        stepped encoding.  */
> >> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> >> +    {
> >> +      res_npatterns = sel.encoding ().npatterns ();
> >> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >> +      if (res_nelts_per_pattern == 3
> >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> >> +       res_nelts_per_pattern = 2;
> > Um, in this case, should we set:
> > res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
> > if both have nelts_per_pattern == 1 ?
>
> No, it still needs to be 2 even if arg0 and arg1 are duplicates.
> E.g. consider a selector that picks the first element of arg0
> followed by a duplicate of the first element of arg1.
>
> > Also I suppose this matters only for non-integral element type, since
> > for integral element type,
> > vector_cst_elt will return the correct value even if the element is
> > not explicitly encoded and input vector is dup ?
>
> Yeah, but it might help even for integers.  If we build fewer
> elements explicitly, and so read fewer implicitly-encoded inputs,
> there's less risk of running into:
>
>       if (!can_div_trunc_p (sel[i], len, &q, &r))
>         {
>           if (reason)
>             *reason = "cannot divide selector element by arg len";
>           return NULL_TREE;
>         }
Ah right, thanks for the clarification!
I am currently away on vacation and will return next Thursday, and
will post a follow up patch based on your patch.
Sorry for the delay.

Thanks,
Prathamesh
>
> Thanks,
> Richard

Prathamesh Kulkarni Nov. 8, 2023, 4:27 p.m. UTC | #6

On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Thu, 26 Oct 2023 at 04:09, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > >>
> > >> Hi,
> > >>
> > >> Sorry the slow review.  I clearly didn't think this through properly
> > >> when doing the review of the original patch, so I wanted to spend
> > >> some time working on the code to get a better understanding of
> > >> the problem.
> > >>
> > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > >> > Hi,
> > >> > For the following test-case:
> > >> >
> > >> > typedef float __attribute__((__vector_size__ (16))) F;
> > >> > F foo (F a, F b)
> > >> > {
> > >> >   F v = (F) { 9 };
> > >> >   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > >> > }
> > >> >
> > >> > Compiling with -O2 results in following ICE:
> > >> > foo.c: In function ‘foo’:
> > >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
> > >> >     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > >> >       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > >> > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
> > >> >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
> > >> > const&)
> > >> >         ../../gcc/gcc/rtl.h:2314
> > >> > 0x7f3185 wide_int_ref_storage<false,
> > >> > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
> > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > >> >         ../../gcc/gcc/wide-int.h:1089
> > >> > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
> > >> >>::generic_wide_int<std::pair<rtx_def*, machine_mode>
> > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > >> >         ../../gcc/gcc/wide-int.h:847
> > >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> > >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> > >> >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
> > >> >         ../../gcc/gcc/poly-int.h:467
> > >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> > >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > >> >         ../../gcc/gcc/poly-int.h:453
> > >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
> > >> >         ../../gcc/gcc/rtl.h:2383
> > >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
> > >> >         ../../gcc/gcc/rtx-vector-builder.h:122
> > >> > 0xfd4e1b vector_builder<rtx_def*, machine_mode,
> > >> > rtx_vector_builder>::elt(unsigned int) const
> > >> >         ../../gcc/gcc/vector-builder.h:253
> > >> > 0xfd4d11 rtx_vector_builder::build()
> > >> >         ../../gcc/gcc/rtx-vector-builder.cc:73
> > >> > 0xc21d9c const_vector_from_tree
> > >> >         ../../gcc/gcc/expr.cc:13487
> > >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > >> > expand_modifier, rtx_def**, bool)
> > >> >         ../../gcc/gcc/expr.cc:11059
> > >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
> > >> >         ../../gcc/gcc/expr.h:310
> > >> > 0xaee682 expand_return
> > >> >         ../../gcc/gcc/cfgexpand.cc:3809
> > >> > 0xaee682 expand_gimple_stmt_1
> > >> >         ../../gcc/gcc/cfgexpand.cc:3918
> > >> > 0xaee682 expand_gimple_stmt
> > >> >         ../../gcc/gcc/cfgexpand.cc:4044
> > >> > 0xaf28f0 expand_gimple_basic_block
> > >> >         ../../gcc/gcc/cfgexpand.cc:6100
> > >> > 0xaf4996 execute
> > >> >         ../../gcc/gcc/cfgexpand.cc:6835
> > >> >
> > >> > IIUC, the issue is that fold_vec_perm returns a vector having float element
> > >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries
> > >> > to derive element v[3], not present in the encoding, while trying to
> > >> > build rtx vector
> > >> > in rtx_vector_builder::build():
> > >> >  for (unsigned int i = 0; i < nelts; ++i)
> > >> >     RTVEC_ELT (v, i) = elt (i);
> > >> >
> > >> > The attached patch tries to fix this by returning false from
> > >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
> > >> > input vector has non-integral element type, so for VLA vectors, it
> > >> > will only build result with dup sequence (nelts_per_pattern < 3) for
> > >> > non-integral element type.
> > >> >
> > >> > For VLS vectors, this will still work for stepped sequence since it
> > >> > will then use the "VLS exception" in fold_vec_perm_cst, and set:
> > >> > res_npattern = res_nelts and
> > >> > res_nelts_per_pattern = 1
> > >> >
> > >> > and fold the above case to:
> > >> > F foo (F a, F b)
> > >> > {
> > >> >   <bb 2> [local count: 1073741824]:
> > >> >   return { 0.0, 9.0e+0, 0.0, 0.0 };
> > >> > }
> > >> >
> > >> > But I am not sure if this is entirely correct, since:
> > >> > tree res = out_elts.build ();
> > >> > will canonicalize the encoding and may result in a stepped sequence
> > >> > (vector_builder::finalize() may reduce npatterns at the cost of increasing
> > >> > nelts_per_pattern)  ?
> > >> >
> > >> > PS: This issue is now latent after PR111648 fix, since
> > >> > valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
> > >> > false because the corresponding pattern in arg0 is not a natural
> > >> > stepped sequence, and folds correctly using VLS exception. However, I
> > >> > guess the underlying issue of dealing with non-integral element types
> > >> > in fold_vec_perm_cst still remains ?
> > >> >
> > >> > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
> > >> > and on x86_64-linux-gnu.
> > >>
> > >> I think the problem is instead in the way that we're calculating
> > >> res_npatterns and res_nelts_per_pattern.
> > >>
> > >> If the selector is a duplication of { a1, ..., an }, then the
> > >> result will be a duplication of n elements, regardless of the shape
> > >> of the other arguments.
> > >>
> > >> Similarly, if the selector is { a1, ...., an } followed by a
> > >> duplication of { b1, ..., bn }, the result be n elements followed
> > >> by a duplication of n elements, regardless of the shape of the other
> > >> arguments.
> > >>
> > >> So for these two cases, res_npatterns and res_nelts_per_pattern
> > >> can come directly from the selector's encoding.
> > >>
> > >> If:
> > >>
> > >> (1) the selector is an n-pattern stepped sequence
> > >> (2) the stepped part of each pattern selects from the same input pattern
> > >> (3) the stepped part of each pattern does not select the first element
> > >>     of the input pattern, or the full input pattern is stepped
> > >>     (your previous patch)
> > >>
> > >> then the result is stepped only if one of the inputs is stepped.
> > >> This is because, if an input pattern has 1 or 2 elements, (3) means
> > >> that each element of the stepped sequence will select the same value,
> > >> as if the selector step had been 0.
> > > Hi Richard,
> > > Thanks for the suggestions! I agree when selector is dup of {a1, ... an, ...} or
> > > base elements followed up dup {a1, .. an, b1, ... bn, ...} in that
> > > case we can set
> > > res_nelts_per_pattern from selector's encoding. However even if we provide
> > > more nelts_per_pattern that necessary, I guess vector_builder::finalize() will
> > > canonicalize it to the correct encoding for result ?
> >
> > Right.  The elements before finalize is called have to be correct,
> > but they don't need to be canonical or minimal.
> >
> > After the change, we'll build no more elements than we did before.
> >
> > >> So I think the PR could be solved by something like the attached.
> > >> Do you agree?  If so, could you base the patch on this instead?
> > >>
> > >> Only tested against the self-tests.
> > >>
> > >> Thanks,
> > >> Richard
> > >>
> > >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > >> index 40767736389..00fce4945a7 100644
> > >> --- a/gcc/fold-const.cc
> > >> +++ b/gcc/fold-const.cc
> > >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> > >>    unsigned res_npatterns, res_nelts_per_pattern;
> > >>    unsigned HOST_WIDE_INT res_nelts;
> > >>
> > >> -  /* (1) If SEL is a suitable mask as determined by
> > >> -     valid_mask_for_fold_vec_perm_cst_p, then:
> > >> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> > >> -     res_nelts_per_pattern = max of nelts_per_pattern between
> > >> -                            ARG0, ARG1 and SEL.
> > >> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> > >> -     res_npatterns = nelts in result vector.
> > >> -     res_nelts_per_pattern = 1.
> > >> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> > >> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > >> -    {
> > >> -      res_npatterns
> > >> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
> > >> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
> > >> -                             sel.encoding ().npatterns ()));
> > >> +  /* First try to implement the fold in a VLA-friendly way.
> > >> +
> > >> +     (1) If the selector is simply a duplication of N elements, the
> > >> +        result is likewise a duplication of N elements.
> > >> +
> > >> +     (2) If the selector is N elements followed by a duplication
> > >> +        of N elements, the result is too.
> > >>
> > >> -      res_nelts_per_pattern
> > >> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > >> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> > >> -                             sel.encoding ().nelts_per_pattern ()));
> > >> +     (3) If the selector is N elements followed by an interleaving
> > >> +        of N linear series, the situation is more complex.
> > >>
> > >> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
> > >> +        can handle this case.  If we can, then each of the N linear
> > >> +        series either (a) selects the same element each time or
> > >> +        (b) selects a linear series from one of the input patterns.
> > >> +
> > >> +        If (b) holds for one of the linear series, the result
> > >> +        will contain a linear series, and so the result will have
> > >> +        the same shape as the selector.  If (a) holds for all of
> > >> +        the lienar series, the result will be the same as (2) above.
> > >> +
> > >> +        (b) can only hold if one of the inputs pattern has a
> > >> +        stepped encoding.  */
> > >> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > >> +    {
> > >> +      res_npatterns = sel.encoding ().npatterns ();
> > >> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > >> +      if (res_nelts_per_pattern == 3
> > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> > >> +       res_nelts_per_pattern = 2;
> > > Um, in this case, should we set:
> > > res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
> > > if both have nelts_per_pattern == 1 ?
> >
> > No, it still needs to be 2 even if arg0 and arg1 are duplicates.
> > E.g. consider a selector that picks the first element of arg0
> > followed by a duplicate of the first element of arg1.
> >
> > > Also I suppose this matters only for non-integral element type, since
> > > for integral element type,
> > > vector_cst_elt will return the correct value even if the element is
> > > not explicitly encoded and input vector is dup ?
> >
> > Yeah, but it might help even for integers.  If we build fewer
> > elements explicitly, and so read fewer implicitly-encoded inputs,
> > there's less risk of running into:
> >
> >       if (!can_div_trunc_p (sel[i], len, &q, &r))
> >         {
> >           if (reason)
> >             *reason = "cannot divide selector element by arg len";
> >           return NULL_TREE;
> >         }
> Ah right, thanks for the clarification!
> I am currently away on vacation and will return next Thursday, and
> will post a follow up patch based on your patch.
> Sorry for the delay.
Hi,
Sorry for slow response, I have rebased your patch and added couple of tests.
The attached patch resulted in fallout for aarch64/sve/slp_3.c and
aarch64/sve/slp_4.c.

Specifically for slp_3.c, we didn't fold following case:
arg0, arg1 are dup vectors.
sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2,
nelts_per_pattern = 3)
because res_nelts_per_pattern was set to 3, and upon encountering 2,
fold_vec_perm_cst returned false.

With patch, we set res_nelts_per_pattern = 2 (since input vectors are
dup), and thus gets folded to:
res = { arg0[0], arg1[0], ... } // (2, 1)

Which results in using ldrqd for loading the result instead of doing
the permutation at runtime with mov and zip1.
I have adjusted the tests for new code-gen.
Does it look OK ?

There's also this strange failure observed on x86_64, as well as on aarch64:
New tests that FAIL (1 tests):
libitm.c++/dropref.C -B
/home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs
execution test

Looking at dropref.C:
/* { dg-xfail-run-if "unsupported" { *-*-* } } */
#include <libitm.h>

char *pp;

int main()
{
  __transaction_atomic {
    _ITM_dropReferences (pp, 555);
  }
  return 0;
}

doesn't seem relevant to VEC_PERM_EXPR folding ?
The patch otherwise passes bootstrap+test on aarch64-linux-gnu with
and without SVE, and on x86_64-linux-gnu.

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Richard
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 40767736389..75410869796 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -10743,27 +10743,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
   unsigned res_npatterns, res_nelts_per_pattern;
   unsigned HOST_WIDE_INT res_nelts;
 
-  /* (1) If SEL is a suitable mask as determined by
-     valid_mask_for_fold_vec_perm_cst_p, then:
-     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
-     res_nelts_per_pattern = max of nelts_per_pattern between
-			     ARG0, ARG1 and SEL.
-     (2) If SEL is not a suitable mask, and TYPE is VLS then:
-     res_npatterns = nelts in result vector.
-     res_nelts_per_pattern = 1.
-     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
-  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
-    {
-      res_npatterns
-	= std::max (VECTOR_CST_NPATTERNS (arg0),
-		    std::max (VECTOR_CST_NPATTERNS (arg1),
-			      sel.encoding ().npatterns ()));
+  /* First try to implement the fold in a VLA-friendly way.
+
+     (1) If the selector is simply a duplication of N elements, the
+	 result is likewise a duplication of N elements.
+
+     (2) If the selector is N elements followed by a duplication
+	 of N elements, the result is too.
+
+     (3) If the selector is N elements followed by an interleaving
+	 of N linear series, the situation is more complex.
+
+	 valid_mask_for_fold_vec_perm_cst_p detects whether we
+	 can handle this case.  If we can, then each of the N linear
+	 series either (a) selects the same element each time or
+	 (b) selects a linear series from one of the input patterns.
+
+	 If (b) holds for one of the linear series, the result
+	 will contain a linear series, and so the result will have
+	 the same shape as the selector.  If (a) holds for all of
+	 the lienar series, the result will be the same as (2) above.
 
-      res_nelts_per_pattern
-	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
-		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
-			      sel.encoding ().nelts_per_pattern ()));
+	 (b) can only hold if one of the input patterns has a
+	 stepped encoding.  */
 
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns = sel.encoding ().npatterns ();
+      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+      if (res_nelts_per_pattern == 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
+	res_nelts_per_pattern = 2;
       res_nelts = res_npatterns * res_nelts_per_pattern;
     }
   else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
@@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode)
 	tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) };
 	validate_res (1, 3, res, expected_res);
       }
+
+      /* Case 8: Same as aarch64/sve/slp_3.c:
+	 arg0, arg1 are dup vectors.
+	 sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3)
+	 So res = { arg0[0], arg1[0], ... } // (2, 1)
+
+	 In this case, since the input vectors are dup, only the first two
+	 elements per pattern in sel are considered significant.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 3);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (2, 1, res, expected_res);
+      }
     }
 }
 
@@ -17730,6 +17764,45 @@ test_nunits_min_4 (machine_mode vmode)
 	ASSERT_TRUE (res == NULL_TREE);
 	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
       }
+
+      /* Case 8: PR111754: When input vector is not a stepped sequence,
+	 check that the result is not a stepped sequence either, even
+	 if sel has a stepped sequence.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 2);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { 0, 1, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(1) };
+	validate_res (sel.encoding ().npatterns (), 2, res, expected_res);
+      }
+
+      /* Case 9: If sel doesn't contain a stepped sequence,
+	 check that the result has same encoding as sel, irrespective
+	 of shape of input vectors.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 2);
+	poly_uint64 mask_elems[] = { 0, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (sel.encoding ().npatterns (),
+		      sel.encoding ().nelts_per_pattern (), res, expected_res);
+      }
     }
 }
 
diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c
new file mode 100644
index 00000000000..7c1c16875c7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr111754.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+typedef float __attribute__((__vector_size__ (16))) F;
+
+F foo (F a, F b)
+{
+  F v = (F) { 9 };
+  return __builtin_shufflevector (v, v, 1, 0, 1, 2);
+}
+
+/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */
+/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
index 82dd43a4d98..cb649bc1aa9 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
@@ -33,21 +33,15 @@ TEST_ALL (VEC_PERM)
 
 /* 1 for each 8-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
-/* 1 for each 16-bit type plus 1 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */
+/* 1 for each 16-bit type  */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
 /* 1 for each 32-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */
-/* 3 for double.  */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */
 /* The 64-bit types need:
 
       ZIP1 ZIP1 (2 ZIP2s optimized away)
       ZIP1 ZIP2.  */
-/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 
 /* The loop should be fully-masked.  The 64-bit types need two loads
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
index b1fa5e3cf68..ce940a28597 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
@@ -35,20 +35,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
 
 TEST_ALL (VEC_PERM)
 
-/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */
+/* 1 for each 8-bit type  */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */
 /* 1 for each 16-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */
-/* 4 for double.  */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */
 /* The 32-bit types need:
 
       ZIP1 ZIP1 (2 ZIP2s optimized away)
@@ -59,7 +49,7 @@ TEST_ALL (VEC_PERM)
       ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away)
       ZIP1 ZIP2 ZIP1 ZIP2
       ZIP1 ZIP2 ZIP1 ZIP2.  */
-/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
 /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
 
 /* The loop should be fully-masked.  The 32-bit types need two loads

Prathamesh Kulkarni Nov. 15, 2023, 3:14 p.m. UTC | #7

On Wed, 8 Nov 2023 at 21:57, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Thu, 26 Oct 2023 at 04:09, Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> > >
> > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
> > > > <richard.sandiford@arm.com> wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> Sorry the slow review.  I clearly didn't think this through properly
> > > >> when doing the review of the original patch, so I wanted to spend
> > > >> some time working on the code to get a better understanding of
> > > >> the problem.
> > > >>
> > > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > >> > Hi,
> > > >> > For the following test-case:
> > > >> >
> > > >> > typedef float __attribute__((__vector_size__ (16))) F;
> > > >> > F foo (F a, F b)
> > > >> > {
> > > >> >   F v = (F) { 9 };
> > > >> >   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > > >> > }
> > > >> >
> > > >> > Compiling with -O2 results in following ICE:
> > > >> > foo.c: In function ‘foo’:
> > > >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
> > > >> >     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > > >> >       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > >> > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
> > > >> >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
> > > >> > const&)
> > > >> >         ../../gcc/gcc/rtl.h:2314
> > > >> > 0x7f3185 wide_int_ref_storage<false,
> > > >> > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
> > > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > > >> >         ../../gcc/gcc/wide-int.h:1089
> > > >> > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
> > > >> >>::generic_wide_int<std::pair<rtx_def*, machine_mode>
> > > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > > >> >         ../../gcc/gcc/wide-int.h:847
> > > >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> > > >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> > > >> >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
> > > >> >         ../../gcc/gcc/poly-int.h:467
> > > >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> > > >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> > > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > > >> >         ../../gcc/gcc/poly-int.h:453
> > > >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
> > > >> >         ../../gcc/gcc/rtl.h:2383
> > > >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
> > > >> >         ../../gcc/gcc/rtx-vector-builder.h:122
> > > >> > 0xfd4e1b vector_builder<rtx_def*, machine_mode,
> > > >> > rtx_vector_builder>::elt(unsigned int) const
> > > >> >         ../../gcc/gcc/vector-builder.h:253
> > > >> > 0xfd4d11 rtx_vector_builder::build()
> > > >> >         ../../gcc/gcc/rtx-vector-builder.cc:73
> > > >> > 0xc21d9c const_vector_from_tree
> > > >> >         ../../gcc/gcc/expr.cc:13487
> > > >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > > >> > expand_modifier, rtx_def**, bool)
> > > >> >         ../../gcc/gcc/expr.cc:11059
> > > >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
> > > >> >         ../../gcc/gcc/expr.h:310
> > > >> > 0xaee682 expand_return
> > > >> >         ../../gcc/gcc/cfgexpand.cc:3809
> > > >> > 0xaee682 expand_gimple_stmt_1
> > > >> >         ../../gcc/gcc/cfgexpand.cc:3918
> > > >> > 0xaee682 expand_gimple_stmt
> > > >> >         ../../gcc/gcc/cfgexpand.cc:4044
> > > >> > 0xaf28f0 expand_gimple_basic_block
> > > >> >         ../../gcc/gcc/cfgexpand.cc:6100
> > > >> > 0xaf4996 execute
> > > >> >         ../../gcc/gcc/cfgexpand.cc:6835
> > > >> >
> > > >> > IIUC, the issue is that fold_vec_perm returns a vector having float element
> > > >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries
> > > >> > to derive element v[3], not present in the encoding, while trying to
> > > >> > build rtx vector
> > > >> > in rtx_vector_builder::build():
> > > >> >  for (unsigned int i = 0; i < nelts; ++i)
> > > >> >     RTVEC_ELT (v, i) = elt (i);
> > > >> >
> > > >> > The attached patch tries to fix this by returning false from
> > > >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
> > > >> > input vector has non-integral element type, so for VLA vectors, it
> > > >> > will only build result with dup sequence (nelts_per_pattern < 3) for
> > > >> > non-integral element type.
> > > >> >
> > > >> > For VLS vectors, this will still work for stepped sequence since it
> > > >> > will then use the "VLS exception" in fold_vec_perm_cst, and set:
> > > >> > res_npattern = res_nelts and
> > > >> > res_nelts_per_pattern = 1
> > > >> >
> > > >> > and fold the above case to:
> > > >> > F foo (F a, F b)
> > > >> > {
> > > >> >   <bb 2> [local count: 1073741824]:
> > > >> >   return { 0.0, 9.0e+0, 0.0, 0.0 };
> > > >> > }
> > > >> >
> > > >> > But I am not sure if this is entirely correct, since:
> > > >> > tree res = out_elts.build ();
> > > >> > will canonicalize the encoding and may result in a stepped sequence
> > > >> > (vector_builder::finalize() may reduce npatterns at the cost of increasing
> > > >> > nelts_per_pattern)  ?
> > > >> >
> > > >> > PS: This issue is now latent after PR111648 fix, since
> > > >> > valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
> > > >> > false because the corresponding pattern in arg0 is not a natural
> > > >> > stepped sequence, and folds correctly using VLS exception. However, I
> > > >> > guess the underlying issue of dealing with non-integral element types
> > > >> > in fold_vec_perm_cst still remains ?
> > > >> >
> > > >> > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
> > > >> > and on x86_64-linux-gnu.
> > > >>
> > > >> I think the problem is instead in the way that we're calculating
> > > >> res_npatterns and res_nelts_per_pattern.
> > > >>
> > > >> If the selector is a duplication of { a1, ..., an }, then the
> > > >> result will be a duplication of n elements, regardless of the shape
> > > >> of the other arguments.
> > > >>
> > > >> Similarly, if the selector is { a1, ...., an } followed by a
> > > >> duplication of { b1, ..., bn }, the result be n elements followed
> > > >> by a duplication of n elements, regardless of the shape of the other
> > > >> arguments.
> > > >>
> > > >> So for these two cases, res_npatterns and res_nelts_per_pattern
> > > >> can come directly from the selector's encoding.
> > > >>
> > > >> If:
> > > >>
> > > >> (1) the selector is an n-pattern stepped sequence
> > > >> (2) the stepped part of each pattern selects from the same input pattern
> > > >> (3) the stepped part of each pattern does not select the first element
> > > >>     of the input pattern, or the full input pattern is stepped
> > > >>     (your previous patch)
> > > >>
> > > >> then the result is stepped only if one of the inputs is stepped.
> > > >> This is because, if an input pattern has 1 or 2 elements, (3) means
> > > >> that each element of the stepped sequence will select the same value,
> > > >> as if the selector step had been 0.
> > > > Hi Richard,
> > > > Thanks for the suggestions! I agree when selector is dup of {a1, ... an, ...} or
> > > > base elements followed up dup {a1, .. an, b1, ... bn, ...} in that
> > > > case we can set
> > > > res_nelts_per_pattern from selector's encoding. However even if we provide
> > > > more nelts_per_pattern that necessary, I guess vector_builder::finalize() will
> > > > canonicalize it to the correct encoding for result ?
> > >
> > > Right.  The elements before finalize is called have to be correct,
> > > but they don't need to be canonical or minimal.
> > >
> > > After the change, we'll build no more elements than we did before.
> > >
> > > >> So I think the PR could be solved by something like the attached.
> > > >> Do you agree?  If so, could you base the patch on this instead?
> > > >>
> > > >> Only tested against the self-tests.
> > > >>
> > > >> Thanks,
> > > >> Richard
> > > >>
> > > >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > > >> index 40767736389..00fce4945a7 100644
> > > >> --- a/gcc/fold-const.cc
> > > >> +++ b/gcc/fold-const.cc
> > > >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> > > >>    unsigned res_npatterns, res_nelts_per_pattern;
> > > >>    unsigned HOST_WIDE_INT res_nelts;
> > > >>
> > > >> -  /* (1) If SEL is a suitable mask as determined by
> > > >> -     valid_mask_for_fold_vec_perm_cst_p, then:
> > > >> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> > > >> -     res_nelts_per_pattern = max of nelts_per_pattern between
> > > >> -                            ARG0, ARG1 and SEL.
> > > >> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> > > >> -     res_npatterns = nelts in result vector.
> > > >> -     res_nelts_per_pattern = 1.
> > > >> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> > > >> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > > >> -    {
> > > >> -      res_npatterns
> > > >> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
> > > >> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
> > > >> -                             sel.encoding ().npatterns ()));
> > > >> +  /* First try to implement the fold in a VLA-friendly way.
> > > >> +
> > > >> +     (1) If the selector is simply a duplication of N elements, the
> > > >> +        result is likewise a duplication of N elements.
> > > >> +
> > > >> +     (2) If the selector is N elements followed by a duplication
> > > >> +        of N elements, the result is too.
> > > >>
> > > >> -      res_nelts_per_pattern
> > > >> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > > >> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> > > >> -                             sel.encoding ().nelts_per_pattern ()));
> > > >> +     (3) If the selector is N elements followed by an interleaving
> > > >> +        of N linear series, the situation is more complex.
> > > >>
> > > >> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
> > > >> +        can handle this case.  If we can, then each of the N linear
> > > >> +        series either (a) selects the same element each time or
> > > >> +        (b) selects a linear series from one of the input patterns.
> > > >> +
> > > >> +        If (b) holds for one of the linear series, the result
> > > >> +        will contain a linear series, and so the result will have
> > > >> +        the same shape as the selector.  If (a) holds for all of
> > > >> +        the lienar series, the result will be the same as (2) above.
> > > >> +
> > > >> +        (b) can only hold if one of the inputs pattern has a
> > > >> +        stepped encoding.  */
> > > >> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > > >> +    {
> > > >> +      res_npatterns = sel.encoding ().npatterns ();
> > > >> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > > >> +      if (res_nelts_per_pattern == 3
> > > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> > > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> > > >> +       res_nelts_per_pattern = 2;
> > > > Um, in this case, should we set:
> > > > res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
> > > > if both have nelts_per_pattern == 1 ?
> > >
> > > No, it still needs to be 2 even if arg0 and arg1 are duplicates.
> > > E.g. consider a selector that picks the first element of arg0
> > > followed by a duplicate of the first element of arg1.
> > >
> > > > Also I suppose this matters only for non-integral element type, since
> > > > for integral element type,
> > > > vector_cst_elt will return the correct value even if the element is
> > > > not explicitly encoded and input vector is dup ?
> > >
> > > Yeah, but it might help even for integers.  If we build fewer
> > > elements explicitly, and so read fewer implicitly-encoded inputs,
> > > there's less risk of running into:
> > >
> > >       if (!can_div_trunc_p (sel[i], len, &q, &r))
> > >         {
> > >           if (reason)
> > >             *reason = "cannot divide selector element by arg len";
> > >           return NULL_TREE;
> > >         }
> > Ah right, thanks for the clarification!
> > I am currently away on vacation and will return next Thursday, and
> > will post a follow up patch based on your patch.
> > Sorry for the delay.
> Hi,
> Sorry for slow response, I have rebased your patch and added couple of tests.
> The attached patch resulted in fallout for aarch64/sve/slp_3.c and
> aarch64/sve/slp_4.c.
>
> Specifically for slp_3.c, we didn't fold following case:
> arg0, arg1 are dup vectors.
> sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2,
> nelts_per_pattern = 3)
> because res_nelts_per_pattern was set to 3, and upon encountering 2,
> fold_vec_perm_cst returned false.
>
> With patch, we set res_nelts_per_pattern = 2 (since input vectors are
> dup), and thus gets folded to:
> res = { arg0[0], arg1[0], ... } // (2, 1)
>
> Which results in using ldrqd for loading the result instead of doing
> the permutation at runtime with mov and zip1.
> I have adjusted the tests for new code-gen.
> Does it look OK ?
>
> There's also this strange failure observed on x86_64, as well as on aarch64:
> New tests that FAIL (1 tests):
> libitm.c++/dropref.C -B
> /home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs
> execution test
>
> Looking at dropref.C:
> /* { dg-xfail-run-if "unsupported" { *-*-* } } */
> #include <libitm.h>
>
> char *pp;
>
> int main()
> {
>   __transaction_atomic {
>     _ITM_dropReferences (pp, 555);
>   }
>   return 0;
> }
>
> doesn't seem relevant to VEC_PERM_EXPR folding ?
> The patch otherwise passes bootstrap+test on aarch64-linux-gnu with
> and without SVE, and on x86_64-linux-gnu.
Hi,
ping: https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635728.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Richard

Prathamesh Kulkarni Nov. 23, 2023, 11:40 a.m. UTC | #8

On Wed, 15 Nov 2023 at 20:44, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> On Wed, 8 Nov 2023 at 21:57, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > On Thu, 26 Oct 2023 at 04:09, Richard Sandiford
> > > <richard.sandiford@arm.com> wrote:
> > > >
> > > > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
> > > > > <richard.sandiford@arm.com> wrote:
> > > > >>
> > > > >> Hi,
> > > > >>
> > > > >> Sorry the slow review.  I clearly didn't think this through properly
> > > > >> when doing the review of the original patch, so I wanted to spend
> > > > >> some time working on the code to get a better understanding of
> > > > >> the problem.
> > > > >>
> > > > >> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > > > >> > Hi,
> > > > >> > For the following test-case:
> > > > >> >
> > > > >> > typedef float __attribute__((__vector_size__ (16))) F;
> > > > >> > F foo (F a, F b)
> > > > >> > {
> > > > >> >   F v = (F) { 9 };
> > > > >> >   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > > > >> > }
> > > > >> >
> > > > >> > Compiling with -O2 results in following ICE:
> > > > >> > foo.c: In function ‘foo’:
> > > > >> > foo.c:6:10: internal compiler error: in decompose, at rtl.h:2314
> > > > >> >     6 |   return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > > > >> >       |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > >> > 0x7f3185 wi::int_traits<std::pair<rtx_def*, machine_mode>
> > > > >> >>::decompose(long*, unsigned int, std::pair<rtx_def*, machine_mode>
> > > > >> > const&)
> > > > >> >         ../../gcc/gcc/rtl.h:2314
> > > > >> > 0x7f3185 wide_int_ref_storage<false,
> > > > >> > false>::wide_int_ref_storage<std::pair<rtx_def*, machine_mode>
> > > > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > > > >> >         ../../gcc/gcc/wide-int.h:1089
> > > > >> > 0x7f3185 generic_wide_int<wide_int_ref_storage<false, false>
> > > > >> >>::generic_wide_int<std::pair<rtx_def*, machine_mode>
> > > > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > > > >> >         ../../gcc/gcc/wide-int.h:847
> > > > >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> > > > >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> > > > >> >>(poly_int_full, std::pair<rtx_def*, machine_mode> const&)
> > > > >> >         ../../gcc/gcc/poly-int.h:467
> > > > >> > 0x7f3185 poly_int<1u, generic_wide_int<wide_int_ref_storage<false,
> > > > >> > false> > >::poly_int<std::pair<rtx_def*, machine_mode>
> > > > >> >>(std::pair<rtx_def*, machine_mode> const&)
> > > > >> >         ../../gcc/gcc/poly-int.h:453
> > > > >> > 0x7f3185 wi::to_poly_wide(rtx_def const*, machine_mode)
> > > > >> >         ../../gcc/gcc/rtl.h:2383
> > > > >> > 0x7f3185 rtx_vector_builder::step(rtx_def*, rtx_def*) const
> > > > >> >         ../../gcc/gcc/rtx-vector-builder.h:122
> > > > >> > 0xfd4e1b vector_builder<rtx_def*, machine_mode,
> > > > >> > rtx_vector_builder>::elt(unsigned int) const
> > > > >> >         ../../gcc/gcc/vector-builder.h:253
> > > > >> > 0xfd4d11 rtx_vector_builder::build()
> > > > >> >         ../../gcc/gcc/rtx-vector-builder.cc:73
> > > > >> > 0xc21d9c const_vector_from_tree
> > > > >> >         ../../gcc/gcc/expr.cc:13487
> > > > >> > 0xc21d9c expand_expr_real_1(tree_node*, rtx_def*, machine_mode,
> > > > >> > expand_modifier, rtx_def**, bool)
> > > > >> >         ../../gcc/gcc/expr.cc:11059
> > > > >> > 0xaee682 expand_expr(tree_node*, rtx_def*, machine_mode, expand_modifier)
> > > > >> >         ../../gcc/gcc/expr.h:310
> > > > >> > 0xaee682 expand_return
> > > > >> >         ../../gcc/gcc/cfgexpand.cc:3809
> > > > >> > 0xaee682 expand_gimple_stmt_1
> > > > >> >         ../../gcc/gcc/cfgexpand.cc:3918
> > > > >> > 0xaee682 expand_gimple_stmt
> > > > >> >         ../../gcc/gcc/cfgexpand.cc:4044
> > > > >> > 0xaf28f0 expand_gimple_basic_block
> > > > >> >         ../../gcc/gcc/cfgexpand.cc:6100
> > > > >> > 0xaf4996 execute
> > > > >> >         ../../gcc/gcc/cfgexpand.cc:6835
> > > > >> >
> > > > >> > IIUC, the issue is that fold_vec_perm returns a vector having float element
> > > > >> > type with res_nelts_per_pattern == 3, and later ICE's when it tries
> > > > >> > to derive element v[3], not present in the encoding, while trying to
> > > > >> > build rtx vector
> > > > >> > in rtx_vector_builder::build():
> > > > >> >  for (unsigned int i = 0; i < nelts; ++i)
> > > > >> >     RTVEC_ELT (v, i) = elt (i);
> > > > >> >
> > > > >> > The attached patch tries to fix this by returning false from
> > > > >> > valid_mask_for_fold_vec_perm_cst if sel has a stepped sequence and
> > > > >> > input vector has non-integral element type, so for VLA vectors, it
> > > > >> > will only build result with dup sequence (nelts_per_pattern < 3) for
> > > > >> > non-integral element type.
> > > > >> >
> > > > >> > For VLS vectors, this will still work for stepped sequence since it
> > > > >> > will then use the "VLS exception" in fold_vec_perm_cst, and set:
> > > > >> > res_npattern = res_nelts and
> > > > >> > res_nelts_per_pattern = 1
> > > > >> >
> > > > >> > and fold the above case to:
> > > > >> > F foo (F a, F b)
> > > > >> > {
> > > > >> >   <bb 2> [local count: 1073741824]:
> > > > >> >   return { 0.0, 9.0e+0, 0.0, 0.0 };
> > > > >> > }
> > > > >> >
> > > > >> > But I am not sure if this is entirely correct, since:
> > > > >> > tree res = out_elts.build ();
> > > > >> > will canonicalize the encoding and may result in a stepped sequence
> > > > >> > (vector_builder::finalize() may reduce npatterns at the cost of increasing
> > > > >> > nelts_per_pattern)  ?
> > > > >> >
> > > > >> > PS: This issue is now latent after PR111648 fix, since
> > > > >> > valid_mask_for_fold_vec_perm_cst with  sel = {1, 0, 1, ...} returns
> > > > >> > false because the corresponding pattern in arg0 is not a natural
> > > > >> > stepped sequence, and folds correctly using VLS exception. However, I
> > > > >> > guess the underlying issue of dealing with non-integral element types
> > > > >> > in fold_vec_perm_cst still remains ?
> > > > >> >
> > > > >> > The patch passes bootstrap+test with and without SVE on aarch64-linux-gnu,
> > > > >> > and on x86_64-linux-gnu.
> > > > >>
> > > > >> I think the problem is instead in the way that we're calculating
> > > > >> res_npatterns and res_nelts_per_pattern.
> > > > >>
> > > > >> If the selector is a duplication of { a1, ..., an }, then the
> > > > >> result will be a duplication of n elements, regardless of the shape
> > > > >> of the other arguments.
> > > > >>
> > > > >> Similarly, if the selector is { a1, ...., an } followed by a
> > > > >> duplication of { b1, ..., bn }, the result be n elements followed
> > > > >> by a duplication of n elements, regardless of the shape of the other
> > > > >> arguments.
> > > > >>
> > > > >> So for these two cases, res_npatterns and res_nelts_per_pattern
> > > > >> can come directly from the selector's encoding.
> > > > >>
> > > > >> If:
> > > > >>
> > > > >> (1) the selector is an n-pattern stepped sequence
> > > > >> (2) the stepped part of each pattern selects from the same input pattern
> > > > >> (3) the stepped part of each pattern does not select the first element
> > > > >>     of the input pattern, or the full input pattern is stepped
> > > > >>     (your previous patch)
> > > > >>
> > > > >> then the result is stepped only if one of the inputs is stepped.
> > > > >> This is because, if an input pattern has 1 or 2 elements, (3) means
> > > > >> that each element of the stepped sequence will select the same value,
> > > > >> as if the selector step had been 0.
> > > > > Hi Richard,
> > > > > Thanks for the suggestions! I agree when selector is dup of {a1, ... an, ...} or
> > > > > base elements followed up dup {a1, .. an, b1, ... bn, ...} in that
> > > > > case we can set
> > > > > res_nelts_per_pattern from selector's encoding. However even if we provide
> > > > > more nelts_per_pattern that necessary, I guess vector_builder::finalize() will
> > > > > canonicalize it to the correct encoding for result ?
> > > >
> > > > Right.  The elements before finalize is called have to be correct,
> > > > but they don't need to be canonical or minimal.
> > > >
> > > > After the change, we'll build no more elements than we did before.
> > > >
> > > > >> So I think the PR could be solved by something like the attached.
> > > > >> Do you agree?  If so, could you base the patch on this instead?
> > > > >>
> > > > >> Only tested against the self-tests.
> > > > >>
> > > > >> Thanks,
> > > > >> Richard
> > > > >>
> > > > >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > > > >> index 40767736389..00fce4945a7 100644
> > > > >> --- a/gcc/fold-const.cc
> > > > >> +++ b/gcc/fold-const.cc
> > > > >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> > > > >>    unsigned res_npatterns, res_nelts_per_pattern;
> > > > >>    unsigned HOST_WIDE_INT res_nelts;
> > > > >>
> > > > >> -  /* (1) If SEL is a suitable mask as determined by
> > > > >> -     valid_mask_for_fold_vec_perm_cst_p, then:
> > > > >> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> > > > >> -     res_nelts_per_pattern = max of nelts_per_pattern between
> > > > >> -                            ARG0, ARG1 and SEL.
> > > > >> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> > > > >> -     res_npatterns = nelts in result vector.
> > > > >> -     res_nelts_per_pattern = 1.
> > > > >> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> > > > >> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > > > >> -    {
> > > > >> -      res_npatterns
> > > > >> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
> > > > >> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
> > > > >> -                             sel.encoding ().npatterns ()));
> > > > >> +  /* First try to implement the fold in a VLA-friendly way.
> > > > >> +
> > > > >> +     (1) If the selector is simply a duplication of N elements, the
> > > > >> +        result is likewise a duplication of N elements.
> > > > >> +
> > > > >> +     (2) If the selector is N elements followed by a duplication
> > > > >> +        of N elements, the result is too.
> > > > >>
> > > > >> -      res_nelts_per_pattern
> > > > >> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > > > >> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> > > > >> -                             sel.encoding ().nelts_per_pattern ()));
> > > > >> +     (3) If the selector is N elements followed by an interleaving
> > > > >> +        of N linear series, the situation is more complex.
> > > > >>
> > > > >> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
> > > > >> +        can handle this case.  If we can, then each of the N linear
> > > > >> +        series either (a) selects the same element each time or
> > > > >> +        (b) selects a linear series from one of the input patterns.
> > > > >> +
> > > > >> +        If (b) holds for one of the linear series, the result
> > > > >> +        will contain a linear series, and so the result will have
> > > > >> +        the same shape as the selector.  If (a) holds for all of
> > > > >> +        the lienar series, the result will be the same as (2) above.
> > > > >> +
> > > > >> +        (b) can only hold if one of the inputs pattern has a
> > > > >> +        stepped encoding.  */
> > > > >> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > > > >> +    {
> > > > >> +      res_npatterns = sel.encoding ().npatterns ();
> > > > >> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > > > >> +      if (res_nelts_per_pattern == 3
> > > > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> > > > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> > > > >> +       res_nelts_per_pattern = 2;
> > > > > Um, in this case, should we set:
> > > > > res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
> > > > > if both have nelts_per_pattern == 1 ?
> > > >
> > > > No, it still needs to be 2 even if arg0 and arg1 are duplicates.
> > > > E.g. consider a selector that picks the first element of arg0
> > > > followed by a duplicate of the first element of arg1.
> > > >
> > > > > Also I suppose this matters only for non-integral element type, since
> > > > > for integral element type,
> > > > > vector_cst_elt will return the correct value even if the element is
> > > > > not explicitly encoded and input vector is dup ?
> > > >
> > > > Yeah, but it might help even for integers.  If we build fewer
> > > > elements explicitly, and so read fewer implicitly-encoded inputs,
> > > > there's less risk of running into:
> > > >
> > > >       if (!can_div_trunc_p (sel[i], len, &q, &r))
> > > >         {
> > > >           if (reason)
> > > >             *reason = "cannot divide selector element by arg len";
> > > >           return NULL_TREE;
> > > >         }
> > > Ah right, thanks for the clarification!
> > > I am currently away on vacation and will return next Thursday, and
> > > will post a follow up patch based on your patch.
> > > Sorry for the delay.
> > Hi,
> > Sorry for slow response, I have rebased your patch and added couple of tests.
> > The attached patch resulted in fallout for aarch64/sve/slp_3.c and
> > aarch64/sve/slp_4.c.
> >
> > Specifically for slp_3.c, we didn't fold following case:
> > arg0, arg1 are dup vectors.
> > sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2,
> > nelts_per_pattern = 3)
> > because res_nelts_per_pattern was set to 3, and upon encountering 2,
> > fold_vec_perm_cst returned false.
> >
> > With patch, we set res_nelts_per_pattern = 2 (since input vectors are
> > dup), and thus gets folded to:
> > res = { arg0[0], arg1[0], ... } // (2, 1)
> >
> > Which results in using ldrqd for loading the result instead of doing
> > the permutation at runtime with mov and zip1.
> > I have adjusted the tests for new code-gen.
> > Does it look OK ?
> >
> > There's also this strange failure observed on x86_64, as well as on aarch64:
> > New tests that FAIL (1 tests):
> > libitm.c++/dropref.C -B
> > /home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs
> > execution test
> >
> > Looking at dropref.C:
> > /* { dg-xfail-run-if "unsupported" { *-*-* } } */
> > #include <libitm.h>
> >
> > char *pp;
> >
> > int main()
> > {
> >   __transaction_atomic {
> >     _ITM_dropReferences (pp, 555);
> >   }
> >   return 0;
> > }
> >
> > doesn't seem relevant to VEC_PERM_EXPR folding ?
> > The patch otherwise passes bootstrap+test on aarch64-linux-gnu with
> > and without SVE, and on x86_64-linux-gnu.
> Hi,
> ping: https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635728.html
Hi Richard,
ping * 2: https://gcc.gnu.org/pipermail/gcc-patches/2023-November/635728.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Richard

Richard Sandiford Nov. 23, 2023, 9:43 p.m. UTC | #9

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
>>
>> On Thu, 26 Oct 2023 at 04:09, Richard Sandiford
>> <richard.sandiford@arm.com> wrote:
>> >
>> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
>> > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
>> > > <richard.sandiford@arm.com> wrote:
>> > >> So I think the PR could be solved by something like the attached.
>> > >> Do you agree?  If so, could you base the patch on this instead?
>> > >>
>> > >> Only tested against the self-tests.
>> > >>
>> > >> Thanks,
>> > >> Richard
>> > >>
>> > >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
>> > >> index 40767736389..00fce4945a7 100644
>> > >> --- a/gcc/fold-const.cc
>> > >> +++ b/gcc/fold-const.cc
>> > >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
>> > >>    unsigned res_npatterns, res_nelts_per_pattern;
>> > >>    unsigned HOST_WIDE_INT res_nelts;
>> > >>
>> > >> -  /* (1) If SEL is a suitable mask as determined by
>> > >> -     valid_mask_for_fold_vec_perm_cst_p, then:
>> > >> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
>> > >> -     res_nelts_per_pattern = max of nelts_per_pattern between
>> > >> -                            ARG0, ARG1 and SEL.
>> > >> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
>> > >> -     res_npatterns = nelts in result vector.
>> > >> -     res_nelts_per_pattern = 1.
>> > >> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
>> > >> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
>> > >> -    {
>> > >> -      res_npatterns
>> > >> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
>> > >> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
>> > >> -                             sel.encoding ().npatterns ()));
>> > >> +  /* First try to implement the fold in a VLA-friendly way.
>> > >> +
>> > >> +     (1) If the selector is simply a duplication of N elements, the
>> > >> +        result is likewise a duplication of N elements.
>> > >> +
>> > >> +     (2) If the selector is N elements followed by a duplication
>> > >> +        of N elements, the result is too.
>> > >>
>> > >> -      res_nelts_per_pattern
>> > >> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
>> > >> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
>> > >> -                             sel.encoding ().nelts_per_pattern ()));
>> > >> +     (3) If the selector is N elements followed by an interleaving
>> > >> +        of N linear series, the situation is more complex.
>> > >>
>> > >> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
>> > >> +        can handle this case.  If we can, then each of the N linear
>> > >> +        series either (a) selects the same element each time or
>> > >> +        (b) selects a linear series from one of the input patterns.
>> > >> +
>> > >> +        If (b) holds for one of the linear series, the result
>> > >> +        will contain a linear series, and so the result will have
>> > >> +        the same shape as the selector.  If (a) holds for all of
>> > >> +        the lienar series, the result will be the same as (2) above.
>> > >> +
>> > >> +        (b) can only hold if one of the inputs pattern has a
>> > >> +        stepped encoding.  */
>> > >> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
>> > >> +    {
>> > >> +      res_npatterns = sel.encoding ().npatterns ();
>> > >> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
>> > >> +      if (res_nelts_per_pattern == 3
>> > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
>> > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
>> > >> +       res_nelts_per_pattern = 2;
>> > > Um, in this case, should we set:
>> > > res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
>> > > if both have nelts_per_pattern == 1 ?
>> >
>> > No, it still needs to be 2 even if arg0 and arg1 are duplicates.
>> > E.g. consider a selector that picks the first element of arg0
>> > followed by a duplicate of the first element of arg1.
>> >
>> > > Also I suppose this matters only for non-integral element type, since
>> > > for integral element type,
>> > > vector_cst_elt will return the correct value even if the element is
>> > > not explicitly encoded and input vector is dup ?
>> >
>> > Yeah, but it might help even for integers.  If we build fewer
>> > elements explicitly, and so read fewer implicitly-encoded inputs,
>> > there's less risk of running into:
>> >
>> >       if (!can_div_trunc_p (sel[i], len, &q, &r))
>> >         {
>> >           if (reason)
>> >             *reason = "cannot divide selector element by arg len";
>> >           return NULL_TREE;
>> >         }
>> Ah right, thanks for the clarification!
>> I am currently away on vacation and will return next Thursday, and
>> will post a follow up patch based on your patch.
>> Sorry for the delay.
> Hi,
> Sorry for slow response, I have rebased your patch and added couple of tests.
> The attached patch resulted in fallout for aarch64/sve/slp_3.c and
> aarch64/sve/slp_4.c.
>
> Specifically for slp_3.c, we didn't fold following case:
> arg0, arg1 are dup vectors.
> sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2,
> nelts_per_pattern = 3)
> because res_nelts_per_pattern was set to 3, and upon encountering 2,
> fold_vec_perm_cst returned false.
>
> With patch, we set res_nelts_per_pattern = 2 (since input vectors are
> dup), and thus gets folded to:
> res = { arg0[0], arg1[0], ... } // (2, 1)
>
> Which results in using ldrqd for loading the result instead of doing
> the permutation at runtime with mov and zip1.
> I have adjusted the tests for new code-gen.
> Does it look OK ?
>
> There's also this strange failure observed on x86_64, as well as on aarch64:
> New tests that FAIL (1 tests):
> libitm.c++/dropref.C -B
> /home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs
> execution test
>
> Looking at dropref.C:
> /* { dg-xfail-run-if "unsupported" { *-*-* } } */
> #include <libitm.h>
>
> char *pp;
>
> int main()
> {
>   __transaction_atomic {
>     _ITM_dropReferences (pp, 555);
>   }
>   return 0;
> }
>
> doesn't seem relevant to VEC_PERM_EXPR folding ?
> The patch otherwise passes bootstrap+test on aarch64-linux-gnu with
> and without SVE, and on x86_64-linux-gnu.
>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Prathamesh
>> >
>> > Thanks,
>> > Richard
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 40767736389..75410869796 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10743,27 +10743,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
>    unsigned res_npatterns, res_nelts_per_pattern;
>    unsigned HOST_WIDE_INT res_nelts;
>  
> -  /* (1) If SEL is a suitable mask as determined by
> -     valid_mask_for_fold_vec_perm_cst_p, then:
> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> -     res_nelts_per_pattern = max of nelts_per_pattern between
> -			     ARG0, ARG1 and SEL.
> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> -     res_npatterns = nelts in result vector.
> -     res_nelts_per_pattern = 1.
> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> -    {
> -      res_npatterns
> -	= std::max (VECTOR_CST_NPATTERNS (arg0),
> -		    std::max (VECTOR_CST_NPATTERNS (arg1),
> -			      sel.encoding ().npatterns ()));
> +  /* First try to implement the fold in a VLA-friendly way.
> +
> +     (1) If the selector is simply a duplication of N elements, the
> +	 result is likewise a duplication of N elements.
> +
> +     (2) If the selector is N elements followed by a duplication
> +	 of N elements, the result is too.
> +
> +     (3) If the selector is N elements followed by an interleaving
> +	 of N linear series, the situation is more complex.
> +
> +	 valid_mask_for_fold_vec_perm_cst_p detects whether we
> +	 can handle this case.  If we can, then each of the N linear
> +	 series either (a) selects the same element each time or
> +	 (b) selects a linear series from one of the input patterns.
> +
> +	 If (b) holds for one of the linear series, the result
> +	 will contain a linear series, and so the result will have
> +	 the same shape as the selector.  If (a) holds for all of
> +	 the lienar series, the result will be the same as (2) above.

my typo: linear
>  
> -      res_nelts_per_pattern
> -	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> -		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> -			      sel.encoding ().nelts_per_pattern ()));
> +	 (b) can only hold if one of the input patterns has a
> +	 stepped encoding.  */
>  
> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> +    {
> +      res_npatterns = sel.encoding ().npatterns ();
> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> +      if (res_nelts_per_pattern == 3
> +	  && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> +	  && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> +	res_nelts_per_pattern = 2;
>        res_nelts = res_npatterns * res_nelts_per_pattern;
>      }
>    else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
> @@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode)
>  	tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) };
>  	validate_res (1, 3, res, expected_res);
>        }
> +
> +      /* Case 8: Same as aarch64/sve/slp_3.c:
> +	 arg0, arg1 are dup vectors.
> +	 sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3)
> +	 So res = { arg0[0], arg1[0], ... } // (2, 1)
> +
> +	 In this case, since the input vectors are dup, only the first two
> +	 elements per pattern in sel are considered significant.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 2, 3);
> +	poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG1(0) };
> +	validate_res (2, 1, res, expected_res);
> +      }
>      }
>  }
>  
> @@ -17730,6 +17764,45 @@ test_nunits_min_4 (machine_mode vmode)
>  	ASSERT_TRUE (res == NULL_TREE);
>  	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
>        }
> +
> +      /* Case 8: PR111754: When input vector is not a stepped sequence,
> +	 check that the result is not a stepped sequence either, even
> +	 if sel has a stepped sequence.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 2);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 2);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 1, 3);
> +	poly_uint64 mask_elems[] = { 0, 1, 2 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG0(1) };
> +	validate_res (sel.encoding ().npatterns (), 2, res, expected_res);

The test is OK, but I think it's worth noting that the fold_vec_perm_cst
arguments aren't canonical.  Since sel selects only from the first input,
the canonical form would be:

  tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg0, sel);

So OK with a comment, but also OK with the line above instead (and no arg1).

> +      }
> +
> +      /* Case 9: If sel doesn't contain a stepped sequence,
> +	 check that the result has same encoding as sel, irrespective
> +	 of shape of input vectors.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 1, 2);
> +	poly_uint64 mask_elems[] = { 0, len };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG1(0) };
> +	validate_res (sel.encoding ().npatterns (),
> +		      sel.encoding ().nelts_per_pattern (), res, expected_res);
> +      }
>      }
>  }
>  
> diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c
> new file mode 100644
> index 00000000000..7c1c16875c7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr111754.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +typedef float __attribute__((__vector_size__ (16))) F;
> +
> +F foo (F a, F b)
> +{
> +  F v = (F) { 9 };
> +  return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> +}
> +
> +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */
> +/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> index 82dd43a4d98..cb649bc1aa9 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> @@ -33,21 +33,15 @@ TEST_ALL (VEC_PERM)
>  
>  /* 1 for each 8-bit type.  */
>  /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
> -/* 1 for each 16-bit type plus 1 for double.  */
> -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */
> +/* 1 for each 16-bit type  */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
>  /* 1 for each 32-bit type.  */
>  /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */

Let's replace the deleted lines with:

/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 6 } } */

> -/* 3 for double.  */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */
>  /* The 64-bit types need:
>  
>        ZIP1 ZIP1 (2 ZIP2s optimized away)

This line should be deleted, now that the ZIP1s are gone.

>        ZIP1 ZIP2.  */
> -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */
> +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
>  /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
>  
>  /* The loop should be fully-masked.  The 64-bit types need two loads
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> index b1fa5e3cf68..ce940a28597 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> @@ -35,20 +35,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
>  
>  TEST_ALL (VEC_PERM)
>  
> -/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double.  */
> -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */
> +/* 1 for each 8-bit type  */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */
>  /* 1 for each 16-bit type.  */
>  /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */
> -/* 4 for double.  */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */

Similarly here:

/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 18 } } */

>  /* The 32-bit types need:
>  
>        ZIP1 ZIP1 (2 ZIP2s optimized away)

This line should be deleted.

> @@ -59,7 +49,7 @@ TEST_ALL (VEC_PERM)
>        ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away)

Same here.

OK with those changes, and sorry for the slow review.

Thanks,
Richard

>        ZIP1 ZIP2 ZIP1 ZIP2
>        ZIP1 ZIP2 ZIP1 ZIP2.  */
> -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */
> +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
>  /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
>  
>  /* The loop should be fully-masked.  The 32-bit types need two loads

Prathamesh Kulkarni Nov. 27, 2023, 3:13 p.m. UTC | #10

On Fri, 24 Nov 2023 at 03:13, Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> > On Thu, 26 Oct 2023 at 09:43, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> >>
> >> On Thu, 26 Oct 2023 at 04:09, Richard Sandiford
> >> <richard.sandiford@arm.com> wrote:
> >> >
> >> > Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> >> > > On Wed, 25 Oct 2023 at 02:58, Richard Sandiford
> >> > > <richard.sandiford@arm.com> wrote:
> >> > >> So I think the PR could be solved by something like the attached.
> >> > >> Do you agree?  If so, could you base the patch on this instead?
> >> > >>
> >> > >> Only tested against the self-tests.
> >> > >>
> >> > >> Thanks,
> >> > >> Richard
> >> > >>
> >> > >> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> > >> index 40767736389..00fce4945a7 100644
> >> > >> --- a/gcc/fold-const.cc
> >> > >> +++ b/gcc/fold-const.cc
> >> > >> @@ -10743,27 +10743,37 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> >> > >>    unsigned res_npatterns, res_nelts_per_pattern;
> >> > >>    unsigned HOST_WIDE_INT res_nelts;
> >> > >>
> >> > >> -  /* (1) If SEL is a suitable mask as determined by
> >> > >> -     valid_mask_for_fold_vec_perm_cst_p, then:
> >> > >> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> >> > >> -     res_nelts_per_pattern = max of nelts_per_pattern between
> >> > >> -                            ARG0, ARG1 and SEL.
> >> > >> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> >> > >> -     res_npatterns = nelts in result vector.
> >> > >> -     res_nelts_per_pattern = 1.
> >> > >> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> >> > >> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> >> > >> -    {
> >> > >> -      res_npatterns
> >> > >> -       = std::max (VECTOR_CST_NPATTERNS (arg0),
> >> > >> -                   std::max (VECTOR_CST_NPATTERNS (arg1),
> >> > >> -                             sel.encoding ().npatterns ()));
> >> > >> +  /* First try to implement the fold in a VLA-friendly way.
> >> > >> +
> >> > >> +     (1) If the selector is simply a duplication of N elements, the
> >> > >> +        result is likewise a duplication of N elements.
> >> > >> +
> >> > >> +     (2) If the selector is N elements followed by a duplication
> >> > >> +        of N elements, the result is too.
> >> > >>
> >> > >> -      res_nelts_per_pattern
> >> > >> -       = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> >> > >> -                   std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> >> > >> -                             sel.encoding ().nelts_per_pattern ()));
> >> > >> +     (3) If the selector is N elements followed by an interleaving
> >> > >> +        of N linear series, the situation is more complex.
> >> > >>
> >> > >> +        valid_mask_for_fold_vec_perm_cst_p detects whether we
> >> > >> +        can handle this case.  If we can, then each of the N linear
> >> > >> +        series either (a) selects the same element each time or
> >> > >> +        (b) selects a linear series from one of the input patterns.
> >> > >> +
> >> > >> +        If (b) holds for one of the linear series, the result
> >> > >> +        will contain a linear series, and so the result will have
> >> > >> +        the same shape as the selector.  If (a) holds for all of
> >> > >> +        the lienar series, the result will be the same as (2) above.
> >> > >> +
> >> > >> +        (b) can only hold if one of the inputs pattern has a
> >> > >> +        stepped encoding.  */
> >> > >> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> >> > >> +    {
> >> > >> +      res_npatterns = sel.encoding ().npatterns ();
> >> > >> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> >> > >> +      if (res_nelts_per_pattern == 3
> >> > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> >> > >> +         && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> >> > >> +       res_nelts_per_pattern = 2;
> >> > > Um, in this case, should we set:
> >> > > res_nelts_per_pattern = max (nelts_per_pattern (arg0), nelts_per_pattern(arg1))
> >> > > if both have nelts_per_pattern == 1 ?
> >> >
> >> > No, it still needs to be 2 even if arg0 and arg1 are duplicates.
> >> > E.g. consider a selector that picks the first element of arg0
> >> > followed by a duplicate of the first element of arg1.
> >> >
> >> > > Also I suppose this matters only for non-integral element type, since
> >> > > for integral element type,
> >> > > vector_cst_elt will return the correct value even if the element is
> >> > > not explicitly encoded and input vector is dup ?
> >> >
> >> > Yeah, but it might help even for integers.  If we build fewer
> >> > elements explicitly, and so read fewer implicitly-encoded inputs,
> >> > there's less risk of running into:
> >> >
> >> >       if (!can_div_trunc_p (sel[i], len, &q, &r))
> >> >         {
> >> >           if (reason)
> >> >             *reason = "cannot divide selector element by arg len";
> >> >           return NULL_TREE;
> >> >         }
> >> Ah right, thanks for the clarification!
> >> I am currently away on vacation and will return next Thursday, and
> >> will post a follow up patch based on your patch.
> >> Sorry for the delay.
> > Hi,
> > Sorry for slow response, I have rebased your patch and added couple of tests.
> > The attached patch resulted in fallout for aarch64/sve/slp_3.c and
> > aarch64/sve/slp_4.c.
> >
> > Specifically for slp_3.c, we didn't fold following case:
> > arg0, arg1 are dup vectors.
> > sel = { 0, len, 1, len + 1, 2, len + 2, ... } // (npatterns = 2,
> > nelts_per_pattern = 3)
> > because res_nelts_per_pattern was set to 3, and upon encountering 2,
> > fold_vec_perm_cst returned false.
> >
> > With patch, we set res_nelts_per_pattern = 2 (since input vectors are
> > dup), and thus gets folded to:
> > res = { arg0[0], arg1[0], ... } // (2, 1)
> >
> > Which results in using ldrqd for loading the result instead of doing
> > the permutation at runtime with mov and zip1.
> > I have adjusted the tests for new code-gen.
> > Does it look OK ?
> >
> > There's also this strange failure observed on x86_64, as well as on aarch64:
> > New tests that FAIL (1 tests):
> > libitm.c++/dropref.C -B
> > /home/prathamesh.kulkarni/gnu-toolchain/gcc/gnu-964-5/bootstrap-build-after/aarch64-unknown-linux-gnu/./libitm/../libstdc++-v3/src/.libs
> > execution test
> >
> > Looking at dropref.C:
> > /* { dg-xfail-run-if "unsupported" { *-*-* } } */
> > #include <libitm.h>
> >
> > char *pp;
> >
> > int main()
> > {
> >   __transaction_atomic {
> >     _ITM_dropReferences (pp, 555);
> >   }
> >   return 0;
> > }
> >
> > doesn't seem relevant to VEC_PERM_EXPR folding ?
> > The patch otherwise passes bootstrap+test on aarch64-linux-gnu with
> > and without SVE, and on x86_64-linux-gnu.
> >
> > Thanks,
> > Prathamesh
> >>
> >> Thanks,
> >> Prathamesh
> >> >
> >> > Thanks,
> >> > Richard
> >
> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> > index 40767736389..75410869796 100644
> > --- a/gcc/fold-const.cc
> > +++ b/gcc/fold-const.cc
> > @@ -10743,27 +10743,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
> >    unsigned res_npatterns, res_nelts_per_pattern;
> >    unsigned HOST_WIDE_INT res_nelts;
> >
> > -  /* (1) If SEL is a suitable mask as determined by
> > -     valid_mask_for_fold_vec_perm_cst_p, then:
> > -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> > -     res_nelts_per_pattern = max of nelts_per_pattern between
> > -                          ARG0, ARG1 and SEL.
> > -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> > -     res_npatterns = nelts in result vector.
> > -     res_nelts_per_pattern = 1.
> > -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> > -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > -    {
> > -      res_npatterns
> > -     = std::max (VECTOR_CST_NPATTERNS (arg0),
> > -                 std::max (VECTOR_CST_NPATTERNS (arg1),
> > -                           sel.encoding ().npatterns ()));
> > +  /* First try to implement the fold in a VLA-friendly way.
> > +
> > +     (1) If the selector is simply a duplication of N elements, the
> > +      result is likewise a duplication of N elements.
> > +
> > +     (2) If the selector is N elements followed by a duplication
> > +      of N elements, the result is too.
> > +
> > +     (3) If the selector is N elements followed by an interleaving
> > +      of N linear series, the situation is more complex.
> > +
> > +      valid_mask_for_fold_vec_perm_cst_p detects whether we
> > +      can handle this case.  If we can, then each of the N linear
> > +      series either (a) selects the same element each time or
> > +      (b) selects a linear series from one of the input patterns.
> > +
> > +      If (b) holds for one of the linear series, the result
> > +      will contain a linear series, and so the result will have
> > +      the same shape as the selector.  If (a) holds for all of
> > +      the lienar series, the result will be the same as (2) above.
>
> my typo: linear
> >
> > -      res_nelts_per_pattern
> > -     = std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> > -                 std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> > -                           sel.encoding ().nelts_per_pattern ()));
> > +      (b) can only hold if one of the input patterns has a
> > +      stepped encoding.  */
> >
> > +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> > +    {
> > +      res_npatterns = sel.encoding ().npatterns ();
> > +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> > +      if (res_nelts_per_pattern == 3
> > +       && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> > +       && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> > +     res_nelts_per_pattern = 2;
> >        res_nelts = res_npatterns * res_nelts_per_pattern;
> >      }
> >    else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
> > @@ -17562,6 +17573,29 @@ test_nunits_min_2 (machine_mode vmode)
> >       tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) };
> >       validate_res (1, 3, res, expected_res);
> >        }
> > +
> > +      /* Case 8: Same as aarch64/sve/slp_3.c:
> > +      arg0, arg1 are dup vectors.
> > +      sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3)
> > +      So res = { arg0[0], arg1[0], ... } // (2, 1)
> > +
> > +      In this case, since the input vectors are dup, only the first two
> > +      elements per pattern in sel are considered significant.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 2, 3);
> > +     poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG1(0) };
> > +     validate_res (2, 1, res, expected_res);
> > +      }
> >      }
> >  }
> >
> > @@ -17730,6 +17764,45 @@ test_nunits_min_4 (machine_mode vmode)
> >       ASSERT_TRUE (res == NULL_TREE);
> >       ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
> >        }
> > +
> > +      /* Case 8: PR111754: When input vector is not a stepped sequence,
> > +      check that the result is not a stepped sequence either, even
> > +      if sel has a stepped sequence.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 2);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 2);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 1, 3);
> > +     poly_uint64 mask_elems[] = { 0, 1, 2 };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG0(1) };
> > +     validate_res (sel.encoding ().npatterns (), 2, res, expected_res);
>
> The test is OK, but I think it's worth noting that the fold_vec_perm_cst
> arguments aren't canonical.  Since sel selects only from the first input,
> the canonical form would be:
>
>   tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg0, sel);
>
> So OK with a comment, but also OK with the line above instead (and no arg1).
>
> > +      }
> > +
> > +      /* Case 9: If sel doesn't contain a stepped sequence,
> > +      check that the result has same encoding as sel, irrespective
> > +      of shape of input vectors.  */
> > +      {
> > +     tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> > +     poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +     vec_perm_builder builder (len, 1, 2);
> > +     poly_uint64 mask_elems[] = { 0, len };
> > +     builder_push_elems (builder, mask_elems);
> > +
> > +     vec_perm_indices sel (builder, 2, len);
> > +     tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +     tree expected_res[] = { ARG0(0), ARG1(0) };
> > +     validate_res (sel.encoding ().npatterns (),
> > +                   sel.encoding ().nelts_per_pattern (), res, expected_res);
> > +      }
> >      }
> >  }
> >
> > diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c
> > new file mode 100644
> > index 00000000000..7c1c16875c7
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/pr111754.c
> > @@ -0,0 +1,13 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -fdump-tree-optimized" } */
> > +
> > +typedef float __attribute__((__vector_size__ (16))) F;
> > +
> > +F foo (F a, F b)
> > +{
> > +  F v = (F) { 9 };
> > +  return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> > +}
> > +
> > +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */
> > +/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> > index 82dd43a4d98..cb649bc1aa9 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> > @@ -33,21 +33,15 @@ TEST_ALL (VEC_PERM)
> >
> >  /* 1 for each 8-bit type.  */
> >  /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
> > -/* 1 for each 16-bit type plus 1 for double.  */
> > -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */
> > +/* 1 for each 16-bit type  */
> > +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
> >  /* 1 for each 32-bit type.  */
> >  /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */
>
> Let's replace the deleted lines with:
>
> /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 6 } } */
>
> > -/* 3 for double.  */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */
> >  /* The 64-bit types need:
> >
> >        ZIP1 ZIP1 (2 ZIP2s optimized away)
>
> This line should be deleted, now that the ZIP1s are gone.
>
> >        ZIP1 ZIP2.  */
> > -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */
> > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
> >  /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
> >
> >  /* The loop should be fully-masked.  The 64-bit types need two loads
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> > index b1fa5e3cf68..ce940a28597 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> > @@ -35,20 +35,10 @@ vec_slp_##TYPE (TYPE *restrict a, int n)                  \
> >
> >  TEST_ALL (VEC_PERM)
> >
> > -/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double.  */
> > -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */
> > +/* 1 for each 8-bit type  */
> > +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */
> >  /* 1 for each 16-bit type.  */
> >  /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */
> > -/* 4 for double.  */
> > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */
>
> Similarly here:
>
> /* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 18 } } */
>
> >  /* The 32-bit types need:
> >
> >        ZIP1 ZIP1 (2 ZIP2s optimized away)
>
> This line should be deleted.
>
> > @@ -59,7 +49,7 @@ TEST_ALL (VEC_PERM)
> >        ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away)
>
> Same here.
>
> OK with those changes, and sorry for the slow review.
Hi Richard,
Thanks for the suggestions, I have done the changes in attached patch.
Bootstrapped+tested with and without SVE on aarch64-linux-gnu and
x86_64-linux-gnu.
Which passes with exception of dropref.C failure above, but I assume
that's spurious since
it's not relevant to VEC_PERM_EXPR folding ?
Is it OK to commit the patch to trunk ?

Thanks,
Prathamesh
>
> Thanks,
> Richard
>
> >        ZIP1 ZIP2 ZIP1 ZIP2
> >        ZIP1 ZIP2 ZIP1 ZIP2.  */
> > -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */
> > +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
> >  /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
> >
> >  /* The loop should be fully-masked.  The 32-bit types need two loads
PR111754: Rework encoding of result for VEC_PERM_EXPR with constant input vectors.

gcc/ChangeLog:
	PR middle-end/111754
	* fold-const.cc (fold_vec_perm_cst): Set result's encoding to sel's
	encoding, and set res_nelts_per_pattern to 2 if sel contains stepped
	sequence but input vectors do not.
	(test_nunits_min_2): New test Case 8.
	(test_nunits_min_4): New tests Case 8 and Case 9.

gcc/testsuite/ChangeLog:
	PR middle-end/111754
	* gcc.target/aarch64/sve/slp_3.c: Adjust code-gen.
	* gcc.target/aarch64/sve/slp_4.c: Likewise.
	* gcc.dg/vect/pr111754.c: New test.

Co-authored-by: Richard Sandiford <richard.sandiford@arm.com>

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 332bc8aead2..dff09b81f7b 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -10803,27 +10803,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
   unsigned res_npatterns, res_nelts_per_pattern;
   unsigned HOST_WIDE_INT res_nelts;
 
-  /* (1) If SEL is a suitable mask as determined by
-     valid_mask_for_fold_vec_perm_cst_p, then:
-     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
-     res_nelts_per_pattern = max of nelts_per_pattern between
-			     ARG0, ARG1 and SEL.
-     (2) If SEL is not a suitable mask, and TYPE is VLS then:
-     res_npatterns = nelts in result vector.
-     res_nelts_per_pattern = 1.
-     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
-  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
-    {
-      res_npatterns
-	= std::max (VECTOR_CST_NPATTERNS (arg0),
-		    std::max (VECTOR_CST_NPATTERNS (arg1),
-			      sel.encoding ().npatterns ()));
+  /* First try to implement the fold in a VLA-friendly way.
+
+     (1) If the selector is simply a duplication of N elements, the
+	 result is likewise a duplication of N elements.
+
+     (2) If the selector is N elements followed by a duplication
+	 of N elements, the result is too.
+
+     (3) If the selector is N elements followed by an interleaving
+	 of N linear series, the situation is more complex.
+
+	 valid_mask_for_fold_vec_perm_cst_p detects whether we
+	 can handle this case.  If we can, then each of the N linear
+	 series either (a) selects the same element each time or
+	 (b) selects a linear series from one of the input patterns.
 
-      res_nelts_per_pattern
-	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
-		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
-			      sel.encoding ().nelts_per_pattern ()));
+	 If (b) holds for one of the linear series, the result
+	 will contain a linear series, and so the result will have
+	 the same shape as the selector.  If (a) holds for all of
+	 the linear series, the result will be the same as (2) above.
 
+	 (b) can only hold if one of the input patterns has a
+	 stepped encoding.  */
+
+  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
+    {
+      res_npatterns = sel.encoding ().npatterns ();
+      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
+      if (res_nelts_per_pattern == 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
+	  && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
+	res_nelts_per_pattern = 2;
       res_nelts = res_npatterns * res_nelts_per_pattern;
     }
   else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
@@ -17622,6 +17633,29 @@ test_nunits_min_2 (machine_mode vmode)
 	tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) };
 	validate_res (1, 3, res, expected_res);
       }
+
+      /* Case 8: Same as aarch64/sve/slp_3.c:
+	 arg0, arg1 are dup vectors.
+	 sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3)
+	 So res = { arg0[0], arg1[0], ... } // (2, 1)
+
+	 In this case, since the input vectors are dup, only the first two
+	 elements per pattern in sel are considered significant.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 2, 3);
+	poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (2, 1, res, expected_res);
+      }
     }
 }
 
@@ -17790,6 +17824,44 @@ test_nunits_min_4 (machine_mode vmode)
 	ASSERT_TRUE (res == NULL_TREE);
 	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
       }
+
+      /* Case 8: PR111754: When input vector is not a stepped sequence,
+	 check that the result is not a stepped sequence either, even
+	 if sel has a stepped sequence.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 2);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 3);
+	poly_uint64 mask_elems[] = { 0, 1, 2 };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 1, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg0, sel);
+
+	tree expected_res[] = { ARG0(0), ARG0(1) };
+	validate_res (sel.encoding ().npatterns (), 2, res, expected_res);
+      }
+
+      /* Case 9: If sel doesn't contain a stepped sequence,
+	 check that the result has same encoding as sel, irrespective
+	 of shape of input vectors.  */
+      {
+	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
+	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
+	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
+
+	vec_perm_builder builder (len, 1, 2);
+	poly_uint64 mask_elems[] = { 0, len };
+	builder_push_elems (builder, mask_elems);
+
+	vec_perm_indices sel (builder, 2, len);
+	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
+
+	tree expected_res[] = { ARG0(0), ARG1(0) };
+	validate_res (sel.encoding ().npatterns (),
+		      sel.encoding ().nelts_per_pattern (), res, expected_res);
+      }
     }
 }
 
diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c
new file mode 100644
index 00000000000..7c1c16875c7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr111754.c
@@ -0,0 +1,13 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+typedef float __attribute__((__vector_size__ (16))) F;
+
+F foo (F a, F b)
+{
+  F v = (F) { 9 };
+  return __builtin_shufflevector (v, v, 1, 0, 1, 2);
+}
+
+/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */
+/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
index 82dd43a4d98..775c1e1d530 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
@@ -33,21 +33,14 @@ TEST_ALL (VEC_PERM)
 
 /* 1 for each 8-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
-/* 1 for each 16-bit type plus 1 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */
+/* 1 for each 16-bit type  */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
 /* 1 for each 32-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */
-/* 3 for double.  */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 6 } } */
 /* The 64-bit types need:
-
-      ZIP1 ZIP1 (2 ZIP2s optimized away)
       ZIP1 ZIP2.  */
-/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
 
 /* The loop should be fully-masked.  The 64-bit types need two loads
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
index b1fa5e3cf68..5a9fc8ff750 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
@@ -35,31 +35,20 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
 
 TEST_ALL (VEC_PERM)
 
-/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double.  */
-/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */
+/* 1 for each 8-bit type  */
+/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */
 /* 1 for each 16-bit type.  */
 /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */
-/* 4 for double.  */
-/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */
+/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 18 } } */
 /* The 32-bit types need:
 
-      ZIP1 ZIP1 (2 ZIP2s optimized away)
       ZIP1 ZIP2
 
    and the 64-bit types need:
 
-      ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away)
       ZIP1 ZIP2 ZIP1 ZIP2
       ZIP1 ZIP2 ZIP1 ZIP2.  */
-/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
 /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
 
 /* The loop should be fully-masked.  The 32-bit types need two loads

Richard Sandiford Nov. 27, 2023, 4:49 p.m. UTC | #11

Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org> writes:
> PR111754: Rework encoding of result for VEC_PERM_EXPR with constant input vectors.
>
> gcc/ChangeLog:
> 	PR middle-end/111754
> 	* fold-const.cc (fold_vec_perm_cst): Set result's encoding to sel's
> 	encoding, and set res_nelts_per_pattern to 2 if sel contains stepped
> 	sequence but input vectors do not.
> 	(test_nunits_min_2): New test Case 8.
> 	(test_nunits_min_4): New tests Case 8 and Case 9.
>
> gcc/testsuite/ChangeLog:
> 	PR middle-end/111754
> 	* gcc.target/aarch64/sve/slp_3.c: Adjust code-gen.
> 	* gcc.target/aarch64/sve/slp_4.c: Likewise.
> 	* gcc.dg/vect/pr111754.c: New test.

OK, thanks.

Richard

> Co-authored-by: Richard Sandiford <richard.sandiford@arm.com>
>
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 332bc8aead2..dff09b81f7b 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -10803,27 +10803,38 @@ fold_vec_perm_cst (tree type, tree arg0, tree arg1, const vec_perm_indices &sel,
>    unsigned res_npatterns, res_nelts_per_pattern;
>    unsigned HOST_WIDE_INT res_nelts;
>  
> -  /* (1) If SEL is a suitable mask as determined by
> -     valid_mask_for_fold_vec_perm_cst_p, then:
> -     res_npatterns = max of npatterns between ARG0, ARG1, and SEL
> -     res_nelts_per_pattern = max of nelts_per_pattern between
> -			     ARG0, ARG1 and SEL.
> -     (2) If SEL is not a suitable mask, and TYPE is VLS then:
> -     res_npatterns = nelts in result vector.
> -     res_nelts_per_pattern = 1.
> -     This exception is made so that VLS ARG0, ARG1 and SEL work as before.  */
> -  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> -    {
> -      res_npatterns
> -	= std::max (VECTOR_CST_NPATTERNS (arg0),
> -		    std::max (VECTOR_CST_NPATTERNS (arg1),
> -			      sel.encoding ().npatterns ()));
> +  /* First try to implement the fold in a VLA-friendly way.
> +
> +     (1) If the selector is simply a duplication of N elements, the
> +	 result is likewise a duplication of N elements.
> +
> +     (2) If the selector is N elements followed by a duplication
> +	 of N elements, the result is too.
> +
> +     (3) If the selector is N elements followed by an interleaving
> +	 of N linear series, the situation is more complex.
> +
> +	 valid_mask_for_fold_vec_perm_cst_p detects whether we
> +	 can handle this case.  If we can, then each of the N linear
> +	 series either (a) selects the same element each time or
> +	 (b) selects a linear series from one of the input patterns.
>  
> -      res_nelts_per_pattern
> -	= std::max (VECTOR_CST_NELTS_PER_PATTERN (arg0),
> -		    std::max (VECTOR_CST_NELTS_PER_PATTERN (arg1),
> -			      sel.encoding ().nelts_per_pattern ()));
> +	 If (b) holds for one of the linear series, the result
> +	 will contain a linear series, and so the result will have
> +	 the same shape as the selector.  If (a) holds for all of
> +	 the linear series, the result will be the same as (2) above.
>  
> +	 (b) can only hold if one of the input patterns has a
> +	 stepped encoding.  */
> +
> +  if (valid_mask_for_fold_vec_perm_cst_p (arg0, arg1, sel, reason))
> +    {
> +      res_npatterns = sel.encoding ().npatterns ();
> +      res_nelts_per_pattern = sel.encoding ().nelts_per_pattern ();
> +      if (res_nelts_per_pattern == 3
> +	  && VECTOR_CST_NELTS_PER_PATTERN (arg0) < 3
> +	  && VECTOR_CST_NELTS_PER_PATTERN (arg1) < 3)
> +	res_nelts_per_pattern = 2;
>        res_nelts = res_npatterns * res_nelts_per_pattern;
>      }
>    else if (TYPE_VECTOR_SUBPARTS (type).is_constant (&res_nelts))
> @@ -17622,6 +17633,29 @@ test_nunits_min_2 (machine_mode vmode)
>  	tree expected_res[] = { ARG0(0), ARG1(0), ARG1(1) };
>  	validate_res (1, 3, res, expected_res);
>        }
> +
> +      /* Case 8: Same as aarch64/sve/slp_3.c:
> +	 arg0, arg1 are dup vectors.
> +	 sel = { 0, len, 1, len+1, 2, len+2, ... } // (2, 3)
> +	 So res = { arg0[0], arg1[0], ... } // (2, 1)
> +
> +	 In this case, since the input vectors are dup, only the first two
> +	 elements per pattern in sel are considered significant.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 2, 3);
> +	poly_uint64 mask_elems[] = { 0, len, 1, len + 1, 2, len + 2 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG1(0) };
> +	validate_res (2, 1, res, expected_res);
> +      }
>      }
>  }
>  
> @@ -17790,6 +17824,44 @@ test_nunits_min_4 (machine_mode vmode)
>  	ASSERT_TRUE (res == NULL_TREE);
>  	ASSERT_TRUE (!strcmp (reason, "step is not multiple of npatterns"));
>        }
> +
> +      /* Case 8: PR111754: When input vector is not a stepped sequence,
> +	 check that the result is not a stepped sequence either, even
> +	 if sel has a stepped sequence.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 2);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 1, 3);
> +	poly_uint64 mask_elems[] = { 0, 1, 2 };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 1, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg0, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG0(1) };
> +	validate_res (sel.encoding ().npatterns (), 2, res, expected_res);
> +      }
> +
> +      /* Case 9: If sel doesn't contain a stepped sequence,
> +	 check that the result has same encoding as sel, irrespective
> +	 of shape of input vectors.  */
> +      {
> +	tree arg0 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	tree arg1 = build_vec_cst_rand (vmode, 1, 3, 1);
> +	poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> +
> +	vec_perm_builder builder (len, 1, 2);
> +	poly_uint64 mask_elems[] = { 0, len };
> +	builder_push_elems (builder, mask_elems);
> +
> +	vec_perm_indices sel (builder, 2, len);
> +	tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> +
> +	tree expected_res[] = { ARG0(0), ARG1(0) };
> +	validate_res (sel.encoding ().npatterns (),
> +		      sel.encoding ().nelts_per_pattern (), res, expected_res);
> +      }
>      }
>  }
>  
> diff --git a/gcc/testsuite/gcc.dg/vect/pr111754.c b/gcc/testsuite/gcc.dg/vect/pr111754.c
> new file mode 100644
> index 00000000000..7c1c16875c7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/pr111754.c
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +typedef float __attribute__((__vector_size__ (16))) F;
> +
> +F foo (F a, F b)
> +{
> +  F v = (F) { 9 };
> +  return __builtin_shufflevector (v, v, 1, 0, 1, 2);
> +}
> +
> +/* { dg-final { scan-tree-dump-not "VEC_PERM_EXPR" "optimized" } } */
> +/* { dg-final { scan-tree-dump "return \{ 0.0, 9.0e\\+0, 0.0, 0.0 \}" "optimized" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> index 82dd43a4d98..775c1e1d530 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_3.c
> @@ -33,21 +33,14 @@ TEST_ALL (VEC_PERM)
>  
>  /* 1 for each 8-bit type.  */
>  /* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 2 } } */
> -/* 1 for each 16-bit type plus 1 for double.  */
> -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 4 } } */
> +/* 1 for each 16-bit type  */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 3 } } */
>  /* 1 for each 32-bit type.  */
>  /* { dg-final { scan-assembler-times {\tld1rqw\tz[0-9]+\.s, } 3 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #41\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #25\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #31\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #62\n} 2 } } */
> -/* 3 for double.  */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 3 } } */
> +/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 6 } } */
>  /* The 64-bit types need:
> -
> -      ZIP1 ZIP1 (2 ZIP2s optimized away)
>        ZIP1 ZIP2.  */
> -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 9 } } */
> +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
>  /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 3 } } */
>  
>  /* The loop should be fully-masked.  The 64-bit types need two loads
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> index b1fa5e3cf68..5a9fc8ff750 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_4.c
> @@ -35,31 +35,20 @@ vec_slp_##TYPE (TYPE *restrict a, int n)			\
>  
>  TEST_ALL (VEC_PERM)
>  
> -/* 1 for each 8-bit type, 4 for each 32-bit type and 4 for double.  */
> -/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 18 } } */
> +/* 1 for each 8-bit type  */
> +/* { dg-final { scan-assembler-times {\tld1rd\tz[0-9]+\.d, } 2 } } */
>  /* 1 for each 16-bit type.  */
>  /* { dg-final { scan-assembler-times {\tld1rqh\tz[0-9]+\.h, } 3 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #99\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #11\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #17\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #80\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #63\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #37\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #24\n} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, #81\n} 2 } } */
> -/* 4 for double.  */
> -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, x[0-9]+\n} 4 } } */
> +/* { dg-final { scan-assembler-times {\tld1rqd\tz[0-9]+\.d, } 18 } } */
>  /* The 32-bit types need:
>  
> -      ZIP1 ZIP1 (2 ZIP2s optimized away)
>        ZIP1 ZIP2
>  
>     and the 64-bit types need:
>  
> -      ZIP1 ZIP1 ZIP1 ZIP1 (4 ZIP2s optimized away)
>        ZIP1 ZIP2 ZIP1 ZIP2
>        ZIP1 ZIP2 ZIP1 ZIP2.  */
> -/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 33 } } */
> +/* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
>  /* { dg-final { scan-assembler-times {\tzip2\tz[0-9]+\.d, z[0-9]+\.d, z[0-9]+\.d\n} 15 } } */
>  
>  /* The loop should be fully-masked.  The 32-bit types need two loads

Commit Message

Comments

Patch