From patchwork Mon May 23 09:29:52 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
X-Patchwork-Id: 68345
Delivered-To: patch@linaro.org
Received: by 10.140.92.199 with SMTP id b65csp1119831qge;
 Mon, 23 May 2016 02:30:31 -0700 (PDT)
X-Received: by 10.98.58.156 with SMTP id v28mr8838200pfj.137.1463995828718; 
 Mon, 23 May 2016 02:30:28 -0700 (PDT)
Return-Path: <gcc-patches-return-428016-patch=linaro.org@gcc.gnu.org>
Received: from sourceware.org (server1.sourceware.org. [209.132.180.131])
 by mx.google.com with ESMTPS id
 f79si50609281pfj.248.2016.05.23.02.30.28 for <patch@linaro.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 23 May 2016 02:30:28 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-return-428016-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 client-ip=209.132.180.131; 
Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org;
 spf=pass (google.com: domain of
 gcc-patches-return-428016-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 smtp.mailfrom=gcc-patches-return-428016-patch=linaro.org@gcc.gnu.org;
 dmarc=fail (p=NONE dis=NONE) header.from=linaro.org
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender
 :mime-version:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type; q=dns; s=default; b=W+Ga9KSPU8XQQTWsXf
 44kEcygpAHg572MqCtlyOEgBJJpzZwCGCW2rI94V3o4nB7rjsO1J4MwKpd2MIpZz
 ZpeHvR628l7ABrSOQGCS72eXyoiV5GzAkhQHBzLRxsCAF4MLBAuyBXzfGnd/mWYx
 Ym4qQcG9HdgRr0X3oCYpl2Uq0=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender
 :mime-version:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type; s=default; bh=oS1XfGpcVeF1LqR2fSUTToMg
 tRk=; b=N0ZSgRJQXMWG6QPDXY5r1htTigSvg4nBSaLfJRuiyPgA2C6QtGW7Nt2C
 DZzZgqjqk7BSiSrWsf0shjtcQLG3aurHbVA4lXlA5pLAAU9aoa/1MDOWIMET2D8b
 FCwhwRpykBNumER5XbSuOhF66+9tom10YsiE/YONpfrS8rEPyWg=
Received: (qmail 20542 invoked by alias); 23 May 2016 09:30:10 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <mailto:gcc-patches-unsubscribe-patch=linaro.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 20511 invoked by uid 89); 23 May 2016 09:30:09 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.5 required=5.0 tests=AWL, BAYES_20,
 RCVD_IN_DNSWL_LOW,
 SPF_PASS autolearn=ham version=3.3.2 spammy=charles, gates,
 prathamesh.kulkarni@linaro.org, sk:pratham
X-HELO: mail-ig0-f182.google.com
Received: from mail-ig0-f182.google.com (HELO mail-ig0-f182.google.com)
 (209.85.213.182) by sourceware.org
 (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256
 encrypted) ESMTPS; Mon, 23 May 2016 09:29:56 +0000
Received: by mail-ig0-f182.google.com with SMTP id bi2so30096278igb.0 for
 <gcc-patches@gcc.gnu.org>; Mon, 23 May 2016 02:29:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net;
 s=20130820;
 h=x-gm-message-state:mime-version:in-reply-to:references:date
 :message-id:subject:from:to:cc;
 bh=OVJT4wKy1XDjIs+5005QX7dT/jd54kZkBa/EnVNw7Zo=;
 b=gVYQhkiptZfYsNW2iYZf/O1kFiCVqdYFWxdrmm5rHQCUkg/3Cmg2pwMCmhHnNCE2Ue
 hm0qnAW9CS0SZDUhpOtv5WSA32v0HUZ6Zn/hhzWGg2TuCTWv0F66jU6KcB/OT2WC6iOg
 /D1yhTT26p1q+mvN3EK/MfO85+UW0xUbV2JMVzsJL69PwwSoaf8WMFIielwJ8FnNN9ja
 Z6YCYeShwQRI2/0i0O7JMYtSeAPhSBa3C/sb6vDD1kERT7ckIGkgE5ddRdpwrU79V6Oa
 xR5wuXh141rYOrdlqa30UBBR0l7G0jgfAtrq+PAYUd9EW8U91eqx+OliiVNv/VUEkZzQ
 CNcA==
X-Gm-Message-State: AOPr4FWc4edp7AHChRz5fCiLZi/Z2GS9mnu5OSSfAna/OPoXXzT28M0kDGt7F8rA1Hlbsya5xUKKW3Rol1qDp6pl
MIME-Version: 1.0
X-Received: by 10.50.3.103 with SMTP id b7mr12363781igb.46.1463995792388;
 Mon, 23 May 2016 02:29:52 -0700 (PDT)
Received: by 10.36.236.5 with HTTP; Mon, 23 May 2016 02:29:52 -0700 (PDT)
In-Reply-To: <CAAgBjM=X8rXnDnNV4-sWC_0fCt9hcvuOx2-hJ=vwsAb1oBq22Q@mail.gmail.com>
References: <CAAgBjMk0Hdask2JU8xs4fj_Ai1e0ggxB+h3ayb=NOGQBYJ8ccQ@mail.gmail.com>	<55BB4127.5050202@foss.arm.com>	<CAAgBjMnteaOO=JFifLD1CG7s-JE5r=65wcQm+HBwSMtpwg3ReA@mail.gmail.com>	<CAJA7tRaf8sZHmzHCt0n8rGJKt4bQycfSU+ABmb8rbcKb1VFQ=A@mail.gmail.com>	<CAAgBjM=X8rXnDnNV4-sWC_0fCt9hcvuOx2-hJ=vwsAb1oBq22Q@mail.gmail.com>
Date: Mon, 23 May 2016 14:59:52 +0530
Message-ID: <CAAgBjMkh=0tyzaMjqpSQg_mV8GYH-Tx9WV5xjHs2k0H0=UeqtQ@mail.gmail.com>
Subject: Re: [ARM] implement division using vrecpe/vrecps with
 -funsafe-math-optimizations
From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
To: Ramana Radhakrishnan <ramana.gcc@googlemail.com>
Cc: Ramana Radhakrishnan <ramana.radhakrishnan@foss.arm.com>,
 gcc Patches <gcc-patches@gcc.gnu.org>,
 Charles Baylis <charles.baylis@linaro.org>
X-IsSubscribed: yes

On 5 February 2016 at 18:40, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
> On 4 February 2016 at 16:31, Ramana Radhakrishnan
> <ramana.gcc@googlemail.com> wrote:
>> On Sun, Jan 17, 2016 at 9:06 AM, Prathamesh Kulkarni
>> <prathamesh.kulkarni@linaro.org> wrote:
>>> On 31 July 2015 at 15:04, Ramana Radhakrishnan
>>> <ramana.radhakrishnan@foss.arm.com> wrote:
>>>>
>>>>
>>>> On 29/07/15 11:09, Prathamesh Kulkarni wrote:
>>>>> Hi,
>>>>> This patch tries to implement division with multiplication by
>>>>> reciprocal using vrecpe/vrecps
>>>>> with -funsafe-math-optimizations and -freciprocal-math enabled.
>>>>> Tested on arm-none-linux-gnueabihf using qemu.
>>>>> OK for trunk ?
>>>>>
>>>>> Thank you,
>>>>> Prathamesh
>>>>>
>>>>
>>>> I've tried this in the past and never been convinced that 2 iterations are enough to get to stability with this given that the results are only precise for 8 bits / iteration. Thus I've always believed you need 3 iterations rather than 2 at which point I've never been sure that it's worth it. So the testing that you've done with this currently is not enough for this to go into the tree.
>>>>
>>>> I'd like this to be tested on a couple of different AArch32 implementations with a wider range of inputs to verify that the results are acceptable as well as running something like SPEC2k(6) with atleast one iteration to ensure correctness.
>>> Hi,
>>> I got results of SPEC2k6 fp benchmarks:
>>> a15: +0.64% overall, 481.wrf: +6.46%
>>> a53: +0.21% overall, 416.gamess: -1.39%, 481.wrf: +6.76%
>>> a57: +0.35% overall, 481.wrf: +3.84%
>>> The other benchmarks had (almost) identical results.
>>
>> Thanks for the benchmarking results -  Please repost the patch with
>> the changes that I had requested in my previous review - given it is
>> now stage4 , I would rather queue changes like this for stage1 now.
> Hi,
> Please find the updated patch attached.
> It passes testsuite for arm-none-linux-gnueabi, arm-none-linux-gnueabihf and
> arm-none-eabi.
> However the test-case added in the patch (neon-vect-div-1.c) fails to
> get vectorized at -O2
> for armeb-none-linux-gnueabihf.
> Charles suggested me to try with -O3, which worked.
> It appears the test-case fails to get vectorized with
> -fvect-cost-model=cheap (which is default enabled at -O2)
> and passes for -fno-vect-cost-model / -fvect-cost-model=dynamic
>
> I can't figure out why it fails -fvect-cost-model=cheap.
> From the vect dump (attached):
> neon-vect-div-1.c:12:3: note: Setting misalignment to -1.
> neon-vect-div-1.c:12:3: note: not vectorized: unsupported unaligned load.*_9
Hi,
I think I have some idea why the test-case fails attached with patch
fail to get vectorized on armeb with -O2.

Issue with big endian vectorizer:
The patch does not cause regressions on big endian vectorizer but
fails to vectorize the test-cases attached with the patch, while they
get vectorized on
litttle-endian.
Fails with armeb with the following message in dump:
note: not vectorized: unsupported unaligned load.*_9

The behavior of big and little endian vectorizer seems to be different
in arm_builtin_support_vector_misalignment() which overrides the hook
targetm.vectorize.support_vector_misalignment().

targetm.vectorize.support_vector_misalignment is called by
vect_supportable_dr_alignment () which in turn is called
by verify_data_refs_alignment ().

Execution upto following condition is common between arm and armeb
in vect_supportable_dr_alignment():

if ((TYPE_USER_ALIGN (type) && !is_packed)
      || targetm.vectorize.support_vector_misalignment (mode, type,
                                            DR_MISALIGNMENT (dr), is_packed))
        /* Can't software pipeline the loads, but can at least do them.  */
        return dr_unaligned_supported;

For little endian case:
arm_builtin_support_vector_misalignment() is called with
V2SF mode and misalignment == -1, and the following condition
becomes true:
/* If the misalignment is unknown, we should be able to handle the access
         so long as it is not to a member of a packed data structure.  */
  if (misalignment == -1)
    return true;

Since the hook returned true we enter the condition above in
vect_supportable_dr_alignment() and return dr_unaligned_supported;

For big-endian:
arm_builtin_support_vector_misalignment() is called with V2SF mode.
The following condition that gates the entire function body fails:
 if (TARGET_NEON && !BYTES_BIG_ENDIAN && unaligned_access)
and the default hook gets called with V2SF mode and the default hook
returns false because
movmisalign_optab does not exist for V2SF mode.

So the condition above in vect_supportable_dr_alignment() fails
and we come here:
 /* Unsupported.  */
return dr_unaligned_unsupported;

And hence we get the unaligned load not supported message in the dump
for armeb in verify_data_ref_alignment ():

static bool
verify_data_ref_alignment (data_reference_p dr)
{
  enum dr_alignment_support supportable_dr_alignment
    = vect_supportable_dr_alignment (dr, false);
  if (!supportable_dr_alignment)
    {
      if (dump_enabled_p ())
        {
          if (DR_IS_READ (dr))
            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                             "not vectorized: unsupported unaligned load.");

With -O3, the test-cases vectorize for armeb, because loop peeling for alignment
is turned on.
The above behavior is also reproducible with test-case which is
irrelevant to the patch.
for instance, we get the same unsupported unaligned load for following
test-case (replaced / with +)

void
foo (int len, float * __restrict p, float *__restrict x)
{
  len = len & ~31;
  for (int i = 0; i < len; i++)
    p[i] = p[i] + x[i];
}
Is the patch OK to commit after bootstrap+test ?

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
>>
>> Thanks,
>> Ramana
>>
>>>
>>> Thanks,
>>> Prathamesh
>>>>
>>>>
>>>> moving on to the patches.
>>>>
>>>>> diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
>>>>> index 654d9d5..28c2e2a 100644
>>>>> --- a/gcc/config/arm/neon.md
>>>>> +++ b/gcc/config/arm/neon.md
>>>>> @@ -548,6 +548,32 @@
>>>>>                      (const_string "neon_mul_<V_elem_ch><q>")))]
>>>>>  )
>>>>>
>>>>
>>>> Please add a comment here.
>>>>
>>>>> +(define_expand "div<mode>3"
>>>>> +  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
>>>>> +        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
>>>>> +               (match_operand:VCVTF 2 "s_register_operand" "w")))]
>>>>
>>>> I want to double check that this doesn't collide with Alan's patches for FP16 especially if he reuses the VCVTF iterator for all the vcvt f16 cases.
>>>>
>>>>> +  "TARGET_NEON && flag_unsafe_math_optimizations && flag_reciprocal_math"
>>>>> +  {
>>>>> +    rtx rec = gen_reg_rtx (<MODE>mode);
>>>>> +    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
>>>>> +
>>>>> +    /* Reciprocal estimate */
>>>>> +    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
>>>>> +
>>>>> +    /* Perform 2 iterations of Newton-Raphson method for better accuracy */
>>>>> +    for (int i = 0; i < 2; i++)
>>>>> +      {
>>>>> +     emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
>>>>> +     emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
>>>>> +      }
>>>>> +
>>>>> +    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec */
>>>>> +    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
>>>>> +    DONE;
>>>>> +  }
>>>>> +)
>>>>> +
>>>>> +
>>>>>  (define_insn "mul<mode>3add<mode>_neon"
>>>>>    [(set (match_operand:VDQW 0 "s_register_operand" "=w")
>>>>>          (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-1.c b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>> new file mode 100644
>>>>> index 0000000..e562ef3
>>>>> --- /dev/null
>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-1.c
>>>>> @@ -0,0 +1,14 @@
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -ftree-vectorize -fdump-tree-vect-all" } */
>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>
>>>> No this is wrong.
>>>>
>>>> What is armv8 specific about this test ? This is just like another test that is for Neon. vrecpe / vrecps are not instructions that were introduced in the v8 version of the architecture. They've existed in the base Neon instruction set. The code generation above in the patterns will be enabled when TARGET_NEON is true which can happen when -mfpu=neon -mfloat-abi={softfp/hard} is true.
>>>>
>>>>> +
>>>>> +void
>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>> +{
>>>>> +  len = len & ~31;
>>>>> +  for (int i = 0; i < len; i++)
>>>>> +    p[i] = p[i] / x[i];
>>>>> +}
>>>>> +
>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
>>>>> diff --git a/gcc/testsuite/gcc.target/arm/vect-div-2.c b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>> new file mode 100644
>>>>> index 0000000..8e15d0a
>>>>> --- /dev/null
>>>>> +++ b/gcc/testsuite/gcc.target/arm/vect-div-2.c
>>>>> @@ -0,0 +1,14 @@
>>>>> +/* { dg-do compile } */
>>>>> +/* { dg-require-effective-target arm_v8_neon_ok } */
>>>>
>>>> And likewise.
>>>>
>>>>> +/* { dg-options "-O2 -funsafe-math-optimizations -fno-reciprocal-math -ftree-vectorize -fdump-tree-vect-all" } */
>>>>> +/* { dg-add-options arm_v8_neon } */
>>>>> +
>>>>> +void
>>>>> +foo (int len, float * __restrict p, float *__restrict x)
>>>>> +{
>>>>> +  len = len & ~31;
>>>>> +  for (int i = 0; i < len; i++)
>>>>> +    p[i] = p[i] / x[i];
>>>>> +}
>>>>> +
>>>>> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */
>>>>
>>>>
>>>> regards
>>>> Ramana

diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 6b4896d..862e31b 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -578,6 +578,38 @@
                     (const_string "neon_mul_<V_elem_ch><q>")))]
 )
 
+/* Perform division using multiply-by-reciprocal. 
+   Reciprocal is calculated using Newton-Raphson method.
+   Enabled with -funsafe-math-optimizations -freciprocal-math
+   and disabled for -Os since it increases code size .  */
+
+(define_expand "div<mode>3"
+  [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+        (div:VCVTF (match_operand:VCVTF 1 "s_register_operand" "w")
+		  (match_operand:VCVTF 2 "s_register_operand" "w")))]
+  "TARGET_NEON && !optimize_size
+   && flag_unsafe_math_optimizations && flag_reciprocal_math"
+  {
+    rtx rec = gen_reg_rtx (<MODE>mode);
+    rtx vrecps_temp = gen_reg_rtx (<MODE>mode);
+
+    /* Reciprocal estimate.  */
+    emit_insn (gen_neon_vrecpe<mode> (rec, operands[2]));
+
+    /* Perform 2 iterations of newton-raphson method.  */
+    for (int i = 0; i < 2; i++)
+      {
+	emit_insn (gen_neon_vrecps<mode> (vrecps_temp, rec, operands[2]));
+	emit_insn (gen_mul<mode>3 (rec, rec, vrecps_temp));
+      }
+
+    /* We now have reciprocal in rec, perform operands[0] = operands[1] * rec.  */
+    emit_insn (gen_mul<mode>3 (operands[0], operands[1], rec));
+    DONE;
+  }
+)
+
+
 (define_insn "mul<mode>3add<mode>_neon"
   [(set (match_operand:VDQW 0 "s_register_operand" "=w")
         (plus:VDQW (mult:VDQW (match_operand:VDQW 2 "s_register_operand" "w")
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
new file mode 100644
index 0000000..dae724a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-1.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O3 -funsafe-math-optimizations -fdump-tree-vect-all" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
new file mode 100644
index 0000000..0450b70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/neon-vect-div-2.c
@@ -0,0 +1,16 @@
+/* Test pattern div<mode>3.  */
+
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_neon_ok } */
+/* { dg-options "-O3 -funsafe-math-optimizations -fno-reciprocal-math -fdump-tree-vect-all" } */
+/* { dg-add-options arm_neon } */
+
+void
+foo (int len, float * __restrict p, float *__restrict x)
+{
+  len = len & ~31;
+  for (int i = 0; i < len; i++)
+    p[i] = p[i] / x[i];
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */