From patchwork Mon Nov 21 21:34:00 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Doug Gilmore X-Patchwork-Id: 83303 Delivered-To: patch@linaro.org Received: by 10.140.97.165 with SMTP id m34csp1778756qge; Mon, 21 Nov 2016 13:34:38 -0800 (PST) X-Received: by 10.99.253.85 with SMTP id m21mr21576585pgj.38.1479764078181; Mon, 21 Nov 2016 13:34:38 -0800 (PST) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id y26si24539531pfk.69.2016.11.21.13.34.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 21 Nov 2016 13:34:38 -0800 (PST) Received-SPF: pass (google.com: domain of gcc-patches-return-442176-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org; spf=pass (google.com: domain of gcc-patches-return-442176-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-442176-patch=linaro.org@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; q=dns; s=default; b=hFCcMQrrdNOmqxQQt XSgBxVPD+tcH+VIAZDLuytwpsx/cfnFswqnqd/jkRgWQXqoO901oYpJ244loKybN qeA4LcJ1/HQStBKS79SUPLjVHf+ogZSTdr8qLjwxkzAYzFNlcCDOaqg3BrbA2vPR lK5QzB0zg1Xys6qRWPUIM9VrJI= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; s=default; bh=oK3m4WD5of6lAYJHfctkoFW uWgM=; b=nKyUZzv8xtB0QrH4PpZwoiJa/mjVUB9wRRqle1DZ5OlWIDtcMI5Low1 8U7xDEtsra3xpMnPX4ZyGGdmkqeAn1s7dvt7Hy2Xd1W99TmYVeSU4aR7n/S1VGpt oGNoVg1MHvc0hYYuOtEzKjU1B4+L/dlHwwD+qCqaBsRAjvdKeVg0= Received: (qmail 93316 invoked by alias); 21 Nov 2016 21:34:17 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 93248 invoked by uid 89); 21 Nov 2016 21:34:16 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-4.9 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_NONE, RP_MATCHES_RCVD, SPF_PASS autolearn=ham version=3.3.2 spammy=appearance, modifications, Gilmore, gilmore X-HELO: mailapp01.imgtec.com Received: from mailapp01.imgtec.com (HELO mailapp01.imgtec.com) (195.59.15.196) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 21 Nov 2016 21:34:05 +0000 Received: from hhmail02.hh.imgtec.org (unknown [10.100.10.20]) by Forcepoint Email with ESMTPS id EA2544C9F6D4F; Mon, 21 Nov 2016 21:33:57 +0000 (GMT) Received: from BAMAIL02.ba.imgtec.org (10.20.40.28) by hhmail02.hh.imgtec.org (10.100.10.20) with Microsoft SMTP Server (TLS) id 14.3.294.0; Mon, 21 Nov 2016 21:34:02 +0000 Received: from [10.20.2.42] (10.20.2.42) by bamail02.ba.imgtec.org (10.20.40.28) with Microsoft SMTP Server id 14.3.266.1; Mon, 21 Nov 2016 13:34:00 -0800 Subject: Re: [PATCH PR68030/PR69710][RFC]Introduce a simple local CSE interface and use it in vectorizer To: Jeff Law , Bin Cheng , "gcc-patches@gcc.gnu.org" References: <6688966c-6744-8af3-1255-961b3ff19ca7@redhat.com> CC: nd From: Doug Gilmore Message-ID: Date: Mon, 21 Nov 2016 13:34:00 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <6688966c-6744-8af3-1255-961b3ff19ca7@redhat.com> I haven't seen any followups to this discussion of Bin's patch to PR68303 and PR69710, the patch submission: http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02000.html Discussion: http://gcc.gnu.org/ml/gcc-patches/2016-07/msg00761.html http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01551.html http://gcc.gnu.org/ml/gcc-patches/2016-06/msg00372.html http://gcc.gnu.org/ml/gcc-patches/2016-06/msg01550.html http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02162.html http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02155.html http://gcc.gnu.org/ml/gcc-patches/2016-05/msg02154.html so I did some investigation to get a better understanding of the issues involved. On 07/13/2016 01:59 PM, Jeff Law wrote: > On 05/25/2016 05:22 AM, Bin Cheng wrote: >> Hi, As analyzed in PR68303 and PR69710, vectorizer generates >> duplicated computations in loop's pre-header basic block when >> creating base address for vector reference to the same memory object. > Not a huge surprise. Loop optimizations generally have a tendency > to create and/or expose CSE opportunities. Unrolling is a common > culprit, there's certainly the possibility for header duplication, > code motions and IV rewriting to also expose/create redundant code. > > ... > > But, 1) It >> doesn't fix all the problem on x86_64. Root cause is computation for >> base address of the first reference is somehow moved outside of >> loop's pre-header, local CSE can't help in this case. > That's a bid odd -- have you investigated why this is outside the loop header? > ... I didn't look at this issue per se, but I did try running DOM between autovectorization and IVS. Just running DOM had little effect, what was crucial was adding the change Bin mentioned in his original message: Besides CSE issue, this patch also re-associates address expressions in vect_create_addr_base_for_vector_ref, specifically, it splits constant offset and adds it back near the expression root in IR. This is necessary because GCC only handles re-association for commutative operators in CSE. I attached a patch for these changes only. These are the important modifications that address the some of the IVS related issues exposed by PR68303. I found that adding the CSE change (or calling DOM between autovectorization and IVOPTS) is not needed, and from what I have seen, actually makes the code worse. Applying only the modifications to vect_create_addr_base_for_vector_ref, additional simplifications will be done when induction variables are found (function find_induction_variables). These simplications are indicated by the appearance of lines: Applying pattern match.pd:1056, generic-match.c:11865 in the IVOPS dump file. Now IVOPTs transforms the code so that constants now appear in the computation of the effective addresses for the memory OPs. However the code generated by IVOPTS still uses a separate base register for each memory reference. Later DOM3 transforms the code to use just one base register, which is the form the code needs to be in for the preliminary phase of IVOPTs where "IV uses" associated with memory OPs are placed into groups. At the time of this grouping, checks are done to ensure that for each member of a group the constant offsets don't overflow the immediate fields in actual machine instructions (more in this see * below). Currently it appears that an IV is generated for each memory reference. Instead of generating a new IV for each memory reference, we could try to detect that value of the new IV is just a constant offset of an existing IV and just generate a new temp reflecting that. I haven't worked through what needs to be done to implement that, but for the issue in PR69710 (saxpy example where the same IV should be used for a load and store) is straightforward to implement so since work has already been done in during data dependence analysis to detect this situation. I attached a patch for PR69710 that was bootstrapped and tested on X86_64 without errors. It does appear that it needs more testing, since I did notice SPEC 2006 h264ref produces different results with the patch applied, which I still need to investigate. Doug * Note that when IV uses are grouped, only positive constant offsets constraints are considered. That negative offsets can be used are reflected in the costs of using a different IV than the IV associated with a particular group. Thus once the optimal IV set is found, a different IV may chosen, which causes negative constant offsets to be used. >From 68d783d5b8e3a6238a7e7b25d76fc6e457bfc484 Mon Sep 17 00:00:00 2001 From: Doug Gilmore Date: Fri, 18 Nov 2016 13:29:47 -0800 Subject: [PATCH] Another prototype fix to PR69710. Would need to be extended to handle PR68030. Tested on r240609 several weeks ago: bootstrapped and no make check regressions on x86_64-unknown-linux-gnu, though a run-time error on SPEC 2006 h264ref at -O3 with --size=test was observed. --- gcc/tree-data-ref.c | 1 + gcc/tree-data-ref.h | 4 ++++ gcc/tree-vect-data-refs.c | 50 ++++++++++++++++++++++++++++++++++++++++++++--- gcc/tree-vect-loop.c | 2 ++ gcc/tree-vectorizer.h | 5 +++++ 5 files changed, 59 insertions(+), 3 deletions(-) diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c index 8152da3..7a4850c 100644 --- a/gcc/tree-data-ref.c +++ b/gcc/tree-data-ref.c @@ -1492,6 +1492,7 @@ initialize_data_dependence_relation (struct data_reference *a, DDR_SUBSCRIPTS (res).create (0); DDR_DIR_VECTS (res).create (0); DDR_DIST_VECTS (res).create (0); + DDR_PREHEADER_IV (res) = NULL_TREE; if (a == NULL || b == NULL) { diff --git a/gcc/tree-data-ref.h b/gcc/tree-data-ref.h index 856dd58..ca67d93 100644 --- a/gcc/tree-data-ref.h +++ b/gcc/tree-data-ref.h @@ -264,6 +264,9 @@ struct data_dependence_relation /* Set to true when the dependence relation is on the same data access. */ bool self_reference_p; + + /* For zero distance dependencies, use the same IV in the preheader. */ + tree preheader_iv; }; typedef struct data_dependence_relation *ddr_p; @@ -294,6 +297,7 @@ typedef struct data_dependence_relation *ddr_p; #define DDR_DIST_VECT(DDR, I) \ DDR_DIST_VECTS (DDR)[I] #define DDR_REVERSED_P(DDR) DDR->reversed_p +#define DDR_PREHEADER_IV(DDR) DDR->preheader_iv bool dr_analyze_innermost (struct data_reference *, struct loop *); diff --git a/gcc/tree-vect-data-refs.c b/gcc/tree-vect-data-refs.c index 5a30314..12cfbec 100644 --- a/gcc/tree-vect-data-refs.c +++ b/gcc/tree-vect-data-refs.c @@ -394,6 +394,7 @@ vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, } } + LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo).safe_push (ddr); continue; } @@ -4492,10 +4493,53 @@ vect_create_data_ref_ptr (gimple *stmt, tree aggr_type, struct loop *at_loop, /* (2) Calculate the initial address of the aggregate-pointer, and set the aggregate-pointer to point to it before the loop. */ - /* Create: (&(base[init_val+offset]+byte_offset) in the loop preheader. */ + /* Create: (&(base[init_val+offset]+byte_offset) in the loop preheader, + taking care to use reuse IVs. */ - new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list, - offset, loop, byte_offset); + bool need_new_temp = true; + ddr_p zd_ddr = (ddr_p) NULL; + if (loop_vinfo) + { + int i; + ddr_p ddr; + + + FOR_EACH_VEC_ELT (LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo), i, ddr) + { + if (DDR_A(ddr) == dr || DDR_B(ddr) == dr) + { + zd_ddr = ddr; + break; + } + } + + if (zd_ddr && DDR_PREHEADER_IV (zd_ddr)) + { + new_temp = DDR_PREHEADER_IV (zd_ddr); + need_new_temp = false; + if (dump_enabled_p ()) + { + dump_printf_loc (MSG_NOTE, vect_location, "reuse "); + dump_generic_expr (MSG_NOTE, TDF_SLIM, new_temp); + dump_printf (MSG_NOTE, "\n"); + } + } + } + + if (need_new_temp) + new_temp = vect_create_addr_base_for_vector_ref (stmt, &new_stmt_list, + offset, loop, byte_offset); + if (zd_ddr) + { + DDR_PREHEADER_IV (zd_ddr) = new_temp; + if (dump_enabled_p ()) + { + dump_printf_loc (MSG_NOTE, vect_location, "save "); + dump_generic_expr (MSG_NOTE, TDF_SLIM, new_temp); + dump_printf (MSG_NOTE, "\n"); + } + } + if (new_stmt_list) { if (pe) diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 4150b0d..0bf67e9 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -1173,6 +1173,7 @@ new_loop_vec_info (struct loop *loop) LOOP_VINFO_PEELING_FOR_NITER (res) = false; LOOP_VINFO_OPERANDS_SWAPPED (res) = false; LOOP_VINFO_ORIG_LOOP_INFO (res) = NULL; + LOOP_VINFO_ZERO_DIST_DEPS (res) = vNULL; return res; } @@ -1271,6 +1272,7 @@ destroy_loop_vec_info (loop_vec_info loop_vinfo, bool clean_stmts) LOOP_VINFO_GROUPED_STORES (loop_vinfo).release (); LOOP_VINFO_REDUCTIONS (loop_vinfo).release (); LOOP_VINFO_REDUCTION_CHAINS (loop_vinfo).release (); + LOOP_VINFO_ZERO_DIST_DEPS (loop_vinfo).release (); destroy_cost_data (LOOP_VINFO_TARGET_COST_DATA (loop_vinfo)); loop_vinfo->scalar_cost_vec.release (); diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 2a7fa0a..7fcc824 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -339,6 +339,10 @@ typedef struct _loop_vec_info : public vec_info { this points to the original vectorized loop. Otherwise NULL. */ _loop_vec_info *orig_loop_info; + /* Dependency relations that have zero distance, thus the associated + memory references can share the same IV. */ + vec zero_dist_deps; + } *loop_vec_info; /* Access Functions. */ @@ -379,6 +383,7 @@ typedef struct _loop_vec_info : public vec_info { #define LOOP_VINFO_SCALAR_ITERATION_COST(L) (L)->scalar_cost_vec #define LOOP_VINFO_SINGLE_SCALAR_ITERATION_COST(L) (L)->single_scalar_iteration_cost #define LOOP_VINFO_ORIG_LOOP_INFO(L) (L)->orig_loop_info +#define LOOP_VINFO_ZERO_DIST_DEPS(L) (L)->zero_dist_deps #define LOOP_REQUIRES_VERSIONING_FOR_ALIGNMENT(L) \ ((L)->may_misalign_stmts.length () > 0) -- 1.9.1