From patchwork Mon Oct 24 12:09:28 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Martin_Li=C5=A1ka?= X-Patchwork-Id: 78934 Delivered-To: patch@linaro.org Received: by 10.140.97.247 with SMTP id m110csp2541940qge; Mon, 24 Oct 2016 05:09:58 -0700 (PDT) X-Received: by 10.98.25.67 with SMTP id 64mr28686720pfz.46.1477310998139; Mon, 24 Oct 2016 05:09:58 -0700 (PDT) Return-Path: Received: from sourceware.org (server1.sourceware.org. [209.132.180.131]) by mx.google.com with ESMTPS id m5si15061748pgh.272.2016.10.24.05.09.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 24 Oct 2016 05:09:58 -0700 (PDT) Received-SPF: pass (google.com: domain of gcc-patches-return-439386-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) client-ip=209.132.180.131; Authentication-Results: mx.google.com; dkim=pass header.i=@gcc.gnu.org; spf=pass (google.com: domain of gcc-patches-return-439386-patch=linaro.org@gcc.gnu.org designates 209.132.180.131 as permitted sender) smtp.mailfrom=gcc-patches-return-439386-patch=linaro.org@gcc.gnu.org DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; q=dns; s=default; b=qzhTxl2+OjDvDh8zi Pb8dlAHmGfVIMrnqFOeGq1DmVYZoIz6ieYzjRXWgAJjr5EZvrlQXt4mntlGlxzGV qb7KsSBO2NEbHZ2sFG06ednVbYDQjYgCucxbqC15174B13fZ20OkeEvsrH+YvX1i nMfgvaz/XCC8WtW9L8BzD7gaWU= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; s=default; bh=aQ0yonbBDu5evefZqW8DO+m /CjM=; b=Tv+EzeiqqjqbU9H4PdM4Ke1bZyhHtGiFyT9XmvKiivWgLhriDAeLFTT V/q2WqH8o1JpVWlYNggXr6gyh0KPaNmUE6yiiED8SrPJ94hIXyu9YFv9Pz4R+Hvu 1PzITMDtiUKw/bx9yk6YxvR8oOEwnPeHf3AHecltUMHJwAVtJsjI= Received: (qmail 108620 invoked by alias); 24 Oct 2016 12:09:42 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 108603 invoked by uid 89); 24 Oct 2016 12:09:42 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: Yes, score=6.8 required=5.0 tests=BAYES_50, SPAM_BODY, SPF_PASS autolearn=no version=3.3.2 spammy=11.6, 987, stupid, thread-local X-HELO: mx2.suse.de Received: from mx2.suse.de (HELO mx2.suse.de) (195.135.220.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 24 Oct 2016 12:09:32 +0000 Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id A829FABC8; Mon, 24 Oct 2016 12:09:29 +0000 (UTC) Subject: Re: [RFC] Speed-up -fprofile-update=atomic To: Richard Biener References: <1253ac69-3301-f185-e43a-a34cadf8f51e@suse.cz> <1ff3cc75-7cee-79f3-395b-ef7a4d286a3d@acm.org> <04a05835-4666-4d7d-c1a9-d4bcc4ea924a@suse.cz> <87k2fpdatl.fsf@tassilo.jf.intel.com> <6f8b1905-818b-bfff-1bf3-5ba04f3b4b64@suse.cz> <20160818155130.GE5871@two.firstfloor.org> <20160818155449.GP14857@tucnak.redhat.com> <5798a459-2fc7-d82a-f89b-30a45a03c831@suse.cz> <8932e842-2457-64a0-76bc-a81b9a9a9b31@suse.cz> <6af6c6fc-9994-ff81-8e07-dab6b647d143@suse.cz> Cc: Jakub Jelinek , Andi Kleen , Jeff Law , Nathan Sidwell , GCC Patches , "Hubicha, Jan" From: =?UTF-8?Q?Martin_Li=c5=a1ka?= Message-ID: <0dc8a029-696a-39b0-96cc-ab44488607eb@suse.cz> Date: Mon, 24 Oct 2016 14:09:28 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: X-IsSubscribed: yes On 10/17/2016 02:03 PM, Richard Biener wrote: > On Mon, Oct 17, 2016 at 1:46 PM, Martin Liška wrote: >> On 10/13/2016 11:43 AM, Richard Biener wrote: >>> On Wed, Oct 12, 2016 at 3:52 PM, Martin Liška wrote: >>>> On 10/04/2016 11:45 AM, Richard Biener wrote: >>>>> On Thu, Sep 15, 2016 at 12:00 PM, Martin Liška wrote: >>>>>> On 09/07/2016 02:09 PM, Richard Biener wrote: >>>>>>> On Wed, Sep 7, 2016 at 1:37 PM, Martin Liška wrote: >>>>>>>> On 08/18/2016 06:06 PM, Richard Biener wrote: >>>>>>>>> On August 18, 2016 5:54:49 PM GMT+02:00, Jakub Jelinek wrote: >>>>>>>>>> On Thu, Aug 18, 2016 at 08:51:31AM -0700, Andi Kleen wrote: >>>>>>>>>>>> I'd prefer to make updates atomic in multi-threaded applications. >>>>>>>>>>>> The best proxy we have for that is -pthread. >>>>>>>>>>>> >>>>>>>>>>>> Is it slower, most definitely, but odds are we're giving folks >>>>>>>>>>>> garbage data otherwise, which in many ways is even worse. >>>>>>>>>>> >>>>>>>>>>> It will likely be catastrophically slower in some cases. >>>>>>>>>>> >>>>>>>>>>> Catastrophically as in too slow to be usable. >>>>>>>>>>> >>>>>>>>>>> An atomic instruction is a lot more expensive than a single >>>>>>>>>> increment. Also >>>>>>>>>>> they sometimes are really slow depending on the state of the machine. >>>>>>>>>> >>>>>>>>>> Can't we just have thread-local copies of all the counters (perhaps >>>>>>>>>> using >>>>>>>>>> __thread pointer as base) and just atomically merge at thread >>>>>>>>>> termination? >>>>>>>>> >>>>>>>>> I suggested that as well but of course it'll have its own class of issues (short lived threads, so we need to somehow re-use counters from terminated threads, large number of threads and thus using too much memory for the counters) >>>>>>>>> >>>>>>>>> Richard. >>>>>>>> >>>>>>>> Hello. >>>>>>>> >>>>>>>> I've got written the approach on my TODO list, let's see whether it would be doable in a reasonable amount of time. >>>>>>>> >>>>>>>> I've just finished some measurements to illustrate slow-down of -fprofile-update=atomic approach. >>>>>>>> All numbers are: no profile, -fprofile-generate, -fprofile-generate -fprofile-update=atomic >>>>>>>> c-ray benchmark (utilizing 8 threads, -O3): 1.7, 15.5., 38.1s >>>>>>>> unrar (utilizing 8 threads, -O3): 3.6, 11.6, 38s >>>>>>>> tramp3d (1 thread, -O3): 18.0, 46.6, 168s >>>>>>>> >>>>>>>> So the slow-down is roughly 300% compared to -fprofile-generate. I'm not having much experience with default option >>>>>>>> selection, but these numbers can probably help. >>>>>>>> >>>>>>>> Thoughts? >>>>>>> >>>>>>> Look at the generated code for an instrumented simple loop and see that for >>>>>>> the non-atomic updates we happily apply store-motion to the counter update >>>>>>> and thus we only get one counter update per loop exit rather than one per >>>>>>> loop iteration. Now see what happens for the atomic case (I suspect you >>>>>>> get one per iteration). >>>>>>> >>>>>>> I'll bet this accounts for most of the slowdown. >>>>>>> >>>>>>> Back in time ICC which had atomic counter updates (but using function >>>>>>> calls - ugh!) had a > 1000% overhead with FDO for tramp3d (they also >>>>>>> didn't have early inlining -- removing abstraction helps reducing the number >>>>>>> of counters significantly). >>>>>>> >>>>>>> Richard. >>>>>> >>>>>> Hi. >>>>>> >>>>>> During Cauldron I discussed with Richi approaches how to speed-up ARCS >>>>>> profile counter updates. My first attempt is to utilize TLS storage, where >>>>>> every function is accumulating arcs counters. These are eventually added >>>>>> (using atomic operations) to the global one at the very end of a function. >>>>>> Currently I rely on target support of TLS, which is questionable whether >>>>>> to have such a requirement for -fprofile-update=atomic, or to add a new option value >>>>>> like -fprofile-update=atomic-tls? >>>>>> >>>>>> Running the patch on tramp3d, compared to previous numbers, it takes 88s to finish. >>>>>> Time shrinks to 50%, compared to the current implementation. >>>>>> >>>>>> Thoughts? >>>>> >>>>> Hmm, I thought I suggested that you can simply use automatic storage >>>>> (which effectively >>>>> is TLS...) for regions that are not forked or abnormally left (which >>>>> means SESE regions >>>>> that have no calls that eventually terminate or throw externally). >>>>> >>>>> So why did you end up with TLS? >>>> >>>> Hi. >>>> >>>> Usage for TLS does not makes sense, stupid mistake ;) >>>> >>>> By using SESE regions, do you mean the infrastructure that is utilized >>>> by Graphite machinery? >>> >>> No, just as "single-entry single-exit region" which means placing of >>> initializations of the internal counters to zero and the updates of the >>> actual counters is "obvious". >>> >>> Note that this "optimization" isn't one if the SESE region does not contain >>> cycle(s). Unless there is a way to do an atomic update of a bunch of >>> counters faster than doing them separately. This optimization will also >>> increase register pressure (or force the internal counters to the stack). >>> Thus selecting which counters to "optimize" and which ones to leave in place >>> might be necessary. >> >> Ok, I must admit the selection which counters to optimize is crucial. Current implementation >> (atomic increments at places where a BB is reached) has advantage that it does not increase >> register pressure and it does not update a global arcs counter when the BB is not visited. >> On the other hand, having a local counters which are updated at function exit (my current implementation) >> possibly updates counters for BB that not seen and it creates a huge memory locking spot if multiple threads >> call a function very often. This is perf report of cray benchmark: >> >> │ if((d = SQ(b) - 4.0 * a * c) < 0.0) return 0; >> 0.01 │3c0: xor %ecx,%ecx >> 0.00 │3c2: mov 0x110(%rsp),%rdx >> 0.00 │ lock add %rdx,__gcov0.ray_sphere >> 7.96 │ mov 0x118(%rsp),%rdx >> │ lock add %rdx,__gcov0.ray_sphere+0x8 >> 11.39 │ mov 0x120(%rsp),%rdx >> │ lock add %rdx,__gcov0.ray_sphere+0x10 >> 11.09 │ mov 0x128(%rsp),%rdx >> 0.00 │ lock add %rdx,__gcov0.ray_sphere+0x18 >> 11.27 │ mov 0x130(%rsp),%rdx >> │ lock add %rdx,__gcov0.ray_sphere+0x20 >> 11.02 │ mov 0x138(%rsp),%rdx >> │ lock add %rdx,__gcov0.ray_sphere+0x28 >> 11.46 │ mov 0x140(%rsp),%rdx >> 0.00 │ lock add %rdx,__gcov0.ray_sphere+0x30 >> 11.84 │ mov 0x148(%rsp),%rdx >> 0.00 │ lock add %rdx,__gcov0.ray_sphere+0x38 >> 11.57 │ mov 0x150(%rsp),%rdx >> │ lock add %rdx,__gcov0.ray_sphere+0x40 >> 6.86 │ mov 0x158(%rsp),%rdx >> │ lock add %rdx,__gcov0.ray_sphere+0x48 >> >> My current approach does atomic increment when maybe_hot_bb_p return false >> and local counters are used otherwise. Question is how to find a better place >> where to initialize and store local counter values? > > Given the main reason for using local counters is loops you can use > loop information > and put counter initializations in the loop preheader and stores on > the loop exit edges > (basically what loop store motion does w/o atomic updates). Store motion knows > to ignore any aliasing with counters and thus is _very_ aggressive > with doing this > (including all loop nests but of course not across possible (abnormal) > function exits). > > While profile instrumentation is quite late it is of course still > before IPA inlining. > Thus loop store motion w/o atomic updates possibly still sees more opportunities > than this. I wonder if we might want to leave the optimization to the regular > optimizers by only lowering the actual counter kind late and start with some > IFN_UPDATE_COVERAGE_COUNTER which passes could handle (and merge). > > Anyway, I'd go for the loop trick and simply handle counter motion > from innermost > loops only. The final update place would be the common post-dominator of all > exits (if you want to handle multiple exits). As said you need to watch out for > function termination that is not reflected by the CFG (external throws, exits in > callees, etc.). > > Richard. Hello Richard. I've just finished my patch, where I come up with a new internal fn (UPDATE_COVERAGE_COUNTER). The function is generated by profile pass, and is handled by lim pass. I originally tried to support internal fn in the lim machinery, but doing specific loop motion looks easier to work with. With the patch applied, tramp3d runs 1.8x slower with -fprofile-update=atomic compared to -fprofile-update=single. Patch can bootstrap on ppc64le-redhat-linux and survives regression tests. Thoughts? Thanks, Martin > >> Ideas welcomed. >> Thanks, >> Martin >> >>> >>> Richard. >>> >>>> Thanks, >>>> Martin >>>> >>>>> >>>>> Richard. >>>>> >>>>>> Martin >>>>>> >>>>>>> >>>>>>>> Martin >>>>>>>> >>>>>>>>> >>>>>>>>>> Jakub >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> >> >From a9e76ff1bc8b69e23a50d8cdc9556270cf7b269d Mon Sep 17 00:00:00 2001 From: marxin Date: Fri, 21 Oct 2016 13:26:19 +0200 Subject: [PATCH] Introduce loop store motion for UPDATE_COVERAGE_COUNTER gcc/testsuite/ChangeLog: 2016-10-24 Martin Liska * gcc.dg/tree-ssa/ssa-lim-11.c: Update scanned pattern. gcc/ChangeLog: 2016-10-24 Martin Liska * Makefile.in: Add new file profile-expand.c. * internal-fn.c (expand_UPDATE_COVERAGE_COUNTER): New IFN. * internal-fn.def: Likewise. * passes.def: Add new pass profile_expand. * profile-expand.c: New file. * tree-pass.h (make_pass_profile_expand): Declare a new function. * tree-profile.c (gimple_gen_edge_profiler): Generate the new internal fn. * tree-ssa-loop-im.c (loop_suitable_for_sm): Move to header file. (move_coverage_counter_update): New function. (process_sm_for_coverage_counter): Likewise. (pass_lim::execute): Call invariant motion for UPDATE_COVERAGE_COUNTER internal functions. * tree-ssa-loop.h: Move the function here from tree-ssa-loop-im.c. * value-prof.h: Declare expand_coverage_counter_ifns. --- gcc/Makefile.in | 1 + gcc/internal-fn.c | 6 ++ gcc/internal-fn.def | 2 + gcc/passes.def | 1 + gcc/profile-expand.c | 143 ++++++++++++++++++++++++++ gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c | 3 +- gcc/tree-pass.h | 1 + gcc/tree-profile.c | 37 +------ gcc/tree-ssa-loop-im.c | 157 +++++++++++++++++++++++++---- gcc/tree-ssa-loop.h | 18 ++++ gcc/value-prof.h | 1 + 11 files changed, 315 insertions(+), 55 deletions(-) create mode 100644 gcc/profile-expand.c diff --git a/gcc/Makefile.in b/gcc/Makefile.in index c512cd7..9bdb406 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1404,6 +1404,7 @@ OBJS = \ print-rtl-function.o \ print-tree.o \ profile.o \ + profile-expand.o \ real.o \ realmpfr.o \ recog.o \ diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c index 0b32d5f..557d373 100644 --- a/gcc/internal-fn.c +++ b/gcc/internal-fn.c @@ -254,6 +254,12 @@ expand_FALLTHROUGH (internal_fn, gcall *call) "invalid use of attribute %"); } +static void +expand_UPDATE_COVERAGE_COUNTER (internal_fn, gcall *) +{ + gcc_unreachable (); +} + /* Helper function for expand_addsub_overflow. Return 1 if ARG interpreted as signed in its precision is known to be always positive or 2 if ARG is known to be always negative, or 3 if ARG may diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def index d4fbdb2..348fc2f 100644 --- a/gcc/internal-fn.def +++ b/gcc/internal-fn.def @@ -198,6 +198,8 @@ DEF_INTERNAL_FN (ATOMIC_COMPARE_EXCHANGE, ECF_LEAF | ECF_NOTHROW, NULL) /* To implement [[fallthrough]]. */ DEF_INTERNAL_FN (FALLTHROUGH, ECF_LEAF | ECF_NOTHROW, NULL) +DEF_INTERNAL_FN (UPDATE_COVERAGE_COUNTER, ECF_LEAF | ECF_NOTHROW, NULL) + #undef DEF_INTERNAL_INT_FN #undef DEF_INTERNAL_FLT_FN #undef DEF_INTERNAL_OPTAB_FN diff --git a/gcc/passes.def b/gcc/passes.def index 1375254..4a22860 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -386,6 +386,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_asan_O0); NEXT_PASS (pass_tsan_O0); NEXT_PASS (pass_sanopt); + NEXT_PASS (pass_profile_expand); NEXT_PASS (pass_cleanup_eh); NEXT_PASS (pass_lower_resx); NEXT_PASS (pass_nrv); diff --git a/gcc/profile-expand.c b/gcc/profile-expand.c new file mode 100644 index 0000000..317fe1f --- /dev/null +++ b/gcc/profile-expand.c @@ -0,0 +1,143 @@ +/* Profile expand pass. + Copyright (C) 2003-2016 Free Software Foundation, Inc. + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify it +under the terms of the GNU General Public License as published by the +Free Software Foundation; either version 3, or (at your option) any +later version. + +GCC is distributed in the hope that it will be useful, but WITHOUT +ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +. */ + +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "memmodel.h" +#include "backend.h" +#include "target.h" +#include "tree.h" +#include "gimple.h" +#include "cfghooks.h" +#include "tree-pass.h" +#include "ssa.h" +#include "coverage.h" +#include "varasm.h" +#include "tree-nested.h" +#include "gimplify.h" +#include "gimple-iterator.h" +#include "gimplify-me.h" +#include "tree-cfg.h" +#include "tree-into-ssa.h" +#include "value-prof.h" +#include "profile.h" +#include "tree-cfgcleanup.h" +#include "params.h" + +void +expand_coverage_counter_ifns (void) +{ + basic_block bb; + tree f = builtin_decl_explicit (LONG_LONG_TYPE_SIZE > 32 + ? BUILT_IN_ATOMIC_FETCH_ADD_8: + BUILT_IN_ATOMIC_FETCH_ADD_4); + + FOR_EACH_BB_FN (bb, cfun) + { + gimple_stmt_iterator gsi; + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + if (gimple_call_internal_p (stmt, IFN_UPDATE_COVERAGE_COUNTER)) + { + tree addr = gimple_call_arg (stmt, 0); + tree value = gimple_call_arg (stmt, 1); + if (flag_profile_update == PROFILE_UPDATE_ATOMIC) + { + gcall *stmt + = gimple_build_call (f, 3, addr, value, + build_int_cst (integer_type_node, + MEMMODEL_RELAXED)); + gsi_replace (&gsi, stmt, true); + } + else + { + gcc_assert (TREE_CODE (addr) == ADDR_EXPR); + tree ref = TREE_OPERAND (addr, 0); + tree gcov_type_tmp_var + = make_temp_ssa_name (get_gcov_type (), NULL, + "PROF_edge_counter"); + gassign *stmt1 = gimple_build_assign (gcov_type_tmp_var, ref); + gcov_type_tmp_var + = make_temp_ssa_name (get_gcov_type (), NULL, + "PROF_edge_counter"); + gassign *stmt2 + = gimple_build_assign (gcov_type_tmp_var, PLUS_EXPR, + gimple_assign_lhs (stmt1), value); + gassign *stmt3 + = gimple_build_assign (unshare_expr (ref), + gimple_assign_lhs (stmt2)); + gsi_insert_seq_before (&gsi, stmt1, GSI_SAME_STMT); + gsi_insert_seq_before (&gsi, stmt2, GSI_SAME_STMT); + gsi_replace (&gsi, stmt3, GSI_SAME_STMT); + } + } + } + } +} + +/* Profile expand pass. */ + +namespace { + +const pass_data pass_data_profile_expand = +{ + GIMPLE_PASS, /* type */ + "profile_expand", /* name */ + OPTGROUP_LOOP, /* optinfo_flags */ + TV_LIM, /* tv_id */ + PROP_cfg, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_update_ssa, /* todo_flags_finish */ +}; + +class pass_profile_expand : public gimple_opt_pass +{ +public: + pass_profile_expand (gcc::context *ctxt) + : gimple_opt_pass (pass_data_profile_expand, ctxt) + {} + + /* opt_pass methods: */ + opt_pass * clone () { return new pass_profile_expand (m_ctxt); } + virtual bool gate (function *) { return true; } + virtual unsigned int execute (function *); + +}; // class pass_profile_expand + +unsigned int +pass_profile_expand::execute (function *) +{ + expand_coverage_counter_ifns (); + + return 0; +} + +} // anon namespace + +gimple_opt_pass * +make_pass_profile_expand (gcc::context *ctxt) +{ + return new pass_profile_expand (ctxt); +} + + diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c index e4c11aa..4c14e24 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-lim-11.c @@ -22,4 +22,5 @@ void access_buf(struct thread_param* p) } } -/* { dg-final { scan-tree-dump-times "Executing store motion of __gcov0.access_buf\\\[\[01\]\\\] from loop 1" 2 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "Executing store motion of &__gcov0.access_buf\\\[\[01\]\\\] from loop 1" 1 "lim2" } } */ +/* { dg-final { scan-tree-dump-times "Executing store motion of &__gcov0.access_buf\\\[\[01\]\\\] from loop 2" 1 "lim2" } } */ diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index 5903fde..ac919f8 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -365,6 +365,7 @@ extern gimple_opt_pass *make_pass_fix_loops (gcc::context *ctxt); extern gimple_opt_pass *make_pass_tree_loop (gcc::context *ctxt); extern gimple_opt_pass *make_pass_tree_no_loop (gcc::context *ctxt); extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt); +extern gimple_opt_pass *make_pass_profile_expand (gcc::context *ctxt); extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt); extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt); extern gimple_opt_pass *make_pass_predcom (gcc::context *ctxt); diff --git a/gcc/tree-profile.c b/gcc/tree-profile.c index abeee92..288d38c 100644 --- a/gcc/tree-profile.c +++ b/gcc/tree-profile.c @@ -252,40 +252,13 @@ gimple_init_edge_profiler (void) void gimple_gen_edge_profiler (int edgeno, edge e) { - tree one; + tree one = build_int_cst (gcov_type_node, 1); - one = build_int_cst (gcov_type_node, 1); - - if (flag_profile_update == PROFILE_UPDATE_ATOMIC) - { - /* __atomic_fetch_add (&counter, 1, MEMMODEL_RELAXED); */ - tree addr = tree_coverage_counter_addr (GCOV_COUNTER_ARCS, edgeno); - tree f = builtin_decl_explicit (LONG_LONG_TYPE_SIZE > 32 - ? BUILT_IN_ATOMIC_FETCH_ADD_8: - BUILT_IN_ATOMIC_FETCH_ADD_4); - gcall *stmt = gimple_build_call (f, 3, addr, one, - build_int_cst (integer_type_node, - MEMMODEL_RELAXED)); - gsi_insert_on_edge (e, stmt); - } - else - { - tree ref = tree_coverage_counter_ref (GCOV_COUNTER_ARCS, edgeno); - tree gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node, - NULL, "PROF_edge_counter"); - gassign *stmt1 = gimple_build_assign (gcov_type_tmp_var, ref); - gcov_type_tmp_var = make_temp_ssa_name (gcov_type_node, - NULL, "PROF_edge_counter"); - gassign *stmt2 = gimple_build_assign (gcov_type_tmp_var, PLUS_EXPR, - gimple_assign_lhs (stmt1), one); - gassign *stmt3 = gimple_build_assign (unshare_expr (ref), - gimple_assign_lhs (stmt2)); - gsi_insert_on_edge (e, stmt1); - gsi_insert_on_edge (e, stmt2); - gsi_insert_on_edge (e, stmt3); - } + tree addr = tree_coverage_counter_addr (GCOV_COUNTER_ARCS, edgeno); + gcall *stmt = gimple_build_call_internal (IFN_UPDATE_COVERAGE_COUNTER, + 2, addr, one); + gsi_insert_on_edge (e, stmt); } - /* Emits code to get VALUE to instrument at GSI, and returns the variable containing the value. */ diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c index 463db04..0e04250 100644 --- a/gcc/tree-ssa-loop-im.c +++ b/gcc/tree-ssa-loop-im.c @@ -44,6 +44,7 @@ along with GCC; see the file COPYING3. If not see #include "trans-mem.h" #include "gimple-fold.h" #include "tree-scalar-evolution.h" +#include "coverage.h" /* TODO: Support for predicated code motion. I.e. @@ -2281,24 +2282,6 @@ find_refs_for_sm (struct loop *loop, bitmap sm_executed, bitmap refs_to_sm) } } -/* Checks whether LOOP (with exits stored in EXITS array) is suitable - for a store motion optimization (i.e. whether we can insert statement - on its exits). */ - -static bool -loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED, - vec exits) -{ - unsigned i; - edge ex; - - FOR_EACH_VEC_ELT (exits, i, ex) - if (ex->flags & (EDGE_ABNORMAL | EDGE_EH)) - return false; - - return true; -} - /* Try to perform store motion for all memory references modified inside LOOP. SM_EXECUTED is the bitmap of the memory references for that store motion was executed in one of the outer loops. */ @@ -2556,6 +2539,132 @@ tree_ssa_lim (void) return todo; } +/* Move coverage counter update internal function, pointed by GSI iterator, + out of a loop LOOP. */ + +static void +move_coverage_counter_update (gimple_stmt_iterator *gsi, struct loop *loop) +{ + gimple *call = gsi_stmt (*gsi); + tree type = get_gcov_type (); + + vec exits = get_loop_exit_edges (loop); + if (!loop_suitable_for_sm (loop, exits)) + { + exits.release (); + return; + } + + /* Verify that BB of the CALL statement post-dominates all exits. */ + for (unsigned i = 0; i < exits.length (); i++) + { + edge exit = exits[i]; + if (!dominated_by_p (CDI_POST_DOMINATORS, call->bb, exit->src)) + { + exits.release (); + return; + } + } + + if (exits.is_empty ()) + return; + + edge preheader = loop_preheader_edge (loop); + if (!single_succ_p (preheader->src) + || preheader->dest != call->bb) + return; + + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "Executing store motion of "); + print_generic_expr (dump_file, gimple_call_arg (call, 0), 0); + fprintf (dump_file, " from loop %d\n", loop->num); + } + + tree preheader_var = make_temp_ssa_name (type, NULL, "PROF_edge_counter"); + gimple *stmt = gimple_build_assign (preheader_var, build_int_cst (type, 0)); + gimple_stmt_iterator it = gsi_last_bb (preheader->src); + gsi_insert_after (&it, stmt, GSI_NEW_STMT); + + tree loop_var1 = make_temp_ssa_name (type, NULL, "PROF_edge_counter"); + tree loop_var2 = make_temp_ssa_name (type, NULL, "PROF_edge_counter"); + + gphi *phi = create_phi_node (loop_var1, call->bb); + add_phi_arg (phi, preheader_var, preheader, UNKNOWN_LOCATION); + + stmt = gimple_build_assign (loop_var2, PLUS_EXPR, loop_var1, + build_int_cst (type, 1)); + gsi_insert_before (gsi, stmt, GSI_SAME_STMT); + + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, call->bb->preds) + if (e != preheader) + add_phi_arg (phi, loop_var2, e, UNKNOWN_LOCATION); + + tree updated_value = make_temp_ssa_name (type, NULL, "PROF_edge_counter"); + + for (unsigned i = 0; i < exits.length (); i++) + { + edge exit = exits[i]; + if (!dominated_by_p (CDI_DOMINATORS, exit->dest, exit->src)) + { + basic_block new_bb = split_edge (exit); + set_immediate_dominator (CDI_DOMINATORS, new_bb, exit->src); + e = single_pred_edge (new_bb); + } + else + e = exit; + + basic_block bb = e->dest; + phi = create_phi_node (updated_value, e->dest); + add_phi_arg (phi, loop_var2, e, UNKNOWN_LOCATION); + + tree addr = unshare_expr (gimple_call_arg (call, 0)); + call = gimple_build_call_internal (IFN_UPDATE_COVERAGE_COUNTER, + 2, addr, gimple_phi_result (phi)); + + it = gsi_start_bb (bb); + gsi_insert_before (&it, call, GSI_NEW_STMT); + } + + exits.release (); + gsi_remove (gsi, true); +} + +/* Process store motion for coverage counter update internal function. */ + +static void +process_sm_for_coverage_counter (void) +{ + bool has_dom_info = dom_info_available_p (CDI_POST_DOMINATORS); + if (!has_dom_info) + calculate_dominance_info (CDI_POST_DOMINATORS); + + struct loop *loop; + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) + { + basic_block *body = get_loop_body (loop); + for (unsigned i = 0; i < loop->num_nodes; i++) + { + gimple_stmt_iterator gsi; + for (gsi = gsi_start_bb (body[i]); !gsi_end_p (gsi);) + { + gimple *stmt = gsi_stmt (gsi); + if (gimple_call_internal_p (stmt, IFN_UPDATE_COVERAGE_COUNTER) + && integer_onep (gimple_call_arg (stmt, 1))) + move_coverage_counter_update (&gsi, loop); + + if (!gsi_end_p (gsi)) + gsi_next (&gsi); + } + } + } + + if (!has_dom_info) + free_dominance_info (CDI_POST_DOMINATORS); +} + /* Loop invariant motion pass. */ namespace { @@ -2592,11 +2701,15 @@ pass_lim::execute (function *fun) { bool in_loop_pipeline = scev_initialized_p (); if (!in_loop_pipeline) - loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS); + loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS + | LOOPS_HAVE_PREHEADERS); - if (number_of_loops (fun) <= 1) - return 0; - unsigned int todo = tree_ssa_lim (); + unsigned int todo = 0; + if (number_of_loops (fun) > 1) + todo = tree_ssa_lim (); + + process_sm_for_coverage_counter (); + todo |= TODO_update_ssa; if (!in_loop_pipeline) loop_optimizer_finalize (); diff --git a/gcc/tree-ssa-loop.h b/gcc/tree-ssa-loop.h index b2f37ab..88bcad7 100644 --- a/gcc/tree-ssa-loop.h +++ b/gcc/tree-ssa-loop.h @@ -79,4 +79,22 @@ loop_containing_stmt (gimple *stmt) return bb->loop_father; } +/* Checks whether LOOP (with exits stored in EXITS array) is suitable + for a store motion optimization (i.e. whether we can insert statement + on its exits). */ + +static inline bool +loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED, + vec exits) +{ + unsigned i; + edge ex; + + FOR_EACH_VEC_ELT (exits, i, ex) + if (ex->flags & (EDGE_ABNORMAL | EDGE_EH)) + return false; + + return true; +} + #endif /* GCC_TREE_SSA_LOOP_H */ diff --git a/gcc/value-prof.h b/gcc/value-prof.h index 07e2b3b..cb3ae1d 100644 --- a/gcc/value-prof.h +++ b/gcc/value-prof.h @@ -98,6 +98,7 @@ bool check_ic_target (gcall *, struct cgraph_node *); /* In tree-profile.c. */ extern void gimple_init_edge_profiler (void); extern void gimple_gen_edge_profiler (int, edge); +extern void expand_coverage_counter_ifns (void); extern void gimple_gen_interval_profiler (histogram_value, unsigned, unsigned); extern void gimple_gen_pow2_profiler (histogram_value, unsigned, unsigned); extern void gimple_gen_one_value_profiler (histogram_value, unsigned, unsigned); -- 2.10.1