Message ID | CAELXzTP8rYTGBEk=nA6gD7_FFB-K2Rkd2gSKr5L8-k_DA+PYWw@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [RFC,AARCH64] Machine reorg pass for aarch64/Falkor to handle prefetcher tag collision | expand |
Hi Kugan, On 12/02/18 23:58, Kugan Vivekanandarajah wrote: > Implements a machine reorg pass for aarch64/Falkor to handle > prefetcher tag collision. This is strictly not part of the loop > unroller but for Falkor, unrolling can make h/w prefetcher performing > badly if there are too much tag collisions based on the discussions in > https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html. > Could you expand a bit more on what transformation exactly this pass does? From my understanding the loads that use the same base register and offset and have the same destination register are considered part of the same stream by the hardware prefetcher, so for example: ldr x0, [x1, 16] (load1) ... (set x1 to something else) ldr x0, [x1, 16] (load2) will cause the prefetcher to think that both loads are part of the same stream, so this pass tries to rewrite the sequence into: ldr x0, [x1, 16] ... (set x1 to something else) mov tmp, x1 ldr x0, [tmp, 16] Where the tag/signature is the combination of destination x0, base x1 and offset 16. Is this a fair description? I've got some comments on the patch itself > gcc/ChangeLog: > > 2018-02-12 Kugan Vivekanandarajah <kuganv@linaro.org> > > * config/aarch64/aarch64.c (iv_p): New. > (strided_load_p): Likwise. > (make_tag): Likesie. > (get_load_info): Likewise. > (aarch64_reorg): Likewise. > (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook. New functions need function comments describing the arguments at least. Functions like make_tag, get_load_info etc can get tricky to maintain without some documentation on what they are supposed to accept and return. I think the pass should be enabled at certain optimisation levels, say -O2? I don't think it would be desirable at -Os since it creates extra moves that increase code size. That being said, I would recommend you implement this as an aarch64-specific pass, in a similar way to cortex-a57-fma-steering.c. That way you can register it in aarch64-passes.def and have flexibility as to when exactly the pass gets to run (i.e. you wouldn't be limited by when machine_reorg gets run). Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way of gating this pass. Do it in a similar way to the FMA steering pass that is, define a new flag in aarch64-tuning-flags.def and use it in the tune_flags field of the falkor tuning struct. Hope this helps, Kyrill
Hi Kyrill, On 13 February 2018 at 20:47, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > Hi Kugan, > > On 12/02/18 23:58, Kugan Vivekanandarajah wrote: >> >> Implements a machine reorg pass for aarch64/Falkor to handle >> prefetcher tag collision. This is strictly not part of the loop >> unroller but for Falkor, unrolling can make h/w prefetcher performing >> badly if there are too much tag collisions based on the discussions in >> https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html. >> > > Could you expand a bit more on what transformation exactly this pass does? This is similar to what LLVM does in https://reviews.llvm.org/D35366. Falkor hardware prefetcher works well when signature of the prefetches (or tags as computed in the patch - similar to LLVM) are different for different memory streams. If different memory streams have the same signature, it can result in bad performance. This machine reorg pass tries to change the signature of memory loads by changing the base register with a free register. > From my understanding the loads that use the same base > register and offset and have the same destination register > are considered part of the same stream by the hardware prefetcher, so for > example: > ldr x0, [x1, 16] (load1) > ... (set x1 to something else) > ldr x0, [x1, 16] (load2) > > will cause the prefetcher to think that both loads are part of the same > stream, > so this pass tries to rewrite the sequence into: > ldr x0, [x1, 16] > ... (set x1 to something else) > mov tmp, x1 > ldr x0, [tmp, 16] > > Where the tag/signature is the combination of destination x0, base x1 and > offset 16. > Is this a fair description? This is precisely what is happening. > > I've got some comments on the patch itself > >> gcc/ChangeLog: >> >> 2018-02-12 Kugan Vivekanandarajah <kuganv@linaro.org> >> >> * config/aarch64/aarch64.c (iv_p): New. >> (strided_load_p): Likwise. >> (make_tag): Likesie. >> (get_load_info): Likewise. >> (aarch64_reorg): Likewise. >> (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook. > > > New functions need function comments describing the arguments at least. > Functions like make_tag, get_load_info etc can get tricky to maintain > without > some documentation on what they are supposed to accept and return. I wil add the comments. > > I think the pass should be enabled at certain optimisation levels, say -O2? > I don't think it would be desirable at -Os since it creates extra moves that > increase code size. Ok, I will change this. > > That being said, I would recommend you implement this as an aarch64-specific > pass, > in a similar way to cortex-a57-fma-steering.c. That way you can register it > in > aarch64-passes.def and have flexibility as to when exactly the pass gets to > run > (i.e. you wouldn't be limited by when machine_reorg gets run). > > Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way > of > gating this pass. Do it in a similar way to the FMA steering pass that is, > define a new flag in aarch64-tuning-flags.def and use it in the tune_flags > field > of the falkor tuning struct. Ok, I will revise the patch. Thanks, Kugan > > Hope this helps, > Kyrill
Hi, On 14 February 2018 at 09:47, Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> wrote: > Hi Kyrill, > > On 13 February 2018 at 20:47, Kyrill Tkachov > <kyrylo.tkachov@foss.arm.com> wrote: >> Hi Kugan, >> >> On 12/02/18 23:58, Kugan Vivekanandarajah wrote: >>> >>> Implements a machine reorg pass for aarch64/Falkor to handle >>> prefetcher tag collision. This is strictly not part of the loop >>> unroller but for Falkor, unrolling can make h/w prefetcher performing >>> badly if there are too much tag collisions based on the discussions in >>> https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html. >>> >> >> Could you expand a bit more on what transformation exactly this pass does? > > This is similar to what LLVM does in https://reviews.llvm.org/D35366. > > Falkor hardware prefetcher works well when signature of the prefetches > (or tags as computed in the patch - similar to LLVM) are different for > different memory streams. If different memory streams have the same > signature, it can result in bad performance. This machine reorg pass > tries to change the signature of memory loads by changing the base > register with a free register. > >> From my understanding the loads that use the same base >> register and offset and have the same destination register >> are considered part of the same stream by the hardware prefetcher, so for >> example: >> ldr x0, [x1, 16] (load1) >> ... (set x1 to something else) >> ldr x0, [x1, 16] (load2) >> >> will cause the prefetcher to think that both loads are part of the same >> stream, >> so this pass tries to rewrite the sequence into: >> ldr x0, [x1, 16] >> ... (set x1 to something else) >> mov tmp, x1 >> ldr x0, [tmp, 16] >> >> Where the tag/signature is the combination of destination x0, base x1 and >> offset 16. >> Is this a fair description? > > This is precisely what is happening. > >> >> I've got some comments on the patch itself >> >>> gcc/ChangeLog: >>> >>> 2018-02-12 Kugan Vivekanandarajah <kuganv@linaro.org> >>> >>> * config/aarch64/aarch64.c (iv_p): New. >>> (strided_load_p): Likwise. >>> (make_tag): Likesie. >>> (get_load_info): Likewise. >>> (aarch64_reorg): Likewise. >>> (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook. >> >> >> New functions need function comments describing the arguments at least. >> Functions like make_tag, get_load_info etc can get tricky to maintain >> without >> some documentation on what they are supposed to accept and return. > > I wil add the comments. > >> >> I think the pass should be enabled at certain optimisation levels, say -O2? >> I don't think it would be desirable at -Os since it creates extra moves that >> increase code size. > > Ok, I will change this. > >> >> That being said, I would recommend you implement this as an aarch64-specific >> pass, >> in a similar way to cortex-a57-fma-steering.c. That way you can register it >> in >> aarch64-passes.def and have flexibility as to when exactly the pass gets to >> run >> (i.e. you wouldn't be limited by when machine_reorg gets run). >> >> Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way >> of >> gating this pass. Do it in a similar way to the FMA steering pass that is, >> define a new flag in aarch64-tuning-flags.def and use it in the tune_flags >> field >> of the falkor tuning struct. > > Ok, I will revise the patch. Here is the revised patch. Thanks, Kugan gcc/ChangeLog: 2018-02-15 Kugan Vivekanandarajah <kuganv@linaro.org> * config.gcc: Add falkor-tag-collision-avoidance.o to extra_objs for aarch64-*-*. * config/aarch64/aarch64-protos.h (make_pass_tag_collision_avoidance): Declare. * config/aarch64/aarch64-passes.def: Insert tag collision avoidance pass. * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION): Define. * config/aarch64/aarch64.c (qdf24xx_tunings): Add AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION. * config/aarch64/falkor-tag-collision-avoidance.c: New file. * config/aarch64/t-aarch64: Add falkor-tag-collision-avoidance.o. > > > Thanks, > Kugan > >> >> Hope this helps, >> Kyrill diff --git a/gcc/config.gcc b/gcc/config.gcc index eca156a..c3f3e1a 100644 --- a/gcc/config.gcc +++ b/gcc/config.gcc @@ -303,7 +303,7 @@ aarch64*-*-*) extra_headers="arm_fp16.h arm_neon.h arm_acle.h" c_target_objs="aarch64-c.o" cxx_target_objs="aarch64-c.o" - extra_objs="aarch64-builtins.o aarch-common.o cortex-a57-fma-steering.o" + extra_objs="aarch64-builtins.o aarch-common.o cortex-a57-fma-steering.o falkor-tag-collision-avoidance.o" target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c" target_has_targetm_common=yes ;; diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def index 87747b4..d4b6a43 100644 --- a/gcc/config/aarch64/aarch64-passes.def +++ b/gcc/config/aarch64/aarch64-passes.def @@ -19,3 +19,4 @@ <http://www.gnu.org/licenses/>. */ INSERT_PASS_AFTER (pass_regrename, 1, pass_fma_steering); +INSERT_PASS_AFTER (pass_fast_rtl_dce, 1, pass_tag_collision_avoidance); diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index 2d705d2..d8f6964 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -544,6 +544,7 @@ std::string aarch64_get_extension_string_for_isa_flags (unsigned long, unsigned long); rtl_opt_pass *make_pass_fma_steering (gcc::context *ctxt); +rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *ctxt); poly_uint64 aarch64_regmode_natural_size (machine_mode); diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def index ea9ead2..c0dd178 100644 --- a/gcc/config/aarch64/aarch64-tuning-flags.def +++ b/gcc/config/aarch64/aarch64-tuning-flags.def @@ -41,4 +41,6 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW) are not considered cheap. */ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) +AARCH64_EXTRA_TUNING_OPTION ("avoid_prefetch_tag_collision", AVOID_PREFETCH_TAG_COLLISION) + #undef AARCH64_EXTRA_TUNING_OPTION diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 2e70f3a..b075325 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -875,7 +875,7 @@ static const struct tune_params qdf24xx_tunings = 2, /* min_div_recip_mul_df. */ 0, /* max_case_values. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION), /* tune_flags. */ &qdf24xx_prefetch_tune }; diff --git a/gcc/config/aarch64/falkor-tag-collision-avoidance.c b/gcc/config/aarch64/falkor-tag-collision-avoidance.c index e69de29..1fe320f 100644 --- a/gcc/config/aarch64/falkor-tag-collision-avoidance.c +++ b/gcc/config/aarch64/falkor-tag-collision-avoidance.c @@ -0,0 +1,468 @@ +/* Tag Collision Avoidance pass for Falkor. + Copyright (C) 2018 Free Software Foundation, Inc. + + This file is part of GCC. + + GCC is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + GCC is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received a copy of the GNU General Public License + along with GCC; see the file COPYING3. If not see + <http://www.gnu.org/licenses/>. */ + +#define IN_TARGET_CODE 1 + +#include "config.h" +#define INCLUDE_LIST +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "target.h" +#include "rtl.h" +#include "tree.h" +#include "tree-pass.h" +#include "aarch64-protos.h" +#include "hash-map.h" +#include "cfgloop.h" +#include "cfgrtl.h" +#include "rtl-iter.h" +#include "df.h" +#include "memmodel.h" +#include "optabs.h" +#include "regs.h" +#include "recog.h" + +/* + Falkor hardware prefetcher works well when signature of the prefetches + (or tags as computed in the patch) are different for different memory + streams. If different memory streams have the same signature, it can + result in bad performance. This pass tries to change the signature of + memory loads by changing the base register with a free register. + + Signature (TAG) is based on SRC, DST and Offset. If the signature is + is same, it will be considered part of the same stream by the hardware + prefetcher, for example: + ldr x0, [x1, 16] (load stream 1) + x1 is resused for a different stream + ldr x0, [x1, 16] (load stream 2) + + will cause the prefetcher to think that both loads are part of the same + stream, so this pass tries to rewrite the sequence into: + ldr x0, [x1, 16] + mov tmp, x1 + ldr x0, [tmp, 16] + Such that the signatures are different. */ + + +/* Return true if the REG is an IV in the LOOP, false otherwise. + This is an approximate check and does not rely on the functionality + provided likes of biv () as the loop form might not be suitable for + such analysis. */ + +static bool +iv_p (rtx reg, struct loop *loop) +{ + df_ref adef; + unsigned regno = REGNO (reg); + bool def_in_loop = false; + bool def_out_loop = false; + + if (GET_MODE_CLASS (GET_MODE (reg)) != MODE_INT) + return false; + + for (adef = DF_REG_DEF_CHAIN (regno); adef; adef = DF_REF_NEXT_REG (adef)) + { + if (!DF_REF_INSN_INFO (adef) + || !NONDEBUG_INSN_P (DF_REF_INSN (adef))) + continue; + + basic_block bb = DF_REF_BB (adef); + if (dominated_by_p (CDI_DOMINATORS, bb, loop->header) + && bb->loop_father == loop) + { + rtx_insn *insn = DF_REF_INSN (adef); + recog_memoized (insn); + rtx pat = PATTERN (insn); + if (GET_CODE (pat) != SET) + continue; + rtx x = SET_SRC (pat); + if (GET_CODE (x) == ZERO_EXTRACT + || GET_CODE (x) == ZERO_EXTEND + || GET_CODE (x) == SIGN_EXTEND) + x = XEXP (x, 0); + if (MEM_P (x)) + continue; + if (GET_CODE (x) == POST_INC + || GET_CODE (x) == POST_DEC + || GET_CODE (x) == PRE_INC + || GET_CODE (x) == PRE_DEC) + def_in_loop = true; + else if (BINARY_P (x)) + def_in_loop = true; + } + if (dominated_by_p (CDI_DOMINATORS, loop->header, bb)) + def_out_loop = true; + if (def_in_loop && def_out_loop) + return true; + } + return false; +} + +/* Return true if X is a strided load in the LOOP, false otherwise. + If it is a strided load, set the BASE and OFFSET. Also, if this is + a pre/post increment load, set PRE_POST to true. */ + +static bool +strided_load_p (rtx x, + struct loop *loop, + bool *pre_post, + rtx *base, + rtx *offset) +{ + /* Loadded value is extended, get src. */ + if (GET_CODE (x) == ZERO_EXTRACT + || GET_CODE (x) == ZERO_EXTEND + || GET_CODE (x) == SIGN_EXTEND) + x = XEXP (x, 0); + + /* If it is not MEM_P, it is not lodade from mem. */ + if (!MEM_P (x)) + return false; + + /* Get the src of MEM_P. */ + x = XEXP (x, 0); + + /* If it is a post/pre increment, get the src. */ + if (GET_CODE (x) == POST_INC + || GET_CODE (x) == POST_DEC + || GET_CODE (x) == PRE_INC + || GET_CODE (x) == PRE_DEC) + { + x = XEXP (x, 0); + *pre_post = true; + } + + /* get base and offset depending on the type. */ + if (REG_P (x) + || UNARY_P (x)) + { + if (!REG_P (x)) + x = XEXP (x, 0); + if (REG_P (x) + && iv_p (x, loop)) + { + *base = x; + return true; + } + } + else if (BINARY_P (x)) + { + rtx reg1, reg2; + reg1 = XEXP (x, 0); + + if (REG_P (reg1) + && REGNO (reg1) == SP_REGNUM) + return false; + reg2 = XEXP (x, 1); + + if (REG_P (reg1) + && iv_p (reg1, loop)) + { + + *base = reg1; + *offset = reg2; + return true; + } + + if (REG_P (reg1) + && REG_P (reg2) + && iv_p (reg2, loop)) + { + *base = reg1; + *offset = reg2; + return true; + } + } + return false; +} + +/* Compute the TAG (or signature) based on BASE, DEST and + OFFSET of the load. */ + +static unsigned +make_tag (unsigned dest, unsigned base, unsigned offset) +{ + return (dest & 0xf) + | ((base & 0xf) << 4) + | ((offset & 0x3f) << 8); +} + + +/* Return true if INSN is a strided load in LOOP. + If it is a strided load, set the DEST, BASE and OFFSET. + Also, if this is a pre/post increment load, set PRE_POST + to true. */ + +static bool +get_load_info (rtx_insn *insn, + struct loop *loop, + bool *pre_post, + rtx *base, + rtx *dest, + rtx *offset) +{ + subrtx_var_iterator::array_type array; + if (!INSN_P (insn) || recog_memoized (insn) < 0) + return false; + rtx pat = PATTERN (insn); + switch (GET_CODE (pat)) + { + case PARALLEL: + { + for (int j = 0; j < XVECLEN (pat, 0); ++j) + { + rtx ex = XVECEXP (pat, 0, j); + FOR_EACH_SUBRTX_VAR (iter, array, ex, NONCONST) + { + const_rtx x = *iter; + if (GET_CODE (x) == SET + && strided_load_p (SET_SRC (x), loop, pre_post, + base, offset)) + { + *dest = SET_DEST (x); + return true; + } + } + } + } + break; + + case SET: + FOR_EACH_SUBRTX_VAR (iter, array, SET_SRC (pat), NONCONST) + { + rtx x = *iter; + if (strided_load_p (x, loop, pre_post, + base, offset)) + { + *dest = SET_DEST (pat); + return true; + } + } + + default: + break; + } + return false; +} + +/* Tag collision avoidance pass for Falkor. */ + +void +execute_tag_collision_avoidance () +{ + basic_block *body, bb; + struct loop *loop; + rtx_insn *insn; + + compute_bb_for_insn (); + /* Compute live regs. */ + df_compute_regs_ever_live (true); + df_analyze (); + + /* Find the loops. */ + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); + calculate_dominance_info (CDI_DOMINATORS); + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) + { + hash_map <rtx, auto_vec<rtx_insn *> > tag_map (512); + body = get_loop_body (loop); + auto_vec <rtx> tags; + + /* Record all the memory tags. */ + for (unsigned i = 0; i < loop->num_nodes; i++) + { + bb = body[i]; + FOR_BB_INSNS (bb, insn) + { + unsigned tag; + rtx base = NULL_RTX; + rtx dest = NULL_RTX; + rtx offset = NULL_RTX; + bool pre_or_post = false; + + if (!INSN_P (insn) + || DEBUG_INSN_P (insn)) + continue; + + if (get_load_info (insn, loop, &pre_or_post, + &base, &dest, &offset) + && REG_P (dest)) + { + int int_offset = 0; + if (offset && REG_P (offset)) + int_offset = (1 << 5) | REGNO (offset); + else if (offset && CONST_INT_P (offset)) + { + int_offset = INTVAL (offset); + int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant (); + if (!pre_or_post) + int_offset >>= 2; + } + tag = make_tag (REGNO (dest), REGNO (base), int_offset); + rtx t = GEN_INT (tag); + if (!tag_map.get (t)) + tags.safe_push (t); + tag_map.get_or_insert (t).safe_push (insn); + } + } + } + + for (unsigned i = 0; i < tags.length (); ++i) + { + rtx t = tags[i]; + auto_vec<rtx_insn *> *v = tag_map.get (t); + + for (int j = v->length () - 1; j > 0; --j) + { + /* Get the insns that has tags colliding. */ + rtx_insn *insn = (*v)[j]; + rtx pat; + bool changed = false; + int int_offset = 0; + rtx base = NULL_RTX; + rtx dest = NULL_RTX; + rtx offset = NULL_RTX; + bool pre_or_post = false; + + if (!get_load_info (insn, loop, &pre_or_post, + &base, &dest, &offset)) + gcc_assert (false); + + if (offset && REG_P (offset)) + int_offset = (1 << 5) | REGNO (offset); + else if (offset && CONST_INT_P (offset)) + { + int_offset = INTVAL (offset); + int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant (); + if (!pre_or_post) + int_offset >>= 2; + } + + /* Go over temporary registers and find a free register, if + available. */ + for (int k = R9_REGNUM; !changed && (k <= R15_REGNUM); k++) + if (!df_hard_reg_used_p (k)) + { + unsigned tag; + rtx t; + + tag = make_tag (REGNO (dest), k, int_offset); + t = GEN_INT (tag); + /* Check to see if the new tag also collides with an + existing load. */ + if (tag_map.get (t)) + continue; + + machine_mode mode = GET_MODE (base); + rtx new_reg = gen_rtx_REG (mode, k); + t = GEN_INT (make_tag (REGNO (dest), REGNO (new_reg), + int_offset)); + vec <rtx_insn *> *v2 = tag_map.get (t); + if (v2 && (v2->length () > 0)) + continue; + + /* Change the insn: dest = load (base, offset) + into tmp = base; dest = load (tmp, offset). */ + extract_insn (insn); + for (int l = 0; + (!changed) && (l < recog_data.n_operands); l++) + { + subrtx_ptr_iterator::array_type array; + rtx *op = recog_data.operand_loc[l]; + + if (recog_data.operand_type[l] == OP_OUT) + continue; + + FOR_EACH_SUBRTX_PTR (iter, array, op, NONCONST) + { + rtx *loc = *iter; + rtx x = *loc; + + if (!changed && (base == x)) + { + pat = gen_rtx_SET (new_reg, base); + if (validate_change (insn, loc, new_reg, false)) + { + emit_insn_before (pat, insn); + if (pre_or_post) + { + rtx pat2 = gen_rtx_SET (base, new_reg); + emit_insn_after (pat2, insn); + } + } + v->pop (); + tag_map.get_or_insert (t).safe_push (insn); + changed = true; + break; + } + } + } + } + } + } + } + + loop_optimizer_finalize (); +} + + +const pass_data pass_data_tag_collision_avoidance = +{ + RTL_PASS, /* type */ + "tag_collision_avoidance", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_NONE, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_df_finish, /* todo_flags_finish */ +}; + +class pass_tag_collision_avoidance : public rtl_opt_pass +{ +public: + pass_tag_collision_avoidance (gcc::context *ctxt) + : rtl_opt_pass (pass_data_tag_collision_avoidance, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *) + { + return (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION) + && optimize >= 2; + } + + virtual unsigned int execute (function *) + { + execute_tag_collision_avoidance (); + return 0; + } + +}; // class pass_tag_collision_avoidance + +/* Create a new pass instance. */ + +rtl_opt_pass * +make_pass_tag_collision_avoidance (gcc::context *ctxt) +{ + return new pass_tag_collision_avoidance (ctxt); +} diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64 index 0be1f0d..f185b40 100644 --- a/gcc/config/aarch64/t-aarch64 +++ b/gcc/config/aarch64/t-aarch64 @@ -67,6 +67,15 @@ cortex-a57-fma-steering.o: $(srcdir)/config/aarch64/cortex-a57-fma-steering.c \ $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ $(srcdir)/config/aarch64/cortex-a57-fma-steering.c +falkor-tag-collision-avoidance.o: $(srcdir)/config/aarch64/falkor-tag-collision-avoidance.c \ + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \ + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \ + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \ + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \ + $(srcdir)/config/aarch64/aarch64-protos.h + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ + $(srcdir)/config/aarch64/falkor-tag-collision-avoidance.c + comma=, MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG)))) MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
From 0cd4f5acb2117c739ba81bb4b8b71af499107812 Mon Sep 17 00:00:00 2001 From: Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> Date: Mon, 12 Feb 2018 10:44:53 +1100 Subject: [PATCH 4/4] reorg-for-tag-collision Change-Id: Ic6e42d54268c9112ec1c25de577ca92c1808eeff --- gcc/config/aarch64/aarch64.c | 353 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 353 insertions(+) diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 1ce2a0c..48e7c54 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -71,6 +71,7 @@ #include "selftest.h" #include "selftest-rtl.h" #include "rtx-vector-builder.h" +#include "cfgrtl.h" /* This file should be included last. */ #include "target-def.h" @@ -17203,6 +17204,355 @@ aarch64_select_early_remat_modes (sbitmap modes) } } +static bool +iv_p (rtx reg, struct loop *loop) +{ + df_ref adef; + unsigned regno = REGNO (reg); + bool def_in_loop = false; + bool def_out_loop = false; + + if (GET_MODE_CLASS (GET_MODE (reg)) != MODE_INT) + return false; + + for (adef = DF_REG_DEF_CHAIN (regno); adef; adef = DF_REF_NEXT_REG (adef)) + { + if (!DF_REF_INSN_INFO (adef) + || !NONDEBUG_INSN_P (DF_REF_INSN (adef))) + continue; + + basic_block bb = DF_REF_BB (adef); + if (dominated_by_p (CDI_DOMINATORS, bb, loop->header) + && bb->loop_father == loop) + { + rtx_insn *insn = DF_REF_INSN (adef); + recog_memoized (insn); + rtx pat = PATTERN (insn); + if (GET_CODE (pat) != SET) + continue; + rtx x = SET_SRC (pat); + if (GET_CODE (x) == ZERO_EXTRACT + || GET_CODE (x) == ZERO_EXTEND + || GET_CODE (x) == SIGN_EXTEND) + x = XEXP (x, 0); + if (MEM_P (x)) + continue; + if (GET_CODE (x) == POST_INC + || GET_CODE (x) == POST_DEC + || GET_CODE (x) == PRE_INC + || GET_CODE (x) == PRE_DEC) + def_in_loop = true; + else if (BINARY_P (x)) + def_in_loop = true; + } + if (dominated_by_p (CDI_DOMINATORS, loop->header, bb)) + def_out_loop = true; + if (def_in_loop && def_out_loop) + return true; + } + return false; +} + +/* Return true if X is a strided load. */ + +static bool +strided_load_p (rtx x, + struct loop *loop, + bool *pre_post, + rtx *base, + rtx *offset) +{ + /* Loadded value is extended, get src. */ + if (GET_CODE (x) == ZERO_EXTRACT + || GET_CODE (x) == ZERO_EXTEND + || GET_CODE (x) == SIGN_EXTEND) + x = XEXP (x, 0); + + /* If it is not MEM_P, it is not lodade from mem. */ + if (!MEM_P (x)) + return false; + + /* Get the src of MEM_P. */ + x = XEXP (x, 0); + + /* If it is a post/pre increment, get the src. */ + if (GET_CODE (x) == POST_INC + || GET_CODE (x) == POST_DEC + || GET_CODE (x) == PRE_INC + || GET_CODE (x) == PRE_DEC) + { + x = XEXP (x, 0); + *pre_post = true; + } + + /* get base and offset depending on the type. */ + if (REG_P (x) + || UNARY_P (x)) + { + if (!REG_P (x)) + x = XEXP (x, 0); + if (REG_P (x) + && iv_p (x, loop)) + { + *base = x; + return true; + } + } + else if (BINARY_P (x)) + { + rtx reg1, reg2; + reg1 = XEXP (x, 0); + + if (REG_P (reg1) + && REGNO (reg1) == SP_REGNUM) + return false; + reg2 = XEXP (x, 1); + + if (REG_P (reg1) + && iv_p (reg1, loop)) + { + + *base = reg1; + *offset = reg2; + return true; + } + + if (REG_P (reg1) + && REG_P (reg2) + && iv_p (reg2, loop)) + { + *base = reg1; + *offset = reg2; + return true; + } + } + return false; +} + +static unsigned +make_tag (unsigned dest, unsigned base, unsigned offset) +{ + return (dest & 0xf) + | ((base & 0xf) << 4) + | ((offset & 0x3f) << 8); +} + + +/* Return true if X INSN is a strided load. */ + +static bool +get_load_info (rtx_insn *insn, + struct loop *loop, + bool *pre_post, + rtx *base, + rtx *dest, + rtx *offset) +{ + subrtx_var_iterator::array_type array; + if (!INSN_P (insn) || recog_memoized (insn) < 0) + return false; + rtx pat = PATTERN (insn); + switch (GET_CODE (pat)) + { + case PARALLEL: + { + for (int j = 0; j < XVECLEN (pat, 0); ++j) + { + rtx ex = XVECEXP (pat, 0, j); + FOR_EACH_SUBRTX_VAR (iter, array, ex, NONCONST) + { + const_rtx x = *iter; + if (GET_CODE (x) == SET + && strided_load_p (SET_SRC (x), loop, pre_post, + base, offset)) + { + *dest = SET_DEST (x); + return true; + } + } + } + } + break; + + case SET: + FOR_EACH_SUBRTX_VAR (iter, array, SET_SRC (pat), NONCONST) + { + rtx x = *iter; + if (strided_load_p (x, loop, pre_post, + base, offset)) + { + *dest = SET_DEST (pat); + return true; + } + } + + default: + break; + } + return false; +} + +static void +aarch64_reorg (void) +{ + basic_block *body, bb; + struct loop *loop; + rtx_insn *insn; + + if (aarch64_tune != falkor) + return; + + compute_bb_for_insn (); + /* Compute live regs. */ + df_compute_regs_ever_live (true); + df_analyze (); + + /* Find the loops. */ + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); + calculate_dominance_info (CDI_DOMINATORS); + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) + { + hash_map <rtx, auto_vec<rtx_insn *> > tag_map (512); + body = get_loop_body (loop); + auto_vec <rtx> tags; + + /* Record all the memory tags. */ + for (unsigned i = 0; i < loop->num_nodes; i++) + { + bb = body[i]; + FOR_BB_INSNS (bb, insn) + { + unsigned tag; + rtx base = NULL_RTX; + rtx dest = NULL_RTX; + rtx offset = NULL_RTX; + bool pre_or_post = false; + + if (!INSN_P (insn) + || DEBUG_INSN_P (insn)) + continue; + + if (get_load_info (insn, loop, &pre_or_post, + &base, &dest, &offset)) + { + int int_offset = 0; + if (offset && REG_P (offset)) + int_offset = (1 << 5) | REGNO (offset); + else if (offset && CONST_INT_P (offset)) + { + int_offset = INTVAL (offset); + int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant (); + if (!pre_or_post) + int_offset >>= 2; + } + tag = make_tag (REGNO (dest), REGNO (base), int_offset); + rtx t = GEN_INT (tag); + if (!tag_map.get (t)) + tags.safe_push (t); + tag_map.get_or_insert (t).safe_push (insn); + } + } + } + + for (unsigned i = 0; i < tags.length (); ++i) + { + rtx t = tags[i]; + auto_vec<rtx_insn *> *v = tag_map.get (t); + + for (int j = v->length () - 1; j > 0; --j) + { + /* Get the insns that has tags colliding. */ + rtx_insn *insn = (*v)[j]; + rtx pat; + bool changed = false; + int int_offset = 0; + rtx base = NULL_RTX; + rtx dest = NULL_RTX; + rtx offset = NULL_RTX; + bool pre_or_post = false; + + if (!get_load_info (insn, loop, &pre_or_post, + &base, &dest, &offset)) + gcc_assert (false); + + if (offset && REG_P (offset)) + int_offset = (1 << 5) | REGNO (offset); + else if (offset && CONST_INT_P (offset)) + { + int_offset = INTVAL (offset); + int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant (); + if (!pre_or_post) + int_offset >>= 2; + } + + /* Go over temporary registers and find a free register, if + available. */ + for (int k = R9_REGNUM; !changed && (k <= R15_REGNUM); k++) + if (!df_hard_reg_used_p (k)) + { + unsigned tag; + rtx t; + + tag = make_tag (REGNO (dest), k, int_offset); + t = GEN_INT (tag); + /* Check to see if the new tag also collides with an + existing load. */ + if (tag_map.get (t)) + continue; + + machine_mode mode = GET_MODE (base); + rtx new_reg = gen_rtx_REG (mode, k); + t = GEN_INT (make_tag (REGNO (dest), REGNO (new_reg), + int_offset)); + vec <rtx_insn *> *v2 = tag_map.get (t); + if (v2 && (v2->length () > 0)) + continue; + + /* Change the insn: dest = load (base, offset) + into tmp = base; dest = load (tmp, offset). */ + extract_insn (insn); + for (int l = 0; + (!changed) && (l < recog_data.n_operands); l++) + { + subrtx_ptr_iterator::array_type array; + rtx *op = recog_data.operand_loc[l]; + + if (recog_data.operand_type[l] == OP_OUT) + continue; + + FOR_EACH_SUBRTX_PTR (iter, array, op, NONCONST) + { + rtx *loc = *iter; + rtx x = *loc; + + if (!changed && (base == x)) + { + pat = gen_rtx_SET (new_reg, base); + if (validate_change (insn, loc, new_reg, false)) + { + emit_insn_before (pat, insn); + if (pre_or_post) + { + rtx pat2 = gen_rtx_SET (base, new_reg); + emit_insn_after (pat2, insn); + } + } + v->pop (); + tag_map.get_or_insert (t).safe_push (insn); + changed = true; + break; + } + } + } + } + } + } + } + + loop_optimizer_finalize (); + df_finish_pass (true); +} + /* Target-specific selftests. */ #if CHECKING_P @@ -17675,6 +18025,9 @@ aarch64_libgcc_floating_mode_supported_p #undef TARGET_HW_MAX_MEM_READ_STREAMS #define TARGET_HW_MAX_MEM_READ_STREAMS aarch64_hw_max_mem_read_streams +#undef TARGET_MACHINE_DEPENDENT_REORG +#define TARGET_MACHINE_DEPENDENT_REORG aarch64_reorg + #if CHECKING_P #undef TARGET_RUN_TARGET_SELFTESTS #define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests -- 2.7.4