Message ID | CAELXzTMazB7YRiTR73bzOqCMOPk6ubF8=4LYEzzK0imf+FVS8w@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | Loop unrolling and memory load streams | expand |
On Thu, Sep 14, 2017 at 6:33 PM, Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> wrote: > This patch adds aarch64_loop_unroll_adjust to limit partial unrolling > in rtl based on strided-loads in loop. Can you expand on this some more? Like give an example of where this helps? I am trying to better understand your counting schemes since it seems like the count is based on the number of loads and not cache lines. What do you mean by a strided load? Doesn't this function overcount when you have: for(int i = 1;i<1024;i++) { t+= a[i-1]*a[i]; } if it is counting based on cache lines rather than based on load addresses? It also seems to do some weird counting when you have: for(int i = 1;i<1024;i++) { t+= a[(i-1)*N+i]*a[(i)*N+i]; } That is: (PLUS (REG) (REG)) Also seems to overcount when loading from the same pointer twice. In my micro-arch, the number of prefetch slots is based on cache line miss so this would be overcounting by a factor of 2. Thanks, Andrew > > Thanks, > Kugan > > gcc/ChangeLog: > > 2017-09-12 Kugan Vivekanandarajah <kuganv@linaro.org> > > * cfgloop.h (iv_analyze_biv): export. > * loop-iv.c: Likewise. > * config/aarch64/aarch64.c (strided_load_p): New. > (insn_has_strided_load): New. > (count_strided_load_rtl): New. > (aarch64_loop_unroll_adjust): New.
On Fri, Sep 15, 2017 at 2:33 AM, Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> wrote: > This patch adds aarch64_loop_unroll_adjust to limit partial unrolling > in rtl based on strided-loads in loop. > > Thanks, > Kugan > > gcc/ChangeLog: > > 2017-09-12 Kugan Vivekanandarajah <kuganv@linaro.org> > > * cfgloop.h (iv_analyze_biv): export. > * loop-iv.c: Likewise. > * config/aarch64/aarch64.c (strided_load_p): New. > (insn_has_strided_load): New. > (count_strided_load_rtl): New. > (aarch64_loop_unroll_adjust): New. This implementation assumes a particular kind of prefetcher and collisions in that hardware prefetcher. Are you sure this helps every single micro-architecture out there (or rather doesn't harm ?) ? Further more how has this patchset been benchmarked, what micro-architecture, what benchmarks, what's the performance impact and why should this be considered for generic ? regards Ramana
Hi Ramana On 15 September 2017 at 18:40, Ramana Radhakrishnan <ramana.gcc@googlemail.com> wrote: > On Fri, Sep 15, 2017 at 2:33 AM, Kugan Vivekanandarajah > <kugan.vivekanandarajah@linaro.org> wrote: >> This patch adds aarch64_loop_unroll_adjust to limit partial unrolling >> in rtl based on strided-loads in loop. >> >> Thanks, >> Kugan >> >> gcc/ChangeLog: >> >> 2017-09-12 Kugan Vivekanandarajah <kuganv@linaro.org> >> >> * cfgloop.h (iv_analyze_biv): export. >> * loop-iv.c: Likewise. >> * config/aarch64/aarch64.c (strided_load_p): New. >> (insn_has_strided_load): New. >> (count_strided_load_rtl): New. >> (aarch64_loop_unroll_adjust): New. > > > This implementation assumes a particular kind of prefetcher and > collisions in that hardware prefetcher. Are you sure this helps every > single micro-architecture out there (or rather doesn't harm ?) ? > Further more how has this patchset been benchmarked, what > micro-architecture, what benchmarks, what's the performance impact and > why should this be considered for generic ? > I tested on -mcpu=falkor and at the moment this does not have any effect on other cpus. It is not enabled for generic. Thanks, Kugan
Hi Andrew, On 15 September 2017 at 13:36, Andrew Pinski <pinskia@gmail.com> wrote: > On Thu, Sep 14, 2017 at 6:33 PM, Kugan Vivekanandarajah > <kugan.vivekanandarajah@linaro.org> wrote: >> This patch adds aarch64_loop_unroll_adjust to limit partial unrolling >> in rtl based on strided-loads in loop. > > Can you expand on this some more? Like give an example of where this > helps? I am trying to better understand your counting schemes since > it seems like the count is based on the number of loads and not cache > lines. This is a simplified model and I am assuming here that prefetcher will tune based on the memory accesses. I don't have access to any of the internals of how this is implemented in different microarchitectures but I am assuming (in a simplified sense) that hw logic will detect memory accesses patterns and using this it will prefetch the cache line. If there are memory accesses like what you have shown that falls within the cache line, they may be combined but you still need to detect them and tune. And also detecting them at compile time is not always easy. So this is a simplified model. > What do you mean by a strided load? > Doesn't this function overcount when you have: > for(int i = 1;i<1024;i++) > { > t+= a[i-1]*a[i]; > } > if it is counting based on cache lines rather than based on load addresses? Sorry for my terminology. what I mean by strided access is any memory accesses in the form memory[iv]. I am counting memory[iv] and memory[iv+1] as two deferent streams. This may or may not fall into same cache line. > > It also seems to do some weird counting when you have: > for(int i = 1;i<1024;i++) > { > t+= a[(i-1)*N+i]*a[(i)*N+i]; > } > > That is: > (PLUS (REG) (REG)) > > Also seems to overcount when loading from the same pointer twice. If you prefer to count cache line basis, then I am counting it twice intentionally. > > In my micro-arch, the number of prefetch slots is based on cache line > miss so this would be overcounting by a factor of 2. I am not entirely sure if this will be useful for all the cores. It is shown to beneficial for falkor based on what is done in LLVM. Thanks, Kugan > > Thanks, > Andrew > >> >> Thanks, >> Kugan >> >> gcc/ChangeLog: >> >> 2017-09-12 Kugan Vivekanandarajah <kuganv@linaro.org> >> >> * cfgloop.h (iv_analyze_biv): export. >> * loop-iv.c: Likewise. >> * config/aarch64/aarch64.c (strided_load_p): New. >> (insn_has_strided_load): New. >> (count_strided_load_rtl): New. >> (aarch64_loop_unroll_adjust): New.
On Sun, Sep 17, 2017 at 4:41 PM, Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> wrote: > Hi Andrew, > > On 15 September 2017 at 13:36, Andrew Pinski <pinskia@gmail.com> wrote: >> On Thu, Sep 14, 2017 at 6:33 PM, Kugan Vivekanandarajah >> <kugan.vivekanandarajah@linaro.org> wrote: >>> This patch adds aarch64_loop_unroll_adjust to limit partial unrolling >>> in rtl based on strided-loads in loop. >> >> Can you expand on this some more? Like give an example of where this >> helps? I am trying to better understand your counting schemes since >> it seems like the count is based on the number of loads and not cache >> lines. > > This is a simplified model and I am assuming here that prefetcher will > tune based on the memory accesses. I don't have access to any of the > internals of how this is implemented in different microarchitectures > but I am assuming (in a simplified sense) that hw logic will detect > memory accesses patterns and using this it will prefetch the cache > line. If there are memory accesses like what you have shown that falls > within the cache line, they may be combined but you still need to > detect them and tune. And also detecting them at compile time is not > always easy. So this is a simplified model. > >> What do you mean by a strided load? >> Doesn't this function overcount when you have: >> for(int i = 1;i<1024;i++) >> { >> t+= a[i-1]*a[i]; >> } >> if it is counting based on cache lines rather than based on load addresses? > Sorry for my terminology. what I mean by strided access is any memory > accesses in the form memory[iv]. I am counting memory[iv] and > memory[iv+1] as two deferent streams. This may or may not fall into > same cache line. > >> >> It also seems to do some weird counting when you have: >> for(int i = 1;i<1024;i++) >> { >> t+= a[(i-1)*N+i]*a[(i)*N+i]; >> } >> >> That is: >> (PLUS (REG) (REG)) >> >> Also seems to overcount when loading from the same pointer twice. > > If you prefer to count cache line basis, then I am counting it twice > intentionally. > >> >> In my micro-arch, the number of prefetch slots is based on cache line >> miss so this would be overcounting by a factor of 2. > > I am not entirely sure if this will be useful for all the cores. It is > shown to beneficial for falkor based on what is done in LLVM. Can you share at least one benchmark or microbenchmark which shows the benefit? Because I can't seem to understand how the falkor core handles their hardware prefetcher to see if this is beneficial even there? Thanks, Andrew > > Thanks, > Kugan >> >> Thanks, >> Andrew >> >>> >>> Thanks, >>> Kugan >>> >>> gcc/ChangeLog: >>> >>> 2017-09-12 Kugan Vivekanandarajah <kuganv@linaro.org> >>> >>> * cfgloop.h (iv_analyze_biv): export. >>> * loop-iv.c: Likewise. >>> * config/aarch64/aarch64.c (strided_load_p): New. >>> (insn_has_strided_load): New. >>> (count_strided_load_rtl): New. >>> (aarch64_loop_unroll_adjust): New.
From 10e02b026784798fff6a3513dc11b1cffb1cf78a Mon Sep 17 00:00:00 2001 From: Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org> Date: Wed, 23 Aug 2017 12:35:14 +1000 Subject: [PATCH 5/5] add aarch64_loop_unroll_adjust --- gcc/cfgloop.h | 1 + gcc/config/aarch64/aarch64.c | 136 +++++++++++++++++++++++++++++++++++++++++++ gcc/loop-iv.c | 2 +- 3 files changed, 138 insertions(+), 1 deletion(-) diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h index 2308e7a..a3876a2 100644 --- a/gcc/cfgloop.h +++ b/gcc/cfgloop.h @@ -479,6 +479,7 @@ extern bool iv_analyze_expr (rtx_insn *, rtx, machine_mode, extern rtx get_iv_value (struct rtx_iv *, rtx); extern bool biv_p (rtx_insn *, rtx); extern void find_simple_exit (struct loop *, struct niter_desc *); +extern bool iv_analyze_biv (rtx def, struct rtx_iv *iv); extern void iv_analysis_done (void); extern struct niter_desc *get_simple_loop_desc (struct loop *loop); diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index e88bb6c..624a996 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -15189,6 +15189,139 @@ aarch64_ok_to_unroll (struct loop *loop, unsigned HOST_WIDE_INT nunroll) return true; } +/* Return true if X is a strided load. */ + +static bool +strided_load_p (const_rtx x) +{ + struct rtx_iv iv; + rtx reg; + + if (!MEM_P (x)) + return false; + + reg = XEXP (x, 0); + if (REG_P (reg) + || UNARY_P (reg)) + { + if (!REG_P (reg)) + reg = XEXP (reg, 0); + if (REG_P (reg) + && iv_analyze_biv (reg, &iv)) + return true; + } + else if (BINARY_P (reg)) + { + rtx reg1, reg2; + reg1 = XEXP (reg, 0); + reg2 = XEXP (reg, 1); + if (REG_P (reg1) + && iv_analyze_biv (reg1, &iv)) + return true; + if (REG_P (reg2) + && iv_analyze_biv (reg2, &iv)) + return true; + } + return false; +} + + +/* Return true if X INSN is a strided load. */ + +static bool +insn_has_strided_load (rtx_insn *insn) +{ + subrtx_iterator::array_type array; + if (!INSN_P (insn) || recog_memoized (insn) < 0) + return false; + rtx pat = PATTERN (insn); + + switch (GET_CODE (pat)) + { + case PARALLEL: + { + for (int j = 0; j < XVECLEN (pat, 0); ++j) + { + rtx ex = XVECEXP (pat, 0, j); + FOR_EACH_SUBRTX (iter, array, ex, NONCONST) + { + const_rtx x = *iter; + if (GET_CODE (x) == SET + && strided_load_p (SET_SRC (x))) + return true; + } + } + } + break; + + case SET: + FOR_EACH_SUBRTX (iter, array, SET_SRC (pat), NONCONST) + { + const_rtx x = *iter; + if (strided_load_p (x)) + return true; + } + + default: + break; + } + return false; +} + +/* Count the strided loads in the LOOP. If the strided loads are larger + (compared to MAX_STRIDED_LOADS), we dont need to compute all of + them. This is used to limit the partial unrolling factor to avoid + prefetcher collision. */ + +static unsigned +count_strided_load_rtl (struct loop *loop, unsigned max_strided_loads) +{ + basic_block *bbs; + unsigned count = 0; + rtx_insn *insn; + iv_analysis_loop_init (loop); + bbs = get_loop_body (loop); + + for (unsigned i = 0; i < loop->num_nodes; ++i) + { + FOR_BB_INSNS (bbs[i], insn) + { + if (insn_has_strided_load (insn)) + count ++; + + if (count > (max_strided_loads / 2)) + { + free (bbs); + iv_analysis_done (); + return count; + } + } + } + free (bbs); + iv_analysis_done (); + return count; +} + +/* Target hook loop_unroll_adjust that limits partial loop unrolling + factor, if this would make the outer loop's prefetch streams more + than hardware can handle. */ + +static unsigned +aarch64_loop_unroll_adjust (unsigned n_unroll, struct loop *loop) +{ + int max_strided_loads; + max_strided_loads = aarch64_tune_params.prefetch->hw_prefetchers_avail; + + if (max_strided_loads == -1) + return n_unroll; + + unsigned count = count_strided_load_rtl (loop, max_strided_loads); + if (count > 0) + n_unroll = 1 << (floor_log2 (max_strided_loads/count)); + + return n_unroll; +} + /* Target-specific selftests. */ #if CHECKING_P @@ -15620,6 +15753,9 @@ aarch64_libgcc_floating_mode_supported_p #undef TARGET_OK_TO_UNROLL #define TARGET_OK_TO_UNROLL aarch64_ok_to_unroll +#undef TARGET_LOOP_UNROLL_ADJUST +#define TARGET_LOOP_UNROLL_ADJUST aarch64_loop_unroll_adjust + #if CHECKING_P #undef TARGET_RUN_TARGET_SELFTESTS #define TARGET_RUN_TARGET_SELFTESTS selftest::aarch64_run_selftests diff --git a/gcc/loop-iv.c b/gcc/loop-iv.c index 745b613..3a8c54e 100644 --- a/gcc/loop-iv.c +++ b/gcc/loop-iv.c @@ -852,7 +852,7 @@ record_biv (rtx def, struct rtx_iv *iv) /* Determines whether DEF is a biv and if so, stores its description to *IV. */ -static bool +bool iv_analyze_biv (rtx def, struct rtx_iv *iv) { rtx inner_step, outer_step; -- 2.7.4