Message ID | 1481801639-14286-1-git-send-email-james.greenhalgh@arm.com |
---|---|
State | New |
Headers | show |
On Thu, Dec 15, 2016 at 12:33 PM, James Greenhalgh <james.greenhalgh@arm.com> wrote: > > Hi, > > As mentioned in PR77445, the improvements to the jump threading cost model > this year have caused substantial regressions in the amount of jump threading > we do and the performance of workloads which rely on that threading. > > This patch represents the low-bar in fixing the performance issues reported > in PR77445 - by weakening the cost model enough that we thread in a way much > closer to GCC 6. I don't think this patch is likely to be acceptable for > trunk, but I'm posting it for consideration regardless. > > Under the new cost model, if the edge doesn't pass optimize_edge_for_speed_p, > then we don't thread. The problem in late threading is bad edge profile > data makes the edge look cold, and thus it fails optimize_edge_for_speed_p > and is no longer considered a candidate for threading. As an aside, I think > this is the wrong cost model for jump threading, where you get the most > impact if you can resolve unpredictable switch statements - which by their > nature may have multiple cold edges in need of threading. > > Early threading should avoid these issues, as there is no edge profile > info yet. optimize_edge_for_speed_p is therefore more likely to hold, but > the condition for threading is: > > if (speed_p && optimize_edge_for_speed_p (taken_edge)) > { > if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS)) > { > [...reject threading...] > } > } > else if (n_insns > 1) > { > [...reject threading...] > } > > With speed_p is hardwired to false for the early threader > ( pass_early_thread_jumps::execute ): > > find_jump_threads_backwards (bb, false); > > So we always fall to the n_insns > 1 case and thus only rarely get to > thread. > > In this patch I change that call in pass_early_thread_jumps::execute to > instead look at optimize_bb_for_speed_p (bb) . That allows the speed_p > check to pass in the main threading cost model, and then the > optimize_edge_for_speed_p can also pass. That gets the first stage of > jump-threading back working in a proprietary benchmark which is sensitive > to this optimisation. But all early optimizations are supposed to never increase code size -- they exist to get more realistic sizes to the IPA inliner heuristic code. Iff at all then add some --param early-jump-threading-insns and use that in place of the '1' and then eventually bump that to 2 or 3. But using the 100 from the late threading is too much. Note that using optimize_bb_for_speed_p at this point in the pipeline is nonsense and you can as well use ! optimize_size ... because even a guessed profile is only created at the end of the early optimization pipeline. > To get the rest of the required jump threading, I also have to weaken the > cost model - and this is obviously a hack! The easy hack is to special case > when the taken edge has frequency zero, and permit jump threading there. > > I know this patch is likely not the preferred way to fix this. For me that > would be a change to the cost model, which as I mentioned above I think > misses the point about which edges we want to thread. By far the best > fix would be to the junk edge profiling data we create during threading. > > However, this patch does fix the performance issues identified in PR77445, > and does highlight a fundamental issue with the early threader (which > doesn't seem to me like it will be effective while it sets speed_p to > false), so I'd like it to be considered for trunk if no better fix appears > before stage 4. But the PR has nothing to do with early threading but with late threading fed a bogus profile. > Bootstrapped on x86_64 with no issues. The testsuite changes just reshuffle > which passes spot the threading opportunities. > > OK? I don't think so. Richard. > Thanks, > James > > --- > gcc/ > > 2016-12-15 James Greenhalgh <james.greenhalgh@arm.com> > > PR tree-optimization/77445 > * tree-ssa-threadbackward.c (profitable_jump_thread_path) Work > around sometimes corrupt edge frequency data. > (pass_early_thread_jumps::execute): Pass > optimize_bb_for_speed_p as the speed_p parameter to > find_jump_threads_backwards to enable threading in more cases. > > gcc/testsuite/ > > 2016-12-15 James Greenhalgh <james.greenhalgh@arm.com> > > PR tree-optimization/77445 > * gcc.dg/tree-ssa/ssa-dom-thread-7.c: Adjust options and dump passes. > * gcc.dg/tree-ssa/pr66752-3.c: Likewise. >
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c index 896c8bf..39ec3d6 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr66752-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-details -fdump-tree-dce2" } */ +/* { dg-options "-O2 -fdump-tree-ethread-details -fdump-tree-dce2" } */ extern int status, pt; extern int count; @@ -34,7 +34,7 @@ foo (int N, int c, int b, int *a) /* There are 4 FSM jump threading opportunities, all of which will be realized, which will eliminate testing of FLAG, completely. */ -/* { dg-final { scan-tree-dump-times "Registering FSM" 4 "thread1"} } */ +/* { dg-final { scan-tree-dump-times "Registering FSM" 4 "ethread"} } */ /* There should be no assignments or references to FLAG, verify they're eliminated as early as possible. */ diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c index 9a9d1cb..5b087fb 100644 --- a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c @@ -1,8 +1,9 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 16" "thread1" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 9" "thread2" } } */ -/* { dg-final { scan-tree-dump "Jumps threaded: 3" "thread3" } } */ +/* { dg-options "-O2 -fdump-tree-ethread-stats -fdump-tree-thread1-stats -fdump-tree-thread2-stats -fdump-tree-dom2-stats -fdump-tree-thread3-stats -fdump-tree-dom3-stats -fdump-tree-vrp2-stats -fno-guess-branch-probability" } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 16" "ethread" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 6" "thread1" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 4" "thread2" } } */ +/* { dg-final { scan-tree-dump "Jumps threaded: 2" "thread3" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom2" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "dom3" } } */ /* { dg-final { scan-tree-dump-not "Jumps threaded" "vrp2" } } */ diff --git a/gcc/tree-ssa-threadbackward.c b/gcc/tree-ssa-threadbackward.c index 203e20e..0d29ab5 100644 --- a/gcc/tree-ssa-threadbackward.c +++ b/gcc/tree-ssa-threadbackward.c @@ -311,7 +311,20 @@ profitable_jump_thread_path (vec<basic_block, va_gc> *&path, return NULL; } - if (speed_p && optimize_edge_for_speed_p (taken_edge)) + + /* FIXME: Edge frequency can get badly out shape as a result of + the jump threading passes. In those cases, + EDGE_FREQUENCY (taken_edge) == 0 , and so trivially fails the + test for optimize_edge_for_speed_p. The correct fix would + be to ensure that profiling information coming out of jump threading + is meaningful, but in lieu of that add a hack check to this cost model + which permits jump threading in the case EDGE_FREQUENCY has been + corrupted. Only do this if the profile info is present and corrupt, + not if it is absent. */ + if (speed_p + && (optimize_edge_for_speed_p (taken_edge) + || (profile_status_for_fn (cfun) != PROFILE_ABSENT + && EDGE_FREQUENCY (taken_edge) == 0))) { if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS)) { @@ -870,7 +883,7 @@ pass_early_thread_jumps::execute (function *fun) FOR_EACH_BB_FN (bb, fun) { if (EDGE_COUNT (bb->succs) > 1) - find_jump_threads_backwards (bb, false); + find_jump_threads_backwards (bb, optimize_bb_for_speed_p (bb)); } thread_through_all_blocks (true); return 0;