Message ID | CABXYE2VXjy7=5Y=c1TCxLE8KuwLtwBYBhTB24xrWDvWAeiBwbQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Tue, Jul 7, 2015 at 8:07 AM, Jeff Law <law@redhat.com> wrote: > On 06/29/2015 07:15 PM, Jim Wilson wrote: > So if these "copies" require a conversion, then isn't it fundamentally > wrong to have a PHI node which copies between them? That would seem to > implicate the eipa_sra pass as needing to be aware of these promotions and > avoid having these objects with different representations appearing on the > lhs/rhs of a PHI node. My years at Cisco didn't give me a chance to work on the SSA passes, so I don't know much about how they work. But looking at this, I see that PHI nodes are eventually handed by emit_partition_copy in tree-outof-ssa.c, which calls convert_to_mode, so it appears that conversions between different (closely related?) types are OK in a PHI node. The problem in this case is that we have the exact same type for the src and dest. The only difference is that the ARM forces sign-extension for signed sub-word parameters and zero-extension for signed sub-word locals. Thus to detect the need for a conversion, you have to have the decls, and we don't have them here. There is also the problem that all of the SUBREG_PROMOTED_* stuff is in expand_expr and friends, which aren't used by the cfglayout/tree-outof-ssa.c code for PHI nodes. So we need to copy some of the SUBREG_PROMOTED_* handling into cfglyout/tree-outof-ssa, or modify them to call expand_expr for PHI nodes, and I haven't had any luck getting that to work yet. I still need to learn more about the code to figure out if this is possible. I also think that the ARM handling of PROMOTE_MODE is wrong. Treating a signed sub-word and unsigned can lead to inefficient code. This is part of the problem is much easier for me to fix. It may be hard to convince ARM maintainers that it should be changed though. I need more time to work on this too. I haven't looked at trying to forbid the optimizer from creating PHI nodes connecting parameters to locals. That just sounds like a strange thing to forbid, and seems likely to result in worse code by disabling too many optimizations. But maybe it is the right solution if neither of the other two options work. This should only be done when PROMOTE_MODE disagrees with TARGET_PROMOTE_FUNCTION_MODE, forcing the copy to require a conversion. Jim
On Thu, Jul 2, 2015 at 2:07 AM, Richard Earnshaw <Richard.Earnshaw@foss.arm.com> wrote: > Not quite, ARM state still has more flexible addressing modes for > unsigned byte loads than for signed byte loads. It's even worse with > thumb1 where some signed loads have no single-register addressing mode > (ie you have to copy zero into another register to use as an index > before doing the load). I wasn't aware of the load address problem. That was something I hadn't considered, and will have to look at that. Load is just one instruction though. For most other instructions, a zero-extend results in less efficient code, because it then forces a sign-extend before a signed operation. The fact that parameters and locals are handled differently which requires conversions when copying between them results in more inefficient code. And changing TARGET_PROMOTE_FUNCTION_MODE is an ABI change, and hence would be unwise, so changing PROMOTE_MODE is the safer option. Consider this testcase extern signed short gs; short sub (void) { signed short s = gs; int i; for (i = 0; i < 10; i++) { s += 1; if (s > 10) break; } return s; } The inner loop ends up as .L3: adds r3, r3, #1 mov r0, r1 uxth r3, r3 sxth r2, r3 cmp r2, #10 bgt .L8 cmp r2, r1 bne .L3 bx lr We need the sign-extension for the compare. We need the zero-extension for the loop carried dependency. We have two extensions in every loop iteration, plus some extra register usage and register movement. We get better code for this example if we aren't forcing signed shorts to be zero-extended via PROMOTE_MODE. The lack of a reg+immediate address mode for ldrs[bh] in thumb1 does look like a problem though. But this means the difference between generating movs r2, #0 ldrsh r3, [r3, r2] with my patch, or ldrh r3, [r3] lsls r2, r3, #16 asrs r2, r2, #16 without my patch. It isn't clear which sequence is better. The sign-extends in the second sequence can sometimes be optimized away, and sometimes they can't be optimized away. Similarly, in the first sequence, loading zero into a reg can sometimes be optimized, and sometimes it can't. There is also no guarantee that you get the first sequence with the patch or the second sequence without the patch. There is a splitter for ldrsh, so you can get the second pattern sometimes with the patch. Similarly, it might be possible to get the first pattern without the patch in some cases, though I don't have one at the moment. Jim
On Wed, Jul 8, 2015 at 3:54 PM, Jeff Law <law@redhat.com> wrote: > On 07/07/2015 10:29 AM, Jim Wilson wrote: > This is critically important as various parts of the compiler will take a > degenerate PHI node and propagate the RHS of the PHI into the uses of the > LHS of the PHI -- without doing any conversions. I think this is OK, because tree-outof-ssa does send code in basic blocks through expand_expr, which will emit conversions if necessary. it is only the conversion of PHI nodes to RTL that is the problem, as it doesn't use expand_expr, and hence doesn't get the SUBREG_PROMOTED_P conversions. Jim
On 15 February 2016 at 12:32, Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> wrote: > > On 04/02/16 08:58, Ramana Radhakrishnan wrote: >> >> On Tue, Jun 30, 2015 at 2:15 AM, Jim Wilson <jim.wilson@linaro.org> wrote: >>> >>> This is my suggested fix for PR 65932, which is a linux kernel >>> miscompile with gcc-5.1. >>> >>> The problem here is caused by a chain of events. The first is that >>> the relatively new eipa_sra pass creates fake parameters that behave >>> slightly differently than normal parameters. The second is that the >>> optimizer creates phi nodes that copy local variables to fake >>> parameters and/or vice versa. The third is that the ouf-of-ssa pass >>> assumes that it can emit simple move instructions for these phi nodes. >>> And the fourth is that the ARM port has a PROMOTE_MODE macro that >>> forces QImode and HImode to unsigned, but a >>> TARGET_PROMOTE_FUNCTION_MODE hook that does not. So signed char and >>> short parameters have different in register representations than local >>> variables, and require a conversion when copying between them, a >>> conversion that the out-of-ssa pass can't easily emit. >>> >>> Ultimately, I think this is a problem in the arm backend. It should >>> not have a PROMOTE_MODE macro that is changing the sign of char and >>> short local variables. I also think that we should merge the >>> PROMOTE_MODE macro with the TARGET_PROMOTE_FUNCTION_MODE hook to >>> prevent this from happening again. >>> >>> I see four general problems with the current ARM PROMOTE_MODE definition. >>> 1) Unsigned char is only faster for armv5 and earlier, before the sxtb >>> instruction was added. It is a lose for armv6 and later. >>> 2) Unsigned short was only faster for targets that don't support >>> unaligned accesses. Support for these targets was removed a while >>> ago, and this PROMODE_MODE hunk should have been removed at the same >>> time. It was accidentally left behind. >>> 3) TARGET_PROMOTE_FUNCTION_MODE used to be a boolean hook, when it was >>> converted to a function, the PROMOTE_MODE code was copied without the >>> UNSIGNEDP changes. Thus it is only an accident that >>> TARGET_PROMOTE_FUNCTION_MODE and PROMOTE_MODE disagree. Changing >>> TARGET_PROMOTE_FUNCTION_MODE is an ABI change, so only PROMOTE_MODE >>> changes to resolve the difference are safe. >>> 4) There is a general principle that you should only change signedness >>> in PROMOTE_MODE if the hardware forces it, as otherwise this results >>> in extra conversion instructions that make code slower. The mips64 >>> hardware for instance requires that 32-bit values be sign-extended >>> regardless of type, and instructions may trap if this is not true. >>> However, it has a set of 32-bit instructions that operate on these >>> values, and hence no conversions are required. There is no similar >>> case on ARM. Thus the conversions are unnecessary and unwise. This >>> can be seen in the testcases where gcc emits both a zero-extend and a >>> sign-extend inside a loop, as the sign-extend is required for a >>> compare, and the zero-extend is required by PROMOTE_MODE. >> >> Given Kyrill's testing with the patch and the reasonably detailed >> check of the effects of code generation changes - The arm.h hunk is ok >> - I do think we should make this explicit in the documentation that >> TARGET_PROMOTE_MODE and TARGET_PROMOTE_FUNCTION_MODE should agree and >> better still maybe put in a checking assert for the same in the >> mid-end but that could be the subject of a follow-up patch. >> >> Ok to apply just the arm.h hunk as I think Kyrill has taken care of >> the testsuite fallout separately. > > Hi all, > > I'd like to backport the arm.h from this ( r233130) to the GCC 5 > branch. As the CSE patch from my series had some fallout on x86_64 > due to a deficiency in the AVX patterns that is too invasive to fix > at this stage (and presumably backport), I'd like to just backport > this arm.h fix and adjust the tests to XFAIL the fallout that comes > with not applying the CSE patch. The attached patch does that. > > The code quality fallout on code outside the testsuite is not > that gread. The SPEC benchmarks are not affected by not applying > the CSE change, and only a single sequence in a popular embedded benchmark > shows some degradation for -mtune=cortex-a9 in the same way as the > wmul-1.c and wmul-2.c tests. > > I think that's a fair tradeoff for fixing the wrong code bug on that branch. > > Ok to backport r233130 and the attached testsuite patch to the GCC 5 branch? > > Thanks, > Kyrill > > 2016-02-15 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > PR target/65932 > * gcc.target/arm/wmul-1.c: Add -mtune=cortex-a9 to dg-options. > xfail the scan-assembler test. > * gcc.target/arm/wmul-2.c: Likewise. > * gcc.target/arm/wmul-3.c: Simplify test to generate a single smulbb. > > Hi Kyrill, I've noticed that wmul-3 still fails on the gcc-5 branch when forcing GCC configuration to: --with-cpu cortex-a5 --with-fpu vfpv3-d16-fp16 (target arm-none-linux-gnueabihf) The generated code is: sxth r0, r0 sxth r1, r1 mul r0, r1, r0 instead of smulbb r0, r1, r0 on trunk. I guess we don't worry? > > >> >> regards >> Ramana >> >> >> >> >>> My change was tested with an arm bootstrap, make check, and SPEC >>> CPU2000 run. The original poster verified that this gives a linux >>> kernel that boots correctly. >>> >>> The PRMOTE_MODE change causes 3 testsuite testcases to fail. These >>> are tests to verify that smulbb and/or smlabb are generated. >>> Eliminating the unnecessary sign conversions causes us to get better >>> code that doesn't include the smulbb and smlabb instructions. I had >>> to modify the testcases to get them to emit the desired instructions. >>> With the testcase changes there are no additional testsuite failures, >>> though I'm concerned that these testcases with the changes may be >>> fragile, and future changes may break them again. >> >> >> >>> If there are ARM parts where smulbb/smlabb are faster than mul/mla, >>> then maybe we should try to add new patterns to get the instructions >>> emitted again for the unmodified testcases. >>> >>> Jim > >
gcc/ 2015-06-29 Jim Wilson <jim.wilson@linaro.org> PR target/65932 * config/arm/arm.h (PROMOTE_MODE): Don't set UNSIGNEDP for QImode and HImode. gcc/testsuite/ 2015-06-29 Jim Wilson <jim.wilson@linaro.org> PR target/65932 * gcc.target/arm/wmul-1.c (mac): Change a and b to int pointers. Cast multiply operands to short. * gcc.target/arm/wmul-2.c (vec_mpy): Cast multiply result to short. * gcc.target/arm/wmul-3.c (mac): Cast multiply results to short. Index: config/arm/arm.h =================================================================== --- config/arm/arm.h (revision 224672) +++ config/arm/arm.h (working copy) @@ -523,16 +523,10 @@ extern int arm_arch_crc; type, but kept valid in the wider mode. The signedness of the extension may differ from that of the type. */ -/* It is far faster to zero extend chars than to sign extend them */ - #define PROMOTE_MODE(MODE, UNSIGNEDP, TYPE) \ if (GET_MODE_CLASS (MODE) == MODE_INT \ && GET_MODE_SIZE (MODE) < 4) \ { \ - if (MODE == QImode) \ - UNSIGNEDP = 1; \ - else if (MODE == HImode) \ - UNSIGNEDP = 1; \ (MODE) = SImode; \ } Index: testsuite/gcc.target/arm/wmul-1.c =================================================================== --- testsuite/gcc.target/arm/wmul-1.c (revision 224672) +++ testsuite/gcc.target/arm/wmul-1.c (working copy) @@ -2,14 +2,14 @@ /* { dg-require-effective-target arm_dsp } */ /* { dg-options "-O1 -fexpensive-optimizations" } */ -int mac(const short *a, const short *b, int sqr, int *sum) +int mac(const int *a, const int *b, int sqr, int *sum) { int i; int dotp = *sum; for (i = 0; i < 150; i++) { - dotp += b[i] * a[i]; - sqr += b[i] * b[i]; + dotp += (short)b[i] * (short)a[i]; + sqr += (short)b[i] * (short)b[i]; } *sum = dotp; Index: testsuite/gcc.target/arm/wmul-2.c =================================================================== --- testsuite/gcc.target/arm/wmul-2.c (revision 224672) +++ testsuite/gcc.target/arm/wmul-2.c (working copy) @@ -7,7 +7,7 @@ void vec_mpy(int y[], const short x[], s int i; for (i = 0; i < 150; i++) - y[i] += ((scaler * x[i]) >> 31); + y[i] += ((short)(scaler * x[i]) >> 31); } /* { dg-final { scan-assembler-times "smulbb" 1 } } */ Index: testsuite/gcc.target/arm/wmul-3.c =================================================================== --- testsuite/gcc.target/arm/wmul-3.c (revision 224672) +++ testsuite/gcc.target/arm/wmul-3.c (working copy) @@ -8,8 +8,8 @@ int mac(const short *a, const short *b, int dotp = *sum; for (i = 0; i < 150; i++) { - dotp -= b[i] * a[i]; - sqr -= b[i] * b[i]; + dotp -= (short)(b[i] * a[i]); + sqr -= (short)(b[i] * b[i]); } *sum = dotp;