Message ID | CADnVucDOc85gQyAR+G45suuCRNHjznQAzcxrg9YB1yjF+QP-Ww@mail.gmail.com |
---|---|
State | New |
Headers | show |
Ping? Ramana mentioned at Linaro Connect that this should be tested on more platforms. I've now checked this on qemu with no regressions on trunk for: arm-unknown-linux-gnueabihf v7-A: ARM and Thumb-2 arm-unknown-linux-gnueabi v4t, v5t, v6: ARM OK for trunk? Archive link: http://gcc.gnu.org/ml/gcc-patches/2014-02/msg01611.html On 27 February 2014 16:38, Charles Baylis <charles.baylis@linaro.org> wrote: > [resending as text/plain] > > Hi > > These patches optimise 64 bit division by removing the use of the > __gnu_[u]ldivmod_helper functions and hence avoiding the redundant > calculation of the remainder in those functions. > > Bootstrapped, tested and checked for arm-unknown-linux-gnueabihf. > > Benchmarked on Chromebook and Raspberry Pi using attached divbench3.c. > Loop1 varies the divisor and loop2 varies the dividend. > > Chromebook: > > before: > loop1 unsigned: 3.474419 > loop2 unsigned: 6.564871 > loop1 signed: 4.127967 > loop2 signed: 6.071490 > > after: > loop1 unsigned: 2.781364 > loop2 unsigned: 6.166478 > loop1 signed: 2.800974 > loop2 signed: 6.129588 > > Raspberry pi: > before > loop1 unsigned: 28.881753 > loop2 unsigned: 19.876385 > loop1 signed: 32.074941 > loop2 signed: 20.594860 > > after: > loop1 unsigned: 24.893846 > loop2 unsigned: 19.537562 > loop1 signed: 25.334509 > loop2 signed: 19.615088 > > Any comments? OK for stage 1? > > > Patch 1: > > 2014-02-27 Charles Baylis <charles.baylis@linaro.org> > > * config/arm/bpabi.S (__aeabi_uldivmod): Perform division using call > to __udivmoddi4. > > > Patch 2: > > 2014-02-27 Charles Baylis <charles.baylis@linaro.org> > > * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division via > call to __udivmoddi4 and fixing up for negative operands.
On 27/02/14 16:38, Charles Baylis wrote: > [resending as text/plain] > > Hi > > These patches optimise 64 bit division by removing the use of the > __gnu_[u]ldivmod_helper functions and hence avoiding the redundant > calculation of the remainder in those functions. > > Bootstrapped, tested and checked for arm-unknown-linux-gnueabihf. > > Benchmarked on Chromebook and Raspberry Pi using attached divbench3.c. > Loop1 varies the divisor and loop2 varies the dividend. > > Chromebook: > > before: > loop1 unsigned: 3.474419 > loop2 unsigned: 6.564871 > loop1 signed: 4.127967 > loop2 signed: 6.071490 > > after: > loop1 unsigned: 2.781364 > loop2 unsigned: 6.166478 > loop1 signed: 2.800974 > loop2 signed: 6.129588 > > Raspberry pi: > before > loop1 unsigned: 28.881753 > loop2 unsigned: 19.876385 > loop1 signed: 32.074941 > loop2 signed: 20.594860 > > after: > loop1 unsigned: 24.893846 > loop2 unsigned: 19.537562 > loop1 signed: 25.334509 > loop2 signed: 19.615088 > > Any comments? OK for stage 1? > > > Patch 1: > > 2014-02-27 Charles Baylis <charles.baylis@linaro.org> > > * config/arm/bpabi.S (__aeabi_uldivmod): Perform division using call > to __udivmoddi4. > > > Patch 2: > > 2014-02-27 Charles Baylis <charles.baylis@linaro.org> > > * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division via > call to __udivmoddi4 and fixing up for negative operands. > > Sorry for the delay reviewing this... I think really, you've got three independent changes here: 1) Optimize the prologue/epilogue sequences when ldrd is available. 2) Replace the call to __gnu_ldivmod_helper with __udivmoddi4 3) Optimize the code to __aeabi_ldivmod. Ideally, therefore, this is a three patch series, but it's then missing a few bits. 4) Step 2 can also be trivially applied to bpabi-v6m.S as well, since it's a direct swap of one function for another (unless I've misread the changes, I think the ABI of the two helper functions are the same). 5) Step 4 then makes __gnu_ldivmod_helper in bpabi.c a dead function which can be deleted. This is good because currently pulling in either 64-bit division function causes both these helper functions to be pulled in and thus the whole of the 64-bit div-mod code for both signed and unsigned values. That's particularly unfortunate for ARMv6m class devices as that's potentially a lot of redundant code. Finally, I know this was the original code, but the complete lack of comments in this code made reviewing even the trivial parts a complete nightmare -- it took me half an hour before I remembered that __udivmoddi4 took three parameters, the third of which was on the stack: thus the messing around with sp/ip in the prologue wasn't just trivial padding but a necessary part of the function. Please could you add, at least some short comments clarifying the register disposition on input and what that prologue code is up to... Finally, how was this code tested? Anyway, some additional comments below: > 0001-Optimise-__aeabi_uldivmod.patch > > > From 35254b813303e7fb40eb8aa0bb749216fd8f96fc Mon Sep 17 00:00:00 2001 > From: Charles Baylis <charles.baylis@linaro.org> > Date: Tue, 25 Feb 2014 18:34:38 +0000 > Subject: [PATCH 1/2] Optimise __aeabi_uldivmod > > 2014-02-25 Charles Baylis <charles.baylis@linaro.org> > > * config/arm/bpabi.S (__aeabi_uldivmod): Perform division using call > to __udivmoddi4. > * config/arm/bpabi.S (__aeabi_uldivmod): Optimise stack pointer > manipulation. Don't repeat the function name for multiple tweaks to the same function; as mentioned above, if these are really separate changes they should be in separate submissions. Mixing unrelated changes just makes the reviewing step that much harder. > --- > libgcc/config/arm/bpabi.S | 25 ++++++++++++++++++++----- > 1 file changed, 20 insertions(+), 5 deletions(-) > > diff --git a/libgcc/config/arm/bpabi.S b/libgcc/config/arm/bpabi.S > index 7772301..e020af5 100644 > --- a/libgcc/config/arm/bpabi.S > +++ b/libgcc/config/arm/bpabi.S > @@ -120,6 +120,16 @@ ARM_FUNC_START aeabi_ulcmp > #endif > .endm > > +/* we can use STRD/LDRD on v5TE and later, and any Thumb-2 architecture. */ > +#if (defined(__ARM_EABI__) \ > + && (defined(__thumb2__) \ > + || (__ARM_ARCH >= 5 && defined(__TARGET_FEATURE_DSP)))) > +#define CAN_USE_LDRD 1 > +#else > +#define CAN_USE_LDRD 0 > +#endif > + > + > #ifdef L_aeabi_ldivmod > > ARM_FUNC_START aeabi_ldivmod > @@ -149,18 +159,23 @@ ARM_FUNC_START aeabi_uldivmod > cfi_start __aeabi_uldivmod, LSYM(Lend_aeabi_uldivmod) > test_div_by_zero unsigned > > - sub sp, sp, #8 > -#if defined(__thumb2__) > - mov ip, sp > - push {ip, lr} > +#if defined(__thumb2__) && CAN_USE_LDRD > + sub ip, sp, #8 > + strd ip,lr, [sp, #-16]! Space after comma. Also, since you've essentially rewritten the entire function, can you please also reformat them to follow the coding style of the rest of the file: namely "<tab>OP<tab>operands". > #else > + sub sp, sp, #8 > do_push {sp, lr} > #endif Please add a comment that the value at *sp is the address of the the slot for the remainder. > 98: cfi_push 98b - __aeabi_uldivmod, 0xe, -0xc, 0x10 > - bl SYM(__gnu_uldivmod_helper) __PLT__ > + bl SYM(__udivmoddi4) __PLT__ > ldr lr, [sp, #4] > +#if CAN_USE_LDRD > + ldrd r2, r3, [sp, #8] > + add sp, sp, #16 > +#else > add sp, sp, #8 > do_pop {r2, r3} > +#endif > RET > cfi_end LSYM(Lend_aeabi_uldivmod) > > > > 0002-Optimise-__aeabi_ldivmod.patch > > > From 975d9c624e77ee00476e6866250b0e2e31461fca Mon Sep 17 00:00:00 2001 > From: Charles Baylis <charles.baylis@linaro.org> > Date: Tue, 25 Feb 2014 16:27:59 +0000 > Subject: [PATCH 2/2] Optimise __aeabi_ldivmod > > 2014-02-25 Charles Baylis <charles.baylis@linaro.org> > > * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division using > unsigned division via call to __udivmoddi4 and additional logic. > --- > libgcc/config/arm/bpabi.S | 74 +++++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 69 insertions(+), 5 deletions(-) > > diff --git a/libgcc/config/arm/bpabi.S b/libgcc/config/arm/bpabi.S > index e020af5..8b75a28 100644 > --- a/libgcc/config/arm/bpabi.S > +++ b/libgcc/config/arm/bpabi.S > @@ -136,20 +136,84 @@ ARM_FUNC_START aeabi_ldivmod > cfi_start __aeabi_ldivmod, LSYM(Lend_aeabi_ldivmod) > test_div_by_zero signed > > - sub sp, sp, #8 > -#if defined(__thumb2__) > - mov ip, sp > - push {ip, lr} > +#if defined(__thumb2__) && CAN_USE_LDRD > + sub ip, sp, #8 > + strd ip,lr, [sp, #-16]! Space after comma. > #else > + sub sp, sp, #8 > do_push {sp, lr} > #endif > + cmp xxh, #0 > + blt 1f > + cmp yyh, #0 > + blt 2f > + > +98: cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10 The CFI push should really precede your conditional tests, it relates to the do_push expression. > + bl SYM(__udivmoddi4) __PLT__ > + ldr lr, [sp, #4] > +#if CAN_USE_LDRD > + ldrd r2, r3, [sp, #8] > + add sp, sp, #16 > +#else > + add sp, sp, #8 > + do_pop {r2, r3} > +#endif You're missing a CFI pop, which is needed when the values on the stack go out of scope. > + RET > +1: /* xxh:xxl is negative */ > + rsbs xxl, xxl, #0 We're using unified syntax, so NEGS is preferable. > + sbc xxh, xxh, xxh, lsl #1 Worthy of a comment, Thumb2 has no RSC instruction, so use X - 2X. > + cmp yyh, #0 > + blt 3f > +98: cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10 This CFI push looks wrong. You've already pushed things earlier. On the other hand, you should save the state before the CFI pop above, so that you can restore the state again for the next (ie this) block of code. > + bl SYM(__udivmoddi4) __PLT__ > + ldr lr, [sp, #4] > +#if CAN_USE_LDRD > + ldrd r2, r3, [sp, #8] > + add sp, sp, #16 > +#else > + add sp, sp, #8 > + do_pop {r2, r3} > +#endif > + rsbs xxl, xxl, #0 > + sbc xxh, xxh, xxh, lsl #1 > + rsbs yyl, yyl, #0 > + sbc yyh, yyh, yyh, lsl #1 > + RET > + > +2: /* only yyh:yyl is negative */ > + rsbs yyl, yyl, #0 > + sbc yyh, yyh, yyh, lsl #1 > 98: cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10 See above... and more below. > - bl SYM(__gnu_ldivmod_helper) __PLT__ > + bl SYM(__udivmoddi4) __PLT__ > ldr lr, [sp, #4] > +#if CAN_USE_LDRD > + ldrd r2, r3, [sp, #8] > + add sp, sp, #16 > +#else > add sp, sp, #8 > do_pop {r2, r3} > +#endif > + rsbs xxl, xxl, #0 > + sbc xxh, xxh, xxh, lsl #1 > RET > + > +3: /* both xxh:xxl and yyh:yyl are negative */ > + rsbs yyl, yyl, #0 > + sbc yyh, yyh, yyh, lsl #1 > cfi_end LSYM(Lend_aeabi_ldivmod) > +98: cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10 > + bl SYM(__udivmoddi4) __PLT__ > + ldr lr, [sp, #4] > +#if CAN_USE_LDRD > + ldrd r2, r3, [sp, #8] > + add sp, sp, #16 > +#else > + add sp, sp, #8 > + do_pop {r2, r3} > +#endif > + rsbs yyl, yyl, #0 > + sbc yyh, yyh, yyh, lsl #1 > + RET > > #endif /* L_aeabi_ldivmod */ > You use the LDRD vs do_pop sequence identically several times. To avoid a lot of ifdefs, it might be worth considering a macro for this idiom to reduce the overall amount of conditionalized code. > > > divbench3.c > > > > #include <stdint.h> > #include <stdio.h> > #include <unistd.h> > #include <sys/time.h> > > double tv_to_s(struct timeval tv) > { > return tv.tv_sec + ((double)tv.tv_usec)/1.0e6; > } > > #define STEP (0x7fffffffffff0000/100000000) > #define END (0x7fffffffffff0001-STEP) > #define START1 (37ll) > #define START2 (3ll) > > uint64_t __aeabi_uldivmod(uint64_t,uint64_t); > int64_t __aeabi_ldivmod(int64_t,int64_t); > > int main(int argc, char **argv) > { > double time1, time2, time3, time4; > struct timeval start, end; > > > volatile uint64_t dummy; > uint64_t i; > > volatile int64_t sdummy; > int64_t si; > > gettimeofday (&start, NULL); > for (i = START2; i < END; i += STEP) > { > dummy = __aeabi_uldivmod(END, i); > } > gettimeofday (&end, NULL); > time1 = tv_to_s (end) - tv_to_s (start); > > gettimeofday (&start, NULL); > for (i = START1; i < END; i += STEP * 5) > { > dummy = __aeabi_uldivmod(i, 373459); > } > gettimeofday (&end, NULL); > time2 = tv_to_s (end) - tv_to_s (start); > > gettimeofday (&start, NULL); > for (si = START2; si < END; si += STEP) > { > sdummy = __aeabi_ldivmod(END, si); > } > gettimeofday (&end, NULL); > time3 = tv_to_s (end) - tv_to_s (start); > > gettimeofday (&start, NULL); > for (si = START1; si < END; si += STEP * 5) > { > sdummy = __aeabi_ldivmod(si, 373459); > } > gettimeofday (&end, NULL); > time4 = tv_to_s (end) - tv_to_s (start); > > printf ("loop1 unsigned: %12f\n" > "loop2 unsigned: %12f\n" > "loop1 signed: %12f\n" > "loop2 signed: %12f\n", > time1, time2, time3, time4); > > return 0; > } >
From 975d9c624e77ee00476e6866250b0e2e31461fca Mon Sep 17 00:00:00 2001 From: Charles Baylis <charles.baylis@linaro.org> Date: Tue, 25 Feb 2014 16:27:59 +0000 Subject: [PATCH 2/2] Optimise __aeabi_ldivmod 2014-02-25 Charles Baylis <charles.baylis@linaro.org> * config/arm/bpabi.S (__aeabi_ldivmod): Perform signed division using unsigned division via call to __udivmoddi4 and additional logic. --- libgcc/config/arm/bpabi.S | 74 +++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 69 insertions(+), 5 deletions(-) diff --git a/libgcc/config/arm/bpabi.S b/libgcc/config/arm/bpabi.S index e020af5..8b75a28 100644 --- a/libgcc/config/arm/bpabi.S +++ b/libgcc/config/arm/bpabi.S @@ -136,20 +136,84 @@ ARM_FUNC_START aeabi_ldivmod cfi_start __aeabi_ldivmod, LSYM(Lend_aeabi_ldivmod) test_div_by_zero signed - sub sp, sp, #8 -#if defined(__thumb2__) - mov ip, sp - push {ip, lr} +#if defined(__thumb2__) && CAN_USE_LDRD + sub ip, sp, #8 + strd ip,lr, [sp, #-16]! #else + sub sp, sp, #8 do_push {sp, lr} #endif + cmp xxh, #0 + blt 1f + cmp yyh, #0 + blt 2f + +98: cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10 + bl SYM(__udivmoddi4) __PLT__ + ldr lr, [sp, #4] +#if CAN_USE_LDRD + ldrd r2, r3, [sp, #8] + add sp, sp, #16 +#else + add sp, sp, #8 + do_pop {r2, r3} +#endif + RET +1: /* xxh:xxl is negative */ + rsbs xxl, xxl, #0 + sbc xxh, xxh, xxh, lsl #1 + cmp yyh, #0 + blt 3f +98: cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10 + bl SYM(__udivmoddi4) __PLT__ + ldr lr, [sp, #4] +#if CAN_USE_LDRD + ldrd r2, r3, [sp, #8] + add sp, sp, #16 +#else + add sp, sp, #8 + do_pop {r2, r3} +#endif + rsbs xxl, xxl, #0 + sbc xxh, xxh, xxh, lsl #1 + rsbs yyl, yyl, #0 + sbc yyh, yyh, yyh, lsl #1 + RET + +2: /* only yyh:yyl is negative */ + rsbs yyl, yyl, #0 + sbc yyh, yyh, yyh, lsl #1 98: cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10 - bl SYM(__gnu_ldivmod_helper) __PLT__ + bl SYM(__udivmoddi4) __PLT__ ldr lr, [sp, #4] +#if CAN_USE_LDRD + ldrd r2, r3, [sp, #8] + add sp, sp, #16 +#else add sp, sp, #8 do_pop {r2, r3} +#endif + rsbs xxl, xxl, #0 + sbc xxh, xxh, xxh, lsl #1 RET + +3: /* both xxh:xxl and yyh:yyl are negative */ + rsbs yyl, yyl, #0 + sbc yyh, yyh, yyh, lsl #1 cfi_end LSYM(Lend_aeabi_ldivmod) +98: cfi_push 98b - __aeabi_ldivmod, 0xe, -0xc, 0x10 + bl SYM(__udivmoddi4) __PLT__ + ldr lr, [sp, #4] +#if CAN_USE_LDRD + ldrd r2, r3, [sp, #8] + add sp, sp, #16 +#else + add sp, sp, #8 + do_pop {r2, r3} +#endif + rsbs yyl, yyl, #0 + sbc yyh, yyh, yyh, lsl #1 + RET #endif /* L_aeabi_ldivmod */ -- 1.8.3.2