Message ID | 20111219172122.GA10120@davesworkthinkpad |
---|---|
State | New |
Headers | show |
On 12/19/2011 09:21 AM, Dr. David Alan Gilbert wrote: > + @ r1 currently points to the 2nd byte of the word containing the 0 > + tst r2, # CHARTSTMASK(0) @ 1st character > + bne 10f > + adds r1,r1,#1 > + tst r2, # CHARTSTMASK(1) @ 2nd character > + ittt eq > + addeq r1,r1,#1 > + tsteq r2, # (3<<15) @ 2nd & 3rd character > + @ If not the 3rd must be the last one > + addeq r1,r1,#1 No need to search -- clz can do that for you. #ifdef __ARMEL__ rbit r2, r2 #endif clz r2, r2 lsrs r2, r2, #3 adds r1, r1, r2 r~
On Mon, 19 Dec 2011, Dr. David Alan Gilbert wrote: > +@ A very simple strchr routine, from benchmarks on A9 it's a bit faster than > +@ the current version in eglibc. > +@ While I have a version that does 8 bytes/loop and is a lot faster on long > +@ strings, it is slower on short strings, and short strings seem more common > +@ in strchr usage. That sounds like a possible case for a hybrid function, with an unrolled initial part testing some number of characters to cover short strings (it might be possible to get things aligned at the same time) and the more complicated version for longer strings. What's the actual size distribution you see in strchr use?
On 20 December 2011 23:07, Joseph S. Myers <joseph@codesourcery.com> wrote: > On Mon, 19 Dec 2011, Dr. David Alan Gilbert wrote: > >> +@ A very simple strchr routine, from benchmarks on A9 it's a bit faster than >> +@ the current version in eglibc. >> +@ While I have a version that does 8 bytes/loop and is a lot faster on long >> +@ strings, it is slower on short strings, and short strings seem more common >> +@ in strchr usage. > > That sounds like a possible case for a hybrid function, with an unrolled > initial part testing some number of characters to cover short strings (it > might be possible to get things aligned at the same time) and the more > complicated version for longer strings. Of course the difficulty with strchr (compared with memchr) is that you have no length parameter to hint at how much is left, and thus whether it's worth making that switch; so it has to be a heuristic and it's going to cost you something in the small case. > What's the actual size distribution you see in strchr use? It varies heavily by program. I only found one of the SPEC benchmarks using it to a measurable amount (i.e. showed up in profile), and that was mostly in 24byte strings with the match at random positions within the string. From some embedded benchmarks I found that almost all the uses of strchr were calls with less than 8 byte strings (mostly unexpected fallout from within libc rather than the meat of the benchmark). One of the things people commonly seem to do with strchr is see whether a character is in a set, and run that strchr along a string - e.g. something like char *mystr=.... char tmp; for(tmp=*mystr;tmp;mystr++) { tmp=*(mystr++); if (strchr(" \t\n\r", tmp)!=NULL) { do something } } With the string being searched depressingly small. There are some places where you might hope the compiler could be smart and do something with that, although often the string being searched is actually a variable (think $IFS in shell). (I fancy something like a strchr_short with the compiler calling that when it knows the string is less than some length - but how do you define that length between the compiler and library?) The other thing I found in some examples was splitting env strings; so searching for the '=' in NAME=VALUE, the NAME parts tend not to be particularly long. The worst case tends to be where you're using strchr on a long string and you don't actually find the match. A ltrace of gcc's cc1 shows a few of the cases - e.g. Mostly short identifiers: strchr("nothrow", ' ') = NULL strchr("final", ' ') = NULL strchr("dfinish", ' ') = NULL This is a case that can get long - finding '.' in filenames; this can be one where you get longer strings without a match. strchr("/usr/local/include", '.') = NULL strchr("/usr/local/include", '.') = NULL strchr("/usr/lib/arm-linux-gnueabi/gcc/arm-linux-gnueabi/4.5.2/include", '.') = ".5.2/include" I suspect something like parsing a format string: strchr("-+ #0", 's') = NULL Splitting assignments: strchr("__UINT16_C(c)=c", '=') = "=c" strchr("__UINT_LEAST32_MAX__=4294967295U", '=') = "=4294967295U" Here's a profile graph of different strlen's on an ARM: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr?action=AttachFile&do=get&target=panda-01-strchr-git44154ec-strchr-abs.png That 'simple' one is showing the benefit at the short lengths, the 'smarter' one I have is doing 8 bytes/loop and is nice on the long strings - but as you can see worse at the short ones. Dave
On 20 December 2011 20:29, Richard Henderson <rth@twiddle.net> wrote: > On 12/19/2011 09:21 AM, Dr. David Alan Gilbert wrote: >> + @ r1 currently points to the 2nd byte of the word containing the 0 >> + tst r2, # CHARTSTMASK(0) @ 1st character >> + bne 10f >> + adds r1,r1,#1 >> + tst r2, # CHARTSTMASK(1) @ 2nd character >> + ittt eq >> + addeq r1,r1,#1 >> + tsteq r2, # (3<<15) @ 2nd & 3rd character >> + @ If not the 3rd must be the last one >> + addeq r1,r1,#1 > > No need to search -- clz can do that for you. > > #ifdef __ARMEL__ > rbit r2, r2 > #endif > clz r2, r2 > lsrs r2, r2, #3 > adds r1, r1, r2 Hi Richard, Thanks for the suggestion - I'd seen that form in some code from another library and decided because of that not to roll it into my code; however since there have now been 3 separate people who've suggested it to me independently I don't see why not! Thanks, Dave
On 12/21/2011 02:55 AM, David Gilbert wrote: > That 'simple' one is showing the benefit at the short lengths, > the 'smarter' one I have is doing 8 bytes/loop and is nice on the long > strings - but as you can see worse at the short ones. Having not seen your "smarter" strchr, it's hard to suggest anything concrete. I'd have thought that there's enough slack in load delay that one or two arithmetic operations could be done without penalty... Something like performing a simple compare loop looking for "alignment plus": ... bic r3, r0, #7 and r1, r1, #255 adds r3, r3, #32 1: ldrb r2, [r0],#1 cmp r2, r1 cbz r2, .Lfound_zero it ne cmpne r0, r3 bne 1b cmp r2, r1 beq .Lfound @ Here, r0 is aligned. Do something word-based. ... or even just and r3, r0, #7 and r1, r1, #255 rsb r3, r3, #32 1: ldrb r2, [r0],#1 cmp r2, r1 beq .Lfound subs r3, r3, #1 cbz r2, .Lfound_zero bne 1b @ Here, r0 is aligned. Do something word-based. r~
diff -urN ports/sysdeps/arm/eabi/armv6t2/strchr.S src/ports/sysdeps/arm/eabi/armv6t2/strchr.S --- ports/sysdeps/arm/eabi/armv6t2/strchr.S 1970-01-01 01:00:00.000000000 +0100 +++ ports/sysdeps/arm/eabi/armv6t2/strchr.S 2011-12-16 13:43:56.704694919 +0000 @@ -0,0 +1,71 @@ +/* Copyright (C) 2011 Free Software Foundation, Inc. + This file is part of the GNU C Library. + Code contributed by Dave Gilbert <david.gilbert@linaro.org> + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA + 02111-1307 USA. */ + +#include <sysdep.h> + +@ A very simple strchr routine, from benchmarks on A9 it's a bit faster than +@ the current version in eglibc. +@ While I have a version that does 8 bytes/loop and is a lot faster on long +@ strings, it is slower on short strings, and short strings seem more common +@ in strchr usage. +@ Note: The use of cbz/cbnz means it's Thumb only + +@ 2011-02-07 david.gilbert@linaro.org +@ Extracted from local git a5b438d861 +@ 2011-12-16 david.gilbert@linaro.org +@ Copy from Cortex strings rev 65 and change license + + .syntax unified + + .text + .thumb + +@ --------------------------------------------------------------------------- + + .thumb_func + .global strchr + .type strchr,%function +ENTRY(strchr) + @ r0 = start of string + @ r1 = character to match + @ returns NULL for no match, or a pointer to the match + and r1,r1, #255 + +1: + ldrb r2,[r0],#1 + cmp r2,r1 + cbz r2,10f + bne 1b + + @ We're here if it matched +5: + subs r0,r0,#1 + DO_RET(lr) + +10: + @ We're here if we ran off the end + cmp r1, #0 @ Corner case - you can search for the nil and get a pointer to it + beq 5b @ messy, if common we should branch at the start to a special loop + mov r0,#0 + DO_RET(lr) + +END(strchr) + +weak_alias (strchr, index) +libc_hidden_builtin_def(strchr) diff -urN ports/sysdeps/arm/eabi/armv6t2/strlen.S src/ports/sysdeps/arm/eabi/armv6t2/strlen.S --- ports/sysdeps/arm/eabi/armv6t2/strlen.S 1970-01-01 01:00:00.000000000 +0100 +++ ports/sysdeps/arm/eabi/armv6t2/strlen.S 2011-12-16 13:43:01.991130183 +0000 @@ -0,0 +1,118 @@ +/* Copyright (C) 2011 Free Software Foundation, Inc. + This file is part of the GNU C Library. + Code contributed by Dave Gilbert <david.gilbert@linaro.org> + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, write to the Free + Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA + 02111-1307 USA. */ + +#include <sysdep.h> + +@ This strlen routine is optimised on a Cortex-A9 and should work on +@ all ARMv7 processors. This routine is reasonably fast for short +@ strings, but is probably slower than a simple implementation if all +@ your strings are very short +@ Note: The use of cbz/cbnz means it's Thumb only + +@ 2011-02-08 david.gilbert@linaro.org +@ Extracted from local git 6848613a +@ 2011-12-16 david.gilbert@linaro.org +@ Copy from Cortex strings rev 65 and change license +@ Add cfi magic, switch to ldrd + + +@ this lets us check a flag in a 00/ff byte easily in either endianness +#ifdef __ARMEB__ +#define CHARTSTMASK(c) 1<<(31-(c*8)) +#else +#define CHARTSTMASK(c) 1<<(c*8) +#endif + +@----------------------------------------------------------------------------- + .syntax unified + + .text + .thumb + + .thumb_func + .global strlen + .type strlen,%function +ENTRY(strlen) + @ r0 = string + @ returns count of bytes in string not including terminator + mov r1, r0 + push { r4,r6 } + cfi_adjust_cfa_offset (8) + cfi_rel_offset (r4, 0) + cfi_rel_offset (r6, 4) + + cfi_remember_state + + mvns r6, #0 @ all F + movs r4, #0 + tst r0, #7 + beq 2f + +1: + ldrb r2, [r1], #1 + tst r1, #7 @ Hit alignment yet? + cbz r2, 10f @ Exit if we found the 0 + bne 1b + + @ So we're now aligned +2: + ldrd r2,r3,[r1],#8 + uadd8 r2, r2, r6 @ Parallel add 0xff - sets the GE bits for anything that wasn't 0 + sel r2, r4, r6 @ bytes are 00 for none-00 bytes, or ff for 00 bytes - NOTE INVERSION + uadd8 r3, r3, r6 @ Parallel add 0xff - sets the GE bits for anything that wasn't 0 + sel r3, r2, r6 @ bytes are 00 for none-00 bytes, or ff for 00 bytes - NOTE INVERSION + cmp r3, #0 + beq 2b + +strlenendtmp: + @ One (or more) of the bytes we loaded was 0 - but which one? + @ r2 has the mask corresponding to the first loaded word + @ r3 has a combined mask of the two words - but if r2 was all-non 0 + @ then it's just the 2nd words + cmp r2, #0 + itte eq + moveq r2, r3 @ the end is in the 2nd word + subeq r1,r1,#3 + subne r1,r1,#7 + + @ r1 currently points to the 2nd byte of the word containing the 0 + tst r2, # CHARTSTMASK(0) @ 1st character + bne 10f + adds r1,r1,#1 + tst r2, # CHARTSTMASK(1) @ 2nd character + ittt eq + addeq r1,r1,#1 + tsteq r2, # (3<<15) @ 2nd & 3rd character + @ If not the 3rd must be the last one + addeq r1,r1,#1 + +10: + @ r0 is still at the beginning, r1 is pointing 1 byte after terminator + sub r0, r1, r0 + subs r0, r0, #1 + pop { r4, r6 } + + cfi_adjust_cfa_offset (-8) + cfi_restore (r4) + cfi_restore (r6) + + DO_RET(lr) + +END(strlen) +libc_hidden_builtin_def (strlen)