uboot optimize memmove

Message ID	20130726065323.27333.82421.stgit@localhost.localdomain
State	New
Headers	show Return-Path: <patchwork-forward+bncBDV7F4FB6UKRB2NZZCHQKGQEVLHM54Q@linaro.org> Received-SPF: neutral (google.com: 209.85.212.46 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) client-ip=209.85.212.46; Received-SPF: neutral (google.com: 114.25.201.216 is neither permitted nor denied by best guess record for domain of andy.green@linaro.org) client-ip=114.25.201.216; Subject: [PATCH] uboot optimize memmove To: u-boot@lists.denx.de From: Andy Green <andy.green@linaro.org> Cc: linaro-dev@lists.linaro.org, patches@linaro.org Date: Fri, 26 Jul 2013 14:53:23 +0800 Message-ID: <20130726065323.27333.82421.stgit@localhost.localdomain> User-Agent: StGit/0.16 MIME-Version: 1.0 Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit

warmcat July 26, 2013, 6:53 a.m. UTC

While studying the reason why kernel copy from NOR was so slow on our platform,
I realized U-Boot is pulling it from 32-bit NOR in 8-bit chunks needlessly.

bootm uses memmove() and that just takes the approach by default to move u8s
around.

This optimization prefers memcpy() implementation (done mostly in 32-bit reads
and writes) if there's no overlap in source and dest, resulting in a huge
speedup on our platform (480ms copy from 32-bit NOR ---> 140ms)

Signed-off-by: Andy Green <andy.green@linaro.org>
---
 lib/string.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Wolfgang Denk July 26, 2013, 12:58 p.m. UTC | #1

Dear Andy Green,

In message <20130726065323.27333.82421.stgit@localhost.localdomain> you wrote:
> While studying the reason why kernel copy from NOR was so slow on our platform,
> I realized U-Boot is pulling it from 32-bit NOR in 8-bit chunks needlessly.
> 
> bootm uses memmove() and that just takes the approach by default to move u8s
> around.
> 
> This optimization prefers memcpy() implementation (done mostly in 32-bit reads
> and writes) if there's no overlap in source and dest, resulting in a huge
> speedup on our platform (480ms copy from 32-bit NOR ---> 140ms)

Sorry, but I dislike your patch.

Instead of making assumptions on the performance of memcpy() and
adding the overhead of an additional function call (which can be
expensive especially for short copy operations) it would make more
sense to pull the "copy a word at a time" code from memcpy() into
memmove(), too.

On the other hand - if you really care about performance, then why do
you not make sure that you provide optimized implementations for such
functions and consequently #define __HAVE_ARCH_MEMMOVE (and
__HAVE_ARCH_MEMCPY) ?

Best regards,

Wolfgang Denk

warmcat July 26, 2013, 1:42 p.m. UTC | #2

On 26 July 2013 20:58, Wolfgang Denk <wd@denx.de> wrote:
> Dear Andy Green,
>
> In message <20130726065323.27333.82421.stgit@localhost.localdomain> you wrote:
>> While studying the reason why kernel copy from NOR was so slow on our platform,
>> I realized U-Boot is pulling it from 32-bit NOR in 8-bit chunks needlessly.
>>
>> bootm uses memmove() and that just takes the approach by default to move u8s
>> around.
>>
>> This optimization prefers memcpy() implementation (done mostly in 32-bit reads
>> and writes) if there's no overlap in source and dest, resulting in a huge
>> speedup on our platform (480ms copy from 32-bit NOR ---> 140ms)
>
> Sorry, but I dislike your patch.

I see.

> Instead of making assumptions on the performance of memcpy() and

As I wrote, I measured the performance and got a very big gain, it's >
3x faster on my setup to use memcpy() then default memmove().

By calling that an "assumption" you're saying that there exist
platforms where 32-bit linear memmove() is slower than doing it with
8-bit actions?

> adding the overhead of an additional function call (which can be
> expensive especially for short copy operations) it would make more

I am not sure U-Boot is really in the business of doing small
memmoves, but okay...

> sense to pull the "copy a word at a time" code from memcpy() into
> memmove(), too.
>
> On the other hand - if you really care about performance, then why do

I spent several hours figuring out why our NOR boot performance was
terrible.... it's because this default memmove code is gloriously
inefficient for all cases.

If you like it like that, no worries.

> you not make sure that you provide optimized implementations for such
> functions and consequently #define __HAVE_ARCH_MEMMOVE (and
> __HAVE_ARCH_MEMCPY) ?

Yes I found these afterwards... performance is slightly better than
memcpy() thanks to Nicolas Pitre it seems.  The U-Boot config for the
platform we have didn't know about them, it's much better with them.

After I wrote this patch it was also pointed out by Will Newton at
Linaro that we have Neon accelerated memcpy lying around with BSD
license....

https://launchpad.net/cortex-strings

however for my purposes NOR boot is working good enough with the ARCH versions.

-Andy

> Best regards,
>
> Wolfgang Denk
>
> --
> DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
> It is dangerous to be sincere unless you are also stupid.
>                                                 - George Bernard Shaw

Wolfgang Denk July 26, 2013, 7:06 p.m. UTC | #3

Dear Andy Green,

In message <CAAfg0W7p6rvvaMqsKsC09yQnfPod08YDTbeh_MQrz6n+B4iyYA@mail.gmail.com> you wrote:
>
> > Instead of making assumptions on the performance of memcpy() and
> 
> As I wrote, I measured the performance and got a very big gain, it's 
> 3x faster on my setup to use memcpy() then default memmove().

Yes, in your single test case of copying a Linux kernel image,
i. e. a multi-megabyte file.

> By calling that an "assumption" you're saying that there exist
> platforms where 32-bit linear memmove() is slower than doing it with
> 8-bit actions?

No.  I said you should not assume that memcpy() is always faster than
memmove(); a system may use optiomized versions of either.

> > adding the overhead of an additional function call (which can be
> > expensive especially for short copy operations) it would make more
> 
> I am not sure U-Boot is really in the business of doing small
> memmoves, but okay...

It's easy to avoid this overhead, and also get rid of the
restrictions you built into it (otimizong only the non-overlapping
case), so if we touch that code, we should do it right.

> > sense to pull the "copy a word at a time" code from memcpy() into
> > memmove(), too.
> >
> > On the other hand - if you really care about performance, then why do
> 
> I spent several hours figuring out why our NOR boot performance was
> terrible.... it's because this default memmove code is gloriously
> inefficient for all cases.
> 
> If you like it like that, no worries.

Don't twist my words.  I asked for a different, better implementation,
that's all.

Best regards,

Wolfgang Denk

warmcat July 28, 2013, 11:39 p.m. UTC | #4

Dear Wolfgang Denk,

On 27 July 2013 03:06, Wolfgang Denk <wd@denx.de> wrote:
> Dear Andy Green,
>
> In message <CAAfg0W7p6rvvaMqsKsC09yQnfPod08YDTbeh_MQrz6n+B4iyYA@mail.gmail.com> you wrote:
>>
>> > Instead of making assumptions on the performance of memcpy() and
>>
>> As I wrote, I measured the performance and got a very big gain, it's
>> 3x faster on my setup to use memcpy() then default memmove().
>
> Yes, in your single test case of copying a Linux kernel image,
> i. e. a multi-megabyte file.

That "single test case" of copying a "multi-megabyte file" is the
reason d'etre of a bootloader...if it doesn't perform well doing that
then it's a problem.

However I agree the alternative arch-specific implementation for ARM
is pretty good, so this is kind of moot.

Since I traced my problem to here I "fixed" it there, but that actual
problem was were using the default implementations at all (the config
was inherited)..

>> By calling that an "assumption" you're saying that there exist
>> platforms where 32-bit linear memmove() is slower than doing it with
>> 8-bit actions?
>
> No.  I said you should not assume that memcpy() is always faster than
> memmove(); a system may use optiomized versions of either.

I did not assume that, I looked at your code for both and saw, and
proved, that using 32-bit operations for mass move actions is going to
perform better than using 8-bit operations.

That's not something you can write off as an "assumption", it's a fact.

>> > adding the overhead of an additional function call (which can be
>> > expensive especially for short copy operations) it would make more
>>
>> I am not sure U-Boot is really in the business of doing small
>> memmoves, but okay...
>
> It's easy to avoid this overhead, and also get rid of the
> restrictions you built into it (otimizong only the non-overlapping
> case), so if we touch that code, we should do it right.

Given that code should perferably never be used, maybe it should print
a warning like "Using default memory ops" and leave it like it is.
The problem is not correctness just inefficiency.

>> > sense to pull the "copy a word at a time" code from memcpy() into
>> > memmove(), too.
>> >
>> > On the other hand - if you really care about performance, then why do
>>
>> I spent several hours figuring out why our NOR boot performance was
>> terrible.... it's because this default memmove code is gloriously
>> inefficient for all cases.
>>
>> If you like it like that, no worries.
>
> Don't twist my words.  I asked for a different, better implementation,
> that's all.

Unfortunately I'm only looking at it because it made trouble, and we
now no longer use that code.

For the use-case I'm studying (very fast overall boot) it still may
make sense to implement the NEON stuff in which case I'll offer a
patch for that.

-Andy

>
> Best regards,
>
> Wolfgang Denk
>
> --
> DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
> Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
> The only time the world beats a path to your door is when you are  in
> the bathroom.

Wolfgang Denk July 29, 2013, 8:03 a.m. UTC | #5

Dear Andy Green,

In message <CAAfg0W4wXdRafRcHvrNVNnkCGd3EvJfNujuzUsjQXgW=WmCU+w@mail.gmail.com> you wrote:
>
> Given that code should perferably never be used, maybe it should print
> a warning like "Using default memory ops" and leave it like it is.
> The problem is not correctness just inefficiency.

Agreed. You found a problem, and a way to improve it.  All this is
perfectly fine.  I just didn't like your implementation, and rather
suggest to copy the 32-bit-copy loop as used in the current memcpy()
code and use it also in memmove().

I'm looking forward to seeing your patch v2 with such an improved
implementation.  Thanks in advance.

Best regards,

Wolfgang Denk

Dirk Behme July 29, 2013, 8:28 a.m. UTC | #6

On 26.07.2013 15:42, Andy Green wrote:
> On 26 July 2013 20:58, Wolfgang Denk <wd@denx.de> wrote:
...
>> you not make sure that you provide optimized implementations for such
>> functions and consequently #define __HAVE_ARCH_MEMMOVE (and
>> __HAVE_ARCH_MEMCPY) ?
>
> Yes I found these afterwards... performance is slightly better than
> memcpy() thanks to Nicolas Pitre it seems.  The U-Boot config for the
> platform we have didn't know about them, it's much better with them.
>
> After I wrote this patch it was also pointed out by Will Newton at
> Linaro that we have Neon accelerated memcpy lying around with BSD
> license....
>
> https://launchpad.net/cortex-strings
>
> however for my purposes NOR boot is working good enough with the ARCH versions.

I've had a look to

https://launchpad.net/cortex-strings

and there downloaded

https://launchpad.net/cortex-strings/trunk/2013.01/+download/cortex-strings-1.0-2013.01.tar.bz2

(from the green download button on the right side).

The README mentions

src/neon  contains NEON based routines for AArch32.

but the cortex-strings-1.0-2013.01.tar.bz2 seems to contain an empty 
src/neon directory. Is this intended?

Best regards

Dirk

warmcat July 29, 2013, 8:44 a.m. UTC | #7

Looping Will...

On 29 July 2013 16:28, Dirk Behme <dirk.behme@de.bosch.com> wrote:
> On 26.07.2013 15:42, Andy Green wrote:
>>
>> On 26 July 2013 20:58, Wolfgang Denk <wd@denx.de> wrote:
>
> ...
>>>
>>> you not make sure that you provide optimized implementations for such
>>> functions and consequently #define __HAVE_ARCH_MEMMOVE (and
>>> __HAVE_ARCH_MEMCPY) ?
>>
>>
>> Yes I found these afterwards... performance is slightly better than
>> memcpy() thanks to Nicolas Pitre it seems.  The U-Boot config for the
>> platform we have didn't know about them, it's much better with them.
>>
>> After I wrote this patch it was also pointed out by Will Newton at
>> Linaro that we have Neon accelerated memcpy lying around with BSD
>> license....
>>
>> https://launchpad.net/cortex-strings
>>
>> however for my purposes NOR boot is working good enough with the ARCH
>> versions.
>
>
> I've had a look to
>
> https://launchpad.net/cortex-strings
>
> and there downloaded
>
> https://launchpad.net/cortex-strings/trunk/2013.01/+download/cortex-strings-1.0-2013.01.tar.bz2
>
> (from the green download button on the right side).
>
> The README mentions
>
> src/neon  contains NEON based routines for AArch32.
>
> but the cortex-strings-1.0-2013.01.tar.bz2 seems to contain an empty
> src/neon directory. Is this intended?

I don't know... I agree that dir is empty... Will, do you know the
story?  I found nice-looking ones in reference/newlib but I don't
think they're NEON.

-Andy

> Best regards
>
> Dirk
>
>

Will Newton July 29, 2013, 10:34 a.m. UTC | #8

On 29 July 2013 09:44, Andy Green <andy.green@linaro.org> wrote:
> Looping Will...
>
> On 29 July 2013 16:28, Dirk Behme <dirk.behme@de.bosch.com> wrote:
>> On 26.07.2013 15:42, Andy Green wrote:
>>>
>>> On 26 July 2013 20:58, Wolfgang Denk <wd@denx.de> wrote:
>>
>> ...
>>>>
>>>> you not make sure that you provide optimized implementations for such
>>>> functions and consequently #define __HAVE_ARCH_MEMMOVE (and
>>>> __HAVE_ARCH_MEMCPY) ?
>>>
>>>
>>> Yes I found these afterwards... performance is slightly better than
>>> memcpy() thanks to Nicolas Pitre it seems.  The U-Boot config for the
>>> platform we have didn't know about them, it's much better with them.
>>>
>>> After I wrote this patch it was also pointed out by Will Newton at
>>> Linaro that we have Neon accelerated memcpy lying around with BSD
>>> license....
>>>
>>> https://launchpad.net/cortex-strings
>>>
>>> however for my purposes NOR boot is working good enough with the ARCH
>>> versions.
>>
>>
>> I've had a look to
>>
>> https://launchpad.net/cortex-strings
>>
>> and there downloaded
>>
>> https://launchpad.net/cortex-strings/trunk/2013.01/+download/cortex-strings-1.0-2013.01.tar.bz2
>>
>> (from the green download button on the right side).
>>
>> The README mentions
>>
>> src/neon  contains NEON based routines for AArch32.
>>
>> but the cortex-strings-1.0-2013.01.tar.bz2 seems to contain an empty
>> src/neon directory. Is this intended?
>
> I don't know... I agree that dir is empty... Will, do you know the
> story?  I found nice-looking ones in reference/newlib but I don't
> think they're NEON.

The fast memcpy code is in src/linaro-a9/memcpy.S. The directory
structure pre-dates my involvement with the project so I cannot
explain it completely. ;-)

--
Will Newton
Toolchain Working Group, Linaro

uboot optimize memmove

Commit Message

Comments

Patch