[3/5] perf core: Prepare writing into ring buffer from end

Message ID	56F52E83.70409@huawei.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Subject: Re: [PATCH 3/5] perf core: Prepare writing into ring buffer from end To: Peter Zijlstra <peterz@infradead.org> References: <1457949585-191064-1-git-send-email-wangnan0@huawei.com> <1457949585-191064-4-git-send-email-wangnan0@huawei.com> <20160323095007.GW6344@twins.programming.kicks-ass.net> CC: <mingo@redhat.com>, <linux-kernel@vger.kernel.org>, He Kuang <hekuang@huawei.com>, Alexei Starovoitov <ast@kernel.org>, "Arnaldo Carvalho de Melo" <acme@redhat.com>, Brendan Gregg <brendan.d.gregg@gmail.com>, "Jiri Olsa" <jolsa@kernel.org>, Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>, Namhyung Kim <namhyung@kernel.org>, Zefan Li <lizefan@huawei.com>, <pi3orama@163.com> From: "Wangnan (F)" <wangnan0@huawei.com> Message-ID: <56F52E83.70409@huawei.com> Date: Fri, 25 Mar 2016 20:26:43 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.5.0 MIME-Version: 1.0 In-Reply-To: <20160323095007.GW6344@twins.programming.kicks-ass.net> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk

Wang Nan March 25, 2016, 12:26 p.m. UTC

On 2016/3/23 17:50, Peter Zijlstra wrote:
> On Mon, Mar 14, 2016 at 09:59:43AM +0000, Wang Nan wrote:

>> Convert perf_output_begin to __perf_output_begin and make the later

>> function able to write records from the end of the ring buffer.

>> Following commits will utilize the 'backward' flag.

>>

>> This patch doesn't introduce any extra performance overhead since we

>> use always_inline.

> So while I agree that with __always_inline and constant propagation we

> _should_ end up with the same code, we have:

>

> $ size defconfig-build/kernel/events/ring_buffer.o.{pre,post}

>     text    data     bss     dec     hex filename

>     3785       2       0    3787     ecb defconfig-build/kernel/events/ring_buffer.o.pre

>     3673       2       0    3675     e5b defconfig-build/kernel/events/ring_buffer.o.post

>

> The patch actually makes the file shrink.

>

> So I think we still want to have some actual performance numbers.


In my environment the two objects are nearly idential:


$ objdump -d kernel/events/ring_buffer.o.new  > ./out.new.S
$ objdump -d kernel/events/ring_buffer.o.old  > ./out.old.S

   4a2:    e8 00 00 00 00           callq  4a7 <perf_output_begin+0x1a7>
@@ -874,7 +874,7 @@
   c39:    eb e7                    jmp    c22 <perf_aux_output_begin+0x172>
   c3b:    80 3d 00 00 00 00 00     cmpb   $0x0,0x0(%rip)        # c42 
<perf_aux_output_begin+0x192>
   c42:    75 93                    jne    bd7 <perf_aux_output_begin+0x127>
- c44:    be 2b 01 00 00           mov    $0x12b,%esi
+ c44:    be 49 01 00 00           mov    $0x149,%esi
   c49:    48 c7 c7 00 00 00 00     mov    $0x0,%rdi
   c50:    e8 00 00 00 00           callq  c55 <perf_aux_output_begin+0x1a5>
   c55:    c6 05 00 00 00 00 01     movb   $0x1,0x0(%rip)        # c5c 
<perf_aux_output_begin+0x1ac>


I think you enabled some unusual config options?

Thank you.

Wang Nan March 25, 2016, 12:36 p.m. UTC | #1

On 2016/3/25 20:26, Wangnan (F) wrote:
>

>

> On 2016/3/23 17:50, Peter Zijlstra wrote:

>> On Mon, Mar 14, 2016 at 09:59:43AM +0000, Wang Nan wrote:

>>> Convert perf_output_begin to __perf_output_begin and make the later

>>> function able to write records from the end of the ring buffer.

>>> Following commits will utilize the 'backward' flag.

>>>

>>> This patch doesn't introduce any extra performance overhead since we

>>> use always_inline.

>> So while I agree that with __always_inline and constant propagation we

>> _should_ end up with the same code, we have:

>>

>> $ size defconfig-build/kernel/events/ring_buffer.o.{pre,post}

>>     text    data     bss     dec     hex filename

>>     3785       2       0    3787     ecb 

>> defconfig-build/kernel/events/ring_buffer.o.pre

>>     3673       2       0    3675     e5b 

>> defconfig-build/kernel/events/ring_buffer.o.post

>>

>> The patch actually makes the file shrink.

>>

>> So I think we still want to have some actual performance numbers.

>

> In my environment the two objects are nearly idential:

>

>

> $ objdump -d kernel/events/ring_buffer.o.new  > ./out.new.S

> $ objdump -d kernel/events/ring_buffer.o.old  > ./out.old.S

>

> --- ./out.old.S    2016-03-25 12:18:52.060656423 +0000

> +++ ./out.new.S    2016-03-25 12:18:45.376630269 +0000

> @@ -1,5 +1,5 @@

>

> -kernel/events/ring_buffer.o.old:     file format elf64-x86-64

> +kernel/events/ring_buffer.o.new:     file format elf64-x86-64

>

>

>  Disassembly of section .text:

> @@ -320,7 +320,7 @@

>   402:    4d 8d 04 0f              lea    (%r15,%rcx,1),%r8

>   406:    48 89 c8                 mov    %rcx,%rax

>   409:    4c 0f b1 43 40           cmpxchg %r8,0x40(%rbx)

> - 40e:    48 39 c8                 cmp    %rcx,%rax

> + 40e:    48 39 c1                 cmp    %rax,%rcx

>   411:    75 b4                    jne    3c7 <perf_output_begin+0xc7>

>   413:    48 8b 73 58              mov    0x58(%rbx),%rsi

>   417:    48 8b 43 68              mov    0x68(%rbx),%rax

> @@ -357,7 +357,7 @@

>   480:    85 c0                    test   %eax,%eax

>   482:    0f 85 02 ff ff ff        jne    38a <perf_output_begin+0x8a>

>   488:    48 c7 c2 00 00 00 00     mov    $0x0,%rdx

> - 48f:    be 7c 00 00 00           mov    $0x7c,%esi

> + 48f:    be 89 00 00 00           mov    $0x89,%esi

>   494:    48 c7 c7 00 00 00 00     mov    $0x0,%rdi

>   49b:    c6 05 00 00 00 00 01     movb   $0x1,0x0(%rip)        # 4a2 

> <perf_output_begin+0x1a2>

>   4a2:    e8 00 00 00 00           callq  4a7 <perf_output_begin+0x1a7>

> @@ -874,7 +874,7 @@

>   c39:    eb e7                    jmp    c22 

> <perf_aux_output_begin+0x172>

>   c3b:    80 3d 00 00 00 00 00     cmpb   $0x0,0x0(%rip)        # c42 

> <perf_aux_output_begin+0x192>

>   c42:    75 93                    jne    bd7 

> <perf_aux_output_begin+0x127>

> - c44:    be 2b 01 00 00           mov    $0x12b,%esi

> + c44:    be 49 01 00 00           mov    $0x149,%esi

>   c49:    48 c7 c7 00 00 00 00     mov    $0x0,%rdi

>   c50:    e8 00 00 00 00           callq  c55 

> <perf_aux_output_begin+0x1a5>

>   c55:    c6 05 00 00 00 00 01     movb   $0x1,0x0(%rip)        # c5c 

> <perf_aux_output_begin+0x1ac>

>

>

> I think you enabled some unusual config options?

>


You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result:

$ size kernel/events/ring_buffer.o*
    text       data        bss        dec        hex    filename
    4545          4          8       4557       11cd 
kernel/events/ring_buffer.o.new
    4641          4          8       4653       122d 
kernel/events/ring_buffer.o.old

Thank you.

Wang Nan March 25, 2016, 2:14 p.m. UTC | #2

On 2016/3/25 20:36, Wangnan (F) wrote:
>

>

> On 2016/3/25 20:26, Wangnan (F) wrote:

>>

>>

>> On 2016/3/23 17:50, Peter Zijlstra wrote:

>>> On Mon, Mar 14, 2016 at 09:59:43AM +0000, Wang Nan wrote:

>>>> Convert perf_output_begin to __perf_output_begin and make the later

>>>> function able to write records from the end of the ring buffer.

>>>> Following commits will utilize the 'backward' flag.

>>>>

>>>> This patch doesn't introduce any extra performance overhead since we

>>>> use always_inline.

>>> So while I agree that with __always_inline and constant propagation we

>>> _should_ end up with the same code, we have:

>>>

>>> $ size defconfig-build/kernel/events/ring_buffer.o.{pre,post}

>>>     text    data     bss     dec     hex filename

>>>     3785       2       0    3787     ecb 

>>> defconfig-build/kernel/events/ring_buffer.o.pre

>>>     3673       2       0    3675     e5b 

>>> defconfig-build/kernel/events/ring_buffer.o.post

>>>

>>> The patch actually makes the file shrink.

>>>

>>> So I think we still want to have some actual performance numbers.

>>

>> In my environment the two objects are nearly idential:

>>

>>

>> $ objdump -d kernel/events/ring_buffer.o.new  > ./out.new.S

>> $ objdump -d kernel/events/ring_buffer.o.old  > ./out.old.S

>>

>> --- ./out.old.S    2016-03-25 12:18:52.060656423 +0000

>> +++ ./out.new.S    2016-03-25 12:18:45.376630269 +0000

>> @@ -1,5 +1,5 @@

>>

>> -kernel/events/ring_buffer.o.old:     file format elf64-x86-64

>> +kernel/events/ring_buffer.o.new:     file format elf64-x86-64

>>

>>

>>  Disassembly of section .text:

>> @@ -320,7 +320,7 @@

>>   402:    4d 8d 04 0f              lea    (%r15,%rcx,1),%r8

>>   406:    48 89 c8                 mov    %rcx,%rax

>>   409:    4c 0f b1 43 40           cmpxchg %r8,0x40(%rbx)

>> - 40e:    48 39 c8                 cmp    %rcx,%rax

>> + 40e:    48 39 c1                 cmp    %rax,%rcx

>>   411:    75 b4                    jne    3c7 <perf_output_begin+0xc7>

>>   413:    48 8b 73 58              mov    0x58(%rbx),%rsi

>>   417:    48 8b 43 68              mov    0x68(%rbx),%rax

>> @@ -357,7 +357,7 @@

>>   480:    85 c0                    test   %eax,%eax

>>   482:    0f 85 02 ff ff ff        jne    38a <perf_output_begin+0x8a>

>>   488:    48 c7 c2 00 00 00 00     mov    $0x0,%rdx

>> - 48f:    be 7c 00 00 00           mov    $0x7c,%esi

>> + 48f:    be 89 00 00 00           mov    $0x89,%esi

>>   494:    48 c7 c7 00 00 00 00     mov    $0x0,%rdi

>>   49b:    c6 05 00 00 00 00 01     movb   $0x1,0x0(%rip) # 4a2 

>> <perf_output_begin+0x1a2>

>>   4a2:    e8 00 00 00 00           callq  4a7 <perf_output_begin+0x1a7>

>> @@ -874,7 +874,7 @@

>>   c39:    eb e7                    jmp    c22 

>> <perf_aux_output_begin+0x172>

>>   c3b:    80 3d 00 00 00 00 00     cmpb   $0x0,0x0(%rip) # c42 

>> <perf_aux_output_begin+0x192>

>>   c42:    75 93                    jne    bd7 

>> <perf_aux_output_begin+0x127>

>> - c44:    be 2b 01 00 00           mov    $0x12b,%esi

>> + c44:    be 49 01 00 00           mov    $0x149,%esi

>>   c49:    48 c7 c7 00 00 00 00     mov    $0x0,%rdi

>>   c50:    e8 00 00 00 00           callq  c55 

>> <perf_aux_output_begin+0x1a5>

>>   c55:    c6 05 00 00 00 00 01     movb   $0x1,0x0(%rip) # c5c 

>> <perf_aux_output_begin+0x1ac>

>>

>>

>> I think you enabled some unusual config options?

>>

>

> You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result:

>

> $ size kernel/events/ring_buffer.o*

>    text       data        bss        dec        hex    filename

>    4545          4          8       4557       11cd 

> kernel/events/ring_buffer.o.new

>    4641          4          8       4653       122d 

> kernel/events/ring_buffer.o.old

>

> Thank you.

>


After enabling CONFIG_OPTIMIZE_INLINING:

Test its performance by calling 'close(-1)' for 3000000 times and
use 'perf record -o /dev/null -e raw_syscalls:* test-ring-buffer' to
capture system calls:

                   MEAN         STDVAR
BASE            800077.1     23448.13
RAWPERF.PRE    2465858.0    603473.70
RAWPERF.POST   2471925.0    609437.60

Considering the high stdvar, after applying this patch the performance
is not change.

I'll put this number in my commit message.

Thank you.

Wang Nan March 28, 2016, 1:07 a.m. UTC | #3

On 2016/3/27 23:30, pi3orama wrote:
>

> 发自我的 iPhone

>

>> 在 2016年3月27日，下午11:20，Peter Zijlstra <peterz@infradead.org> 写道：

>>

>> On Fri, Mar 25, 2016 at 10:14:36PM +0800, Wangnan (F) wrote:

>>>>> I think you enabled some unusual config options?

>> x86_64-defconfig

>>

>>>> You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result:

>> It has that indeed.

>>

>>> After enabling CONFIG_OPTIMIZE_INLINING:

>>>

>>> Test its performance by calling 'close(-1)' for 3000000 times and

>>> use 'perf record -o /dev/null -e raw_syscalls:* test-ring-buffer' to

>>> capture system calls:

>>>

>>>                   MEAN         STDVAR

>>> BASE            800077.1     23448.13

>>> RAWPERF.PRE    2465858.0    603473.70

>>> RAWPERF.POST   2471925.0    609437.60

>>>

>>> Considering the high stdvar, after applying this patch the performance

>>> is not change.

>> Why is your variance so immense? And doesn't that render the

>> measurements pointless?

>>

> For some unknown reason, about

> 10% of these results raises 2 times of normal

> results. Say, "normal results" are about

> 2200000, but those "outliers" are about

> 4400000 (I can't access raw data now).

> Variance becomes much smaller if I remove

> those outliers.


After manually removing outliners (remove 10 outliners from 100 raw
data points in each data set):


                  MEAN         STDVAR
BASE            800077.1     23448.13
RAWPERF.PRE    2265741.0     10421.35
RAWPERF.POST   2269826.0     10507.45

Thank you.

> I guess the outliers is caused by some type

> of lock stepping? No clue about it.

>

> Thank you.

>

Wang Nan March 28, 2016, 1:58 a.m. UTC | #4

On 2016/3/28 9:07, Wangnan (F) wrote:
>

>

> On 2016/3/27 23:30, pi3orama wrote:

>>

>> 发自我的 iPhone

>>

>>> 在 2016年3月27日，下午11:20，Peter Zijlstra <peterz@infradead.org> 

>>> 写道：

>>>

>>> On Fri, Mar 25, 2016 at 10:14:36PM +0800, Wangnan (F) wrote:

>>>>>> I think you enabled some unusual config options?

>>> x86_64-defconfig

>>>

>>>>> You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result:

>>> It has that indeed.

>>>

>>>> After enabling CONFIG_OPTIMIZE_INLINING:

>>>>

>>>> Test its performance by calling 'close(-1)' for 3000000 times and

>>>> use 'perf record -o /dev/null -e raw_syscalls:* test-ring-buffer' to

>>>> capture system calls:

>>>>

>>>>                   MEAN         STDVAR

>>>> BASE            800077.1     23448.13

>>>> RAWPERF.PRE    2465858.0    603473.70

>>>> RAWPERF.POST   2471925.0    609437.60

>>>>

>>>> Considering the high stdvar, after applying this patch the performance

>>>> is not change.

>>> Why is your variance so immense? And doesn't that render the

>>> measurements pointless?

>>>

>> For some unknown reason, about

>> 10% of these results raises 2 times of normal

>> results. Say, "normal results" are about

>> 2200000, but those "outliers" are about

>> 4400000 (I can't access raw data now).

>> Variance becomes much smaller if I remove

>> those outliers.

>


Find the reason of these outliners.

If perf and 'test-ring-buffer' are scheduled on different processors,
the performance is bad. I think cache is the main reason.

I will redo the test, bind them to cores on same CPU.

Thank you.

Wang Nan March 28, 2016, 2:58 a.m. UTC | #5

On 2016/3/28 9:58, Wangnan (F) wrote:
>

>

> On 2016/3/28 9:07, Wangnan (F) wrote:

>>

>>

>> On 2016/3/27 23:30, pi3orama wrote:

>>>

>>> 发自我的 iPhone

>>>

>>>> 在 2016年3月27日，下午11:20，Peter Zijlstra <peterz@infradead.org> 

>>>> 写道：

>>>>

>>>> On Fri, Mar 25, 2016 at 10:14:36PM +0800, Wangnan (F) wrote:

>>>>>>> I think you enabled some unusual config options?

>>>> x86_64-defconfig

>>>>

>>>>>> You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result:

>>>> It has that indeed.

>>>>

>>>>> After enabling CONFIG_OPTIMIZE_INLINING:

>>>>>

>>>>> Test its performance by calling 'close(-1)' for 3000000 times and

>>>>> use 'perf record -o /dev/null -e raw_syscalls:* test-ring-buffer' to

>>>>> capture system calls:

>>>>>

>>>>>                   MEAN         STDVAR

>>>>> BASE            800077.1     23448.13

>>>>> RAWPERF.PRE    2465858.0    603473.70

>>>>> RAWPERF.POST   2471925.0    609437.60

>>>>>

>>>>> Considering the high stdvar, after applying this patch the 

>>>>> performance

>>>>> is not change.

>>>> Why is your variance so immense? And doesn't that render the

>>>> measurements pointless?

>>>>

>>> For some unknown reason, about

>>> 10% of these results raises 2 times of normal

>>> results. Say, "normal results" are about

>>> 2200000, but those "outliers" are about

>>> 4400000 (I can't access raw data now).

>>> Variance becomes much smaller if I remove

>>> those outliers.

>>

>

> Find the reason of these outliners.

>

> If perf and 'test-ring-buffer' are scheduled on different processors,

> the performance is bad. I think cache is the main reason.

>

> I will redo the test, bind them to cores on same CPU.

>

> Thank you.


Test method improvements:

1. Set CPU freq:

  # for f in /sys/devices/system/cpu/cpufreq/policy*/scaling_governor  ; 
do echo performance > $f ; done

2. Bind core:
Add following code into head of test-ring-buffer:

         CPU_ZERO(&mask);
         CPU_SET(6, &mask);
         pthread_setaffinity_np(pthread_self(), sizeof(mask), &mask);
         pthread_yield();

3. Bind core (perf):

Use following command to start perf:

  # taskset -c 7 ./perf record -o /dev/null --no-buildid-cache -e 
raw_syscalls:* test-ring-buffer

New result of 100 test data in both cases:

                   MEAN         STDVAR
BASE            800214.950    2853.083
RAWPERF.PRE    2253846.700    9997.014
RAWPERF.POST   2257495.540    8516.293

Thank you.

[3/5] perf core: Prepare writing into ring buffer from end

Commit Message

Comments

Patch