Message ID | 56F52E83.70409@huawei.com |
---|---|
State | New |
Headers | show |
On 2016/3/25 20:26, Wangnan (F) wrote: > > > On 2016/3/23 17:50, Peter Zijlstra wrote: >> On Mon, Mar 14, 2016 at 09:59:43AM +0000, Wang Nan wrote: >>> Convert perf_output_begin to __perf_output_begin and make the later >>> function able to write records from the end of the ring buffer. >>> Following commits will utilize the 'backward' flag. >>> >>> This patch doesn't introduce any extra performance overhead since we >>> use always_inline. >> So while I agree that with __always_inline and constant propagation we >> _should_ end up with the same code, we have: >> >> $ size defconfig-build/kernel/events/ring_buffer.o.{pre,post} >> text data bss dec hex filename >> 3785 2 0 3787 ecb >> defconfig-build/kernel/events/ring_buffer.o.pre >> 3673 2 0 3675 e5b >> defconfig-build/kernel/events/ring_buffer.o.post >> >> The patch actually makes the file shrink. >> >> So I think we still want to have some actual performance numbers. > > In my environment the two objects are nearly idential: > > > $ objdump -d kernel/events/ring_buffer.o.new > ./out.new.S > $ objdump -d kernel/events/ring_buffer.o.old > ./out.old.S > > --- ./out.old.S 2016-03-25 12:18:52.060656423 +0000 > +++ ./out.new.S 2016-03-25 12:18:45.376630269 +0000 > @@ -1,5 +1,5 @@ > > -kernel/events/ring_buffer.o.old: file format elf64-x86-64 > +kernel/events/ring_buffer.o.new: file format elf64-x86-64 > > > Disassembly of section .text: > @@ -320,7 +320,7 @@ > 402: 4d 8d 04 0f lea (%r15,%rcx,1),%r8 > 406: 48 89 c8 mov %rcx,%rax > 409: 4c 0f b1 43 40 cmpxchg %r8,0x40(%rbx) > - 40e: 48 39 c8 cmp %rcx,%rax > + 40e: 48 39 c1 cmp %rax,%rcx > 411: 75 b4 jne 3c7 <perf_output_begin+0xc7> > 413: 48 8b 73 58 mov 0x58(%rbx),%rsi > 417: 48 8b 43 68 mov 0x68(%rbx),%rax > @@ -357,7 +357,7 @@ > 480: 85 c0 test %eax,%eax > 482: 0f 85 02 ff ff ff jne 38a <perf_output_begin+0x8a> > 488: 48 c7 c2 00 00 00 00 mov $0x0,%rdx > - 48f: be 7c 00 00 00 mov $0x7c,%esi > + 48f: be 89 00 00 00 mov $0x89,%esi > 494: 48 c7 c7 00 00 00 00 mov $0x0,%rdi > 49b: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4a2 > <perf_output_begin+0x1a2> > 4a2: e8 00 00 00 00 callq 4a7 <perf_output_begin+0x1a7> > @@ -874,7 +874,7 @@ > c39: eb e7 jmp c22 > <perf_aux_output_begin+0x172> > c3b: 80 3d 00 00 00 00 00 cmpb $0x0,0x0(%rip) # c42 > <perf_aux_output_begin+0x192> > c42: 75 93 jne bd7 > <perf_aux_output_begin+0x127> > - c44: be 2b 01 00 00 mov $0x12b,%esi > + c44: be 49 01 00 00 mov $0x149,%esi > c49: 48 c7 c7 00 00 00 00 mov $0x0,%rdi > c50: e8 00 00 00 00 callq c55 > <perf_aux_output_begin+0x1a5> > c55: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # c5c > <perf_aux_output_begin+0x1ac> > > > I think you enabled some unusual config options? > You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result: $ size kernel/events/ring_buffer.o* text data bss dec hex filename 4545 4 8 4557 11cd kernel/events/ring_buffer.o.new 4641 4 8 4653 122d kernel/events/ring_buffer.o.old Thank you.
On 2016/3/25 20:36, Wangnan (F) wrote: > > > On 2016/3/25 20:26, Wangnan (F) wrote: >> >> >> On 2016/3/23 17:50, Peter Zijlstra wrote: >>> On Mon, Mar 14, 2016 at 09:59:43AM +0000, Wang Nan wrote: >>>> Convert perf_output_begin to __perf_output_begin and make the later >>>> function able to write records from the end of the ring buffer. >>>> Following commits will utilize the 'backward' flag. >>>> >>>> This patch doesn't introduce any extra performance overhead since we >>>> use always_inline. >>> So while I agree that with __always_inline and constant propagation we >>> _should_ end up with the same code, we have: >>> >>> $ size defconfig-build/kernel/events/ring_buffer.o.{pre,post} >>> text data bss dec hex filename >>> 3785 2 0 3787 ecb >>> defconfig-build/kernel/events/ring_buffer.o.pre >>> 3673 2 0 3675 e5b >>> defconfig-build/kernel/events/ring_buffer.o.post >>> >>> The patch actually makes the file shrink. >>> >>> So I think we still want to have some actual performance numbers. >> >> In my environment the two objects are nearly idential: >> >> >> $ objdump -d kernel/events/ring_buffer.o.new > ./out.new.S >> $ objdump -d kernel/events/ring_buffer.o.old > ./out.old.S >> >> --- ./out.old.S 2016-03-25 12:18:52.060656423 +0000 >> +++ ./out.new.S 2016-03-25 12:18:45.376630269 +0000 >> @@ -1,5 +1,5 @@ >> >> -kernel/events/ring_buffer.o.old: file format elf64-x86-64 >> +kernel/events/ring_buffer.o.new: file format elf64-x86-64 >> >> >> Disassembly of section .text: >> @@ -320,7 +320,7 @@ >> 402: 4d 8d 04 0f lea (%r15,%rcx,1),%r8 >> 406: 48 89 c8 mov %rcx,%rax >> 409: 4c 0f b1 43 40 cmpxchg %r8,0x40(%rbx) >> - 40e: 48 39 c8 cmp %rcx,%rax >> + 40e: 48 39 c1 cmp %rax,%rcx >> 411: 75 b4 jne 3c7 <perf_output_begin+0xc7> >> 413: 48 8b 73 58 mov 0x58(%rbx),%rsi >> 417: 48 8b 43 68 mov 0x68(%rbx),%rax >> @@ -357,7 +357,7 @@ >> 480: 85 c0 test %eax,%eax >> 482: 0f 85 02 ff ff ff jne 38a <perf_output_begin+0x8a> >> 488: 48 c7 c2 00 00 00 00 mov $0x0,%rdx >> - 48f: be 7c 00 00 00 mov $0x7c,%esi >> + 48f: be 89 00 00 00 mov $0x89,%esi >> 494: 48 c7 c7 00 00 00 00 mov $0x0,%rdi >> 49b: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4a2 >> <perf_output_begin+0x1a2> >> 4a2: e8 00 00 00 00 callq 4a7 <perf_output_begin+0x1a7> >> @@ -874,7 +874,7 @@ >> c39: eb e7 jmp c22 >> <perf_aux_output_begin+0x172> >> c3b: 80 3d 00 00 00 00 00 cmpb $0x0,0x0(%rip) # c42 >> <perf_aux_output_begin+0x192> >> c42: 75 93 jne bd7 >> <perf_aux_output_begin+0x127> >> - c44: be 2b 01 00 00 mov $0x12b,%esi >> + c44: be 49 01 00 00 mov $0x149,%esi >> c49: 48 c7 c7 00 00 00 00 mov $0x0,%rdi >> c50: e8 00 00 00 00 callq c55 >> <perf_aux_output_begin+0x1a5> >> c55: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # c5c >> <perf_aux_output_begin+0x1ac> >> >> >> I think you enabled some unusual config options? >> > > You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result: > > $ size kernel/events/ring_buffer.o* > text data bss dec hex filename > 4545 4 8 4557 11cd > kernel/events/ring_buffer.o.new > 4641 4 8 4653 122d > kernel/events/ring_buffer.o.old > > Thank you. > After enabling CONFIG_OPTIMIZE_INLINING: Test its performance by calling 'close(-1)' for 3000000 times and use 'perf record -o /dev/null -e raw_syscalls:* test-ring-buffer' to capture system calls: MEAN STDVAR BASE 800077.1 23448.13 RAWPERF.PRE 2465858.0 603473.70 RAWPERF.POST 2471925.0 609437.60 Considering the high stdvar, after applying this patch the performance is not change. I'll put this number in my commit message. Thank you.
On 2016/3/27 23:30, pi3orama wrote: > > 发自我的 iPhone > >> 在 2016年3月27日,下午11:20,Peter Zijlstra <peterz@infradead.org> 写道: >> >> On Fri, Mar 25, 2016 at 10:14:36PM +0800, Wangnan (F) wrote: >>>>> I think you enabled some unusual config options? >> x86_64-defconfig >> >>>> You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result: >> It has that indeed. >> >>> After enabling CONFIG_OPTIMIZE_INLINING: >>> >>> Test its performance by calling 'close(-1)' for 3000000 times and >>> use 'perf record -o /dev/null -e raw_syscalls:* test-ring-buffer' to >>> capture system calls: >>> >>> MEAN STDVAR >>> BASE 800077.1 23448.13 >>> RAWPERF.PRE 2465858.0 603473.70 >>> RAWPERF.POST 2471925.0 609437.60 >>> >>> Considering the high stdvar, after applying this patch the performance >>> is not change. >> Why is your variance so immense? And doesn't that render the >> measurements pointless? >> > For some unknown reason, about > 10% of these results raises 2 times of normal > results. Say, "normal results" are about > 2200000, but those "outliers" are about > 4400000 (I can't access raw data now). > Variance becomes much smaller if I remove > those outliers. After manually removing outliners (remove 10 outliners from 100 raw data points in each data set): MEAN STDVAR BASE 800077.1 23448.13 RAWPERF.PRE 2265741.0 10421.35 RAWPERF.POST 2269826.0 10507.45 Thank you. > I guess the outliers is caused by some type > of lock stepping? No clue about it. > > Thank you. >
On 2016/3/28 9:07, Wangnan (F) wrote: > > > On 2016/3/27 23:30, pi3orama wrote: >> >> 发自我的 iPhone >> >>> 在 2016年3月27日,下午11:20,Peter Zijlstra <peterz@infradead.org> >>> 写道: >>> >>> On Fri, Mar 25, 2016 at 10:14:36PM +0800, Wangnan (F) wrote: >>>>>> I think you enabled some unusual config options? >>> x86_64-defconfig >>> >>>>> You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result: >>> It has that indeed. >>> >>>> After enabling CONFIG_OPTIMIZE_INLINING: >>>> >>>> Test its performance by calling 'close(-1)' for 3000000 times and >>>> use 'perf record -o /dev/null -e raw_syscalls:* test-ring-buffer' to >>>> capture system calls: >>>> >>>> MEAN STDVAR >>>> BASE 800077.1 23448.13 >>>> RAWPERF.PRE 2465858.0 603473.70 >>>> RAWPERF.POST 2471925.0 609437.60 >>>> >>>> Considering the high stdvar, after applying this patch the performance >>>> is not change. >>> Why is your variance so immense? And doesn't that render the >>> measurements pointless? >>> >> For some unknown reason, about >> 10% of these results raises 2 times of normal >> results. Say, "normal results" are about >> 2200000, but those "outliers" are about >> 4400000 (I can't access raw data now). >> Variance becomes much smaller if I remove >> those outliers. > Find the reason of these outliners. If perf and 'test-ring-buffer' are scheduled on different processors, the performance is bad. I think cache is the main reason. I will redo the test, bind them to cores on same CPU. Thank you.
On 2016/3/28 9:58, Wangnan (F) wrote: > > > On 2016/3/28 9:07, Wangnan (F) wrote: >> >> >> On 2016/3/27 23:30, pi3orama wrote: >>> >>> 发自我的 iPhone >>> >>>> 在 2016年3月27日,下午11:20,Peter Zijlstra <peterz@infradead.org> >>>> 写道: >>>> >>>> On Fri, Mar 25, 2016 at 10:14:36PM +0800, Wangnan (F) wrote: >>>>>>> I think you enabled some unusual config options? >>>> x86_64-defconfig >>>> >>>>>> You must enabled CONFIG_OPTIMIZE_INLINING. Now I get similar result: >>>> It has that indeed. >>>> >>>>> After enabling CONFIG_OPTIMIZE_INLINING: >>>>> >>>>> Test its performance by calling 'close(-1)' for 3000000 times and >>>>> use 'perf record -o /dev/null -e raw_syscalls:* test-ring-buffer' to >>>>> capture system calls: >>>>> >>>>> MEAN STDVAR >>>>> BASE 800077.1 23448.13 >>>>> RAWPERF.PRE 2465858.0 603473.70 >>>>> RAWPERF.POST 2471925.0 609437.60 >>>>> >>>>> Considering the high stdvar, after applying this patch the >>>>> performance >>>>> is not change. >>>> Why is your variance so immense? And doesn't that render the >>>> measurements pointless? >>>> >>> For some unknown reason, about >>> 10% of these results raises 2 times of normal >>> results. Say, "normal results" are about >>> 2200000, but those "outliers" are about >>> 4400000 (I can't access raw data now). >>> Variance becomes much smaller if I remove >>> those outliers. >> > > Find the reason of these outliners. > > If perf and 'test-ring-buffer' are scheduled on different processors, > the performance is bad. I think cache is the main reason. > > I will redo the test, bind them to cores on same CPU. > > Thank you. Test method improvements: 1. Set CPU freq: # for f in /sys/devices/system/cpu/cpufreq/policy*/scaling_governor ; do echo performance > $f ; done 2. Bind core: Add following code into head of test-ring-buffer: CPU_ZERO(&mask); CPU_SET(6, &mask); pthread_setaffinity_np(pthread_self(), sizeof(mask), &mask); pthread_yield(); 3. Bind core (perf): Use following command to start perf: # taskset -c 7 ./perf record -o /dev/null --no-buildid-cache -e raw_syscalls:* test-ring-buffer New result of 100 test data in both cases: MEAN STDVAR BASE 800214.950 2853.083 RAWPERF.PRE 2253846.700 9997.014 RAWPERF.POST 2257495.540 8516.293 Thank you.
--- ./out.old.S 2016-03-25 12:18:52.060656423 +0000 +++ ./out.new.S 2016-03-25 12:18:45.376630269 +0000 @@ -1,5 +1,5 @@ -kernel/events/ring_buffer.o.old: file format elf64-x86-64 +kernel/events/ring_buffer.o.new: file format elf64-x86-64 Disassembly of section .text: @@ -320,7 +320,7 @@ 402: 4d 8d 04 0f lea (%r15,%rcx,1),%r8 406: 48 89 c8 mov %rcx,%rax 409: 4c 0f b1 43 40 cmpxchg %r8,0x40(%rbx) - 40e: 48 39 c8 cmp %rcx,%rax + 40e: 48 39 c1 cmp %rax,%rcx 411: 75 b4 jne 3c7 <perf_output_begin+0xc7> 413: 48 8b 73 58 mov 0x58(%rbx),%rsi 417: 48 8b 43 68 mov 0x68(%rbx),%rax @@ -357,7 +357,7 @@ 480: 85 c0 test %eax,%eax 482: 0f 85 02 ff ff ff jne 38a <perf_output_begin+0x8a> 488: 48 c7 c2 00 00 00 00 mov $0x0,%rdx - 48f: be 7c 00 00 00 mov $0x7c,%esi + 48f: be 89 00 00 00 mov $0x89,%esi 494: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 49b: c6 05 00 00 00 00 01 movb $0x1,0x0(%rip) # 4a2 <perf_output_begin+0x1a2>