Message ID | cover.1556182964.git.viresh.kumar@linaro.org |
---|---|
Headers | show |
Series | sched/fair: Fallback to sched-idle CPU for better performance | expand |
On Thu, Apr 25, 2019 at 5:38 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > Hi, > > Here is another attempt to get some benefit out of the sched-idle > policy. The previous version [1] focused on getting better power numbers > and this version tries to get better performance or lower response time > for the tasks. > > The first patch is unchanged from v1 and accumulates > information about sched-idle tasks per CPU. > > The second patch changes the way the target CPU is selected in the fast > path. Currently, we target for an idle CPU in select_idle_sibling() to > run the next task, but in case we don't find idle CPUs it is better to > pick a CPU which will run the task the soonest, for performance reason. > A CPU which isn't idle but has only SCHED_IDLE activity queued on it > should be a good target based on this criteria as any normal fair task > will most likely preempt the currently running SCHED_IDLE task > immediately. In fact, choosing a SCHED_IDLE CPU shall give better > results as it should be able to run the task sooner than an idle CPU > (which requires to be woken up from an idle state). > > Basic testing is done with the help of rt-app currently to make sure the > task is getting placed correctly. I run some test with this set on our (Facebook's) web service (main job) and ffmpeg (side job). The result looks promising. For all the tests below, I run the web service with same load level; and the side job with SCHED_IDLE. I presented schedule latency distribution of the main job. The latency distribution is measured with the runqlat tool: https://github.com/iovisor/bpftrace/blob/master/tools/runqlat.bt I modified the tool to only track wake up latency of the main job. 4 cases are compared here: 1. w/o this set, w/o side job; 2. w/ this set, w/o side job; 3. w/o this set, w/ side job; 4. w/ this set, w/ side job; Case #1. w/o this set, w/o side job @usecs: [1] 1705 | | [2, 4) 1102086 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 329160 |@@@@@@@@@@@@@@@ | [8, 16) 34135 |@ | [16, 32) 37466 |@ | [32, 64) 15700 | | [64, 128) 8759 | | [128, 256) 5714 | | [256, 512) 3718 | | [512, 1K) 2152 | | [1K, 2K) 882 | | [2K, 4K) 256 | | [4K, 8K) 48 | | [8K, 16K) 2 | | Case #2. w/ this set, w/o side job; @usecs: [1] 2121 | | [2, 4) 1251877 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 401517 |@@@@@@@@@@@@@@@@ | [8, 16) 64325 |@@ | [16, 32) 74352 |@@@ | [32, 64) 40583 |@ | [64, 128) 26129 |@ | [128, 256) 18612 | | [256, 512) 12863 | | [512, 1K) 8304 | | [1K, 2K) 4072 | | [2K, 4K) 1569 | | [4K, 8K) 411 | | [8K, 16K) 70 | | [16K, 32K) 1 | | From #1 and #2, we see this set probably adds a little overhead to scheduling when there is no side job. But the overhead is clearly very small. Case #3. w/o this set, w/ side job; @usecs: [1] 1282 | | [2, 4) 260977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 446120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 136927 |@@@@@@@@@@@@@@@ | [16, 32) 148758 |@@@@@@@@@@@@@@@@@ | [32, 64) 160291 |@@@@@@@@@@@@@@@@@@ | [64, 128) 177292 |@@@@@@@@@@@@@@@@@@@@ | [128, 256) 184573 |@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 173405 |@@@@@@@@@@@@@@@@@@@@ | [512, 1K) 149662 |@@@@@@@@@@@@@@@@@ | [1K, 2K) 120217 |@@@@@@@@@@@@@@ | [2K, 4K) 80275 |@@@@@@@@@ | [4K, 8K) 36108 |@@@@ | [8K, 16K) 11121 |@ | [16K, 32K) 736 | | [32K, 64K) 19 | | Case #4. w/ this set, w/ side job; @usecs: [1] 407 | | [2, 4) 535378 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [4, 8) 795526 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8, 16) 93032 |@@@@@@ | [16, 32) 89890 |@@@@@ | [32, 64) 82775 |@@@@@ | [64, 128) 84413 |@@@@@ | [128, 256) 84413 |@@@@@ | [256, 512) 77202 |@@@@@ | [512, 1K) 66043 |@@@@ | [1K, 2K) 49276 |@@@ | [2K, 4K) 30114 |@ | [4K, 8K) 11145 | | [8K, 16K) 2328 | | [16K, 32K) 88 | | #3 and #4 clearly showed the benefit of this set. With this set, we see significantly fewer latency values in the 8usecs+ ranges. Thanks, Song > > -- > viresh > > Viresh Kumar (2): > sched: Start tracking SCHED_IDLE tasks count in cfs_rq > sched/fair: Fallback to sched-idle CPU if idle CPU isn't found > > kernel/sched/fair.c | 42 +++++++++++++++++++++++++++++++++--------- > kernel/sched/sched.h | 2 ++ > 2 files changed, 35 insertions(+), 9 deletions(-) > > -- > 2.21.0.rc0.269.g1a574e7a288b > > [1] https://lore.kernel.org/lkml/cover.1543229820.git.viresh.kumar@linaro.org/
Hi Song, On Thu, 9 May 2019 at 23:54, Song Liu <liu.song.a23@gmail.com> wrote: > > On Thu, Apr 25, 2019 at 5:38 AM Viresh Kumar <viresh.kumar@linaro.org> wrote: > > > > Hi, > > > > Here is another attempt to get some benefit out of the sched-idle > > policy. The previous version [1] focused on getting better power numbers > > and this version tries to get better performance or lower response time > > for the tasks. > > > > The first patch is unchanged from v1 and accumulates > > information about sched-idle tasks per CPU. > > > > The second patch changes the way the target CPU is selected in the fast > > path. Currently, we target for an idle CPU in select_idle_sibling() to > > run the next task, but in case we don't find idle CPUs it is better to > > pick a CPU which will run the task the soonest, for performance reason. > > A CPU which isn't idle but has only SCHED_IDLE activity queued on it > > should be a good target based on this criteria as any normal fair task > > will most likely preempt the currently running SCHED_IDLE task > > immediately. In fact, choosing a SCHED_IDLE CPU shall give better > > results as it should be able to run the task sooner than an idle CPU > > (which requires to be woken up from an idle state). > > > > Basic testing is done with the help of rt-app currently to make sure the > > task is getting placed correctly. > > I run some test with this set on our (Facebook's) web service (main job) > and ffmpeg (side job). The result looks promising. > > For all the tests below, I run the web service with same load level; and > the side job with SCHED_IDLE. I presented schedule latency distribution > of the main job. The latency distribution is measured with the runqlat tool: > https://github.com/iovisor/bpftrace/blob/master/tools/runqlat.bt > > I modified the tool to only track wake up latency of the main job. > > 4 cases are compared here: > > 1. w/o this set, w/o side job; > 2. w/ this set, w/o side job; > 3. w/o this set, w/ side job; > 4. w/ this set, w/ side job; > > > Case #1. w/o this set, w/o side job > @usecs: > [1] 1705 | | > [2, 4) 1102086 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [4, 8) 329160 |@@@@@@@@@@@@@@@ | > [8, 16) 34135 |@ | > [16, 32) 37466 |@ | > [32, 64) 15700 | | > [64, 128) 8759 | | > [128, 256) 5714 | | > [256, 512) 3718 | | > [512, 1K) 2152 | | > [1K, 2K) 882 | | > [2K, 4K) 256 | | > [4K, 8K) 48 | | > [8K, 16K) 2 | | > > Case #2. w/ this set, w/o side job; > @usecs: > [1] 2121 | | > [2, 4) 1251877 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [4, 8) 401517 |@@@@@@@@@@@@@@@@ | > [8, 16) 64325 |@@ | > [16, 32) 74352 |@@@ | > [32, 64) 40583 |@ | > [64, 128) 26129 |@ | > [128, 256) 18612 | | > [256, 512) 12863 | | > [512, 1K) 8304 | | > [1K, 2K) 4072 | | > [2K, 4K) 1569 | | > [4K, 8K) 411 | | > [8K, 16K) 70 | | > [16K, 32K) 1 | | > > From #1 and #2, we see this set probably adds a little overhead to > scheduling when there is no side job. But the overhead is clearly very > small. > > > Case #3. w/o this set, w/ side job; > @usecs: > [1] 1282 | | > [2, 4) 260977 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [4, 8) 446120 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8, 16) 136927 |@@@@@@@@@@@@@@@ | > [16, 32) 148758 |@@@@@@@@@@@@@@@@@ | > [32, 64) 160291 |@@@@@@@@@@@@@@@@@@ | > [64, 128) 177292 |@@@@@@@@@@@@@@@@@@@@ | > [128, 256) 184573 |@@@@@@@@@@@@@@@@@@@@@ | > [256, 512) 173405 |@@@@@@@@@@@@@@@@@@@@ | > [512, 1K) 149662 |@@@@@@@@@@@@@@@@@ | > [1K, 2K) 120217 |@@@@@@@@@@@@@@ | > [2K, 4K) 80275 |@@@@@@@@@ | > [4K, 8K) 36108 |@@@@ | > [8K, 16K) 11121 |@ | > [16K, 32K) 736 | | > [32K, 64K) 19 | | > > Case #4. w/ this set, w/ side job; > @usecs: > [1] 407 | | > [2, 4) 535378 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [4, 8) 795526 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8, 16) 93032 |@@@@@@ | > [16, 32) 89890 |@@@@@ | > [32, 64) 82775 |@@@@@ | > [64, 128) 84413 |@@@@@ | > [128, 256) 84413 |@@@@@ | > [256, 512) 77202 |@@@@@ | > [512, 1K) 66043 |@@@@ | > [1K, 2K) 49276 |@@@ | > [2K, 4K) 30114 |@ | > [4K, 8K) 11145 | | > [8K, 16K) 2328 | | > [16K, 32K) 88 | | > > #3 and #4 clearly showed the benefit of this set. With this set, we see > significantly fewer latency values in the 8usecs+ ranges. > Thanks for running tests with this patchset, your results looks goods with a significant decrease of long wakeup latency. Vincent > Thanks, > Song > > > > > -- > > viresh > > > > Viresh Kumar (2): > > sched: Start tracking SCHED_IDLE tasks count in cfs_rq > > sched/fair: Fallback to sched-idle CPU if idle CPU isn't found > > > > kernel/sched/fair.c | 42 +++++++++++++++++++++++++++++++++--------- > > kernel/sched/sched.h | 2 ++ > > 2 files changed, 35 insertions(+), 9 deletions(-) > > > > -- > > 2.21.0.rc0.269.g1a574e7a288b > > > > [1] https://lore.kernel.org/lkml/cover.1543229820.git.viresh.kumar@linaro.org/
On 25-04-19, 15:07, Viresh Kumar wrote: > Hi, > > Here is another attempt to get some benefit out of the sched-idle > policy. The previous version [1] focused on getting better power numbers > and this version tries to get better performance or lower response time > for the tasks. > > The first patch is unchanged from v1 and accumulates > information about sched-idle tasks per CPU. > > The second patch changes the way the target CPU is selected in the fast > path. Currently, we target for an idle CPU in select_idle_sibling() to > run the next task, but in case we don't find idle CPUs it is better to > pick a CPU which will run the task the soonest, for performance reason. > A CPU which isn't idle but has only SCHED_IDLE activity queued on it > should be a good target based on this criteria as any normal fair task > will most likely preempt the currently running SCHED_IDLE task > immediately. In fact, choosing a SCHED_IDLE CPU shall give better > results as it should be able to run the task sooner than an idle CPU > (which requires to be woken up from an idle state). > > Basic testing is done with the help of rt-app currently to make sure the > task is getting placed correctly. More results here: - Tested on Octacore Hikey platform (all CPUs change frequency together). - rt-app json attached here. It creates few tasks and we monitor the scheduling latency for them by looking at "wu_lat" field (usec). - The histograms are created using https://github.com/adkein/textogram: textogram -a 0 -z 1000 -n 10 - the stats are accumulated using: https://github.com/nferraz/st - NOTE: The % values shown don't add up, just look at total numbers instead Test 1: Create 8 CFS tasks (no SCHED_IDLE tasks) without this patchset: 0 - 100 : ################################################## 72% (3688) 100 - 200 : ################ 24% (1253) 200 - 300 : ## 2% (149) 300 - 400 : 0% (22) 400 - 500 : 0% (1) 500 - 600 : 0% (3) 600 - 700 : 0% (1) 700 - 800 : 0% (1) 800 - 900 : 900 - 1000 : 0% (1) >1000 : 0% (17) N min max sum mean stddev 5136 0 2452 535985 104.358 104.585 Test 2: Create 8 CFS tasks and 5 SCHED_IDLE tasks: A. Without sched-idle patchset: 0 - 100 : ################################################## 88% (3102) 100 - 200 : ## 4% (148) 200 - 300 : 1% (41) 300 - 400 : 0% (27) 400 - 500 : 0% (33) 500 - 600 : 0% (32) 600 - 700 : 1% (36) 700 - 800 : 0% (27) 800 - 900 : 0% (19) 900 - 1000 : 0% (26) >1000 : 34% (1218) N min max sum mean stddev 4710 0 67664 5.25956e+06 1116.68 2315.09 B. With sched-idle patchset: 0 - 100 : ################################################## 99% (5042) 100 - 200 : 0% (8) 200 - 300 : 300 - 400 : 400 - 500 : 0% (2) 500 - 600 : 0% (1) 600 - 700 : 700 - 800 : 0% (1) 800 - 900 : 0% (1) 900 - 1000 : >1000 : 0% (40) N min max sum mean stddev 5095 0 7773 523170 102.683 475.482 The mean latency dropped to 10% and the stddev to around 25% with this patchset. I have tried more combinations of CFS and SCHED_IDLE tasks and see expected improvement in scheduling latency for all of them. -- viresh