Message ID | 1464001138-25063-10-git-send-email-morten.rasmussen@arm.com |
---|---|
State | New |
Headers | show |
On Tue, May 24, 2016 at 08:04:24AM +0800, Yuyang Du wrote: > On Mon, May 23, 2016 at 11:58:51AM +0100, Morten Rasmussen wrote: > > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if > > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric > > configurations SD_WAKE_AFFINE is only desirable if the waking task's > > compute demand (utilization) is suitable for the cpu capacities > > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup > > balancing take over (find_idlest_{group, cpu}()). > > > > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain > > containing cpus with different capacities. This is enforced by a > > previous patch based on the SD_ASYM_CPUCAPACITY flag. > > > > Ideally, we shouldn't set 'want_affine' in the first place, but we don't > > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start > > traversing them. > > > > cc: Ingo Molnar <mingo@redhat.com> > > cc: Peter Zijlstra <peterz@infradead.org> > > > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> > > --- > > kernel/sched/fair.c | 28 +++++++++++++++++++++++++++- > > 1 file changed, 27 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 564215d..ce44fa7 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL; > > unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; > > #endif > > > > +/* > > + * The margin used when comparing utilization with cpu capacity: > > + * util * 1024 < capacity * margin > > + */ > > +unsigned int capacity_margin = 1280; /* ~20% */ > > + > > static inline void update_load_add(struct load_weight *lw, unsigned long inc) > > { > > lw->weight += inc; > > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu) > > return (util >= capacity) ? capacity : util; > > } > > > > +static inline int task_util(struct task_struct *p) > > +{ > > + return p->se.avg.util_avg; > > +} > > + > > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > > +{ > > + long delta; > > + long prev_cap = capacity_of(prev_cpu); > > + > > + delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap; > > + > > + /* prev_cpu is fairly close to max, no need to abort wake_affine */ > > + if (delta < prev_cap >> 3) > > + return 0; > > delta can be negative? still return 0? I could add an abs() around delta. Do you have a specific scenario in mind? Under normal circumstances, I don't think it can be negative?
On Wed, May 25, 2016 at 02:57:00PM +0800, Wanpeng Li wrote: > 2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>: > > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if > > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric > > configurations SD_WAKE_AFFINE is only desirable if the waking task's > > compute demand (utilization) is suitable for the cpu capacities > > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup > > balancing take over (find_idlest_{group, cpu}()). > > > > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain > > containing cpus with different capacities. This is enforced by a > > previous patch based on the SD_ASYM_CPUCAPACITY flag. > > > > Ideally, we shouldn't set 'want_affine' in the first place, but we don't > > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start > > traversing them. > > > > cc: Ingo Molnar <mingo@redhat.com> > > cc: Peter Zijlstra <peterz@infradead.org> > > > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> > > --- > > kernel/sched/fair.c | 28 +++++++++++++++++++++++++++- > > 1 file changed, 27 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 564215d..ce44fa7 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL; > > unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; > > #endif > > > > +/* > > + * The margin used when comparing utilization with cpu capacity: > > + * util * 1024 < capacity * margin > > + */ > > +unsigned int capacity_margin = 1280; /* ~20% */ > > + > > static inline void update_load_add(struct load_weight *lw, unsigned long inc) > > { > > lw->weight += inc; > > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu) > > return (util >= capacity) ? capacity : util; > > } > > > > +static inline int task_util(struct task_struct *p) > > +{ > > + return p->se.avg.util_avg; > > +} > > + > > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > > +{ > > + long delta; > > + long prev_cap = capacity_of(prev_cpu); > > + > > + delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap; > > + > > + /* prev_cpu is fairly close to max, no need to abort wake_affine */ > > + if (delta < prev_cap >> 3) > > + return 0; > > + > > + return prev_cap * 1024 < task_util(p) * capacity_margin; > > +} > > If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/ > SMT enabled, then each HT has capacity 589, wake_cap() will result in > always not wake affine, right? The idea is that SMT systems would bail out already at the previous condition. We should have max_cpu_capacity == prev_cap == 589, delta should then be zero and make the first condition true and make wake_cap() always return 0 for any system with symmetric capacities regardless of their actual capacity values. Note that this isn't entirely true as I used capacity_of() for prev_cap, if I change that to capacity_orig_of() it should be true. By making the !wake_cap() condition always true for want_affine, we should preserve existing behaviour for SMT/SMP. The only overhead is the capacity delta computation and comparison, which should be cheap. Does that make sense? Btw, task util_avg == SCHED_CAPACITY_SCALE should only be possible temporarily, it should decay to util_avg <= capacity_orig_of(task_cpu(p)) over time. That doesn't affect your question though as the second condition would still evaluate true if util_avg == capacity_orig_of(task_cpu(p)), but as said above the first condition should bail out before we get here. Morten > > + > > /* > > * select_task_rq_fair: Select target runqueue for the waking task in domains > > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > > @@ -5316,7 +5341,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f > > > > if (sd_flag & SD_BALANCE_WAKE) { > > record_wakee(p); > > - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); > > + want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) > > + && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); > > } > > > > rcu_read_lock(); > > -- > > 1.9.1 > >
On Wed, May 25, 2016 at 06:29:33PM +0800, Wanpeng Li wrote: > 2016-05-25 17:49 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>: > > On Wed, May 25, 2016 at 02:57:00PM +0800, Wanpeng Li wrote: > >> 2016-05-23 18:58 GMT+08:00 Morten Rasmussen <morten.rasmussen@arm.com>: > >> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if > >> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric > >> > configurations SD_WAKE_AFFINE is only desirable if the waking task's > >> > compute demand (utilization) is suitable for the cpu capacities > >> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup > >> > balancing take over (find_idlest_{group, cpu}()). > >> > > >> > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain > >> > containing cpus with different capacities. This is enforced by a > >> > previous patch based on the SD_ASYM_CPUCAPACITY flag. > >> > > >> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't > >> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start > >> > traversing them. > >> > > >> > cc: Ingo Molnar <mingo@redhat.com> > >> > cc: Peter Zijlstra <peterz@infradead.org> > >> > > >> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> > >> > --- > >> > kernel/sched/fair.c | 28 +++++++++++++++++++++++++++- > >> > 1 file changed, 27 insertions(+), 1 deletion(-) > >> > > >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> > index 564215d..ce44fa7 100644 > >> > --- a/kernel/sched/fair.c > >> > +++ b/kernel/sched/fair.c > >> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL; > >> > unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; > >> > #endif > >> > > >> > +/* > >> > + * The margin used when comparing utilization with cpu capacity: > >> > + * util * 1024 < capacity * margin > >> > + */ > >> > +unsigned int capacity_margin = 1280; /* ~20% */ > >> > + > >> > static inline void update_load_add(struct load_weight *lw, unsigned long inc) > >> > { > >> > lw->weight += inc; > >> > @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu) > >> > return (util >= capacity) ? capacity : util; > >> > } > >> > > >> > +static inline int task_util(struct task_struct *p) > >> > +{ > >> > + return p->se.avg.util_avg; > >> > +} > >> > + > >> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) > >> > +{ > >> > + long delta; > >> > + long prev_cap = capacity_of(prev_cpu); > >> > + > >> > + delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap; > >> > + > >> > + /* prev_cpu is fairly close to max, no need to abort wake_affine */ > >> > + if (delta < prev_cap >> 3) > >> > + return 0; > >> > + > >> > + return prev_cap * 1024 < task_util(p) * capacity_margin; > >> > +} > >> > >> If one task util_avg is SCHED_CAPACITY_SCALE and running on x86 box w/ > >> SMT enabled, then each HT has capacity 589, wake_cap() will result in > >> always not wake affine, right? > > > > The idea is that SMT systems would bail out already at the previous > > condition. We should have max_cpu_capacity == prev_cap == 589, delta > > should then be zero and make the first condition true and make > > wake_cap() always return 0 for any system with symmetric capacities > > regardless of their actual capacity values. > > > > Note that this isn't entirely true as I used capacity_of() for prev_cap, > > if I change that to capacity_orig_of() it should be true. > > > > By making the !wake_cap() condition always true for want_affine, we > > should preserve existing behaviour for SMT/SMP. The only overhead is the > > capacity delta computation and comparison, which should be cheap. > > > > Does that make sense? > > Fair enough, thanks for your explanation. > > > > > Btw, task util_avg == SCHED_CAPACITY_SCALE should only be possible > > temporarily, it should decay to util_avg <= > > capacity_orig_of(task_cpu(p)) over time. That doesn't affect your > > Sorry, I didn't find it will decay to capacity_orig in > __update_load_avg(), could you elaborate? I should have checked the code before writing that :-( I thought the scaling by arch_scale_cpu_capacity() in __update_load_avg() would do that, but it turns out that the default implementation of arch_scale_cpu_capacity() doesn't do that when we pass a NULL pointer for the sched_domain, it would have returned smt_gain/span_weight == capacity_orig_of(cpu) otherwise. Sorry for the confusion, though I'm not sure if it is right to return SCHED_CAPACITY_SCALE for SMT systems.
On Thu, Jun 02, 2016 at 04:21:05PM +0200, Peter Zijlstra wrote: > On Mon, May 23, 2016 at 11:58:51AM +0100, Morten Rasmussen wrote: > > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if > > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric > > configurations SD_WAKE_AFFINE is only desirable if the waking task's > > compute demand (utilization) is suitable for the cpu capacities > > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup > > balancing take over (find_idlest_{group, cpu}()). > > > > The assumption is that SD_WAKE_AFFINE is never set for a sched_domain > > containing cpus with different capacities. This is enforced by a > > previous patch based on the SD_ASYM_CPUCAPACITY flag. > > > > Ideally, we shouldn't set 'want_affine' in the first place, but we don't > > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start > > traversing them. > > I'm a bit confused... > > Lets assume a 2+2 big.little thing with shared LLC: > > > ---------- SD2 ---------- > > -- SD1 -- -- SD1 -- > > 0 1 2 3 > > > SD1: WAKE_AFFINE, BALANCE_WAKE > SD2: ASYM_CAPACITY, BALANCE_WAKE > > t0 used to run on cpu1, t0 used to run on cpu2 > > cpu0 wakes t0: > > want_affine = 1 > SD1: > WAKE_AFFINE > cpumask_test_cpu(prev_cpu, sd_mask) == true > affine_sd = SD1 > break; > > affine_sd != NULL -> affine-wakeup > > cpu0 wakes t1: > > want_affine = 1 > SD1: > WAKE_AFFINE > cpumask_test_cpu(prev_cpu, sd_mask) == false > SD2: > BALANCE_WAKE > sd = SD2 > > affine_sd == NULL, sd == SD2 -> find_idlest_*() > > > All without this patch... > > So what is this thing doing? Not very much in those cases, but it makes one important difference in one case. We could do fine without the patch if we could assume that all tasks are already in the right SD according their PELT utilization and if not they will be woken up by a cpu in the right SD (so we do find_idlest_*()). But we can't :-( Let's take your example above and add that t0 should really be running on cpu2/3 due to its utilization, assuming SD1[01] are little cpus and SD1[23] are big cpus. In that case we would still do affine-wakeup and stick the task on cpu0 despite it being a little cpu. To avoid that, this patch sets want_affine = 0 in that case so we go find_idlest_*() to give the task a chance of being put on cpu2/3. The patch is also setting want_affine = 0 for other cases which are already taking the find_idlest_*() route due to the cpumask test as illustrated by your example above. We can have the current scenarios: b = big cpu capacity/task util l = little cpu capacity/task util x = don't care case task util prev_cpu this_cpu wakeup ------------------------------------------------------------------- 1 b b b affine (b) 2 b b l slow (b) 3 b l b slow (b) 4 b l l slow (b) 5 l b b affine (x) 6 l b l slow (x) 7 l l b slow (x) 8 l l l affine (x) Without the patch we would do affine-wakeup on little in case 4, where we want to wake up on a big cpu. We only do affine-wakeup when both this_cpu and prev_cpu have the same capacity and that capacity is sufficient. Vincent pointed out that it is overly restrictive as it is perfectly safe to do affine-wakeup in case 6 and 7, where the waker and the previous cpu have sufficient capacity but they are not the same. If we made wake_affine() consider cpu capacity, it should be possible to do affine-wakeup even for case 2 and 3, leaving us with only case 4 requiring the find_idles_*() route. There are more cases for taking the slow wakeup path if you have more than two cpu capacities to deal with, but I'm going to spare you the full detailed table ;-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 564215d..ce44fa7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL; unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL; #endif +/* + * The margin used when comparing utilization with cpu capacity: + * util * 1024 < capacity * margin + */ +unsigned int capacity_margin = 1280; /* ~20% */ + static inline void update_load_add(struct load_weight *lw, unsigned long inc) { lw->weight += inc; @@ -5293,6 +5299,25 @@ static int cpu_util(int cpu) return (util >= capacity) ? capacity : util; } +static inline int task_util(struct task_struct *p) +{ + return p->se.avg.util_avg; +} + +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu) +{ + long delta; + long prev_cap = capacity_of(prev_cpu); + + delta = cpu_rq(cpu)->rd->max_cpu_capacity - prev_cap; + + /* prev_cpu is fairly close to max, no need to abort wake_affine */ + if (delta < prev_cap >> 3) + return 0; + + return prev_cap * 1024 < task_util(p) * capacity_margin; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -5316,7 +5341,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (sd_flag & SD_BALANCE_WAKE) { record_wakee(p); - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); + want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) + && cpumask_test_cpu(cpu, tsk_cpus_allowed(p)); } rcu_read_lock();
Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if SD_BALANCE_WAKE is set on the sched_domains. For asymmetric configurations SD_WAKE_AFFINE is only desirable if the waking task's compute demand (utilization) is suitable for the cpu capacities available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup balancing take over (find_idlest_{group, cpu}()). The assumption is that SD_WAKE_AFFINE is never set for a sched_domain containing cpus with different capacities. This is enforced by a previous patch based on the SD_ASYM_CPUCAPACITY flag. Ideally, we shouldn't set 'want_affine' in the first place, but we don't know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start traversing them. cc: Ingo Molnar <mingo@redhat.com> cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com> --- kernel/sched/fair.c | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-) -- 1.9.1