Message ID | 1400860385-14555-3-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote: > I have tried to understand the meaning of the condition : > (this_load <= load && > this_load + target_load(prev_cpu, idx) <= tl_per_task) > but i failed to find a use case that can take advantage of it and i haven't > found description of it in the previous commits' log. commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48 int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). if ((tl <= load && - tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || - 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { + tl + target_load(cpu, idx) <= tl_per_task) || + 100*(tl + p->load_weight) <= imbalance*load) { commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f + if ((tl <= load && + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { So back when the code got introduced, it read: target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE So while the first line makes some sense, the second line is still somewhat challenging. I read the second line something like: if there's less than one full task running on the combined cpus. Now for idx==0 this is hard, because even when sync=1 you can only make it true if both cpus are completely idle, in which case you really want to move to the waking cpu I suppose. One task running will have it == SCHED_LOAD_SCALE. But for idx>0 this can trigger in all kinds of situations of light load.
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote: > > If sync is set, it's not as straight forward as above (especially if cgroup > are involved) avg load with cgroups is 'interesting' alright. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 9587ed1..30240ab 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) > { > s64 this_load, load; > int idx, this_cpu, prev_cpu; > - unsigned long tl_per_task; > struct task_group *tg; > unsigned long weight; > int balanced; > @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) > balanced = this_eff_load <= prev_eff_load; > } else > balanced = true; > + schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); > > + if (!balanced) > + return 0; > /* > * If the currently running task will sleep within > * a reasonable amount of time then attract this newly > * woken task: > */ > + if (sync) > return 1; > > + schedstat_inc(sd, ttwu_move_affine); > + schedstat_inc(p, se.statistics.nr_wakeups_affine); > > + return 1; > } So I'm not usually one for schedstat nitpicking, but should we fix it in the sync case? That is, for sync we return 1 but do no inc nr_wakeups_affine, even though its going to be an affine wakeup. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote: >> I have tried to understand the meaning of the condition : >> (this_load <= load && >> this_load + target_load(prev_cpu, idx) <= tl_per_task) >> but i failed to find a use case that can take advantage of it and i haven't >> found description of it in the previous commits' log. > > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48 > > int try_to_wake_up(): > > in this function the value SCHED_LOAD_BALANCE is used to represent the load > contribution of a single task in various calculations in the code that > decides which CPU to put the waking task on. While this would be a valid > on a system where the nice values for the runnable tasks were distributed > evenly around zero it will lead to anomalous load balancing if the > distribution is skewed in either direction. To overcome this problem > SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task > or by the average load_weight per task for the queue in question (as > appropriate). > > if ((tl <= load && > - tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || > - 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { > + tl + target_load(cpu, idx) <= tl_per_task) || > + 100*(tl + p->load_weight) <= imbalance*load) { The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34 where task_hot had been replaced by + if ((tl <= load && + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { but as explained, i haven't found a clear explanation of this condition > > > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f > > > + if ((tl <= load && > + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || > + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { > > > So back when the code got introduced, it read: > > target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && > target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE > > So while the first line makes some sense, the second line is still > somewhat challenging. > > I read the second line something like: if there's less than one full > task running on the combined cpus. ok. your explanation makes sense > > Now for idx==0 this is hard, because even when sync=1 you can only make > it true if both cpus are completely idle, in which case you really want > to move to the waking cpu I suppose. This use case is already taken into account by if (this_load > 0) .. else balance = true > > One task running will have it == SCHED_LOAD_SCALE. > > But for idx>0 this can trigger in all kinds of situations of light load. target_load is the max between load for idx == 0 and load for the selected idx so we have even less chance to match the condition : both cpu are completely idle -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 27 May 2014 15:45, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote: >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 9587ed1..30240ab 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) >> { >> s64 this_load, load; >> int idx, this_cpu, prev_cpu; >> - unsigned long tl_per_task; >> struct task_group *tg; >> unsigned long weight; >> int balanced; >> @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) >> balanced = this_eff_load <= prev_eff_load; >> } else >> balanced = true; >> + schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); >> >> + if (!balanced) >> + return 0; >> /* >> * If the currently running task will sleep within >> * a reasonable amount of time then attract this newly >> * woken task: >> */ >> + if (sync) >> return 1; >> >> + schedstat_inc(sd, ttwu_move_affine); >> + schedstat_inc(p, se.statistics.nr_wakeups_affine); >> >> + return 1; >> } > > So I'm not usually one for schedstat nitpicking, but should we fix it in > the sync case? That is, for sync we return 1 but do no inc > nr_wakeups_affine, even though its going to be an affine wakeup. ok, i'm going to move schedstat update at the right place > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote: > On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote: > >> I have tried to understand the meaning of the condition : > >> (this_load <= load && > >> this_load + target_load(prev_cpu, idx) <= tl_per_task) > >> but i failed to find a use case that can take advantage of it and i haven't > >> found description of it in the previous commits' log. > > > > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48 > > > > int try_to_wake_up(): > > > > in this function the value SCHED_LOAD_BALANCE is used to represent the load > > contribution of a single task in various calculations in the code that > > decides which CPU to put the waking task on. While this would be a valid > > on a system where the nice values for the runnable tasks were distributed > > evenly around zero it will lead to anomalous load balancing if the > > distribution is skewed in either direction. To overcome this problem > > SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task > > or by the average load_weight per task for the queue in question (as > > appropriate). > > > > if ((tl <= load && > > - tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || > > - 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { > > + tl + target_load(cpu, idx) <= tl_per_task) || > > + 100*(tl + p->load_weight) <= imbalance*load) { > > The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34 > where task_hot had been replaced by > + if ((tl <= load && > + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || > + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { > > but as explained, i haven't found a clear explanation of this condition Yeah, that's the commit I had below; but I suppose we could ask Nick if we really want, I've heard he still replies to email, even though he's locked up in a basement somewhere :-) > > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f > > > > > > + if ((tl <= load && > > + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || > > + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { > > > > > > So back when the code got introduced, it read: > > > > target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && > > target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE > > > > So while the first line makes some sense, the second line is still > > somewhat challenging. > > > > I read the second line something like: if there's less than one full > > task running on the combined cpus. > > ok. your explanation makes sense Maybe, its still slightly weird :-) > > > > Now for idx==0 this is hard, because even when sync=1 you can only make > > it true if both cpus are completely idle, in which case you really want > > to move to the waking cpu I suppose. > > This use case is already taken into account by > > if (this_load > 0) > .. > else > balance = true Agreed. > > One task running will have it == SCHED_LOAD_SCALE. > > > > But for idx>0 this can trigger in all kinds of situations of light load. > > target_load is the max between load for idx == 0 and load for the > selected idx so we have even less chance to match the condition : both > cpu are completely idle Ah, yes, I forgot to look at the target_load() thing and missed the max, yes that all makes it entirely less likely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 27 May 2014 17:39, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote: >> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote: >> >> I have tried to understand the meaning of the condition : >> >> (this_load <= load && >> >> this_load + target_load(prev_cpu, idx) <= tl_per_task) >> >> but i failed to find a use case that can take advantage of it and i haven't >> >> found description of it in the previous commits' log. >> > >> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48 >> > >> > int try_to_wake_up(): >> > >> > in this function the value SCHED_LOAD_BALANCE is used to represent the load >> > contribution of a single task in various calculations in the code that >> > decides which CPU to put the waking task on. While this would be a valid >> > on a system where the nice values for the runnable tasks were distributed >> > evenly around zero it will lead to anomalous load balancing if the >> > distribution is skewed in either direction. To overcome this problem >> > SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task >> > or by the average load_weight per task for the queue in question (as >> > appropriate). >> > >> > if ((tl <= load && >> > - tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || >> > - 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >> > + tl + target_load(cpu, idx) <= tl_per_task) || >> > + 100*(tl + p->load_weight) <= imbalance*load) { >> >> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34 >> where task_hot had been replaced by >> + if ((tl <= load && >> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || >> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >> >> but as explained, i haven't found a clear explanation of this condition > > Yeah, that's the commit I had below; but I suppose we could ask Nick if > we really want, I've heard he still replies to email, even though he's > locked up in a basement somewhere :-) ok, I have added him in the list Nick, While doing some rework on the wake affine part of the scheduler, i failed to catch the use case that takes advantage of a condition that you added some while ago with the commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f Could you help us to clarify the 2 first lines of the test that you added ? + if ((tl <= load && + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { Regards, Vincent > >> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f >> > >> > >> > + if ((tl <= load && >> > + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || >> > + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >> > >> > >> > So back when the code got introduced, it read: >> > >> > target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && >> > target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE >> > >> > So while the first line makes some sense, the second line is still >> > somewhat challenging. >> > >> > I read the second line something like: if there's less than one full >> > task running on the combined cpus. >> >> ok. your explanation makes sense > > Maybe, its still slightly weird :-) > >> > >> > Now for idx==0 this is hard, because even when sync=1 you can only make >> > it true if both cpus are completely idle, in which case you really want >> > to move to the waking cpu I suppose. >> >> This use case is already taken into account by >> >> if (this_load > 0) >> .. >> else >> balance = true > > Agreed. > >> > One task running will have it == SCHED_LOAD_SCALE. >> > >> > But for idx>0 this can trigger in all kinds of situations of light load. >> >> target_load is the max between load for idx == 0 and load for the >> selected idx so we have even less chance to match the condition : both >> cpu are completely idle > > Ah, yes, I forgot to look at the target_load() thing and missed the max, > yes that all makes it entirely less likely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Using another email address for Nick On 27 May 2014 18:14, Vincent Guittot <vincent.guittot@linaro.org> wrote: > On 27 May 2014 17:39, Peter Zijlstra <peterz@infradead.org> wrote: >> On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote: >>> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote: >>> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote: >>> >> I have tried to understand the meaning of the condition : >>> >> (this_load <= load && >>> >> this_load + target_load(prev_cpu, idx) <= tl_per_task) >>> >> but i failed to find a use case that can take advantage of it and i haven't >>> >> found description of it in the previous commits' log. >>> > >>> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48 >>> > >>> > int try_to_wake_up(): >>> > >>> > in this function the value SCHED_LOAD_BALANCE is used to represent the load >>> > contribution of a single task in various calculations in the code that >>> > decides which CPU to put the waking task on. While this would be a valid >>> > on a system where the nice values for the runnable tasks were distributed >>> > evenly around zero it will lead to anomalous load balancing if the >>> > distribution is skewed in either direction. To overcome this problem >>> > SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task >>> > or by the average load_weight per task for the queue in question (as >>> > appropriate). >>> > >>> > if ((tl <= load && >>> > - tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || >>> > - 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >>> > + tl + target_load(cpu, idx) <= tl_per_task) || >>> > + 100*(tl + p->load_weight) <= imbalance*load) { >>> >>> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34 >>> where task_hot had been replaced by >>> + if ((tl <= load && >>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || >>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >>> >>> but as explained, i haven't found a clear explanation of this condition >> >> Yeah, that's the commit I had below; but I suppose we could ask Nick if >> we really want, I've heard he still replies to email, even though he's >> locked up in a basement somewhere :-) ok, I have added him in the list Nick, While doing some rework on the wake affine part of the scheduler, i failed to catch the use case that takes advantage of a condition that you added some while ago with the commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f Could you help us to clarify the 2 first lines of the test that you added ? + if ((tl <= load && + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { Regards, Vincent > >> >>> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f >>> > >>> > >>> > + if ((tl <= load && >>> > + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || >>> > + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >>> > >>> > >>> > So back when the code got introduced, it read: >>> > >>> > target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && >>> > target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE >>> > >>> > So while the first line makes some sense, the second line is still >>> > somewhat challenging. >>> > >>> > I read the second line something like: if there's less than one full >>> > task running on the combined cpus. >>> >>> ok. your explanation makes sense >> >> Maybe, its still slightly weird :-) >> >>> > >>> > Now for idx==0 this is hard, because even when sync=1 you can only make >>> > it true if both cpus are completely idle, in which case you really want >>> > to move to the waking cpu I suppose. >>> >>> This use case is already taken into account by >>> >>> if (this_load > 0) >>> .. >>> else >>> balance = true >> >> Agreed. >> >>> > One task running will have it == SCHED_LOAD_SCALE. >>> > >>> > But for idx>0 this can trigger in all kinds of situations of light load. >>> >>> target_load is the max between load for idx == 0 and load for the >>> selected idx so we have even less chance to match the condition : both >>> cpu are completely idle >> >> Ah, yes, I forgot to look at the target_load() thing and missed the max, >> yes that all makes it entirely less likely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi Vincent & Peter, On 28/05/14 07:49, Vincent Guittot wrote: [...] > > Nick, > > While doing some rework on the wake affine part of the scheduler, i > failed to catch the use case that takes advantage of a condition that > you added some while ago with the commit > a3f21bce1fefdf92a4d1705e888d390b10f3ac6f > > Could you help us to clarify the 2 first lines of the test that you added ? > + if ((tl <= load && > + tl + target_load(cpu, idx) <= > SCHED_LOAD_SCALE) || > + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { > > Regards, > Vincent >> >>> >>>>> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f >>>>> >>>>> >>>>> + if ((tl <= load && >>>>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || >>>>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >>>>> >>>>> >>>>> So back when the code got introduced, it read: >>>>> >>>>> target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && >>>>> target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE >>>>> Shouldn't this be target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) && target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE "[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48) mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents the load contribution of a single task. So I read the second part as if the sum of the load of this_cpu and prev_cpu is smaller or equal to the (maximal) load contribution (maximal possible effect) of a single task. There is even a comment in "[PATCH] sched: tweak affine wakeups" (a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when SCHED_LOAD_SCALE gets subtracted from tl = this_load = target_load(this_cpu, idx): + * If sync wakeup then subtract the (maximum possible) + * effect of the currently running task from the load + * of the current CPU: "[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/ +static inline unsigned long cpu_avg_load_per_task(int cpu) +{ + runqueue_t *rq = cpu_rq(cpu); + unsigned long n = rq->nr_running; + + return n ? rq->raw_weighted_load / n : SCHED_LOAD_SCALE; -- Dietmar >>>>> So while the first line makes some sense, the second line is still >>>>> somewhat challenging. >>>>> >>>>> I read the second line something like: if there's less than one full >>>>> task running on the combined cpus. >>>> >>>> ok. your explanation makes sense >>> >>> Maybe, its still slightly weird :-) >>> >>>>> [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 28 May 2014 17:09, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > Hi Vincent & Peter, > > On 28/05/14 07:49, Vincent Guittot wrote: > [...] >> >> Nick, >> >> While doing some rework on the wake affine part of the scheduler, i >> failed to catch the use case that takes advantage of a condition that >> you added some while ago with the commit >> a3f21bce1fefdf92a4d1705e888d390b10f3ac6f >> >> Could you help us to clarify the 2 first lines of the test that you added ? >> + if ((tl <= load && >> + tl + target_load(cpu, idx) <= >> SCHED_LOAD_SCALE) || >> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >> >> Regards, >> Vincent >>> >>>> >>>>>> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f >>>>>> >>>>>> >>>>>> + if ((tl <= load && >>>>>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) || >>>>>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) { >>>>>> >>>>>> >>>>>> So back when the code got introduced, it read: >>>>>> >>>>>> target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) && >>>>>> target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE >>>>>> > > Shouldn't this be > > target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) && > target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE yes, there was a typo mistake in Peter's explanation > > "[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48) > mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents > the load contribution of a single task. So I read the second part as if > the sum of the load of this_cpu and prev_cpu is smaller or equal to the > (maximal) load contribution (maximal possible effect) of a single task. > > There is even a comment in "[PATCH] sched: tweak affine wakeups" > (a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when > SCHED_LOAD_SCALE gets subtracted from tl = this_load = > target_load(this_cpu, idx): > > + * If sync wakeup then subtract the (maximum possible) > + * effect of the currently running task from the load > + * of the current CPU: > > "[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/ > > +static inline unsigned long cpu_avg_load_per_task(int cpu) > +{ > + runqueue_t *rq = cpu_rq(cpu); > + unsigned long n = rq->nr_running; > + > + return n ? rq->raw_weighted_load / n : SCHED_LOAD_SCALE; > > -- Dietmar > >>>>>> So while the first line makes some sense, the second line is still >>>>>> somewhat challenging. >>>>>> >>>>>> I read the second line something like: if there's less than one full >>>>>> task running on the combined cpus. >>>>> >>>>> ok. your explanation makes sense >>>> >>>> Maybe, its still slightly weird :-) >>>> >>>>>> > [...] > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9587ed1..30240ab 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { s64 this_load, load; int idx, this_cpu, prev_cpu; - unsigned long tl_per_task; struct task_group *tg; unsigned long weight; int balanced; @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) balanced = this_eff_load <= prev_eff_load; } else balanced = true; + schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); + if (!balanced) + return 0; /* * If the currently running task will sleep within * a reasonable amount of time then attract this newly * woken task: */ - if (sync && balanced) + if (sync) return 1; - schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts); - tl_per_task = cpu_avg_load_per_task(this_cpu); - - if (balanced || - (this_load <= load && - this_load + target_load(prev_cpu, idx) <= tl_per_task)) { - /* - * This domain has SD_WAKE_AFFINE and - * p is cache cold in this domain, and - * there is no bad imbalance. - */ - schedstat_inc(sd, ttwu_move_affine); - schedstat_inc(p, se.statistics.nr_wakeups_affine); + schedstat_inc(sd, ttwu_move_affine); + schedstat_inc(p, se.statistics.nr_wakeups_affine); - return 1; - } - return 0; + return 1; } /*
I have tried to understand the meaning of the condition : (this_load <= load && this_load + target_load(prev_cpu, idx) <= tl_per_task) but i failed to find a use case that can take advantage of it and i haven't found description of it in the previous commits' log. Futhermore, the comment of the condition refers to task_hot function that was used before being replaced by the current condition: /* * This domain has SD_WAKE_AFFINE and * p is cache cold in this domain, and * there is no bad imbalance. */ If we look more deeply the below condition this_load + target_load(prev_cpu, idx) <= tl_per_task When sync is clear, we have : tl_per_task = runnable_load_avg / nr_running this_load = max(runnable_load_avg, cpuload[idx]) target_load = max(runnable_load_avg', cpuload'[idx]) It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the condition. This implies that runnable_load_avg == 0 too because of the condition: this_load <= load but if this _load is null, balanced is already set and the test is redundant. If sync is set, it's not as straight forward as above (especially if cgroup are involved) but the policy should be similar as we have removed a task that's going to sleep in order to get a more accurate load and this_load values. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 25 +++++++------------------ 1 file changed, 7 insertions(+), 18 deletions(-)