Message ID | 1400860385-14555-9-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote: > Monitor the activity level of each group of each sched_domain level. The > activity is the amount of cpu_power that is currently used on a CPU or group > of CPUs. We use the runnable_avg_sum and _period to evaluate this activity > level. In the special use case where the CPU is fully loaded by more than 1 > task, the activity level is set above the cpu_power in order to reflect the > overload of the CPU > > +static int get_cpu_activity(int cpu) > +{ > + struct rq *rq = cpu_rq(cpu); > + u32 sum = rq->avg.runnable_avg_sum; > + u32 period = rq->avg.runnable_avg_period; > + > + if (sum >= period) > + return power_orig_of(cpu) + rq->nr_running - 1; > + > + return (sum * power_orig_of(cpu)) / period; > +} While I appreciate the need to signify the overload situation, I don't think adding nr_running makes sense. The amount of tasks has no bearing on the amount of overload. Also, and note I've not yet seen the use, it strikes me as odd to use the orig power here. I would've thought the current capacity (not the max capacity) is relevant to balancing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 27 May 2014 19:32, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote: >> Monitor the activity level of each group of each sched_domain level. The >> activity is the amount of cpu_power that is currently used on a CPU or group >> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity >> level. In the special use case where the CPU is fully loaded by more than 1 >> task, the activity level is set above the cpu_power in order to reflect the >> overload of the CPU >> > >> +static int get_cpu_activity(int cpu) >> +{ >> + struct rq *rq = cpu_rq(cpu); >> + u32 sum = rq->avg.runnable_avg_sum; >> + u32 period = rq->avg.runnable_avg_period; >> + >> + if (sum >= period) >> + return power_orig_of(cpu) + rq->nr_running - 1; >> + >> + return (sum * power_orig_of(cpu)) / period; >> +} > > While I appreciate the need to signify the overload situation, I don't > think adding nr_running makes sense. The amount of tasks has no bearing > on the amount of overload. I agree that it's not the best way to evaluate overload but it's the only simple one that is available without additional computation. I'm going to try to find a better metric or come back the solution which only adds +1 and compute the overload somewhere else > > Also, and note I've not yet seen the use, it strikes me as odd to use > the orig power here. I would've thought the current capacity (not the > max capacity) is relevant to balancing. activity also measures the non cfs tasks activity whereas current capacity removes the capacity used by non cfs tasks. So as long as arch_scale_cpu_freq is not used (which is the case for now), original capacity is ok but we might need a 3rd metric so we would have: original capacity (max capacity that can provide a CPU) usable capacity (new value that reflects current capacity available for any kind of processing rt tasks, irq, cfs tasks) current capacity (capacity available for cfs tasks) Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote: > Monitor the activity level of each group of each sched_domain level. The > activity is the amount of cpu_power that is currently used on a CPU or group > of CPUs. We use the runnable_avg_sum and _period to evaluate this activity > level. In the special use case where the CPU is fully loaded by more than 1 > task, the activity level is set above the cpu_power in order to reflect the > overload of the CPU > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 22 ++++++++++++++++++++++ > 1 file changed, 22 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b7c51be..c01d8b6 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu) > return cpu_rq(cpu)->cpu_power; > } > > +static unsigned long power_orig_of(int cpu) > +{ > + return cpu_rq(cpu)->cpu_power_orig; > +} > + > static unsigned long cpu_avg_load_per_task(int cpu) > { > struct rq *rq = cpu_rq(cpu); > @@ -4438,6 +4443,18 @@ done: > return target; > } > > +static int get_cpu_activity(int cpu) > +{ > + struct rq *rq = cpu_rq(cpu); > + u32 sum = rq->avg.runnable_avg_sum; > + u32 period = rq->avg.runnable_avg_period; > + > + if (sum >= period) > + return power_orig_of(cpu) + rq->nr_running - 1; > + > + return (sum * power_orig_of(cpu)) / period; > +} The rq runnable_avg_{sum, period} give a very long term view of the cpu utilization (I will use the term utilization instead of activity as I think that is what we are talking about here). IMHO, it is too slow to be used as basis for load balancing decisions. I think that was also agreed upon in the last discussion related to this topic [1]. The basic problem is that worst case: sum starting from 0 and period already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods (ms) for sum to reach 47742. In other words, the cpu might have been fully utilized for 345 ms before it is considered fully utilized. Periodic load-balancing happens much more frequently than that. Also, if load-balancing actually moves tasks around it may take quite a while before runnable_avg_sum actually reflects this change. The next periodic load-balance is likely to happen before runnable_avg_sum has reflected the result of the previous periodic load-balance. To avoid these problems, we need to base utilization on a metric which is updated instantaneously when we add/remove tasks to a cpu (or a least fast enough that we don't see the above problems). In the previous discussion [1] it was suggested that a sum of unweighted task runnable_avg_{sum,period} ratio instead. That is, an unweighted equivalent to weighted_cpuload(). That isn't a perfect solution either. It is fine as long as the cpus are not fully utilized, but when they are we need to use weighted_cpuload() to preserve smp_nice. What to do around the tipping point needs more thought, but I think that is currently the best proposal for a solution for task and cpu utilization. rq runnable_avg_sum is useful for decisions where we need a longer term view of the cpu utilization, but I don't see how we can use as cpu utilization metric for load-balancing decisions at wakeup or periodically. Morten [1] https://lkml.org/lkml/2014/1/8/251 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote: >> Monitor the activity level of each group of each sched_domain level. The >> activity is the amount of cpu_power that is currently used on a CPU or group >> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity >> level. In the special use case where the CPU is fully loaded by more than 1 >> task, the activity level is set above the cpu_power in order to reflect the >> overload of the CPU >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> --- >> kernel/sched/fair.c | 22 ++++++++++++++++++++++ >> 1 file changed, 22 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b7c51be..c01d8b6 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu) >> return cpu_rq(cpu)->cpu_power; >> } >> >> +static unsigned long power_orig_of(int cpu) >> +{ >> + return cpu_rq(cpu)->cpu_power_orig; >> +} >> + >> static unsigned long cpu_avg_load_per_task(int cpu) >> { >> struct rq *rq = cpu_rq(cpu); >> @@ -4438,6 +4443,18 @@ done: >> return target; >> } >> >> +static int get_cpu_activity(int cpu) >> +{ >> + struct rq *rq = cpu_rq(cpu); >> + u32 sum = rq->avg.runnable_avg_sum; >> + u32 period = rq->avg.runnable_avg_period; >> + >> + if (sum >= period) >> + return power_orig_of(cpu) + rq->nr_running - 1; >> + >> + return (sum * power_orig_of(cpu)) / period; >> +} > > The rq runnable_avg_{sum, period} give a very long term view of the cpu > utilization (I will use the term utilization instead of activity as I > think that is what we are talking about here). IMHO, it is too slow to > be used as basis for load balancing decisions. I think that was also > agreed upon in the last discussion related to this topic [1]. > > The basic problem is that worst case: sum starting from 0 and period > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods > (ms) for sum to reach 47742. In other words, the cpu might have been > fully utilized for 345 ms before it is considered fully utilized. > Periodic load-balancing happens much more frequently than that. I agree that it's not really responsive but several statistics of the scheduler use the same kind of metrics and have the same kind of responsiveness. I agree that it's not enough and that's why i'm not using only this metric but it gives information that the unweighted load_avg_contrib (that you are speaking about below) can't give. So i would be less contrasted than you and would say that we probably need additional metrics > > Also, if load-balancing actually moves tasks around it may take quite a > while before runnable_avg_sum actually reflects this change. The next > periodic load-balance is likely to happen before runnable_avg_sum has > reflected the result of the previous periodic load-balance. runnable_avg_sum uses a 1ms unit step so i tend to disagree with your point above > > To avoid these problems, we need to base utilization on a metric which > is updated instantaneously when we add/remove tasks to a cpu (or a least > fast enough that we don't see the above problems). In the previous > discussion [1] it was suggested that a sum of unweighted task > runnable_avg_{sum,period} ratio instead. That is, an unweighted > equivalent to weighted_cpuload(). That isn't a perfect solution either. Regarding the unweighted load_avg_contrib, you will have similar issue because of the slowness in the variation of each sched_entity load that will be added/removed in the unweighted load_avg_contrib. The update of the runnable_avg_{sum,period} of an sched_entity is quite similar to cpu utilization. This value is linked to the CPU on which it has run previously because of the time sharing with others tasks, so the unweighted load of a freshly migrated task will reflect its load on the previous CPU (with the time sharing with other tasks on prev CPU). I'm not saying that such metric is useless but it's not perfect as well. Vincent > It is fine as long as the cpus are not fully utilized, but when they are > we need to use weighted_cpuload() to preserve smp_nice. What to do > around the tipping point needs more thought, but I think that is > currently the best proposal for a solution for task and cpu utilization. > > rq runnable_avg_sum is useful for decisions where we need a longer term > view of the cpu utilization, but I don't see how we can use as cpu > utilization metric for load-balancing decisions at wakeup or > periodically. > > Morten > > [1] https://lkml.org/lkml/2014/1/8/251 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote: > The basic problem is that worst case: sum starting from 0 and period > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods > (ms) for sum to reach 47742. But it takes only 4 periods (4*32ms) to get ~94% of the way there, So our half-time is 32 * 1024 ms, now look at \Sum_i=1,4 1/(2^i) = 0.9375. So 4 periods (4*32*1024=131072 ms) gets us ~94% of the way there. Now, granted, 4 periods of 32-ish ms is still very long, but nowhere near as horrid as the full convergence.
On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote: > On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote: > >> Monitor the activity level of each group of each sched_domain level. The > >> activity is the amount of cpu_power that is currently used on a CPU or group > >> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity > >> level. In the special use case where the CPU is fully loaded by more than 1 > >> task, the activity level is set above the cpu_power in order to reflect the > >> overload of the CPU > >> > >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > >> --- > >> kernel/sched/fair.c | 22 ++++++++++++++++++++++ > >> 1 file changed, 22 insertions(+) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index b7c51be..c01d8b6 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu) > >> return cpu_rq(cpu)->cpu_power; > >> } > >> > >> +static unsigned long power_orig_of(int cpu) > >> +{ > >> + return cpu_rq(cpu)->cpu_power_orig; > >> +} > >> + > >> static unsigned long cpu_avg_load_per_task(int cpu) > >> { > >> struct rq *rq = cpu_rq(cpu); > >> @@ -4438,6 +4443,18 @@ done: > >> return target; > >> } > >> > >> +static int get_cpu_activity(int cpu) > >> +{ > >> + struct rq *rq = cpu_rq(cpu); > >> + u32 sum = rq->avg.runnable_avg_sum; > >> + u32 period = rq->avg.runnable_avg_period; > >> + > >> + if (sum >= period) > >> + return power_orig_of(cpu) + rq->nr_running - 1; > >> + > >> + return (sum * power_orig_of(cpu)) / period; > >> +} > > > > The rq runnable_avg_{sum, period} give a very long term view of the cpu > > utilization (I will use the term utilization instead of activity as I > > think that is what we are talking about here). IMHO, it is too slow to > > be used as basis for load balancing decisions. I think that was also > > agreed upon in the last discussion related to this topic [1]. > > > > The basic problem is that worst case: sum starting from 0 and period > > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods > > (ms) for sum to reach 47742. In other words, the cpu might have been > > fully utilized for 345 ms before it is considered fully utilized. > > Periodic load-balancing happens much more frequently than that. > > I agree that it's not really responsive but several statistics of the > scheduler use the same kind of metrics and have the same kind of > responsiveness. I might be wrong, but I don't think we use anything similar to this to estimate cpu load/utilization for load-balancing purposes except for {source, target}_load() where it is used to bias the decisions whether or not to balance if the difference is small. That is what the discussion was about last time. > I agree that it's not enough and that's why i'm not using only this > metric but it gives information that the unweighted load_avg_contrib > (that you are speaking about below) can't give. So i would be less > contrasted than you and would say that we probably need additional > metrics I'm not saying that we shouldn't this metric at all, I'm just saying that I don't think it is suitable for estimating the short term view cpu utilization which is what you need to make load-balancing decisions. We can't observe the effect of recent load-balancing decisions if the metric is too slow to react. I realize that what I mean by 'slow' might be unclear. Load tracking (both task and rq) takes a certain amount of history into account in runnable_avg_{sum, period}. This amount is determined by the 'y'-weight, which has been chosen such that we consider the load in the past 345 time units, where the time unit is ~1 ms. The contribution is smaller the further you go back due to y^n, which diminishes to 0 for n > 345. So, if a task or cpu goes from having been idle for >345 ms to being constantly busy, it will take 345 ms until the entire history that we care about will reflect this change. Only then runnable_avg_sum will reach 47742. The rate of change is faster to begin with since the weight of the most recent history is higher. runnable_avg_sum will get to 47742/2 in just 32 ms. Since we may do periodic load-balance every 10 ms or so, we will perform a number of load-balances where runnable_avg_sum will mostly be reflecting the state of the world before a change (new task queued or moved a task to a different cpu). If you had have two tasks continuously on one cpu and your other cpu is idle, and you move one of the tasks to the other cpu, runnable_avg_sum will remain unchanged, 47742, on the first cpu while it starts from 0 on the other one. 10 ms later it will have increased a bit, 32 ms later it will be 47742/2, and 345 ms later it reaches 47742. In the mean time the cpu doesn't appear fully utilized and we might decide to put more tasks on it because we don't know if runnable_avg_sum represents a partially utilized cpu (for example a 50% task) or if it will continue to rise and eventually get to 47742. IMO, we need cpu utilization to clearly represent the current utilization of the cpu such that any changes since the last wakeup or load-balance are clearly visible. > > > > > Also, if load-balancing actually moves tasks around it may take quite a > > while before runnable_avg_sum actually reflects this change. The next > > periodic load-balance is likely to happen before runnable_avg_sum has > > reflected the result of the previous periodic load-balance. > > runnable_avg_sum uses a 1ms unit step so i tend to disagree with your > point above See explanation above. The important thing is how much history we take into account. That is 345x 1 ms time units. The rate at which the sum is updated doesn't have any change anything. 1 ms after a change (wakeup, load-balance,...) runnable_avg_sum can only change by 1024. The remaining ~98% of your weighted history still reflects the world before the change. > > To avoid these problems, we need to base utilization on a metric which > > is updated instantaneously when we add/remove tasks to a cpu (or a least > > fast enough that we don't see the above problems). In the previous > > discussion [1] it was suggested that a sum of unweighted task > > runnable_avg_{sum,period} ratio instead. That is, an unweighted > > equivalent to weighted_cpuload(). That isn't a perfect solution either. > > Regarding the unweighted load_avg_contrib, you will have similar issue > because of the slowness in the variation of each sched_entity load > that will be added/removed in the unweighted load_avg_contrib. > > The update of the runnable_avg_{sum,period} of an sched_entity is > quite similar to cpu utilization. Yes, runnable_avg_{sum, period} for tasks and rqs are exactly the same. No difference there :) > This value is linked to the CPU on > which it has run previously because of the time sharing with others > tasks, so the unweighted load of a freshly migrated task will reflect > its load on the previous CPU (with the time sharing with other tasks > on prev CPU). I agree that the task runnable_avg_sum is always affected by the circumstances on the cpu where it is running, and that it takes this history with it. However, I think cfs.runnable_load_avg leads to less problems than using the rq runnable_avg_sum. It would work nicely for the two tasks on two cpus example I mentioned earlier. We don't need add something on top when the cpu is fully utilized by more than one task. It comes more naturally with cfs.runnable_load_avg. If it is much larger than 47742, it should be fairly safe to assume that you shouldn't stick more tasks on that cpu. > > I'm not saying that such metric is useless but it's not perfect as well. It comes with its own set of problems, agreed. Based on my current understanding (or lack thereof) they just seem smaller :) Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote: >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote: >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote: [snip] > >> This value is linked to the CPU on >> which it has run previously because of the time sharing with others >> tasks, so the unweighted load of a freshly migrated task will reflect >> its load on the previous CPU (with the time sharing with other tasks >> on prev CPU). > > I agree that the task runnable_avg_sum is always affected by the > circumstances on the cpu where it is running, and that it takes this > history with it. However, I think cfs.runnable_load_avg leads to less > problems than using the rq runnable_avg_sum. It would work nicely for > the two tasks on two cpus example I mentioned earlier. We don't need add i would say that nr_running is an even better metrics for such situation as the load doesn't give any additional information. Just to point that we can spent a lot of time listing which use case are better covered by which metrics :-) > something on top when the cpu is fully utilized by more than one task. > It comes more naturally with cfs.runnable_load_avg. If it is much larger > than 47742, it should be fairly safe to assume that you shouldn't stick > more tasks on that cpu. > >> >> I'm not saying that such metric is useless but it's not perfect as well. > > It comes with its own set of problems, agreed. Based on my current > understanding (or lack thereof) they just seem smaller :) I think it's worth using the cpu utilization for some cases because it has got some information that are not available elsewhere. And the replacement of the current capacity computation is one example. As explained previously, I'm not against adding other metrics and i'm not sure to understand why you oppose these 2 metrics whereas they could be complementary Vincent > > Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 23/05/14 16:53, Vincent Guittot wrote: > Monitor the activity level of each group of each sched_domain level. The > activity is the amount of cpu_power that is currently used on a CPU or group > of CPUs. We use the runnable_avg_sum and _period to evaluate this activity > level. In the special use case where the CPU is fully loaded by more than 1 > task, the activity level is set above the cpu_power in order to reflect the > overload of the CPU > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [...] > /* > * select_task_rq_fair: Select target runqueue for the waking task in domains > * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, > @@ -5518,6 +5535,7 @@ struct sg_lb_stats { > unsigned long sum_weighted_load; /* Weighted load of group's tasks */ > unsigned long load_per_task; > unsigned long group_power; > + unsigned long group_activity; /* Total activity of the group */ > unsigned int sum_nr_running; /* Nr tasks running in the group */ > unsigned int group_capacity; > unsigned int idle_cpus; > @@ -5538,6 +5556,7 @@ struct sd_lb_stats { > struct sched_group *busiest; /* Busiest group in this sd */ > struct sched_group *local; /* Local group in this sd */ > unsigned long total_load; /* Total load of all groups in sd */ > + unsigned long total_activity; /* Total activity of all groups in sd */ > unsigned long total_pwr; /* Total power of all groups in sd */ > unsigned long avg_load; /* Average load across all groups in sd */ > > @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) > .busiest = NULL, > .local = NULL, > .total_load = 0UL, > + .total_activity = 0UL, AFAICS, total_activity is not used right now. Do you intend to use it to calculate something like avg_activity later (like total_load/avg_load)? > .total_pwr = 0UL, > .busiest_stat = { > .avg_load = 0UL, [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > On 23/05/14 16:53, Vincent Guittot wrote: >> Monitor the activity level of each group of each sched_domain level. The >> activity is the amount of cpu_power that is currently used on a CPU or group >> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity >> level. In the special use case where the CPU is fully loaded by more than 1 >> task, the activity level is set above the cpu_power in order to reflect the >> overload of the CPU >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > [...] > >> /* >> * select_task_rq_fair: Select target runqueue for the waking task in domains >> * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, >> @@ -5518,6 +5535,7 @@ struct sg_lb_stats { >> unsigned long sum_weighted_load; /* Weighted load of group's tasks */ >> unsigned long load_per_task; >> unsigned long group_power; >> + unsigned long group_activity; /* Total activity of the group */ >> unsigned int sum_nr_running; /* Nr tasks running in the group */ >> unsigned int group_capacity; >> unsigned int idle_cpus; >> @@ -5538,6 +5556,7 @@ struct sd_lb_stats { >> struct sched_group *busiest; /* Busiest group in this sd */ >> struct sched_group *local; /* Local group in this sd */ >> unsigned long total_load; /* Total load of all groups in sd */ >> + unsigned long total_activity; /* Total activity of all groups in sd */ >> unsigned long total_pwr; /* Total power of all groups in sd */ >> unsigned long avg_load; /* Average load across all groups in sd */ >> >> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) >> .busiest = NULL, >> .local = NULL, >> .total_load = 0UL, >> + .total_activity = 0UL, > > AFAICS, total_activity is not used right now. Do you intend to use it to > calculate something like avg_activity later (like total_load/avg_load)? this patch only creates group_activity and total_activity.The use of these statistics appears on the next patches. I have not planned to compute such avg_activity for the moment mainly because i have need it yet. Do you have in mind any use of such avg_activity ? Vincent > >> .total_pwr = 0UL, >> .busiest_stat = { >> .avg_load = 0UL, > > [...] > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 30/05/14 20:20, Vincent Guittot wrote: > On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >> On 23/05/14 16:53, Vincent Guittot wrote: >>> Monitor the activity level of each group of each sched_domain level. The >>> activity is the amount of cpu_power that is currently used on a CPU or group >>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity >>> level. In the special use case where the CPU is fully loaded by more than 1 >>> task, the activity level is set above the cpu_power in order to reflect the >>> overload of the CPU >>> >>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >> >> [...] >> >>> /* >>> * select_task_rq_fair: Select target runqueue for the waking task in domains >>> * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, >>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats { >>> unsigned long sum_weighted_load; /* Weighted load of group's tasks */ >>> unsigned long load_per_task; >>> unsigned long group_power; >>> + unsigned long group_activity; /* Total activity of the group */ >>> unsigned int sum_nr_running; /* Nr tasks running in the group */ >>> unsigned int group_capacity; >>> unsigned int idle_cpus; >>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats { >>> struct sched_group *busiest; /* Busiest group in this sd */ >>> struct sched_group *local; /* Local group in this sd */ >>> unsigned long total_load; /* Total load of all groups in sd */ >>> + unsigned long total_activity; /* Total activity of all groups in sd */ >>> unsigned long total_pwr; /* Total power of all groups in sd */ >>> unsigned long avg_load; /* Average load across all groups in sd */ >>> >>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) >>> .busiest = NULL, >>> .local = NULL, >>> .total_load = 0UL, >>> + .total_activity = 0UL, >> >> AFAICS, total_activity is not used right now. Do you intend to use it to >> calculate something like avg_activity later (like total_load/avg_load)? > > this patch only creates group_activity and total_activity.The use of > these statistics appears on the next patches. I have not planned to > compute such avg_activity for the moment mainly because i have need it > yet. > > Do you have in mind any use of such avg_activity ? No, not really but I still don't see (taken all your patches) where sds->total_activity is used. $ find . -name "*.[ch]" | xargs grep total_activity ./kernel/sched/fair.c: unsigned long total_activity; /* Total activity of all groups in sd */ ./kernel/sched/fair.c: .total_activity = 0UL, ./kernel/sched/fair.c: sds->total_activity += sgs->group_activity; OOTH, sgs->group_activity is used to set sgs->group_capacity which is then used in if conditions in update_sd_pick_busiest(), update_sd_lb_stats() and find_busiest_group(). [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 1 June 2014 13:33, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > On 30/05/14 20:20, Vincent Guittot wrote: >> >> On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: >>> >>> On 23/05/14 16:53, Vincent Guittot wrote: >>>> >>>> Monitor the activity level of each group of each sched_domain level. The >>>> activity is the amount of cpu_power that is currently used on a CPU or >>>> group >>>> of CPUs. We use the runnable_avg_sum and _period to evaluate this >>>> activity >>>> level. In the special use case where the CPU is fully loaded by more >>>> than 1 >>>> task, the activity level is set above the cpu_power in order to reflect >>>> the >>>> overload of the CPU >>>> >>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> >>> >>> >>> [...] >>> >>>> /* >>>> * select_task_rq_fair: Select target runqueue for the waking task in >>>> domains >>>> * that have the 'sd_flag' flag set. In practice, this is >>>> SD_BALANCE_WAKE, >>>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats { >>>> unsigned long sum_weighted_load; /* Weighted load of group's >>>> tasks */ >>>> unsigned long load_per_task; >>>> unsigned long group_power; >>>> + unsigned long group_activity; /* Total activity of the group */ >>>> unsigned int sum_nr_running; /* Nr tasks running in the group */ >>>> unsigned int group_capacity; >>>> unsigned int idle_cpus; >>>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats { >>>> struct sched_group *busiest; /* Busiest group in this sd */ >>>> struct sched_group *local; /* Local group in this sd */ >>>> unsigned long total_load; /* Total load of all groups in sd >>>> */ >>>> + unsigned long total_activity; /* Total activity of all groups in >>>> sd */ >>>> unsigned long total_pwr; /* Total power of all groups in >>>> sd */ >>>> unsigned long avg_load; /* Average load across all groups in sd >>>> */ >>>> >>>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct >>>> sd_lb_stats *sds) >>>> .busiest = NULL, >>>> .local = NULL, >>>> .total_load = 0UL, >>>> + .total_activity = 0UL, >>> >>> >>> AFAICS, total_activity is not used right now. Do you intend to use it to >>> calculate something like avg_activity later (like total_load/avg_load)? >> >> >> this patch only creates group_activity and total_activity.The use of >> these statistics appears on the next patches. I have not planned to >> compute such avg_activity for the moment mainly because i have need it >> yet. >> >> Do you have in mind any use of such avg_activity ? > > > No, not really but I still don't see (taken all your patches) where > sds->total_activity is used. > > $ find . -name "*.[ch]" | xargs grep total_activity > ./kernel/sched/fair.c: unsigned long total_activity; /* Total activity of > all groups in sd */ > ./kernel/sched/fair.c: .total_activity = 0UL, > ./kernel/sched/fair.c: sds->total_activity += sgs->group_activity; > > OOTH, sgs->group_activity is used to set sgs->group_capacity which is then > used in if conditions in update_sd_pick_busiest(), update_sd_lb_stats() and > find_busiest_group(). Sorry, i misread you 1st email and haven't understood that you were specifically speaking about total_activity. When i created the activity statistic, i have added both sg and sd fields but i finally only use the sg one. If i don't need the sd field after updating my patches, i will remove it Thanks Vincent > > [...] > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote: > On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote: > >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote: > > [snip] > > > > >> This value is linked to the CPU on > >> which it has run previously because of the time sharing with others > >> tasks, so the unweighted load of a freshly migrated task will reflect > >> its load on the previous CPU (with the time sharing with other tasks > >> on prev CPU). > > > > I agree that the task runnable_avg_sum is always affected by the > > circumstances on the cpu where it is running, and that it takes this > > history with it. However, I think cfs.runnable_load_avg leads to less > > problems than using the rq runnable_avg_sum. It would work nicely for > > the two tasks on two cpus example I mentioned earlier. We don't need add > > i would say that nr_running is an even better metrics for such > situation as the load doesn't give any additional information. I fail to understand how nr_running can be used. nr_running doesn't tell you anything about the utilization of the cpu, just the number tasks that happen to be runnable at a point in time on a specific cpu. It might be two small tasks that just happened to be running while you read nr_running. An unweighted version of cfs.runnable_load_avg gives you a metric that captures cpu utilization to some extend, but not the number of tasks. And it reflects task migrations immediately unlike the rq runnable_avg_sum. > Just to point that we can spent a lot of time listing which use case > are better covered by which metrics :-) Agreed, but I think it is quite important to discuss what we understand by cpu utilization. It seems to be different depending on what you want to use it for. I think it is also clear that none of the metrics that have been proposed are perfect. We therefore have to be careful to only use metrics in scenarios where they make sense. IMHO, both rq runnable_avg_sum and unweighted cfs.runnable_load_avg capture cpu utilization, but in different ways. We have done experiments internally with rq runnable_avg_sum for load-balancing decisions in the past and found it unsuitable due to its slow response to task migrations. That is why I brought it up here. AFAICT, you use rq runnable_avg_sum more like a flag than a quantitative measure of cpu utilization. Viewing things from an energy-awareness point of view I'm more interested in the latter for estimating the implications of moving tasks around. I don't have any problems with using rq runnable_avg_sum for other things as long we are fully aware of how this metric works. > > something on top when the cpu is fully utilized by more than one task. > > It comes more naturally with cfs.runnable_load_avg. If it is much larger > > than 47742, it should be fairly safe to assume that you shouldn't stick > > more tasks on that cpu. > > > >> > >> I'm not saying that such metric is useless but it's not perfect as well. > > > > It comes with its own set of problems, agreed. Based on my current > > understanding (or lack thereof) they just seem smaller :) > > I think it's worth using the cpu utilization for some cases because it > has got some information that are not available elsewhere. And the > replacement of the current capacity computation is one example. > As explained previously, I'm not against adding other metrics and i'm > not sure to understand why you oppose these 2 metrics whereas they > could be complementary I think we more or less agree :) I'm fine with both metrics and I agree that they complement each other. My concern is using the right metric for the right job. If you choose to use rq runnable_avg_sum you have to keep its slow reaction time in mind. I think that might be difficult/not possible for some load-balancing decisions. That is basically my point :) Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote: > The rq runnable_avg_{sum, period} give a very long term view of the cpu > utilization (I will use the term utilization instead of activity as I > think that is what we are talking about here). IMHO, it is too slow to > be used as basis for load balancing decisions. I think that was also > agreed upon in the last discussion related to this topic [1]. > > The basic problem is that worst case: sum starting from 0 and period > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods > (ms) for sum to reach 47742. In other words, the cpu might have been > fully utilized for 345 ms before it is considered fully utilized. > Periodic load-balancing happens much more frequently than that. Like said earlier the 94% mark is actually hit much sooner, but yes, likely still too slow. 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc.. > Also, if load-balancing actually moves tasks around it may take quite a > while before runnable_avg_sum actually reflects this change. The next > periodic load-balance is likely to happen before runnable_avg_sum has > reflected the result of the previous periodic load-balance. > > To avoid these problems, we need to base utilization on a metric which > is updated instantaneously when we add/remove tasks to a cpu (or a least > fast enough that we don't see the above problems). So the per-task-load-tracking stuff already does that. It updates the per-cpu load metrics on migration. See {de,en}queue_entity_load_avg(). And keeping an unweighted per-cpu variant isn't that much more work. > In the previous > discussion [1] it was suggested that a sum of unweighted task > runnable_avg_{sum,period} ratio instead. That is, an unweighted > equivalent to weighted_cpuload(). That isn't a perfect solution either. > It is fine as long as the cpus are not fully utilized, but when they are > we need to use weighted_cpuload() to preserve smp_nice. What to do > around the tipping point needs more thought, but I think that is > currently the best proposal for a solution for task and cpu utilization. I'm not too worried about the tipping point, per task runnable figures of an overloaded cpu are higher, so migration between an overloaded cpu and an underloaded cpu are going to be tricky no matter what we do. > rq runnable_avg_sum is useful for decisions where we need a longer term > view of the cpu utilization, but I don't see how we can use as cpu > utilization metric for load-balancing decisions at wakeup or > periodically. So keeping one with a faster decay would add extra per-task storage. But would be possible..
On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > Since we may do periodic load-balance every 10 ms or so, we will perform > a number of load-balances where runnable_avg_sum will mostly be > reflecting the state of the world before a change (new task queued or > moved a task to a different cpu). If you had have two tasks continuously > on one cpu and your other cpu is idle, and you move one of the tasks to > the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > first cpu while it starts from 0 on the other one. 10 ms later it will > have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > it reaches 47742. In the mean time the cpu doesn't appear fully utilized > and we might decide to put more tasks on it because we don't know if > runnable_avg_sum represents a partially utilized cpu (for example a 50% > task) or if it will continue to rise and eventually get to 47742. Ah, no, since we track per task, and update the per-cpu ones when we migrate tasks, the per-cpu values should be instantly updated. If we were to increase per task storage, we might as well also track running_avg not only runnable_avg.
On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote: > On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote: > > On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote: > > >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote: > > > I agree that the task runnable_avg_sum is always affected by the > > > circumstances on the cpu where it is running, and that it takes this > > > history with it. However, I think cfs.runnable_load_avg leads to less > > > problems than using the rq runnable_avg_sum. It would work nicely for > > > the two tasks on two cpus example I mentioned earlier. We don't need add > > > > i would say that nr_running is an even better metrics for such > > situation as the load doesn't give any additional information. > > I fail to understand how nr_running can be used. nr_running doesn't tell > you anything about the utilization of the cpu, just the number tasks > that happen to be runnable at a point in time on a specific cpu. It > might be two small tasks that just happened to be running while you read > nr_running. Agreed, I'm not at all seeing how nr_running is useful here. > An unweighted version of cfs.runnable_load_avg gives you a metric that > captures cpu utilization to some extend, but not the number of tasks. > And it reflects task migrations immediately unlike the rq > runnable_avg_sum. So runnable_avg would be equal to the utilization as long as there's idle time, as soon as we're over-loaded the metric shows how much extra cpu is required. That is, runnable_avg - running_avg >= 0 and the amount is the exact amount of extra cpu required to make all tasks run but not have idle time. > Agreed, but I think it is quite important to discuss what we understand > by cpu utilization. It seems to be different depending on what you want > to use it for. I understand utilization to be however much cpu is actually used, so I would, per the existing naming, call running_avg to be the avg utilization of a task/group/cpu whatever. > We have done experiments internally with rq runnable_avg_sum for > load-balancing decisions in the past and found it unsuitable due to its > slow response to task migrations. That is why I brought it up here. So I'm not entirely seeing that from the code (I've not traced this), afaict we actually update the per-cpu values on migration based on the task values. old_rq->sum -= p->val; new_rq->sum += p->val; like,.. except of course totally obscured.
On Tue, Jun 03, 2014 at 04:40:58PM +0100, Peter Zijlstra wrote: > On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote: > > The rq runnable_avg_{sum, period} give a very long term view of the cpu > > utilization (I will use the term utilization instead of activity as I > > think that is what we are talking about here). IMHO, it is too slow to > > be used as basis for load balancing decisions. I think that was also > > agreed upon in the last discussion related to this topic [1]. > > > > The basic problem is that worst case: sum starting from 0 and period > > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods > > (ms) for sum to reach 47742. In other words, the cpu might have been > > fully utilized for 345 ms before it is considered fully utilized. > > Periodic load-balancing happens much more frequently than that. > > Like said earlier the 94% mark is actually hit much sooner, but yes, > likely still too slow. > > 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc.. Agreed. > > > Also, if load-balancing actually moves tasks around it may take quite a > > while before runnable_avg_sum actually reflects this change. The next > > periodic load-balance is likely to happen before runnable_avg_sum has > > reflected the result of the previous periodic load-balance. > > > > To avoid these problems, we need to base utilization on a metric which > > is updated instantaneously when we add/remove tasks to a cpu (or a least > > fast enough that we don't see the above problems). > > So the per-task-load-tracking stuff already does that. It updates the > per-cpu load metrics on migration. See {de,en}queue_entity_load_avg(). I think there is some confusion here. There are two per-cpu load metrics that tracks differently. The cfs.runnable_load_avg is basically the sum of the load contributions of the tasks on the cfs rq. The sum gets updated whenever tasks are {en,de}queued by adding/subtracting the load contribution of the task being added/removed. That is the one you are referring to. The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is tracking whether the cpu has something to do or not. It doesn't matter many tasks are runnable or what their load is. It is updated in update_rq_runnable_avg(). It increases when rq->nr_running > 0 and decays if not. It also takes time spent running rt tasks into account in idle_{enter, exit}_fair(). So if you remove tasks from the rq, this metric will start decaying and eventually get to 0, unlike the cfs.runnable_load_avg where the task load contribution subtracted every time a task is removed. The rq runnable_avg_sum is the one being used in this patch set. Ben, pjt, please correct me if I'm wrong. > And keeping an unweighted per-cpu variant isn't that much more work. Agreed. > > > In the previous > > discussion [1] it was suggested that a sum of unweighted task > > runnable_avg_{sum,period} ratio instead. That is, an unweighted > > equivalent to weighted_cpuload(). That isn't a perfect solution either. > > It is fine as long as the cpus are not fully utilized, but when they are > > we need to use weighted_cpuload() to preserve smp_nice. What to do > > around the tipping point needs more thought, but I think that is > > currently the best proposal for a solution for task and cpu utilization. > > I'm not too worried about the tipping point, per task runnable figures > of an overloaded cpu are higher, so migration between an overloaded cpu > and an underloaded cpu are going to be tricky no matter what we do. Yes, agreed. I just got the impression that you were concerned about smp_nice last time we discussed this. > > rq runnable_avg_sum is useful for decisions where we need a longer term > > view of the cpu utilization, but I don't see how we can use as cpu > > utilization metric for load-balancing decisions at wakeup or > > periodically. > > So keeping one with a faster decay would add extra per-task storage. But > would be possible.. I have had that thought when we discussed potential replacements for cpu_load[]. It will require some messing around with the nicely optimized load tracking maths if we want to have load tracking with a different y-coefficient. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Jun 03, 2014 at 04:50:07PM +0100, Peter Zijlstra wrote: > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > > Since we may do periodic load-balance every 10 ms or so, we will perform > > a number of load-balances where runnable_avg_sum will mostly be > > reflecting the state of the world before a change (new task queued or > > moved a task to a different cpu). If you had have two tasks continuously > > on one cpu and your other cpu is idle, and you move one of the tasks to > > the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > > first cpu while it starts from 0 on the other one. 10 ms later it will > > have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > > it reaches 47742. In the mean time the cpu doesn't appear fully utilized > > and we might decide to put more tasks on it because we don't know if > > runnable_avg_sum represents a partially utilized cpu (for example a 50% > > task) or if it will continue to rise and eventually get to 47742. > > Ah, no, since we track per task, and update the per-cpu ones when we > migrate tasks, the per-cpu values should be instantly updated. No, not for this per-cpu tracking metric :) For cfs.runnable_load_avg you are right, but this patch set is using rq->avg.runnable_avg_sum which is different. See my other reply. > If we were to increase per task storage, we might as well also track > running_avg not only runnable_avg. That could probably make sense. We had that in pjt's first proposal. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote: > > I'm not too worried about the tipping point, per task runnable figures > > of an overloaded cpu are higher, so migration between an overloaded cpu > > and an underloaded cpu are going to be tricky no matter what we do. > > Yes, agreed. I just got the impression that you were concerned about > smp_nice last time we discussed this. Well, yes, we need to keep that working, but the exact detail around the tipping point are near impossible to get right, so I'm not too bothered there. > > > rq runnable_avg_sum is useful for decisions where we need a longer term > > > view of the cpu utilization, but I don't see how we can use as cpu > > > utilization metric for load-balancing decisions at wakeup or > > > periodically. > > > > So keeping one with a faster decay would add extra per-task storage. But > > would be possible.. > > I have had that thought when we discussed potential replacements for > cpu_load[]. It will require some messing around with the nicely > optimized load tracking maths if we want to have load tracking with a > different y-coefficient. My initial thought was a y=0.5, which is really >>=1. But yes, if we want something else that'll get messy real fast methinks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote: > > So the per-task-load-tracking stuff already does that. It updates the > > per-cpu load metrics on migration. See {de,en}queue_entity_load_avg(). > > I think there is some confusion here. There are two per-cpu load metrics > that tracks differently. > > The cfs.runnable_load_avg is basically the sum of the load contributions > of the tasks on the cfs rq. The sum gets updated whenever tasks are > {en,de}queued by adding/subtracting the load contribution of the task > being added/removed. That is the one you are referring to. > > The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is > tracking whether the cpu has something to do or not. It doesn't matter > many tasks are runnable or what their load is. It is updated in > update_rq_runnable_avg(). It increases when rq->nr_running > 0 and > decays if not. It also takes time spent running rt tasks into account in > idle_{enter, exit}_fair(). So if you remove tasks from the rq, this > metric will start decaying and eventually get to 0, unlike the > cfs.runnable_load_avg where the task load contribution subtracted every > time a task is removed. The rq runnable_avg_sum is the one being used in > this patch set. > > Ben, pjt, please correct me if I'm wrong. Argh, ok I completely missed that. I think the cfs.runnable_load_avg is the sane number, not entirely sure what good rq->avg.runnable_avg is good for, it seems a weird metric on first consideration. Will have to ponder that a bit more. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Jun 03, 2014 at 04:59:39PM +0100, Peter Zijlstra wrote: > On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote: > > An unweighted version of cfs.runnable_load_avg gives you a metric that > > captures cpu utilization to some extend, but not the number of tasks. > > And it reflects task migrations immediately unlike the rq > > runnable_avg_sum. > > So runnable_avg would be equal to the utilization as long as > there's idle time, as soon as we're over-loaded the metric shows how > much extra cpu is required. > > That is, runnable_avg - running_avg >= 0 and the amount is the > exact amount of extra cpu required to make all tasks run but not have > idle time. Yes, roughly. runnable_avg goes up quite steeply if you have many tasks on a fully utilized cpu, so the actual amount of extra cpu required might be somewhat lower. I can't come up with something better, so I agree. > > > Agreed, but I think it is quite important to discuss what we understand > > by cpu utilization. It seems to be different depending on what you want > > to use it for. > > I understand utilization to be however much cpu is actually used, so I > would, per the existing naming, call running_avg to be the avg > utilization of a task/group/cpu whatever. I see your point, but for load balancing purposes we are more intested in the runnable_avg as it tells us about the cpu capacity requirements. I don't like to throw more terms into the mix, but you could call runnable_avg the potential task/group/cpu utilization. This is an estimate of how much utilization a task would cause if we moved it to an idle cpu. That might be quite different from running_avg on an over-utilized cpu. > > > We have done experiments internally with rq runnable_avg_sum for > > load-balancing decisions in the past and found it unsuitable due to its > > slow response to task migrations. That is why I brought it up here. > > So I'm not entirely seeing that from the code (I've not traced this), > afaict we actually update the per-cpu values on migration based on the > task values. > > old_rq->sum -= p->val; > new_rq->sum += p->val; > > like,.. except of course totally obscured. Yes, for cfs.runnable_load_avg, rq->avg.runnable_avg_sum is different. See the other reply. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
> > The basic problem is that worst case: sum starting from 0 and period > > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods > > (ms) for sum to reach 47742. In other words, the cpu might have been > > fully utilized for 345 ms before it is considered fully utilized. > > Periodic load-balancing happens much more frequently than that. > > Like said earlier the 94% mark is actually hit much sooner, but yes, > likely still too slow. > > 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc.. > > > In the previous > > discussion [1] it was suggested that a sum of unweighted task > > runnable_avg_{sum,period} ratio instead. That is, an unweighted > > equivalent to weighted_cpuload(). That isn't a perfect solution either. > > It is fine as long as the cpus are not fully utilized, but when they are > > we need to use weighted_cpuload() to preserve smp_nice. What to do > > around the tipping point needs more thought, but I think that is > > currently the best proposal for a solution for task and cpu utilization. > > I'm not too worried about the tipping point, per task runnable figures > of an overloaded cpu are higher, so migration between an overloaded cpu > and an underloaded cpu are going to be tricky no matter what we do. > Hi, Can I join this dicussion late? As I understand, you are talking about a metric for cpu activity. And the issues about runnable_avg_sum is its sluggishness to latest change, and also need unweighted load averages. You might be aware of my recent proposal to CPU ConCurrency (CC). It is 1) an average of nr_running, or 2) nr_running weighted CPU utilization. So it is a combination of CPU utlization and run queue (both factored natually). It meets the needs you talked, I think. You can take it as a candidate, or at least we can talk about it? Thanks, Yuyang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: >> Since we may do periodic load-balance every 10 ms or so, we will perform >> a number of load-balances where runnable_avg_sum will mostly be >> reflecting the state of the world before a change (new task queued or >> moved a task to a different cpu). If you had have two tasks continuously >> on one cpu and your other cpu is idle, and you move one of the tasks to >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the >> first cpu while it starts from 0 on the other one. 10 ms later it will >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized >> and we might decide to put more tasks on it because we don't know if >> runnable_avg_sum represents a partially utilized cpu (for example a 50% >> task) or if it will continue to rise and eventually get to 47742. > > Ah, no, since we track per task, and update the per-cpu ones when we > migrate tasks, the per-cpu values should be instantly updated. > > If we were to increase per task storage, we might as well also track > running_avg not only runnable_avg. I agree that the removed running_avg should give more useful information about the the load of a CPU. The main issue with running_avg is that it's disturbed by other tasks (as point out previously). As a typical example, if we have 2 tasks with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be in the range of [100% - 50%] depending of the parallelism of the runtime of the tasks whereas the reality is 50% and the use of running_avg will return this value Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: > On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > >> Since we may do periodic load-balance every 10 ms or so, we will perform > >> a number of load-balances where runnable_avg_sum will mostly be > >> reflecting the state of the world before a change (new task queued or > >> moved a task to a different cpu). If you had have two tasks continuously > >> on one cpu and your other cpu is idle, and you move one of the tasks to > >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > >> first cpu while it starts from 0 on the other one. 10 ms later it will > >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized > >> and we might decide to put more tasks on it because we don't know if > >> runnable_avg_sum represents a partially utilized cpu (for example a 50% > >> task) or if it will continue to rise and eventually get to 47742. > > > > Ah, no, since we track per task, and update the per-cpu ones when we > > migrate tasks, the per-cpu values should be instantly updated. > > > > If we were to increase per task storage, we might as well also track > > running_avg not only runnable_avg. > > I agree that the removed running_avg should give more useful > information about the the load of a CPU. > > The main issue with running_avg is that it's disturbed by other tasks > (as point out previously). As a typical example, if we have 2 tasks > with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be > in the range of [100% - 50%] depending of the parallelism of the > runtime of the tasks whereas the reality is 50% and the use of > running_avg will return this value I'm not sure I see how 100% is possible, but yes I agree that runnable can indeed be inflated due to this queueing effect.
On Wed, Jun 04, 2014 at 09:08:09AM +0100, Peter Zijlstra wrote: > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: > > On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > > >> Since we may do periodic load-balance every 10 ms or so, we will perform > > >> a number of load-balances where runnable_avg_sum will mostly be > > >> reflecting the state of the world before a change (new task queued or > > >> moved a task to a different cpu). If you had have two tasks continuously > > >> on one cpu and your other cpu is idle, and you move one of the tasks to > > >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > > >> first cpu while it starts from 0 on the other one. 10 ms later it will > > >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > > >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized > > >> and we might decide to put more tasks on it because we don't know if > > >> runnable_avg_sum represents a partially utilized cpu (for example a 50% > > >> task) or if it will continue to rise and eventually get to 47742. > > > > > > Ah, no, since we track per task, and update the per-cpu ones when we > > > migrate tasks, the per-cpu values should be instantly updated. > > > > > > If we were to increase per task storage, we might as well also track > > > running_avg not only runnable_avg. > > > > I agree that the removed running_avg should give more useful > > information about the the load of a CPU. > > > > The main issue with running_avg is that it's disturbed by other tasks > > (as point out previously). As a typical example, if we have 2 tasks > > with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be > > in the range of [100% - 50%] depending of the parallelism of the > > runtime of the tasks whereas the reality is 50% and the use of > > running_avg will return this value Both running_avg and runnable_avg are affected by other tasks on the same cpus, but in different ways. They are equal if you only have one task on a cpu. If you have more, running_avg will give you the true requirement of the tasks until the cpu is fully utilized. At which point the task running_avg will drop if you add more tasks (the unweighted sum of task running_avgs remains constant). runnable_avg on the other hand, might be affected as soon as you have two task running on the same cpu if they are runnable at the same time. That isn't necessarily a bad thing for load-balancing purposes, because tasks that are runnable at the same time are likely to be run more efficiently by placing them on different cpus. You might view as at sort of built in concurrency factor, somewhat similar to what Yuyang is proposing. runnable_avg increases rapidly when the cpu is over-utilized. > I'm not sure I see how 100% is possible, but yes I agree that runnable > can indeed be inflated due to this queueing effect. You should only be able to get to 75% worst case for runnable_avg for that example. The total running_avg is 50% no matter if the tasks overlaps or not. f you had five tasks on one cpu that each have a 25% requirement you can get individual task runnable_avgs of up to 100% (cpu unweighted runnable_load_avg can get up 500%, I think), but the task running_avgs would be 20% each (total of 100%). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote: > Both running_avg and runnable_avg are affected by other tasks on the > same cpus, but in different ways. They are equal if you only have one > task on a cpu. If you have more, running_avg will give you the true > requirement of the tasks until the cpu is fully utilized. At which point > the task running_avg will drop if you add more tasks (the unweighted sum > of task running_avgs remains constant). > > runnable_avg on the other hand, might be affected as soon as you have > two task running on the same cpu if they are runnable at the same time. > That isn't necessarily a bad thing for load-balancing purposes, because > tasks that are runnable at the same time are likely to be run more > efficiently by placing them on different cpus. You might view as at sort > of built in concurrency factor, somewhat similar to what Yuyang is > proposing. runnable_avg increases rapidly when the cpu is over-utilized. Agreed. > > I'm not sure I see how 100% is possible, but yes I agree that runnable > > can indeed be inflated due to this queueing effect. > > You should only be able to get to 75% worst case for runnable_avg for > that example. The total running_avg is 50% no matter if the tasks > overlaps or not. Yes, 75% is what I ended up with. > f you had five tasks on one cpu that each have a 25% requirement you can > get individual task runnable_avgs of up to 100% (cpu unweighted > runnable_load_avg can get up 500%, I think), but the task running_avgs > would be 20% each (total of 100%). Yeah, more or less so indeed. I had not considered the queueing effects on runnable_avg yesterday, so good that that got raised. That does indeed invalidate my: runnable - running := extra cpu required thing. It ends up being the extra cpu required for 0 latency but gobs of idle time, which is something else entirely.
On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: >> >> Since we may do periodic load-balance every 10 ms or so, we will perform >> >> a number of load-balances where runnable_avg_sum will mostly be >> >> reflecting the state of the world before a change (new task queued or >> >> moved a task to a different cpu). If you had have two tasks continuously >> >> on one cpu and your other cpu is idle, and you move one of the tasks to >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the >> >> first cpu while it starts from 0 on the other one. 10 ms later it will >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized >> >> and we might decide to put more tasks on it because we don't know if >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50% >> >> task) or if it will continue to rise and eventually get to 47742. >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we >> > migrate tasks, the per-cpu values should be instantly updated. >> > >> > If we were to increase per task storage, we might as well also track >> > running_avg not only runnable_avg. >> >> I agree that the removed running_avg should give more useful >> information about the the load of a CPU. >> >> The main issue with running_avg is that it's disturbed by other tasks >> (as point out previously). As a typical example, if we have 2 tasks >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be >> in the range of [100% - 50%] depending of the parallelism of the >> runtime of the tasks whereas the reality is 50% and the use of >> running_avg will return this value > > I'm not sure I see how 100% is possible, but yes I agree that runnable > can indeed be inflated due to this queueing effect. In fact, it can be even worse than that because i forgot to take into account the geometric series effect which implies that it depends of the runtime (idletime) of the task Take 3 examples: 2 tasks that need to run 10ms simultaneously each 40ms. If they share the same CPU, they will be on the runqueue 20ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 33% each so the unweighted runnable_load_avg of the CPU will be 66% 2 tasks that need to run 25ms simultaneously each 100ms. If they share the same CPU, they will be on the runqueue 50ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 74% each so the unweighted runnable_load_avg of the CPU will be 148% 2 tasks that need to run 50ms simultaneously each 200ms. If they share the same CPU, they will be on the runqueue 100ms (in fact a bit less for one of them), Their load (runnable_avg_sum/runnable_avg_period) will be 89% each so the unweighted runnable_load_avg of the CPU will be 180% Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 4 June 2014 11:23, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote: >> Both running_avg and runnable_avg are affected by other tasks on the >> same cpus, but in different ways. They are equal if you only have one >> task on a cpu. If you have more, running_avg will give you the true >> requirement of the tasks until the cpu is fully utilized. At which point >> the task running_avg will drop if you add more tasks (the unweighted sum >> of task running_avgs remains constant). >> >> runnable_avg on the other hand, might be affected as soon as you have >> two task running on the same cpu if they are runnable at the same time. >> That isn't necessarily a bad thing for load-balancing purposes, because >> tasks that are runnable at the same time are likely to be run more >> efficiently by placing them on different cpus. You might view as at sort >> of built in concurrency factor, somewhat similar to what Yuyang is >> proposing. runnable_avg increases rapidly when the cpu is over-utilized. > > Agreed. > >> > I'm not sure I see how 100% is possible, but yes I agree that runnable >> > can indeed be inflated due to this queueing effect. >> >> You should only be able to get to 75% worst case for runnable_avg for >> that example. The total running_avg is 50% no matter if the tasks >> overlaps or not. > > Yes, 75% is what I ended up with. Can you explain how you reach 75% as it depends on the runtime and a runtime longer than 345ms will end to a 100% load whatever the idletime was previously ? > >> f you had five tasks on one cpu that each have a 25% requirement you can >> get individual task runnable_avgs of up to 100% (cpu unweighted >> runnable_load_avg can get up 500%, I think), but the task running_avgs >> would be 20% each (total of 100%). > > Yeah, more or less so indeed. I had not considered the queueing effects > on runnable_avg yesterday, so good that that got raised. > > That does indeed invalidate my: runnable - running := extra cpu required > thing. It ends up being the extra cpu required for 0 latency but gobs of > idle time, which is something else entirely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 10:23:13AM +0100, Peter Zijlstra wrote: > > f you had five tasks on one cpu that each have a 25% requirement you can > > get individual task runnable_avgs of up to 100% (cpu unweighted > > runnable_load_avg can get up 500%, I think), but the task running_avgs > > would be 20% each (total of 100%). > > Yeah, more or less so indeed. I had not considered the queueing effects > on runnable_avg yesterday, so good that that got raised. > > That does indeed invalidate my: runnable - running := extra cpu required > thing. It ends up being the extra cpu required for 0 latency but gobs of > idle time, which is something else entirely. Agreed, but I think it is still a useful estimate of the required compute capacity. If there is a significant difference between runnable and running on a cpu, the current mix of tasks is not good for latency. However, we need to treat it as a worst case estimate and not necessarily try to move exactly runnable-running worth of tasks to another cpu. So far I haven't been able to come up with something better. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 10:32:10AM +0100, Vincent Guittot wrote: > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform > >> >> a number of load-balances where runnable_avg_sum will mostly be > >> >> reflecting the state of the world before a change (new task queued or > >> >> moved a task to a different cpu). If you had have two tasks continuously > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized > >> >> and we might decide to put more tasks on it because we don't know if > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50% > >> >> task) or if it will continue to rise and eventually get to 47742. > >> > > >> > Ah, no, since we track per task, and update the per-cpu ones when we > >> > migrate tasks, the per-cpu values should be instantly updated. > >> > > >> > If we were to increase per task storage, we might as well also track > >> > running_avg not only runnable_avg. > >> > >> I agree that the removed running_avg should give more useful > >> information about the the load of a CPU. > >> > >> The main issue with running_avg is that it's disturbed by other tasks > >> (as point out previously). As a typical example, if we have 2 tasks > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be > >> in the range of [100% - 50%] depending of the parallelism of the > >> runtime of the tasks whereas the reality is 50% and the use of > >> running_avg will return this value > > > > I'm not sure I see how 100% is possible, but yes I agree that runnable > > can indeed be inflated due to this queueing effect. > > In fact, it can be even worse than that because i forgot to take into > account the geometric series effect which implies that it depends of > the runtime (idletime) of the task > > Take 3 examples: > > 2 tasks that need to run 10ms simultaneously each 40ms. If they share > the same CPU, they will be on the runqueue 20ms (in fact a bit less > for one of them), Their load (runnable_avg_sum/runnable_avg_period) > will be 33% each so the unweighted runnable_load_avg of the CPU will > be 66% Right, it actually depends on how often you switch between the tasks. If you sched_tick happens every 10 ms then in this example one task will run for 10 ms an be done, while the other one waits for 10 ms and then runs to completion in 10 ms. The result is that one task is runnable for 10 ms and the other one is runnable for 20 ms. That gives you 25% and 50% for a total of 75%. > > 2 tasks that need to run 25ms simultaneously each 100ms. If they share > the same CPU, they will be on the runqueue 50ms (in fact a bit less > for one of them), Their load (runnable_avg_sum/runnable_avg_period) > will be 74% each so the unweighted runnable_load_avg of the CPU will > be 148% > > 2 tasks that need to run 50ms simultaneously each 200ms. If they > share the same CPU, they will be on the runqueue 100ms (in fact a bit > less for one of them), Their load > (runnable_avg_sum/runnable_avg_period) will be 89% each so the > unweighted runnable_load_avg of the CPU will be 180% In this case you switch been the tasks before they complete, so in this case both tasks are waiting so runnable will get much righer than 75%. You are right. It was thinking about tasks where the busy period is short enough that they run to completion when they are scheduled. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote: > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform > >> >> a number of load-balances where runnable_avg_sum will mostly be > >> >> reflecting the state of the world before a change (new task queued or > >> >> moved a task to a different cpu). If you had have two tasks continuously > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized > >> >> and we might decide to put more tasks on it because we don't know if > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50% > >> >> task) or if it will continue to rise and eventually get to 47742. > >> > > >> > Ah, no, since we track per task, and update the per-cpu ones when we > >> > migrate tasks, the per-cpu values should be instantly updated. > >> > > >> > If we were to increase per task storage, we might as well also track > >> > running_avg not only runnable_avg. > >> > >> I agree that the removed running_avg should give more useful > >> information about the the load of a CPU. > >> > >> The main issue with running_avg is that it's disturbed by other tasks > >> (as point out previously). As a typical example, if we have 2 tasks > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be > >> in the range of [100% - 50%] depending of the parallelism of the > >> runtime of the tasks whereas the reality is 50% and the use of > >> running_avg will return this value > > > > I'm not sure I see how 100% is possible, but yes I agree that runnable > > can indeed be inflated due to this queueing effect. Let me explain the 75%, take any one of the above scenarios. Lets call the two tasks A and B, and let for a moment assume A always wins and runs first, and then B. So A will be runnable for 25%, B otoh will be runnable the entire time A is actually running plus its own running time, giving 50%. Together that makes 75%. If you release the assumption that A runs first, but instead assume they equally win the first execution, you get them averaging at 37.5% each, which combined will still give 75%. > In fact, it can be even worse than that because i forgot to take into > account the geometric series effect which implies that it depends of > the runtime (idletime) of the task > > Take 3 examples: > > 2 tasks that need to run 10ms simultaneously each 40ms. If they share > the same CPU, they will be on the runqueue 20ms (in fact a bit less > for one of them), Their load (runnable_avg_sum/runnable_avg_period) > will be 33% each so the unweighted runnable_load_avg of the CPU will > be 66% > > 2 tasks that need to run 25ms simultaneously each 100ms. If they share > the same CPU, they will be on the runqueue 50ms (in fact a bit less > for one of them), Their load (runnable_avg_sum/runnable_avg_period) > will be 74% each so the unweighted runnable_load_avg of the CPU will > be 148% > > 2 tasks that need to run 50ms simultaneously each 200ms. If they > share the same CPU, they will be on the runqueue 100ms (in fact a bit > less for one of them), Their load > (runnable_avg_sum/runnable_avg_period) will be 89% each so the > unweighted runnable_load_avg of the CPU will be 180% And this is because the running time is 'large' compared to the decay and we get hit by the weight of the recent state? Yes, I can see that, the avg will fluctuate due to the nature of this thing.
On Wed, Jun 04, 2014 at 10:35:15AM +0100, Vincent Guittot wrote: > On 4 June 2014 11:23, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote: > >> Both running_avg and runnable_avg are affected by other tasks on the > >> same cpus, but in different ways. They are equal if you only have one > >> task on a cpu. If you have more, running_avg will give you the true > >> requirement of the tasks until the cpu is fully utilized. At which point > >> the task running_avg will drop if you add more tasks (the unweighted sum > >> of task running_avgs remains constant). > >> > >> runnable_avg on the other hand, might be affected as soon as you have > >> two task running on the same cpu if they are runnable at the same time. > >> That isn't necessarily a bad thing for load-balancing purposes, because > >> tasks that are runnable at the same time are likely to be run more > >> efficiently by placing them on different cpus. You might view as at sort > >> of built in concurrency factor, somewhat similar to what Yuyang is > >> proposing. runnable_avg increases rapidly when the cpu is over-utilized. > > > > Agreed. > > > >> > I'm not sure I see how 100% is possible, but yes I agree that runnable > >> > can indeed be inflated due to this queueing effect. > >> > >> You should only be able to get to 75% worst case for runnable_avg for > >> that example. The total running_avg is 50% no matter if the tasks > >> overlaps or not. > > > > Yes, 75% is what I ended up with. > > Can you explain how you reach 75% as it depends on the runtime and a > runtime longer than 345ms will end to a 100% load whatever the > idletime was previously ? If the busy period of each task is long enough that the first one to run runs to completetion before the other task is scheduled you get 25% and 50%. But as I said in my other reply, you can get higher task runnable if the tasks busy period is long enough that you switch between them before the first one completes. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote: > On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote: > > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote: > > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: > > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: > > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform > > >> >> a number of load-balances where runnable_avg_sum will mostly be > > >> >> reflecting the state of the world before a change (new task queued or > > >> >> moved a task to a different cpu). If you had have two tasks continuously > > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to > > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will > > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized > > >> >> and we might decide to put more tasks on it because we don't know if > > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50% > > >> >> task) or if it will continue to rise and eventually get to 47742. > > >> > > > >> > Ah, no, since we track per task, and update the per-cpu ones when we > > >> > migrate tasks, the per-cpu values should be instantly updated. > > >> > > > >> > If we were to increase per task storage, we might as well also track > > >> > running_avg not only runnable_avg. > > >> > > >> I agree that the removed running_avg should give more useful > > >> information about the the load of a CPU. > > >> > > >> The main issue with running_avg is that it's disturbed by other tasks > > >> (as point out previously). As a typical example, if we have 2 tasks > > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be > > >> in the range of [100% - 50%] depending of the parallelism of the > > >> runtime of the tasks whereas the reality is 50% and the use of > > >> running_avg will return this value > > > > > > I'm not sure I see how 100% is possible, but yes I agree that runnable > > > can indeed be inflated due to this queueing effect. > > Let me explain the 75%, take any one of the above scenarios. Lets call > the two tasks A and B, and let for a moment assume A always wins and > runs first, and then B. > > So A will be runnable for 25%, B otoh will be runnable the entire time A > is actually running plus its own running time, giving 50%. Together that > makes 75%. > > If you release the assumption that A runs first, but instead assume they > equally win the first execution, you get them averaging at 37.5% each, > which combined will still give 75%. But that is assuming that the first task gets to run to completion of it busy period. If it uses up its sched_slice and we switch to the other tasks, they both get to wait. For example, if the sched_slice is 5 ms and the busy period is 10 ms, the execution pattern would be: A, B, A, B, idle, ... In that case A is runnable for 15 ms and B is for 20 ms. Assuming that the overall period is 40 ms, the A runnable is 37.5% and B is 50%. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 11:36:19AM +0100, Morten Rasmussen wrote: > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote: > > Let me explain the 75%, take any one of the above scenarios. Lets call > > the two tasks A and B, and let for a moment assume A always wins and > > runs first, and then B. > > > > So A will be runnable for 25%, B otoh will be runnable the entire time A > > is actually running plus its own running time, giving 50%. Together that > > makes 75%. > > > > If you release the assumption that A runs first, but instead assume they > > equally win the first execution, you get them averaging at 37.5% each, > > which combined will still give 75%. > > But that is assuming that the first task gets to run to completion of it > busy period. If it uses up its sched_slice and we switch to the other > tasks, they both get to wait. > > For example, if the sched_slice is 5 ms and the busy period is 10 ms, > the execution pattern would be: A, B, A, B, idle, ... In that case A is > runnable for 15 ms and B is for 20 ms. Assuming that the overall period > is 40 ms, the A runnable is 37.5% and B is 50%. Indeed, with preemption added you can pull this out further. You can then indeed get infinitesimally close to 100% with this.
On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote: >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote: >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote: >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform >> > >> >> a number of load-balances where runnable_avg_sum will mostly be >> > >> >> reflecting the state of the world before a change (new task queued or >> > >> >> moved a task to a different cpu). If you had have two tasks continuously >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized >> > >> >> and we might decide to put more tasks on it because we don't know if >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50% >> > >> >> task) or if it will continue to rise and eventually get to 47742. >> > >> > >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we >> > >> > migrate tasks, the per-cpu values should be instantly updated. >> > >> > >> > >> > If we were to increase per task storage, we might as well also track >> > >> > running_avg not only runnable_avg. >> > >> >> > >> I agree that the removed running_avg should give more useful >> > >> information about the the load of a CPU. >> > >> >> > >> The main issue with running_avg is that it's disturbed by other tasks >> > >> (as point out previously). As a typical example, if we have 2 tasks >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be >> > >> in the range of [100% - 50%] depending of the parallelism of the >> > >> runtime of the tasks whereas the reality is 50% and the use of >> > >> running_avg will return this value >> > > >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable >> > > can indeed be inflated due to this queueing effect. >> >> Let me explain the 75%, take any one of the above scenarios. Lets call >> the two tasks A and B, and let for a moment assume A always wins and >> runs first, and then B. >> >> So A will be runnable for 25%, B otoh will be runnable the entire time A >> is actually running plus its own running time, giving 50%. Together that >> makes 75%. >> >> If you release the assumption that A runs first, but instead assume they >> equally win the first execution, you get them averaging at 37.5% each, >> which combined will still give 75%. > > But that is assuming that the first task gets to run to completion of it > busy period. If it uses up its sched_slice and we switch to the other > tasks, they both get to wait. > > For example, if the sched_slice is 5 ms and the busy period is 10 ms, > the execution pattern would be: A, B, A, B, idle, ... In that case A is > runnable for 15 ms and B is for 20 ms. Assuming that the overall period > is 40 ms, the A runnable is 37.5% and B is 50%. The exact value for your scheduling example above is: A runnable will be 47% and B runnable will be 60% (unless i make a mistake in my computation) and CPU runnable will be 60% too Vincent > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote: > On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote: > >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote: > >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote: > >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: > >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: > >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: > >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform > >> > >> >> a number of load-balances where runnable_avg_sum will mostly be > >> > >> >> reflecting the state of the world before a change (new task queued or > >> > >> >> moved a task to a different cpu). If you had have two tasks continuously > >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to > >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the > >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will > >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later > >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized > >> > >> >> and we might decide to put more tasks on it because we don't know if > >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50% > >> > >> >> task) or if it will continue to rise and eventually get to 47742. > >> > >> > > >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we > >> > >> > migrate tasks, the per-cpu values should be instantly updated. > >> > >> > > >> > >> > If we were to increase per task storage, we might as well also track > >> > >> > running_avg not only runnable_avg. > >> > >> > >> > >> I agree that the removed running_avg should give more useful > >> > >> information about the the load of a CPU. > >> > >> > >> > >> The main issue with running_avg is that it's disturbed by other tasks > >> > >> (as point out previously). As a typical example, if we have 2 tasks > >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be > >> > >> in the range of [100% - 50%] depending of the parallelism of the > >> > >> runtime of the tasks whereas the reality is 50% and the use of > >> > >> running_avg will return this value > >> > > > >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable > >> > > can indeed be inflated due to this queueing effect. > >> > >> Let me explain the 75%, take any one of the above scenarios. Lets call > >> the two tasks A and B, and let for a moment assume A always wins and > >> runs first, and then B. > >> > >> So A will be runnable for 25%, B otoh will be runnable the entire time A > >> is actually running plus its own running time, giving 50%. Together that > >> makes 75%. > >> > >> If you release the assumption that A runs first, but instead assume they > >> equally win the first execution, you get them averaging at 37.5% each, > >> which combined will still give 75%. > > > > But that is assuming that the first task gets to run to completion of it > > busy period. If it uses up its sched_slice and we switch to the other > > tasks, they both get to wait. > > > > For example, if the sched_slice is 5 ms and the busy period is 10 ms, > > the execution pattern would be: A, B, A, B, idle, ... In that case A is > > runnable for 15 ms and B is for 20 ms. Assuming that the overall period > > is 40 ms, the A runnable is 37.5% and B is 50%. > > The exact value for your scheduling example above is: > A runnable will be 47% and B runnable will be 60% (unless i make a > mistake in my computation) I get: A: 15/40 ms = 37.5% B: 20/40 ms = 50% Schedule: | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | A: run rq run ----------- sleeping ------------- run B: rq run rq run ---- sleeping ------------- rq > and CPU runnable will be 60% too rq->avg.runnable_avg_sum should be 50%. You have two tasks running for 20 ms every 40 ms. Right? Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote: >> On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote: >> > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote: >> >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote: >> >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote: >> >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote: >> >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote: >> >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote: >> >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform >> >> > >> >> a number of load-balances where runnable_avg_sum will mostly be >> >> > >> >> reflecting the state of the world before a change (new task queued or >> >> > >> >> moved a task to a different cpu). If you had have two tasks continuously >> >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to >> >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the >> >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will >> >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later >> >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized >> >> > >> >> and we might decide to put more tasks on it because we don't know if >> >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50% >> >> > >> >> task) or if it will continue to rise and eventually get to 47742. >> >> > >> > >> >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we >> >> > >> > migrate tasks, the per-cpu values should be instantly updated. >> >> > >> > >> >> > >> > If we were to increase per task storage, we might as well also track >> >> > >> > running_avg not only runnable_avg. >> >> > >> >> >> > >> I agree that the removed running_avg should give more useful >> >> > >> information about the the load of a CPU. >> >> > >> >> >> > >> The main issue with running_avg is that it's disturbed by other tasks >> >> > >> (as point out previously). As a typical example, if we have 2 tasks >> >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be >> >> > >> in the range of [100% - 50%] depending of the parallelism of the >> >> > >> runtime of the tasks whereas the reality is 50% and the use of >> >> > >> running_avg will return this value >> >> > > >> >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable >> >> > > can indeed be inflated due to this queueing effect. >> >> >> >> Let me explain the 75%, take any one of the above scenarios. Lets call >> >> the two tasks A and B, and let for a moment assume A always wins and >> >> runs first, and then B. >> >> >> >> So A will be runnable for 25%, B otoh will be runnable the entire time A >> >> is actually running plus its own running time, giving 50%. Together that >> >> makes 75%. >> >> >> >> If you release the assumption that A runs first, but instead assume they >> >> equally win the first execution, you get them averaging at 37.5% each, >> >> which combined will still give 75%. >> > >> > But that is assuming that the first task gets to run to completion of it >> > busy period. If it uses up its sched_slice and we switch to the other >> > tasks, they both get to wait. >> > >> > For example, if the sched_slice is 5 ms and the busy period is 10 ms, >> > the execution pattern would be: A, B, A, B, idle, ... In that case A is >> > runnable for 15 ms and B is for 20 ms. Assuming that the overall period >> > is 40 ms, the A runnable is 37.5% and B is 50%. >> >> The exact value for your scheduling example above is: >> A runnable will be 47% and B runnable will be 60% (unless i make a >> mistake in my computation) > > I get: > > A: 15/40 ms = 37.5% > B: 20/40 ms = 50% > > Schedule: > > | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | > A: run rq run ----------- sleeping ------------- run > B: rq run rq run ---- sleeping ------------- rq > >> and CPU runnable will be 60% too > > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for > 20 ms every 40 ms. > > Right? ok, i see the misunderstood. it's depends of what we mean by runnable. You take the % of time whereas i take the runnable_avg_sum/period so A is on_rq 15/40 ms = 37.5% of the time which gives a runnable_avg_sum/runnable_avg_period of 47% B is on_rq 20/40 ms = 50% of the time which gives a runnable_avg_sum/runnable_avg_period of 60% and CPU has a task on its rq 20/40ms = 50% of the time which gives a runnable_avg_sum/runnable_avg_period of 60% Vincent > > Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote: > On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > I get: > > > > A: 15/40 ms = 37.5% > > B: 20/40 ms = 50% > > > > Schedule: > > > > | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | > > A: run rq run ----------- sleeping ------------- run > > B: rq run rq run ---- sleeping ------------- rq > > > >> and CPU runnable will be 60% too > > > > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for > > 20 ms every 40 ms. > > > > Right? > > ok, i see the misunderstood. > it's depends of what we mean by runnable. You take the % of time > whereas i take the runnable_avg_sum/period Right. There is a difference. > > so A is on_rq 15/40 ms = 37.5% of the time which gives a > runnable_avg_sum/runnable_avg_period of 47% > B is on_rq 20/40 ms = 50% of the time which gives a > runnable_avg_sum/runnable_avg_period of 60% > and CPU has a task on its rq 20/40ms = 50% of the time which gives a > runnable_avg_sum/runnable_avg_period of 60% Yes, that seems about right. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Jun 04, 2014 at 02:09:18PM +0100, Morten Rasmussen wrote: > On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote: > > On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > > I get: > > > > > > A: 15/40 ms = 37.5% > > > B: 20/40 ms = 50% > > > > > > Schedule: > > > > > > | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | > > > A: run rq run ----------- sleeping ------------- run > > > B: rq run rq run ---- sleeping ------------- rq > > > > > >> and CPU runnable will be 60% too > > > > > > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for > > > 20 ms every 40 ms. > > > > > > Right? > > > > ok, i see the misunderstood. > > it's depends of what we mean by runnable. You take the % of time > > whereas i take the runnable_avg_sum/period > > Right. There is a difference. > > > > > so A is on_rq 15/40 ms = 37.5% of the time which gives a > > runnable_avg_sum/runnable_avg_period of 47% > > B is on_rq 20/40 ms = 50% of the time which gives a > > runnable_avg_sum/runnable_avg_period of 60% > > and CPU has a task on its rq 20/40ms = 50% of the time which gives a > > runnable_avg_sum/runnable_avg_period of 60% > > Yes, that seems about right. If my calculations are right, cfs.runnable_load_avg would peak 15 ms into the period at about 104%. After 20 ms A has decayed more than B has gained so the sum is slightly lower. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b7c51be..c01d8b6 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu) return cpu_rq(cpu)->cpu_power; } +static unsigned long power_orig_of(int cpu) +{ + return cpu_rq(cpu)->cpu_power_orig; +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -4438,6 +4443,18 @@ done: return target; } +static int get_cpu_activity(int cpu) +{ + struct rq *rq = cpu_rq(cpu); + u32 sum = rq->avg.runnable_avg_sum; + u32 period = rq->avg.runnable_avg_period; + + if (sum >= period) + return power_orig_of(cpu) + rq->nr_running - 1; + + return (sum * power_orig_of(cpu)) / period; +} + /* * select_task_rq_fair: Select target runqueue for the waking task in domains * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE, @@ -5518,6 +5535,7 @@ struct sg_lb_stats { unsigned long sum_weighted_load; /* Weighted load of group's tasks */ unsigned long load_per_task; unsigned long group_power; + unsigned long group_activity; /* Total activity of the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int group_capacity; unsigned int idle_cpus; @@ -5538,6 +5556,7 @@ struct sd_lb_stats { struct sched_group *busiest; /* Busiest group in this sd */ struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ + unsigned long total_activity; /* Total activity of all groups in sd */ unsigned long total_pwr; /* Total power of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */ @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .busiest = NULL, .local = NULL, .total_load = 0UL, + .total_activity = 0UL, .total_pwr = 0UL, .busiest_stat = { .avg_load = 0UL, @@ -5876,6 +5896,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, load = source_load(i, load_idx); sgs->group_load += load; + sgs->group_activity += get_cpu_activity(i); sgs->sum_nr_running += rq->cfs.h_nr_running; #ifdef CONFIG_NUMA_BALANCING sgs->nr_numa_running += rq->nr_numa_running; @@ -6034,6 +6055,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd next_group: /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; + sds->total_activity += sgs->group_activity; sds->total_pwr += sgs->group_power; sg = sg->next;
Monitor the activity level of each group of each sched_domain level. The activity is the amount of cpu_power that is currently used on a CPU or group of CPUs. We use the runnable_avg_sum and _period to evaluate this activity level. In the special use case where the CPU is fully loaded by more than 1 task, the activity level is set above the cpu_power in order to reflect the overload of the CPU Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)