Message ID | 1400860385-14555-5-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
On 23/05/14 16:52, Vincent Guittot wrote: > power_orig is only changed for system with a SMT sched_domain level in order to > reflect the lower capacity of CPUs. Heterogenous system also have to reflect an > original capacity that is different from the default value. > > Create a more generic function arch_scale_cpu_power that can be also used by > non SMT platform to set power_orig. > > The weak behavior of arch_scale_cpu_power is the previous SMT one in order to > keep backward compatibility in the use of power_orig. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> As you know, besides uarch scaled cpu power for HMP, freq scaled cpu power is important for energy-aware scheduling to achieve freq scale invariance for task load. I know that your patch-set is not about introducing freq scaled cpu power, but we were discussing how this can be achieved w/ your patch-set in place, so maybe you can share your opinion regarding the easiest way to achieve freq scale invariance with us? (1) We assume that the current way (update_cpu_power() calls arch_scale_freq_power() to get the avg power(freq) over the time period since the last call to arch_scale_freq_power()) is suitable for us. Do you have another opinion here? (2) Is the current layout of update_cpu_power() adequate for this, where we scale power_orig related to freq and then related to rt/(irq): power_orig = scale_cpu(SCHED_POWER_SCALE) power = scale_rt(scale_freq(power_orig)) or do we need an extra power_freq data member on the rq and do: power_orig = scale_cpu(SCHED_POWER_SCALE) power_freq = scale_freq(power_orig)) power = scale_rt(power_orig)) In other words, do we consider rt/(irq) pressure when calculating freq scale invariant task load or not? Thanks, -- Dietmar [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, May 30, 2014 at 03:04:32PM +0100, Dietmar Eggemann wrote: > On 23/05/14 16:52, Vincent Guittot wrote: > > power_orig is only changed for system with a SMT sched_domain level in order to > > reflect the lower capacity of CPUs. Heterogenous system also have to reflect an > > original capacity that is different from the default value. > > > > Create a more generic function arch_scale_cpu_power that can be also used by > > non SMT platform to set power_orig. > > > > The weak behavior of arch_scale_cpu_power is the previous SMT one in order to > > keep backward compatibility in the use of power_orig. > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > As you know, besides uarch scaled cpu power for HMP, freq scaled cpu > power is important for energy-aware scheduling to achieve freq scale > invariance for task load. > > I know that your patch-set is not about introducing freq scaled cpu > power, but we were discussing how this can be achieved w/ your patch-set > in place, so maybe you can share your opinion regarding the easiest way > to achieve freq scale invariance with us? > > (1) We assume that the current way (update_cpu_power() calls > arch_scale_freq_power() to get the avg power(freq) over the time period > since the last call to arch_scale_freq_power()) is suitable > for us. Do you have another opinion here? > > (2) Is the current layout of update_cpu_power() adequate for this, where > we scale power_orig related to freq and then related to rt/(irq): > > power_orig = scale_cpu(SCHED_POWER_SCALE) > power = scale_rt(scale_freq(power_orig)) > > or do we need an extra power_freq data member on the rq and do: > > power_orig = scale_cpu(SCHED_POWER_SCALE) > power_freq = scale_freq(power_orig)) > power = scale_rt(power_orig)) > > In other words, do we consider rt/(irq) pressure when calculating freq > scale invariant task load or not? I don't think you should. The work done depends on the frequency, not on other tasks present on the cpu. The same is true for an over-utilized cpu, a task will run less than the desired amount of time, this is no different from a RT/irq preempting the task and taking its time.
On 30 May 2014 16:04, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > On 23/05/14 16:52, Vincent Guittot wrote: >> power_orig is only changed for system with a SMT sched_domain level in order to >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an >> original capacity that is different from the default value. >> >> Create a more generic function arch_scale_cpu_power that can be also used by >> non SMT platform to set power_orig. >> >> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to >> keep backward compatibility in the use of power_orig. >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > As you know, besides uarch scaled cpu power for HMP, freq scaled cpu > power is important for energy-aware scheduling to achieve freq scale > invariance for task load. > > I know that your patch-set is not about introducing freq scaled cpu > power, but we were discussing how this can be achieved w/ your patch-set > in place, so maybe you can share your opinion regarding the easiest way > to achieve freq scale invariance with us? > > (1) We assume that the current way (update_cpu_power() calls > arch_scale_freq_power() to get the avg power(freq) over the time period > since the last call to arch_scale_freq_power()) is suitable > for us. Do you have another opinion here? Using power (or power_freq as you mentioned below) is probably the easiest and more straight forward solution. You can use it to scale each element when updating entity runnable. Nevertheless, I see to 2 potential issues: - is power updated often enough to correctly follow the frequency scaling ? we need to compare power update frequency with runnable_avg_sum variation speed and the rate at which we will change the CPU's frequency. - the max value of runnable_avg_sum will be also scaled so a task running on a CPU with less capacity could be seen as a "low" load even if it's an always running tasks. So we need to find a way to reach the max value for such situation > > (2) Is the current layout of update_cpu_power() adequate for this, where > we scale power_orig related to freq and then related to rt/(irq): > > power_orig = scale_cpu(SCHED_POWER_SCALE) > power = scale_rt(scale_freq(power_orig)) > > or do we need an extra power_freq data member on the rq and do: > > power_orig = scale_cpu(SCHED_POWER_SCALE) > power_freq = scale_freq(power_orig)) > power = scale_rt(power_orig)) do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ? > > In other words, do we consider rt/(irq) pressure when calculating freq > scale invariant task load or not? we should take power_freq which implies a new field Thanks, Vincent > > Thanks, > > -- Dietmar > [...] > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote: > power_orig is only changed for system with a SMT sched_domain level in order to > reflect the lower capacity of CPUs. Heterogenous system also have to reflect an > original capacity that is different from the default value. > > Create a more generic function arch_scale_cpu_power that can be also used by > non SMT platform to set power_orig. I did a quick test of the patch set with adjusting cpu_power on big.LITTLE (ARM TC2) to reflect the different compute capacities of the A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and without the patches applied, but with non-default cpu_powers. I didn't see any difference in the load-balance. Three tasks ended up on the two A15s and two tasks ended up on two of the three A7s leaving one unused in both cases. Using default cpu_power I get one task on each of the five cpus (best throughput). Unless I messed something up, it seems that setting cpu_power doesn't give me the best throughput with these patches applied. Have you done any tests on big.LITTLE? Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote: >> power_orig is only changed for system with a SMT sched_domain level in order to >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an >> original capacity that is different from the default value. >> >> Create a more generic function arch_scale_cpu_power that can be also used by >> non SMT platform to set power_orig. > > I did a quick test of the patch set with adjusting cpu_power on > big.LITTLE (ARM TC2) to reflect the different compute capacities of the > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and > without the patches applied, but with non-default cpu_powers. > > I didn't see any difference in the load-balance. Three tasks ended up on > the two A15s and two tasks ended up on two of the three A7s leaving one > unused in both cases. > > Using default cpu_power I get one task on each of the five cpus (best > throughput). Unless I messed something up, it seems that setting > cpu_power doesn't give me the best throughput with these patches > applied. That's normal this patchset is necessary but not enough to solve the issue you mention. We also need to fix the way the imbalance is calculated for such situation. I have planned to push that in another patchset in order to not mix too much thing together Vincent > > Have you done any tests on big.LITTLE? > > Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[...] >> (1) We assume that the current way (update_cpu_power() calls >> arch_scale_freq_power() to get the avg power(freq) over the time period >> since the last call to arch_scale_freq_power()) is suitable >> for us. Do you have another opinion here? > > Using power (or power_freq as you mentioned below) is probably the > easiest and more straight forward solution. You can use it to scale > each element when updating entity runnable. > Nevertheless, I see to 2 potential issues: > - is power updated often enough to correctly follow the frequency > scaling ? we need to compare power update frequency with > runnable_avg_sum variation speed and the rate at which we will change > the CPU's frequency. > - the max value of runnable_avg_sum will be also scaled so a task > running on a CPU with less capacity could be seen as a "low" load even > if it's an always running tasks. So we need to find a way to reach the > max value for such situation I think I mixed two problems together here: Firstly, we need to scale cpu power in update_cpu_power() regarding uArch, frequency and rt/irq pressure. Here the freq related value we get back from arch_scale_freq_power(..., cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)). Secondly, to be able to scale the runnable avg sum of a sched entity (se->avg->runnable_avg_sum), we preferable have a coefficient representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most powerful cpu in the system) and another coefficient (avg freq over 'now - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to be retrieved from the arch in __update_entity_runnable_avg(). >> (2) Is the current layout of update_cpu_power() adequate for this, where >> we scale power_orig related to freq and then related to rt/(irq): >> >> power_orig = scale_cpu(SCHED_POWER_SCALE) >> power = scale_rt(scale_freq(power_orig)) >> >> or do we need an extra power_freq data member on the rq and do: >> >> power_orig = scale_cpu(SCHED_POWER_SCALE) >> power_freq = scale_freq(power_orig)) >> power = scale_rt(power_orig)) > > do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ? No, I also think that power=scale_rt(power_freq) is correct. >> In other words, do we consider rt/(irq) pressure when calculating freq >> scale invariant task load or not? > > we should take power_freq which implies a new field [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 4 June 2014 11:42, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > [...] >>> (1) We assume that the current way (update_cpu_power() calls >>> arch_scale_freq_power() to get the avg power(freq) over the time period >>> since the last call to arch_scale_freq_power()) is suitable >>> for us. Do you have another opinion here? >> >> Using power (or power_freq as you mentioned below) is probably the >> easiest and more straight forward solution. You can use it to scale >> each element when updating entity runnable. >> Nevertheless, I see to 2 potential issues: >> - is power updated often enough to correctly follow the frequency >> scaling ? we need to compare power update frequency with >> runnable_avg_sum variation speed and the rate at which we will change >> the CPU's frequency. >> - the max value of runnable_avg_sum will be also scaled so a task >> running on a CPU with less capacity could be seen as a "low" load even >> if it's an always running tasks. So we need to find a way to reach the >> max value for such situation > > I think I mixed two problems together here: > > Firstly, we need to scale cpu power in update_cpu_power() regarding > uArch, frequency and rt/irq pressure. > Here the freq related value we get back from arch_scale_freq_power(..., > cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)). > > Secondly, to be able to scale the runnable avg sum of a sched entity > (se->avg->runnable_avg_sum), we preferable have a coefficient > representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most > powerful cpu in the system) and another coefficient (avg freq over 'now AFAICT, the coefficient representing uArch diffs is already taken into account into power_freq thanks to scale_cpu, isn't it ? > - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to > be retrieved from the arch in __update_entity_runnable_avg(). > >>> (2) Is the current layout of update_cpu_power() adequate for this, where >>> we scale power_orig related to freq and then related to rt/(irq): >>> >>> power_orig = scale_cpu(SCHED_POWER_SCALE) >>> power = scale_rt(scale_freq(power_orig)) >>> >>> or do we need an extra power_freq data member on the rq and do: >>> >>> power_orig = scale_cpu(SCHED_POWER_SCALE) >>> power_freq = scale_freq(power_orig)) >>> power = scale_rt(power_orig)) >> >> do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ? > > No, I also think that power=scale_rt(power_freq) is correct. > >>> In other words, do we consider rt/(irq) pressure when calculating freq >>> scale invariant task load or not? >> >> we should take power_freq which implies a new field > [...] > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote: > On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote: > >> power_orig is only changed for system with a SMT sched_domain level in order to > >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an > >> original capacity that is different from the default value. > >> > >> Create a more generic function arch_scale_cpu_power that can be also used by > >> non SMT platform to set power_orig. > > > > I did a quick test of the patch set with adjusting cpu_power on > > big.LITTLE (ARM TC2) to reflect the different compute capacities of the > > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and > > without the patches applied, but with non-default cpu_powers. > > > > I didn't see any difference in the load-balance. Three tasks ended up on > > the two A15s and two tasks ended up on two of the three A7s leaving one > > unused in both cases. > > > > Using default cpu_power I get one task on each of the five cpus (best > > throughput). Unless I messed something up, it seems that setting > > cpu_power doesn't give me the best throughput with these patches > > applied. > > That's normal this patchset is necessary but not enough to solve the > issue you mention. We also need to fix the way the imbalance is > calculated for such situation. I have planned to push that in another > patchset in order to not mix too much thing together Based on the commit messages I was just lead to believe that this was a self-contained patch set that also addressed issues related to handling heterogeneous systems. Maybe it would be worth mentioning that this set is only part of the solution somewhere? It is a bit unclear to me how these changes, which appear to mainly improve factoring rt and irq time into cpu_power, will solve the cpu_power issues related to heterogeneous systems. Can you share your plans for the follow up patch set? I think it would be better to review the solution as a whole. I absolutely agree that the imbalance calculation needs to fixed, but I don't think the current rq runnable_avg_sum is the right choice for that purpose for the reasons I pointed out the in other thread. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[...] >> Firstly, we need to scale cpu power in update_cpu_power() regarding >> uArch, frequency and rt/irq pressure. >> Here the freq related value we get back from arch_scale_freq_power(..., >> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)). >> >> Secondly, to be able to scale the runnable avg sum of a sched entity >> (se->avg->runnable_avg_sum), we preferable have a coefficient >> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most >> powerful cpu in the system) and another coefficient (avg freq over 'now > > AFAICT, the coefficient representing uArch diffs is already taken into > account into power_freq thanks to scale_cpu, isn't it ? True, but I can't see how the current signature of arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch and freq invariant updating of se->avg->runnable_avg_sum business. >> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to >> be retrieved from the arch in __update_entity_runnable_avg(). >> [...] -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 4 June 2014 13:17, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote: >> On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote: >> > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote: >> >> power_orig is only changed for system with a SMT sched_domain level in order to >> >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an >> >> original capacity that is different from the default value. >> >> >> >> Create a more generic function arch_scale_cpu_power that can be also used by >> >> non SMT platform to set power_orig. >> > >> > I did a quick test of the patch set with adjusting cpu_power on >> > big.LITTLE (ARM TC2) to reflect the different compute capacities of the >> > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and >> > without the patches applied, but with non-default cpu_powers. >> > >> > I didn't see any difference in the load-balance. Three tasks ended up on >> > the two A15s and two tasks ended up on two of the three A7s leaving one >> > unused in both cases. >> > >> > Using default cpu_power I get one task on each of the five cpus (best >> > throughput). Unless I messed something up, it seems that setting >> > cpu_power doesn't give me the best throughput with these patches >> > applied. >> >> That's normal this patchset is necessary but not enough to solve the >> issue you mention. We also need to fix the way the imbalance is >> calculated for such situation. I have planned to push that in another >> patchset in order to not mix too much thing together > > Based on the commit messages I was just lead to believe that this was a > self-contained patch set that also addressed issues related to handling > heterogeneous systems. Maybe it would be worth mentioning that this set > is only part of the solution somewhere? So this patch addresses the issue of describing the CPU capacity of an HMP system but doesn't solve the one task per CPU issue alone. > > It is a bit unclear to me how these changes, which appear to mainly > improve factoring rt and irq time into cpu_power, will solve the > cpu_power issues related to heterogeneous systems. Can you share your > plans for the follow up patch set? I think it would be better to review > the solution as a whole. I have planned to use load_per_task to calculate the imbalance to pull > > I absolutely agree that the imbalance calculation needs to fixed, but I > don't think the current rq runnable_avg_sum is the right choice for that > purpose for the reasons I pointed out the in other thread. > > Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 5 June 2014 10:59, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote: > [...] >>> Firstly, we need to scale cpu power in update_cpu_power() regarding >>> uArch, frequency and rt/irq pressure. >>> Here the freq related value we get back from arch_scale_freq_power(..., >>> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)). >>> >>> Secondly, to be able to scale the runnable avg sum of a sched entity >>> (se->avg->runnable_avg_sum), we preferable have a coefficient >>> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most >>> powerful cpu in the system) and another coefficient (avg freq over 'now >> >> AFAICT, the coefficient representing uArch diffs is already taken into >> account into power_freq thanks to scale_cpu, isn't it ? > > True, but I can't see how the current signature of > arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch > and freq invariant updating of se->avg->runnable_avg_sum business. My 1st assumption is that the update of rq->cpu_power (or rq->cpu_power_freq as we discussed earlier) is fast enough compare to frequency scaling so that it can be used to scale the load without major error. In this case, you can use the cpu_power_orig, cpu_power fields. The update period of cpu_power is equal balance_interval which is initialized with the weight of the sched_domain (less or equal to 4ms for most of ARM platform). Most of ARM platforms use a 100 HZ jiffies so the update period will be aligned to 10ms (1 tick). If this update is not fast enough, the second possibility could be to directly access to current frequency (or something that represent it). This might be possible once the cpufreq will have been consolidated into the scheduler similarly to what is going with cpuidle Vincent > >>> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to >>> be retrieved from the arch in __update_entity_runnable_avg(). >>> > [...] > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6a84114..6dc05f7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5618,6 +5618,20 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu) return default_scale_smt_power(sd, cpu); } +unsigned long __weak arch_scale_cpu_power(struct sched_domain *sd, int cpu) +{ + unsigned long weight = sd->span_weight; + + if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { + if (sched_feat(ARCH_POWER)) + return arch_scale_smt_power(sd, cpu); + else + return default_scale_smt_power(sd, cpu); + } + + return SCHED_POWER_SCALE; +} + static unsigned long scale_rt_power(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5654,18 +5668,12 @@ static unsigned long scale_rt_power(int cpu) static void update_cpu_power(struct sched_domain *sd, int cpu) { - unsigned long weight = sd->span_weight; unsigned long power = SCHED_POWER_SCALE; struct sched_group *sdg = sd->groups; - if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) { - if (sched_feat(ARCH_POWER)) - power *= arch_scale_smt_power(sd, cpu); - else - power *= default_scale_smt_power(sd, cpu); + power *= arch_scale_cpu_power(sd, cpu); - power >>= SCHED_POWER_SHIFT; - } + power >>= SCHED_POWER_SHIFT; sdg->sgp->power_orig = power;
power_orig is only changed for system with a SMT sched_domain level in order to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an original capacity that is different from the default value. Create a more generic function arch_scale_cpu_power that can be also used by non SMT platform to set power_orig. The weak behavior of arch_scale_cpu_power is the previous SMT one in order to keep backward compatibility in the use of power_orig. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- kernel/sched/fair.c | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-)