Message ID | 1522223215-23524-1-git-send-email-vincent.guittot@linaro.org |
---|---|
State | New |
Headers | show |
Series | sched: support dynamiQ cluster | expand |
On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote: > Arm DynamiQ system can integrate cores with different micro architecture > or max OPP under the same DSU so we can have cores with different compute > capacity at the LLC (which was not the case with legacy big/LITTLE > architecture). Such configuration is similar in some way to ITMT on intel > platform which allows some cores to be boosted to higher turbo frequency > than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with > highest capacity, will always be used in priortiy in order to provide > maximum throughput. > > Add arch_asym_cpu_priority() for arm64 as this function is used to > differentiate CPUs in the scheduler. The CPU's capacity is used to order > CPUs in the same DSU. > > Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING > at MC level. > > Some tests have been done on a hikey960 platform (quad cortex-A53, > quad cortex-A73). For the test purpose, the CPUs topology of the hikey960 > has been modified so the 8 heterogeneous cores are described as being part > of the same cluster and sharing resources (MC level) like with a DynamiQ DSU. > > Results below show the time in seconds to run sysbench --test=cpu with an > increasing number of threads. The sysbench test run 32 times > > without patch with patch diff > 1 threads 11.04(+/- 30%) 8.86(+/- 0%) -19% > 2 threads 5.59(+/- 14%) 4.43(+/- 0%) -20% > 3 threads 3.80(+/- 13%) 2.95(+/- 0%) -22% > 4 threads 3.10(+/- 12%) 2.22(+/- 0%) -28% > 5 threads 2.47(+/- 5%) 1.95(+/- 0%) -21% > 6 threads 2.09(+/- 0%) 1.73(+/- 0%) -17% > 7 threads 1.64(+/- 0%) 1.56(+/- 0%) - 7% > 8 threads 1.42(+/- 0%) 1.42(+/- 0%) 0% > > Results show a better and stable results across iteration with the patch > compared to mainline because we are always using big cores in priority whereas > with mainline, the scheduler randomly choose a big or a little cores when > there are more cores than number of threads. > With 1 thread, the test duration varies in the range [8.85 .. 15.86] for > mainline whereas it stays in the range [8.85..8.87] with the patch > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > The SD_ASYM_PACKING flag is disabled by default and I'm preparing another patch > to enable this dynamically at boot time by detecting the system topology. > > arch/arm64/kernel/topology.c | 30 ++++++++++++++++++++++++++++++ > 1 file changed, 30 insertions(+) > > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c > index 2186853..cb6705e5 100644 > --- a/arch/arm64/kernel/topology.c > +++ b/arch/arm64/kernel/topology.c > @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void) > } > } > > +#ifdef CONFIG_SCHED_MC > +unsigned int __read_mostly arm64_sched_asym_enabled; > + > +int arch_asym_cpu_priority(int cpu) > +{ > + return topology_get_cpu_scale(NULL, cpu); > +} > + > +static inline int arm64_sched_dynamiq(void) > +{ > + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0; > +} > + > +static int arm64_core_flags(void) > +{ > + return cpu_core_flags() | arm64_sched_dynamiq(); > +} > +#endif > + > +static struct sched_domain_topology_level arm64_topology[] = { > +#ifdef CONFIG_SCHED_MC > + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) }, Maybe stick this in a macro to avoid the double #ifdef? Will
On 28 March 2018 at 11:12, Will Deacon <will.deacon@arm.com> wrote: > On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote: >> >> The SD_ASYM_PACKING flag is disabled by default and I'm preparing another patch >> to enable this dynamically at boot time by detecting the system topology. >> >> arch/arm64/kernel/topology.c | 30 ++++++++++++++++++++++++++++++ >> 1 file changed, 30 insertions(+) >> >> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c >> index 2186853..cb6705e5 100644 >> --- a/arch/arm64/kernel/topology.c >> +++ b/arch/arm64/kernel/topology.c >> @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void) >> } >> } >> >> +#ifdef CONFIG_SCHED_MC >> +unsigned int __read_mostly arm64_sched_asym_enabled; >> + >> +int arch_asym_cpu_priority(int cpu) >> +{ >> + return topology_get_cpu_scale(NULL, cpu); >> +} >> + >> +static inline int arm64_sched_dynamiq(void) >> +{ >> + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0; >> +} >> + >> +static int arm64_core_flags(void) >> +{ >> + return cpu_core_flags() | arm64_sched_dynamiq(); >> +} >> +#endif >> + >> +static struct sched_domain_topology_level arm64_topology[] = { >> +#ifdef CONFIG_SCHED_MC >> + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) }, > > Maybe stick this in a macro to avoid the double #ifdef? ok, I will do that in next version Vincent > > Will
On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote: > Arm DynamiQ system can integrate cores with different micro architecture > or max OPP under the same DSU so we can have cores with different compute > capacity at the LLC (which was not the case with legacy big/LITTLE > architecture). Such configuration is similar in some way to ITMT on intel > platform which allows some cores to be boosted to higher turbo frequency > than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with > highest capacity, will always be used in priortiy in order to provide > maximum throughput. > > Add arch_asym_cpu_priority() for arm64 as this function is used to > differentiate CPUs in the scheduler. The CPU's capacity is used to order > CPUs in the same DSU. > > Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING > at MC level. > > Some tests have been done on a hikey960 platform (quad cortex-A53, > quad cortex-A73). For the test purpose, the CPUs topology of the hikey960 > has been modified so the 8 heterogeneous cores are described as being part > of the same cluster and sharing resources (MC level) like with a DynamiQ DSU. > > Results below show the time in seconds to run sysbench --test=cpu with an > increasing number of threads. The sysbench test run 32 times > > without patch with patch diff > 1 threads 11.04(+/- 30%) 8.86(+/- 0%) -19% > 2 threads 5.59(+/- 14%) 4.43(+/- 0%) -20% > 3 threads 3.80(+/- 13%) 2.95(+/- 0%) -22% > 4 threads 3.10(+/- 12%) 2.22(+/- 0%) -28% > 5 threads 2.47(+/- 5%) 1.95(+/- 0%) -21% > 6 threads 2.09(+/- 0%) 1.73(+/- 0%) -17% > 7 threads 1.64(+/- 0%) 1.56(+/- 0%) - 7% > 8 threads 1.42(+/- 0%) 1.42(+/- 0%) 0% > > Results show a better and stable results across iteration with the patch > compared to mainline because we are always using big cores in priority whereas > with mainline, the scheduler randomly choose a big or a little cores when > there are more cores than number of threads. > With 1 thread, the test duration varies in the range [8.85 .. 15.86] for > mainline whereas it stays in the range [8.85..8.87] with the patch Using ASYM_PACKING is essentially an easier but somewhat less accurate way to achieve the same behaviour for big.LITTLE system as with the "misfit task" series that been under review here for the last couple of months. As I see it, the main differences is that ASYM_PACKING attempts to pack all tasks regardless of task utilization on the higher capacity cpus whereas the "misfit task" series carefully picks cpus with tasks they can't handle so we don't risk migrating tasks which are perfectly suitable to for a little cpu to a big cpu unnecessarily. Also it is based directly on utilization and cpu capacity like the capacity awareness we already have to deal with big.LITTLE in the wake-up path. Furthermore, it should work for all big.LITTLE systems regardless of the topology, where I think ASYM_PACKING might not work well for systems with separate big and little sched_domains. Have to tried taking the misfit patches for a spin on your setup? I expect them give you the same behaviour as you report above. Morten
Hi Morten, On 29 March 2018 at 14:53, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Wed, Mar 28, 2018 at 09:46:55AM +0200, Vincent Guittot wrote: >> Arm DynamiQ system can integrate cores with different micro architecture >> or max OPP under the same DSU so we can have cores with different compute >> capacity at the LLC (which was not the case with legacy big/LITTLE >> architecture). Such configuration is similar in some way to ITMT on intel >> platform which allows some cores to be boosted to higher turbo frequency >> than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with >> highest capacity, will always be used in priortiy in order to provide >> maximum throughput. >> >> Add arch_asym_cpu_priority() for arm64 as this function is used to >> differentiate CPUs in the scheduler. The CPU's capacity is used to order >> CPUs in the same DSU. >> >> Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING >> at MC level. >> >> Some tests have been done on a hikey960 platform (quad cortex-A53, >> quad cortex-A73). For the test purpose, the CPUs topology of the hikey960 >> has been modified so the 8 heterogeneous cores are described as being part >> of the same cluster and sharing resources (MC level) like with a DynamiQ DSU. >> >> Results below show the time in seconds to run sysbench --test=cpu with an >> increasing number of threads. The sysbench test run 32 times >> >> without patch with patch diff >> 1 threads 11.04(+/- 30%) 8.86(+/- 0%) -19% >> 2 threads 5.59(+/- 14%) 4.43(+/- 0%) -20% >> 3 threads 3.80(+/- 13%) 2.95(+/- 0%) -22% >> 4 threads 3.10(+/- 12%) 2.22(+/- 0%) -28% >> 5 threads 2.47(+/- 5%) 1.95(+/- 0%) -21% >> 6 threads 2.09(+/- 0%) 1.73(+/- 0%) -17% >> 7 threads 1.64(+/- 0%) 1.56(+/- 0%) - 7% >> 8 threads 1.42(+/- 0%) 1.42(+/- 0%) 0% >> >> Results show a better and stable results across iteration with the patch >> compared to mainline because we are always using big cores in priority whereas >> with mainline, the scheduler randomly choose a big or a little cores when >> there are more cores than number of threads. >> With 1 thread, the test duration varies in the range [8.85 .. 15.86] for >> mainline whereas it stays in the range [8.85..8.87] with the patch > > Using ASYM_PACKING is essentially an easier but somewhat less accurate > way to achieve the same behaviour for big.LITTLE system as with the > "misfit task" series that been under review here for the last couple of > months. I think that it's not exactly the same goal although if it's probably close but ASYM_PACKING ensures that the maximum compute capacity is used. > > As I see it, the main differences is that ASYM_PACKING attempts to pack > all tasks regardless of task utilization on the higher capacity cpus > whereas the "misfit task" series carefully picks cpus with tasks they > can't handle so we don't risk migrating tasks which are perfectly That's one main difference because misfit task will let middle range load task on little CPUs which will not provide maximum performance. I have put an example below > suitable to for a little cpu to a big cpu unnecessarily. Also it is > based directly on utilization and cpu capacity like the capacity > awareness we already have to deal with big.LITTLE in the wake-up path. > Furthermore, it should work for all big.LITTLE systems regardless of the > topology, where I think ASYM_PACKING might not work well for systems > with separate big and little sched_domains. I haven't look in details if ASYM_PACKING can work correctly on legacy big/little as I was mainly focus on dynamiQ config but I guess that might also work > > Have to tried taking the misfit patches for a spin on your setup? I > expect them give you the same behaviour as you report above. So I have tried both your tests and mine on both patchset and they provide same results which is somewhat expected as the benches are run for several seconds. In other to highlight the main difference between misfit task and ASYM_PACKING, I have reused your test and reduced the number of max-request for sysbench so that the test duration was in the range of hundreds ms. Hikey960 (emulate dynamiq topology) min avg(stdev) max misfit 0.097500 0.114911(+- 10%) 0.138500 asym 0.092500 0.106072(+- 6%) 0.122900 In this case, we can see that ASYM_PACKING is doing better( 8%) because it migrates sysbench threads on big core as soon as they are available whereas misfit task has to wait for the utilization to increase above the 80% which takes around 70ms when starting with an utilization that is null Regards, Vincent > > Morten
Hi, On 30/03/18 13:34, Vincent Guittot wrote: > Hi Morten, > [..] >> >> As I see it, the main differences is that ASYM_PACKING attempts to pack >> all tasks regardless of task utilization on the higher capacity cpus >> whereas the "misfit task" series carefully picks cpus with tasks they >> can't handle so we don't risk migrating tasks which are perfectly > > That's one main difference because misfit task will let middle range > load task on little CPUs which will not provide maximum performance. > I have put an example below > >> suitable to for a little cpu to a big cpu unnecessarily. Also it is >> based directly on utilization and cpu capacity like the capacity >> awareness we already have to deal with big.LITTLE in the wake-up path. I think that bit is quite important. AFAICT, ASYM_PACKING disregards task utilization, it only makes sure that (with your patch) tasks will be migrated to big CPUS if those ever go idle (pulls at NEWLY_IDLE balance or later on during nohz balance). I didn't see anything related to ASYM_PACKING in the wake path. >> Have to tried taking the misfit patches for a spin on your setup? I >> expect them give you the same behaviour as you report above. > > So I have tried both your tests and mine on both patchset and they > provide same results which is somewhat expected as the benches are run > for several seconds. > In other to highlight the main difference between misfit task and > ASYM_PACKING, I have reused your test and reduced the number of > max-request for sysbench so that the test duration was in the range of > hundreds ms. > > Hikey960 (emulate dynamiq topology) > min avg(stdev) max > misfit 0.097500 0.114911(+- 10%) 0.138500 > asym 0.092500 0.106072(+- 6%) 0.122900 > > In this case, we can see that ASYM_PACKING is doing better( 8%) > because it migrates sysbench threads on big core as soon as they are > available whereas misfit task has to wait for the utilization to > increase above the 80% which takes around 70ms when starting with an > utilization that is null > I believe ASYM_PACKING behaves better here because the workload is only sysbench threads. As stated above, since task utilization is disregarded, I think we could have a scenario where the big CPUs are filled with "small" tasks and the LITTLE CPUs hold a few "big" tasks - because what mostly matters here is the order in which the tasks spawn, not their utilization - which is potentially broken. There's that bit in *update_sd_pick_busiest()*: /* No ASYM_PACKING if target CPU is already busy */ if (env->idle == CPU_NOT_IDLE) return true; So I'm not entirely sure how realistic that scenario is, but I suppose it could still happen. Food for thought in any case. Regards, Valentin
Hi Valentin, On 3 April 2018 at 00:27, Valentin Schneider <valentin.schneider@arm.com> wrote: > Hi, > > On 30/03/18 13:34, Vincent Guittot wrote: >> Hi Morten, >> > [..] >>> >>> As I see it, the main differences is that ASYM_PACKING attempts to pack >>> all tasks regardless of task utilization on the higher capacity cpus >>> whereas the "misfit task" series carefully picks cpus with tasks they >>> can't handle so we don't risk migrating tasks which are perfectly >> >> That's one main difference because misfit task will let middle range >> load task on little CPUs which will not provide maximum performance. >> I have put an example below >> >>> suitable to for a little cpu to a big cpu unnecessarily. Also it is >>> based directly on utilization and cpu capacity like the capacity >>> awareness we already have to deal with big.LITTLE in the wake-up path. > > I think that bit is quite important. AFAICT, ASYM_PACKING disregards > task utilization, it only makes sure that (with your patch) tasks will be > migrated to big CPUS if those ever go idle (pulls at NEWLY_IDLE balance or > later on during nohz balance). I didn't see anything related to ASYM_PACKING > in the wake path. > >>> Have to tried taking the misfit patches for a spin on your setup? I >>> expect them give you the same behaviour as you report above. >> >> So I have tried both your tests and mine on both patchset and they >> provide same results which is somewhat expected as the benches are run >> for several seconds. >> In other to highlight the main difference between misfit task and >> ASYM_PACKING, I have reused your test and reduced the number of >> max-request for sysbench so that the test duration was in the range of >> hundreds ms. >> >> Hikey960 (emulate dynamiq topology) >> min avg(stdev) max >> misfit 0.097500 0.114911(+- 10%) 0.138500 >> asym 0.092500 0.106072(+- 6%) 0.122900 >> >> In this case, we can see that ASYM_PACKING is doing better( 8%) >> because it migrates sysbench threads on big core as soon as they are >> available whereas misfit task has to wait for the utilization to >> increase above the 80% which takes around 70ms when starting with an >> utilization that is null >> > > I believe ASYM_PACKING behaves better here because the workload is only > sysbench threads. As stated above, since task utilization is disregarded, I It behaves better because it doesn't wait for the task's utilization to reach a level before assuming the task needs high compute capacity. The utilization gives an idea of the running time of the task not the performance level that is needed > think we could have a scenario where the big CPUs are filled with "small" > tasks and the LITTLE CPUs hold a few "big" tasks - because what mostly > matters here is the order in which the tasks spawn, not their utilization - > which is potentially broken. > > There's that bit in *update_sd_pick_busiest()*: > > /* No ASYM_PACKING if target CPU is already busy */ > if (env->idle == CPU_NOT_IDLE) > return true; > > So I'm not entirely sure how realistic that scenario is, but I suppose it > could still happen. Food for thought in any case. > > Regards, > Valentin
Hi, On 03/04/18 13:17, Vincent Guittot wrote: > Hi Valentin, > [...] >> >> I believe ASYM_PACKING behaves better here because the workload is only >> sysbench threads. As stated above, since task utilization is disregarded, I > > It behaves better because it doesn't wait for the task's utilization > to reach a level before assuming the task needs high compute capacity. > The utilization gives an idea of the running time of the task not the > performance level that is needed > That's my point actually. ASYM_PACKING disregards utilization and moves those threads to the big cores ASAP, which is good here because it's just sysbench threads. What I meant was that if the task composition changes, IOW we mix "small" tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like sysbench threads), we shouldn't assume all of those require to run on a big CPU. The thing is, ASYM_PACKING can't make the difference between those, so it'll all come down to which task spawned first. Furthermore, ASYM_PACKING will forcefully move tasks via active balance regardless of the imbalance as long as a big CPU is idle. So we could have a scenario where loads of "small" tasks spawn, and they all get moved to a big CPU until they're all full (because they're periodic tasks so the big CPUs will eventually be idle and will pull another task as long as they get some idle time). Then, before the load tracking signals of those tasks ramp up high enough that the load balancer would try to move those to LITTLE CPUs, some "big" tasks spawn. They get scheduled on LITTLE CPUs, and now the system will look balanced so nothing will be done. I acknowledge this all sounds convoluted but I hope it highlights what I think could go wrong with ASYM_PACKING on asymmetric systems. Regards, Valentin
On 4 April 2018 at 12:44, Valentin Schneider <valentin.schneider@arm.com> wrote: > Hi, > > On 03/04/18 13:17, Vincent Guittot wrote: >> Hi Valentin, >> > [...] >>> >>> I believe ASYM_PACKING behaves better here because the workload is only >>> sysbench threads. As stated above, since task utilization is disregarded, I >> >> It behaves better because it doesn't wait for the task's utilization >> to reach a level before assuming the task needs high compute capacity. >> The utilization gives an idea of the running time of the task not the >> performance level that is needed >> > > That's my point actually. ASYM_PACKING disregards utilization and moves those > threads to the big cores ASAP, which is good here because it's just sysbench > threads. > > What I meant was that if the task composition changes, IOW we mix "small" > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like > sysbench threads), we shouldn't assume all of those require to run on a big > CPU. The thing is, ASYM_PACKING can't make the difference between those, so That's the 1st point where I tend to disagree: why big cores are only for long running task and periodic stuff can't need to run on big cores to get max compute capacity ? You make the assumption that only long running tasks need high compute capacity. This patch wants to always provide max compute capacity to the system and not only long running task > it'll all come down to which task spawned first. > > Furthermore, ASYM_PACKING will forcefully move tasks via active balance > regardless of the imbalance as long as a big CPU is idle. > > So we could have a scenario where loads of "small" tasks spawn, and they all > get moved to a big CPU until they're all full (because they're periodic tasks > so the big CPUs will eventually be idle and will pull another task as long as > they get some idle time). > > Then, before the load tracking signals of those tasks ramp up high enough > that the load balancer would try to move those to LITTLE CPUs, some "big" > tasks spawn. They get scheduled on LITTLE CPUs, and now the system will look > balanced so nothing will be done. As explained above, as long as the big CPUs are always used,I don't think it's a problem. What is a problem is if a task stays on a little CPU whereas a big CPU is idle because we can provide more throughput > > > I acknowledge this all sounds convoluted but I hope it highlights what I > think could go wrong with ASYM_PACKING on asymmetric systems. > > Regards, > Valentin
On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote: > On 4 April 2018 at 12:44, Valentin Schneider <valentin.schneider@arm.com> wrote: > > Hi, > > > > On 03/04/18 13:17, Vincent Guittot wrote: > >> Hi Valentin, > >> > > [...] > >>> > >>> I believe ASYM_PACKING behaves better here because the workload is only > >>> sysbench threads. As stated above, since task utilization is disregarded, I > >> > >> It behaves better because it doesn't wait for the task's utilization > >> to reach a level before assuming the task needs high compute capacity. > >> The utilization gives an idea of the running time of the task not the > >> performance level that is needed > >> > > > > That's my point actually. ASYM_PACKING disregards utilization and moves those > > threads to the big cores ASAP, which is good here because it's just sysbench > > threads. > > > > What I meant was that if the task composition changes, IOW we mix "small" > > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like > > sysbench threads), we shouldn't assume all of those require to run on a big > > CPU. The thing is, ASYM_PACKING can't make the difference between those, so > > That's the 1st point where I tend to disagree: why big cores are only > for long running task and periodic stuff can't need to run on big > cores to get max compute capacity ? > You make the assumption that only long running tasks need high compute > capacity. This patch wants to always provide max compute capacity to > the system and not only long running task There is no way we can tell if a periodic or short-running tasks requires the compute capacity of a big core or not based on utilization alone. The utilization can only tell us if a task could potentially use more compute capacity, i.e. the utilization approaches the compute capacity of its current cpu. How we handle low utilization tasks comes down to how we define "performance" and if we care about the cost of "performance" (e.g. energy consumption). Placing a low utilization task on a little cpu should always be fine from _throughput_ point of view. As long as the cpu has spare cycles it means that work isn't piling up faster than it can be processed. However, from a _latency_ (completion time) point of view it might be a problem, and for latency sensitive tasks I can agree that going for max capacity might be better choice. The misfit patches places tasks based on utilization to ensure that tasks get the _throughput_ they need if possible. This is in line with the placement policy we have in select_task_rq_fair() already. We shouldn't forget that what we are discussing here is the default behaviour when we don't have sufficient knowledge about the tasks in the scheduler. So we are looking a reasonable middle-of-the-road policy that doesn't kill your performance or the battery. If user-space has its own opinion about performance requirements it is free to use task affinity to control which cpu the task end up on and ensure that the task gets max capacity always. On top of that we have had interfaces in Android for years to specify performance requirements for task (groups) to allow small tasks to be placed on big cpus and big task to be placed on little cpus depending on their requirements. It is even tied into cpufreq as well. A lot of effort has gone into Android to get this balance right. Patrick is working hard on upstreaming some of those features. In the bigger picture always going for max capacity is not desirable for well-configured big.LITTLE system. You would never exploit the advantage of the little cpus as you always use big first and only use little when the bigs are overloaded at which point having little cpus at all makes little sense. Vendors build big.LITTLE systems because they want a better performance/energy trade-off, if they wanted max capacity always, they would just built big-only systems. If we would be that concerned about latency, DVFS would be a problem too and we would use nothing but the performance governor. So seen in the bigger picture I have to disagree that blindly going for max capacity is the right default policy for big.LITTLE. As soon as we involve a energy model in the task placement decisions, it definitely isn't. Morten
Hi Morten, On 5 April 2018 at 17:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote: >> On 4 April 2018 at 12:44, Valentin Schneider <valentin.schneider@arm.com> wrote: >> > Hi, >> > >> > On 03/04/18 13:17, Vincent Guittot wrote: >> >> Hi Valentin, >> >> >> > [...] >> >>> >> >>> I believe ASYM_PACKING behaves better here because the workload is only >> >>> sysbench threads. As stated above, since task utilization is disregarded, I >> >> >> >> It behaves better because it doesn't wait for the task's utilization >> >> to reach a level before assuming the task needs high compute capacity. >> >> The utilization gives an idea of the running time of the task not the >> >> performance level that is needed >> >> >> > >> > That's my point actually. ASYM_PACKING disregards utilization and moves those >> > threads to the big cores ASAP, which is good here because it's just sysbench >> > threads. >> > >> > What I meant was that if the task composition changes, IOW we mix "small" >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like >> > sysbench threads), we shouldn't assume all of those require to run on a big >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so >> >> That's the 1st point where I tend to disagree: why big cores are only >> for long running task and periodic stuff can't need to run on big >> cores to get max compute capacity ? >> You make the assumption that only long running tasks need high compute >> capacity. This patch wants to always provide max compute capacity to >> the system and not only long running task > > There is no way we can tell if a periodic or short-running tasks > requires the compute capacity of a big core or not based on utilization > alone. The utilization can only tell us if a task could potentially use > more compute capacity, i.e. the utilization approaches the compute > capacity of its current cpu. > > How we handle low utilization tasks comes down to how we define > "performance" and if we care about the cost of "performance" (e.g. > energy consumption). > > Placing a low utilization task on a little cpu should always be fine > from _throughput_ point of view. As long as the cpu has spare cycles it I disagree, throughput is not only a matter of spare cycle it's also a matter of how fast you compute the work like with IO activity as an example > means that work isn't piling up faster than it can be processed. > However, from a _latency_ (completion time) point of view it might be a > problem, and for latency sensitive tasks I can agree that going for max > capacity might be better choice. > > The misfit patches places tasks based on utilization to ensure that > tasks get the _throughput_ they need if possible. This is in line with > the placement policy we have in select_task_rq_fair() already. > > We shouldn't forget that what we are discussing here is the default > behaviour when we don't have sufficient knowledge about the tasks in the > scheduler. So we are looking a reasonable middle-of-the-road policy that > doesn't kill your performance or the battery. If user-space has its own But misfit task kills performance and might also kills your battery as it doesn't prevent small task to run on big cores The default behavior of the scheduler is to provide max _throughput_ not middle performance and then side activity can mitigate the power impact like frequency scaling or like EAS which tries to optimize the usage of energy when system is not overloaded. With misfit task, you make the assumption that short task on little core is the best placement to do even for a performance PoV. It seems that you make some power/performance assumption without using an energy model which can make such decision. This is all the interest of EAS. > opinion about performance requirements it is free to use task affinity > to control which cpu the task end up on and ensure that the task gets > max capacity always. On top of that we have had interfaces in Android > for years to specify performance requirements for task (groups) to allow > small tasks to be placed on big cpus and big task to be placed on little > cpus depending on their requirements. It is even tied into cpufreq as > well. A lot of effort has gone into Android to get this balance right. > Patrick is working hard on upstreaming some of those features. > > In the bigger picture always going for max capacity is not desirable for > well-configured big.LITTLE system. You would never exploit the advantage > of the little cpus as you always use big first and only use little when > the bigs are overloaded at which point having little cpus at all makes If i'm not wrong misfit task patchset doesn't prevent little task to run on big core > little sense. Vendors build big.LITTLE systems because they want a > better performance/energy trade-off, if they wanted max capacity always, > they would just built big-only systems. And that's all the purpose of the EAS patchset. EAS patchset is there to put some energy awareness in the scheduler decision. There is 2 running mode for EAS: one when there is spare cycles so tasks can be placed to optimize energy consumption. And one when the system or part of the system is overloaded and it goes back to default performance mode because there is no interest for energy efficiency and we just want to provide max performance. So the asym packing fits with this latter mode as it provide the max compute capacity to the default mode and doesn't break EAS as it uses the load balance which is disable by EAS in not overloaded mode Vincent > > If we would be that concerned about latency, DVFS would be a problem too > and we would use nothing but the performance governor. So seen in the > bigger picture I have to disagree that blindly going for max capacity is > the right default policy for big.LITTLE. As soon as we involve a energy > model in the task placement decisions, it definitely isn't. > > Morten
On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote: > Hi Morten, > > On 5 April 2018 at 17:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote: > >> On 4 April 2018 at 12:44, Valentin Schneider <valentin.schneider@arm.com> wrote: > >> > Hi, > >> > > >> > On 03/04/18 13:17, Vincent Guittot wrote: > >> >> Hi Valentin, > >> >> > >> > [...] > >> >>> > >> >>> I believe ASYM_PACKING behaves better here because the workload is only > >> >>> sysbench threads. As stated above, since task utilization is disregarded, I > >> >> > >> >> It behaves better because it doesn't wait for the task's utilization > >> >> to reach a level before assuming the task needs high compute capacity. > >> >> The utilization gives an idea of the running time of the task not the > >> >> performance level that is needed > >> >> > >> > > >> > That's my point actually. ASYM_PACKING disregards utilization and moves those > >> > threads to the big cores ASAP, which is good here because it's just sysbench > >> > threads. > >> > > >> > What I meant was that if the task composition changes, IOW we mix "small" > >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like > >> > sysbench threads), we shouldn't assume all of those require to run on a big > >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so > >> > >> That's the 1st point where I tend to disagree: why big cores are only > >> for long running task and periodic stuff can't need to run on big > >> cores to get max compute capacity ? > >> You make the assumption that only long running tasks need high compute > >> capacity. This patch wants to always provide max compute capacity to > >> the system and not only long running task > > > > There is no way we can tell if a periodic or short-running tasks > > requires the compute capacity of a big core or not based on utilization > > alone. The utilization can only tell us if a task could potentially use > > more compute capacity, i.e. the utilization approaches the compute > > capacity of its current cpu. > > > > How we handle low utilization tasks comes down to how we define > > "performance" and if we care about the cost of "performance" (e.g. > > energy consumption). > > > > Placing a low utilization task on a little cpu should always be fine > > from _throughput_ point of view. As long as the cpu has spare cycles it > > I disagree, throughput is not only a matter of spare cycle it's also a > matter of how fast you compute the work like with IO activity as an > example From a cpu centric point of view it is, but I agree that from a application/user point of view completion time might impact throughput too. For example of if your throughput depends on how fast you can offload work to some peripheral device (GPU for example). However, as I said in the beginning we don't know what the task does. > > means that work isn't piling up faster than it can be processed. > > However, from a _latency_ (completion time) point of view it might be a > > problem, and for latency sensitive tasks I can agree that going for max > > capacity might be better choice. > > > > The misfit patches places tasks based on utilization to ensure that > > tasks get the _throughput_ they need if possible. This is in line with > > the placement policy we have in select_task_rq_fair() already. > > > > We shouldn't forget that what we are discussing here is the default > > behaviour when we don't have sufficient knowledge about the tasks in the > > scheduler. So we are looking a reasonable middle-of-the-road policy that > > doesn't kill your performance or the battery. If user-space has its own > > But misfit task kills performance and might also kills your battery as > it doesn't prevent small task to run on big cores As I said it is not perfect for all use-cases, it is middle-of-the-road approach. But I strongly disagree that it is always a bad choice for both energy and performance as you suggest. ASYM_PACKING doesn't guarantee max "throughput" (by your definition) either as you may fill up your big cores with smaller tasks leaving the big tasks behind on little cpus. > The default behavior of the scheduler is to provide max _throughput_ > not middle performance and then side activity can mitigate the power > impact like frequency scaling or like EAS which tries to optimize the > usage of energy when system is not overloaded. That view doesn't fit very well with all activities around integrating cpufreq and the scheduler. Frequency scaling is an important factor in optimizing the throughput. > With misfit task, you > make the assumption that short task on little core is the best > placement to do even for a performance PoV. I never said it was the best placement, I said it was a reasonable default policy for big.LITTLE systems. > It seems that you make > some power/performance assumption without using an energy model which > can make such decision. This is all the interest of EAS. I'm trying to see the bigger picture where you seem not to. The ASYM_PACKING solution is incompatible with EAS. CFS has a cpu centric view and the default policy I'm suggesting doesn't violate that view. Your own code in group_is_overloaded() follows this view as it is utilization based and happily accepts partially utilized groups as being fine without need to be offloaded despite you could have multiple tasks waiting to execute. CFS doesn't not provide any latency guarantees, but we of course do the best we can within reason to minimize it. Seen in the bigger picture I would consider going for max capacity for big.LITTLE systems more aggressive than using the performance cpufreq govenor. Nobody does the latter for battery powered devices, hence I don't see why anyone would to go big-always for big.LITTLE systems. > > > opinion about performance requirements it is free to use task affinity > > to control which cpu the task end up on and ensure that the task gets > > max capacity always. On top of that we have had interfaces in Android > > for years to specify performance requirements for task (groups) to allow > > small tasks to be placed on big cpus and big task to be placed on little > > cpus depending on their requirements. It is even tied into cpufreq as > > well. A lot of effort has gone into Android to get this balance right. > > Patrick is working hard on upstreaming some of those features. > > > > In the bigger picture always going for max capacity is not desirable for > > well-configured big.LITTLE system. You would never exploit the advantage > > of the little cpus as you always use big first and only use little when > > the bigs are overloaded at which point having little cpus at all makes > > If i'm not wrong misfit task patchset doesn't prevent little task to > run on big core It does not, in fact it doesn't touch small tasks at all, that is not the point of the patch set. The point is to make sure that big tasks don't get stuck on little cpus. IOW, a selective little to big migration based on task utilization. > > > little sense. Vendors build big.LITTLE systems because they want a > > better performance/energy trade-off, if they wanted max capacity always, > > they would just built big-only systems. > > And that's all the purpose of the EAS patchset. EAS patchset is there > to put some energy awareness in the scheduler decision. There is 2 > running mode for EAS: one when there is spare cycles so tasks can be > placed to optimize energy consumption. And one when the system or part > of the system is overloaded and it goes back to default performance > mode because there is no interest for energy efficiency and we just > want to provide max performance. So the asym packing fits with this > latter mode as it provide the max compute capacity to the default mode > and doesn't break EAS as it uses the load balance which is disable by > EAS in not overloaded mode We still care about energy even when we are overutilized. We really don't want a vastly different placement policy depending on whether we are overutilized or not if we can avoid it as the situation changes frequently in many real world scenarios. With ASYM_PACKING everything could suddenly shift to big cpus if a little cpu is suddenly overutilized. With the misfit patches, we would detect exactly which little cpu that needs help, migrate the misfit task and everything will return to non-overutilized. That is why I said that ASYM_PACKING is incompatible with energy-aware scheduling and we would need the misfit patches anyway. Morten
Hi Morten, On 6 April 2018 at 14:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote: >> Hi Morten, >> >> On 5 April 2018 at 17:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote: >> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote: >> >> On 4 April 2018 at 12:44, Valentin Schneider <valentin.schneider@arm.com> wrote: [snip] >> >> > What I meant was that if the task composition changes, IOW we mix "small" >> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like >> >> > sysbench threads), we shouldn't assume all of those require to run on a big >> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so >> >> >> >> That's the 1st point where I tend to disagree: why big cores are only >> >> for long running task and periodic stuff can't need to run on big >> >> cores to get max compute capacity ? >> >> You make the assumption that only long running tasks need high compute >> >> capacity. This patch wants to always provide max compute capacity to >> >> the system and not only long running task >> > >> > There is no way we can tell if a periodic or short-running tasks >> > requires the compute capacity of a big core or not based on utilization >> > alone. The utilization can only tell us if a task could potentially use >> > more compute capacity, i.e. the utilization approaches the compute >> > capacity of its current cpu. >> > >> > How we handle low utilization tasks comes down to how we define >> > "performance" and if we care about the cost of "performance" (e.g. >> > energy consumption). >> > >> > Placing a low utilization task on a little cpu should always be fine >> > from _throughput_ point of view. As long as the cpu has spare cycles it >> >> I disagree, throughput is not only a matter of spare cycle it's also a >> matter of how fast you compute the work like with IO activity as an >> example > > From a cpu centric point of view it is, but I agree that from a > application/user point of view completion time might impact throughput > too. For example of if your throughput depends on how fast you can > offload work to some peripheral device (GPU for example). > > However, as I said in the beginning we don't know what the task does. I agree but that's not what you do with misfit as you assume long running task has higher priority but not shorter running tasks > >> > means that work isn't piling up faster than it can be processed. >> > However, from a _latency_ (completion time) point of view it might be a >> > problem, and for latency sensitive tasks I can agree that going for max >> > capacity might be better choice. >> > >> > The misfit patches places tasks based on utilization to ensure that >> > tasks get the _throughput_ they need if possible. This is in line with >> > the placement policy we have in select_task_rq_fair() already. >> > >> > We shouldn't forget that what we are discussing here is the default >> > behaviour when we don't have sufficient knowledge about the tasks in the >> > scheduler. So we are looking a reasonable middle-of-the-road policy that >> > doesn't kill your performance or the battery. If user-space has its own >> >> But misfit task kills performance and might also kills your battery as >> it doesn't prevent small task to run on big cores > > As I said it is not perfect for all use-cases, it is middle-of-the-road > approach. But I strongly disagree that it is always a bad choice for mmh ... I never said that it's always a bad choice; I said that it can also easily make bad choice and kills performance and / or battery. In fact, we can't really predict the behavior of the system as short running tasks can be randomly put on big or little cores and random behavior are impossible to predict and mitigate. > both energy and performance as you suggest. ASYM_PACKING doesn't > guarantee max "throughput" (by your definition) either as you may fill > up your big cores with smaller tasks leaving the big tasks behind on > little cpus. You didn't understand the point here. Asym ensures the max throughput to the system because it will provide the max compute capacity per seconds to the whole system and not only to some specific tasks. You assume that long running tasks must run on big cores and not short running tasks. But why filling a big core with long running task and filling a little core with short running tasks is the best choice ? Why the opposite should not be better as long as the big core is fully used ? The goal is to keep big CPU used whatever the type of tasks. then, there are other mechanism like cgroup to help sorting groups of tasks. You try to partially do 2 things at the same time > >> The default behavior of the scheduler is to provide max _throughput_ >> not middle performance and then side activity can mitigate the power >> impact like frequency scaling or like EAS which tries to optimize the >> usage of energy when system is not overloaded. > > That view doesn't fit very well with all activities around integrating > cpufreq and the scheduler. Frequency scaling is an important factor in > optimizing the throughput. > Here you didn't catch my point too. Pleas don't give me intention that I don't have. By side activity, I'm not saying that it should not consolidate the cpufreq and other framework decisions. Scheduler is the best place to consolidate CPU related decision. I'm just saying that it's an additional action taken to optimize energy. The scheduler doesn't use current frequency in task placement and load balancing as it assumes that max throughput is available if needed and adjust frequency to current needs > >> With misfit task, you >> make the assumption that short task on little core is the best >> placement to do even for a performance PoV. > > I never said it was the best placement, I said it was a reasonable > default policy for big.LITTLE systems. But "The primary job for the task scheduler is to deliver the highest possible throughput with minimal latency." > >> It seems that you make >> some power/performance assumption without using an energy model which >> can make such decision. This is all the interest of EAS. > > I'm trying to see the bigger picture where you seem not to. The Thanks for helping me to get the bigger picture ;-) > ASYM_PACKING solution is incompatible with EAS. CFS has a cpu centric > view and the default policy I'm suggesting doesn't violate that view. Sorry I don't catch the sentences above > Your own code in group_is_overloaded() follows this view as it is > utilization based and happily accepts partially utilized groups as being But this is done for SMP system where all cores have same capacity and to detect when tasks can get more throughput on another CPU. ASYM_PACKING is there to add capacity awareness in the load balance when CPUs have different capacity > fine without need to be offloaded despite you could have multiple tasks > waiting to execute. > CFS doesn't not provide any latency guarantees, but > we of course do the best we can within reason to minimize it. > > Seen in the bigger picture I would consider going for max capacity for > big.LITTLE systems more aggressive than using the performance cpufreq > govenor. Nobody does the latter for battery powered devices, hence I > don't see why anyone would to go big-always for big.LITTLE systems. And that's why EAS exists: to make battery friendly decision > >> >> > opinion about performance requirements it is free to use task affinity >> > to control which cpu the task end up on and ensure that the task gets >> > max capacity always. On top of that we have had interfaces in Android >> > for years to specify performance requirements for task (groups) to allow >> > small tasks to be placed on big cpus and big task to be placed on little >> > cpus depending on their requirements. It is even tied into cpufreq as >> > well. A lot of effort has gone into Android to get this balance right. >> > Patrick is working hard on upstreaming some of those features. >> > >> > In the bigger picture always going for max capacity is not desirable for >> > well-configured big.LITTLE system. You would never exploit the advantage >> > of the little cpus as you always use big first and only use little when >> > the bigs are overloaded at which point having little cpus at all makes >> >> If i'm not wrong misfit task patchset doesn't prevent little task to >> run on big core > > It does not, in fact it doesn't touch small tasks at all, that is not > the point of the patch set. The point is to make sure that big tasks > don't get stuck on little cpus. IOW, a selective little to big > migration based on task utilization. > >> >> > little sense. Vendors build big.LITTLE systems because they want a >> > better performance/energy trade-off, if they wanted max capacity always, >> > they would just built big-only systems. >> >> And that's all the purpose of the EAS patchset. EAS patchset is there >> to put some energy awareness in the scheduler decision. There is 2 >> running mode for EAS: one when there is spare cycles so tasks can be >> placed to optimize energy consumption. And one when the system or part >> of the system is overloaded and it goes back to default performance >> mode because there is no interest for energy efficiency and we just >> want to provide max performance. So the asym packing fits with this >> latter mode as it provide the max compute capacity to the default mode >> and doesn't break EAS as it uses the load balance which is disable by >> EAS in not overloaded mode > > We still care about energy even when we are overutilized. We really > don't want a vastly different placement policy depending on whether we > are overutilized or not if we can avoid it as the situation changes > frequently in many real world scenarios. With ASYM_PACKING everything > could suddenly shift to big cpus if a little cpu is suddenly > overutilized. With the misfit patches, we would detect exactly which Not everything. The same happens with ASYM_PACKING. It doesn't blindly put everything on "big" cores and do use parallelism too. Regards, Vincent > little cpu that needs help, migrate the misfit task and everything will > return to non-overutilized. That is why I said that ASYM_PACKING is > incompatible with energy-aware scheduling and we would need the misfit > patches anyway. > > Morten
On Mon, Apr 09, 2018 at 09:34:00AM +0200, Vincent Guittot wrote: > Hi Morten, > > On 6 April 2018 at 14:58, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote: > >> Hi Morten, > >> > >> On 5 April 2018 at 17:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > >> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote: > >> >> On 4 April 2018 at 12:44, Valentin Schneider <valentin.schneider@arm.com> wrote: > > [snip] > > >> >> > What I meant was that if the task composition changes, IOW we mix "small" > >> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like > >> >> > sysbench threads), we shouldn't assume all of those require to run on a big > >> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so > >> >> > >> >> That's the 1st point where I tend to disagree: why big cores are only > >> >> for long running task and periodic stuff can't need to run on big > >> >> cores to get max compute capacity ? > >> >> You make the assumption that only long running tasks need high compute > >> >> capacity. This patch wants to always provide max compute capacity to > >> >> the system and not only long running task > >> > > >> > There is no way we can tell if a periodic or short-running tasks > >> > requires the compute capacity of a big core or not based on utilization > >> > alone. The utilization can only tell us if a task could potentially use > >> > more compute capacity, i.e. the utilization approaches the compute > >> > capacity of its current cpu. > >> > > >> > How we handle low utilization tasks comes down to how we define > >> > "performance" and if we care about the cost of "performance" (e.g. > >> > energy consumption). > >> > > >> > Placing a low utilization task on a little cpu should always be fine > >> > from _throughput_ point of view. As long as the cpu has spare cycles it > >> > >> I disagree, throughput is not only a matter of spare cycle it's also a > >> matter of how fast you compute the work like with IO activity as an > >> example > > > > From a cpu centric point of view it is, but I agree that from a > > application/user point of view completion time might impact throughput > > too. For example of if your throughput depends on how fast you can > > offload work to some peripheral device (GPU for example). > > > > However, as I said in the beginning we don't know what the task does. > > I agree but that's not what you do with misfit as you assume long > running task has higher priority but not shorter running tasks Not really, as I said in the previous replies it comes down what you see as the goal of the CFS scheduler. With the misfit patches I'm just trying to make sure that no task is overutilizing a cpu unnecessarily as this is in line with what load-balancing does for SMP systems. Compute capacity is distributed as evenly as possible based on utilization just like it is for load-balancing when task priorities are the same. From that point of view the misfit patches don't give long running tasks preferential treatment. However, I do agree that from a completion time point of view, low utilization tasks could suffer unnecessarily in some scenarios. I don't see optimizing for completion time of low utilization tasks as a primary goal of CFS. Wake-up balancing does try to minimize wake-up latency, but that is about it. Fork and exec balancing and the load-balancing code is all based on load and utilization. Even if we wanted to optimize for completion time it is more tricky for asymmetric cpu capacity systems than it is for SMP. Just keeping the big cpus busy all the time isn't going to do it for many scenarios. Firstly, migrating running tasks is quite expensive so force-migrating a short-running task could end up taking longer time than letting it complete on a little cpu. Secondly, by keeping big cpus busy at all cost you risk that longer running tasks will either end up queueing on the big cpus if you choose to enqueue them there anyway, or they could end up running on a little cpu if you go for the first available cpu in which case you end up harming the completion time of that task instead. I'm not sure how you balance which task's completion time is more important differently than we do today based on load or utilization. The misfit patches use the latter. We could let it use load instead although I think we have agreed in the past the comparing load to capacity isn't great idea. Finally, keeping big cpus busy will increase the number of active migrations a lot. As said above, I see your point about completion time might suffer in some cases for low utilization tasks, but I don't see how you can fix that automagically. ASYM_PACKING has a lot of problematic side-effects. If use-space knows that completion time is important for a task, there are already ways to improve that somewhat in mainline (task priority and pinning), and more powerful solutions in the Android kernel which Patrick is currently pushing upstream. > > > > >> > means that work isn't piling up faster than it can be processed. > >> > However, from a _latency_ (completion time) point of view it might be a > >> > problem, and for latency sensitive tasks I can agree that going for max > >> > capacity might be better choice. > >> > > >> > The misfit patches places tasks based on utilization to ensure that > >> > tasks get the _throughput_ they need if possible. This is in line with > >> > the placement policy we have in select_task_rq_fair() already. > >> > > >> > We shouldn't forget that what we are discussing here is the default > >> > behaviour when we don't have sufficient knowledge about the tasks in the > >> > scheduler. So we are looking a reasonable middle-of-the-road policy that > >> > doesn't kill your performance or the battery. If user-space has its own > >> > >> But misfit task kills performance and might also kills your battery as > >> it doesn't prevent small task to run on big cores > > > > As I said it is not perfect for all use-cases, it is middle-of-the-road > > approach. But I strongly disagree that it is always a bad choice for > > mmh ... I never said that it's always a bad choice; I said that it can > also easily make bad choice and kills performance and / or battery. You did say "But misfit task kills performance and might...", but never mind, thanks for clarifying your statement. > In > fact, we can't really predict the behavior of the system as short > running tasks can be randomly put on big or little cores and random > behavior are impossible to predict and mitigate. You can't predict the behaviour of the system either if you use ASYM_PACKING. The short running tasks may or may not be lucky to wake up when there is a big cpu idle. Performance is a best-effort thing on most modern systems. ASYM_PACKING might increase the probability that a short running task ends up on a big cpu, but at the same time it might harm predictability of completion time of long running tasks. > > both energy and performance as you suggest. ASYM_PACKING doesn't > > guarantee max "throughput" (by your definition) either as you may fill > > up your big cores with smaller tasks leaving the big tasks behind on > > little cpus. > > You didn't understand the point here. Asym ensures the max throughput > to the system because it will provide the max compute capacity per > seconds to the whole system and not only to some specific tasks. You > assume that long running tasks must run on big cores and not short > running tasks. But why filling a big core with long running task and > filling a little core with short running tasks is the best choice ? I'm fairly sure I understand your point. From a theoretical point of view, if migrations were free and we had no caches, always keeping the big cpus busy before using the little cpus would get us most throughput. I don't disagree with that. The issue here is that migrations aren't free, we do have caches, the CFS scheduler isn't designed to work that way, and for many real world use-cases on big.LITTLE systems people don't want to maximize global throughput, they want to maximize throughput of the important tasks at the expense of everyone else running slower even if they don't care about energy. I'm not saying that scheduling short running tasks on little cpus is always the best choice, but it seems to be a good compromise and it is in line with the existing load-balancing policy. So I see it as the least invasive solution to improve things for asymmetric cpu capacity systems. > Why the opposite should not be better as long as the big core is fully > used ? The goal is to keep big CPU used whatever the type of tasks. > then, there are other mechanism like cgroup to help sorting groups of > tasks. Because of all the side-effects I mentioned further up. If your goal is to keep the big cpus always busy, why not change the wake-up code to always prefer them instead of trying to catch them later? That seems a much more reasonable approach since you would migrate short running tasks at wake-up which is much cheaper and would only require simple tweaks to the existing capacity-aware wake-up code. Short running tasks will always be handled there, so we only need to worry about long running tasks that would be handled by the misfit patches. My worry with doing that is that big tasks might suffer from additional migrations and that the policy is too aggressive for users that care about energy, so it would have to be disabled as soon as an energy model is in use. > You try to partially do 2 things at the same time I'm trying to make all the effort in scheduling and OSPM come together while looking at what users need. > > > > >> The default behavior of the scheduler is to provide max _throughput_ > >> not middle performance and then side activity can mitigate the power > >> impact like frequency scaling or like EAS which tries to optimize the > >> usage of energy when system is not overloaded. > > > > That view doesn't fit very well with all activities around integrating > > cpufreq and the scheduler. Frequency scaling is an important factor in > > optimizing the throughput. > > > > Here you didn't catch my point too. Pleas don't give me intention that > I don't have. > By side activity, I'm not saying that it should not consolidate the > cpufreq and other framework decisions. Scheduler is the best place to > consolidate CPU related decision. I'm just saying that it's an > additional action taken to optimize energy. > The scheduler doesn't use current frequency in task placement and load > balancing as it assumes that max throughput is available if needed and > adjust frequency to current needsA That is the whole problem with mainline scheduling and OSPM that we have been working on addressing for several years now. Energy-aware scheduling does exactly that, it considers current frequency as part of task placement and we actively ask for a suitable frequency based on a mix of PELT utilization and use-space hints. All this goodness has already been in the Android kernel for years. Hence my point above was to say that viewing frequency selection as a "side activity" doesn't fit with what is being proposed for energy-aware scheduling. > > > > >> With misfit task, you > >> make the assumption that short task on little core is the best > >> placement to do even for a performance PoV. > > > > I never said it was the best placement, I said it was a reasonable > > default policy for big.LITTLE systems. > > But "The primary job for the task scheduler is to deliver the highest > possible throughput with minimal latency." I'm not sure where that quote is coming from, but I think I have already covered to great extent above why optimizing for aggressively for keeping the big cpus busy on asymmetric cpu capacity systems isn't necessarily the best choice. At least, if we this is what we truly want ASYM_PACKING is not a good implementation of this policy. > > > > >> It seems that you make > >> some power/performance assumption without using an energy model which > >> can make such decision. This is all the interest of EAS. > > > > I'm trying to see the bigger picture where you seem not to. The > > Thanks for helping me to get the bigger picture ;-) > > > ASYM_PACKING solution is incompatible with EAS. CFS has a cpu centric > > view and the default policy I'm suggesting doesn't violate that view. > > Sorry I don't catch the sentences above My point is that ASYM_PACKING conflicts with EAS while the misfit patches work well with EAS and the resulting behaviour is in line with load-balancing as I already covered above. > > > Your own code in group_is_overloaded() follows this view as it is > > utilization based and happily accepts partially utilized groups as being > > But this is done for SMP system where all cores have same capacity and > to detect when tasks can get more throughput on another CPU. But you don't detect scenarios where you could improve completion time. This is where this discussion started :-) > ASYM_PACKING is there to add capacity awareness in the load balance > when CPUs have different capacity Well, one fundamental difference between asymmetric cpu capacity systems (big.LITTLE) and the existing users of ASYM_PACKING is that the existing users of ASYM_PACKING don't have any downsides of using that feature. As in, the n+1th task to be packed doesn't get punished in terms of performance just because it woke up later than the other tasks. It is just placing tasks to improve the chances of an opportunistic performance boost. This is not the case for asymmetric cpu capacity systems. Using ASYM_PACKING here would mean that late wakers gets punished while early risers gets treated with better throughput until they choose to stop or it gets preempted because there are more tasks than cpus. Is it fair to favor the first tasks to wake? I think providing true fairness, particularly on asymmetric cpu capacity systems, can only be achieved by using a rotating scheduler, where each task take turns on running on the fastest cpu ;-) > > > fine without need to be offloaded despite you could have multiple tasks > > waiting to execute. > > CFS doesn't not provide any latency guarantees, but > > we of course do the best we can within reason to minimize it. > > > > Seen in the bigger picture I would consider going for max capacity for > > big.LITTLE systems more aggressive than using the performance cpufreq > > govenor. Nobody does the latter for battery powered devices, hence I > > don't see why anyone would to go big-always for big.LITTLE systems. > > And that's why EAS exists: to make battery friendly decision True, I'm just wondering if we should spend effort supporting a use-case which might only be of theoretical interest instead of focusing on the problems that a lot of users care about. > >> > opinion about performance requirements it is free to use task affinity > >> > to control which cpu the task end up on and ensure that the task gets > >> > max capacity always. On top of that we have had interfaces in Android > >> > for years to specify performance requirements for task (groups) to allow > >> > small tasks to be placed on big cpus and big task to be placed on little > >> > cpus depending on their requirements. It is even tied into cpufreq as > >> > well. A lot of effort has gone into Android to get this balance right. > >> > Patrick is working hard on upstreaming some of those features. > >> > > >> > In the bigger picture always going for max capacity is not desirable for > >> > well-configured big.LITTLE system. You would never exploit the advantage > >> > of the little cpus as you always use big first and only use little when > >> > the bigs are overloaded at which point having little cpus at all makes > >> > >> If i'm not wrong misfit task patchset doesn't prevent little task to > >> run on big core > > > > It does not, in fact it doesn't touch small tasks at all, that is not > > the point of the patch set. The point is to make sure that big tasks > > don't get stuck on little cpus. IOW, a selective little to big > > migration based on task utilization. > > > >> > >> > little sense. Vendors build big.LITTLE systems because they want a > >> > better performance/energy trade-off, if they wanted max capacity always, > >> > they would just built big-only systems. > >> > >> And that's all the purpose of the EAS patchset. EAS patchset is there > >> to put some energy awareness in the scheduler decision. There is 2 > >> running mode for EAS: one when there is spare cycles so tasks can be > >> placed to optimize energy consumption. And one when the system or part > >> of the system is overloaded and it goes back to default performance > >> mode because there is no interest for energy efficiency and we just > >> want to provide max performance. So the asym packing fits with this > >> latter mode as it provide the max compute capacity to the default mode > >> and doesn't break EAS as it uses the load balance which is disable by > >> EAS in not overloaded mode > > > > We still care about energy even when we are overutilized. We really > > don't want a vastly different placement policy depending on whether we > > are overutilized or not if we can avoid it as the situation changes > > frequently in many real world scenarios. With ASYM_PACKING everything > > could suddenly shift to big cpus if a little cpu is suddenly > > overutilized. With the misfit patches, we would detect exactly which > > Not everything. The same happens with ASYM_PACKING. It doesn't blindly > put everything on "big" cores and do use parallelism too. I fail to understand your point here. ASYM_PACKING doesn't put multiple tasks on the same cpu, but it does fill all the big cpus even if all we really need is to migrate a single big task. Morten
On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote: > As said above, I see your point about completion time might suffer in > some cases for low utilization tasks, but I don't see how you can fix > that automagically. ASYM_PACKING has a lot of problematic side-effects. > If use-space knows that completion time is important for a task, there > are already ways to improve that somewhat in mainline (task priority and > pinning), and more powerful solutions in the Android kernel which > Patrick is currently pushing upstream. So I tend to side with Morten on this one. I don't particularly like ASYM_PACKING much, but we already had it for PPC and it works for the small difference in performance ITMI has. At the time Morten already objected to using it for ITMI, and I just haven't had time to look into his proposal for using capacity. But I don't see it working right for big.litte/dynamiq, simply because it is a very strong always big preference, which is against the whole design premisis of big.little (as Morten has been trying to argue).
On Thu, Apr 12, 2018 at 08:22:11PM +0200, Peter Zijlstra wrote: > On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote: > > As said above, I see your point about completion time might suffer in > > some cases for low utilization tasks, but I don't see how you can fix > > that automagically. ASYM_PACKING has a lot of problematic side-effects. > > If use-space knows that completion time is important for a task, there > > are already ways to improve that somewhat in mainline (task priority and > > pinning), and more powerful solutions in the Android kernel which > > Patrick is currently pushing upstream. > > So I tend to side with Morten on this one. I don't particularly like > ASYM_PACKING much, but we already had it for PPC and it works for the > small difference in performance ITMI has. > > At the time Morten already objected to using it for ITMI, and I just > haven't had time to look into his proposal for using capacity. > > But I don't see it working right for big.litte/dynamiq, simply because > it is a very strong always big preference, which is against the whole > design premisis of big.little (as Morten has been trying to argue). In Vincent's defence, vendors do sometimes make design decisions that I don't quite understand. So there could be users that really want a non-energy-aware big-first policy, but as I said earlier in this thread, that could be implemented better with a small tweak to wake_cap() and using the misfit patches. We would have to disable big-first policy and go with the current migrate-big-task-to-big-cpus policy as soon as we care about energy. I'm happy to give that try and come up with a patch.
On 12 April 2018 at 20:22, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Apr 10, 2018 at 02:19:50PM +0100, Morten Rasmussen wrote: >> As said above, I see your point about completion time might suffer in >> some cases for low utilization tasks, but I don't see how you can fix >> that automagically. ASYM_PACKING has a lot of problematic side-effects. >> If use-space knows that completion time is important for a task, there >> are already ways to improve that somewhat in mainline (task priority and >> pinning), and more powerful solutions in the Android kernel which >> Patrick is currently pushing upstream. > > So I tend to side with Morten on this one. I don't particularly like > ASYM_PACKING much, but we already had it for PPC and it works for the > small difference in performance ITMI has. > > At the time Morten already objected to using it for ITMI, and I just > haven't had time to look into his proposal for using capacity. > > But I don't see it working right for big.litte/dynamiq, simply because > it is a very strong always big preference, which is against the whole > design premisis of big.little (as Morten has been trying to argue). In fact, Little not only gives some better power efficiency but it also handles far better some stuff like interrupt handling as an example Nevertheless, whatever the solution, it will never fit with big.Little/dynamiQ system without some EAS as soon as the power efficiency is involved in the equation. I have planned to test more deeply how ASYM_PACKING works with EAS when i will have finished others on going activity. >
On Fri, Apr 6, 2018 at 5:58 AM, Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote: >> Hi Morten, >> >> On 5 April 2018 at 17:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote: >> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote: >> >> On 4 April 2018 at 12:44, Valentin Schneider <valentin.schneider@arm.com> wrote: >> >> > Hi, >> >> > >> >> > On 03/04/18 13:17, Vincent Guittot wrote: >> >> >> Hi Valentin, >> >> >> >> >> > [...] >> >> >>> >> >> >>> I believe ASYM_PACKING behaves better here because the workload is only >> >> >>> sysbench threads. As stated above, since task utilization is disregarded, I >> >> >> >> >> >> It behaves better because it doesn't wait for the task's utilization >> >> >> to reach a level before assuming the task needs high compute capacity. >> >> >> The utilization gives an idea of the running time of the task not the >> >> >> performance level that is needed >> >> >> >> >> > >> >> > [ >> >> > That's my point actually. ASYM_PACKING disregards utilization and moves those >> >> > threads to the big cores ASAP, which is good here because it's just sysbench >> >> > threads. >> >> > >> >> > What I meant was that if the task composition changes, IOW we mix "small" >> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive stuff like >> >> > sysbench threads), we shouldn't assume all of those require to run on a big >> >> > CPU. The thing is, ASYM_PACKING can't make the difference between those, so >> > [Morten] >> >> >> >> That's the 1st point where I tend to disagree: why big cores are only >> >> for long running task and periodic stuff can't need to run on big >> >> cores to get max compute capacity ? >> >> You make the assumption that only long running tasks need high compute >> >> capacity. This patch wants to always provide max compute capacity to >> >> the system and not only long running task >> > >> > There is no way we can tell if a periodic or short-running tasks >> > requires the compute capacity of a big core or not based on utilization >> > alone. The utilization can only tell us if a task could potentially use >> > more compute capacity, i.e. the utilization approaches the compute >> > capacity of its current cpu. >> > >> > How we handle low utilization tasks comes down to how we define >> > "performance" and if we care about the cost of "performance" (e.g. >> > energy consumption). >> > >> > Placing a low utilization task on a little cpu should always be fine >> > from _throughput_ point of view. As long as the cpu has spare cycles it >> >> [Vincent] >> I disagree, throughput is not only a matter of spare cycle it's also a >> matter of how fast you compute the work like with IO activity as an >> example > > [Morten] > From a cpu centric point of view it is, but I agree that from a > application/user point of view completion time might impact throughput > too. For example of if your throughput depends on how fast you can > offload work to some peripheral device (GPU for example). > > However, as I said in the beginning we don't know what the task does. [Joel] Just wanted to say about Vincent point of IO loads throughput - remembering from when I was playing with the iowait boost stuff, that - say you have a little task that does some IO and blocks and does so periodically. In the scenario the task will run for little time and is a little task by way of looking at utilization. However, if we were to run it on the BIG CPUs, the overall throughput of the I/O activity would be higher. For this case, it seems its impossible to specify the "default" behavior correctly. Like, do we care about performance or energy more? This seems more like a policy-decision from userspace and not something the scheduler should necessarily have to decide. Like if I/O activity is background and not affecting the user experience. thanks, - Joel
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 2186853..cb6705e5 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -296,6 +296,33 @@ static void __init reset_cpu_topology(void) } } +#ifdef CONFIG_SCHED_MC +unsigned int __read_mostly arm64_sched_asym_enabled; + +int arch_asym_cpu_priority(int cpu) +{ + return topology_get_cpu_scale(NULL, cpu); +} + +static inline int arm64_sched_dynamiq(void) +{ + return arm64_sched_asym_enabled ? SD_ASYM_PACKING : 0; +} + +static int arm64_core_flags(void) +{ + return cpu_core_flags() | arm64_sched_dynamiq(); +} +#endif + +static struct sched_domain_topology_level arm64_topology[] = { +#ifdef CONFIG_SCHED_MC + { cpu_coregroup_mask, arm64_core_flags, SD_INIT_NAME(MC) }, +#endif + { cpu_cpu_mask, SD_INIT_NAME(DIE) }, + { NULL, }, +}; + void __init init_cpu_topology(void) { reset_cpu_topology(); @@ -306,4 +333,7 @@ void __init init_cpu_topology(void) */ if (of_have_populated_dt() && parse_dt_topology()) reset_cpu_topology(); + + /* Set scheduler topology descriptor */ + set_sched_topology(arm64_topology); }
Arm DynamiQ system can integrate cores with different micro architecture or max OPP under the same DSU so we can have cores with different compute capacity at the LLC (which was not the case with legacy big/LITTLE architecture). Such configuration is similar in some way to ITMT on intel platform which allows some cores to be boosted to higher turbo frequency than others and which uses SD_ASYM_PACKING feature to ensures that CPUs with highest capacity, will always be used in priortiy in order to provide maximum throughput. Add arch_asym_cpu_priority() for arm64 as this function is used to differentiate CPUs in the scheduler. The CPU's capacity is used to order CPUs in the same DSU. Create sched domain topolgy level for arm64 so we can set SD_ASYM_PACKING at MC level. Some tests have been done on a hikey960 platform (quad cortex-A53, quad cortex-A73). For the test purpose, the CPUs topology of the hikey960 has been modified so the 8 heterogeneous cores are described as being part of the same cluster and sharing resources (MC level) like with a DynamiQ DSU. Results below show the time in seconds to run sysbench --test=cpu with an increasing number of threads. The sysbench test run 32 times without patch with patch diff 1 threads 11.04(+/- 30%) 8.86(+/- 0%) -19% 2 threads 5.59(+/- 14%) 4.43(+/- 0%) -20% 3 threads 3.80(+/- 13%) 2.95(+/- 0%) -22% 4 threads 3.10(+/- 12%) 2.22(+/- 0%) -28% 5 threads 2.47(+/- 5%) 1.95(+/- 0%) -21% 6 threads 2.09(+/- 0%) 1.73(+/- 0%) -17% 7 threads 1.64(+/- 0%) 1.56(+/- 0%) - 7% 8 threads 1.42(+/- 0%) 1.42(+/- 0%) 0% Results show a better and stable results across iteration with the patch compared to mainline because we are always using big cores in priority whereas with mainline, the scheduler randomly choose a big or a little cores when there are more cores than number of threads. With 1 thread, the test duration varies in the range [8.85 .. 15.86] for mainline whereas it stays in the range [8.85..8.87] with the patch Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> --- The SD_ASYM_PACKING flag is disabled by default and I'm preparing another patch to enable this dynamically at boot time by detecting the system topology. arch/arm64/kernel/topology.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) -- 2.7.4