Message ID | ad25c731a6ca1bd1269555245952f05c856a9759.1352196505.git.viresh.kumar@linaro.org |
---|---|
State | Accepted |
Headers | show |
Hello, Viresh. On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote: > Workqueues queues work on current cpu, if the caller haven't passed a preferred > cpu. This may wake up an idle CPU, which is actually not required. > > This work can be processed by any CPU and so we must select a non-idle CPU here. > This patch adds in support in workqueue framework to get preferred CPU details > from the scheduler, instead of using current CPU. > > Most of the time when a work is queued, the current cpu isn't idle and so we > will choose it only. There are cases when a cpu is idle when it queues some > work. For example, consider following scenario: > - A cpu has programmed a timer and is IDLE now. > - CPU gets into interrupt handler due to timer and queues a work. As the CPU is > currently IDLE, we can queue this work to some other CPU. I'm pretty skeptical about this. queue_work() w/o explicit CPU assignment has always guaranteed that the work item will be executed on the same CPU. I don't think there are too many users depending on that but am not sure at all that there are none. I asked you last time that you would at the very least need to audit most users but it doesn't seem like there has been any effort there. Given the seemingly low rate of migration and subtlety of such bugs, things could get nasty to debug. That said, if the obtained benefit is big enough, sure, we can definitely change the behavior, which isn't all that great to begin with, and try to shake out the bugs quicky by e.g. forcing all work items to execute on different CPUs, but you're presenting a few percent of work items being migrated to a different CPU from an already active CPU, which doesn't seem like such a big benefit to me even if the migration target CPU is somewhat more efficient. How much powersaving are we talking about? Thanks.
Hi Tejun, On 26 November 2012 22:45, Tejun Heo <tj@kernel.org> wrote: > On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote: > I'm pretty skeptical about this. queue_work() w/o explicit CPU > assignment has always guaranteed that the work item will be executed > on the same CPU. I don't think there are too many users depending on > that but am not sure at all that there are none. I asked you last > time that you would at the very least need to audit most users but it > doesn't seem like there has been any effort there. My bad. I completely missed/forgot that comment from your earlier mails. Will do it. > That said, if the obtained benefit is big enough, sure, we can > definitely change the behavior, which isn't all that great to begin > with, and try to shake out the bugs quicky by e.g. forcing all work > items to execute on different CPUs, but you're presenting a few > percent of work items being migrated to a different CPU from an > already active CPU, which doesn't seem like such a big benefit to me > even if the migration target CPU is somewhat more efficient. How much > powersaving are we talking about? Hmm.. I actually implemented the problem discussed here: (I know you have seen this earlier :) ) http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf Specifically slides: 12 & 19. I haven't done much power calculations with it and have tested it more from functionality point of view. @Vincent: Can you add some comments here? -- viresh
On 27 November 2012 06:19, Viresh Kumar <viresh.kumar@linaro.org> wrote: > Hi Tejun, > > On 26 November 2012 22:45, Tejun Heo <tj@kernel.org> wrote: >> On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote: > >> I'm pretty skeptical about this. queue_work() w/o explicit CPU >> assignment has always guaranteed that the work item will be executed >> on the same CPU. I don't think there are too many users depending on >> that but am not sure at all that there are none. I asked you last >> time that you would at the very least need to audit most users but it >> doesn't seem like there has been any effort there. > > My bad. I completely missed/forgot that comment from your earlier mails. > Will do it. > >> That said, if the obtained benefit is big enough, sure, we can >> definitely change the behavior, which isn't all that great to begin >> with, and try to shake out the bugs quicky by e.g. forcing all work >> items to execute on different CPUs, but you're presenting a few >> percent of work items being migrated to a different CPU from an >> already active CPU, which doesn't seem like such a big benefit to me >> even if the migration target CPU is somewhat more efficient. How much >> powersaving are we talking about? > > Hmm.. I actually implemented the problem discussed here: > (I know you have seen this earlier :) ) > > http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/08/lpc2012-sched-timer-workqueue.pdf > > Specifically slides: 12 & 19. > > I haven't done much power calculations with it and have tested it more from > functionality point of view. > > @Vincent: Can you add some comments here? Sorry for this late reply. We have faced some situations on TC2 (as an example) where the tasks are running in the LITTLE cluster whereas some periodic works stay on the big cluster so we can have one cluster that wakes up for tasks and another one that wakes up for work. We would like to consolidate the behaviour of the work with the tasks behaviour. Sorry, I don't have relevant figures as the patches are used with other ones which also impact the power consumption. This series introduces the possibility to run a work on another CPU which is necessary if we want a better correlation of task and work scheduling on the system. Most of the time the queue_work is used when a driver don't mind the CPU on which you want to run whereas it looks like it should be used only if you want to run locally. We would like to solve this point with the new interface that is proposed by viresh Vincent > > -- > viresh
On Tue, 2012-11-06 at 16:08 +0530, Viresh Kumar wrote: > Workqueues queues work on current cpu, if the caller haven't passed a preferred > cpu. This may wake up an idle CPU, which is actually not required. > > This work can be processed by any CPU and so we must select a non-idle CPU here. > This patch adds in support in workqueue framework to get preferred CPU details > from the scheduler, instead of using current CPU. > > Most of the time when a work is queued, the current cpu isn't idle and so we > will choose it only. There are cases when a cpu is idle when it queues some > work. For example, consider following scenario: > - A cpu has programmed a timer and is IDLE now. > - CPU gets into interrupt handler due to timer and queues a work. As the CPU is > currently IDLE, we can queue this work to some other CPU. > > Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> > --- > kernel/workqueue.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 042d221..d32efa2 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1238,7 +1238,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, > struct global_cwq *last_gcwq; > > if (cpu == WORK_CPU_UNBOUND) > - cpu = raw_smp_processor_id(); > + cpu = sched_select_cpu(0); A couple of things. The sched_select_cpu() is not cheap. It has a double loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, and we are CPU 1023 and all other CPUs happen to be idle, we could be searching 1023 CPUs before we come up with our own. __queue_work() should be fast as there is a reason that it is delaying the work and not running it itself. Also, I really don't like this as a default behavior. It seems that this solution is for a very special case, and this can become very intrusive for the normal case. To be honest, I'm uncomfortable with this approach. It seems to be fighting a symptom and not the disease. I'd rather find a way to keep work from being queued on wrong CPU. If it is a timer, find a way to move the timer. If it is something else, lets work to fix that. Doing searches of possibly all CPUs (unlikely, but it is there), just seems wrong to me. -- Steve > > /* > * It's multi cpu. If @work was previously on a different > @@ -1383,7 +1383,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, > if (gcwq) > lcpu = gcwq->cpu; > if (lcpu == WORK_CPU_UNBOUND) > - lcpu = raw_smp_processor_id(); > + lcpu = sched_select_cpu(0); > } else { > lcpu = WORK_CPU_UNBOUND; > }
On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote: > A couple of things. The sched_select_cpu() is not cheap. It has a double > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, > and we are CPU 1023 and all other CPUs happen to be idle, we could be > searching 1023 CPUs before we come up with our own. Not sure if you missed the first check sched_select_cpu() +int sched_select_cpu(unsigned int sd_flags) +{ + /* If Current cpu isn't idle, don't migrate anything */ + if (!idle_cpu(cpu)) + return cpu; We aren't going to search if we aren't idle. > Also, I really don't like this as a default behavior. It seems that this > solution is for a very special case, and this can become very intrusive > for the normal case. We tried with an KCONFIG option for it, which Tejun rejected. > To be honest, I'm uncomfortable with this approach. It seems to be > fighting a symptom and not the disease. I'd rather find a way to keep > work from being queued on wrong CPU. If it is a timer, find a way to > move the timer. If it is something else, lets work to fix that. Doing > searches of possibly all CPUs (unlikely, but it is there), just seems > wrong to me. As Vincent pointed out, on big LITTLE systems we just don't want to serve works on big cores. That would be wasting too much of power. Specially if we are going to wake up big cores. It would be difficult to control the source driver (which queues work) to little cores. We thought, if somebody wanted to queue work on current cpu then they must use queue_work_on(). -- viresh
On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote: > On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote: > > A couple of things. The sched_select_cpu() is not cheap. It has a double > > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, > > and we are CPU 1023 and all other CPUs happen to be idle, we could be > > searching 1023 CPUs before we come up with our own. > > Not sure if you missed the first check sched_select_cpu() > > +int sched_select_cpu(unsigned int sd_flags) > +{ > + /* If Current cpu isn't idle, don't migrate anything */ > + if (!idle_cpu(cpu)) > + return cpu; > > We aren't going to search if we aren't idle. OK, we are idle, but CPU 1022 isn't. We still need a large search. But, heh we are idle we can spin. But then why go through this in the first place ;-) > > > Also, I really don't like this as a default behavior. It seems that this > > solution is for a very special case, and this can become very intrusive > > for the normal case. > > We tried with an KCONFIG option for it, which Tejun rejected. Yeah, I saw that. I don't like adding KCONFIG options either. Best is to get something working that doesn't add any regressions. If you can get this to work without making *any* regressions in the normal case than I'm totally fine with that. But if this adds any issues with the normal case, then it's a show stopper. > > > To be honest, I'm uncomfortable with this approach. It seems to be > > fighting a symptom and not the disease. I'd rather find a way to keep > > work from being queued on wrong CPU. If it is a timer, find a way to > > move the timer. If it is something else, lets work to fix that. Doing > > searches of possibly all CPUs (unlikely, but it is there), just seems > > wrong to me. > > As Vincent pointed out, on big LITTLE systems we just don't want to > serve works on big cores. That would be wasting too much of power. > Specially if we are going to wake up big cores. > > It would be difficult to control the source driver (which queues work) to > little cores. We thought, if somebody wanted to queue work on current > cpu then they must use queue_work_on(). As Tejun has mentioned earlier, is there any assumptions anywhere that expects an unbounded work queue to not migrate? Where per cpu variables might be used. Tejun had a good idea of forcing this to migrate the work *every* time. To not let a work queue run on the same CPU that it was queued on. If it can survive that, then it is probably OK. Maybe add a config option that forces this? That way, anyone can test that this isn't an issue. -- Steve
On 27 November 2012 14:59, Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote: >> On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote: >> > A couple of things. The sched_select_cpu() is not cheap. It has a double >> > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, >> > and we are CPU 1023 and all other CPUs happen to be idle, we could be >> > searching 1023 CPUs before we come up with our own. >> >> Not sure if you missed the first check sched_select_cpu() >> >> +int sched_select_cpu(unsigned int sd_flags) >> +{ >> + /* If Current cpu isn't idle, don't migrate anything */ >> + if (!idle_cpu(cpu)) >> + return cpu; >> >> We aren't going to search if we aren't idle. > > OK, we are idle, but CPU 1022 isn't. We still need a large search. But, > heh we are idle we can spin. But then why go through this in the first > place ;-) By migrating it now, it will create its activity and wake up on the right CPU next time. If migrating on any CPUs seems a bit risky, we could restrict the migration on a CPU on the same node. We can pass such contraints on sched_select_cpu > > >> >> > Also, I really don't like this as a default behavior. It seems that this >> > solution is for a very special case, and this can become very intrusive >> > for the normal case. >> >> We tried with an KCONFIG option for it, which Tejun rejected. > > Yeah, I saw that. I don't like adding KCONFIG options either. Best is to > get something working that doesn't add any regressions. If you can get > this to work without making *any* regressions in the normal case than > I'm totally fine with that. But if this adds any issues with the normal > case, then it's a show stopper. > >> >> > To be honest, I'm uncomfortable with this approach. It seems to be >> > fighting a symptom and not the disease. I'd rather find a way to keep >> > work from being queued on wrong CPU. If it is a timer, find a way to >> > move the timer. If it is something else, lets work to fix that. Doing >> > searches of possibly all CPUs (unlikely, but it is there), just seems >> > wrong to me. >> >> As Vincent pointed out, on big LITTLE systems we just don't want to >> serve works on big cores. That would be wasting too much of power. >> Specially if we are going to wake up big cores. >> >> It would be difficult to control the source driver (which queues work) to >> little cores. We thought, if somebody wanted to queue work on current >> cpu then they must use queue_work_on(). > > As Tejun has mentioned earlier, is there any assumptions anywhere that > expects an unbounded work queue to not migrate? Where per cpu variables > might be used. Tejun had a good idea of forcing this to migrate the work > *every* time. To not let a work queue run on the same CPU that it was > queued on. If it can survive that, then it is probably OK. Maybe add a > config option that forces this? That way, anyone can test that this > isn't an issue. > > -- Steve > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
On Tue, 2012-11-27 at 15:55 +0100, Vincent Guittot wrote: > On 27 November 2012 14:59, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote: > >> On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote: > >> > A couple of things. The sched_select_cpu() is not cheap. It has a double > >> > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, > >> > and we are CPU 1023 and all other CPUs happen to be idle, we could be > >> > searching 1023 CPUs before we come up with our own. > >> > >> Not sure if you missed the first check sched_select_cpu() > >> > >> +int sched_select_cpu(unsigned int sd_flags) > >> +{ > >> + /* If Current cpu isn't idle, don't migrate anything */ > >> + if (!idle_cpu(cpu)) > >> + return cpu; > >> > >> We aren't going to search if we aren't idle. > > > > OK, we are idle, but CPU 1022 isn't. We still need a large search. But, > > heh we are idle we can spin. But then why go through this in the first > > place ;-) > > By migrating it now, it will create its activity and wake up on the > right CPU next time. > > If migrating on any CPUs seems a bit risky, we could restrict the > migration on a CPU on the same node. We can pass such contraints on > sched_select_cpu > That's assuming that the CPUs stay idle. Now if we move the work to another CPU and it goes idle, then it may move that again. It could end up being a ping pong approach. I don't think idle is a strong enough heuristic for the general case. If interrupts are constantly going off on a CPU that happens to be idle most of the time, it will constantly be moving work onto CPUs that are currently doing real work, and by doing so, it will be slowing those CPUs down. -- Steve
On 27 November 2012 16:04, Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 2012-11-27 at 15:55 +0100, Vincent Guittot wrote: >> On 27 November 2012 14:59, Steven Rostedt <rostedt@goodmis.org> wrote: >> > On Tue, 2012-11-27 at 19:18 +0530, Viresh Kumar wrote: >> >> On 27 November 2012 18:56, Steven Rostedt <rostedt@goodmis.org> wrote: >> >> > A couple of things. The sched_select_cpu() is not cheap. It has a double >> >> > loop of domains/cpus looking for a non idle cpu. If we have 1024 CPUs, >> >> > and we are CPU 1023 and all other CPUs happen to be idle, we could be >> >> > searching 1023 CPUs before we come up with our own. >> >> >> >> Not sure if you missed the first check sched_select_cpu() >> >> >> >> +int sched_select_cpu(unsigned int sd_flags) >> >> +{ >> >> + /* If Current cpu isn't idle, don't migrate anything */ >> >> + if (!idle_cpu(cpu)) >> >> + return cpu; >> >> >> >> We aren't going to search if we aren't idle. >> > >> > OK, we are idle, but CPU 1022 isn't. We still need a large search. But, >> > heh we are idle we can spin. But then why go through this in the first >> > place ;-) >> >> By migrating it now, it will create its activity and wake up on the >> right CPU next time. >> >> If migrating on any CPUs seems a bit risky, we could restrict the >> migration on a CPU on the same node. We can pass such contraints on >> sched_select_cpu >> > > That's assuming that the CPUs stay idle. Now if we move the work to > another CPU and it goes idle, then it may move that again. It could end > up being a ping pong approach. > > I don't think idle is a strong enough heuristic for the general case. If > interrupts are constantly going off on a CPU that happens to be idle > most of the time, it will constantly be moving work onto CPUs that are > currently doing real work, and by doing so, it will be slowing those > CPUs down. > I agree that idle is probably not enough but it's the heuristic that is currently used for selecting a CPU for a timer and the timer also uses sched_select_cpu in this series. So in order to go step by step, a common interface has been introduced for selecting a CPU and this function uses the same algorithm than the timer already do. Once we agreed on an interface, the heuristic could be updated. > -- Steve > >
Hi Tejun, On 27 November 2012 10:49, Viresh Kumar <viresh.kumar@linaro.org> wrote: > On 26 November 2012 22:45, Tejun Heo <tj@kernel.org> wrote: >> On Tue, Nov 06, 2012 at 04:08:45PM +0530, Viresh Kumar wrote: > >> I'm pretty skeptical about this. queue_work() w/o explicit CPU >> assignment has always guaranteed that the work item will be executed >> on the same CPU. I don't think there are too many users depending on >> that but am not sure at all that there are none. I asked you last >> time that you would at the very least need to audit most users but it >> doesn't seem like there has been any effort there. > > My bad. I completely missed/forgot that comment from your earlier mails. > Will do it. And this is the first thing i did this year :) So there are almost 1200 files which are using: queue_work, queue_delayed_work, schedule_work, schedule_delayed_work or schedule_on_each_cpu Obviously i can't try to understand all of them :) , and so i tried to search two strings in them: "cpu" or "processor_id". So, this would catch every file of these 1200 odd files which use some variables/comment/code with cpu or smp_processor_id or raw_processor_id, etc.. I got around 650 files with these two searches.. Then i went into these files to see if there is anything noticeable, which may break due to this patch... I got a list of files where cpu/processor_id strings are found, which may break with this patch (still can't guarantee as i don't have knowledge of these drivers)... - arch/powerpc/mm/numa.c - arch/s390/appldata/appldata_base.c - arch/s390/kernel/time.c - arch/s390/kernel/topology.c - arch/s390/oprofile/hwsampler.c - arch/x86/kernel/cpu/mcheck/mce.c - arch/x86/kernel/tsc.c - arch/x86/kvm/x86.c - block/blk-core.c - block/blk-throttle.c - block/blk-genhd.c - crypto/cryptd.c - drivers/base/power/domain.c - drivers/cpufreq/cpufreq.c - drivers/hv/vmbus_drv.c - drivers/infiniband/hw/qib/qib_iba7322.c and drivers/infiniband/hw/qib/qib_init.c - drivers/md/dm-crypt.c - drivers/md/md.c - drivers/net/ethernet/sfc/efx.c - drivers/net/ethernet/tile/tilepro.c - drivers/net/team/team_mode_loadbalance.c - drivers/oprofile/cpu_buffer.c - drivers/s390/kvm/kvm_virtio.c - drivers/scsi/fcoe/fcoe.c - drivers/tty/sysrq.c - drivers/xen/pcpu.c - fs/file.c, file_table.c, fs/fscache/object.c - include/linux/stop_machine.h, kernel/stop_machine.c - mm/percpu.c - mm/slab.c - mm/vmstat.c - net/core/flow.c - net/iucv/iucv.c I am not sure what to do now :) , can you assist ?
Hello, Viresh. On Fri, Jan 04, 2013 at 04:41:47PM +0530, Viresh Kumar wrote: > I got a list of files where cpu/processor_id strings are found, which > may break with > this patch (still can't guarantee as i don't have knowledge of these drivers)... ... > I am not sure what to do now :) , can you assist ? I don't know either. Changing behavior subtly like this is hard. I usually try to spot some problem cases and try to identify patterns there. Once you identify a few of them, understanding and detecting other problem cases get a lot easier. In this case, maybe there are too many places to audit and the problems are too subtle, and, if we *have* to do it, the only thing we can do is implementing a debug option to make such problems more visible - say, always schedule to a different cpu on queue_work(). That said, at this point, the patchset doesn't seem all that convincing to me and if I'm comprehending responses from others correctly that seems to be the consensus. It might be a better approach to identify the specific offending workqueue users and make them handle the situation more intelligently than trying to impose the behavior on all workqueue users. At any rate, we need way more data showing this actually helps and if so why. Thanks.
Hi Tejun, On 4 January 2013 20:39, Tejun Heo <tj@kernel.org> wrote: > I don't know either. Changing behavior subtly like this is hard. I > usually try to spot some problem cases and try to identify patterns > there. Once you identify a few of them, understanding and detecting > other problem cases get a lot easier. In this case, maybe there are > too many places to audit and the problems are too subtle, and, if we > *have* to do it, the only thing we can do is implementing a debug > option to make such problems more visible - say, always schedule to a > different cpu on queue_work(). > > That said, at this point, the patchset doesn't seem all that > convincing to me and if I'm comprehending responses from others > correctly that seems to be the consensus. It might be a better > approach to identify the specific offending workqueue users and make > them handle the situation more intelligently than trying to impose the > behavior on all workqueue users. At any rate, we need way more data > showing this actually helps and if so why. I understand your concerns and believe me, even i feel the same :) I had another idea, that i wanted to share. Firstly the root cause of this patchset. Myself and some others in Linaro are working on ARM future cores: big.LITTLE systems. Here we have few very powerful, high power consuming cores (big, currently A15's) and few very efficient ones (LITTLE, currently A7's). The ultimate goal is to save as much power as possible without compromising much with performance. For, that we wanted most of the stuff to run on LITTLE cores and some performance-demanding stuff on big Cores. There are multiple things going around in this direction. Now, we thought A15's or big cores shouldn't be used for running small tasks like timers/workqueues and hence this patch is an attempt towards reaching that goal. Over that we can do some load balancing of works within multiple alive cpus, so that it can get done quickly. Also, we shouldn't start using an idle cpu just for processing work :) I have another idea that we can try: queue_work_on_any_cpu(). With this we would not break any existing code and can try to migrate old users to this new infrastructure (atleast the ones which are rearming works from their work_handlers). What do you say? To take care of the cache locality issue, we can pass an argument to this routine, that can provide - the mask of cpus to schedule this work on OR - Sched Level (SD_LEVEL) of cpus to run it. Waiting for your view on it :) -- viresh
On Mon, 2013-01-07 at 15:28 +0530, Viresh Kumar wrote: > Hi Tejun, > > On 4 January 2013 20:39, Tejun Heo <tj@kernel.org> wrote: > > I don't know either. Changing behavior subtly like this is hard. I > > usually try to spot some problem cases and try to identify patterns > > there. Once you identify a few of them, understanding and detecting > > other problem cases get a lot easier. In this case, maybe there are > > too many places to audit and the problems are too subtle, and, if we > > *have* to do it, the only thing we can do is implementing a debug > > option to make such problems more visible - say, always schedule to a > > different cpu on queue_work(). > > > > That said, at this point, the patchset doesn't seem all that > > convincing to me and if I'm comprehending responses from others > > correctly that seems to be the consensus. It might be a better > > approach to identify the specific offending workqueue users and make > > them handle the situation more intelligently than trying to impose the > > behavior on all workqueue users. At any rate, we need way more data > > showing this actually helps and if so why. > > I understand your concerns and believe me, even i feel the same :) > I had another idea, that i wanted to share. > > Firstly the root cause of this patchset. > > Myself and some others in Linaro are working on ARM future cores: > big.LITTLE systems. > Here we have few very powerful, high power consuming cores (big, > currently A15's) and > few very efficient ones (LITTLE, currently A7's). > > The ultimate goal is to save as much power as possible without compromising > much with performance. For, that we wanted most of the stuff to run on LITTLE > cores and some performance-demanding stuff on big Cores. There are > multiple things > going around in this direction. Now, we thought A15's or big cores > shouldn't be used > for running small tasks like timers/workqueues and hence this patch is > an attempt > towards reaching that goal. > > Over that we can do some load balancing of works within multiple alive > cpus, so that > it can get done quickly. Also, we shouldn't start using an idle cpu > just for processing > work :) > > I have another idea that we can try: > > queue_work_on_any_cpu(). I think this is a good idea. > > With this we would not break any existing code and can try to migrate > old users to > this new infrastructure (atleast the ones which are rearming works from their > work_handlers). What do you say? > > To take care of the cache locality issue, we can pass an argument to > this routine, > that can provide > - the mask of cpus to schedule this work on > OR > - Sched Level (SD_LEVEL) of cpus to run it. I wouldn't give a mask. If one is needed, we could have a queue_work_on_mask_cpus(), or something. I think the "any" in the name should be good enough to let developers know that this will not be on the CPU that is called. By default, I would suggest for cache locality, that we try to keep it on the same CPU. But if there's a better CPU to run on, it runs there. Also, we could still add a debug option that makes it always run on other CPUs to slap developers that don't read. -- Steve > > Waiting for your view on it :) > > -- > viresh
Hello, Viresh. On Mon, Jan 07, 2013 at 03:28:33PM +0530, Viresh Kumar wrote: > Firstly the root cause of this patchset. > > Myself and some others in Linaro are working on ARM future cores: > big.LITTLE systems. > Here we have few very powerful, high power consuming cores (big, > currently A15's) and > few very efficient ones (LITTLE, currently A7's). > > The ultimate goal is to save as much power as possible without compromising > much with performance. For, that we wanted most of the stuff to run on LITTLE > cores and some performance-demanding stuff on big Cores. There are > multiple things > going around in this direction. Now, we thought A15's or big cores > shouldn't be used > for running small tasks like timers/workqueues and hence this patch is > an attempt > towards reaching that goal. I see. My understanding of big.little is very superficial so there are very good chances that I'm utterly wrong, but to me it seems like more of a stop-gap solution than something which will have staying power in the sense that the complexity doesn't seem to have any inherent reason other than the failure to make the big ones power efficient enough while idle or under minor load. Maybe this isn't specific to big.little and heterogeneous cores will be the way of future with big.little being just a forefront of the trend. And even if that's not the case, it definitely doesn't mean that we can't accomodate big.little, but, at this point, it does make me a bit more reluctant to make wholesale changes specifically for big.little. > Over that we can do some load balancing of works within multiple alive > cpus, so that > it can get done quickly. Also, we shouldn't start using an idle cpu > just for processing > work :) The latter part "not using idle cpu just for processing work" does apply to homogeneous systems too but as I wrote earlier work items don't spontaneously happen on an idle core. Something, including timer, needs to kick it. So, by definition, a CPU already isn't idle when a work item starts execution on it. What am I missing here? > I have another idea that we can try: > > queue_work_on_any_cpu(). > > With this we would not break any existing code and can try to migrate > old users to > this new infrastructure (atleast the ones which are rearming works from their > work_handlers). What do you say? Yeah, this could be a better solution, I think. Plus, it's not like finding the optimal cpu is free. > To take care of the cache locality issue, we can pass an argument to > this routine, > that can provide > - the mask of cpus to schedule this work on > OR > - Sched Level (SD_LEVEL) of cpus to run it. Let's start simple for now. If we really need it, we can always add more later. Thanks.
On Mon, Jan 7, 2013 at 8:34 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Viresh. > > On Mon, Jan 07, 2013 at 03:28:33PM +0530, Viresh Kumar wrote: >> Firstly the root cause of this patchset. >> >> Myself and some others in Linaro are working on ARM future cores: >> big.LITTLE systems. >> Here we have few very powerful, high power consuming cores (big, >> currently A15's) and >> few very efficient ones (LITTLE, currently A7's). >> >> The ultimate goal is to save as much power as possible without compromising >> much with performance. For, that we wanted most of the stuff to run on LITTLE >> cores and some performance-demanding stuff on big Cores. There are >> multiple things >> going around in this direction. Now, we thought A15's or big cores >> shouldn't be used >> for running small tasks like timers/workqueues and hence this patch is >> an attempt >> towards reaching that goal. > > I see. My understanding of big.little is very superficial so there > are very good chances that I'm utterly wrong, but to me it seems like > more of a stop-gap solution than something which will have staying > power in the sense that the complexity doesn't seem to have any > inherent reason other than the failure to make the big ones power > efficient enough while idle or under minor load. The two processors use different manufacturing processes - one optimised for performance, the other for really low power. So the reason is physics at this point. Other architectures are known to be playing with similar schemes. ARM's big.LITTLE is just the first one to the market. > Maybe this isn't specific to big.little and heterogeneous cores will > be the way of future with big.little being just a forefront of the > trend. And even if that's not the case, it definitely doesn't mean > that we can't accomodate big.little, but, at this point, it does make > me a bit more reluctant to make wholesale changes specifically for > big.little. The patches aren't targeted to benefit only b.L systems though it was definitely the trigger for our investigations. Our hope is that after presenting more analysis results from our side we'll be able to convince the community of the general usefulness of this feature. Here are a few short introductions to big.LITTLE in case you're interested.[1][2] [1] http://www.arm.com/files/downloads/big.LITTLE_Final.pdf [2] http://lwn.net/Articles/481055/
On 7 January 2013 18:58, Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 2013-01-07 at 15:28 +0530, Viresh Kumar wrote: >> I have another idea that we can try: >> >> queue_work_on_any_cpu(). > > I think this is a good idea. :) :) >> - the mask of cpus to schedule this work on >> OR >> - Sched Level (SD_LEVEL) of cpus to run it. > > I wouldn't give a mask. If one is needed, we could have a > queue_work_on_mask_cpus(), or something. I think the "any" in the name > should be good enough to let developers know that this will not be on > the CPU that is called. Fair Enough. > By default, I would suggest for cache locality, > that we try to keep it on the same CPU. But if there's a better CPU to > run on, it runs there. That would break our intention behind this routine. We should run it on a cpu which we think is the best one for it power/performance wise. So, i would like to call the sched_select_cpu() routine from this routine to get the suggested cpu. This routine would however would see more changes later to optimize it more. > Also, we could still add a debug option that > makes it always run on other CPUs to slap developers that don't read. I don't think we need it :( It would be a new API, with zero existing users. And so, whomsoever uses it, should know what exactly he is doing :) Thanks for your quick feedback.
[Removed Suresh and Venki from discussion, they switched their companies probably] On 7 January 2013 20:34, Tejun Heo <tj@kernel.org> wrote: > The latter part "not using idle cpu just for processing work" does > apply to homogeneous systems too but as I wrote earlier work items > don't spontaneously happen on an idle core. Something, including > timer, needs to kick it. So, by definition, a CPU already isn't idle > when a work item starts execution on it. What am I missing here? We are talking about a core being idle from schedulers perspective :) >> I have another idea that we can try: >> >> queue_work_on_any_cpu(). >> >> With this we would not break any existing code and can try to migrate >> old users to >> this new infrastructure (atleast the ones which are rearming works from their >> work_handlers). What do you say? > > Yeah, this could be a better solution, I think. Plus, it's not like > finding the optimal cpu is free. Thanks for the first part (When i shared this idea with Vincent and Amit, i wasn't sure at all about the feedback i will get from you and others, but i am very happy now :) ). I couldn't understand the second part. We still need to search for a free cpu for this new routine. And the implementation would almost be same as the implementation of queue_work() in my initial patch >> To take care of the cache locality issue, we can pass an argument to >> this routine, >> that can provide >> - the mask of cpus to schedule this work on >> OR >> - Sched Level (SD_LEVEL) of cpus to run it. > > Let's start simple for now. If we really need it, we can always add > more later. :) Agreed. But i liked the idea from steven, we can have two routines: queue_work_on_any_cpu() and queue_work_on_cpus()
On Mon, 2013-01-07 at 23:29 +0530, Viresh Kumar wrote: > > By default, I would suggest for cache locality, > > that we try to keep it on the same CPU. But if there's a better CPU to > > run on, it runs there. > > That would break our intention behind this routine. We should run > it on a cpu which we think is the best one for it power/performance wise. If you are running on a big.Little box sure. But for normal (read x86 ;) systems, it should probably stay on the current CPU. > > So, i would like to call the sched_select_cpu() routine from this routine to > get the suggested cpu. Does the caller need to know? Or if you have a big.LITTLE setup, it should just work automagically? > > This routine would however would see more changes later to optimize it > more. > > > Also, we could still add a debug option that > > makes it always run on other CPUs to slap developers that don't read. > > I don't think we need it :( > It would be a new API, with zero existing users. And so, whomsoever uses > it, should know what exactly he is doing :) Heh, you don't know kernel developers well do you? ;-) Once a new API is added to the kernel tree, it quickly becomes (mis)used. -- Steve
Hi Steven, On 8 January 2013 03:59, Steven Rostedt <rostedt@goodmis.org> wrote: > On Mon, 2013-01-07 at 23:29 +0530, Viresh Kumar wrote: > >> > By default, I would suggest for cache locality, >> > that we try to keep it on the same CPU. But if there's a better CPU to >> > run on, it runs there. >> >> That would break our intention behind this routine. We should run >> it on a cpu which we think is the best one for it power/performance wise. > > If you are running on a big.Little box sure. But for normal (read x86 ;) > systems, it should probably stay on the current CPU. But that's not the purpose of this new call. If the caller want it to be on local cpu, he must not use this call. It is upto the core routine (sched_select_cpu() in our case) to decide where to launch it. If we need something special for x86, we can hack this routine. Even for non bigLITTLE systems, we may want to run a work on non-idle cpu and so we can't guarantee local cpu here. >> So, i would like to call the sched_select_cpu() routine from this routine to >> get the suggested cpu. > > Does the caller need to know? Or if you have a big.LITTLE setup, it > should just work automagically? Caller wouldn't know, he just needs to trust on sched_select_cpu() to give the most optimum cpu. Again, it is not only for big LITTLE, but for any SMP system, where we don't want an idle cpu to do this work. >> I don't think we need it :( >> It would be a new API, with zero existing users. And so, whomsoever uses >> it, should know what exactly he is doing :) > > Heh, you don't know kernel developers well do you? ;-) I agree with you, but we don't need to care for foolish new code here. > Once a new API is added to the kernel tree, it quickly becomes > (mis)used. Its true with all pieces of code in kernel and we really don't need to take care of such users here :)
Hello, On Mon, Jan 07, 2013 at 11:37:22PM +0530, Viresh Kumar wrote: > On 7 January 2013 20:34, Tejun Heo <tj@kernel.org> wrote: > > The latter part "not using idle cpu just for processing work" does > > apply to homogeneous systems too but as I wrote earlier work items > > don't spontaneously happen on an idle core. Something, including > > timer, needs to kick it. So, by definition, a CPU already isn't idle > > when a work item starts execution on it. What am I missing here? > > We are talking about a core being idle from schedulers perspective :) But it's not like cpu doesn't consume power if scheduler considers it idle, right? Can you please explain in detail how this contributes to saving power? Is it primarily about routing work items to lower power CPUs? And please don't point to presentation slides. They don't seem to explain much better and patches and the code should be able to describe themselves. Here, more so, as the desired behavior change and the resulting powersave are rather subtle. > > Yeah, this could be a better solution, I think. Plus, it's not like > > finding the optimal cpu is free. > > Thanks for the first part (When i shared this idea with Vincent and Amit, i > wasn't sure at all about the feedback i will get from you and others, but i > am very happy now :) ). > > I couldn't understand the second part. We still need to search for a free cpu > for this new routine. And the implementation would almost be same as the > implementation of queue_work() in my initial patch I meant that enforcing lookup for the "optimal" cpu on queue_work() by default would add overhead to the operation, which in cases (on most homogeneous configs) would lead to overhead without any realized benefit. > > Let's start simple for now. If we really need it, we can always add > > more later. > > :) > Agreed. But i liked the idea from steven, we can have two routines: > queue_work_on_any_cpu() and queue_work_on_cpus() We can talk about specifics later but please try to *justify* each new interface. Please note that "putting work items to cpus which the scheduler considers idle" can't really be justification in itself. Thanks.
On 10 January 2013 00:19, Tejun Heo <tj@kernel.org> wrote: > On Mon, Jan 07, 2013 at 11:37:22PM +0530, Viresh Kumar wrote: >> We are talking about a core being idle from schedulers perspective :) > > But it's not like cpu doesn't consume power if scheduler considers it > idle, right? Can you please explain in detail how this contributes to > saving power? Is it primarily about routing work items to lower power > CPUs? And please don't point to presentation slides. They don't seem > to explain much better and patches and the code should be able to > describe themselves. Here, more so, as the desired behavior change > and the resulting powersave are rather subtle. I got your concerns. Firstly, when cpu is idle from schedulers perspective, it consumes a lot of power. queue_work_on_any_cpu() would queue the work on any other cpu only when current cpu is idle from schedulers perspective, and this can only happen when the cpu was actually idle (in low power state), woke up due to some interrupt/timer and is asked to queue a work.. The idea is to choose other non-idle cpu at this point, so that current cpu can immediately go into deeper idle state. With this cpus can stay longer at deeper idle state, rather than running works. And in cases, where works are rearmed from the handler, this can cause sufficient power loss, which could be easily saved by pushing this work to non-idle cpus. The same approach is taken for deffered timers too, they are already using such routine. .
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 042d221..d32efa2 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -1238,7 +1238,7 @@ static void __queue_work(unsigned int cpu, struct workqueue_struct *wq, struct global_cwq *last_gcwq; if (cpu == WORK_CPU_UNBOUND) - cpu = raw_smp_processor_id(); + cpu = sched_select_cpu(0); /* * It's multi cpu. If @work was previously on a different @@ -1383,7 +1383,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq, if (gcwq) lcpu = gcwq->cpu; if (lcpu == WORK_CPU_UNBOUND) - lcpu = raw_smp_processor_id(); + lcpu = sched_select_cpu(0); } else { lcpu = WORK_CPU_UNBOUND; }
Workqueues queues work on current cpu, if the caller haven't passed a preferred cpu. This may wake up an idle CPU, which is actually not required. This work can be processed by any CPU and so we must select a non-idle CPU here. This patch adds in support in workqueue framework to get preferred CPU details from the scheduler, instead of using current CPU. Most of the time when a work is queued, the current cpu isn't idle and so we will choose it only. There are cases when a cpu is idle when it queues some work. For example, consider following scenario: - A cpu has programmed a timer and is IDLE now. - CPU gets into interrupt handler due to timer and queues a work. As the CPU is currently IDLE, we can queue this work to some other CPU. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> --- kernel/workqueue.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)