Message ID | CAKohpomMZ0TAN2e6N76_g4ZRzxd5vZ1XfuZfxrP7GMxfTNiLVw@mail.gmail.com |
---|---|
State | New |
Headers | show |
On 23 January 2014 19:05, Frederic Weisbecker <fweisbec@gmail.com> wrote: > On Thu, Jan 23, 2014 at 11:22:32AM +0530, Viresh Kumar wrote: >> I think below diff might get this fixed for you, though I am not sure if it >> breaks something else. Probably Thomas/Frederic can answer here. >> If this looks fine I will send it formally again: >> >> diff --git a/kernel/timer.c b/kernel/timer.c >> index accfd24..3a2c7fa 100644 >> --- a/kernel/timer.c >> +++ b/kernel/timer.c >> @@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu) >> * makes sure that a CPU on the way to stop its tick can not >> * evaluate the timer wheel. >> */ >> - wake_up_nohz_cpu(cpu); >> + if (!tbase_get_deferrable(timer->base)) >> + wake_up_nohz_cpu(cpu); Wait, I got the wrong code here. That's wasn't my initial intention. I actually wanted to write something like this: - wake_up_nohz_cpu(cpu); + if (!tbase_get_deferrable(timer->base) || idle_cpu(cpu)) + wake_up_nohz_cpu(cpu); Will that work? > So you simply rely on the next tick to see the new timer. This should work with > CONFIG_NO_HZ_IDLE but not with CONFIG_NO_HZ_FULL since the target may be running > without the tick. > > Basically, in the case of a deferrable timer you need to manage to call > wake_up_full_nohz_cpu() but not wake_up_idle_cpu(). > > It should be even possible to spare the IPI in a full dynticks CPU if it is > running idle. But that's an optional bonus because it require some deep > care on complicated races against the call to tick_nohz_idle_exit(). > > I also realize than when we enqueue a timer on a full nohz CPU, we should set_need_resched() > the target before sending the IPI if it is idle like does wake_up_idle_cpu(). Otherwise the > IPI will be ignored without exiting the idle loop nor reevaluating the tick on irq exit. > If you can fix that along the way, that will be much appreciated. I haven't thought much about this currently as I have limited knowledge of these routines. Though the problem we were facing wasn't related to NO_HZ_FULL. It was just about waking up an idle cpu. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry was away for short vacation. On 28 January 2014 19:20, Frederic Weisbecker <fweisbec@gmail.com> wrote: > On Thu, Jan 23, 2014 at 07:50:40PM +0530, Viresh Kumar wrote: >> Wait, I got the wrong code here. That's wasn't my initial intention. >> I actually wanted to write something like this: >> >> - wake_up_nohz_cpu(cpu); >> + if (!tbase_get_deferrable(timer->base) || idle_cpu(cpu)) >> + wake_up_nohz_cpu(cpu); >> >> Will that work? Something is seriously wrong with me, again wrote rubbish code. Let me phrase what I wanted to write :) "don't send IPI to a idle CPU for a deferrable timer." Probably I code it correctly this time atleast. - wake_up_nohz_cpu(cpu); + if (!(tbase_get_deferrable(timer->base) && idle_cpu(cpu))) + wake_up_nohz_cpu(cpu); > Well, this is going to wake up the target from its idle state, which is > what we want to avoid if the timer is deferrable, right? Yeah, sorry for doing it for second time :( > The simplest thing we want is: > > if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(cpu)) > wake_up_nohz_cpu(cpu); > > This spares the IPI for the common case where the timer is deferrable and we run > in periodic or dynticks-idle mode (which should be 99.99% of the existing workloads). I wasn't looking at this problem with NO_HZ_FULL in mind. As I thought its only about if the CPU is idle or not. And so the solution I was talking about was: "don't send IPI to a idle CPU for a deferrable timer." But I see that still failing with the code you wrote. For normal cases where we don't enable NO_HZ_FULL, we will still end up waking up idle CPUs which is what Lei Wen reported initially. Also if a CPU is marked for NO_HZ_FULL and is not idle currently then we wouldn't send a IPI for a deferrable timer. But we actually need that, so that we can reevaluate the timers order again? > Then we can later optimize that and spare the IPI on full dynticks CPUs when they run > idle, but that require some special care about subtle races which can't be dealt > with a simple test on "idle_cpu(target)". And power consumption in full dynticks > is already very suboptimized anyway. > > So I suggest we start simple with the above test, and a big fat comment which explains > what we are doing and what needs to be done in the future. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 29 January 2014 10:57, Preeti Murthy <preeti.lkml@gmail.com> wrote: > How about simplifying this design by doing the below? > > 1. Since anyway cpufreq governors monitor load on the cpu once every > 5ms, *tie it with tick_sched_timer*, which also gets deferred when the cpu > enters nohz_idle. Its configurable. We can change sampling time to whatever we want. Some might be selecting it as 30 ms. > 2. To overcome the problem of running this job of monitoring the load > on every cpu, have the *time keeping* cpu do it for you. > > The time keeping cpu has the property that if it has to go to idle, it will do > so and let the next cpu that runs the periodic timer become the time keeper. > Hence no cpu is prevented from entering nohz_idle and the cpu that is busy > and first executes periodic timer will take over as the time keeper. > > The result would be: > > 1. One cpu at any point in time will be monitoring cpu load, at every sched tick > as long as its busy. If it goes to sleep, then it gives up this duty > and enters idle. > The next cpu that runs the periodic timer becomes the cpu to monitor the load > and will continue to do so as long as its busy. Hence we do not miss monitoring > the cpu load. > > 2. This will avoid an additional timer for cpufreq. > > 3. It avoids sending IPIs each time this timer gets modified since there is just > one CPU doing the monitoring. > > 4. The downside to this could be that we are stretching the functions of the > periodic timer into the power management domain which does not seem like > the right thing to do. Looks good, but AFAIK timerkeeper is still to be implemented? Also the best solution is to get rid of this timer completely by getting inputs from scheduler. Probably some ARM/Linaro folks are working on it. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12 February 2014 20:36, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> The change I'm applying is strongly inspired from the above. Can I use your Signed-off-by?
Sure :)
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/timer.c b/kernel/timer.c index accfd24..3a2c7fa 100644 --- a/kernel/timer.c +++ b/kernel/timer.c @@ -940,7 +940,8 @@ void add_timer_on(struct timer_list *timer, int cpu) * makes sure that a CPU on the way to stop its tick can not * evaluate the timer wheel. */ - wake_up_nohz_cpu(cpu); + if (!tbase_get_deferrable(timer->base)) + wake_up_nohz_cpu(cpu); spin_unlock_irqrestore(&base->lock, flags); } EXPORT_SYMBOL_GPL(add_timer_on);