Message ID | 1393223377-5744-7-git-send-email-tuukka.tikkanen@linaro.org |
---|---|
State | New |
Headers | show |
On Mon, 24 Feb 2014, Tuukka Tikkanen wrote: > Sometimes (fairly often) when the cpuidle menu governor is making a decision > about idle state to enter the next timer for the cpu appears to expire in > the past. The menu governor expects the expiry to always be in the future > and in fact stores the time delta in an unsigned variable. However, when > the expiry is in the past, the value returned by tick_nohz_get_sleep_length > can be negative. This patch prevents using negative values, instead making > the governor return immediately similar to having latency requirement set > to 0. > > Note: As with latency == 0, the return value is 0 with no check to see if > the state 0 has been disabled or not. In your cover letter you mention some occurrences of the negative result being observed on x86. That information is worth capturing in the commit log as well to distinguish between theoretical problems from actual observations. > > Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> > --- > drivers/cpuidle/governors/menu.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c > index 71b5232..c414468 100644 > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -302,8 +302,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) > if (unlikely(latency_req == 0)) > return 0; > > - /* determine the expected residency time, round up */ > + /* > + * Determine the expected residency time. If the time is negative, > + * a timer interrupt has probably just expired after disabling > + * interrupts. Return as quickly as possible in the most shallow > + * state possible. tv_nsec is always positive, so only check the > + * seconds. > + */ > t = ktime_to_timespec(tick_nohz_get_sleep_length()); > + if (t.tv_sec < 0) > + return 0; > data->next_timer_us = > t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC; > > -- > 1.7.9.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Feb 24, 2014 at 1:29 AM, Tuukka Tikkanen <tuukka.tikkanen@linaro.org> wrote: > Sometimes (fairly often) when the cpuidle menu governor is making a decision > about idle state to enter the next timer for the cpu appears to expire in > the past. The menu governor expects the expiry to always be in the future > and in fact stores the time delta in an unsigned variable. However, when > the expiry is in the past, the value returned by tick_nohz_get_sleep_length > can be negative. This patch prevents using negative values, instead making > the governor return immediately similar to having latency requirement set > to 0. > > Note: As with latency == 0, the return value is 0 with no check to see if > the state 0 has been disabled or not. > > Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> > --- > drivers/cpuidle/governors/menu.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c > index 71b5232..c414468 100644 > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -302,8 +302,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) > if (unlikely(latency_req == 0)) > return 0; > > - /* determine the expected residency time, round up */ > + /* > + * Determine the expected residency time. If the time is negative, > + * a timer interrupt has probably just expired after disabling > + * interrupts. Return as quickly as possible in the most shallow > + * state possible. tv_nsec is always positive, so only check the > + * seconds. > + */ > t = ktime_to_timespec(tick_nohz_get_sleep_length()); > + if (t.tv_sec < 0) > + return 0; > data->next_timer_us = > t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC; > Are there special conditions that are necessary to provoke a negative return value? I've traced this code on several systems, and never seen a negative return value. However... I do see values up to 300.2 seconds, and those large values seem to decay at the rate of real-time so that after 5 minutes they are small, and then jump back up to 300 seconds. Some folks at Oracle debugged it down to use of NEXT_TIMER_MAX_DELTA when there is _no_ timer currently pending on that CPU. It seems this is easier to observe, the more CPUs a system has -- though I've been able to reproduce it on a system as small as a single-package 8-cpu systems. One proposed way to address this is to cap large values at 1 second. However, that will not recognize that for the period when the large value decays to under 1 second, all of those are fiction. Also, if we could identify the case where there is no future timer, it seems that re-using dev->last_residency would probably be a more useful guess than pretending we'll have a timer expire in 1 second. thanks, Len Brown, Intel Open Source Technology Cente -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
NAK tick_nohz_get_sleep_length() has exactly 1 consumer -- cpuidle. Rather than have cpuidle second guess what is returned, tick_nohz_get_sleep_length() should be updated to return a value that makes sense to cpuidle. thanks, Len Brown, Intel Open Source Technology Cent -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hi, On 6 March 2014 09:41, Len Brown <lenb@kernel.org> wrote: > > On Mon, Feb 24, 2014 at 1:29 AM, Tuukka Tikkanen > <tuukka.tikkanen@linaro.org> wrote: > > Sometimes (fairly often) when the cpuidle menu governor is making a decision > > about idle state to enter the next timer for the cpu appears to expire in > > the past. The menu governor expects the expiry to always be in the future > > and in fact stores the time delta in an unsigned variable. However, when > > the expiry is in the past, the value returned by tick_nohz_get_sleep_length > > can be negative. This patch prevents using negative values, instead making > > the governor return immediately similar to having latency requirement set > > to 0. > > > > Note: As with latency == 0, the return value is 0 with no check to see if > > the state 0 has been disabled or not. > > > > Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> > > --- > > drivers/cpuidle/governors/menu.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c > > index 71b5232..c414468 100644 > > --- a/drivers/cpuidle/governors/menu.c > > +++ b/drivers/cpuidle/governors/menu.c > > @@ -302,8 +302,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) > > if (unlikely(latency_req == 0)) > > return 0; > > > > - /* determine the expected residency time, round up */ > > + /* > > + * Determine the expected residency time. If the time is negative, > > + * a timer interrupt has probably just expired after disabling > > + * interrupts. Return as quickly as possible in the most shallow > > + * state possible. tv_nsec is always positive, so only check the > > + * seconds. > > + */ > > t = ktime_to_timespec(tick_nohz_get_sleep_length()); > > + if (t.tv_sec < 0) > > + return 0; > > data->next_timer_us = > > t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC; > > > > Are there special conditions that are necessary to provoke a negative > return value? > I've traced this code on several systems, and never seen a negative > return value. > By changing the if statement to if (WARN_ONCE(t.tv_sec < 0, "Next timer in the past. tv_sec: %ld, tv_usec: %lu", t.tv_sec, t.tv_usec)) return 0; I eventually got [ 3092.355144] WARNING: CPU: 4 PID: 0 at drivers/cpuidle/governors/menu.c:315 menu_select+0x410/0x420() [ 3092.355145] Next timer in the past. tv_sec: -1, tv_usec: 999999645 ... [ 3092.355169] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 3.14.0-rc3+ #3 so the problem is still lurking there. However, as you can see from the timestamp, either something has changed recently and made it much much less frequent or I did not use the use case / hardware that triggers it more often. It has been quite a long time since I discovered and fixed this in my private trees, so I have forgotten the best way to reproduce this, sorry. In any case, existence is still confirmed. This was on a i7-3770K with 64 bit kernel, running make clean && make -j8 && make clean && make for kernel source tree. If I happen to remember the use case + hardware combination that triggers this rapidly, I'll let you know. That said, I agree dropping this patch and fixing the called function is a better idea. Tuukka > > However... > I do see values up to 300.2 seconds, and those large values seem to decay > at the rate of real-time so that after 5 minutes they are small, and then > jump back up to 300 seconds. > > Some folks at Oracle debugged it down to use of NEXT_TIMER_MAX_DELTA > when there is _no_ timer currently pending on that CPU. It seems this is easier > to observe, the more CPUs a system has -- though I've been able to reproduce > it on a system as small as a single-package 8-cpu systems. > > One proposed way to address this is to cap large values at 1 second. > However, that will not recognize that for the period when the large > value decays to under 1 second, all of those are fiction. > > Also, if we could identify the case where there is no future timer, > it seems that re-using dev->last_residency would probably be > a more useful guess than pretending we'll have a timer expire in 1 second. > > thanks, > Len Brown, Intel Open Source Technology Cente -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tuukka, I've reproduced this negative on a 48 thread 2-socket Xeon during boot (seen it only once, so far). expected_us gets calculated to be -1, which is truthful, since the next timer return value was about 500ns in the past and our math truncates. This, in turn, confuses the heck out of menu's prediction of the next event to be over 500 sec in the future. I suspect more breakage is lurking, so I'll poke some more... Len Brown, Intel Open Source Technology Center -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c index 71b5232..c414468 100644 --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -302,8 +302,16 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev) if (unlikely(latency_req == 0)) return 0; - /* determine the expected residency time, round up */ + /* + * Determine the expected residency time. If the time is negative, + * a timer interrupt has probably just expired after disabling + * interrupts. Return as quickly as possible in the most shallow + * state possible. tv_nsec is always positive, so only check the + * seconds. + */ t = ktime_to_timespec(tick_nohz_get_sleep_length()); + if (t.tv_sec < 0) + return 0; data->next_timer_us = t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
Sometimes (fairly often) when the cpuidle menu governor is making a decision about idle state to enter the next timer for the cpu appears to expire in the past. The menu governor expects the expiry to always be in the future and in fact stores the time delta in an unsigned variable. However, when the expiry is in the past, the value returned by tick_nohz_get_sleep_length can be negative. This patch prevents using negative values, instead making the governor return immediately similar to having latency requirement set to 0. Note: As with latency == 0, the return value is 0 with no check to see if the state 0 has been disabled or not. Signed-off-by: Tuukka Tikkanen <tuukka.tikkanen@linaro.org> --- drivers/cpuidle/governors/menu.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)