Message ID | 1389067023-13541-1-git-send-email-john.stultz@linaro.org |
---|---|
State | Superseded |
Headers | show |
On 01/13/2014 09:51 AM, Richard Cochran wrote: > On Mon, Jan 06, 2014 at 07:57:03PM -0800, John Stultz wrote: >> I still think this is probably 3.15 or later material, but I'd be >> very interested in feedback, thoughts, and testing. > Over the weekend I did a short test of this new approach. I compiled > two kernels, one plain v3.12.7 and one with the proposed fix. The > test was run on three different kernel variants, the current nohz > kernel, the same with nohz=off, and the same but with John's patch. > > I used a simple test script (see below). Using a PCIe express card as > a PHC, the Intel Corporation 82574L, I simply let the this clock run > free (not sync'ed to gps or anything), and then synchronized the Linux > system clock to the PHC using the phc2sys program with a sample rate > of once every 32 seconds. Each of the tests ran for at least three > hours on a system without any other load. > > - Linux 3.12.7-nohz-plain-20140106 nohz-plain.log > - Linux 3.12.7-nohz-plain-20140106 NOHZ=OFF periodic-plain.log > - Linux 3.12.7-nohz-fix-20140106-00001-gd753140 nohz-fix.log > > The performance in the log files as reflected in the clock offset is > summarized in this table. The values are in nanoseconds. > > | | periodic-plain | nohz-fix | nohz-plain | > |---------+----------------+---------------+---------------| > | minimum | -1.599000e+03 | -1.051000e+03 | -5.373700e+04 | > | maximum | +1.311000e+03 | +1.048000e+03 | +6.389500e+04 | > | mean | +9.880240e-02 | -7.747305e+01 | +1.597904e+01 | > | stddev | +4.610021e+02 | +3.960978e+02 | +1.491263e+04 | > > I also made two graphs. > > http://linuxptp.sourceforge.net/nohz-fix/current_nohz.png > http://linuxptp.sourceforge.net/nohz-fix/periodic_vs_fix.png > > (The log files and scripts are also in that directory.) > > So in this test, it looks like the new approach performed at least as > well as using regular, periodic ticks. That's great to hear! Thanks so much, I really appreciate the testing! And this is with HZ=? If you do get a chance to look again, I'd also be interested if running with nohz=off w/ the fix doesn't show any regression compared to the unmodified nohz=off case. I've been trying to validate the ntpd case to ensure there aren't regressions when there's more erratic steering, but unfortunately got side-tracked with what seems to be an ntpd/virtualization issue. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, Jan 28, 2014 at 9:58 AM, Richard Cochran <richardcochran@gmail.com> wrote: > I tested for a regression using the patched kernel with the nohz=off > command line option... > > On Mon, Jan 13, 2014 at 10:15:18AM -0800, John Stultz wrote: >> On 01/13/2014 09:51 AM, Richard Cochran wrote: >> > >> > - Linux 3.12.7-nohz-plain-20140106 nohz-plain.log >> > - Linux 3.12.7-nohz-plain-20140106 NOHZ=OFF periodic-plain.log >> > - Linux 3.12.7-nohz-fix-20140106-00001-gd753140 nohz-fix.log > > Regression check: > > - Linux homeboy 3.12.7-nohz-fix-20140106-00001-gd753140 NOHZ=OFF > nohz-fix-periodic.log > >> > The performance in the log files as reflected in the clock offset is >> > summarized in this table. The values are in nanoseconds. >> > >> > | | periodic-plain | nohz-fix | nohz-plain | >> > |---------+----------------+---------------+---------------| >> > | minimum | -1.599000e+03 | -1.051000e+03 | -5.373700e+04 | >> > | maximum | +1.311000e+03 | +1.048000e+03 | +6.389500e+04 | >> > | mean | +9.880240e-02 | -7.747305e+01 | +1.597904e+01 | >> > | stddev | +4.610021e+02 | +3.960978e+02 | +1.491263e+04 | > > Comparing the nohz=off case with and without the patch, the three hour > test looks like this. > > | | periodic-plain | nohz-fix-periodic | > |---------+----------------+-------------------+ > | minimum | -1.599000e+03 | -1.427000e+03 | > | maximum | +1.311000e+03 | +1.279000e+03 | > | mean | +9.880240e-02 | -2.710778e+01 | > | stddev | +4.610021e+02 | +3.974372e+02 | > >> > http://linuxptp.sourceforge.net/nohz-fix/current_nohz.png >> > http://linuxptp.sourceforge.net/nohz-fix/periodic_vs_fix.png > > I also made a third graph showing before and after the patch. > > http://linuxptp.sourceforge.net/nohz-fix/nohz_regression.png > >> If you do get a chance to look again, I'd also be interested if running >> with nohz=off w/ the fix doesn't show any regression compared to the >> unmodified nohz=off case. > > Looks like there is no regression. That's great to hear! Looks like we might be able to queue it for 3.15... Thanks so much again for the detailed testing! I really appreciate it! thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On 02/07/2014 03:45 AM, Miroslav Lichvar wrote: > On Mon, Jan 06, 2014 at 07:57:03PM -0800, John Stultz wrote: >> Got a few cycles to take another look at this, and tried to address >> Miroslav's latest comments. Please let me know if you have further >> thoughts! > I've had finally some time to look at this, sorry for the delay. > >> I also dropped the tk->ntp_error adjustments, as I've never quite >> been able to recall how that was correct, and it doesn't seem to >> have any affect in the simulator. Will have to proceed carefully >> with testing here. > I see some effect of the ntp_error correction in the simulator, but it > seems rather disruptive than helpful. Perhaps the correction was meant > to account the fact that for the ntp_error accumulation ntp_tick > should change at the time when mult is changed instead of start of the > tick. I tried to find that correction and came up with this: Yes, so I actually re-added the logic a few days after sending that patch out. I realized the issue is that for the accumulation point, we're adjusting the time forward or backwards, so with the new freq the non-accumulated offset lines up with the current time. Thus the ntp_error is the error at that accumulation point, which also needs to be adjusted appropriately (apologies its much easier to see when drawn). > (ntp_tick1 - ntp_tick2) * offset / interval + offset > > That doesn't look usable. But I don't think it's really necessary to > have any correction for that and I'm for dropping that line from the > code as your patch does. I'll have to try to look over your suggestion here another day. I unfortunately have a head cold at the moment, so any complicated math is a bit treacherous. :) > Anyway, it seems there is a different problem in the code. I modified > the simulator to see how the code works when the clocksource frequency > is not divisible by HZ. With TSC_FREQ set to 3579545 (acpi_pm freq) > ntp_error doesn't settle down and the clock has a large frequency > error. > > On top of your patch, this fixes the problem for me: > > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -1065,7 +1065,7 @@ static __always_inline void timekeeping_freqadjust(struct timekeeper *tk, > > /* Calculate current error per tick */ > tick_error = ntp_tick_length() >> tk->ntp_error_shift; > - tick_error -= tk->xtime_interval; > + tick_error -= tk->xtime_interval + tk->xtime_remainder; > > /* Don't worry about correcting it if its small */ > if (likely(abs(tick_error) < 2*interval)) > > My patch has this problem too. The original code seems to be affected > to some extent, it's able to converge to the correct frequency, but > there is some residual ntp error. Adding xtime_remainder to > timekeeping_bigadjust() fixes that. > > I've some comments on the performance of the proposed code, I'll send > them in a separate mail later. Ok.. I'll look into this as well. Thanks so much for your review and fixes! thanks! -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hey Miroslav! Once again, a few months pass and I finally get some more time to look at this. :( Sorry for how slow this has been going. On 02/12/2014 07:42 AM, Miroslav Lichvar wrote: > On Mon, Jan 06, 2014 at 07:57:03PM -0800, John Stultz wrote: >> Got a few cycles to take another look at this, and tried to address >> Miroslav's latest comments. Please let me know if you have further >> thoughts! > In the simulations this version of the patch does indeed work better > than the previous one. I tested it with the ntp_error correction added > back as per your previous mail. It converges, the measured frequency > matches the value set by adjtimex and the ntp error stays close to > zero. I'm concerned about two things now and I'm not sure if they can > be fixed easily. > > One is that the convergence still seems too slow to me. > > ./tk_test -t 500 -n 4000 > samples: 1-500 reg: 1-500 slope: 1.00 dev: 1947.5 max: 36940.9 freq: -100.08081 > samples: 501-1000 reg: 501-1000 slope: 1.00 dev: 430.7 max: 1669.6 freq: -100.07168 > samples: 1001-1500 reg: 1001-1500 slope: 1.00 dev: 675.3 max: 3172.9 freq: -100.07393 > samples: 1501-2000 reg: 1501-2000 slope: 1.00 dev: 453.9 max: 2223.4 freq: -100.07177 > samples: 2001-2500 reg: 2001-2500 slope: 1.00 dev: 2601.2 max: 10875.4 freq: -100.03978 > samples: 2501-3000 reg: 2501-3000 slope: 1.00 dev: 185.6 max: 1251.6 freq: -99.99987 > samples: 3001-3500 reg: 3001-3500 slope: 1.00 dev: 160.1 max: 1181.7 freq: -99.99996 > samples: 3501-4000 reg: 3501-4000 slope: 1.00 dev: 150.7 max: 1103.2 freq: -99.99990 > > You can see in this test it takes about 2500 updates to correct the > initial ntp error and settle down. That's with 1GHz clocksource. In > some tests I did with smaller clock frequencies or different frequency > offsets it took much longer than that. So I started to look into this slow update issue. I was a little confused, as the logarithmic approximation done in the frequency correction shouldn't let *that* much initial error accumulate before get to the +1/0 adjustments. It ends up this is more a reflection of a different part of your patch. Particularly the tk->ntp_tick storage. Bascially the ntp_tick variable is a cache of the ntp_tick_length() value, however it doesn't get set to ntp_tick_length() until *after* you do the first frequency correction. Basically this avoids accumulating any error until after the first correction is made. My main concern is that this seems like an accounting error. By basically avoiding accumulating the initial error it seems it would never be corrected, no? If I add a similar ntp_tick caching to my current implementation, it converges practically as fast (with the only initial error coming from the logarithmic approximation of the divide being quite small). Similarly, if you remove the usage of tk->ntp_tick in the error accumulation loop, using ntp_tick_length() your patch behaves quite similarly to mine. Does this align with your view of the code? Or am I misunderstanding something? > $ ./tk_test -s 2500 -n 5000 > samples: 1-5000 reg: 2501-5000 slope: 1.00 dev: 135942.4 max: 390761.8 freq: -100.00000 > > Here the regression line is calculated only through the latter half of > the samples (where it was already settled down) and we can see the > total ntp error corrected since the first update is around 390 > microseconds. > > I'm not saying ntpd or linuxptp can't work with this. The PLL and PI > controllers seem to handle this slowness in the frequency adjustment > well. I'm more worried about designs that may change the frequency > quickly and rely on accurate prediction of the clock in their feedback > loop. > > The other thing I'm concerned about is that the multiplier doesn't > change only between two closest values when the loop has settled down, > but in a larger range. With the 1GHz clock I see the multiplier is > moving in range of 8387767-8387772, which makes the ntp error several > times larger than it would be if the multiplier switched just between > 8387769 and 8387770. So this point is valid. I spent some time reworking things and I'm not totally happy with it currently (there's a little mess to it), but its much closer to your implementation logically, but again just avoids the divide and keeps the accounting closer to what we already have. I'll hopefully clean up my current work and send it out tomorrow. Also I do really appreciate your submissions here and apologize I've not been able to put more time towards it recently. I also know having me rewrite your patch over and over with various mistakes is probably frustrating, but I do really want to make sure I both understand all the subtleties of your changes (which resynthesizing helps) as well as make sure the accounting is really correct (or at least not changed subtlety without clear reason). Thanks so much again! -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Wed, Apr 23, 2014 at 09:22:45PM -0700, John Stultz wrote: > On 02/12/2014 07:42 AM, Miroslav Lichvar wrote: > > You can see in this test it takes about 2500 updates to correct the > > initial ntp error and settle down. That's with 1GHz clocksource. In > > some tests I did with smaller clock frequencies or different frequency > > offsets it took much longer than that. > > So I started to look into this slow update issue. I was a little > confused, as the logarithmic approximation done in the frequency > correction shouldn't let *that* much initial error accumulate before get > to the +1/0 adjustments. > > It ends up this is more a reflection of a different part of your patch. > Particularly the tk->ntp_tick storage. Bascially the ntp_tick variable > is a cache of the ntp_tick_length() value, however it doesn't get set to > ntp_tick_length() until *after* you do the first frequency correction. > Basically this avoids accumulating any error until after the first > correction is made. Yes, that was one of the main points of the patch. Postponing the tick length change should remove the biggest source of the ntp error. > My main concern is that this seems like an accounting error. By > basically avoiding accumulating the initial error it seems it would > never be corrected, no? Not if we don't consider it to be something that should be corrected. When the ntp tick length is changed (by adjtimex call or on second overflow), to which ticks should be this change applied? The current code always accumulates ntp error with the current tick length, i.e. the change is effectively applied to already passed ticks since last accumulation. I'm not saying this is necessarily wrong, but it causes large ntp errors. I'm proposing to look at the frequency change in a different way and apply it at the current time in the current tick when the clock is updated. In my understanding, there are three sources of ntp error in the current code: - change in the tick length is not effectively applied at the current time in the clock update - mult is controlled by an iterative method - insufficient resolution of mult I think the first source can be removed by postpoing the tick length change as explained above. The second source can be removed by calculating mult precisely with division instead of an iterative method. There is probably nothing we can do about the third source (except switching to 64-bit mult), but it's small, predictable and can be handled easily and cheaply by the +1/0 mult adjustment. I'm still not convinced the clock can be controlled quickly and accurately without information about when will be the next clock update if the first and possibly second source of ntp error remain there. As you have probably seen when working on the patches, the requirements are in conflict and it's difficult or maybe not even possible to get something working well with all different update intervals, clock multipliers and frequency changes. From my view, as someone involved in development of algorithms controlling clocks, I'd like the clock to be as deterministic as possible. When I set the frequency, the kernel shouldn't be correcting some large unknown phase error behind my back. I still wouldn't know when exactly was the frequency actually set, but if that information was exported by adjtimex (I have some ideas how to do that), it would be perfect for me. Thanks for still working on this.
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 87b4f00..15354d4 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -1048,54 +1048,50 @@ static int __init timekeeping_init_ops(void) device_initcall(timekeeping_init_ops); /* - * If the error is already larger, we look ahead even further - * to compensate for late or lost adjustments. + * Calculate the future error caused by incorrect freq value + * and adjust the timekeeper to correct that. */ -static __always_inline int timekeeping_bigadjust(struct timekeeper *tk, - s64 error, s64 *interval, - s64 *offset) +static __always_inline void timekeeping_freqadjust(struct timekeeper *tk, + s64 interval, + s64 offset) { s64 tick_error, i; - u32 look_ahead, adj; - s32 error2, mult; + u32 adj; + s32 mult = 1; - /* - * Use the current error value to determine how much to look ahead. - * The larger the error the slower we adjust for it to avoid problems - * with losing too many ticks, otherwise we would overadjust and - * produce an even larger error. The smaller the adjustment the - * faster we try to adjust for it, as lost ticks can do less harm - * here. This is tuned so that an error of about 1 msec is adjusted - * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks). - */ - error2 = tk->ntp_error >> (NTP_SCALE_SHIFT + 22 - 2 * SHIFT_HZ); - error2 = abs(error2); - for (look_ahead = 0; error2 > 0; look_ahead++) - error2 >>= 2; + /* Calculate current error per tick */ + tick_error = ntp_tick_length() >> tk->ntp_error_shift; + tick_error -= tk->xtime_interval; - /* - * Now calculate the error in (1 << look_ahead) ticks, but first - * remove the single look ahead already included in the error. - */ - tick_error = ntp_tick_length() >> (tk->ntp_error_shift + 1); - tick_error -= tk->xtime_interval >> 1; - error = ((error - tick_error) >> look_ahead) + tick_error; + /* Don't worry about correcting it if its small */ + if (likely(abs(tick_error) < 2*interval)) + return; - /* Finally calculate the adjustment shift value. */ - i = *interval; - mult = 1; - if (error < 0) { - error = -error; - *interval = -*interval; - *offset = -*offset; - mult = -1; + if (tick_error < 0) { + interval = -interval; + offset = -offset; + mult = -mult; } - for (adj = 0; error > i; adj++) - error >>= 1; - *interval <<= adj; - *offset <<= adj; - return mult << adj; + /* Sort out the magnitude of the correction */ + tick_error = abs(tick_error); + i = abs(interval); + for (adj = 0; tick_error > i; adj++) + tick_error >>= 1; + + /* scale the corrections */ + interval <<= adj; + offset <<= adj; + mult <<= adj; + + /* + * Apply the correction to the timekeeper. + * See long comment in timekeeping_adjust to explain the math. + */ + tk->mult += mult; + tk->xtime_interval += interval; + tk->xtime_nsec -= offset; + } /* @@ -1108,65 +1104,23 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset) s64 error, interval = tk->cycle_interval; int adj; - /* - * The point of this is to check if the error is greater than half - * an interval. - * - * First we shift it down from NTP_SHIFT to clocksource->shifted nsecs. - * - * Note we subtract one in the shift, so that error is really error*2. - * This "saves" dividing(shifting) interval twice, but keeps the - * (error > interval) comparison as still measuring if error is - * larger than half an interval. - * - * Note: It does not "save" on aggravation when reading the code. - */ - error = tk->ntp_error >> (tk->ntp_error_shift - 1); - if (error > interval) { - /* - * We now divide error by 4(via shift), which checks if - * the error is greater than twice the interval. - * If it is greater, we need a bigadjust, if its smaller, - * we can adjust by 1. - */ - error >>= 2; - /* - * XXX - In update_wall_time, we round up to the next - * nanosecond, and store the amount rounded up into - * the error. This causes the likely below to be unlikely. - * - * The proper fix is to avoid rounding up by using - * the high precision tk->xtime_nsec instead of - * xtime.tv_nsec everywhere. Fixing this will take some - * time. - */ - if (likely(error <= interval)) - adj = 1; - else - adj = timekeeping_bigadjust(tk, error, &interval, &offset); - } else { - if (error < -interval) { - /* See comment above, this is just switched for the negative */ - error >>= 2; - if (likely(error >= -interval)) { - adj = -1; - interval = -interval; - offset = -offset; - } else { - adj = timekeeping_bigadjust(tk, error, &interval, &offset); - } - } else { - goto out_adjust; - } - } + /* First correct for the current frequency error */ + timekeeping_freqadjust(tk, interval, offset); + + + /* Next make a small adjustment to fix any cumulative error */ + error = tk->ntp_error >> tk->ntp_error_shift; + if (likely(abs(error) <= interval/2)) + goto out_adjust; + + if (error < 0) { + adj = -1; + interval = -interval; + offset = -offset; + } else + adj = 1; + - if (unlikely(tk->clock->maxadj && - (tk->mult + adj > tk->clock->mult + tk->clock->maxadj))) { - printk_once(KERN_WARNING - "Adjusting %s more than 11%% (%ld vs %ld)\n", - tk->clock->name, (long)tk->mult + adj, - (long)tk->clock->mult + tk->clock->maxadj); - } /* * So the following can be confusing. * @@ -1213,15 +1167,21 @@ static void timekeeping_adjust(struct timekeeper *tk, s64 offset) * xtime_nsec_2 = xtime_nsec_1 - offset * Which simplfies to: * xtime_nsec -= offset - * - * XXX - TODO: Doc ntp_error calculation. */ tk->mult += adj; tk->xtime_interval += interval; tk->xtime_nsec -= offset; - tk->ntp_error -= (interval - offset) << tk->ntp_error_shift; out_adjust: + + if (unlikely(tk->clock->maxadj && + (tk->mult > tk->clock->mult + tk->clock->maxadj))) { + printk_once(KERN_WARNING + "Adjusting %s more than 11%% (%ld vs %ld)\n", + tk->clock->name, (long)tk->mult, + (long)tk->clock->mult + tk->clock->maxadj); + } + /* * It may be possible that when we entered this function, xtime_nsec * was very small. Further, if we're slightly speeding the clocksource @@ -1241,7 +1201,6 @@ out_adjust: tk->xtime_nsec = 0; tk->ntp_error += neg << tk->ntp_error_shift; } - } /**
Got a few cycles to take another look at this, and tried to address Miroslav's latest comments. Please let me know if you have further thoughts! thanks -john The existing timekeeping_adjust logic has always been complicated to understand. Further, since it was developed prior to NOHZ becoming common, its not surprising it performs poorly when NOHZ is enabled. Since Miroslav pointed out the problematic nature of the existing code in the NOHZ case, I've tried to refactor the code to perform better. The problem with the previous approach was that it tried to adjust for the total cumulative error using a scaled dampening factor. This resulted in large errors to be corrected slowly, while small errors were corrected quickly. With NOHZ the timekeeping code doesn't know how far out the next tick will be, so this results in bad over-correction to small errors, and insufficient correction to large errors. Inspired by Miroslav's patch, I've refactored the code to try to address the correction in two steps. 1) Check the future freq error for the next tick, and if the frequency error is large, try to make sure we correct that error as best we can. 2) Then make a small single unit adjustment to correct any cumulative error that we've collected. This method performs fairly well in the simulator Miroslav created. I also dropped the tk->ntp_error adjustments, as I've never quite been able to recall how that was correct, and it doesn't seem to have any affect in the simulator. Will have to proceed carefully with testing here. I still think this is probably 3.15 or later material, but I'd be very interested in feedback, thoughts, and testing. Cc: Miroslav Lichvar <mlichvar@redhat.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Reported-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: John Stultz <john.stultz@linaro.org> --- kernel/time/timekeeping.c | 163 +++++++++++++++++----------------------------- 1 file changed, 61 insertions(+), 102 deletions(-)