Message ID | 20200605111657.28773-1-rasmus.villemoes@prevas.dk |
---|---|
State | New |
Headers | show |
Series | [RFC] watchdog: base rate-limiting on get_ticks() rather than get_timer() | expand |
On 05.06.20 13:16, Rasmus Villemoes wrote: > On powerpc, get_timer() is implemented using a volatile variable that > gets incremented from the decrementer interrupt handler. Hence, when > interrupts are disabled, time as measured by get_timer() ceases to > pass. > > Interrupts are necessarily disabled during bootm (see the comment in > bootm_disable_interrupts() for why). But after interrupts are > disabled, there's still lots of work to do - e.g. decompressing the > kernel image to the right load address. Unfortunately, the > rate-limiting logic in wdt_uclass.c's watchdog_reset function means > that WATCHDOG_RESET() becomes a no-op, since get_timer() never > progresses past the next_reset. > > This, combined with an external gpio-triggered watchdog that must be > petted at least every 800ms, means our board gets reset before booting > into linux. > > Now, at least on powerpc, get_ticks() continues to return sensible > results whether or not interrupts are enabled. So this fixes the above > problem for our board - but I don't know if get_ticks() can be assumed > to always work. > > Signed-off-by: Rasmus Villemoes <rasmus.villemoes at prevas.dk> > --- > This is what I had in mind. I also considered making it a config knob > (possibly just auto-selected based on $ARCH) whether to use > get_timer() or get_ticks(), but that becomes quite ugly. I hesitate a bit, moving with this generic code from get_timer() to get_ticks() for all boards. Did you test this patch on other platforms, like some ARM boards? Please note that I don't reject it - just asking. Thanks, Stefan > drivers/watchdog/wdt-uclass.c | 22 ++++++++++++---------- > 1 file changed, 12 insertions(+), 10 deletions(-) > > diff --git a/drivers/watchdog/wdt-uclass.c b/drivers/watchdog/wdt-uclass.c > index 4cdb7bd64c..7be4e9b5bc 100644 > --- a/drivers/watchdog/wdt-uclass.c > +++ b/drivers/watchdog/wdt-uclass.c > @@ -16,14 +16,15 @@ DECLARE_GLOBAL_DATA_PTR; > > #define WATCHDOG_TIMEOUT_SECS (CONFIG_WATCHDOG_TIMEOUT_MSECS / 1000) > > -/* > - * Reset every 1000ms, or however often is required as indicated by a > - * hw_margin_ms property. > - */ > -static ulong reset_period = 1000; > +static u64 ratelimit_ticks; > > int initr_watchdog(void) > { > + /* > + * Reset every 1000ms, or however often is required as indicated by a > + * hw_margin_ms property. > + */ > + u32 reset_period = 1000; > u32 timeout = WATCHDOG_TIMEOUT_SECS; > > /* > @@ -48,6 +49,7 @@ int initr_watchdog(void) > 4 * reset_period) / 4; > } > > + ratelimit_ticks = usec_to_tick(reset_period * 1000); > wdt_start(gd->watchdog_dev, timeout * 1000, 0); > gd->flags |= GD_FLG_WDT_READY; > printf("WDT: Started with%s servicing (%ds timeout)\n", > @@ -117,17 +119,17 @@ int wdt_expire_now(struct udevice *dev, ulong flags) > */ > void watchdog_reset(void) > { > - static ulong next_reset; > - ulong now; > + static u64 last_reset; > + u64 now; > > /* Exit if GD is not ready or watchdog is not initialized yet */ > if (!gd || !(gd->flags & GD_FLG_WDT_READY)) > return; > > /* Do not reset the watchdog too often */ > - now = get_timer(0); > - if (time_after(now, next_reset)) { > - next_reset = now + reset_period; > + now = get_ticks(); > + if (now - last_reset >= ratelimit_ticks) { > + last_reset = now; > wdt_reset(gd->watchdog_dev); > } > } > Viele Gr??e, Stefan
On 05/06/2020 13.48, Stefan Roese wrote: > On 05.06.20 13:16, Rasmus Villemoes wrote: >> This is what I had in mind. I also considered making it a config knob >> (possibly just auto-selected based on $ARCH) whether to use >> get_timer() or get_ticks(), but that becomes quite ugly. > > I hesitate a bit, moving with this generic code from get_timer() > to get_ticks() for all boards. Did you test this patch on other > platforms, like some ARM boards? > > Please note that I don't reject it - just asking. Yeah, I'm not really too happy about it myself, exactly because it affects all arches/platforms. And no, I don't have other hardware handy unfortunately. So it's very much an RFC where I hope someone with knowledge of the various arches can say whether one can expect get_ticks() to be at least as widely available as get_timer(). Rasmus
On 05.06.20 14:11, Rasmus Villemoes wrote: > On 05/06/2020 13.48, Stefan Roese wrote: >> On 05.06.20 13:16, Rasmus Villemoes wrote: >>> This is what I had in mind. I also considered making it a config knob >>> (possibly just auto-selected based on $ARCH) whether to use >>> get_timer() or get_ticks(), but that becomes quite ugly. >> >> I hesitate a bit, moving with this generic code from get_timer() >> to get_ticks() for all boards. Did you test this patch on other >> platforms, like some ARM boards? >> >> Please note that I don't reject it - just asking. > > Yeah, I'm not really too happy about it myself, exactly because it > affects all arches/platforms. And no, I don't have other hardware handy > unfortunately. So it's very much an RFC where I hope someone with > knowledge of the various arches can say whether one can expect > get_ticks() to be at least as widely available as get_timer(). I'll try to test it on a non-powerpc platform soon'ish. Thanks, Stefan
On 05.06.20 14:13, Stefan Roese wrote: > On 05.06.20 14:11, Rasmus Villemoes wrote: >> On 05/06/2020 13.48, Stefan Roese wrote: >>> On 05.06.20 13:16, Rasmus Villemoes wrote: >>>> This is what I had in mind. I also considered making it a config knob >>>> (possibly just auto-selected based on $ARCH) whether to use >>>> get_timer() or get_ticks(), but that becomes quite ugly. >>> >>> I hesitate a bit, moving with this generic code from get_timer() >>> to get_ticks() for all boards. Did you test this patch on other >>> platforms, like some ARM boards? >>> >>> Please note that I don't reject it - just asking. >> >> Yeah, I'm not really too happy about it myself, exactly because it >> affects all arches/platforms. And no, I don't have other hardware handy >> unfortunately. So it's very much an RFC where I hope someone with >> knowledge of the various arches can say whether one can expect >> get_ticks() to be at least as widely available as get_timer(). > > I'll try to test it on a non-powerpc platform soon'ish. I've tested it on an MIPS based platform (gardena-smart-gateway-mt7688) and it works there without any issues AFAICT. Two more remarks: Could you please run a build test with this patch applied for all boards (Travis, Azure...)? And do you see an issue, that the timer-wrap fixed by Chris with this patch [1] will not re-occur with your "ticks" approach? Thanks, Stefan [1] Author: Chris Packham <judge.packham at gmail.com> 2020-02-24 01:20:33 Committer: Stefan Roese <sr at denx.de> 2020-03-16 11:25:12 Parent: db41d65a97f335167e1fbc0400a83333b5157703 (common: Move hang() to the same header as panic()) Child: b4d9452c442769e6dc25649ac02db2d5ed5ea0c8 (watchdog: move initr_watchdog() to wdt-uclass.c) Branches: master, watchdog-ratelimit-2020-06-05-rfc and many more (84) Follows: v2020.04-rc3 Precedes: v2020.04-rc4 watchdog: Handle timer wrap around On some platforms/architectures the value from get_timer() can wrap. This is particularly problematic when long-running code needs to measure a time difference as is the case with watchdog_reset() which tries to avoid tickling the watchdog too frequently. Use time_after() from time.h instead of a plain > comparison to avoid any issues with the time wrapping on a system that has been sitting in u-boot for a long time. Signed-off-by: Chris Packham <judge.packham at gmail.com> Reviewed-by: Stefan Roese <sr at denx.de>
On 05/06/2020 15.37, Stefan Roese wrote: > On 05.06.20 14:13, Stefan Roese wrote: >> On 05.06.20 14:11, Rasmus Villemoes wrote: >>> On 05/06/2020 13.48, Stefan Roese wrote: >>>> On 05.06.20 13:16, Rasmus Villemoes wrote: >>>>> This is what I had in mind. I also considered making it a config knob >>>>> (possibly just auto-selected based on $ARCH) whether to use >>>>> get_timer() or get_ticks(), but that becomes quite ugly. >>>> >>>> I hesitate a bit, moving with this generic code from get_timer() >>>> to get_ticks() for all boards. Did you test this patch on other >>>> platforms, like some ARM boards? >>>> >>>> Please note that I don't reject it - just asking. >>> >>> Yeah, I'm not really too happy about it myself, exactly because it >>> affects all arches/platforms. And no, I don't have other hardware handy >>> unfortunately. So it's very much an RFC where I hope someone with >>> knowledge of the various arches can say whether one can expect >>> get_ticks() to be at least as widely available as get_timer(). >> >> I'll try to test it on a non-powerpc platform soon'ish. > > I've tested it on an MIPS based platform (gardena-smart-gateway-mt7688) > and it works there without any issues AFAICT. Thanks. > Two more remarks: > > Could you please run a build test with this patch applied for all > boards (Travis, Azure...)? I may need a few more pointers than that. What am I supposed to do exactly? > And do you see an issue, that the timer-wrap fixed by Chris with this > patch [1] will not re-occur with your "ticks" approach? No, it will not re-occur, because this does the comparison the right way, namely by subtracting the absolute times and comparing that delta to the threshold (that's also what the time_after does under the hood). The wrong way is saying now >= last_reset + threshold (or as I think the code used to do, computing next_reset = now + threshold and then asking now >= next_reset). Rasmus
On 05.06.20 16:18, Rasmus Villemoes wrote: > On 05/06/2020 15.37, Stefan Roese wrote: >> On 05.06.20 14:13, Stefan Roese wrote: >>> On 05.06.20 14:11, Rasmus Villemoes wrote: >>>> On 05/06/2020 13.48, Stefan Roese wrote: >>>>> On 05.06.20 13:16, Rasmus Villemoes wrote: >>>>>> This is what I had in mind. I also considered making it a config knob >>>>>> (possibly just auto-selected based on $ARCH) whether to use >>>>>> get_timer() or get_ticks(), but that becomes quite ugly. >>>>> >>>>> I hesitate a bit, moving with this generic code from get_timer() >>>>> to get_ticks() for all boards. Did you test this patch on other >>>>> platforms, like some ARM boards? >>>>> >>>>> Please note that I don't reject it - just asking. >>>> >>>> Yeah, I'm not really too happy about it myself, exactly because it >>>> affects all arches/platforms. And no, I don't have other hardware handy >>>> unfortunately. So it's very much an RFC where I hope someone with >>>> knowledge of the various arches can say whether one can expect >>>> get_ticks() to be at least as widely available as get_timer(). >>> >>> I'll try to test it on a non-powerpc platform soon'ish. >> >> I've tested it on an MIPS based platform (gardena-smart-gateway-mt7688) >> and it works there without any issues AFAICT. > > Thanks. > >> Two more remarks: >> >> Could you please run a build test with this patch applied for all >> boards (Travis, Azure...)? > > I may need a few more pointers than that. What am I supposed to do exactly? Not sure if there is some documentation on how to use the Travis, Gitlab or Azure build. It mainly uses buildman [1] internally, so you can use buildman locally if you don't want to push the build into the cloud. >> And do you see an issue, that the timer-wrap fixed by Chris with this >> patch [1] will not re-occur with your "ticks" approach? > > No, it will not re-occur, because this does the comparison the right > way, namely by subtracting the absolute times and comparing that delta > to the threshold (that's also what the time_after does under the hood). > The wrong way is saying now >= last_reset + threshold (or as I think the > code used to do, computing next_reset = now + threshold and then asking > now >= next_reset). Ah, good to know. Thanks for confirming. Thanks, Stefan [1] tools/buildman/README
On Fri, Jun 05, 2020 at 04:23:21PM +0200, Stefan Roese wrote: > On 05.06.20 16:18, Rasmus Villemoes wrote: > > On 05/06/2020 15.37, Stefan Roese wrote: > > > On 05.06.20 14:13, Stefan Roese wrote: > > > > On 05.06.20 14:11, Rasmus Villemoes wrote: > > > > > On 05/06/2020 13.48, Stefan Roese wrote: > > > > > > On 05.06.20 13:16, Rasmus Villemoes wrote: > > > > > > > This is what I had in mind. I also considered making it a config knob > > > > > > > (possibly just auto-selected based on $ARCH) whether to use > > > > > > > get_timer() or get_ticks(), but that becomes quite ugly. > > > > > > > > > > > > I hesitate a bit, moving with this generic code from get_timer() > > > > > > to get_ticks() for all boards. Did you test this patch on other > > > > > > platforms, like some ARM boards? > > > > > > > > > > > > Please note that I don't reject it - just asking. > > > > > > > > > > Yeah, I'm not really too happy about it myself, exactly because it > > > > > affects all arches/platforms. And no, I don't have other hardware handy > > > > > unfortunately. So it's very much an RFC where I hope someone with > > > > > knowledge of the various arches can say whether one can expect > > > > > get_ticks() to be at least as widely available as get_timer(). > > > > > > > > I'll try to test it on a non-powerpc platform soon'ish. > > > > > > I've tested it on an MIPS based platform (gardena-smart-gateway-mt7688) > > > and it works there without any issues AFAICT. > > > > Thanks. > > > > > Two more remarks: > > > > > > Could you please run a build test with this patch applied for all > > > boards (Travis, Azure...)? > > > > I may need a few more pointers than that. What am I supposed to do exactly? > > Not sure if there is some documentation on how to use the Travis, > Gitlab or Azure build. It mainly uses buildman [1] internally, so you > can use buildman locally if you don't want to push the build into > the cloud. So, here's a recently-I-learned. If you submit a pull request on GitHub to https://github.com/u-boot/u-boot that will trigger Travis and Azure runs: https://travis-ci.org/github/u-boot/u-boot/pull_requests and somewhere under: https://dev.azure.com/u-boot/u-boot/_build?definitionId=2&_a=summary Note that sometimes Travis will have connectivity failures and you have to be me to tell it to re-run the job. Azure only has failures when we get those to-be-debugged random failures of sandbox/qemu targets.
On 05/06/2020 16.34, Tom Rini wrote: > On Fri, Jun 05, 2020 at 04:23:21PM +0200, Stefan Roese wrote: >> On 05.06.20 16:18, Rasmus Villemoes wrote: >>> On 05/06/2020 15.37, Stefan Roese wrote: >>>> Could you please run a build test with this patch applied for all >>>> boards (Travis, Azure...)? >>> >>> I may need a few more pointers than that. What am I supposed to do exactly? >> >> Not sure if there is some documentation on how to use the Travis, >> Gitlab or Azure build. It mainly uses buildman [1] internally, so you >> can use buildman locally if you don't want to push the build into >> the cloud. > > So, here's a recently-I-learned. If you submit a pull request on GitHub > to https://github.com/u-boot/u-boot that will trigger Travis and Azure > runs: > https://travis-ci.org/github/u-boot/u-boot/pull_requests > and somewhere under: > https://dev.azure.com/u-boot/u-boot/_build?definitionId=2&_a=summary > > Note that sometimes Travis will have connectivity failures and you have > to be me to tell it to re-run the job. Azure only has failures when we > get those to-be-debugged random failures of sandbox/qemu targets. > Thanks! FWIW, https://travis-ci.org/github/u-boot/u-boot/builds/695074829 and https://dev.azure.com/u-boot/u-boot/_build/results?buildId=713&view=results are all green - but how many of those actually build wdt-uclass.c I don't know. Rasmus
Le 09/04/2021 à 16:12, Rasmus Villemoes a écrit : > On powerpc, time as measured by get_timer() ceases to pass when > interrupts are disabled (since on powerpc get_timer() returns the > value of a volatile variable that gets updated via a timer > interrupt). That in turn means the watchdog_reset() function provided > by CONFIG_WDT ceases to work due to the ratelimiting it imposes. > > Normally, interrupts are just disabled very briefly. However, during > bootm, they are disabled for good prior to decompressing the kernel > image, which can be a somewhat time-consuming operation. Even when we > manage to decompress the kernel and do the other preparation steps and > hand over control to the kernel, the kernel also takes some time > before it is ready to assume responsibility for handling the > watchdog. The end result is that the board gets reset prematurely. > > The ratelimiting isn't really strictly needed (prior to DM WDT, no > such thing existed), so just disable it when we know that time no > longer passes and have watchdog_reset() (e.g. called from > decompression loop) unconditionally reset the watchdog timer. Do we need to make it that complicated ? I think before the generic implementation, powerpc didn't have a rate limitation at all for pinging the watchdog, why not go back this direction, all the time ? I mean we could simply set reset_period to 0 at all time for powerpc ( and change the test to time_after_eq() instead of time_after() ). > > Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> > --- > > I previously sent a patch to change the ratelimiting to be based on > get_ticks() instead of get_timer(), but that has gone nowhere > [1]. This is an alternative which only affects powerpc (and only > boards that have enabled CONFIG_WDT). I hope the watchdog maintainers > will accept at least one of these, or suggest a third alternative, so > I don't have to keep some out-of-tree patch applied without knowing if > that's the direction upstream will take. > > [1] https://patchwork.ozlabs.org/project/uboot/patch/20200605111657.28773-1-rasmus.villemoes@prevas.dk/ > >
On 09/04/2021 16.37, Christophe Leroy wrote: > > > Le 09/04/2021 à 16:12, Rasmus Villemoes a écrit : >> The ratelimiting isn't really strictly needed (prior to DM WDT, no >> such thing existed), so just disable it when we know that time no >> longer passes and have watchdog_reset() (e.g. called from >> decompression loop) unconditionally reset the watchdog timer. > > > Do we need to make it that complicated ? I think before the generic > implementation, powerpc didn't have a rate limitation at all for pinging > the watchdog, why not go back this direction, all the time ? > > I mean we could simply set reset_period to 0 at all time for powerpc ( > and change the test to time_after_eq() instead of time_after() ). Maybe. I think I'd still prefer the one that changes the rate-limiting to be based on get_ticks() instead of get_timer(), which did seem to work everywhere. IIRC, Stefan previously rejected having a config knob or board hook for disabling the ratelimiting - changing the code [the s/after/after_eq/] to effectively allow that by passing hw_margin_ms=0 in DT would probably work, but seems quite hacky (especially when DT is supposed to describe hardware). I've also floated the option of making get_timer() return something sensible even with interrupts off on ppc - that would also solve the problem, and would benefit other generic code that uses get_timer() without knowing its limitation on ppc. https://lists.denx.de/pipermail/u-boot/2020-June/414842.html . So, there are lots of ways of solving it. Which direction to take really depends on which viewpoint one has: (1) the watchdog code should, if rate-limiting at all, use an implementation that works on all platforms [the get_ticks option] (2) the watchdog code should, if rate-limiting at all, provide a way for a board/platform to override or disable that (3) get_timer() on ppc is broken, it needs to provide sensible results at all times, even when interrupts are disabled for tens or hundreds of milliseconds. Rasmus
On 12.04.21 09:04, Rasmus Villemoes wrote: > On 09/04/2021 16.37, Christophe Leroy wrote: >> >> >> Le 09/04/2021 à 16:12, Rasmus Villemoes a écrit : > >>> The ratelimiting isn't really strictly needed (prior to DM WDT, no >>> such thing existed), so just disable it when we know that time no >>> longer passes and have watchdog_reset() (e.g. called from >>> decompression loop) unconditionally reset the watchdog timer. >> >> >> Do we need to make it that complicated ? I think before the generic >> implementation, powerpc didn't have a rate limitation at all for pinging >> the watchdog, why not go back this direction, all the time ? >> >> I mean we could simply set reset_period to 0 at all time for powerpc ( >> and change the test to time_after_eq() instead of time_after() ). > > Maybe. I think I'd still prefer the one that changes the rate-limiting > to be based on get_ticks() instead of get_timer(), which did seem to > work everywhere. > > IIRC, Stefan previously rejected having a config knob or board hook for > disabling the ratelimiting - changing the code [the s/after/after_eq/] > to effectively allow that by passing hw_margin_ms=0 in DT would probably > work, but seems quite hacky (especially when DT is supposed to describe > hardware). I'm not sure if I explicitly rejected any patches, sorry it's been too long. ;) But I also find this idea from Christophe quite applealing, as it's impact on the common code is minimal. Either you set the DT property to 0. Or if you don't have the possibility to do this (no DT on your platform?), then add a new Kconfig option to configure the default value for "reset_period" via Kconfig. > I've also floated the option of making get_timer() return something > sensible even with interrupts off on ppc - that would also solve the > problem, and would benefit other generic code that uses get_timer() > without knowing its limitation on ppc. > https://lists.denx.de/pipermail/u-boot/2020-June/414842.html . > > So, there are lots of ways of solving it. Which direction to take really > depends on which viewpoint one has: > > (1) the watchdog code should, if rate-limiting at all, use an > implementation that works on all platforms [the get_ticks option] > (2) the watchdog code should, if rate-limiting at all, provide a way for > a board/platform to override or disable that > (3) get_timer() on ppc is broken, it needs to provide sensible results > at all times, even when interrupts are disabled for tens or hundreds of > milliseconds. "hw_margin_ms=0" would be my favorite way right now (see above). If this does not work for you, I'm also fine with the latest patch you sent to the list. Thanks, Stefan
diff --git a/drivers/watchdog/wdt-uclass.c b/drivers/watchdog/wdt-uclass.c index 4cdb7bd64c..7be4e9b5bc 100644 --- a/drivers/watchdog/wdt-uclass.c +++ b/drivers/watchdog/wdt-uclass.c @@ -16,14 +16,15 @@ DECLARE_GLOBAL_DATA_PTR; #define WATCHDOG_TIMEOUT_SECS (CONFIG_WATCHDOG_TIMEOUT_MSECS / 1000) -/* - * Reset every 1000ms, or however often is required as indicated by a - * hw_margin_ms property. - */ -static ulong reset_period = 1000; +static u64 ratelimit_ticks; int initr_watchdog(void) { + /* + * Reset every 1000ms, or however often is required as indicated by a + * hw_margin_ms property. + */ + u32 reset_period = 1000; u32 timeout = WATCHDOG_TIMEOUT_SECS; /* @@ -48,6 +49,7 @@ int initr_watchdog(void) 4 * reset_period) / 4; } + ratelimit_ticks = usec_to_tick(reset_period * 1000); wdt_start(gd->watchdog_dev, timeout * 1000, 0); gd->flags |= GD_FLG_WDT_READY; printf("WDT: Started with%s servicing (%ds timeout)\n", @@ -117,17 +119,17 @@ int wdt_expire_now(struct udevice *dev, ulong flags) */ void watchdog_reset(void) { - static ulong next_reset; - ulong now; + static u64 last_reset; + u64 now; /* Exit if GD is not ready or watchdog is not initialized yet */ if (!gd || !(gd->flags & GD_FLG_WDT_READY)) return; /* Do not reset the watchdog too often */ - now = get_timer(0); - if (time_after(now, next_reset)) { - next_reset = now + reset_period; + now = get_ticks(); + if (now - last_reset >= ratelimit_ticks) { + last_reset = now; wdt_reset(gd->watchdog_dev); } }
On powerpc, get_timer() is implemented using a volatile variable that gets incremented from the decrementer interrupt handler. Hence, when interrupts are disabled, time as measured by get_timer() ceases to pass. Interrupts are necessarily disabled during bootm (see the comment in bootm_disable_interrupts() for why). But after interrupts are disabled, there's still lots of work to do - e.g. decompressing the kernel image to the right load address. Unfortunately, the rate-limiting logic in wdt_uclass.c's watchdog_reset function means that WATCHDOG_RESET() becomes a no-op, since get_timer() never progresses past the next_reset. This, combined with an external gpio-triggered watchdog that must be petted at least every 800ms, means our board gets reset before booting into linux. Now, at least on powerpc, get_ticks() continues to return sensible results whether or not interrupts are enabled. So this fixes the above problem for our board - but I don't know if get_ticks() can be assumed to always work. Signed-off-by: Rasmus Villemoes <rasmus.villemoes at prevas.dk> --- This is what I had in mind. I also considered making it a config knob (possibly just auto-selected based on $ARCH) whether to use get_timer() or get_ticks(), but that becomes quite ugly. drivers/watchdog/wdt-uclass.c | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)