Message ID | 5839859.DvuYhMxLoT@rjwysocki.net |
---|---|
State | Superseded |
Headers | show |
Series | [RFC/RFT,v0.1] ACPI: OSL: Use usleep_range() in acpi_os_sleep() | expand |
On Thu, Nov 21, 2024 at 7:39 PM Hans de Goede <hdegoede@redhat.com> wrote: > > Hi, > > On 21-Nov-24 2:15 PM, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > As stated by Len in [1], the extra delay added by msleep() to the > > sleep time value passed to it can be significant, roughly between > > 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with > > HZ = 100, which is hardly acceptable, at least for small sleep time > > values. > > > > Address this by using usleep_range() in acpi_os_sleep() instead of > > msleep(). For short sleep times this is a no-brainer, but even for > > long sleeps usleep_range() should be preferred because timer wheel > > timers are optimized for cancellation before they expire and this > > particular timer is not going to be canceled. > > > > Add at least 50 us on top of the requested sleep time in case the > > timer can be subject to coalescing, which is consistent with what's > > done in user space in this context [2], but for sleeps longer than 5 ms > > use 1% of the requested sleep time for this purpose. > > > > The rationale here is that longer sleeps don't need that much of a timer > > precision as a rule and making the timer a more likely candidate for > > coalescing in these cases is generally desirable. It starts at 5 ms so > > that the delta between the requested sleep time and the effective > > deadline is a contiuous function of the former. > > > > Link: https://lore.kernel.org/linux-pm/c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com/ [1] > > Link: https://lore.kernel.org/linux-pm/CAJvTdK=Q1kwWA6Wxn8Zcf0OicDEk6cHYFAvQVizgA47mXu63+g@mail.gmail.com/ [2] > > Reported-by: Len Brown <lenb@kernel.org> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > --- > > > > This is a follow-up to the discussion started by [1] above and since > > the beginning of it I have changed my mind a bit, as you can see. > > > > Given Arjan's feedback, I've concluded that using usleep_range() for > > all sleep values is the right choice and that some slack should be > > used there. I've taken 50 us as the minimum value of it because that's > > what is used in user space FWICT and I'm not convinced that shorter > > values would be suitable here. > > > > The other part, using 1% of the sleep time as the slack for longer > > sleeps, is likely more controversial. It is roughly based on the > > observation that if one timer interrupt is sufficient for something, > > then using two of them will be wasteful even if this is just somewhat. > > > > Anyway, please let me know what you think. I'd rather do whatever > > the majority of you are comfortable with. > > I know it is a bit early for this, but the patch looks good to me, so: > > Reviewed-by: Hans de Goede <hdegoede@redhat.com> Thank you!
On 11/21/2024 07:15, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > As stated by Len in [1], the extra delay added by msleep() to the > sleep time value passed to it can be significant, roughly between > 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with > HZ = 100, which is hardly acceptable, at least for small sleep time > values. > > Address this by using usleep_range() in acpi_os_sleep() instead of > msleep(). For short sleep times this is a no-brainer, but even for > long sleeps usleep_range() should be preferred because timer wheel > timers are optimized for cancellation before they expire and this > particular timer is not going to be canceled. > > Add at least 50 us on top of the requested sleep time in case the > timer can be subject to coalescing, which is consistent with what's > done in user space in this context [2], but for sleeps longer than 5 ms > use 1% of the requested sleep time for this purpose. > > The rationale here is that longer sleeps don't need that much of a timer > precision as a rule and making the timer a more likely candidate for > coalescing in these cases is generally desirable. It starts at 5 ms so > that the delta between the requested sleep time and the effective > deadline is a contiuous function of the former. > > Link: https://lore.kernel.org/linux-pm/c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com/ [1] > Link: https://lore.kernel.org/linux-pm/CAJvTdK=Q1kwWA6Wxn8Zcf0OicDEk6cHYFAvQVizgA47mXu63+g@mail.gmail.com/ [2] > Reported-by: Len Brown <lenb@kernel.org> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> You probably should also pick up this tag from the earlier version. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263 > --- > > This is a follow-up to the discussion started by [1] above and since > the beginning of it I have changed my mind a bit, as you can see. > > Given Arjan's feedback, I've concluded that using usleep_range() for > all sleep values is the right choice and that some slack should be > used there. I've taken 50 us as the minimum value of it because that's > what is used in user space FWICT and I'm not convinced that shorter > values would be suitable here. > > The other part, using 1% of the sleep time as the slack for longer > sleeps, is likely more controversial. It is roughly based on the > observation that if one timer interrupt is sufficient for something, > then using two of them will be wasteful even if this is just somewhat. > > Anyway, please let me know what you think. I'd rather do whatever > the majority of you are comfortable with. Generally I'm fine with this. I'm about to head on US holiday, but I will forward this to folks that aren't and get some testing input on it to bring back later when I'm back. > > --- > drivers/acpi/osl.c | 22 +++++++++++++++++++++- > 1 file changed, 21 insertions(+), 1 deletion(-) > > Index: linux-pm/drivers/acpi/osl.c > =================================================================== > --- linux-pm.orig/drivers/acpi/osl.c > +++ linux-pm/drivers/acpi/osl.c > @@ -607,7 +607,27 @@ acpi_status acpi_os_remove_interrupt_han > > void acpi_os_sleep(u64 ms) > { > - msleep(ms); > + u64 usec = ms * USEC_PER_MSEC, delta_us = 50; > + > + /* > + * Use a hrtimer because the timer wheel timers are optimized for > + * cancellation before they expire and this timer is not going to be > + * canceled. > + * > + * Set the delta between the requested sleep time and the effective > + * deadline to at least 50 us in case there is an opportunity for timer > + * coalescing. > + * > + * Moreover, longer sleeps can be assumed to need somewhat less timer > + * precision, so sacrifice some of it for making the timer a more likely > + * candidate for coalescing by setting the delta to 1% of the sleep time > + * if it is above 5 ms (this value is chosen so that the delta is a > + * continuous function of the sleep time). > + */ > + if (ms > 5) > + delta_us = (USEC_PER_MSEC / 100) * ms; > + > + usleep_range(usec, usec + delta_us); > } > > void acpi_os_stall(u32 us) > > >
On 11/22/2024 13:27, Rafael J. Wysocki wrote: > On Thu, Nov 21, 2024 at 11:27 PM Mario Limonciello > <mario.limonciello@amd.com> wrote: >> >> On 11/21/2024 07:15, Rafael J. Wysocki wrote: >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >>> >>> As stated by Len in [1], the extra delay added by msleep() to the >>> sleep time value passed to it can be significant, roughly between >>> 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with >>> HZ = 100, which is hardly acceptable, at least for small sleep time >>> values. >>> >>> Address this by using usleep_range() in acpi_os_sleep() instead of >>> msleep(). For short sleep times this is a no-brainer, but even for >>> long sleeps usleep_range() should be preferred because timer wheel >>> timers are optimized for cancellation before they expire and this >>> particular timer is not going to be canceled. >>> >>> Add at least 50 us on top of the requested sleep time in case the >>> timer can be subject to coalescing, which is consistent with what's >>> done in user space in this context [2], but for sleeps longer than 5 ms >>> use 1% of the requested sleep time for this purpose. >>> >>> The rationale here is that longer sleeps don't need that much of a timer >>> precision as a rule and making the timer a more likely candidate for >>> coalescing in these cases is generally desirable. It starts at 5 ms so >>> that the delta between the requested sleep time and the effective >>> deadline is a contiuous function of the former. >>> >>> Link: https://lore.kernel.org/linux-pm/c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com/ [1] >>> Link: https://lore.kernel.org/linux-pm/CAJvTdK=Q1kwWA6Wxn8Zcf0OicDEk6cHYFAvQVizgA47mXu63+g@mail.gmail.com/ [2] >>> Reported-by: Len Brown <lenb@kernel.org> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> >> >> You probably should also pick up this tag from the earlier version. >> >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263 > > Good point. > >>> --- >>> >>> This is a follow-up to the discussion started by [1] above and since >>> the beginning of it I have changed my mind a bit, as you can see. >>> >>> Given Arjan's feedback, I've concluded that using usleep_range() for >>> all sleep values is the right choice and that some slack should be >>> used there. I've taken 50 us as the minimum value of it because that's >>> what is used in user space FWICT and I'm not convinced that shorter >>> values would be suitable here. >>> >>> The other part, using 1% of the sleep time as the slack for longer >>> sleeps, is likely more controversial. It is roughly based on the >>> observation that if one timer interrupt is sufficient for something, >>> then using two of them will be wasteful even if this is just somewhat. >>> >>> Anyway, please let me know what you think. I'd rather do whatever >>> the majority of you are comfortable with. >> >> Generally I'm fine with this. >> >> I'm about to head on US holiday, but I will forward this to folks that >> aren't and get some testing input on it to bring back later when I'm back. > > Thanks! Hi Rafael, I loaded this onto my personal laptop before the holiday and also got others in AMD to do testing on a wider variety of client hardware. No concerns were raised with this patch. Feel free to include: Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Tested-by: Mario Limonciello <mario.limonciello@amd.com>
On Mon, Dec 2, 2024 at 10:54 PM Mario Limonciello <mario.limonciello@amd.com> wrote: > > On 11/22/2024 13:27, Rafael J. Wysocki wrote: > > On Thu, Nov 21, 2024 at 11:27 PM Mario Limonciello > > <mario.limonciello@amd.com> wrote: > >> > >> On 11/21/2024 07:15, Rafael J. Wysocki wrote: > >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >>> > >>> As stated by Len in [1], the extra delay added by msleep() to the > >>> sleep time value passed to it can be significant, roughly between > >>> 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with > >>> HZ = 100, which is hardly acceptable, at least for small sleep time > >>> values. > >>> > >>> Address this by using usleep_range() in acpi_os_sleep() instead of > >>> msleep(). For short sleep times this is a no-brainer, but even for > >>> long sleeps usleep_range() should be preferred because timer wheel > >>> timers are optimized for cancellation before they expire and this > >>> particular timer is not going to be canceled. > >>> > >>> Add at least 50 us on top of the requested sleep time in case the > >>> timer can be subject to coalescing, which is consistent with what's > >>> done in user space in this context [2], but for sleeps longer than 5 ms > >>> use 1% of the requested sleep time for this purpose. > >>> > >>> The rationale here is that longer sleeps don't need that much of a timer > >>> precision as a rule and making the timer a more likely candidate for > >>> coalescing in these cases is generally desirable. It starts at 5 ms so > >>> that the delta between the requested sleep time and the effective > >>> deadline is a contiuous function of the former. > >>> > >>> Link: https://lore.kernel.org/linux-pm/c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com/ [1] > >>> Link: https://lore.kernel.org/linux-pm/CAJvTdK=Q1kwWA6Wxn8Zcf0OicDEk6cHYFAvQVizgA47mXu63+g@mail.gmail.com/ [2] > >>> Reported-by: Len Brown <lenb@kernel.org> > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > >> > >> You probably should also pick up this tag from the earlier version. > >> > >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263 > > > > Good point. > > > >>> --- > >>> > >>> This is a follow-up to the discussion started by [1] above and since > >>> the beginning of it I have changed my mind a bit, as you can see. > >>> > >>> Given Arjan's feedback, I've concluded that using usleep_range() for > >>> all sleep values is the right choice and that some slack should be > >>> used there. I've taken 50 us as the minimum value of it because that's > >>> what is used in user space FWICT and I'm not convinced that shorter > >>> values would be suitable here. > >>> > >>> The other part, using 1% of the sleep time as the slack for longer > >>> sleeps, is likely more controversial. It is roughly based on the > >>> observation that if one timer interrupt is sufficient for something, > >>> then using two of them will be wasteful even if this is just somewhat. > >>> > >>> Anyway, please let me know what you think. I'd rather do whatever > >>> the majority of you are comfortable with. > >> > >> Generally I'm fine with this. > >> > >> I'm about to head on US holiday, but I will forward this to folks that > >> aren't and get some testing input on it to bring back later when I'm back. > > > > Thanks! > > Hi Rafael, > > I loaded this onto my personal laptop before the holiday and also got > others in AMD to do testing on a wider variety of client hardware. > No concerns were raised with this patch. > > Feel free to include: > > Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> > Tested-by: Mario Limonciello <mario.limonciello@amd.com> Thank you!
On Thu, Nov 21, 2024 at 8:15 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > As stated by Len in [1], the extra delay added by msleep() to the > sleep time value passed to it can be significant, roughly between > 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with > HZ = 100, which is hardly acceptable, at least for small sleep time > values. Maybe the problem statement is more clear with a concrete example: msleep(5) on the default HZ=250 on a modern PC takes about 11.9 ms. This results in over 800 ms of spurious system resume delay on systems such as the Dell XPS-13-9300, which use ASL Sleep(5ms) in a tight loop. (yes, this additional cost used to be over 1200 ms before the v6.12 msleep rounding fix) > - msleep(ms); > + u64 usec = ms * USEC_PER_MSEC, delta_us = 50; > + if (ms > 5) > + delta_us = (USEC_PER_MSEC / 100) * ms I measured 100 resume cycles on the Dell XPS 13 9300 on 4 kernels. Here is the measured fastest kernel resume time in msec for each: 1. 1921.292 v6.12 msleep (baseline) 2. 1115.579 v6.12 delta_us = (USEC_PER_MSEC / 100) * ms (this patch) 3. 1113.396 v6.12 delta_us = 50 4. 1107.835 v6.12 delta_us = 0 (I didn't average the 100 runs, because random very long device hiccups throw off the average) So any of #2, #3 and #4 are a huge step forward from what is shipping today! So considering #2 vs #3 vs #4.... I agree that it is a problem for the timer sub-system to work to maintain a 1ns granularity that it can't actually deliver. I think it is fine for the timer sub-system to allow calls to opt into timer slack -- some callers may actually know what number to use. However, I don't think that the timer sub-system should force callers to guess how much slack is appropriate. I think that a caller with 0 slack should be internally rounded up by the timer sub-system to the granularity that it can actually deliver with the timer that is currently in use on that system. Note also that slack of 0 doesn't mean that no coalescing can happen. A slack=0 timer can land within the slack another timer, and the other timer will be pulled forward to coalesce. The 50 usec default for user timer slack is certainly a magic number born of tests of interesting workloads on interesting systems on a certain date. It may not be the right number for other workloads, or other systems with other timers on other dates. My opinion... I don't see a justification for increasing timer slack with increasing duration. User-space timers don't pay this additional delay, why should the ASL programmer? Also, the graduated increasing slack with duration is a guess no more valid than the guess of a flat 50 usec. A flat 50 or a flat 0 have the virtue of being simple -- they will be simpler to understand and maintain in the future. But I can live with any of these options, since they are all a big step forward. thanks, Len Brown, Intel Open Source Technology Center
On Wed, Dec 4, 2024 at 10:41 PM Len Brown <lenb@kernel.org> wrote: > > On Thu, Nov 21, 2024 at 8:15 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote: > > > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > > > As stated by Len in [1], the extra delay added by msleep() to the > > sleep time value passed to it can be significant, roughly between > > 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with > > HZ = 100, which is hardly acceptable, at least for small sleep time > > values. > > Maybe the problem statement is more clear with a concrete example: > > msleep(5) on the default HZ=250 on a modern PC takes about 11.9 ms. > This results in over 800 ms of spurious system resume delay > on systems such as the Dell XPS-13-9300, which use ASL Sleep(5ms) > in a tight loop. Sure, I can add this information to the changelog. > (yes, this additional cost used to be over 1200 ms before the v6.12 > msleep rounding fix) > > > - msleep(ms); > > + u64 usec = ms * USEC_PER_MSEC, delta_us = 50; > > > + if (ms > 5) > > + delta_us = (USEC_PER_MSEC / 100) * ms > > I measured 100 resume cycles on the Dell XPS 13 9300 on 4 kernels. > Here is the measured fastest kernel resume time in msec for each: > > 1. 1921.292 v6.12 msleep (baseline) > 2. 1115.579 v6.12 delta_us = (USEC_PER_MSEC / 100) * ms (this patch) > 3. 1113.396 v6.12 delta_us = 50 > 4. 1107.835 v6.12 delta_us = 0 > > (I didn't average the 100 runs, because random very long device > hiccups throw off the average) > > So any of #2, #3 and #4 are a huge step forward from what is shipping today! > > So considering #2 vs #3 vs #4.... > > I agree that it is a problem for the timer sub-system to work to > maintain a 1ns granularity > that it can't actually deliver. > > I think it is fine for the timer sub-system to allow calls to opt into > timer slack -- > some callers may actually know what number to use. > > However, I don't think that the timer sub-system should force callers to guess > how much slack is appropriate. I think that a caller with 0 slack > should be internally > rounded up by the timer sub-system to the granularity that it can > actually deliver > with the timer that is currently in use on that system. > > Note also that slack of 0 doesn't mean that no coalescing can happen. > A slack=0 timer can land within the slack another timer, and the other > timer will be pulled forward to coalesce. > > The 50 usec default for user timer slack is certainly a magic number born > of tests of interesting workloads on interesting systems on a certain date. > It may not be the right number for other workloads, or other systems > with other timers on other dates. > > My opinion... > > I don't see a justification for increasing timer slack with increasing duration. > User-space timers don't pay this additional delay, why should the ASL > programmer? > > Also, the graduated increasing slack with duration is a guess no more > valid than the guess of a flat 50 usec. > > A flat 50 or a flat 0 have the virtue of being simple -- they will be simpler > to understand and maintain in the future. > > But I can live with any of these options, since they are all a big step forward. All right, thanks!
Index: linux-pm/drivers/acpi/osl.c =================================================================== --- linux-pm.orig/drivers/acpi/osl.c +++ linux-pm/drivers/acpi/osl.c @@ -607,7 +607,27 @@ acpi_status acpi_os_remove_interrupt_han void acpi_os_sleep(u64 ms) { - msleep(ms); + u64 usec = ms * USEC_PER_MSEC, delta_us = 50; + + /* + * Use a hrtimer because the timer wheel timers are optimized for + * cancellation before they expire and this timer is not going to be + * canceled. + * + * Set the delta between the requested sleep time and the effective + * deadline to at least 50 us in case there is an opportunity for timer + * coalescing. + * + * Moreover, longer sleeps can be assumed to need somewhat less timer + * precision, so sacrifice some of it for making the timer a more likely + * candidate for coalescing by setting the delta to 1% of the sleep time + * if it is above 5 ms (this value is chosen so that the delta is a + * continuous function of the sleep time). + */ + if (ms > 5) + delta_us = (USEC_PER_MSEC / 100) * ms; + + usleep_range(usec, usec + delta_us); } void acpi_os_stall(u32 us)