diff mbox series

[RFC/RFT,v0.1] ACPI: OSL: Use usleep_range() in acpi_os_sleep()

Message ID 5839859.DvuYhMxLoT@rjwysocki.net
State Superseded
Headers show
Series [RFC/RFT,v0.1] ACPI: OSL: Use usleep_range() in acpi_os_sleep() | expand

Commit Message

Rafael J. Wysocki Nov. 21, 2024, 1:15 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

As stated by Len in [1], the extra delay added by msleep() to the
sleep time value passed to it can be significant, roughly between
1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with
HZ = 100, which is hardly acceptable, at least for small sleep time
values.

Address this by using usleep_range() in acpi_os_sleep() instead of
msleep().  For short sleep times this is a no-brainer, but even for
long sleeps usleep_range() should be preferred because timer wheel
timers are optimized for cancellation before they expire and this
particular timer is not going to be canceled.

Add at least 50 us on top of the requested sleep time in case the
timer can be subject to coalescing, which is consistent with what's
done in user space in this context [2], but for sleeps longer than 5 ms
use 1% of the requested sleep time for this purpose.

The rationale here is that longer sleeps don't need that much of a timer
precision as a rule and making the timer a more likely candidate for
coalescing in these cases is generally desirable.  It starts at 5 ms so
that the delta between the requested sleep time and the effective
deadline is a contiuous function of the former.

Link: https://lore.kernel.org/linux-pm/c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com/ [1]
Link: https://lore.kernel.org/linux-pm/CAJvTdK=Q1kwWA6Wxn8Zcf0OicDEk6cHYFAvQVizgA47mXu63+g@mail.gmail.com/ [2]
Reported-by: Len Brown <lenb@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This is a follow-up to the discussion started by [1] above and since
the beginning of it I have changed my mind a bit, as you can see.

Given Arjan's feedback, I've concluded that using usleep_range() for
all sleep values is the right choice and that some slack should be
used there.  I've taken 50 us as the minimum value of it because that's
what is used in user space FWICT and I'm not convinced that shorter
values would be suitable here.

The other part, using 1% of the sleep time as the slack for longer
sleeps, is likely more controversial.  It is roughly based on the
observation that if one timer interrupt is sufficient for something,
then using two of them will be wasteful even if this is just somewhat.

Anyway, please let me know what you think.  I'd rather do whatever
the majority of you are comfortable with.

---
 drivers/acpi/osl.c |   22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki Nov. 22, 2024, 7:27 p.m. UTC | #1
On Thu, Nov 21, 2024 at 11:27 PM Mario Limonciello
<mario.limonciello@amd.com> wrote:
>
> On 11/21/2024 07:15, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > As stated by Len in [1], the extra delay added by msleep() to the
> > sleep time value passed to it can be significant, roughly between
> > 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with
> > HZ = 100, which is hardly acceptable, at least for small sleep time
> > values.
> >
> > Address this by using usleep_range() in acpi_os_sleep() instead of
> > msleep().  For short sleep times this is a no-brainer, but even for
> > long sleeps usleep_range() should be preferred because timer wheel
> > timers are optimized for cancellation before they expire and this
> > particular timer is not going to be canceled.
> >
> > Add at least 50 us on top of the requested sleep time in case the
> > timer can be subject to coalescing, which is consistent with what's
> > done in user space in this context [2], but for sleeps longer than 5 ms
> > use 1% of the requested sleep time for this purpose.
> >
> > The rationale here is that longer sleeps don't need that much of a timer
> > precision as a rule and making the timer a more likely candidate for
> > coalescing in these cases is generally desirable.  It starts at 5 ms so
> > that the delta between the requested sleep time and the effective
> > deadline is a contiuous function of the former.
> >
> > Link: https://lore.kernel.org/linux-pm/c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com/ [1]
> > Link: https://lore.kernel.org/linux-pm/CAJvTdK=Q1kwWA6Wxn8Zcf0OicDEk6cHYFAvQVizgA47mXu63+g@mail.gmail.com/ [2]
> > Reported-by: Len Brown <lenb@kernel.org>
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> You probably should also pick up this tag from the earlier version.
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263

Good point.

> > ---
> >
> > This is a follow-up to the discussion started by [1] above and since
> > the beginning of it I have changed my mind a bit, as you can see.
> >
> > Given Arjan's feedback, I've concluded that using usleep_range() for
> > all sleep values is the right choice and that some slack should be
> > used there.  I've taken 50 us as the minimum value of it because that's
> > what is used in user space FWICT and I'm not convinced that shorter
> > values would be suitable here.
> >
> > The other part, using 1% of the sleep time as the slack for longer
> > sleeps, is likely more controversial.  It is roughly based on the
> > observation that if one timer interrupt is sufficient for something,
> > then using two of them will be wasteful even if this is just somewhat.
> >
> > Anyway, please let me know what you think.  I'd rather do whatever
> > the majority of you are comfortable with.
>
> Generally I'm fine with this.
>
> I'm about to head on US holiday, but I will forward this to folks that
> aren't and get some testing input on it to bring back later when I'm back.

Thanks!
Mario Limonciello Dec. 2, 2024, 9:54 p.m. UTC | #2
On 11/22/2024 13:27, Rafael J. Wysocki wrote:
> On Thu, Nov 21, 2024 at 11:27 PM Mario Limonciello
> <mario.limonciello@amd.com> wrote:
>>
>> On 11/21/2024 07:15, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> As stated by Len in [1], the extra delay added by msleep() to the
>>> sleep time value passed to it can be significant, roughly between
>>> 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with
>>> HZ = 100, which is hardly acceptable, at least for small sleep time
>>> values.
>>>
>>> Address this by using usleep_range() in acpi_os_sleep() instead of
>>> msleep().  For short sleep times this is a no-brainer, but even for
>>> long sleeps usleep_range() should be preferred because timer wheel
>>> timers are optimized for cancellation before they expire and this
>>> particular timer is not going to be canceled.
>>>
>>> Add at least 50 us on top of the requested sleep time in case the
>>> timer can be subject to coalescing, which is consistent with what's
>>> done in user space in this context [2], but for sleeps longer than 5 ms
>>> use 1% of the requested sleep time for this purpose.
>>>
>>> The rationale here is that longer sleeps don't need that much of a timer
>>> precision as a rule and making the timer a more likely candidate for
>>> coalescing in these cases is generally desirable.  It starts at 5 ms so
>>> that the delta between the requested sleep time and the effective
>>> deadline is a contiuous function of the former.
>>>
>>> Link: https://lore.kernel.org/linux-pm/c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com/ [1]
>>> Link: https://lore.kernel.org/linux-pm/CAJvTdK=Q1kwWA6Wxn8Zcf0OicDEk6cHYFAvQVizgA47mXu63+g@mail.gmail.com/ [2]
>>> Reported-by: Len Brown <lenb@kernel.org>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> You probably should also pick up this tag from the earlier version.
>>
>> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> 
> Good point.
> 
>>> ---
>>>
>>> This is a follow-up to the discussion started by [1] above and since
>>> the beginning of it I have changed my mind a bit, as you can see.
>>>
>>> Given Arjan's feedback, I've concluded that using usleep_range() for
>>> all sleep values is the right choice and that some slack should be
>>> used there.  I've taken 50 us as the minimum value of it because that's
>>> what is used in user space FWICT and I'm not convinced that shorter
>>> values would be suitable here.
>>>
>>> The other part, using 1% of the sleep time as the slack for longer
>>> sleeps, is likely more controversial.  It is roughly based on the
>>> observation that if one timer interrupt is sufficient for something,
>>> then using two of them will be wasteful even if this is just somewhat.
>>>
>>> Anyway, please let me know what you think.  I'd rather do whatever
>>> the majority of you are comfortable with.
>>
>> Generally I'm fine with this.
>>
>> I'm about to head on US holiday, but I will forward this to folks that
>> aren't and get some testing input on it to bring back later when I'm back.
> 
> Thanks!

Hi Rafael,

I loaded this onto my personal laptop before the holiday and also got 
others in AMD to do testing on a wider variety of client hardware.
No concerns were raised with this patch.

Feel free to include:

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Tested-by: Mario Limonciello <mario.limonciello@amd.com>
Rafael J. Wysocki Dec. 2, 2024, 9:57 p.m. UTC | #3
On Mon, Dec 2, 2024 at 10:54 PM Mario Limonciello
<mario.limonciello@amd.com> wrote:
>
> On 11/22/2024 13:27, Rafael J. Wysocki wrote:
> > On Thu, Nov 21, 2024 at 11:27 PM Mario Limonciello
> > <mario.limonciello@amd.com> wrote:
> >>
> >> On 11/21/2024 07:15, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> As stated by Len in [1], the extra delay added by msleep() to the
> >>> sleep time value passed to it can be significant, roughly between
> >>> 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with
> >>> HZ = 100, which is hardly acceptable, at least for small sleep time
> >>> values.
> >>>
> >>> Address this by using usleep_range() in acpi_os_sleep() instead of
> >>> msleep().  For short sleep times this is a no-brainer, but even for
> >>> long sleeps usleep_range() should be preferred because timer wheel
> >>> timers are optimized for cancellation before they expire and this
> >>> particular timer is not going to be canceled.
> >>>
> >>> Add at least 50 us on top of the requested sleep time in case the
> >>> timer can be subject to coalescing, which is consistent with what's
> >>> done in user space in this context [2], but for sleeps longer than 5 ms
> >>> use 1% of the requested sleep time for this purpose.
> >>>
> >>> The rationale here is that longer sleeps don't need that much of a timer
> >>> precision as a rule and making the timer a more likely candidate for
> >>> coalescing in these cases is generally desirable.  It starts at 5 ms so
> >>> that the delta between the requested sleep time and the effective
> >>> deadline is a contiuous function of the former.
> >>>
> >>> Link: https://lore.kernel.org/linux-pm/c7db7e804c453629c116d508558eaf46477a2d73.1731708405.git.len.brown@intel.com/ [1]
> >>> Link: https://lore.kernel.org/linux-pm/CAJvTdK=Q1kwWA6Wxn8Zcf0OicDEk6cHYFAvQVizgA47mXu63+g@mail.gmail.com/ [2]
> >>> Reported-by: Len Brown <lenb@kernel.org>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> You probably should also pick up this tag from the earlier version.
> >>
> >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> >
> > Good point.
> >
> >>> ---
> >>>
> >>> This is a follow-up to the discussion started by [1] above and since
> >>> the beginning of it I have changed my mind a bit, as you can see.
> >>>
> >>> Given Arjan's feedback, I've concluded that using usleep_range() for
> >>> all sleep values is the right choice and that some slack should be
> >>> used there.  I've taken 50 us as the minimum value of it because that's
> >>> what is used in user space FWICT and I'm not convinced that shorter
> >>> values would be suitable here.
> >>>
> >>> The other part, using 1% of the sleep time as the slack for longer
> >>> sleeps, is likely more controversial.  It is roughly based on the
> >>> observation that if one timer interrupt is sufficient for something,
> >>> then using two of them will be wasteful even if this is just somewhat.
> >>>
> >>> Anyway, please let me know what you think.  I'd rather do whatever
> >>> the majority of you are comfortable with.
> >>
> >> Generally I'm fine with this.
> >>
> >> I'm about to head on US holiday, but I will forward this to folks that
> >> aren't and get some testing input on it to bring back later when I'm back.
> >
> > Thanks!
>
> Hi Rafael,
>
> I loaded this onto my personal laptop before the holiday and also got
> others in AMD to do testing on a wider variety of client hardware.
> No concerns were raised with this patch.
>
> Feel free to include:
>
> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
> Tested-by: Mario Limonciello <mario.limonciello@amd.com>

Thank you!
Len Brown Dec. 4, 2024, 9:41 p.m. UTC | #4
On Thu, Nov 21, 2024 at 8:15 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> As stated by Len in [1], the extra delay added by msleep() to the
> sleep time value passed to it can be significant, roughly between
> 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with
> HZ = 100, which is hardly acceptable, at least for small sleep time
> values.

Maybe the problem statement is more clear with a concrete example:

msleep(5) on the default HZ=250 on a modern PC takes about 11.9 ms.
This results in over 800 ms of spurious system resume delay
on systems such as the Dell XPS-13-9300, which use ASL Sleep(5ms)
in a tight loop.

(yes, this additional cost used to be over 1200 ms before the v6.12
msleep rounding fix)

> -       msleep(ms);
> +       u64 usec = ms * USEC_PER_MSEC, delta_us = 50;

> +       if (ms > 5)
> +               delta_us = (USEC_PER_MSEC / 100) * ms

I measured 100 resume cycles on the Dell XPS 13 9300 on 4 kernels.
Here is the measured fastest kernel resume time in msec for each:

1. 1921.292 v6.12 msleep (baseline)
2. 1115.579 v6.12 delta_us = (USEC_PER_MSEC / 100) * ms (this patch)
3. 1113.396 v6.12 delta_us = 50
4. 1107.835 v6.12 delta_us = 0

(I didn't average the 100 runs, because random very long device
hiccups  throw off the average)

So any of #2, #3 and #4 are a huge step forward from what is shipping today!

So considering #2 vs #3 vs #4....

I agree that it is a problem for the timer sub-system to work to
maintain a 1ns granularity
that it can't actually deliver.

I think it is fine for the timer sub-system to allow calls to opt into
timer slack --
some callers may actually know what number to use.

However, I don't think that the timer sub-system should force callers to guess
how much slack is appropriate.  I think that a caller with 0 slack
should be internally
rounded up by the timer sub-system to the granularity that it can
actually deliver
with the timer that is currently in use on that system.

Note also that slack of 0 doesn't mean that no coalescing can happen.
A slack=0 timer can land within the slack another timer, and the other
timer will be pulled forward to coalesce.

The 50 usec default for user timer slack is certainly a magic number born
of tests of interesting workloads on interesting systems on a certain date.
It may not be the right number for other workloads, or other systems
with other timers on other dates.

My opinion...

I don't see a justification for increasing timer slack with increasing duration.
User-space timers don't pay this additional delay, why should the ASL
programmer?

Also, the graduated increasing slack with duration is a guess no more
valid than the guess of a flat 50 usec.

A flat 50 or a flat 0 have the virtue of being simple -- they will be simpler
to understand and maintain in the future.

But I can live with any of these options, since they are all a big step forward.

thanks,
Len Brown, Intel Open Source Technology Center
Rafael J. Wysocki Dec. 5, 2024, 10:56 a.m. UTC | #5
On Wed, Dec 4, 2024 at 10:41 PM Len Brown <lenb@kernel.org> wrote:
>
> On Thu, Nov 21, 2024 at 8:15 AM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > As stated by Len in [1], the extra delay added by msleep() to the
> > sleep time value passed to it can be significant, roughly between
> > 1.5 ns on systems with HZ = 1000 and as much as 15 ms on systems with
> > HZ = 100, which is hardly acceptable, at least for small sleep time
> > values.
>
> Maybe the problem statement is more clear with a concrete example:
>
> msleep(5) on the default HZ=250 on a modern PC takes about 11.9 ms.
> This results in over 800 ms of spurious system resume delay
> on systems such as the Dell XPS-13-9300, which use ASL Sleep(5ms)
> in a tight loop.

Sure, I can add this information to the changelog.

> (yes, this additional cost used to be over 1200 ms before the v6.12
> msleep rounding fix)
>
> > -       msleep(ms);
> > +       u64 usec = ms * USEC_PER_MSEC, delta_us = 50;
>
> > +       if (ms > 5)
> > +               delta_us = (USEC_PER_MSEC / 100) * ms
>
> I measured 100 resume cycles on the Dell XPS 13 9300 on 4 kernels.
> Here is the measured fastest kernel resume time in msec for each:
>
> 1. 1921.292 v6.12 msleep (baseline)
> 2. 1115.579 v6.12 delta_us = (USEC_PER_MSEC / 100) * ms (this patch)
> 3. 1113.396 v6.12 delta_us = 50
> 4. 1107.835 v6.12 delta_us = 0
>
> (I didn't average the 100 runs, because random very long device
> hiccups  throw off the average)
>
> So any of #2, #3 and #4 are a huge step forward from what is shipping today!
>
> So considering #2 vs #3 vs #4....
>
> I agree that it is a problem for the timer sub-system to work to
> maintain a 1ns granularity
> that it can't actually deliver.
>
> I think it is fine for the timer sub-system to allow calls to opt into
> timer slack --
> some callers may actually know what number to use.
>
> However, I don't think that the timer sub-system should force callers to guess
> how much slack is appropriate.  I think that a caller with 0 slack
> should be internally
> rounded up by the timer sub-system to the granularity that it can
> actually deliver
> with the timer that is currently in use on that system.
>
> Note also that slack of 0 doesn't mean that no coalescing can happen.
> A slack=0 timer can land within the slack another timer, and the other
> timer will be pulled forward to coalesce.
>
> The 50 usec default for user timer slack is certainly a magic number born
> of tests of interesting workloads on interesting systems on a certain date.
> It may not be the right number for other workloads, or other systems
> with other timers on other dates.
>
> My opinion...
>
> I don't see a justification for increasing timer slack with increasing duration.
> User-space timers don't pay this additional delay, why should the ASL
> programmer?
>
> Also, the graduated increasing slack with duration is a guess no more
> valid than the guess of a flat 50 usec.
>
> A flat 50 or a flat 0 have the virtue of being simple -- they will be simpler
> to understand and maintain in the future.
>
> But I can live with any of these options, since they are all a big step forward.

All right, thanks!
diff mbox series

Patch

Index: linux-pm/drivers/acpi/osl.c
===================================================================
--- linux-pm.orig/drivers/acpi/osl.c
+++ linux-pm/drivers/acpi/osl.c
@@ -607,7 +607,27 @@  acpi_status acpi_os_remove_interrupt_han
 
 void acpi_os_sleep(u64 ms)
 {
-	msleep(ms);
+	u64 usec = ms * USEC_PER_MSEC, delta_us = 50;
+
+	/*
+	 * Use a hrtimer because the timer wheel timers are optimized for
+	 * cancellation before they expire and this timer is not going to be
+	 * canceled.
+	 *
+	 * Set the delta between the requested sleep time and the effective
+	 * deadline to at least 50 us in case there is an opportunity for timer
+	 * coalescing.
+	 *
+	 * Moreover, longer sleeps can be assumed to need somewhat less timer
+	 * precision, so sacrifice some of it for making the timer a more likely
+	 * candidate for coalescing by setting the delta to 1% of the sleep time
+	 * if it is above 5 ms (this value is chosen so that the delta is a
+	 * continuous function of the sleep time).
+	 */
+	if (ms > 5)
+		delta_us = (USEC_PER_MSEC / 100) * ms;
+
+	usleep_range(usec, usec + delta_us);
 }
 
 void acpi_os_stall(u32 us)